Marc Perkel wrote:
>
> This is what works for me so far. I need the first From line with a
> newline in front of it which I added - and - I could use a blank line at
> the end of the headers if you could add that.
>
> And - another feature - can you also extract email addresses in the
> message? I think that embedded emal addresses would help in spam
> detection.
Sure, kind of, but grabbing email addresses is messy as nearly anything
can be a valid email address when properly qouted...
> For what it's worth. This is working EXTREMELY well and I have yet to
> add your code. I'm using spamprobe to create a bayesian database on just
> headers. Now I'm going to try to use just headers and links in the
> message - and embedded emails - (maybe phone numbers? can you extract
> phone numbers too?) This might be a breakthrough in spam detection!
I am not going to mess with phone numbers, I think that is messier than
email address, perhaps someone else can add that funtion. As it is, I
have modified things from the very small headers and
http://<domain>
only to a larger script. It should be documented well enough to get
you or someone else going. Just as a note, the email grabbing will
not be perfect but should be close enough.
By the way, where should I send the bill ;)
> #!/usr/bin/perl
> use strict;
> my ($_firstHeader,$_lastHeader,$_message) = (0,0,0);
> while (<STDIN>) {
> ($_firstHeader,$_lastHeader) = (1,0) and $_message++ if /^From
> /; ($_firstHeader,$_lastHeader) = (0,1) if /^$/;
> s/^From /\nFrom /;
> print STDOUT if $_firstHeader && !$_lastHeader && !/^\s*$/;
> print STDOUT "$1\n" if m{(http://[A-Za-z0-9.-]+).*$};
> }
And now, for the new script....
<PERL_SCRIPT>
#!/usr/bin/perl -w
use strict;
# Setup the variables that we are going to use
# $_firstHeader: Keep track of the start of a message
# $_lastHeader : The header//body seperator for a message
# $_isText : Skip any non text MIME parts
# %http : http://<domains>/ found in a message
# %addr : any email\@<domain> found in a message
my ($_firstHeader,$_lastHeader,$_isText,%http,%addr) = (0,0,1,(),());
# Grap the information from STDIN, such as "script <file"
# or "cat file|script" etc. The output is in the form
#
# <NEW LINE>
# From ....
# header: ....
# header: ....
# header: ....
# <NEW LINE>
# http://domain ...
# http://domain ...
# email://address ...
# email://address ...
# <NEW LINE>
while (<STDIN>) {
# MBOX messages start with "From " and check for End of File
# as we need to print the http domains and email links
# if we found any before exiting.
if ( /^From / || eof )
{
($_firstHeader,$_lastHeader) = (1,0) if !eof;
print STDOUT sort keys %http and %http = () if %http;
print STDOUT sort keys %addr and %addr = () if %addr;
s/^/\n/;
}
# If we are in the headers section of a message print the line.
# Set that we are done with the headers section at the first
# blank line we find, as that should be the message
# header//body seperator.
if ( $_firstHeader )
{
($_firstHeader,$_lastHeader) = (0,1) if /^\s*$/;
print STDOUT;
}
# We only want the headers of a message and then only check
# none 'binary' MIME parts, like text/html, text/plain,
# that type of thing. Or just comment these lines if you
# want the http and email stuff run on every part of a message
# no matter what the MIME part type is.
$_isText = 0 if /^Content-Type: /;
$_isText = 1 if /^Content-Type: text.*/i;
next if !$_isText;
# While a line contains http://<something> in it, strip out
# everything upto the first / (or space, etc) after http://
# this is not perfect but should do the job in most cases.
# If you only want http:// domains in the body parts only,
# then add '$_lastHeader && ' before the 'm{'. Only add
# unique domain strings.
while ( m{(http://[A-Za-z0-9.-]+)}g )
{
$http{"$1\n"}++ if !$http{"$1\n"};
}
# While a line contains anything@something then grab the stuff
# that looks like it is an email address. This is less
# perfect than the http stuff, but then again, checking for
# 'valid' email addressess in something is near impossible ;)
# We do try and make the email address thing as clean as
# possible. If you only want the email address looking strings
# from the body of a message then add '$_lastHeader && ' before
# the 'm{'. Only add unique email strings.
while ( $_lastHeader && m{\s+([^[:space:]]+\@[A-Za-z0-9.-]+)}g )
{
my $email = $1;
next if $email !~ /\@[A-Za-z0-9.-]+\..*/ || $email =~ /\@.*\@/;
$email =~ s/[)<\["'\]>(]|(?i:mailto:|href=)|\.$//g;
$addr{"email://$email\n"}++ if !$addr{"email://$email\n"};
}
}
exit 0;
__END__
</PERL_SCRIPT>
--
--EAL--
--