Re: [exim] Need a script to separate headers

Etusivu
Poista viesti
Vastaa
Lähettäjä: Edgar Lovecraft
Päiväys:  
Vastaanottaja: Marc Perkel, exim-users
Kopio: 
Aihe: Re: [exim] Need a script to separate headers
Marc Perkel wrote:
>
> This is what works for me so far. I need the first From line with a
> newline in front of it which I added - and - I could use a blank line at
> the end of the headers if you could add that.
>
> And - another feature - can you also extract email addresses in the
> message? I think that embedded emal addresses would help in spam
> detection.


Sure, kind of, but grabbing email addresses is messy as nearly anything
can be a valid email address when properly qouted...

> For what it's worth. This is working EXTREMELY well and I have yet to
> add your code. I'm using spamprobe to create a bayesian database on just
> headers. Now I'm going to try to use just headers and links in the
> message - and embedded emails - (maybe phone numbers? can you extract
> phone numbers too?) This might be a breakthrough in spam detection!


I am not going to mess with phone numbers, I think that is messier than
email address, perhaps someone else can add that funtion. As it is, I
have modified things from the very small headers and http://<domain>
only to a larger script. It should be documented well enough to get
you or someone else going. Just as a note, the email grabbing will
not be perfect but should be close enough.

By the way, where should I send the bill ;)

> #!/usr/bin/perl
> use strict;
> my ($_firstHeader,$_lastHeader,$_message) = (0,0,0);
> while (<STDIN>) {
>     ($_firstHeader,$_lastHeader) = (1,0) and $_message++ if /^From
> /;     ($_firstHeader,$_lastHeader) = (0,1) if /^$/;
>     s/^From /\nFrom /;
>     print STDOUT if $_firstHeader && !$_lastHeader && !/^\s*$/;
> print STDOUT "$1\n" if m{(http://[A-Za-z0-9.-]+).*$};
> }


And now, for the new script....

<PERL_SCRIPT>
#!/usr/bin/perl -w

use strict;

# Setup the variables that we are going to use
#    $_firstHeader: Keep track of the start of a message
#    $_lastHeader : The header//body seperator for a message
#    $_isText     : Skip any non text MIME parts
#    %http        : http://<domains>/ found in a message
#    %addr        : any email\@<domain> found in a message
my ($_firstHeader,$_lastHeader,$_isText,%http,%addr) = (0,0,1,(),());


# Grap the information from STDIN, such as "script <file"
#    or "cat file|script" etc.  The output is in the form
#
# <NEW LINE>
# From ....
# header: ....
# header: ....
# header: ....
# <NEW LINE>
# http://domain ...
# http://domain ...
# email://address ...
# email://address ...
# <NEW LINE>
while (<STDIN>) {


# MBOX messages start with "From " and check for End of File
#    as we need to print the http domains and email links
#    if we found any before exiting.
    if ( /^From / || eof )
    {
      ($_firstHeader,$_lastHeader)           = (1,0) if !eof;
      print STDOUT sort keys %http and %http = ()    if %http;
      print STDOUT sort keys %addr and %addr = ()    if %addr;
      s/^/\n/;
    }


# If we are in the headers section of a message print the line.
#    Set that we are done with the headers section at the first
#    blank line we find, as that should be the message
#    header//body seperator.
    if ( $_firstHeader )
    {
      ($_firstHeader,$_lastHeader) = (0,1) if /^\s*$/;
      print STDOUT;
    }


# We only want the headers of a message and then only check
#    none 'binary' MIME parts, like text/html, text/plain,
#    that type of thing.  Or just comment these lines if you
#    want the http and email stuff run on every part of a message
#    no matter what the MIME part type is.
    $_isText = 0 if /^Content-Type: /;
    $_isText = 1 if /^Content-Type: text.*/i;
    next if !$_isText;


# While a line contains http://<something> in it, strip out
#    everything upto the first / (or space, etc) after http://
#    this is not perfect but should do the job in most cases.
#    If you only want http:// domains in the body parts only,
#    then add '$_lastHeader && ' before the 'm{'.  Only add
#    unique domain strings.
    while ( m{(http://[A-Za-z0-9.-]+)}g )
    {
      $http{"$1\n"}++ if !$http{"$1\n"};
    }


# While a line contains anything@something then grab the stuff
#    that looks like it is an email address.  This is less
#    perfect than the http stuff, but then again, checking for
#    'valid' email addressess in something is near impossible ;)
#    We do try and make the email address thing as clean as
#    possible.  If you only want the email address looking strings
#    from the body of a message then add '$_lastHeader && ' before
#    the 'm{'.  Only add unique email strings.
    while ( $_lastHeader && m{\s+([^[:space:]]+\@[A-Za-z0-9.-]+)}g )
    {
      my $email =  $1;
      next if $email !~ /\@[A-Za-z0-9.-]+\..*/ || $email =~ /\@.*\@/;
      $email =~ s/[)<\["'\]>(]|(?i:mailto:|href=)|\.$//g;
      $addr{"email://$email\n"}++ if !$addr{"email://$email\n"};
    }


}

exit 0;

__END__
</PERL_SCRIPT>

--

--EAL--

--