Re: [Exim] Spam filter Q:

Etusivu
Poista viesti
Vastaa
Lähettäjä: Michael J. Tubby B.Sc. G8TIC
Päiväys:  
Vastaanottaja: frank, C.B.Bayliss
Kopio: Tamas TEVESZ, eramunds, exim-users
Aihe: Re: [Exim] Spam filter Q:
----- Original Message -----
From: "Frank S. Bernhardt" <frank@???>
To: <C.B.Bayliss@???>
Cc: "Tamas TEVESZ" <ice@???>; <eramunds@???>;
<exim-users@???>
Sent: Thursday, November 22, 2001 4:04 PM
Subject: Re: [Exim] Spam filter Q:


> Yeah, but have you ever seen an e-mail that uses "tit" as a reference to a

small bird (except in this list
> :-) )?
>
> But seriously, you're right about the word breaks. Without them you're

gonna reject a whole lota e-mail
> erroniously. Spammers are getting creative too. I'm receiving e-mails

where the subjet line contains for
> example, not the word "sex" but "s.e.x.". That gets it through my filter.

How do you handle that one?
>


Implement it as a system filter in PERL and use the ability to pack/unpack
against a template
which removes all the punctuation/whitespace/numbers first so that
$f!&,u77c[{[k<>! is still
found as a rude word (this technique already used in the ham radio IRC-like
program DXspider
by Dirk Koopman - in PERL).

Then in your system filter you can decide if you match headers (eg. subject
etc.) and/or the
main body, apply weightings to words - you could read all words and their
"badness quotients"
into a hash, eg:

    my %badwords = (
        'f_uck' => 10,
        'f_cking' => 10,
        'p_orn' => 10,
        'p_enis' => 9,
        'h_ardcore' => 8,
        's_ex' => 7,
        't_it' => 5,
        't_een' => 5,
        'l_icking' => 2,
        ...
        ... );


        ### extra '_'s added to protect the mailing list (Nigel) ###


and find them very quickly and run up a 'score' for the message, then dump
the message if the
score exceeded a certain threshold. Actually you'd probably want several
hashes for various
classes of message, ie. one for sex, one for marketing/promos, one for total
junk, etc. etc. and
run an accumulator for each message class... would probably make a nice
litle project for a
CS student ;-)


Mike




> Chris Bayliss wrote:
>
> > >
> > > On Thu, 22 Nov 2001, Jan Erik Amundsen wrote:
> > >
> > > > if $body_content-type: matches "fuck|shit|lesbian|tits|cash"
> > >
> > > honestly, how many mails have you ever seen (or even heard of) that
> > > have a content-type like these ?
> > >
> > >
> >
> > Loads of them have these words in the message or message headers.
> > $message_body or $header_subject would be a better place to search for
> > a match.
> >
> > However, unless words are delimited by breaks, you hit the Scunthorpe
> > problem. Filtering out tits seems a little harsh; according to my

dictionary
> > the word tit is also used to mean a small bird.
> >
> > Chris Bayliss
> >
> > --
> >
> > ## List details at http://www.exim.org/mailman/listinfo/exim-users Exim

details at http://www.exim.org/ ##
>
> --
>
> Regards
>
> Frank S. Bernhardt
> b.c.s.i.
> 14 Halton Court
> Markham, ON.
> L3P 6R3
>
> 905-471-1691 Voice
> 905-471-3016 FAX
>
> frank@???
>
>
>
> --
>
> ## List details at http://www.exim.org/mailman/listinfo/exim-users Exim

details at http://www.exim.org/ ##
>
>