Author: Ian Eiloart Date: To: Alun CC: exim-users Subject: Re: [exim] Estimating spam deliveries
--On 16 October 2007 15:51:54 +0100 Alun <auj@???> wrote: > Hi Ian,
>
> This sounded interesting so I had a little play with my logs. It's an
> interesting approach, but I think you've probably just got too little
> information to make anything other than vague generalisations.
Yes, I think that's maybe true. Perhaps I can just look at how the
correlation varies with time. I've got logs going back for years...
> Depending on your reporting interval you can end up with wildly
> different correlation coefficients - presumably a larger reporting
> interval (days rather than hours) gives a coefficient that's better
> when it comes to removing diurnal trends,
Yes, but why would I want to remove those? I haven't looked at diurnal
trends, but I think the better the temporal resolution, the safer my
conclusions can be.
> but then you end up with less
> data points with which to generate a coefficient and you can't track
> changes in the perfomance of your filters so well.
>
> Taking the past 6 weeks, I get the following from our logs:
>
>
> Correlation coefficient by day = 0.268
> Correlation coefficient by hour = 0.179
That's interesting. It would be interesting to see the significance levels.
It could be that the increased number of data points increases the
significance of the hourly data beyond that of the daily.
>
> Neither shows any great correlation, so I guess I can assume my filters
> are doing well. Or are they?
Well, the statistical significance depends on the number of data points, as
well as the correlation coefficient. When I used daily datapoints for the
year, I got a correlation of 0.2, but significance at the 0.03% level! I
think this indicates quite high proportions of spam in mailboxes. If I were
rash, I'd say maybe 20% of incoming emails (internal email isn't measured
here).
However, I know that the correlation coefficient isn't linearly related to
the leakage. I was hoping someone might be able to point to some way of
estimating the amount of leakage, with confidence intervals.
I think that means that my spam filters are definitely leaky (an
alternative hypothesis is that my users are more active when spammers are
active - but actually spammers seem to be more active at the weekends, and
my users and their correspondents are more active during the week).
>
> I wonder if it's possible to do something with your (presumably) known
> distinction between internally and externally generated e-mails.
Yes, I log locally generate emails in a different log file. So, all of the
email that I've measured is externally generated.
> If most of your institution's correspondence is with people in the same
> timezone
That seems likely.
> and your internal users don't generate spam
We don't filter them for spam. We do filter them for viruses, of course.
Internal users must use our MSA server, we block outbound port 25 for
almost all client machines. There's some exception for disabled users who
require external services.
> then is it reasonable to assume that internally generated e-mails should correlate > strongly with accepted external mail? If you're letting through lots of
> spam then would it weaken this correlation?
>
> Doing this by hour, and taking one week in September:
>
> a) Correlation between internal and external accepted = 0.881
I'd expect a high correlation here, as much email is conversational.
> b) Correlation between internal and external rejected = 0.455
Hmm. That's very disturbing, isn't it?
> c) Correlation between internal and all external = 0.517
>
> Complete speculation, but if a) and c) above were close to equal
> then it would suggest to me that your spam filter was leaking badly.
>
> Cheers,
> Alun.
--
Ian Eiloart
IT Services, University of Sussex
x3148