[exim] Weird loads with maildir_use_size_file

Startseite
Nachricht löschen
Nachricht beantworten
Autor: Gergely Nagy
Datum:  
To: exim-users
Betreff: [exim] Weird loads with maildir_use_size_file
Greetings!

First of all, apologies in advance for the lack of information regarding
the problem - it's pretty darn hard to test the problem as it occurs on
a mission critical system which I can't freely play around with.

The bulk of the issue is, that I have a configuration which Works(tm),
it's fast, reliable and works like a charm. However, I would like to use
maildirsize files, but whenever I turn maildir_use_size_file on in the
appropriate transport, the load goes from the usual 10-20 to 600 and
above within half a minute. I believe it would rise even further, but so
far, I always turned the option off again before that.

Without much further ado, the relevant transport in my exim4.conf looks
like this:

local_delivery:
driver = appendfile
directory = /m/${l_1:$local_part}/${local_part}@${domain}
quota_directory = /m/${l_1:$local_part}/${local_part}@${domain}
quota = ${lookup mysql{...}}
mailbox_size = 0
maildir_format
# maildir_use_size_file
maildir_tag = ",S=$message_size"
user = mail
maildirfolder_create_regex = /\.[^/]+/$
maildir_quota_directory_regex = ^(?:cur|new|\.(?!Trash).*)$
check_string = ""
message_prefix = ""
message_suffix = ""
no_mode_fail_narrower
no_quota_is_inclusive
quota_size_regex = ,S=(\d+)

The hardware in question is a dual quad-core Intel Xeon, with 4G ram and
a couple of SAS disks appropriately set up to handle the load. It's
running on Debian etch, with Exim 4.63 (+ whatever patches Debian applied).

The usual load is between 10 and 30 (during the busiest hours), handling
15-25 mails / second with maildir_use_size_file turned off.

Every mail that reaches this particular computer is stored locally,
there is no outgoing mail.

The only other service running is courier (imap & pop, both of which
update the maildirsize file on their own and do not cause significant
load, even when there are hundreds of users logging in roughly at the
same time).

Since courier does update the files without problems, and there are only
about 50-60 exim processes running at any given time, I very much doubt
it would be a hardware issue. I mean, if it can handle 30Gb (~700k - 1
million individual mails) of incoming mail each day, AND provide a
usable imap service too, I very highly doubt my hardware would be
inadequate for the job.

However, I did do a few tests - keep in mind, this is a live system, and
I can't run longer tests if they push the load up:

Test #1
=======

I made a separate exim config file for testing purposes, which is
exactly the same as the live one, except it has maildir_use_size_file
turned on.

Then, I sent a few thousand random messages through it - no problem,
deliveries went fast, and the load did not increase in any noticable way.

So, after I had about 14k messages in the maildir, I deleted the
maildirsize file to force a regenerate - no problem at all, either.

(I did that particular step with exim -d+all -bs, and exim seemed to
behave correctly, only listing the contents of the directory, retrieving
the mail sizes from the filename and not statting them at all)

Afterwards, I created about a hundred directories (.randomdir.$n), and
distributed a few thousand (~5k) messages randomly between them, and
forced another maildirsize file rebuild - again, no problem.

Then I launched a stress test, and bombed the mailbox with 10-15
concurrent deliveries, and the problem did not surface during this test,
either.

Test #2
=======

Since the tests I made with exim did not exhibit the problem, I tried to
see if I can make courier-imap fail, and deleted the maildirsize file of
the most used and biggest maildirs we have, to force a recreate.

The load did rise a bit, but nowhere near as high as when I turned
maildir_use_size_file on.

Test #3
=======

Taking a different route, I figured I'll keep an eye on the various
system resources while maildir_use_size is on.

Since I can't leave it on for longer than a minute or two, the data is
probably rather unreliable, but I did see a slight increase in disk io,
which was expected, but the increase was hardly noticable, while the
load skyrocketed.

I'll probably perform more tests of this nature today to see if I can
spot anything in the ~30 seconds I have to investigate before having to
turn the option off again.

Conclusion
==========

In conclusion, I have no idea where to look further, except trying to do
some more tests, and figuring out which maildirs get accessed during the
time maildir_use_size_file is on, copying them to a temporary test area,
and trying to deliver mail there under exim -d+all and spot anything
suspicious.

So, the big question is... what am I doing wrong, and where should I look?

If needed, I'll try to provide as much information about the system as
possible, but, as said above, it being a production system, I'm quite
limited in what I can do.

Thanks in advance,
--
Gergely Nagy <gergely.nagy@???>