> Actually, so far you failed to state a problem.
> Load is something like "number of processes waiting to be processed" and
> I have seen servers with load-values higher thann 1000 which still
> reacted faster than my laptop.
>
> Load is not a problem. It might indicate a problem, but load itself is
> only a symptom.
>
> Disk-IO might be an explanation for "high" load-values. There are
> countless others. Find the real one before desperatly trying out
> possible solutions... when I read your questions and the first answers
> which all circled around "optimising disk-io by tuning the kernel" I
> just felt desperation. Nobody even considered wheater or not your
> asumptions sound reasonable. Nobody even asked questions....
>
> First, 2000 email-accounts does not sound like a big deal, the number of
> messages does. But of course, that's just 20 messages/mailbox. You
> mention one harddisk and that you can't afford a long downtime. That
> worries me. You can't provide reliable service without decent storage
> (raid 1 or raid5, and backup, of course) and regular maintenance. Fix
> that.
>
> In order to understand the real problem, you need data about your
> system. For example, recent top (I prefer atop) with recent kernel will
> show a value "wa" which is short for waiting. If you really have a
> problem regarding disk-io, "wa" should show this.
> Cpu(s): 1.4%us, 0.5%sy, 0.0%ni, 98.1%id, 0.0%wa, 0.0%hi, 0.0%si,
> 0.0%st
Since doing the Spamassassin "bayes_learn_to_journal 1" bayes tweak I
have not seen any serious load issues. Today is closest I have seen
and its not near as bad as I have seen in past. Here is portion of
top. Previously load average would break 100 at peak times stay there
quite a while. This peak of ~33 did not last long either. Quickly
dropped back below 10 and now below 5.
top - 11:27:13 up 1 day, 19:27, 1 user, load average: 31.33, 33.04, 18.19
Tasks: 251 total, 1 running, 241 sleeping, 0 stopped, 9 zombie
Cpu(s): 10.1% us, 3.3% sy, 0.0% ni, 25.4% id, 61.0% wa, 0.2% hi, 0.0% si
Mem: 4115344k total, 3785164k used, 330180k free, 267060k buffers
Swap: 2031608k total, 0k used, 2031608k free, 2203720k cached
PID USER PR NI %CPU TIME+ %MEM VIRT RES SHR S COMMAND
346 nvcs 15 0 4 0:47.58 1.3 60580 51m 2964 S spamd
347 root 16 0 2 0:43.70 1.2 58764 49m 2952 S spamd
3536 clamav 16 0 1 21:21.45 2.6 176m 104m 1052 S clamd
4484 named 18 0 1 24:56.85 1.4 103m 57m 1936 S named
15223 nvcs 17 0 1 0:00.02 0.1 8376 3632 2020 S pyzor
493 root 15 0 0 3:28.21 0.0 0 0 0 D kjournald
3411 root 16 0 0 0:34.79 0.0 2392 492 360 S dovecot
15118 bbwi 18 0 0 0:00.06 0.0 8340 1148 700 D exim
15229 nvcs 18 0 0 0:00.01 0.0 7296 1152 700 D exim
28803 mail 15 0 0 0:40.12 0.0 7820 1236 852 S exim
28860 root 15 0 0 2:04.87 1.0 47192 38m 3252 S spamd
1 root 16 0 0 0:04.90 0.0 2860 548 468 S init
2 root RT 0 0 0:00.54 0.0 0 0 0 S migration/0
3 root 34 19 0 0:06.35 0.0 0 0 0 S ksoftirqd/0
4 root RT 0 0 0:00.40 0.0 0 0 0 S migration/1
5 root 34 19 0 0:05.23 0.0 0 0 0 S ksoftirqd/1
6 root 5 -10 0 0:00.19 0.0 0 0 0 S events/0
7 root 5 -10 0 0:00.08 0.0 0 0 0 S events/1
8 root 11 -10 0 0:00.00 0.0 0 0 0 S khelper
9 root 15 -10 0 0:00.00 0.0 0 0 0 S kacpid
43 root 5 -10 0 0:00.00 0.0 0 0 0 S kblockd/0
44 root 5 -10 0 0:00.00 0.0 0 0 0 S kblockd/1
45 root 15 0 0 0:00.00 0.0 0 0 0 S khubd
62 root 15 0 0 0:16.38 0.0 0 0 0 S pdflush
> I highly recommend running "munin" on every exim-server. It will gather
> lots of numbers regarding your server, which can be invaluable when
> facing problems.
>
> In my experience the most likely cause for high load-values on
> exim-serves is DNS-related. If you process 40k messages an hour and use
> SpamAssassin, more than 500k DNS-requests are likely. If you didn't
> worry about a local caching-DNS-daemon on your mailserver, than you
> should do that now. If you already have a local caching-DNS-daemon on
> your mailserver, consider moving it to a different server.
Running bind on server and from I have heard bind does not create a
significant system load.
> Any further advice would be wild guessing, so it's up to you to provide
> further data.
I still think this is a disk I/O issue but am no expert. Perhaps
there are more tweaks I can do to reduce disk I/O.
Do entries like this in exim.conf create more disk I/O?
# deny email addresses listed in file
deny recipients = lsearch;/etc/virtual/blocked_email
message = Email account suspended due to inactivity
I use that script to auto suspend email accounts that have not been
used/checked in over 6 months. There are a number of other similiar
entries used for things such as popb4smtp. Not sure how efficiently
entries like that work.
Matt