Re: [exim] SMTP processing hangs during bayes sync

Page principale
Supprimer ce message
Répondre à ce message
Auteur: W B Hacker
Date:  
À: exim users
Sujet: Re: [exim] SMTP processing hangs during bayes sync
Bodo Gelbe wrote:

> On Thu, 31 Aug 2006, W B Hacker wrote:
>
>
>>Bodo Gelbe wrote:
>>
>>
>>>I've the following problem with exim and exiscan:
>>>
>>>Exim calls spamd in the DATA ACL (for messages from
>>>external hosts).
>>>
>>>spamassassin/spamd is configured with
>>>
>>>bayes_auto_learn 1
>>>bayes_journal_max_size 0
>>>bayes_auto_expire 0
>>>bayes_learn_to_journal 1
>>>
>>>
>>>>From time to time (cronjob) the database and the journal are
>>>
>>>synchronized (sa-learn --sync). During the synchronization
>>>process SMTP requests are hanging for 20 to 50 seconds (even
>>>those from local hosts, which will not be scanned by spamd).
>>>Number of exim processes are increasing.
>>
>>Does 'exiwhat' confirm that local traffic is in the pile of slowed-down
>>processes? Does it tell you anyhting else that might be useful?
>
>
> shows an increasing number of processes with status
> "handling incoming connection from ..."
>
>
>>What is CPU/memory load at the time?
>
> Low.


OK.

'top -U <your eximd UID>'

Should show these is, for example 'sbwait', using no significant resources.
Each is waiting for a resource or time-out.

How to determine who has the baton, and where:

- temporarily add to each present acl clause, and/or prefix it with a 'warn'
that identifies where in the smtp process and which acl is being entered.

Example of such for my second acl clause in the DATA phase (D2):

logwrite = D2 Checking $sender_address from $sender_host_address

This will make a log entry even if the acl clause does not 'trigger'.
It lets you know where a connection is sitting, with a time-stamp.

Also add a line (or modify one) so as to write only when the acl clause DOES
trigger. Example:

log_message = D2 Passing $sender_address from $sender_host_address

Now set 'log_selector = +all'

Reload Exim, then:

'tail -f -n 1000 /var/log/exim/mainlog'

You will be able to see the TCP connection count rising, and what each child
process is doing as they proceed. For now, they will be interleaved.

Once you can see where hte delay is occuring, you an comment-out the above tools
not needed for more detailed inspection, restart Exim, and look more closely.

After runnig for a while, you may then grep on a connecting IP and/or sender,
HELO, message ID - whatever - to look at a particular session.

The time-stamps and the acl ID's will tell you how long, and WHERE, each process
is waiting.

When next you edit the configure file, searching on the acl ID involved will
take you directly to the problematic acl.

Temporarily commenting it out, reloading Exim, and repating the observations
will confirm what is/is not the point of delay.

'log_selector' may be set back to your normal state, and ALL the above clauses
no longer needed commented out when finished.

>
>
>>>Looks like something is locked within exim before spamd
>>>is called, which locks out processing for messages not
>>>undergoing spamd processing.
>>>
>>
>>May be that those that *are* awaiting SA have used up enough resources, limits,
>>that the others are simply awaiting earlier processes to finish?
>


With the technique above, 'looks like' will be much more accurate.
You should now know exactly where the delay arises and from what activity.

>
> I've tried a "sendmail -d+all ..." during bayes sync and it shows
> two delays:
>
> 08:32:05 3749 Date: Fri, 01 Sep 2006 08:32:05 +0200
> 08:32:05 3749
> 08:32:13 3749 Data file written for message 1GJ2ZR-0000yT-OK
>
>
> 08:32:13 3749 Writing spool header file
> 08:32:40 3749 Size of headers = 278
>
>
> seems something in the spool area is locked before the call to
> exiscan and exiscan is waiting for the lock of the bayes databases
> to be released -> processing of the incoming message stucks.
>
>
>>Are your various limits all at default, or?
>>
>>
>>>Any idea how to avoid the delay (dont't want a transport
>>>because we're doing greylisting based upon the spam level)?
>>>
>>>kr, Bodo
>>>
>>
>>Changing the scheduled sync to low-load time-of-day, or providing greater
>>resources aside, I'll suggest a contrarian approach:
>>
>>- Check and see how many times has the Bayes point contribution to the total SA
>>score been sufficiently high to 'tip the balance' when a given message was not
>>*already* firmly tagged as Spam by other rules, OR deniable for protocol,
>>invalid recipient/domain, format, attachment, MIME, viral reasons?
>>
>>Hint: In the as-issued SA, most Bayes scores are so miniscule they rarely
>>contribute enough to matter, yet Bayes consumes significant resources.
>>
>>Worse - Bayes on a server where user traffic can be so very different, one from
>>another, seems to get confused far more easily than when run on an individual
>>user's MUA. (pays its way on Mozilla/Thunderbrd and others...)
>>
>>Server-side, we've shut Bayes off entirely with more beneficial results than not.
>>
>>Greylisting, BTW, is likely to more than double your connection load, (retry may
>>be idioticlly rapid for zombies) - spawning child processes that may not go all
>>the way through to the DATA phase, but will certainly consume resources.
>
>
>
> Yes but we're using greylisting only for selected mailboxes.
>


Or so you intend.... See above to make sure.

>
>
>>Might help to run 'queue_only', adjust your queue runner interval 'til the
>>resources balance better.
>
>
> Some users are complaining because of the pause during the SMTP session.
> They're using pine with SMTP support (not sendmail).
>


An smtp delay can arise from *either* party sitting with its finger up.

See what new information enhance logging provides.

> It's not a resource problem. From my point of view it's a
> problem with setting/releasing locks.
>


What is important is not your poit if view or my point of view, but what the
ENTIRE system (attached user and network included..) is actually doing/failing
to do.

;-)


Sometimes these things track down to network packet loss, ISP firewall rules
that interfere with MTU size negotiation, or even, in one actual case here, a
faulty ethernet NIC driver & code!

Try enhanced logging...

- But don't leave it running forever unless you have lots of spare space and CPU
cycles!


Bill