Philip Hazel wrote:
> On Tue, 25 Apr 2006, Michel Meyers wrote:
>
>> If there's anything I could run to debug it, please let me know (as I
>> can reproduce the problem pretty easily here).
>
> Perhaps compiling Exim with -g and taking a core dump of a stuck process
> would help. If it is in a loop, that should indicate where the loop is.
> You may have to do special things in order to get a dump of a setuid
> process such as Exim - seek advice from Debian experts.
OK, I'm not knowledgeable with debugging stuff at all so I will
definitely need help there.
>> Had to wait to get home to reproduce the problem, here's the result:
>>
>> fuser /var/spool/exim4/db/retry.lockfile
>> /var/spool/exim4/db/retry.lockfile: 9934 9963
>>
>> ps ax | grep 9934
>> 9934 ? R 1:32 /usr/sbin/exim4 -Mc 1FYTF5-0002a2-VT
>> 10034 pts/7 R+ 0:00 grep 9934
>>
>> ps ax | grep 9963
>> 9963 ? S 0:00 /usr/sbin/exim4 -Mc 1FYTFD-0002aU-BC
>> 10060 pts/7 S+ 0:00 grep 9963
>>
>> a little later:
>>
>> fuser /var/spool/exim4/db/retry.lockfile
>> /var/spool/exim4/db/retry.lockfile: 9934
>>
>> 9934 is the stuck process.
>
> OK, so it really is an Exim process that has the file open. This is
> *very* mysterious, because the code should close the file when it cannot
> get a lock. It really is very straightforward:
The thing is: The hanging process seems to have gotten a lock and then
hung up shortly after not being able to deliver the message (in whatever
cleanup it is doing afterwards). Here's a transcript of such an event:
2006-04-26 13:16:31 1FYi0V-0006vp-06 <= address@domain H=localhost
[127.0.0.1] P=esmtp S=824 id=444F568B.6010002@domain
2006-04-26 13:16:32 1FYi0V-0006vp-06 == destination@domain2 R=dnslookup
T=remote_smtp defer (-44): SMTP error from remote mail server after RCPT
TO:<destination@domain2>: host mail.domain2 [82.149.xx.xx]: 451 GL -
temporary problem. Please try again later.
After that, the process eats 100% CPU, keeps the lockfile open and is in
status 'tidying up after delivering 1FYi0V-0006vp-06'.
> The very next statement after writing to the log is to close the file!
> This should just exit without updating the retry database. (And I can't
> believe it's looping in the log-writing function, though anything is
> possible...)
>
>> 2006-04-25 21:30:58 1FYTFD-0002aU-BC <= apache@domain U=Debian-exim
>> P=spam-scanned S=2855 id=fccbb95ffed7a6394c5a8b23b6ed0547@domain
>> 2006-04-25 21:31:58 1FYTFD-0002aU-BC Failed to get write lock for
>> /var/spool/exim4/db/retry.lockfile: timed out
>> 2006-04-25 21:32:58 1FYTFD-0002aU-BC Failed to get write lock for
>> /var/spool/exim4/db/retry.lockfile: timed out
>> 2006-04-25 21:32:58 1FYTFD-0002aU-BC => user <address@domain>
>> R=local_user T=mail_spool
>>
>> This time I didn't call 'runq', but I did issue several 'mailq's.
>
> A 'mailq' should not cause that. Ah, but did that message get stuck in a
> loop? Looks like it did manage to do a delivery. Did it complete? If so,
> that is what I would expect at one level - it failed to look at the
> retry database when routing, and then failed again when trying to update
> it at the end. (But we don't know why it couldn't get the lock in the
> first place.)
No, as mentioned before, I get those error messages for mails that
actually get delivered. The message that is stuck is never seen again in
the log until I manually kill the hanging process but while it is stuck,
the others 'fail to get write lock'.
> This is all very weird. Sorry it's taking so long to sort out - but it
> must be something unusual otherwise many other hosts would be having the
> same trouble.
No problem, I'm not in a hurry. I found a similar report (prior to
reporting my problem to Debian's BTS) that was never really resolved in
bug
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=314356 (the problem
seems to just have disappeared there although the submitter mentions
having changed HDDs in between). I also found other reports scattered
over the web with seemingly similar symptoms but I can't recall the
references to those right now.
Greetings,
Michel