Philip Hazel wrote:
> On Fri, 21 Apr 2006, Andreas Metzler wrote:
>
> I wonder how we can track this down. There must be something different
> about Michel's system, because nobody else is reporting this, and there
> must be many cases of this kind of retrying happening to lots of people.
If there's anything I could run to debug it, please let me know (as I
can reproduce the problem pretty easily here).
>>> The given log had this:
>>>> 2006-04-04 09:13:48 1FQfhL-00035E-Ay Failed to get write lock for
>>>> /var/spool/exim4/db/retry.lockfile: timed out
>>>> 2006-04-04 09:14:48 1FQfhL-00035E-Ay Failed to get write lock for
>>>> /var/spool/exim4/db/retry.lockfile: timed out
>>> which suggests two tries for the same message, one minute apart. How
>>> often was the OP starting queue runners?
>> <the usual -q30m>
>
> Hmm. So why are there those two messages, I wonder?
Don't get too hung up on them. I do not recall the exact circumstances
of when those were generated (I might have called runq manually at the
time).
>> Note that I get those for mails that are not stuck
>
> They should just be getting read locks (and the message is wrong, as per
> the bug I found), but why are they failing? I guess the next question is
> what DBM library is in use?
I guess you mean libdb4.2 (package rev 4.2.52-23.1 is installed)?
> What kind of file system is used for
> /var/spool/exim4? I'm grasping at straws here.
/var is ext3
>> 2006-04-20 21:06:35 1FWeOO-000396-DP Spool file is locked (another
>> process is handling this message)
>
> At least *some* locking is working. :-)
>
> Does the OP have any kind of tool for looking at open files to see what
> process is using them? For example, fuser? The output of
>
> fuser /var/spool/exim4/db/retry.lockfile
>
> might be helpful.
Had to wait to get home to reproduce the problem, here's the result:
fuser /var/spool/exim4/db/retry.lockfile
/var/spool/exim4/db/retry.lockfile: 9934 9963
ps ax | grep 9934
9934 ? R 1:32 /usr/sbin/exim4 -Mc 1FYTF5-0002a2-VT
10034 pts/7 R+ 0:00 grep 9934
ps ax | grep 9963
9963 ? S 0:00 /usr/sbin/exim4 -Mc 1FYTFD-0002aU-BC
10060 pts/7 S+ 0:00 grep 9963
a little later:
fuser /var/spool/exim4/db/retry.lockfile
/var/spool/exim4/db/retry.lockfile: 9934
9934 is the stuck process. The other one was a normal message that got
delivered.
2006-04-25 21:30:58 1FYTFD-0002aU-BC <= apache@domain U=Debian-exim
P=spam-scanned S=2855 id=fccbb95ffed7a6394c5a8b23b6ed0547@domain
2006-04-25 21:31:58 1FYTFD-0002aU-BC Failed to get write lock for
/var/spool/exim4/db/retry.lockfile: timed out
2006-04-25 21:32:58 1FYTFD-0002aU-BC Failed to get write lock for
/var/spool/exim4/db/retry.lockfile: timed out
2006-04-25 21:32:58 1FYTFD-0002aU-BC => user <address@domain>
R=local_user T=mail_spool
This time I didn't call 'runq', but I did issue several 'mailq's.
Greetings,
Michel