Re: [exim] Bug#360696: Failed to get write lock for/var/spool/exim4/db/retry.lockfile:timed out

Autor: Michel Meyers
Fecha:
A: exim-users, 360696-quiet
Cc: 360696-submitter, Andreas Metzler
Temas antiguos: Re: [exim] Failed to get write lock for/var/spool/exim4/db/retry.lockfile: timed out
Asunto: Re: [exim] Bug#360696: Failed to get write lock for/var/spool/exim4/db/retry.lockfile:timed out

Philip Hazel wrote:
> On Fri, 21 Apr 2006, Andreas Metzler wrote:
>
> I wonder how we can track this down. There must be something different
> about Michel's system, because nobody else is reporting this, and there
> must be many cases of this kind of retrying happening to lots of people.

If there's anything I could run to debug it, please let me know (as I
can reproduce the problem pretty easily here).

>>> The given log had this:
>>>> 2006-04-04 09:13:48 1FQfhL-00035E-Ay Failed to get write lock for
>>>> /var/spool/exim4/db/retry.lockfile: timed out
>>>> 2006-04-04 09:14:48 1FQfhL-00035E-Ay Failed to get write lock for
>>>> /var/spool/exim4/db/retry.lockfile: timed out
>>> which suggests two tries for the same message, one minute apart. How
>>> often was the OP starting queue runners?
>> <the usual -q30m>
>
> Hmm. So why are there those two messages, I wonder?

Don't get too hung up on them. I do not recall the exact circumstances
of when those were generated (I might have called runq manually at the
time).

>> Note that I get those for mails that are not stuck
>
> They should just be getting read locks (and the message is wrong, as per
> the bug I found), but why are they failing? I guess the next question is
> what DBM library is in use?

I guess you mean libdb4.2 (package rev 4.2.52-23.1 is installed)?

> What kind of file system is used for
> /var/spool/exim4? I'm grasping at straws here.

/var is ext3

>> 2006-04-20 21:06:35 1FWeOO-000396-DP Spool file is locked (another
>> process is handling this message)
>
> At least *some* locking is working. :-)
>
> Does the OP have any kind of tool for looking at open files to see what
> process is using them? For example, fuser? The output of
>
> fuser /var/spool/exim4/db/retry.lockfile
>
> might be helpful.

Had to wait to get home to reproduce the problem, here's the result:

fuser /var/spool/exim4/db/retry.lockfile
/var/spool/exim4/db/retry.lockfile: 9934 9963

  ps ax | grep 9934
  9934 ?        R      1:32 /usr/sbin/exim4 -Mc 1FYTF5-0002a2-VT
10034 pts/7    R+     0:00 grep 9934

  ps ax | grep 9963
  9963 ?        S      0:00 /usr/sbin/exim4 -Mc 1FYTFD-0002aU-BC
10060 pts/7    S+     0:00 grep 9963

a little later:

fuser /var/spool/exim4/db/retry.lockfile
/var/spool/exim4/db/retry.lockfile: 9934

9934 is the stuck process. The other one was a normal message that got
delivered.

2006-04-25 21:30:58 1FYTFD-0002aU-BC <= apache@domain U=Debian-exim
P=spam-scanned S=2855 id=fccbb95ffed7a6394c5a8b23b6ed0547@domain
2006-04-25 21:31:58 1FYTFD-0002aU-BC Failed to get write lock for
/var/spool/exim4/db/retry.lockfile: timed out
2006-04-25 21:32:58 1FYTFD-0002aU-BC Failed to get write lock for
/var/spool/exim4/db/retry.lockfile: timed out
2006-04-25 21:32:58 1FYTFD-0002aU-BC => user <address@domain>
R=local_user T=mail_spool

This time I didn't call 'runq', but I did issue several 'mailq's.

Greetings,
        Michel

Re: [exim] Bug#360696: Failed to get write lock for/var/spoo…