Re: [exim] Failed to get write lock for/var/spool/exim4/db/…

Top Page
Delete this message
Reply to this message
Author: Philip Hazel
Date:  
To: Andreas Metzler
CC: exim-users
Subject: Re: [exim] Failed to get write lock for/var/spool/exim4/db/retry.lockfile: timed out
On Sat, 15 Apr 2006, Andreas Metzler wrote:

> gets stuck with 100% CPU usage and the only way to get rid of it is to
> kill it with signal 9. While the stuck process is there, the mainlog
> keeps mentioning messages like these:
> 2006-04-04 09:13:33 1FQfgD-0002kv-EZ Failed to get write lock for
> /var/spool/exim4/db/retry.lockfile: timed out


> Here's an example log entry of a message getting rejected (causing the
> process to go to 100% CPU):
> 2006-04-04 09:31:40 1FQ2wq-0005Kt-8R == removed@??? R=dnslookup
> T=remote_smtp defer (-44): SMTP error from remote mail server after RCPT
> TO:<removed@???>: host mail.removed.de [xx.xxx.xxx.xx]: 451 GL -
> temporary problem. Please try again later.


> I am a little bit at loss on how to debug this, upon asking the
> submitter told us that the stuck process is listed as
> | 31441 tidying up after delivering 1FT0NS-0008AT-0D
> by exiwhat. According to google ther have been similar reports on
> exim-users, none of which ended with a definitive solution.


I found one bug when I first looked at this, but it isn't a processing
bug. It is just that it would always say "Failed to get write lock",
even when the failure was for a read lock. That was easily fixed.

I tried to simulate this problem by patching the code to pretend it had
failed to get a lock when trying to update the retry database while
tidying up after a 451 failure. (It is, in fact, a write lock here.)
Needless to say, I did not get a 100% loop. It just did what it is
supposed to do - that is, failed to update the hints. But of course I
was using release 4.61, not 4.60.

I suppose we'll have to look at the configuration that was being used.
The given log had this:

> 2006-04-04 09:13:48 1FQfhL-00035E-Ay Failed to get write lock for
> /var/spool/exim4/db/retry.lockfile: timed out
> 2006-04-04 09:14:48 1FQfhL-00035E-Ay Failed to get write lock for
> /var/spool/exim4/db/retry.lockfile: timed out


which suggests two tries for the same message, one minute apart. How
often was the OP starting queue runners? I have a feeling this is going
to be a long haul...

Philip

--
Philip Hazel, University of Cambridge Computing Service.