Re: [exim] Bug#360696: Failed to get write lockfor /var/spo…

Startseite
Nachricht löschen
Nachricht beantworten
Autor: Philip Hazel
Datum:  
To: Michel Meyers
CC: exim-users, 360696-submitter, 360696-quiet, Andreas Metzler
Neue Treads: Re: [exim] Bug#360696: Bug#360696: Failed to get write lockfor /var/spool/exim4/db/retry.lockfile:timedout
Betreff: Re: [exim] Bug#360696: Failed to get write lockfor /var/spool/exim4/db/retry.lockfile:timed out
On Tue, 25 Apr 2006, Michel Meyers wrote:

> If there's anything I could run to debug it, please let me know (as I
> can reproduce the problem pretty easily here).


Perhaps compiling Exim with -g and taking a core dump of a stuck process
would help. If it is in a loop, that should indicate where the loop is.
You may have to do special things in order to get a dump of a setuid
process such as Exim - seek advice from Debian experts.

> Had to wait to get home to reproduce the problem, here's the result:
>
> fuser /var/spool/exim4/db/retry.lockfile
> /var/spool/exim4/db/retry.lockfile: 9934 9963
>
>   ps ax | grep 9934
>   9934 ?        R      1:32 /usr/sbin/exim4 -Mc 1FYTF5-0002a2-VT
> 10034 pts/7    R+     0:00 grep 9934

>
>   ps ax | grep 9963
>   9963 ?        S      0:00 /usr/sbin/exim4 -Mc 1FYTFD-0002aU-BC
> 10060 pts/7    S+     0:00 grep 9963

>
> a little later:
>
> fuser /var/spool/exim4/db/retry.lockfile
> /var/spool/exim4/db/retry.lockfile: 9934
>
> 9934 is the stuck process.


OK, so it really is an Exim process that has the file open. This is
*very* mysterious, because the code should close the file when it cannot
get a lock. It really is very straightforward:

sigalrm_seen = FALSE;
alarm(EXIMDB_LOCK_TIMEOUT);
rc = fcntl(dbblock->lockfd, F_SETLKW, &lock_data);
alarm(0);

  if (sigalrm_seen) errno = ETIMEDOUT;
  if (rc < 0)
    {
    log_write(0, LOG_MAIN, "Failed to get %s lock for %s: %s",
      read_only? "read" : "write", buffer,
      (errno == ETIMEDOUT)? "timed out" : strerror(errno));
    (void)close(dbblock->lockfd);
    errno = 0;       /* Indicates locking failure */
    return NULL;
    }


The very next statement after writing to the log is to close the file!
This should just exit without updating the retry database. (And I can't
believe it's looping in the log-writing function, though anything is
possible...)

> 2006-04-25 21:30:58 1FYTFD-0002aU-BC <= apache@domain U=Debian-exim
> P=spam-scanned S=2855 id=fccbb95ffed7a6394c5a8b23b6ed0547@domain
> 2006-04-25 21:31:58 1FYTFD-0002aU-BC Failed to get write lock for
> /var/spool/exim4/db/retry.lockfile: timed out
> 2006-04-25 21:32:58 1FYTFD-0002aU-BC Failed to get write lock for
> /var/spool/exim4/db/retry.lockfile: timed out
> 2006-04-25 21:32:58 1FYTFD-0002aU-BC => user <address@domain>
> R=local_user T=mail_spool
>
> This time I didn't call 'runq', but I did issue several 'mailq's.


A 'mailq' should not cause that. Ah, but did that message get stuck in a
loop? Looks like it did manage to do a delivery. Did it complete? If so,
that is what I would expect at one level - it failed to look at the
retry database when routing, and then failed again when trying to update
it at the end. (But we don't know why it couldn't get the lock in the
first place.)

This is all very weird. Sorry it's taking so long to sort out - but it
must be something unusual otherwise many other hosts would be having the
same trouble.

-- 
Philip Hazel            University of Cambridge Computing Service
Get the Exim 4 book:    http://www.uit.co.uk/exim-book