Re: [exim] Bug#360696: Failed to get write lockfor /var/spool/exim4/db/retry.lockfile:timed out

Autor: Philip Hazel
Data:
A: Michel Meyers
CC: exim-users, 360696-submitter, 360696-quiet, Andreas Metzler
Assumptes nous: Re: [exim] Bug#360696: Bug#360696: Failed to get write lockfor /var/spool/exim4/db/retry.lockfile:timedout
Assumpte: Re: [exim] Bug#360696: Failed to get write lockfor /var/spool/exim4/db/retry.lockfile:timed out

On Tue, 25 Apr 2006, Michel Meyers wrote:

> If there's anything I could run to debug it, please let me know (as I
> can reproduce the problem pretty easily here).

Perhaps compiling Exim with -g and taking a core dump of a stuck process
would help. If it is in a loop, that should indicate where the loop is.
You may have to do special things in order to get a dump of a setuid
process such as Exim - seek advice from Debian experts.

> Had to wait to get home to reproduce the problem, here's the result:
>
> fuser /var/spool/exim4/db/retry.lockfile
> /var/spool/exim4/db/retry.lockfile: 9934 9963
>
> ps ax | grep 9934 > 9934 ? R 1:32 /usr/sbin/exim4 -Mc 1FYTF5-0002a2-VT > 10034 pts/7 R+ 0:00 grep 9934

>
> ps ax | grep 9963 > 9963 ? S 0:00 /usr/sbin/exim4 -Mc 1FYTFD-0002aU-BC > 10060 pts/7 S+ 0:00 grep 9963

>
> a little later:
>
> fuser /var/spool/exim4/db/retry.lockfile
> /var/spool/exim4/db/retry.lockfile: 9934
>
> 9934 is the stuck process.

OK, so it really is an Exim process that has the file open. This is
*very* mysterious, because the code should close the file when it cannot
get a lock. It really is very straightforward:

sigalrm_seen = FALSE;
alarm(EXIMDB_LOCK_TIMEOUT);
rc = fcntl(dbblock->lockfd, F_SETLKW, &lock_data);
alarm(0);

  if (sigalrm_seen) errno = ETIMEDOUT;
  if (rc < 0)
    {
    log_write(0, LOG_MAIN, "Failed to get %s lock for %s: %s",
      read_only? "read" : "write", buffer,
      (errno == ETIMEDOUT)? "timed out" : strerror(errno));
    (void)close(dbblock->lockfd);
    errno = 0;       /* Indicates locking failure */
    return NULL;
    }

The very next statement after writing to the log is to close the file!
This should just exit without updating the retry database. (And I can't
believe it's looping in the log-writing function, though anything is
possible...)

> 2006-04-25 21:30:58 1FYTFD-0002aU-BC <= apache@domain U=Debian-exim
> P=spam-scanned S=2855 id=fccbb95ffed7a6394c5a8b23b6ed0547@domain
> 2006-04-25 21:31:58 1FYTFD-0002aU-BC Failed to get write lock for
> /var/spool/exim4/db/retry.lockfile: timed out
> 2006-04-25 21:32:58 1FYTFD-0002aU-BC Failed to get write lock for
> /var/spool/exim4/db/retry.lockfile: timed out
> 2006-04-25 21:32:58 1FYTFD-0002aU-BC => user <address@domain>
> R=local_user T=mail_spool
>
> This time I didn't call 'runq', but I did issue several 'mailq's.

A 'mailq' should not cause that. Ah, but did that message get stuck in a
loop? Looks like it did manage to do a delivery. Did it complete? If so,
that is what I would expect at one level - it failed to look at the
retry database when routing, and then failed again when trying to update
it at the end. (But we don't know why it couldn't get the lock in the
first place.)

This is all very weird. Sorry it's taking so long to sort out - but it
must be something unusual otherwise many other hosts would be having the
same trouble.

-- 
Philip Hazel            University of Cambridge Computing Service
Get the Exim 4 book:    http://www.uit.co.uk/exim-book

Aquest missatge és part del següent fil:
	l'arbre de fils complet ordenat per data
	Michel Meyers en
	Michel Meyers en

Re: [exim] Bug#360696: Failed to get write lockfor /var/spo…