Re: [Exim] Runaway exim with zombie children

Author: Philip Hazel
Date:
To: Greg Ward
CC: exim-users
Subject: Re: [Exim] Runaway exim with zombie children

On Wed, 20 Nov 2002, Greg Ward wrote:

> Right now, over six days later, Exim is still stewing on one of
> those 11 outgoing copies, and sucking lots of CPU in the process:

Worrying.

> Examining each of those children:
>
> $ ps -fp 11199 112{08,09,10,11,12,13,14,15} > UID PID PPID C STIME TTY STAT TIME CMD > root 11199 1 96 Nov14 ? R 8593:29 /usr/local/exim4/bin/exim -Mc 18CRTd-0002u8-00 > exim 11208 11199 0 Nov14 ? Z 0:00 [exim <defunct>] > exim 11209 11199 0 Nov14 ? Z 0:00 [exim <defunct>] > exim 11210 11199 0 Nov14 ? Z 0:00 [exim <defunct>] > exim 11211 11199 0 Nov14 ? Z 0:00 [exim <defunct>] > exim 11212 11199 0 Nov14 ? Z 0:00 [exim <defunct>] > exim 11213 11199 0 Nov14 ? Z 0:00 [exim <defunct>] > exim 11214 11199 0 Nov14 ? Z 0:00 [exim <defunct>] > exim 11215 11199 0 Nov14 ? Z 0:00 [exim <defunct>]

>
> Process 11199 doesn't show up in the output of 'exiwhat'.

Which suggests that it is in some kind of loop where it isn't responding
to SIGUSR1. Since it hasn't reaped the zombies, my guess is that the
main process is stuck in "waiting for a subprocess".

> I'm going to kill 11199 now;

I don't suppose you thought of attaching a trace to it before killing
it, to find out where it was in the code? If this happens again, that
would provide useful information. The fact that it is the main delivery
process that is in a mess, and is not responding to SIGUSR1, suggests
that it is in some kind of system call, but that is only a guess. The
call is likely to be waitpid().

The code for "wait for subprocess" is actually quite complex and there
is a huge comment around line 2660 in deliver.c for 4.05. However,
changes were made for 4.10 to cope with the way Linux handles
subprocesses when a process is being straced. But I don't think that
would affect your case (presumably you weren't stracing).

> My first guess is to blame my rather complex
> local_scan()

No; local_scan() gets called when a message arrives. At that point there
is only one process. This state is clearly much later, during delivery.

--
Philip Hazel            University of Cambridge Computing Service,
ph10@???      Cambridge, England. Phone: +44 1223 334714.

This message is part of the following thread:
	the complete thread tree sorted by date
	Greg Ward at
	Greg Ward at