Re: [Exim] Runaway exim with zombie children

Top Page
Delete this message
Reply to this message
Author: Philip Hazel
Date:  
To: Greg Ward
CC: exim-users
Subject: Re: [Exim] Runaway exim with zombie children
On Wed, 20 Nov 2002, Greg Ward wrote:

> Right now, over six days later, Exim is still stewing on one of
> those 11 outgoing copies, and sucking lots of CPU in the process:


Worrying.

> Examining each of those children:
>
> $ ps -fp 11199 112{08,09,10,11,12,13,14,15}
> UID        PID  PPID  C STIME TTY      STAT   TIME CMD
> root     11199     1 96 Nov14 ?        R    8593:29 /usr/local/exim4/bin/exim -Mc 18CRTd-0002u8-00
> exim     11208 11199  0 Nov14 ?        Z      0:00 [exim <defunct>]
> exim     11209 11199  0 Nov14 ?        Z      0:00 [exim <defunct>]
> exim     11210 11199  0 Nov14 ?        Z      0:00 [exim <defunct>]
> exim     11211 11199  0 Nov14 ?        Z      0:00 [exim <defunct>]
> exim     11212 11199  0 Nov14 ?        Z      0:00 [exim <defunct>]
> exim     11213 11199  0 Nov14 ?        Z      0:00 [exim <defunct>]
> exim     11214 11199  0 Nov14 ?        Z      0:00 [exim <defunct>]
> exim     11215 11199  0 Nov14 ?        Z      0:00 [exim <defunct>]

>
> Process 11199 doesn't show up in the output of 'exiwhat'.


Which suggests that it is in some kind of loop where it isn't responding
to SIGUSR1. Since it hasn't reaped the zombies, my guess is that the
main process is stuck in "waiting for a subprocess".

> I'm going to kill 11199 now;


I don't suppose you thought of attaching a trace to it before killing
it, to find out where it was in the code? If this happens again, that
would provide useful information. The fact that it is the main delivery
process that is in a mess, and is not responding to SIGUSR1, suggests
that it is in some kind of system call, but that is only a guess. The
call is likely to be waitpid().

The code for "wait for subprocess" is actually quite complex and there
is a huge comment around line 2660 in deliver.c for 4.05. However,
changes were made for 4.10 to cope with the way Linux handles
subprocesses when a process is being straced. But I don't think that
would affect your case (presumably you weren't stracing).

> My first guess is to blame my rather complex
> local_scan()


No; local_scan() gets called when a message arrives. At that point there
is only one process. This state is clearly much later, during delivery.


--
Philip Hazel            University of Cambridge Computing Service,
ph10@???      Cambridge, England. Phone: +44 1223 334714.