[Exim] Runaway exim with zombie children

Top Page
Delete this message
Reply to this message
Author: Greg Ward
Date:  
To: exim-users
Subject: [Exim] Runaway exim with zombie children
I just noticed an apparent runaway exim process with many zombie
children, and was wondering if anyone else had seen anything similar
with Exim 4.05. (Time to upgrade? Or look for bugs in my
local_scan()?)

Details: the message in question is an outgoing list post from a Mailman
2.0 list (python-dev@???). The message was initially received by
Exim at 2002-11-14 16:23:58 and duly passed on to Mailman, which turned
around and started delivering it to list members at 2002-11-14 16:24:39,
in ~11 separate messages with <= 50 recipients (Mailman's SMTP_MAX_RCPT)
each. Right now, over six days later, Exim is still stewing on one of
those 11 outgoing copies, and sucking lots of CPU in the process:

+++ 18CRTd-0002u8-00 not completed +++
2002-11-14 16:24:41 18CRTd-0002u8-00 <= python-dev-admin@??? H=localhost.localdomain (mail.python.org) [127.0.0.1] P=esmtp S=2707 id=3DD41426.7070806@???

$ ps -fp 11199
UID        PID  PPID  C STIME TTY          TIME CMD
root     11199     1 96 Nov14 ?        23:22:10 /usr/local/exim4/bin/exim -Mc 18CRTd-0002u8-00


$ pstree -p 11199
exim(11199)-+-exim(11208)
            |-exim(11209)
            |-exim(11210)
            |-exim(11211)
            |-exim(11212)
            |-exim(11213)
            |-exim(11214)
            `-exim(11215)


Examining each of those children:

$ ps -fp 11199 112{08,09,10,11,12,13,14,15}
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
root     11199     1 96 Nov14 ?        R    8593:29 /usr/local/exim4/bin/exim -Mc 18CRTd-0002u8-00
exim     11208 11199  0 Nov14 ?        Z      0:00 [exim <defunct>]
exim     11209 11199  0 Nov14 ?        Z      0:00 [exim <defunct>]
exim     11210 11199  0 Nov14 ?        Z      0:00 [exim <defunct>]
exim     11211 11199  0 Nov14 ?        Z      0:00 [exim <defunct>]
exim     11212 11199  0 Nov14 ?        Z      0:00 [exim <defunct>]
exim     11213 11199  0 Nov14 ?        Z      0:00 [exim <defunct>]
exim     11214 11199  0 Nov14 ?        Z      0:00 [exim <defunct>]
exim     11215 11199  0 Nov14 ?        Z      0:00 [exim <defunct>]


Process 11199 doesn't show up in the output of 'exiwhat'.

I'm going to kill 11199 now; but does anyone have any clue what might
cause this to happen? My first guess is to blame my rather complex
local_scan() -- which embeds a Python interpreter and calls a
non-trivial Python local_scan() -- but I thought Exim had a timeout on
local_scan(). Also, my Python local_scan() should bail out quite early,
since this message came from localhost.

Guesses anyone?

        Greg
--
Greg Ward <gward@???>                         http://www.gerg.ca/
Think honk if you're a telepath.