I just noticed an apparent runaway exim process with many zombie
children, and was wondering if anyone else had seen anything similar
with Exim 4.05. (Time to upgrade? Or look for bugs in my
local_scan()?)
Details: the message in question is an outgoing list post from a Mailman
2.0 list (python-dev@???). The message was initially received by
Exim at 2002-11-14 16:23:58 and duly passed on to Mailman, which turned
around and started delivering it to list members at 2002-11-14 16:24:39,
in ~11 separate messages with <= 50 recipients (Mailman's SMTP_MAX_RCPT)
each. Right now, over six days later, Exim is still stewing on one of
those 11 outgoing copies, and sucking lots of CPU in the process:
+++ 18CRTd-0002u8-00 not completed +++
2002-11-14 16:24:41 18CRTd-0002u8-00 <= python-dev-admin@??? H=localhost.localdomain (mail.python.org) [127.0.0.1] P=esmtp S=2707 id=3DD41426.7070806@???
$ ps -fp 11199
UID PID PPID C STIME TTY TIME CMD
root 11199 1 96 Nov14 ? 23:22:10 /usr/local/exim4/bin/exim -Mc 18CRTd-0002u8-00
$ pstree -p 11199
exim(11199)-+-exim(11208)
|-exim(11209)
|-exim(11210)
|-exim(11211)
|-exim(11212)
|-exim(11213)
|-exim(11214)
`-exim(11215)
Examining each of those children:
$ ps -fp 11199 112{08,09,10,11,12,13,14,15}
UID PID PPID C STIME TTY STAT TIME CMD
root 11199 1 96 Nov14 ? R 8593:29 /usr/local/exim4/bin/exim -Mc 18CRTd-0002u8-00
exim 11208 11199 0 Nov14 ? Z 0:00 [exim <defunct>]
exim 11209 11199 0 Nov14 ? Z 0:00 [exim <defunct>]
exim 11210 11199 0 Nov14 ? Z 0:00 [exim <defunct>]
exim 11211 11199 0 Nov14 ? Z 0:00 [exim <defunct>]
exim 11212 11199 0 Nov14 ? Z 0:00 [exim <defunct>]
exim 11213 11199 0 Nov14 ? Z 0:00 [exim <defunct>]
exim 11214 11199 0 Nov14 ? Z 0:00 [exim <defunct>]
exim 11215 11199 0 Nov14 ? Z 0:00 [exim <defunct>]
Process 11199 doesn't show up in the output of 'exiwhat'.
I'm going to kill 11199 now; but does anyone have any clue what might
cause this to happen? My first guess is to blame my rather complex
local_scan() -- which embeds a Python interpreter and calls a
non-trivial Python local_scan() -- but I thought Exim had a timeout on
local_scan(). Also, my Python local_scan() should bail out quite early,
since this message came from localhost.
Guesses anyone?
Greg
--
Greg Ward <gward@???> http://www.gerg.ca/
Think honk if you're a telepath.