Re: [exim] Exim4 zombie processes are not killed

Szerző: Phil Pennock
Dátum:
Címzett: Alexander Nagel
CC: exim-users
Tárgy: Re: [exim] Exim4 zombie processes are not killed

On 2008-03-15 at 12:44 +0100, Alexander Nagel wrote:
> I know what zombie processes are, but the problem is that the zombies are
> not getting killed by init (pid 1). They are still there and i can kill them only with -9
> I always try -15 to prevent data loss ;-)

Init doesn't kill them. If it's a zombie, listed with 'Z' in the ps
listing, then it's already dead. There is no risk of data loss as all
open files, etc, are already closed. kill -9 on the zombie process
itself won't change its state, since it's already dead.

Presence of a zombie on the system is not directly a cause for panic and
to go around trying to clean things up: most processes go through a
zombie state when they exit, but it only lasts at most milliseconds.
The only processes which don't become zombies are those where the parent
process currently has SIG_IGN as the handler for SIGCHLD, and even that
behaviour is a non-portable assumption.

A zombie is a dead process which has not been reaped by its parent.
Unless init is broken (which hasn't been managed even by the most
obscure Linux distributions, AFAIK) if init isn't reaping the process
then the reason is that init is not the parent of the zombie process.
Which means that the original process which called fork() still exists.
So the question is why _that_ process hasn't reaped the child.

Reasons for that include:
 * it will do, but you just happened to have the ps gather data in the
   moments between the process becoming a zombie and the parent reaping
   it
 * it will do, but the parent is designed to let children be zombies for
   a little while before reaping them; how safe this is depends upon the
   patterns of how the parent spawns children:
    * one process max?  Not worth worrying about
    * arbitrary number of processes?  Perhaps less wise
 * it will do but the parent is single-threaded and has run some other
   child process which it blocks to wait on and that process has
   unexpectedly hung; the zombie will be reaped when the later process
   exits (a style variation on the previous point)
 * sloppy programming; perhaps the parent doesn't keep track of children
   because the programmer assumes there won't be so many and the OS can
   clean them up afterwards (which might, sometimes, be an okay
   assumption and the sloppiness is harmless) or the programmer didn't
   understand about how signal delivery can be unreliable and relied
   upon a distinct SIGCHLD being delivered for each and every child
   process dying, or some other poor practice.

Exim is not sloppily programmed.

Exim was written with knowledge of the various SIGCHLD semantics and
written to handle this cleanly across multiple platforms.

Exim is single-threaded and may sometimes be waiting upon child
processes as it reads output from them (popen(,"r") style). One earlier
process may be lingering, but it's not spawning off large numbers of
children per process, out of control.

So, which other program is being run by the parent process in every
case? Is that program exiting when it should, or not?

If spamc can take up to 10 seconds to exit, then an earlier spawned
process can linger for up to 10 seconds as a zombie. That's not a
problem, there's only one of them, it will be cleaned up when spamc
exits. If the spamc process is much older than that, then yes you'll
see more zombies.

You've yet to present evidence of an actual problem, beyond monitoring
tuned inappropriately to the environment, unless it's that the spamc
processes are running for too long. In which case, again, why is spamc
running for too long?

If you have cases where spamc has exited and there are still zombie
processes, please post an example process tree.

Unless, of course, there are cases in which the spamcheck Transport can
be called from a router even when $received_protocol is spam-scanned, in
which case you have created a loop of Exim calling Exim calling Exim ...

Moving the spam-checking to ACL time instead of the router/transport
approach would reduce the risk of you missing a $received_protocol check
and creating loops.

See:
http://wiki.exim.org/EximContentScanning
for guides to using ACL-based scanning/blocking.

-Phil