[exim] occasional SIGSEGV in exim4 under heavy load

Autor: Chris Lightfoot
Data:
A: exim-users
Assumpte: [exim] occasional SIGSEGV in exim4 under heavy load

(I've already submitted this as a bug in the Debian exim
package, but I've now reproduced it on stock 4.62, so it
may be of interest to this list as well.)

Under heavy SMTP load, we occasionally observe the exim4
daemon crashing (with the result that no further
connections can be accepted, obviously). We can reproduce
this here with the `postal' SMTP benchmark (package
postal) and the following command-line:

    postal -p 10 -c 10 -m 1 localhost users -

(running on the host running exim4). The file `users'
contains a single line with an email address; in this case
I used a local address aliased to /dev/null in
/etc/aliases. While running these tests I had the whole of
/var/spool/exim4 mounted on a tmpfs (to simulate a
hardware configuration with very fast disk); since the bug
is likely timing-related you may have to do the same to
reproduce it. Here typically it will exhibit within the
first five minutes of postal's run.

Here's a stack trace from the exim4 daemon when it
crashes:

    #0  0x00000000 in ?? ()
    #1  0x40361825 in __pthread_sighandler () from /lib/libpthread.so.0
    #2  <signal handler called>
    #3  0x403d05d9 in __libc_sigaction () from /lib/libc.so.6
    #4  0x4035e828 in sigaction () from /lib/libpthread.so.0
    #5  0x080866c5 in os_non_restarting_signal (sig=17, handler=0x805c930 <main_sigchld_handler>) at os.c:267
    #6  0x0805e9f3 in daemon_go () at daemon.c:1842
    #7  0x0806e06b in main (argc=3, cargv=0xbfffdbc4) at exim.c:3922

-- for reasons related to the Debian package, the line
numbers in os.c in that trace don't correspond to those in
the official source tree. The relevant source line in the
official source is 103.

Looking at the backtrace, it appears that what's happened
is that a signal (presumably SIGCHLD) has arrived while
os_non_restarting_signal is running. The SIGCHLD handler
itself calls os_non_restarting_signal, and a crash
results. I'm not sure why, though -- there's nothing in
the code for that function that's obviously nonreentrant
(it only uses automatic variables and calls sigaction(2),
which is async-signal-safe).

Note that exim in this case is linked against -lpthread,
presumably because it's also linked against -lpq. The
problem does not occur (or, if it does occur, does so
sufficiently rarely as not to have been caught by my
tests) in an exim which is not linked against -lpthread.

The following patch to src/os.c, which blocks the signal
for which a handler is being installed over the call to
sigaction, appears to fix the problem, which is at least
compatible with the above hypothesis, though not a great
fix.

--- os.c.orig   2006-07-11 18:02:09.000000000 +0100
+++ os.c        2006-07-11 18:05:15.000000000 +0100
@@ -261,11 +261,20 @@

#ifdef SA_RESTART
struct sigaction act;
+sigset_t mask, curmask;
+
+sigemptyset(&mask);
+sigprocmask(SIG_BLOCK, &mask, &curmask);
+sigaddset(&mask, sig);
+sigprocmask(SIG_SETMASK, &mask, NULL);
+
act.sa_handler = handler;
sigemptyset(&(act.sa_mask));
act.sa_flags = 0;
sigaction(sig, &act, NULL);

+sigprocmask(SIG_SETMASK, &curmask, NULL);
+
#ifdef STAND_ALONE
printf("Used sigaction() with flags = 0\n");
#endif

-- although looking at that patch now I'm left wondering,
what *was* I thinking? What I meant was,

sigemptyset(&mask);
sigaddset(&mask, sig);
sigprocmask(SIG_BLOCK, &mask, &curmask);

/* ... */

sigprocmask(SIG_SETMASK, &curmask, NULL);

or similar. Actually I don't think there'd be any harm in
blocking all signals over the call to sigaction. I haven't
tried that though.

--
``Nothing so gives the illusion of intelligence
as personal association with large sums of money.''
(John Kenneth Galbraith)

Aquest missatge és part del següent fil:
	l'arbre de fils complet ordenat per data

	Marc Haber en