[exim] exim -qTIME not starting queue runners on some Linux …

Top Page
Delete this message
Reply to this message
Author: Lutz Preßler
Date:  
To: exim-users
CC: poeml
Subject: [exim] exim -qTIME not starting queue runners on some Linux system
Hello,

I see a very strange behaviour with exim on one x86_64
SLES10 Linux system. At the moment a daemon (exim -q15m -bd
or for tests exim -q15s) does not start the periodic queue runners.
Analysing the logs shows that this misfeature started after
updating to
http://download.opensuse.org/repositories/server:/mail/SLE_10/x86_64/exim-4.67-14.1.x86_64.rpm
and restarting the exim daemon afterwards (12 days ago).
The queue runners were started correctly with the previous
also 4.67 based package.
Release does not seem to be the cause anyway, as I tried
with different binary packages (including 4.51 from openSUSE 10.1
distribution) and manually built from streight exim.org sources
(4.67, 4.66) - it's always the same on this running system.
I cannot/don't want to reboot it at the moment. Uptime is
much longer than the start of this problem, as is last glibc
update.

Now some debugging data.

problem-systemi:~ # exim -d -q15s
Exim version 4.67 uid=0 gid=0 pid=23684 D=fbb95cfd
Berkeley DB: Sleepycat Software: Berkeley DB 4.3.29: (June 16, 2006)
Support for: crypteq iconv() IPv6 OpenSSL move_frozen_messages Content_Scanning Old_Demime
[...]
originator: uid=0 gid=0 login=root name=root
23721 changed uid/gid: running as a daemon
23721 uid=8 gid=12 pid=23721
23721 auxiliary group list: 12
23721 LOG: MAIN
23721 exim 4.67 daemon started: pid=23721, -q15s, not listening for SMTP
23721 set_process_info: 23721 daemon: -q15s, not listening
23721 daemon running with uid=8 gid=12 euid=8 egid=12
23721 SIGALRM received
23722 Starting queue-runner: pid 23722
23722 exec /usr/sbin/exim -q
23721 1 queue-runner process running
23721 child 23722 ended: status=0x0
23721 0 queue-runner processes now running
and that's it there - nothing more happens


ok-system:~ # exim -d -q15s
Exim version 4.67 uid=0 gid=0 pid=13291 D=fbb95cfd
Berkeley DB: Sleepycat Software: Berkeley DB 4.3.29: (June 16, 2006)
Support for: crypteq iconv() IPv6 OpenSSL move_frozen_messages Content_Scanning Old_Demime
[...]
originator: uid=0 gid=0 login=root name=root
13311 changed uid/gid: running as a daemon
13311 uid=8 gid=12 pid=13311
13311 auxiliary group list: 12
13311 LOG: MAIN
13311 exim 4.67 daemon started: pid=13311, -q15s, not listening for SMTP
13311 set_process_info: 13311 daemon: -q15s, not listening
13311 daemon running with uid=8 gid=12 euid=8 egid=12
13311 SIGALRM received
13312 Starting queue-runner: pid 13312
13312 exec /usr/sbin/exim -q
13311 1 queue-runner process running
13311 child 13312 ended: status=0x0
13311 0 queue-runner processes now running
13311 SIGALRM received
13313 Starting queue-runner: pid 13313
13313 exec /usr/sbin/exim -q
13311 1 queue-runner process running
13311 child 13313 ended: status=0x0
13311 0 queue-runner processes now running

last 6 lines repeated every 15 seconds as it should be.


Output of "strace exim -d -q15s" (for around 40s)
starting with the first or only output of "SIGALARM received"
(which is simulated in daemon.c) shows the underlying difference:

problematic-system:
write(2, "24249 daemon running with uid=8 "..., 5424249 daemon running with uid=8 gid=12 euid=8 egid=12) = 54
write(2, "24249 SIGALRM received\n", 2324249 SIGALRM received) = 23
clone(24250 Starting queue-runner: pid 24250
24250 exec /usr/sbin/exim -q
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b206d4102f0) = 24250
write(2, "24249 1 queue-runner process run"..., 3724249 1 queue-runner process running) = 37
alarm(15)                               = 0
select(0, NULL, NULL, NULL, {15, 0})    = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
rt_sigaction(SIGCHLD, {SIG_DFL}, NULL, 8) = 0
rt_sigreturn(0x11)                      = -1 EINTR (Interrupted system call)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 24250
write(2, "24249 child 24250 ended: status="..., 3624249 child 24250 ended: status=0x0
) = 36
write(2, "24249 0 queue-runner processes n"..., 4324249 0 queue-runner processes now running
) = 43
wait4(-1, 0x7fff3eaa12f4, WNOHANG, NULL) = -1 ECHILD (No child processes)
rt_sigaction(SIGCHLD, {0x555555576310, [], SA_RESTORER, 0x2b206cfdfbf0}, NULL, 8) = 0
select(0, NULL, NULL, NULL, {15, 0})    = 0 (Timeout)
wait4(-1, 0x7fff3eaa12f4, WNOHANG, NULL) = -1 ECHILD (No child processes)
select(0, NULL, NULL, NULL, {15, 0})    = 0 (Timeout)
wait4(-1, 0x7fff3eaa12f4, WNOHANG, NULL) = -1 ECHILD (No child processes)
select(0, NULL, NULL, NULL, {15, 0} <unfinished ...>



ok-system:
write(2, "13379 daemon running with uid=8 "..., 5413379 daemon running with uid=8 gid=12 euid=8 egid=12) = 54
write(2, "13379 SIGALRM received\n", 2313379 SIGALRM received) = 23
clone(13380 Starting queue-runner: pid 13380
13380 exec /usr/sbin/exim -q
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b2f73a3c2f0) = 13380
write(2, "13379 1 queue-runner process run"..., 3713379 1 queue-runner process running) = 37
alarm(15)                               = 0
select(0, NULL, NULL, NULL, {15, 0})    = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
rt_sigaction(SIGCHLD, {SIG_DFL}, NULL, 8) = 0
rt_sigreturn(0x11)                      = -1 EINTR (Interrupted system call)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 13380
write(2, "13379 child 13380 ended: status="..., 3613379 child 13380 ended: status=0x0) = 36
write(2, "13379 0 queue-runner processes n"..., 4313379 0 queue-runner processes now running) = 43
wait4(-1, 0x7fff38477f24, WNOHANG, NULL) = -1 ECHILD (No child processes)
rt_sigaction(SIGCHLD, {0x555555576310, [], SA_RESTORER, 0x2b2f73609c10}, NULL, 8) = 0
select(0, NULL, NULL, NULL, {15, 0})    = ? ERESTARTNOHAND (To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
rt_sigaction(SIGALRM, {0x555555585ca0, [], SA_RESTORER, 0x2b2f73609c10}, NULL, 8) = 0
rt_sigreturn(0xe)                       = -1 EINTR (Interrupted system call)
wait4(-1, 0x7fff38477f24, WNOHANG, NULL) = -1 ECHILD (No child processes)
write(2, "13379 SIGALRM received\n", 2313379 SIGALRM received) = 23
clone(13381 Starting queue-runner: pid 13381
13381 exec /usr/sbin/exim -q
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b2f73a3c2f0) = 13381
write(2, "13379 1 queue-runner process run"..., 3713379 1 queue-runner process running) = 37
alarm(15)                               = 0
select(0, NULL, NULL, NULL, {15, 0})    = ? ERESTARTNOHAND (To be restarted)
... repeating...


Any idea why SIGALRM is not properly delivered on the problematic
system?

Thanks,
Lutz

-- 
Lutz Preßler  <Lutz.Pressler@???>    http://www.SerNet.DE/
SerNet Service Network GmbH, Bahnhofsallee 1b, D-37081 Göttingen
Tel.: +49-551-370000-2,      FAX: +49-551-370000-9
AG Göttingen, HRB 2816,      GF: Dr. Johannes Loxen