Hi!
[ This is a bit verbose, but important, I think ]
First the situation:
Two boxes, both running exim 4.32 on Debian/GNU Linux, kernel 2.4.26. First
does about 70k deliveries/day (this one starts most of the deliveries
immediately), the other about 15k/day (this one does deliveries only from queue
runners - started every 30 seconds). Both have:
log_selector = -retry_defer -skip_delivery
About a week ago I set up a cronjob (every 2 minutes) on both, which logs
current queue length, number of outgoing SMTP connections and number of running
queue runners. (exim -bpc, exiwhat|grep|wc, and netstat|grep|wc)
Now the problem:
Since then I noticed two similar problems (one on each box).
The log contained a reception of a message, but no other activity for that
message (in case of the second box - for over 2 hours, and in case of the
second - for over 6 days, even though the final retry cutoff time is 4 days!)
So I did:
exim4 -M 1Bi6Ce-0004ct-Gw
But suprisingly it returned:
Spool file is locked (another process is handling this message)
And indeed, there was a delivery process running, with an ESTABLISHED tcp
connection to a remote host. Stracing it revealed the following:
| select(7, [6], NULL, NULL, {573, 210000}) = ? ERESTARTNOHAND (To be restarted)
| --- SIGUSR1 (User defined signal 1) @ 0 (0) ---
| time(NULL) = 1089734281
| open("/var/spool/exim4/exim-process.info", O_WRONLY|O_APPEND|O_LARGEFILE) = 8
| fcntl64(8, F_GETFD) = 0
| fcntl64(8, F_SETFD, FD_CLOEXEC) = 0
| write(8, "2004-07-13 17:58:01 17795 delive"..., 112) = 112
| close(8) = 0
| rt_sigaction(SIGUSR1, {0x403137a0, [], SA_RESTORER|SA_RESTART, 0x40382658}, NULL, 8) = 0
| sigreturn() = ? (mask now [RTMIN])
| select(7, [6], NULL, NULL, {600, 0}
FD number 6 was the TCP socket:
lrwx------ 1 root root 64 2004-07-13 18:04 6 -> socket:[1159031]
I don't know why the timeout was 600 seconds, since I have not changed the
default, which is documented to be 5 minutes (smtp_receive_timeout), but this
seemed to be the following code in src/ip.c:ip_recv (just interrupted by the
SIGUSR1 from my cronjob).
| /* Wait until the socket is ready */
|
| for (;;)
| {
| FD_ZERO (&select_inset);
| FD_SET (sock, &select_inset);
| tv.tv_sec = timeout;
| tv.tv_usec = 0;
| rc = select(sock + 1, (SELECT_ARG2_TYPE *)&select_inset, NULL, NULL, &tv);
|
| /* If some interrupt arrived, just retry. We presume this to be rare,
| but it can happen (e.g. the SIGUSR1 signal sent by exiwhat causes
| select() to exit). */
|
| if (rc < 0 && errno == EINTR)
| {
| HDEBUG(D_any) debug_printf("EINTR while selecting for socket read\n");
| continue;
| }
Apparently the delivery process has spent 6 days hanging like this, by having
its select() timeout reset to 10 minutes by exiwhat every two minutes. Soon
after I disabled the cronjob it got caught by a router for old messages, which
catches them to another spool, and delivered to the same remote host just a few
minutes later.
In case of this message, the delivery process tried to log:
write(4, "2004-07-13 18:18:02 SMTP timeout while connected to polonez.inetia.pl [195.114.161.60] after end of data (96398 bytes written): Connection timed out\ ", 149) = 149
But I can't find this message, possibly because the logfile had beed rotated
and the fd pointed to a deleted file, which would support the theory that the
process indeed was hanging there for a long time. (On the other hand, I
remember reading somewhere that exim does a stat() and reopens a log file in
case its inode has changed).
However another message was logged (probably by the process which spawned the delivery process):
2004-07-13 18:18:02 1Bi6Ce-0004ct-Gw Unexpected error return -1 (errno = 10) from waitpid() for process 17795
(pid 17795 was the hanging process)
In case of the other message I mentioned, the message was:
2004-07-14 17:01:01 1BkjXe-0004nE-4r SMTP timeout while connected to mx3.mail.yahoo.com [64.156.215.6] after initial connection: Connection timed out
I don't exactly understand why the TCP connection wasn't simply terminated by
the remote host? 195.114.161.60 is a "Sun Internet Mail Server
sims.4.0.1999.06.13.00.20", and now that I check, it does timeout sessions with
"421 4.4.2 Timeout while waiting for command.". Or maybe it was, but the event
was never delivered by select()? Some kind of a race?
Anyway, can this be worked around at least for the systems which update the
struct timeval in case of EINTR? Or maybe even for all other, by manually
comparing time() before select() and after it's interrupted?
regards,
Marcin
--
Marcin Owsiany
porridge@???