Re: [exim] stuck exim processes

Author: Jeremy Harris
Date:
To: exim-users
Subject: Re: [exim] stuck exim processes

On 16/02/2022 08:08, Martin Waschbüsch via Exim-users wrote:
> 07:45:34.638 57451 sync_responses expect rcpt for Recipient@???
> 07:45:34.638 57451   SMTP<< 550 5.1.1 <xn--recipient-wob@???>: Recipient address rejected: User unknown

far end rejected RCPT. We had pipelined MAIL, RCPT, BDAT, on a TLS conn.

> 07:45:34.638 57451 look for one response for BDAT
> 07:45:34.638 57451   SMTP<< 554 5.5.1 Error: no valid recipients

follow-on reject for the BDAT, reasonable given the RCPT situation

> 07:45:34.638 57451 ok=0 send_quit=1 send_rset=1 continue_more=0 yield=0 first_address is NULL

we think we're done with this message, and with the connection

> 07:45:34.638 57451   SMTP+> QUIT

set up the QUIT command

> 07:45:34.638 57451 cmd buf flush 6 bytes (more expected)
> 07:45:34.638 57451 tls_write(0x8011111a0, 6, more)
> 07:45:34.638 57451 tls_write((nil), 0)
> 07:45:34.638 57451 SSL_write(0x801600000, 0x801050768, 6)
> 07:45:34.638 57451 outbytes=6 error=0
> 07:45:34.638 57451   SMTP(TLS shutdown)>>

pipeline it with our side's TLS close notification

> 07:45:34.638 57451 SSL3 alert write:warning:close notify
> 07:45:34.638 57451   SMTP(shutdown)>>

Tell the socket we'll send no more data; this results in a TCP FIN send
(leaving the TCP conn in "half-closed" state; the peer could still
send data - and in fact ought to: an SMTP response to the QUIT within
the TLS session, followed by a TLS shutdown)

> 07:45:34.643 57451 Calling SSL_read(0x801600000, 0x8011101a0, 4096)
> 07:45:34.643 57451 read response data: size=15
> 07:45:34.643 57451   SMTP<< 221 2.0.0 Bye

Peer sent the response for QUIT, as expected

> 07:45:34.643 57451 Calling SSL_read(0x801600000, 0x8011101a0, 4096)
> 07:45:35.700 57451 tls_close(): shutting down TLS (with response-wait)
> 07:45:35.700 57451 tls_write((nil), 0)

waiting for the TLS close...

> 07:46:33.400 57448 polling subprocess pipes

(note the PID difference) This is the parent of the transport process,
waiting for status from there

> 07:47:33.420 57448 polling subprocess pipes
> 07:48:33.450 57448 polling subprocess pipes
> 07:49:33.520 57448 polling subprocess pipes
> 07:50:33.530 57448 polling subprocess pipes
> 07:51:33.550 57448 polling subprocess pipes
> 07:52:33.650 57448 polling subprocess pipes
> 07:53:33.730 57448 polling subprocess pipes
> 07:54:33.750 57448 polling subprocess pipes
> 07:55:33.830 57448 polling subprocess pipes
> 07:56:33.930 57448 polling subprocess pipes
> 07:57:33.950 57448 polling subprocess pipes

mmm, 10+ minutes elapsed - and 57448 has neither timed out nor
noticed that 57451 died. Or sent status, but then it should
have said *somthing* more. I wonder if it crashed (as seen in
one previous event) and 57448 didn't see it?

Aha. there was a recent for affecting FreeBSD which might
be relevant: https://bugs.exim.org/show_bug.cgi?id=2831

From
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=259822
I suspect that The FreeBSD project has not picked up that
fix (and possibly also other more-recent ones).

The workaround for that particular issue was to periodically
restart the daemon. A SIGHUP suffices.
Preferable would be to incorporate fixes in the Exim you are running.

Of course, that bug might not be the one you're hitting,
but it's a possible.
--
Cheers,
Jeremy

This message is part of the following thread:
	the complete thread tree sorted by date
	Martin Waschbüsch at
	Martin Waschbüsch at