> On Feb 17, 2022, at 11:41 PM, Evgeniy Berdnikov via Exim-users <exim-users@???> wrote:
>
> On Thu, Feb 17, 2022 at 07:36:38PM -0800, Michael Tratz via Exim-users wrote:
>>> On Feb 16, 2022, at 4:17 PM, Jeremy Harris via Exim-users <exim-users@???> wrote:
>>>
>>> You don't even get a single line from truss as it attaches?
>>> I wonder if the process is spinning in userland?
>>> Does "top" or similar show it?
>>
>> That stuck process is just sitting there and not doing anything. It still shows in top, but it’s just idle.
>
> Process can't be "just idle", it must have its state and data structures,
> the most interesting is stack. Process state may displayed with "ps wchan":
>
> ps -p <pid> -o pid,wchan,cmd
PID WCHAN COMMAND
52656 sbwait /usr/local/sbin/exim -Mc 1nNo2G-000Dgw-E1
All those processes are stuck in sbwait
>
> Stack may be printed by debugger:
>
> gdb -p <pid> -f /path/to/exim
> (gdb) bt full
>
Here is the debugger output:
https://pastebin.com/ZbbdmpF2 <
https://pastebin.com/ZbbdmpF2>
Some line numbers don’t match up with the original exim 4.95 release due to patching in the FreeBSD port.
For:
#12 0x0000000000341a93 in tls_close (ct_ctx=0x8014ac208, do_shutdown=2) at ./tls-openssl.c:4400
The correct line number is 4393 in the exim 4.95 release:
https://github.com/Exim/exim/blob/fb62e7a12be6593a5432fba4a9e4468c34feef5c/src/src/tls-openssl.c#L4393 <
https://github.com/Exim/exim/blob/7980dd8917020521479f2bb28a2363e76fb551e2/src/src/tls-openssl.c#L4530>
And:
#13 0x0000000000394a4f in smtp_deliver (addrlist=0x80142b4a0, host=0x80142be60, host_af=2, defport=25, interface=0x0, tblock=0x801452368, message_defer=0x7fffffffc1a4, suppress_tls=0) at smtp.c:4808
https://github.com/Exim/exim/blob/fb62e7a12be6593a5432fba4a9e4468c34feef5c/src/src/transports/smtp.c#L4819
I finally had some spare time to try to look further into those stuck exim processes on FreeBSD last week. I had a nice list of remote smtp servers which always caused the issue no matter if they sent a 5xx or 2xx. I tried using a more recent git version of exim and also compiled exim on FreeBSD current. Using a default configure file etc. None fixed the issue. The only thing which helped was using GnuTLS instead of openssl.
I rebuilt the port and OS with debugging symbols. As soon as the SSL_shutdown function is called the process would not shutdown anymore for certain hosts. Google had some results of processes getting “stuck” with SSL_shutdown but I’m not that familiar with openssl. After some research it looks like the following commit 001bf8f587 Pipeline QUIT after data in src/src/transports/smtp.c introduced the bug line 4558 for that commit:
tls_close(sx->cctx.tls_ctx, TLS_SHUTDOWN_WAIT);
The tls_shutdown_wr function was also introduced in that commit. It also calls SSL_shutdown. Once tls_close is used after tls_shutdown_wr. The first SSL_shutdown in src/tls-openssl.c causes the process getting stuck for some remote hosts. The hosts which seemed getting stuck were not using pipelining. So I also tried hosts_avoid_pipelining = * for hosts which don’t have the issue, but I couldn’t get the exim process to get stuck. I don’t know why the issue happens with only certain remote smtp servers.
I have added the following patch:
diff --git a/src/src/transports/smtp.c b/src/src/transports/smtp.c
index 6a979a243..f97b0c625 100644
--- a/src/src/transports/smtp.c
+++ b/src/src/transports/smtp.c
@@ -4800,7 +4800,11 @@ if (sx->send_quit || tcw_done && !tcw)
# ifdef EXIM_TCP_CORK
(void) setsockopt(sx->cctx.sock, IPPROTO_TCP, EXIM_TCP_CORK, US &on, sizeof(on));
# endif
- tls_close(sx->cctx.tls_ctx, TLS_SHUTDOWN_WAIT);
+ if (sx->send_tlsclose)
+ {
+ tls_close(sx->cctx.tls_ctx, TLS_SHUTDOWN_WAIT);
+ sx->send_tlsclose = FALSE;
+ }
sx->cctx.tls_ctx = NULL;
}
#endif
Exim has been running for about a week using this patch and I haven't experienced any issues. I don’t know if that is the correct fix or if there is a better way. But I hope it helps in figuring out the root cause of the issue.
Thanks,
Michael Tratz