Re: [exim] stuck exim processes

トップ ページ
このメッセージを削除
このメッセージに返信
著者: Michael Tratz
日付:  
To: exim-users
題目: Re: [exim] stuck exim processes

> On Feb 17, 2022, at 11:41 PM, Evgeniy Berdnikov via Exim-users <exim-users@???> wrote:
>
> On Thu, Feb 17, 2022 at 07:36:38PM -0800, Michael Tratz via Exim-users wrote:
>>> On Feb 16, 2022, at 4:17 PM, Jeremy Harris via Exim-users <exim-users@???> wrote:
>>>
>>> You don't even get a single line from truss as it attaches?
>>> I wonder if the process is spinning in userland?
>>> Does "top" or similar show it?
>>
>> That stuck process is just sitting there and not doing anything. It still shows in top, but it’s just idle.
>
> Process can't be "just idle", it must have its state and data structures,
> the most interesting is stack. Process state may displayed with "ps wchan":
>
> ps -p <pid> -o pid,wchan,cmd


PID WCHAN COMMAND
52656 sbwait /usr/local/sbin/exim -Mc 1nNo2G-000Dgw-E1

All those processes are stuck in sbwait

>
> Stack may be printed by debugger:
>
> gdb -p <pid> -f /path/to/exim
> (gdb) bt full
>


Here is the debugger output:

https://pastebin.com/ZbbdmpF2 <https://pastebin.com/ZbbdmpF2>

Some line numbers don’t match up with the original exim 4.95 release due to patching in the FreeBSD port.

For:

#12 0x0000000000341a93 in tls_close (ct_ctx=0x8014ac208, do_shutdown=2) at ./tls-openssl.c:4400

The correct line number is 4393 in the exim 4.95 release:

https://github.com/Exim/exim/blob/fb62e7a12be6593a5432fba4a9e4468c34feef5c/src/src/tls-openssl.c#L4393 <https://github.com/Exim/exim/blob/7980dd8917020521479f2bb28a2363e76fb551e2/src/src/tls-openssl.c#L4530>

And:

#13 0x0000000000394a4f in smtp_deliver (addrlist=0x80142b4a0, host=0x80142be60, host_af=2, defport=25, interface=0x0, tblock=0x801452368, message_defer=0x7fffffffc1a4, suppress_tls=0) at smtp.c:4808

https://github.com/Exim/exim/blob/fb62e7a12be6593a5432fba4a9e4468c34feef5c/src/src/transports/smtp.c#L4819

I finally had some spare time to try to look further into those stuck exim processes on FreeBSD last week. I had a nice list of remote smtp servers which always caused the issue no matter if they sent a 5xx or 2xx. I tried using a more recent git version of exim and also compiled exim on FreeBSD current. Using a default configure file etc. None fixed the issue. The only thing which helped was using GnuTLS instead of openssl.

I rebuilt the port and OS with debugging symbols. As soon as the SSL_shutdown function is called the process would not shutdown anymore for certain hosts. Google had some results of processes getting “stuck” with SSL_shutdown but I’m not that familiar with openssl. After some research it looks like the following commit 001bf8f587 Pipeline QUIT after data in src/src/transports/smtp.c introduced the bug line 4558 for that commit:

tls_close(sx->cctx.tls_ctx, TLS_SHUTDOWN_WAIT);

The tls_shutdown_wr function was also introduced in that commit. It also calls SSL_shutdown. Once tls_close is used after tls_shutdown_wr. The first SSL_shutdown in src/tls-openssl.c causes the process getting stuck for some remote hosts. The hosts which seemed getting stuck were not using pipelining. So I also tried hosts_avoid_pipelining = * for hosts which don’t have the issue, but I couldn’t get the exim process to get stuck. I don’t know why the issue happens with only certain remote smtp servers.

I have added the following patch:

diff --git a/src/src/transports/smtp.c b/src/src/transports/smtp.c
index 6a979a243..f97b0c625 100644
--- a/src/src/transports/smtp.c
+++ b/src/src/transports/smtp.c
@@ -4800,7 +4800,11 @@ if (sx->send_quit || tcw_done && !tcw)
 # ifdef EXIM_TCP_CORK
     (void) setsockopt(sx->cctx.sock, IPPROTO_TCP, EXIM_TCP_CORK, US &on, sizeof(on));
 # endif
-    tls_close(sx->cctx.tls_ctx, TLS_SHUTDOWN_WAIT);
+    if (sx->send_tlsclose)
+      {
+      tls_close(sx->cctx.tls_ctx, TLS_SHUTDOWN_WAIT);
+      sx->send_tlsclose = FALSE;
+      }
     sx->cctx.tls_ctx = NULL;
     }
 #endif


Exim has been running for about a week using this patch and I haven't experienced any issues. I don’t know if that is the correct fix or if there is a better way. But I hope it helps in figuring out the root cause of the issue.

Thanks,

Michael Tratz