Re: [exim] stuck exim processes

Αρχική Σελίδα
Delete this message
Reply to this message
Συντάκτης: Michael Tratz
Ημερομηνία:  
Προς: Jeremy Harris
Υ/ο: exim-users
Αντικείμενο: Re: [exim] stuck exim processes

> On Feb 16, 2022, at 1:28 PM, Jeremy Harris via Exim-users <exim-users@???> wrote:
>
> On 16/02/2022 20:42, Michael Tratz via Exim-users wrote:
>> I usually can find a few messages per day on a server so if you need me to test a patch, please let me know.
>
> Useful to know. I assume that since you're running
> with those patches, the periodic daemon restart wworkaround
> has no effect? How long has the daemon been up for?


No restarting the daemon doesn’t do anything. The only solution which works is to kill all exim processes and then restart the daemon.
exim will run through the queue and try again. The same messages which caused the delivery process to be stuck will again cause a stuck process. Once the retry rules have been exhausted (5 days) the message will be removed from the queue after a new restart and killing all exim processes. Otherwise those messages will just stick around. I had some which were in the queue for 20+ days until I noticed that issue.

This happens for messages which get a 4xx or 5xx error.

For messages which succeed with a 2xx code they stay stuck too. The exim mail queue output shows a D for delivered in the mail queue. But it never logs that the message was sent. Once you kill the stuck process, and use for example exim -d+all -M message_id the message will be removed from the mail queue. Again as soon as you use hosts_avoid_tls = * all these problem are gone. Here are some details for such a message with a 2xx code.

exiwhat | egrep -i 1nKPol-000KBV-Au
77630 delivering 1nKPol-000KBV-Au: waiting for a remote delivery subprocess to finish
77631 delivering 1nKPol-000KBV-Au to smtp.secureserver.net [68.178.213.37] (xxx@???)

ps output:

exim           77630   0.0  0.0  20620   7760  -  I    11:18        0:00.03 /usr/local/sbin/exim -Mc 1nKPol-000KBV-Au
exim           77631   0.0  0.1  20620   8588  -  I    11:18        0:00.02 /usr/local/sbin/exim -Mc 1nKPol-000KBV-Au


I tried truss to trace the system calls of both processes. 77631 is not printing anything. 77630 periodically writes:

truss -p 77630
wait4(-1,{ STOPPED,sig=127 },WNOHANG,0x0)        = 0 (0x0)
poll({ 24/POLLIN -1/POLLHUP|POLLNVAL|POLLINIGNEOF|0xe00 -1/0x0 -1/0x0 -1/0x0 -1/0x0 -1/0x0 -1/0x0 -1/0x0 -1/0x0 -1/0x0 -1/0x0 -1/0x0 -1/0x0 -1/0x0 },15,60000) = 0 (0x0)
wait4(-1,{ STOPPED,sig=127 },WNOHANG,0x0)        = 0 (0x0)


mailq | grep -1 1nKPol-000KBV-Au

 4h   14K 1nKPol-000KBV-Au <xxx@???>
        D xxx@??? <mailto:spalen@crosslink.net>


The crosslink.net <http://crosslink.net/> is also the only recipient of that message. So the message should be removed from the queue after it was successfully delivered. Then killing process 77630 and 77631

exim d+all -M 1nKPol-000KBV-Au (I removed the beginning of that output)

15:20:04 80137 set_process_info: 80137 delivering specified messages
15:20:04 80137 set_process_info: 80137 delivering 1nKPol-000KBV-Au
15:20:04 80137 Trying spool file /var/spool/exim//input//1nKPol-000KBV-Au-D
15:20:04 80137 reading spool file 1nKPol-000KBV-Au-H
15:20:04 80137 user=exim uid=60 gid=6 sender=bryan@???
15:20:04 80137 sender_fullhost = ([192.168.1.2]) [68.227.83.85]:54443
15:20:04 80137 sender_rcvhost = [68.227.83.85] (port=54443 helo=[192.168.1.2])
15:20:04 80137 sender_local=0 ident=unset
15:20:04 80137 Non-recipients:
15:20:04 80137 Empty Tree
15:20:04 80137 ---- End of tree ----
15:20:04 80137 recipients_count=1
15:20:04 80137 **** SPOOL_IN - No additional fields
15:20:04 80137 body_linecount=345 message_linecount=22
15:20:04 80137 Previously delivered address xxx@??? taken from journal file
15:20:04 80137 Writing spool header file: /var/spool/exim//input//hdr.1nKPol-000KBV-Au
15:20:04 80137 DSN: **** SPOOL_OUT - address: <xxx@???> errorsto: <NULL> orcpt: <NULL> dsn_flags: 0x0
15:20:04 80137 Renaming spool header file: /var/spool/exim//input//1nKPol-000KBV-Au-H
15:20:04 80137 Size of headers = 1043
15:20:04 80137 Delivery address list:
15:20:04 80137 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
15:20:04 80137 After routing:
15:20:04 80137 Local deliveries:
15:20:04 80137 Remote deliveries:
15:20:04 80137 Failed addresses:
15:20:04 80137 Deferred addresses:
15:20:04 80137 search_tidyup called
15:20:04 80137 >>>>>>>>>>>>>>>> deliveries are done >>>>>>>>>>>>>>>>
15:20:04 80137 changed uid/gid: post-delivery tidying
15:20:04 80137 uid=60 gid=6 pid=80137
15:20:04 80137 auxiliary group list: 6
15:20:04 80137 set_process_info: 80137 tidying up after delivering 1nKPol-000KBV-Au
15:20:04 80137 Processing retry items
15:20:04 80137 Succeeded addresses:
15:20:04 80137 Failed addresses:
15:20:04 80137 Deferred addresses:
15:20:04 80137 end of retry processing
15:20:04 80137 LOG: MAIN
15:20:04 80137 Completed QT=4h1m41s
15:20:04 80137 end delivery of 1nKPol-000KBV-Au
15:20:04 80137 search_tidyup called
15:20:04 80137 search_tidyup called
15:20:04 80137 >>>>>>>>>>>>>>>> Exim pid=80137 (fresh-exec) terminating with rc=0 >>>>>>>>>>>>>>>>


The log entry looks like this:

2022-02-16 11:18:23 1nKPol-000KBV-Au <= xxx@??? H=([192.168.1.2]) [68.227.83.85]:54443 P=esmtpsa X=TLS1.2:ECDHE-RSA-AES256-GCM-SHA384:256 CV=no A=cram:xxx@??? S=14523 id=1CE7BCEE-54A5-4D63-9451-BD20AAEDDE9D@??? from <xxx@???> for xxx@???
2022-02-16 11:31:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 11:46:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 12:01:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 12:16:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 12:31:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 12:46:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 13:01:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 13:16:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 13:31:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 13:46:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 14:01:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 14:16:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 14:31:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 14:46:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 15:01:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 15:16:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 15:19:44 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked (another process is handling this message)
2022-02-16 15:20:04 1nKPol-000KBV-Au Completed QT=4h1m41s

But this issue doesn’t happen with every message which gets a 2xx code. Other messages get delivered just fine and log the missing delivery output and are removed from the queue after delivery is complete.

I have not received any complaints of messages not getting delivered. Even though there is no delivery log entry the messages seems to get delivered correctly to the recipient.

>
> If you see any indication of crashing processes, a stacktrace
> from a coredump would be of help.



The processes just sit around and do nothing. No crash. I have seen some processes hanging around since Saturday trying to deliver the message and they would stay that way if I don’t kill the process (daemon up since Saturday)

Thanks,

Michael