Hi,
(please no CC offlist, I'm here.)
Michael Fischer v. Mollard <lists@???> (Mo 21 Mär 2016 18:03:49 CET):
>
> # exiwhat
> […]
> 29077 delivering 1ai1bZ-0007Yw-Qf: waiting for a remote delivery subprocess
> to finish
> 29079 delivering 1ai1bZ-0007Yw-Qf to XXX [XXX] (XXX)
>
> # kill 29079
>
> # exigrep 1ai1bZ-0007Yw-Qf /var/log/exim4/mainlog
> 2016-03-21 16:18:54 [29074] 1ai1bZ-0007Yw-Qf <= XXX H=XXX [XXX] P=smtp S=17032 id=XXX
`-> Started the delivery process 29077, the process you kill @17:33:49
> 2016-03-21 16:26:04 [30608] 1ai1bZ-0007Yw-Qf Spool file is locked (another process is handling this message)
> 2016-03-21 16:51:49 [3046] 1ai1bZ-0007Yw-Qf Spool file is locked (another process is handling this message)
> 2016-03-21 16:54:17 [3424] 1ai1bZ-0007Yw-Qf Spool file is locked (another process is handling this message)
> 2016-03-21 17:03:11 [4875] 1ai1bZ-0007Yw-Qf Spool file is locked (another process is handling this message)
> 2016-03-21 17:08:48 [5951] 1ai1bZ-0007Yw-Qf Spool file is locked (another process is handling this message)
> 2016-03-21 17:21:35 [9861] 1ai1bZ-0007Yw-Qf Spool file is locked (another process is handling this message)
> 2016-03-21 17:26:05 [11602] 1ai1bZ-0007Yw-Qf Spool file is locked (another process is handling this message)
`-> Is what we'd expect, since 29077 is still running and holding the
lock on the spool file(s)
> 2016-03-21 17:33:49 [29077] 1ai1bZ-0007Yw-Qf == XXX R=dnslookup T=T_local_smtp defer (-1): smtp transport process returned non-zero status 0x000f: terminated by signal 15
> 2016-03-21 17:37:13 [16247] 1ai1bZ-0007Yw-Qf => XXX R=dnslookup
> T=T_local_smtp S=17197 H=XXX X=TLS1.2… CV=no C="250 OK id=1ai2pL-0001mF-Hs" QT=1h18m20s DT=2s
> 2016-03-21 17:37:13 [16247] 1ai1bZ-0007Yw-Qf Completed
`-> Ok, so your observations are right, the delivery process hangs there
for about 75 mintes (16:18 -> 17:33), until you killed it
> So if I'm not getting something terribly wrong the delevering process 29079
> which is started by the receiving process 29074 exist for about seventyfive
> minutes until I killed it manually.
Receiving process 29074
`--> Delivery process 29077
`--> SMTP transport process 29079
> After that a queue runner picked it up
> and delivered it smoothly.
Do you have any traces of the delivery attempt of 29077/29079 from
the remote site? Did the mail get through twice?
Can you use a `strace -p` on the delivery and on the transport process
next time?
I'm asking this, because some time ago we observed communication
problems between the smtp transport process and its parent (here 29079
talking back to 29077). If I remember well, the trigger were a long list
of recipients plus some lengthy TLS information.
I think, for us it resulted in crashing deliveries, but I can imagine
hanging deliveries too.
From the commit log:
commit bd21a787cdeef803334a6c7bf50d23b2a18cbd6f
Author: Wolfgang Breyha <wbreyha@???>
Date: Sun Sep 28 13:40:45 2014 +0100
Fix transport-results pipe for multiple recipients combined with certs.
The previous parsing failed when a result item split over a buffer boundary;
fix by prefixing sizes to items, and checking enough has been read as the
initial parsing stage.
....
+JH/05 Fix results-pipe from transport process. Several recipients, combined
+ with certificate use, exposed issues where response data items split
+ over buffer boundaries were not parsed properly. This eventually
+ resulted in duplicates being sent. This issue only became common enough
+ to notice due to the introduction of conection certificate information,
+ the item size being so much larger. Found and fixed by Wolfgang Breyha.
Best regards from Dresden/Germany
Viele Grüße aus Dresden
Heiko Schlittermann
--
SCHLITTERMANN.de ---------------------------- internet & unix support -
Heiko Schlittermann, Dipl.-Ing. (TU) - {fon,fax}: +49.351.802998{1,3} -
gnupg encrypted messages are welcome --------------- key ID: F69376CE -
! key id 7CBF764A and 972EAC9F are revoked since 2015-01 ------------ -