On Sun, May 26, 2013 at 06:15:41PM +0200, Mrten wrote:
> I'm getting a warning message for a cronjob for a few days now:
>
>
> This message was created automatically by mail delivery software.
> A message that you sent has not yet been delivered to one or more of its
> recipients after more than 4 hours on the queue on kimiko.ii.nl.
>
> The message identifier is: 1UgMol-0004Xa-Lb
> The date of the message is: Sun, 26 May 2013 05:12:01 +0200
> The subject of the message is: Cron <scripts@kimiko>
> $HOME/mysql/backup/sql2sshfs.sh
>
>
> The weird part is that all the headers indicate that the message was
> actually sent on 05:12:19 so was only on the queue for a few seconds:
>
>
> Received: from kimiko.ii.nl ([82.94.191.102]:59351)
> by mail1.ii.nl with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
> (Exim 4.76)
> id 1UgRNa-0001S8-Qo
> for scripts@???; Sun, 26 May 2013 05:12:19 +0200
> Received: from Debian-exim by kimiko.ii.nl with local (Exim 4.71)
> id 1UgRNJ-0005We-4H
> for scripts@???; Sun, 26 May 2013 05:12:01 +0200
> Date: Sun, 26 May 2013 05:12:01 +0200
These headers do NOT belong to the delayed message 1UgMol-0004Xa-Lb.
They belong to the notification message 1UgRNJ-0005We-4H cited above,
with content "Message 1UgMol-0004Xa-Lb was delayed for 4 hours".
This notification is reflected in the second set of your log records:
> 2013-05-26 05:12:01 [21244] 1UgRNJ-0005We-4H <= <> R=1UgMol-0004Xa-Lb
> U=Debian-exim P=local S=1180
> 2013-05-26 05:12:01 [21245] 1UgRNJ-0005We-4H == scripts@???
> R=dnslookup T=remote_smtp_dkim defer (-1): domain matches
> queue_smtp_domains, or -odqs set
> 2013-05-26 05:12:19 [21252] 1UgRNJ-0005We-4H => scripts@???
> R=dnslookup T=remote_smtp_dkim S=1210 H=mail1.ii.nl [82.94.191.120]*
> X=TLS1.0:RSA_AES_256_CBC_SHA1:32
> DN="2.5.4.13=#131065416d575a4872333637414c73385a46,C=NL,ST=Zuid-Holland,L=Delft,O=ii.nl
> (Chefren Hagens),CN=mail1.ii.nl,EMAIL=webmaster@???" C="250 OK
> id=1UgRNa-0001S8-Qo" QT=18s DT=1s
> 2013-05-26 05:12:19 [21252] 1UgRNJ-0005We-4H Completed QT=18s
I see nothing strange with this notification message 1UgRNJ-0005We-4H.
As for the problematic message 1UgMol-0004Xa-Lb, from the first set
of log records you can see that there is extremely large gap between
pid=[17458] of initial exim process and pid=[21242] of the second:
> 2013-05-26 05:12:01 [17458] 1UgMol-0004Xa-Lb <= scripts@??? U=scripts
> P=local S=4151
> 2013-05-26 05:12:01 [21242] 1UgMol-0004Xa-Lb == scripts@???
> R=dnslookup T=remote_smtp_dkim defer (-1): domain matches
> queue_smtp_domains, or -odqs set
> 2013-05-26 05:12:18 [21249] 1UgMol-0004Xa-Lb => scripts@???
> R=dnslookup T=remote_smtp_dkim S=4241 H=mail1.ii.nl [82.94.191.120]
> X=TLS1.0:RSA_AES_256_CBC_SHA1:32
> DN="2.5.4.13=#131065416d575a4872333637414c73385a46,C=NL,ST=Zuid-Holland,L=Delft,O=ii.nl
> (Chefren Hagens),CN=mail1.ii.nl,EMAIL=webmaster@???" C="250 OK
> id=1UgRNa-0001S8-AF" QT=4h52m15s DT=1s
> 2013-05-26 05:12:18 [21249] 1UgMol-0004Xa-Lb Completed QT=4h52m15s
This difference is anomaly, because typically exim forks forks so quickly
that 2nd process get pid greater by one then the pid of 1st process.
But in your case the difference is 3784. It means that something caused
exim to hang for a long time from the moment it records the message
submission time to the moment it could fork and write the next log record.
On the assumption that new process are created with the constant rate
we can estimate (very-very roughly) the order of this hangup time.
Records for 1UgRNJ-0005We-4H give us (21252-21245)=7 processes per 18 sec,
records for 1UgMol-0004Xa-Lb give (21249-21242)=7 processes per 17 sec,
so we have about 0.4 proc/sec, and the hangup time value is
3784/0.4=9460 sec (or 2.6 hour). Actual value could differ considerably
because we know nothing about hangup reason (swapping, disk failure, etc),
but surely time may be several hours, so QT=4h52m15s is not very fantastic.
It looks not as exim bug, but looks as a problem of your system.
--
Eugene Berdnikov