I've got the OK to share the anonymised logs with Jeremy and will try and
get them prepared and sent to him tomorrow.
The sad news is that I've had to give up on cutthrough_delivery and turn it
off. The problems we were seeing:
- Errors going to paniclog about failing to unlink msglog files
(although could be muted from nightly reports by changing
the E4BCD_PANICLOG_NOISE variable in etc/default/exim4 to a suitable
pattern);
- A variety of client systems suffering from "stuck" connections:
waiting for a response over the TCP/IP connection but not getting it, and
one end or the other eventually timing out;
- Duplicate deliveries when our Exim sometimes received enough of the
message to spool into the queue but the far end didn't get the SMTP
acknowledgement because the connection had "stuck", so held the message in
their queues to retry later;
- Unusually high numbers of active Exim processes as reported by exiwhat
and confirmed by high TCP/IP counts logged in mainlog (caused by the
"stuck" connections).
We believe we've confirmed that cutthrough_delivery is the cause:
- Checking historic connection counts show they started climbing within
a minute of enabling cutthrough. (Not noticed because our test messages all
went through. Sigh!)
- After disabling cutthrough today connection counts (reported by
exiwhat and logged in mainlog) immediately started returning to the normal
levels or just 1 or 2 at any given time. After 10 minutes (the timeout for
stuck jobs) all was back to blissful normality.
Cheers,
Mike B.
--
Systems Administrator & Change Manager
IT Services, University of York, Heslington, York YO10 5DD, UK
Tel: +44-(0)1904-323811
Web:
www.york.ac.uk/it-services
Disclaimer:
www.york.ac.uk/docs/disclaimer/email.htm