Re: [exim] Google SMTP Timeouts on large mails

Top Page
Delete this message
Reply to this message
Author: Jeremy Harris
Date:  
To: exim-users
Subject: Re: [exim] Google SMTP Timeouts on large mails
On 29/04/2022 10:56, Graeme Coates via Exim-users wrote:
> I have a packet capture which is available here:


> https://tinyurl.com/742s855d


Thank you so much for gathering this.

It seems to show buggy behaviour in your Debian TCP implementation;
(or possibly software-firewall)
I don't see any way that Exim could be forcing this.

Specifically, we see (multiple) retries of a TCP segment for which
we saw both the original data and the ACK from the peer (Google).

There are no SACKs, despite further ACKs after the apparently missed one
(and it being a SACK-enabled connection). This implies *no* ACKs
from that point on were received by the TCP code.


We can't tell exactly what data was involved, lacking the TLS session
keys, but given the above it's probably moot. If you care to investigate
that, see the text around "Add SSLKEYLOGFILE to keep_environment in the exim config"
and feed the resulting file to wireshark.

> The Session log from Exim in debug mode is here (with redacted hosts,
> addresses, etc) - the message was delivered to the server, and is being
> forwarded onto an email in a Google workspace account (following a
> forwarding rule in an aliases file)


> https://tinyurl.com/22nn887u


It all looks reasonable there, up to the point that the GnuTLS library
tells us "The TLS connection was non-properly terminated." - which would
follow on from the pcap-observed problem at the TCP level.

> Is it possible from these traces to pin down the issue at all and maybe come
> up with a workround (without having to turn off tcp_window_scaling) or a
> pointer as to where I need to formally raise a bug, and I'll be happy to do
> so!


You already mentioned IPv4/6 makes no difference.
You could try disabling TFO (but I think it's unlikely to help),
TLSv1.3 (ditto), CHNNKING (more possible, but again it's entirely the
wrong protocol layer), PIPELINING (ditto).

The problem going away when you disable TCP window scaling is interesting,
but it might just be shifting the point it bites to somewhere else
in other size flows.
Exim has no facilities to set a small transmit socket buffer size (which
would have the same effect, and not massacre your performance on other
networking users), I'm afraid.


I guess, if ACKs are not being seen by your TCP endpoint, the socket will
still be holding un-ack'd data in the transmit queue. If you can catch that
(use "ss -panmit dport = 25") it would confirm my interpretation.

If it's the firewall that's dropping inbound TCP ACK packets, I guess there's
the possibility of configuring it to log drops.
--
Cheers,
Jeremy