[Exim] Problems with PIPELINING advertisement and multiple c…

トップ ページ
このメッセージを削除
このメッセージに返信
著者: Stephen Gran
日付:  
To: exim-users
題目: [Exim] Problems with PIPELINING advertisement and multiple concurrent client connections
Hello all,

I am hoping someone can assist me here. We are running exim 4.33 on an
MX, and doing several probably silly and overcomplicated things. I only
say this as a preface so that you understand that my analysis of the
situation may be wrong, but is the best I can come up with so far:)

The MX handles a small ISP's mail, and also does aliasing and forwarding
for several hundred virtual domains, and acts as seconday MX for several
client domains (we sell you box and bandwidth, and we'll throw in
secondary MX kind of thing). We also have a sendmail box in the mix for
virtual domains that have not yet been transitioned to the exim boxes.
There is also spam and virus scanning, as well as various lookups in SQL
and LDAP for half a dozen things.

OK, enough background. One day recently, we got a call from the manager
of the ISP, saying that he had had complaints from some of the users who
had virtual domain email, and the gist of it was that if they were
getting mail at all, it was taking forever (>6 hours). This seemed
unreasonable, as even on busy days (we normally average around 65,000
emails a day, and busy days are closer to 100,000) we can deliver >98%
of emails in under a minute.

Much digging later, we found that the queue on the sendmail box was over
3000. It, of course, doesn't have the acl's or the ability to do
callouts that the exim boxes do, and so was accepting some messages that
the exim boxes reject, so it always has some queue, but normally not
that high. We did some selective grep'ing, and found that all the users
who were complaining about slow email on virtual domains had some mail
stuck in the sendmail queue. No problem, I said at first - just run
sendmail's queue with a regex, and do it in the foreground once or twice
to make sure things go OK.

So I opened half a dozen windows and ran half a dozen queue runners with
regex's (much like exim's -R, for those of you mercifully unaware of
sendmail). What I noticed was that after three or maybe four successful
deliveries, exim 450'd all further messages in that run with a
synchronization error message. The problem, from my point of view, was
that there was no synchronization error, at least within each
connection. To paraphrase the transaction, it would go:

(Reusing cached connection to exim.box)
220 exim.box
ehlo sendmail.box
250-exim.box Hello sendmail.box
250-SIZE 52428800
250-PIPELINING
250-HELP
MAIL FROM: <some@???>
RCPT TO: <legit@???>
250 OK
DATA
550 Unroutable address
503 Valid RCPT must preceed DATA
RSET

(repeat 3 or 4 times)

450 synchorinization error
repeat for all remaining messages in queue.

Sendmail then rightfully concluded from the 4xx code that there was a
temproray problem with the exim machine and put it back in the queue to
run later. Well, running a queue every half hour and only getting 3
mails out at a time had started to cause a serious backlog.

I think the problem is somewhere near:

    case DATA_CMD:
    if (!discarded && recipients_count <= 0)
      {
      if (pipelining_advertised && last_was_rcpt)
        smtp_printf("503 valid RCPT command must precede DATA\r\n");
      else
        done = synprot_error(L_smtp_protocol_error, 503, NULL,
          US"valid RCPT command must precede DATA");
      break;
      }


I have to say that I have not fully checked how the setting of
last_was_rcpt is done, but it appears that when sendmail 'reuses a
cached connection', it does something that makes exim think the commands
are not coming in the order that I could clearly see in the both the
sendmail debugging and in tcpdump, and the syntax error count is
incremented.

The coda to this long tale is that turning off advertising pipeling to
this one box 'fixed' all of this, the users got their mail, and the
queue is again below 1000 on the sendmail box. I am not sure whether it
is the sendmail box's problem for 'reusing a cached connection',
whatever that exactly means, or exim's fault for not properly
interpreting the commands, but I wanted to report it and ask for
comments. If you've made it this far, and have comments or ideas, I
would appreciate them.

Thanks all,
--
--------------------------------------------------------------------------
|  Stephen Gran                  | BOFH excuse #221:  The mainframe needs  |
|  steve@???             | to rest.  It's getting old, you know.   |
|  http://www.lobefin.net/~steve |                                         |

--------------------------------------------------------------------------