[exim-dev] [Bug 2273] New: Queue runners race with cutthroug…

Top Page
Delete this message
Reply to this message
Author: admin
Date:  
To: exim-dev
Subject: [exim-dev] [Bug 2273] New: Queue runners race with cutthrough delivery
https://bugs.exim.org/show_bug.cgi?id=2273

            Bug ID: 2273
           Summary: Queue runners race with cutthrough delivery
           Product: Exim
           Version: 4.91
          Hardware: x86
                OS: Linux
            Status: NEW
          Severity: bug
          Priority: medium
         Component: Delivery in general
          Assignee: nigel@???
          Reporter: tstewart@???
                CC: exim-dev@???


Created attachment 1082
--> https://bugs.exim.org/attachment.cgi?id=1082&action=edit
Move fclose() for data_file to end of receive_msg()

I enabled cutthrough delivery on a busy Exim system, and noticed three messages
like the following in panic.log over an hour-long run:


2018-05-02 15:16:17 1fDxEf-0003ld-Dj failed to unlink
/var/spool/exim/msglog/1fDxEf-0003ld-Dj: No such file or directory


Exim was doing a queue run every minute during this test. An exigrep of that
message ID also reveals a duplicate delivery:


2018-05-02 15:16:17 1fDxEf-0003ld-Dj >> bar@??? R=r_prod T=t_prod
S=166844 H=int-mail.mydomain.com [1.2.3.4] C="250 Ok: id=Ba6g6B-VaF8wA-NF"
QT=0s DT=0s
2018-05-02 15:16:17 1fDxEf-0003ld-Dj <= foo@??? H=mail.example.com
[5.6.7.8] P=esmtps X=TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256 CV=no S=9807
id=msg123412341234@??? T="Ahh, what a fine day for science!"
2018-05-02 15:16:17 1fDxEf-0003ld-Dj Completed

+++ 1fDxEf-0003ld-Dj has not completed +++
2018-05-02 15:16:17 1fDxEf-0003ld-Dj => bar@??? R=r_prod T=t_prod
S=9957 H=int-mail.mydomain.com [1.2.3.4] C="250 Ok: id=Ba6g6B-VaF8wA-R2" QT=0s
DT=0s
2018-05-02 15:16:17 1fDxEf-0003ld-Dj failed to unlink
/var/spool/exim/msglog/1fDxEf-0003ld-Dj: No such file or directory


Downstream logs confirm that this mail was, in fact, delivered twice.

I looked a bit closer at receive_msg() and deliver_message() and saw that Exim
generally unlink()s spool files before closing them when it wants to remove a
spool file after successful delivery (or a receive error). This generally
prevents queue runners from being able to lock spool files that are actually in
use. However, for cutthrough delivery, spool files are closed and then later
unlink()ed when doing cutthrough-specific cleanup. See:

fclose() data file, freeing lock:
https://github.com/Exim/exim/blob/f04f90474bcc7b3fb1a6e03500259448de666f18/src/src/receive.c#L4252

unlink()s when cutthrough was successful:
https://github.com/Exim/exim/blob/f04f90474bcc7b3fb1a6e03500259448de666f18/src/src/receive.c#L4316

This opens a window in which a queue runner can open and lock a file and start
its own delivery, even though cutthrough delivery was already successful at
doing just that.

To close this window, I've tested a patch that simply moves the fclose() to the
end of the function, right before the final "return yield;", and initial
results are positive. Specifically, I reproduced the error with some sleep()s,
moved the fclose(), and retested. The error and dual delivery were replaced
with the expected "Spool file is locked (another process is handling this
message)". I've attached my test patch; let me know what you think.


P.S. I did some Git spelunking and it looks like
https://github.com/Exim/exim/commit/817d9f576cdfbc27cf0536be348645baf27d7836
may have introduced this bug, as it moves the unlink()s after the fclose().

--
You are receiving this mail because:
You are on the CC list for the bug.