[exim-dev] [Bug 1424] New: Exim stuck on $sender_host_name i…

Top Page
Delete this message
Reply to this message
Author: Dreas van Donselaar
Date:  
To: exim-dev
New-Topics: [exim-dev] [Bug 1424] Exim stuck on $sender_host_name in notquit ACL, [exim-dev] [Bug 1424] Exim stuck on $sender_host_name in notquit ACL, [exim-dev] [Bug 1424] Exim stuck on $sender_host_name in notquit ACL, [exim-dev] [Bug 1424] Exim stuck on $sender_host_name in notquit ACL, [exim-dev] [Bug 1424] Exim stuck on $sender_host_name in notquit ACL, [exim-dev] [Bug 1424] Exim stuck on $sender_host_name in notquit ACL
Subject: [exim-dev] [Bug 1424] New: Exim stuck on $sender_host_name in notquit ACL
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1424
           Summary: Exim stuck on $sender_host_name in notquit ACL
           Product: Exim
           Version: 4.82
          Platform: Other
        OS/Version: Linux
            Status: NEW
          Severity: bug
          Priority: medium
         Component: Logging
        AssignedTo: tlyons@???
        ReportedBy: dreas@???
                CC: exim-dev@???



When trying to write $sender_host_name to the logs in the notquit ACL, the Exim
process sometimes gets stuck and hangs forever with:
===
mx11:~# strace -p 19793
Process 19793 attached - interrupt to quit
futex(0x7f7cb37f39e0, FUTEX_WAIT_PRIVATE, 2, NULL
===

This only appears to happen for weirdly setup PTRs.

>From cannonball:

===
(02:24:13 PM) cannonball: Well, here is what I see happening:
(02:24:28 PM) cannonball: 1) normal processing, call local_scan
(02:26:57 PM) cannonball: 2) local_scan does a few things and makes calls back
in to exim using the expand_string() function
(02:31:12 PM) cannonball: 3) exim doesn't know about the previous cached lookup
(I don't know why) and it redoes the DNS resolution for that name, seeming to
result in the receive_bomb_out() in line labeled #35.
(02:32:41 PM) cannonball: 4) I am guessing that logging of that results in
another call to expand_string() which results in another DNS resolution, which
again does not used the cached lookup.
(02:36:00 PM) cannonball: that's about the best I can come up with at this
point.  I'm not familiar with Python really, and I have never used local_scan.
(02:40:01 PM) cannonball: I'm really not sure to be honest.  Let me look at the
source and see what receive_bomb_out() looks like and where it's called from.
(02:40:51 PM) cannonball: I think I guessed right:
(02:41:00 PM) cannonball: [exim-build@tlyons ~/projects/exim/src (odk_build)]$
ack receive_bomb_out src/
(02:41:00 PM) cannonball: src/receive.c
(02:41:00 PM) cannonball: 291:receive_bomb_out(uschar *reason, uschar *msg)
(02:41:00 PM) cannonball: 361:receive_bomb_out(US"data-timeout", msg);   /*
Does not return */
(02:41:00 PM) cannonball: 384:receive_bomb_out(US"local-scan-timeout", US"local
verification problem");
(02:41:00 PM) cannonball: 406:receive_bomb_out(US"local-scan-error", US"local
verification problem");
(02:41:02 PM) cannonball: 443:receive_bomb_out(US"signal-exit", msg);    /*
Does not return */
(02:41:04 PM) cannonball: 3849:to cause a call to receive_bomb_out() if the log
cannot be opened. */
(02:41:06 PM) cannonball: src/functions.h
(02:41:08 PM) cannonball: 258:extern void    receive_bomb_out(uschar *, uschar
*);
(02:41:10 PM) cannonball: src/log.c
(02:41:12 PM) cannonball: 137:of accepting a message, throw it away tidily by
calling receive_bomb_out();
(02:41:16 PM) cannonball: 159:if (receive_call_bombout) receive_bomb_out(NULL,
s2);  /* does not return */
(02:42:00 PM) cannonball: It's probably called from receive.c in the first one
on the stack #35, and it's probably called from log.c in the second one on the
stack #19.
(02:48:19 PM) cannonball: dreas Do you have a notquit ACL?
(02:48:38 PM) dreas: cannonball: Yes, that's where it gets stuck
(02:48:51 PM) dreas: http://pastebin.com/wYMr9We3 (using ACL "acl_notquit")
(02:52:06 PM) cannonball: Undoubtedly if you could get rid of the host name
logging in this acl, this problem would go away.  The problem is that the ACLs
should be bulletproof enough that you should not have to worry about it.
(02:56:31 PM) cannonball: In the not_quit acl, it's expected that the sending
system did not shut down properly, so it tries to close down things in an
orderly manner even though the remote side hung up (in this case, local_scan
timed out).
(02:57:27 PM) cannonball: It very likely is '$sender_host_name' in the insert
that is causing this.
(02:58:19 PM) cannonball: Maybe you could very early in the rcpt acl:   set
acl_c_sender_host_name = $sender_host_name, and then in the not_quit acl, use
$acl_c_sender_host_name (which won't have to get re-expanded since it's just a
text string at this point).
(02:58:56 PM) cannonball: that's the underlying issue I think, the expansions
that occur as a result.
(03:03:26 PM) cannonball: I think the big issue is here:
(03:03:29 PM) cannonball: >>> no IP address found for host
mail.transunionnorte.com
(03:03:29 PM) cannonball: >>> mail.transunionnorte.com in
dns_again_means_nonexist? no (option unset)
(03:03:29 PM) cannonball: >>> temporary error for host name lookup
(03:05:05 PM) cannonball: That makes the whole dns entry with all those cnames
(I think) get not cached, so it has to repeat it every time.  Some transient
error on their side or in transit to your colo could cause the local_scan
timeout *OR* the die sequence logging timeout, which then creates another call
to the die sequence.
(03:05:50 PM) cannonball: receive_bomb_out does seem to do the correct thing.
(03:08:19 PM) cannonball: Hmmm, I was thinking of a different subroutine.
(03:08:42 PM) cannonball: Maybe receive_bomb_out needs a global var to detect
when it's called more than once in a session.
===


This bug still persists in Exim 4.82. It only occurs for IP addresses with
weird/wrongly setup PTR records (hence fairly hard to reproduce). Please let me
know if further information is required.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email