Re: [exim-dev] local_scan timeout with troublesome dns

Top Page
Delete this message
Reply to this message
Author: Phil Pennock
Date:  
To: Todd Lyons
CC: exim-dev
Subject: Re: [exim-dev] local_scan timeout with troublesome dns
On 2013-03-05 at 08:29 -0800, Todd Lyons wrote:
> He was able to repeatedly produce it with this commandline (this is
> the badly resolving PTR records host) and pressing Ctrl-C immediately
> upon exim calling the local_scan function:
>
> exim -d -bh 193.173.111.241


I can reproduce long delays, simply by having "host_lookup = *" in the
configuration, but not hangs, on either FreeBSD or Ubuntu.

Do you have information on what they're _doing_ in their local_scan? Is
it code they can share? Python embedding is interesting ... someday in
the distant future when I have spare time, I'd really like to get back
to serious Exim hacking and adding Python to Perl is on my todo list.

The problem seems to come from the resolution of
mail2.persvragenbrabant.nl and mail2.slimmezorg.nl timing out. Looks
like problems with the DNS servers ns1.brabant.nl and ns2.brabant.nl

As to us: we're not doing anything weird with the system resolvers or
the process state, as far as I know, so there shouldn't be any races or
concurrency issues: by the time their code is called, all other DNS
resolutions are over.

So, they're using Python, with expand_string stuff analogous to what we
have for Perl already, to call back into Exim, to do checks in DNS;
things time out, a signal handler is called, it's _ours_, not theirs,
set up by one of the signal handlers in receive.c. SIGALRM, SIGTERM or
SIGINT. In this case, thanks to the timeouts, likely to be SIGALRM. So
perhaps the issue is that some FILE* object is open inside Python but
not inside Exim, thus the failure in Exim which only happens for them?

We should probably update chapter 44 with something on signal handler
state.

So, this suggests that they've set receive_timeout or
smtp_receive_timeout or local_scan_timeout to something lower than five
minutes, because even with the DNS delays, it's still 30s per host, so
they shouldn't be getting the ALRM.

Side issue:
Should they, in their Python setup, be setting something as a signal
handler on ALRM which does clean-up at the _Python_ layer, before
restoring the original handler and sending ALRM to themselves, to abort
out? I'd imagine something which throws a Python exception, a
descendant of BaseException and sibling of KeyboardInterrupt, to be
treated especially, rather than a normal Python usercode exception which
inherits from Exception, and in their C wrapping, do the catch of their
EximAlarmException or whatever, to restore the previous signal handler
and re-raise. This should let all Python cleanup happen, before Exim
code is called.

Can they run gcore to get the state of their process when it's hung,
*without* sending ^C to it and causing data_sigterm_sigint_handler() to
be called? That's causing bad interactions with what's actually
happening, but I think, in this case, you've mostly pinned it down.

> Looking at the backtrace, it is doing the rdns for the host once,
> calling local_scan, timing out, and then calling receive_bomb_out()
> which calls the notquit acl. But then that logs the
> $sender_host_name, which again calls rdns functions which time out
> again, and (I think mistakenly) calls receive_bomb_out() again. I had


Looks right to me, but I don't think the receive_bomb_out is an error.

> I suspect the fix will be to add a global var that tracks when
> receive_bomb_out() has been entered and immediately exit from the
> function if it detects that it has already been entered once. I think
> it's good that it gets called multiple times when things are really
> bad, we just need to guard it from actually trying to shut everything
> down more than once. Does anybody see any flaws or repercussions from
> this approach?


There are two issues here:

(1) receive_bomb_out is assuming it's about to exit, so some of the code
    was written with that assumption, I think, and predates the ACL
    callout.


    data_file is fclose()d but not set to NULL, so the actual bug
    causing the stdio lockup is that we will fclose(data_file) twice.
    Potentially, double-frees.


    This:
----------------------------8< cut here >8------------------------------
if (data_file != NULL) (void)fclose(data_file);
  else if (data_fd >= 0) (void)close(data_fd);
----------------------------8< cut here >8------------------------------
    needs to be this:
----------------------------8< cut here >8------------------------------
if (data_file != NULL) {
  (void)fclose(data_file);
  data_file = NULL;
} else if (data_fd >= 0) {
  (void)close(data_fd);
  data_fd = -1;
}
----------------------------8< cut here >8------------------------------


(2) The recursive shutdowns of SMTP; best fixed by not trying to call
    into the ACLs twice.  Instead, add a function-level static which is
    set true the first time around, and if true, we don't call those
    ACLs.


    Because the only thing that should be happening at the _top_ level
    after the ACLs are called is to exit, we can short-cut cleanly by
    just exiting immediately.


    We still need to NULL out the data_file/data_fd, to prevent their
    use elsewhere.


Note that global variables are unnecessary, function static variables
will work just fine here.

The attached patch is also in git in the main repo, on the
fix_receive_bombout branch. If it looks good to others, we can merge it
in.

-Phil
diff --git a/src/src/receive.c b/src/src/receive.c
index e0c1c73..56de9fb 100644
--- a/src/src/receive.c
+++ b/src/src/receive.c
@@ -286,32 +286,50 @@ Returns:     it doesn't
 void
 receive_bomb_out(uschar *reason, uschar *msg)
 {
+  static BOOL already_bombing_out;
+/* The smtp_notquit_exit() below can call ACLs which can trigger recursive
+timeouts, if someone has something slow in their quit ACL.  Since the only
+things we should be doing are to close down cleanly ASAP, on the second
+pass we also close down stuff that might be opened again, before bypassing
+the ACL call and exiting. */
+
 /* If spool_name is set, it contains the name of the data file that is being
 written. Unlink it before closing so that it cannot be picked up by a delivery
 process. Ensure that any header file is also removed. */


-if (spool_name[0] != 0)
+if (spool_name[0] != '\0')
{
Uunlink(spool_name);
spool_name[Ustrlen(spool_name) - 1] = 'H';
Uunlink(spool_name);
+ spool_name[0] = '\0';
}

/* Now close the file if it is open, either as a fd or a stream. */

-if (data_file != NULL) (void)fclose(data_file);
- else if (data_fd >= 0) (void)close(data_fd);
+if (data_file != NULL)
+ {
+ (void)fclose(data_file);
+ data_file = NULL;
+} else if (data_fd >= 0) {
+ (void)close(data_fd);
+ data_fd = -1;
+ }

/* Attempt to close down an SMTP connection tidily. For non-batched SMTP, call
smtp_notquit_exit(), which runs the NOTQUIT ACL, if present, and handles the
SMTP response. */

-if (smtp_input)
+if (!already_bombing_out)
   {
-  if (smtp_batched_input)
-    moan_smtp_batch(NULL, "421 %s - message abandoned", msg);  /* No return */
-  smtp_notquit_exit(reason, US"421", US"%s %s - closing connection.",
-    smtp_active_hostname, msg);
+  already_bombing_out = TRUE;
+  if (smtp_input)
+    {
+    if (smtp_batched_input)
+      moan_smtp_batch(NULL, "421 %s - message abandoned", msg);  /* No return */
+    smtp_notquit_exit(reason, US"421", US"%s %s - closing connection.",
+      smtp_active_hostname, msg);
+    }
   }


/* Exit from the program (non-BSMTP cases) */