Re: [exim-dev] local_scan timeout with troublesome dns

Top Page
Delete this message
Reply to this message
Author: Todd Lyons
Date:  
To: Todd Lyons, exim-dev
Subject: Re: [exim-dev] local_scan timeout with troublesome dns
On Tue, Mar 5, 2013 at 2:02 PM, Phil Pennock <pdp@???> wrote:
> On 2013-03-05 at 08:29 -0800, Todd Lyons wrote:
>> Looking at the backtrace, it is doing the rdns for the host once,
>> calling local_scan, timing out, and then calling receive_bomb_out()
>> which calls the notquit acl. But then that logs the
>> $sender_host_name, which again calls rdns functions which time out
>> again, and (I think mistakenly) calls receive_bomb_out() again. I had
>
> Looks right to me, but I don't think the receive_bomb_out is an error.
> There are two issues here:
>
> (1) receive_bomb_out is assuming it's about to exit, so some of the code
>     was written with that assumption, I think, and predates the ACL
>     callout.

<perfect patch snipped>
>
> (2) The recursive shutdowns of SMTP; best fixed by not trying to call
>     into the ACLs twice.  Instead, add a function-level static which is
>     set true the first time around, and if true, we don't call those
>     ACLs.


The guy I'm working with added the glibc devel packages so we get some
(minor) insight into what glibc resolver code path is triggering this.
The below backtrace is also running exim with the patch mentioned
above, and the frequency of his hung processes seems to be less than
it was before, so that's good, but non-answering authoritative
nameservers still are causing it. Look at the following backtrace
(gcore of a several hours hung process) :

Just got another backtrace on test123.com:
========================
(gdb) bt
#0 __lll_lock_wait_private () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
#1 0x00007f2fb03c682d in _L_lock_26 () from /lib/libc.so.6
#2 0x00007f2fb03c6695 in gethostbyname2 (name=0x7f2fb0640688 "\002",
af=10) at ../nss/getXXbyYY.c:98
#3 0x00000000004607b9 in host_find_byname ()
#4 0x000000000046145a in host_name_lookup ()
#5 0x000000000044fecd in ?? ()
#6 0x000000000045092b in ?? ()
#7 0x0000000000450fd9 in ?? ()
#8 0x00000000004562b7 in expand_string ()
#9 0x0000000000432990 in ?? ()
#10 0x00000000004320c0 in ?? ()
#11 0x0000000000432787 in acl_check ()
#12 0x0000000000487edb in smtp_notquit_exit ()
#13 0x0000000000474b87 in receive_bomb_out ()
#14 0x0000000000474bdf in ?? ()
#15 <signal handler called>
#16 0x00007f2fb03a4bd8 in *__GI___poll (fds=0x7fff2f62b540, nfds=1,
timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83
#17 0x00007f2fb1d7f8de in send_dg (statp=0x7f2fb0640300, buf=<value
optimized out>, buflen=<value optimized out>, buf2=0x0, buflen2=<value
optimized out>,
ans=0x7fff2f62c110 "?큂", anssiz=1024, ansp=0x7fff2f62c970, ansp2=0x0,
nansp2=0x0, resplen2=0x0) at res_send.c:1062
#18 __libc_res_nsend (statp=0x7f2fb0640300, buf=<value optimized out>,
buflen=<value optimized out>, buf2=0x0, buflen2=<value optimized out>,
ans=0x7fff2f62c110 "?큂",
anssiz=1024, ansp=0x7fff2f62c970, ansp2=0x0, nansp2=0x0, resplen2=0x0)
at res_send.c:553
#19 0x00007f2fb1d7da95 in *__GI___libc_res_nquery
(statp=0x7f2fb0640300, name=<value optimized out>, class=<value
optimized out>, type=<value optimized out>,
answer=0x7fff2f62c110 "?큂", anslen=<value optimized out>,
answerp=0x7fff2f62c970, answerp2=0x0, nanswerp2=0x0, resplen2=0x0) at
res_query.c:226
#20 0x00007f2fb1d7e031 in __libc_res_nquerydomain
(statp=0x7f2fb0640300, name=0x25bdc30 "test123.com", domain=0xa
<Address 0xa out of bounds>, class=<value optimized out>,
type=28, answer=0x7fff2f62c110 "?큂", anslen=1024,
answerp=0x7fff2f62c970, answerp2=0x0, nanswerp2=0x0, resplen2=0x0) at
res_query.c:578
#21 0x00007f2fb1d7e468 in *__GI___libc_res_nsearch
(statp=0x7f2fb0640300, name=<value optimized out>, class=<value
optimized out>, type=<value optimized out>,
answer=0x7fff2f62c110 "?큂", anslen=1024, answerp=0x7fff2f62c970,
answerp2=0x0, nanswerp2=0x0, resplen2=0x0) at res_query.c:378
#22 0x00007f2faea53fa8 in *__GI__nss_dns_gethostbyname3_r (name=<value
optimized out>, af=10, result=0x7f2fb0640660, buffer=<value optimized
out>, buflen=<value optimized out>,
errnop=<value optimized out>, h_errnop=0x7fff2f62cacc, ttlp=0x0,
canonp=0x0) at nss_dns/dns-host.c:197
#23 0x00007f2faea54264 in _nss_dns_gethostbyname2_r
(name=0x7fff2f62b540 "\v", af=1, result=0x1770,
buffer=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>,
buflen=0, errnop=0x0, h_errnop=0x7fff2f62cacc) at nss_dns/dns-host.c:257
#24 0x00007f2fb03c69e9 in __gethostbyname2_r (name=0x25bdc30
"test123.com", af=10, resbuf=0x7f2fb0640660, buffer=0x25cd4b0
"*\001\004\370\001", buflen=1024,
result=<value optimized out>, h_errnop=0x7fff2f62cacc) at
../nss/getXXbyYY_r.c:253
#25 0x00007f2fb03c6727 in gethostbyname2 (name=0x25bdc30
"test123.com", af=10) at ../nss/getXXbyYY.c:117
#26 0x00000000004607b9 in host_find_byname ()
#27 0x000000000046145a in host_name_lookup ()
#28 0x000000000044fecd in ?? ()
#29 0x000000000045092b in ?? ()
#30 0x00000000004562b7 in expand_string ()

Using dig +trace, it fails with this:

test123.com.        172800    IN    NS    ns4.pageup.net.
test123.com.        172800    IN    NS    ns2.pageup.net.
;; Received 107 bytes from 192.41.162.30#53(192.41.162.30) in 274 ms


;; connection timed out; no servers could be reached

So working through the code, from frame 8, it's calling
expand_string() on $sender_host_name, which is hitting this at line
1599 in expand.c:
    case vtype_host_lookup:                    /* Lookup if not done so */
    if (sender_host_name == NULL && sender_host_address != NULL &&
        !host_lookup_failed && host_name_lookup() == OK)
      host_build_sender_fullhost();
    return (sender_host_name == NULL)? US"" : sender_host_name;


Next, in dns,c, it works its way to host_name_lookup() which re-inits
the dns resolution process (I suspect without properly closing the
previous and still waiting dns resolution) and then finally calls
host_find_byname() :
1780: if ((rc = host_find_byname(&h, NULL, 0, NULL, FALSE)) == HOST_FOUND)

Which of course, hangs again. I don't think force closing the
previous dns resolution is going to fix the problem because the second
attempt is going to just hang again and cause the error handler to
fire. My suspicion is that the only clean solution is adding a module
level variable to detect that $sender_host_name expansion was
attempted once and skip subsequent expansion attempts, maybe returning
just the IP or merely US"" (which seems to be what the return does in
the expand.c segment quoted above. But then again, that _seems_ to be
what !host_lookup_failed in expand.c is attempting to detect, it just
doesn't handle the timeout case. Maybe we just need to set *that*
variable some place that we are forgetting it.

Thoughts? Am I missing a bigger picture?

...Todd

--
The total budget at all receivers for solving senders' problems is $0.
If you want them to accept your mail and manage it the way you want,
send it the way the spec says to. --John Levine