[exim-dev] local_scan timeout with troublesome dns

Kezdőlap
Üzenet törlése
Válasz az üzenetre
Szerző: Todd Lyons
Dátum:  
Címzett: exim-dev
Tárgy: [exim-dev] local_scan timeout with troublesome dns
I worked with a guy in IRC who had a weird issue:

"So we've already spend weeks fulltime on troubleshooting this issue,
but never managed to find out what's causing it. The Exim process
simply gets stuck in FUTEX_WAIT_PRIVATE and we need to kill it to get
rid of it. "lsof" always indicated it seemed to somehow hang on a UDP
query. I finally managed to catch it in Exim debug logging mode, which
shows it hangs on a domain with multiple PTR records that are badly
resolving.
It's Exim 4.80.1, we did customly compile it but there are no patches
except for using encrypted configuration. We've had [this symptom] for
4 years, so all previous versions of the last 4 years seem affected as
well."

He was able to repeatedly produce it with this commandline (this is
the badly resolving PTR records host) and pressing Ctrl-C immediately
upon exim calling the local_scan function:

exim -d -bh 193.173.111.241

We got a core from a stuck process and were able to get a backtrace
that looks like this:

    (gdb) bt
    #0  0x00007fd311d6662e in ?? () from /lib/libc.so.6
    #1  0x00007fd311d02798 in ?? () from /lib/libc.so.6
    #2  0x00007fd311d00b81 in free () from /lib/libc.so.6
    #3  0x00007fd3139474d2 in ?? () from /lib64/ld-linux-x86-64.so.2
    #4  0x00007fd3139424ff in ?? () from /lib64/ld-linux-x86-64.so.2
    #5  0x00007fd313947f5a in ?? () from /lib64/ld-linux-x86-64.so.2
    #6  0x00007fd313943906 in ?? () from /lib64/ld-linux-x86-64.so.2
    #7  0x00007fd31394792a in ?? () from /lib64/ld-linux-x86-64.so.2
    #8  0x00007fd311d90160 in ?? () from /lib/libc.so.6
    #9  0x00007fd313943906 in ?? () from /lib64/ld-linux-x86-64.so.2
    #10 0x00007fd311d901ff in ?? () from /lib/libc.so.6
    #11 0x00007fd311d902f7 in __libc_dlopen_mode () from /lib/libc.so.6
    #12 0x00007fd311d6e345 in ?? () from /lib/libc.so.6
    #13 0x00007fd312e3f7a3 in pthread_once () from /lib/libpthread.so.0
    #14 0x00007fd311d6e444 in backtrace () from /lib/libc.so.6
    #15 0x00007fd311cf25cf in ?? () from /lib/libc.so.6
    #16 0x00007fd311cfbe16 in ?? () from /lib/libc.so.6
    #17 0x00007fd311d00b8c in free () from /lib/libc.so.6
    #18 0x00007fd311cecd5d in fclose () from /lib/libc.so.6
    #19 0x0000000000474af3 in receive_bomb_out ()
    #20 0x00000000004750a5 in ?? ()
    #21 <signal handler called>
    #22 0x00007fd311d6662c in ?? () from /lib/libc.so.6
    #23 0x00007fd311d7082d in ?? () from /lib/libc.so.6
    #24 0x00007fd311d70695 in gethostbyname2 () from /lib/libc.so.6
    #25 0x00000000004607b9 in host_find_byname ()
    #26 0x000000000046145a in host_name_lookup ()
    #27 0x000000000044fecd in ?? ()
    #28 0x000000000045092b in ?? ()
    #29 0x0000000000450fd9 in ?? ()
    #30 0x00000000004562b7 in expand_string ()
    #31 0x0000000000432990 in ?? ()
    #32 0x00000000004320c0 in ?? ()
    #33 0x0000000000432787 in acl_check ()
    #34 0x0000000000487ebb in smtp_notquit_exit ()
    #35 0x0000000000474b5d in receive_bomb_out ()
    #36 0x0000000000474bb5 in ?? ()
    #37 <signal handler called>
    #38 0x00007fd311d4ebd8 in poll () from /lib/libc.so.6
    #39 0x00007fd3137298de in ?? () from /lib/libresolv.so.2
    #40 0x00007fd313727a95 in __libc_res_nquery () from /lib/libresolv.so.2
    #41 0x00007fd313728031 in ?? () from /lib/libresolv.so.2
    #42 0x00007fd313728468 in __libc_res_nsearch () from /lib/libresolv.so.2
    #43 0x00007fd3103fdfa8 in _nss_dns_gethostbyname3_r () from
/lib/libnss_dns.so.2
    #44 0x00007fd3103fe264 in _nss_dns_gethostbyname2_r () from
/lib/libnss_dns.so.2
    #45 0x00007fd311d709e9 in gethostbyname2_r () from /lib/libc.so.6
    #46 0x00007fd311d70727 in gethostbyname2 () from /lib/libc.so.6
    #47 0x00000000004607b9 in host_find_byname ()
    #48 0x000000000046145a in host_name_lookup ()
    #49 0x000000000044fecd in ?? ()
    #50 0x000000000045092b in ?? ()
    #51 0x00000000004562b7 in expand_string ()
    #52 0x000000000049dc6f in ?? ()
    ---Type <return> to continue, or q <return> to quit---
    #53 0x00000000005484d0 in PyEval_EvalFrameEx ()
    #54 0x0000000000549421 in PyEval_EvalFrameEx ()
    #55 0x0000000000549421 in PyEval_EvalFrameEx ()
    #56 0x0000000000549d08 in PyEval_EvalCodeEx ()
    #57 0x0000000000548097 in PyEval_EvalFrameEx ()
    #58 0x0000000000549d08 in PyEval_EvalCodeEx ()
    #59 0x0000000000548097 in PyEval_EvalFrameEx ()
    #60 0x0000000000549d08 in PyEval_EvalCodeEx ()
    #61 0x0000000000548097 in PyEval_EvalFrameEx ()
    #62 0x0000000000549d08 in PyEval_EvalCodeEx ()
    #63 0x00000000005b1921 in ?? ()
    #64 0x00000000004ce1b3 in PyObject_Call ()
    #65 0x00000000004d13af in PyObject_CallFunction ()
    #66 0x000000000049e4ce in local_scan ()
    #67 0x000000000047906e in receive_msg ()
    #68 0x0000000000438646 in daemon_go ()
    #69 0x000000000044cbe5 in main ()
    (gdb)


The acl_notquit stanza looks like this:

acl_notquit:

  warn
    # Log whatever information we may have.
    # There may actually be an existing classification here - if the message
    # got all the way through local scan, and then the connection was
    # dropped before the final QUIT.  In that case, replace the
    # classification.  See #6423.
    set acl_m_log_bin_result = ${lookup mysql{SET SQL_LOG_BIN=0}}
    set acl_c_log_result = ${lookup mysql{INSERT INTO IN_TABLE
(exim_id, last_touched, sender_host, sender_ip, pid, incoming_size,
outgoing_size, sender, recipient, domain, classification,
classification_details) VALUES ('$acl_c_unique_id', UTC_TIMESTAMP(),
'$sender_host_name', '$sender_host_address', '$pid', 0, 0,
'${quote_mysql:$sender_address}', '', '', 'unknown',
'${quote_mysql:$smtp_notquit_reason}') ON DUPLICATE KEY UPDATE
classification=VALUES(classification),
classification_details=VALUES(classification_details)}}
    set acl_m_log_bin_result = ${lookup mysql{SET SQL_LOG_BIN=1}}


# There is actually no choice but to accept, but be explicit anyway.
accept


Looking at the backtrace, it is doing the rdns for the host once,
calling local_scan, timing out, and then calling receive_bomb_out()
which calls the notquit acl. But then that logs the
$sender_host_name, which again calls rdns functions which time out
again, and (I think mistakenly) calls receive_bomb_out() again. I had
him remove the $sender_host_name from the logging statement and it
fixed the issue because it removed the failing DNS lookup from what I
interpret as the error handling portion of the code.

I suspect the fix will be to add a global var that tracks when
receive_bomb_out() has been entered and immediately exit from the
function if it detects that it has already been entered once. I think
it's good that it gets called multiple times when things are really
bad, we just need to guard it from actually trying to shut everything
down more than once. Does anybody see any flaws or repercussions from
this approach?

...Todd
--
The total budget at all receivers for solving senders' problems is $0.
If you want them to accept your mail and manage it the way you want,
send it the way the spec says to. --John Levine