[exim] dns queryid's

Top Page
Delete this message
Reply to this message
Author: Wolfgang Breyha
Date:  
To: exim users
Subject: [exim] dns queryid's
Hi!

At our site we wanted to copy_forward all incoming mail to another
server. So, this backup server got only connections from the MX
machine. I got into big troubles because most of the DNS lookups had
a 2 second delay and fallback to the second nameserver in
resolv.conf. Since both have to handle about 1000 mails/minute this
was a very big issue.

tcpdump showed that all delayed DNS requests from exim had the same
query id. Since our nameservers have "use-id-pool" set they ignored
most of the requests and it was clear why I recognized 2 second
delays and even complete timeouts.

So I digged deaper and wanted to know why exim was using the same
query id's. Since exim uses libresolv I thought that it can't be
exims fault. So I tried to write proof of concept code to rebuild
this behaviour. And I was successfull.

I discovered that every first call to res_search() after a fork()
uses the same queryid the parent has in _res.id right before the
fork(). I've included my PoC-Code. Use it with a "tcpdump port 53"
and you'll see what I mean. At least libresolv used on FC4(2.3.6),
ubuntu dapper(2.3.6) and FC5(2.4) behave that way.

Since the listening exim process does no lookups, all his forked
childs behave like that and the reverse lookups for the MX machines
IP all have the same queryid as long as the listener runs.

Since I'm not sure if it's the responsibility of glibc/libresolv to
set a new queryid on fork() I report this bug here;-) At least a
"couple of installations" out in the wild use these versions of
libresolv and the workarround is pretty simple.

_res.id = res_randomid();
...after the fork. Calling res_init() again didn't help.

So I fixed my problem for now with:
---------------
--- src/child.c.orig    2006-09-02 21:13:48.000000000 +0200
+++ src/child.c 2006-09-02 21:16:26.000000000 +0200
@@ -78,6 +78,9 @@
  uschar **argv =
    store_get((extra + acount + MAX_CLMACROS + 16) * sizeof(char *));


+/* resolver bug workarround */
+_res.id = res_randomid();
+
/* In all case, the list starts out with the path, any macros, and a changed
config file. */

----------------
But I'm sure this is not the best place.

Regards, Wolfgang Breyha
University of Vienna

PS: the proof of concept....
----------------
#include <sys/types.h>
#include <netinet/in.h>
#include <netdb.h>
#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <arpa/nameser.h>
#include <resolv.h>

void dodnslookup();

main(argc, argv)
int argc;
char *argv[];
{
     /* sanity check: one (and only one) argument? */
     if(argc != 2){
         (void) fprintf(stderr, "usage: %s host\n", argv[0]);
         exit(1);
     }


     (void) res_init();


     printf("after init: %d\n", _res.id);


     dodnslookup(argv[1]);
     printf("parent: %d\n", _res.id);


     int i;
     for(i=0; i<3; i++)
     {
         int status;
         pid_t pid = fork();
         if (pid)
         {
             pid_t rc = waitpid(pid, &status, 0);
         }
         else
         {
//            _res.id = res_randomid();
             printf("child(%d) before: %d\n", i, _res.id);
             dodnslookup(argv[1]);
             printf("child(%d) after: %d\n", i, _res.id);
             exit(0);
         }
     }


     exit(0);
}


void
dodnslookup(host)
char *host;
{
     union {
         HEADER hdr;
         u_char buf[NS_PACKETSZ];
     } response;
     int responseLen;


     if((responseLen =
            res_search(host,
                      ns_c_in,
                      ns_t_a,
                      (u_char *)&response,
                      sizeof(response)))
            < 0)
         exit(1);
}
----------------
-- 
Wolfgang Breyha <wbreyha@???> | http://www.blafasel.at/
Vienna University Computer Center | Austria