[exim-dev] Fix for broken WITH_CONTENT_SCAN spam scanning on…

Top Page
Delete this message
Reply to this message
Author: Erik
Date:  
To: exim-dev
Subject: [exim-dev] Fix for broken WITH_CONTENT_SCAN spam scanning on OS X
( Greetings... Please forgive me if I misstep here. This is my
first post to a public project such as this. You can probably just
skip all the excess here and look at the patch below ;)

The SYMPTOM:

Getting "spam acl condition: Resource temporarily unavailable on
spamd socket" during WITH_CONTENT_SCAN (exiscan style) spam scanning
on Mac OS X causing a DEFER result to the ACL (or PASS if one uses /
defer_ok). One also sees "Warning: ACL "warn" statement skipped:
condition test deferred" due to the condition. This might date all
the way back to version 4.52 with the "poll() on OS X" fix. In the
spamd log, one notes a correlated "spamd: bad protocol: header error:
(Content-Length mismatch: Expected <actualMessageSize> bytes, got
<somethingSmaller> bytes)". I personally can reproduce the bug
simply by sending a big message through the spam scanning (>60K with
my TCP settings). A DEFER results in endless "temporarily rejected
after DATA" (451 SMTP result code?) or lets occasional spam through
if one uses /defer_ok on the DENY ACL.

The CAUSE:
In spam.c after a connection is established and the header is
sent, the socket is put into non-blocking mode, then all data is
rushed to it, yet EAGAIN is treated as a hard error (EAGAIN =
Resource temporarily unavailable). In fact, EAGAIN is expected to
occur -- particularly for large messages since they'll tend to fill
up the outgoing low level buffers before the data reaches the network
(or spamd/perl is able to suck it up on the other end?).

The CURE:
To avoid re-introducing blocking, avoid a CPU burning loop, and
avoid the spotty implementations of poll() on OS X, one can just use
the natural BSDish select() instead! In fact, exim itself uses select
() rather than poll() -- albeit for reading rather than sending. Of
course, select() is not identical to poll() but for this simple use
it happens to be close enough for an easy swap. Even the result code
and expected errors are close enough. I'm using this successfully in
production on 10.4 as we speak. I would expect select() to be solid
on all past and future OS X versions, other BSD's, and most other OS
too. I think one is supposed to #include <sys/select.h> to be
posixly correct, but exim only bothers with that for QNX. In any
case, #include exim.h covers it fine.

One might also rename NO_POLL_H to be USE_SELECT_NOT_POLL or
something similar. Or, if it is possible, one should drop the use of
poll(). Exim solely uses select(), albeit only for reading. Is it
somehow consistent across OS's for polling readability but not for
writability? I do see that 'nmap' uses select() for read, write and
extra monitoring, and that seems to work for people across enough
platforms.

On a separate note in checking the result of the poll() or select(),
and dealing with potential "signal" interruption, it probably should
"goto again;" rather than "continue;" I think "continue" will lose
the current unsent buffer's worth of data and corrupt the message
that makes it out to spamd! Coincidentally, this would also
precipitate the same symptoms described above since less data would
be sent than was announced in the header. A rare case, but also
included in the patch below.

REFERENCES:
http://www.erlenstar.demon.co.uk/unix/faq_3.html#SEC29
http://www.hmug.org/man/2/select.php
http://seth.positivism.org/man.cgi/select
http://developer.apple.com/technotes/tn2002/tn2071.html


PATCH against 4.62, (I used 'diff -cd', patch -p 0 from exim
directory should work):

*** src/spam.old        Wed Jun  7 15:20:42 2006
--- src/spam.c  Wed Jun  7 15:45:04 2006
***************
*** 42,47 ****
--- 42,50 ----
     struct sockaddr_un server;
   #ifndef NO_POLL_H
     struct pollfd pollfd;
+ #else
+   struct timeval select_tv;
+   fd_set select_fd;
   #endif


     /* stop compiler warning */
***************
*** 218,224 ****
      * and we poll the desciptor to make sure that we can write without
      * blocking.  Short writes are gracefully handled and if the whole
      * trasaction takes too long it is aborted.
!    * Note: poll() is not supported in OSX 10.2.
      */
   #ifndef NO_POLL_H
     pollfd.fd = spamd_sock;
--- 221,228 ----
      * and we poll the desciptor to make sure that we can write without
      * blocking.  Short writes are gracefully handled and if the whole
      * trasaction takes too long it is aborted.
!    * Note: poll() is not supported in OSX 10.2 and is reported to be
!    *       broken in more recent versions (up to 10.4).
      */
   #ifndef NO_POLL_H
     pollfd.fd = spamd_sock;
***************
*** 232,239 ****
   again:
   #ifndef NO_POLL_H
         result = poll(&pollfd, 1, 1000);
         if (result == -1 && errno == EINTR)
!         continue;
         else if (result < 1) {
           if (result == -1)
             log_write(0, LOG_MAIN|LOG_PANIC,
--- 236,250 ----
   again:
   #ifndef NO_POLL_H
         result = poll(&pollfd, 1, 1000);
+ #else
+       select_tv.tv_sec = 1;
+       select_tv.tv_usec = 0;
+       FD_ZERO(&select_fd);
+       FD_SET(spamd_sock, &select_fd);
+       result = select(spamd_sock+1, NULL, &select_fd, NULL,  
&select_tv);
+ #endif
         if (result == -1 && errno == EINTR)
!         goto again;
         else if (result < 1) {
           if (result == -1)
             log_write(0, LOG_MAIN|LOG_PANIC,
***************
*** 248,254 ****
           (void)fclose(mbox_file);
           return DEFER;
         }
- #endif
         wrote = send(spamd_sock,spamd_buffer + offset,read - offset,0);
         if (wrote == -1)
         {
--- 259,264 ----