In article <3D780861.2060301@???> you write:
>This is REALLY bugging me! it seems like for some reason exim will not respond to activity
>on it's assigned ports unless there's an active strace on the daemon process.
>[exim listening on port 4000]
If I'm reading the source correctly, there's an error in the way that
new connections are accepted that affects machines with two or more
SMTP listeners.
You can end up with multiple listeners because you are listening on
more than one port, more than one interface, or more than one address
family.
Exim's daemon code performs code similar to:
bind and listen on all relevant sockets
set the alarm for the queue interval
while(forever)
{
select on all listening sockets
for(selected sockets)
{
accept the connection
handle next connection if interrupted in accept
if connection cannot be handled right now (eg. high load)
{
send an error, 421 .....
close this connection
handle next connection
}
fork
if (child)
{
handle the smtp connection
exit
}
close this connection
}
}
We get away with the blocking writes for SMTP errors since they are
much smaller than the network write buffer size.
The situation that may cause a problem is if a connection is received
but reset before accept(). SuSv2 isn't clear on what the correct
behaviour for accept() is; I believe that some machines will give an
error in accept(); others will just block until the next connection
happens.
If you are unlucky, this will cause the daemon to block, trying to
accept a connection on a socket that rarely receives connections. The
block will last until the alarm is received, ie. when the next queue
run was due to start - way too long. Furthermore, if two sockets were
selected, and accept() is interrupted by an alarm for the first
socket, I think the second one ends up in a blocking wait with no
alarm set.
Strace has a habit of interrupting blocking system calls, which is why
the program continues when you start tracing. Exim gets into this
unusual state because port 4000 is occasionally probed by script
kiddies, being part of a series (4000-4003) where trojans install
listening daemons.
It would be helpful to see what system calls happened immediately
after tracing started. If you have ltrace, then the same output from
that would be useful too.
Peter