On Wed, 28 Aug 2002, Bernhard Erdmann wrote:
> Now it's clearly reproducable using the same bounce procedure:
> - no strace: Exim works well as expected
> - strace -p PID -f: "remote delivery process count got out of step"
The ChangeLog for 4.11 has this entry:
4. It has been discovered that, under Linux, when a process and its children
are being traced by "strace -f", the children are stolen from the parent
while they are being traced. A call to waitpid(-1,&x,NOHANG), which Exim
uses to test for the completion of "any of my children" in a non-blocking
manner, returns as if there are no children in existence. Exim used treat
this as a serious unexpected error state. What it does now is to use
kill(pid,0) to check explicitly for the continued existence of any of its
children. If it finds any, it assumes it is being traced, and proceeds as
if the return from waitpid() had been "none of your children have finished
yet". If it can't find any children, it gives the error as before.
... and if debugging, it says
process xxx still exists: assume stolen by strace
This seems to be a Linux "feature". The same thing does not happen under
Solaris "truss", for example. I don't know about other OS.
--
Philip Hazel University of Cambridge Computing Service,
ph10@??? Cambridge, England. Phone: +44 1223 334714.