Re: [exim] exim dies on the interrupted system call

Top Page
Delete this message
Reply to this message
Author: W B Hacker
Date:  
To: exim-users
Subject: Re: [exim] exim dies on the interrupted system call
Phil Pennock wrote:
> On 2010-12-28 at 10:06 +0300, Артём Каялайнен wrote:
>> 28.12.2010 5:39, Phil Pennock пишет:
>>> This is a bug in Exim. Looking at the code, I'm rather shocked that
>>> it has never bitten us before now.
>>>
>>> I've filed a tracking bug for this:
>>> http://bugs.exim.org/show_bug.cgi?id=1053
>>>
>>> Feel free to add yourself to the CC list of the bug.
>>>
>>> Regards, -Phil
>> Hmm, quite unexpectedly. Thanks for the details. The only solution so
>> far - periodically check the status of exim by cron job?
>
> That, or protect the service in a keep-alive wrapper, which is normally
> good policy for any critical daemon. Exim is normally much more stable
> than this. I'll try to find time tonight to write a patch.
>
> -Phil
>


The likes of monit, monitord, duende and similar tools work - but have not been
needed 'here' for Exim.

IF the patch is to loop, I'd suggest it be limited to - for example 3 to 7
attempts, no more. Unlimited looping may create a larger (potential) problem
than it solves.

Thereafter, rather than letting failure down the Exim daemon, IMNSHO, it should
then trigger a 'crit' message to the console and do its best to carry on.

Mind - on many unattended boxen (all of mine) those are mapped back to a logfile
in /var/log/console.log *anyway*, so not a panacea.

.. and I am still of the opinion that something 'external' is amiss, and
unusually so. Perhaps a FreeBSD 8.X GEOM/VMFS GJOURNAL bug/limitation/'feature'
- or just a config issue?

I've had Exim 4.4X on FreeBSD 4.X run the box at 100% load and fill /var on an
ATACONTROL RAID1 to 110% under a 26-hour long DDoS attack as it reported
exhaustion of PostgreSQL connections and deferred [1].

But even at hundreds of thousands of such, Exim itself did not quit, even when
it could no longer log due to lack of space on the mount. Got terribly, terribly
slow, yes. But did not die.

Agree that perusal of the code shows it *possible*. But when last/recently/ever
has it *happened* to others, and what OS and fs+'extras' was in use?

JM2CW,

Bill Hacker

[1] Blackholing is a very bad idea for many reasons. On of them is that it can
make Botnets believe they have found a willing relay, share their glee, and
'pile on'.