[exim-dev] [Bug 1053] log_write() doesn't handle interrupted…

Top Page
Delete this message
Reply to this message
Author: Phil Pennock
Date:  
To: exim-dev
Old-Topics: [exim-dev] [Bug 1053] New: log_write() doesn't handle interrupted writes to log-files
Subject: [exim-dev] [Bug 1053] log_write() doesn't handle interrupted writes to log-files
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1053




--- Comment #3 from Phil Pennock <pdp@???> 2011-11-10 09:11:17 ---
It looks as though some of the discussion relevant to this took place on our
exim-users list instead of in the bug ticket. Mail thread "exim dies on the
interrupted system call" in December 2010 and January 2011.

Copy/pasting some of that mailing-list discussion into this ticket; note that
Robert Watson is one of the FreeBSD developers.

David Woodhouse:
----------------------------8< cut here >8------------------------------
On Mon, 2010-12-27 at 21:39 -0500, Phil Pennock wrote:
> This is a bug in Exim. Looking at the code, I'm rather shocked that
> it has never bitten us before now.


It doesn't bite because most operating systems don't actually return
short writes on a real file except on EOF. Even though POSIX permits
them to.

(The case you've seen is actually returning -1 / EINTR rather than a
short write where it writes fewer bytes than you asked, but that's just
a special case of the same thing.)

In Linux we avoid doing short writes because we *know* a lot of
userspace will break if we do that. Exim will not be the only program
which breaks on the FreeBSD system in question.

But yes, strictly speaking it *is* a bug in Exim. There are a bunch of
write() calls which we should wrap with our own function that loops
until it's either written all it had to write, or got a *real* error.
----------------------------8< cut here >8------------------------------

Robert N.M. Watson:
----------------------------8< cut here >8------------------------------
Hi David, et al,

As you observe, returning (-1, EINTR) is probably technically to spec and
correct, but actually something you never want the file system to do. The only
cases I'm aware of where FreeBSD file systems intentionally return EINTR are
soft mounts of NFS, or in some rare edge cases, Coda (and maybe AFS by
implication). As such, I'd consider it a bug if EINTR is getting returned from
write(2) on a regular file in UFS2 -- and also a surprising one.

It would be worth tracking this down a bit more, since if such a bug does
exist, we want to fix it. Is there any chance the write(2) is being sent to a
FIFO in the file system, rather than a regular file, or even a socket? Could
Exim have its file descriptors mixed up? Is Exim using threading, in which case
we could be looking at a threading library bug?

(Normally sleeps performed inside the file system on block I/O are
non-interruptible, for all the reasons cited above).
----------------------------8< cut here >8------------------------------

Tony Finch:
----------------------------8< cut here >8------------------------------
28.12.2010 5:39, Phil Pennock пишет:
>
> This is a bug in Exim. Looking at the code, I'm rather shocked that
> it has never bitten us before now.
> http://bugs.exim.org/show_bug.cgi?id=1053


I'm not convinced this is correct, since at the point in the code
where the panic occurs, Exim believes it is writing to a disk file and
short writes should not occur.

Artem, can you reproduce the bug when running Exim under ktrace -di and/or
with the patch to log.c below? The st_mode value might tell us something
interesting.
----------------------------8< cut here >8------------------------------

Robert Watson in reply to Tony:
----------------------------8< cut here >8------------------------------
Agreed -- I think the important question to answer here is whether the
descriptor being written to is a file or not. It looks like there are
paths where stderr, for example, can end up being written to -- which
might well be a pipe/socket/fifo and hence legitimately get EINTR,
making this an Exim bug. Or it might be a file, in which case most
likely we're looking at a file system bug, which we need to track down
and fix.
----------------------------8< cut here >8------------------------------

In the one report we have with some extra diagnostics:
2011-01-22 19:18:46 [25517] failed to write to main log: length=115 result=-1
errno=4 (Interrupted system call) st_mode=81a0

Hex 81a0 is Octal 100640 which, when masked with S_IFMT on FreeBSD yields octal
0100000 or S_IFREG.

So, short writes when writing to regular files.

Is your log file on NFS mounted with soft writes? If not, which filesystem
type is it on? ZFS?

The evidence to date suggests that there's something hinky happening on
FreeBSD, such that it's POSIX compliant but not meeting common expectations of
Unix programming. So yes, we should get around to retrying these writes, but a
system doing this should have wider problems.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email