Author: Phil Pennock Date: To: Илья Антипов CC: exim-dev Subject: Re: [exim-dev] strange behaviour of server
On 2013-03-17 at 01:52 +0400, Илья Антипов wrote: > This night server stopped responding for some time. About 3 hours. During
> this period of time hdd became completely full (there was about 1Tb oа free
> space) and 90 of 230 (aprox) processes became uninterruptiblle . And
> suddenly space was freed and everything became stable again. Here what was
> in dmesg.
So local filesystem operations don't typically result in short writes,
thus applications performing file I/O will block, uninterruptibly, if
the filesystem "disappears" (this is most commonly seen with NFS).
> [9417724.085437] [<ffffffff816590ff>] schedule+0x3f/0x60
> [9417724.085442] [<ffffffff81659f07>] __mutex_lock_slowpath+0xd7/0x150
> [9417724.085447] [<ffffffff81187361>] ? do_path_lookup+0x31/0xc0
> [9417724.085450] [<ffffffff81659b1a>] mutex_lock+0x2a/0x50
> [9417724.085454] [<ffffffff8118774e>] do_unlinkat+0x8e/0x1d0
> [9417724.085459] [<ffffffff81178290>] ? vfs_write+0x110/0x180
> [9417724.085462] [<ffffffff8117855a>] ? sys_write+0x4a/0x90
> [9417724.085466] [<ffffffff811883e6>] sys_unlink+0x16/0x20
> [9417724.085470] [<ffffffff81663602>] system_call_fastpath+0x16/0x1b
First: please bear in mind that I am not a kernel developer.
All of the traces go through __mutex_lock_slowpath+0xd7/0x150 after
do_path_lookup, so it looks to me as though the kernel is wedging on
trying to access a block to get a directory entry, to answer userland,
and every process trying to access files under that directory would
block.
At this point, I think that the directory affected was probably the
spool directory for Exim, so every time a queue runner was started, it
would lock.
I would run disk diagnostics, to see if the disk is failing, and force a
complete file-system repair (old school "fsck") if your filesystem
supports such a thing.
I think either the disk is bad or you have file-system bugs. The fact
that the filesystem appeared to fill actually argues for a file-system
bug, and some kind of priority inversion so that more data could be
added but basic cleanup operations were blocked. But this is pure
speculation on my part.
Are you running a modern hip filesystem instead of something old and
battle-tested?
Have you checked for updates and errata for the file-system, perhaps
fixed in a newer kernel?
No matter what Exim is doing, it shouldn't be causing all the filesystem
operations to lock up on a lock held, long term, inside the kernel. I
think you should have more luck asking in a community of kernel
maintainers/developers.