Re: [Exim] exim dies silently

Page principale
Supprimer ce message
Répondre à ce message
Auteur: Thomas O'Dowd
Date:  
À: exim-users
Sujet: Re: [Exim] exim dies silently
Hi all,

Previously, I was having a lot of problems with exim 4.30 dying on me as
per my email below. I also noticed that I was getting a few messages
"pipe delivery process timed out" which seemed unrelated in the exim
logs but were also worrying.

I have a pipe transport to deliver messages to a script which updates a
database and does some other processing. Due to a heavy load on the DB
machine and some badly optimised queries it seems that now and again
mail delivery in that script would take over 1 hour (default pipe
timeout) after which time it was killed by exim.

Looking at the exim code for this timeout...

=========== from pipe.c

if ((rc = child_close(pid, timeout)) != 0)
{
/* The process did not complete in time; kill its process group and
fail
the delivery. It appears to be necessary to kill the output process
too, as
otherwise it hangs on for some time if the actual pipe process is
sleeping.
(At least, that's what I observed on Solaris 2.5.1.) Since we are
failing
the delivery, that shouldn't cause any problem. */

  if (rc == -256)
    {
    killpg(pid, SIGKILL);
    kill(outpid, SIGKILL);
    addr->transport_return = FAIL;
    addr->message = string_sprintf("pipe delivery process timed out",
      tblock->name);
    }


===========

It kills the process group of the filter and exim's pipe process. I
wonder if its possible under some conditions that exim would
accidentally kill itself in this case?

I changed the script that the pipe was calling to be more efficient and
also put in an alarm to kill itself if it took more than 15 minutes to
run returning a temporary error exit code. Thus messages no longer
bounce on a timeout.

Exim hasn't died on me for over 2 weeks now, where as before I sped up
this script it was dying on me every other day or so.

What makes me think it was killing itself is that there were no messages
in the logfiles, nothing sent to stdout/stderr and it never generated a
core dump.

I tried replicating the timeout problem with the following...

testpipe:
driver = accept
transport = test_pipe
local_part_prefix = test-

# Test pipe that sleeps for 6 mins and ignores mail
# pipe timeout after 3 minutes.
test_pipe:
driver = pipe
return_path_add
home_directory = /home/tom/projects
command = /home/tom/projects/sleep.sh
timeout = 3m
user = tom
group = tom
log_output = true
return_fail_output
check_string = "From "
escape_string = ">From "

I have a script which just execs a python script. In the real script I
just set some environmental variables aswell.
========= sleep.sh
#!/bin/bash
# Exec the python mhandler prog
PATHTO=`dirname $0`
exec ${PATHTO}/sleep.py

This script calls sleep.py which just sleeps for 6 mins and then tries
to read from stdin.

========== sleep.py
#!/usr/bin/env python

import sys
import time

# sleep 6 mins
time.sleep(360)

# read all input from pipe
line = sys.stdin.readline()
while line:
line = sys.stdin.readline()

===========

Unfortunately, I can't replicate the dying problem though. Perhaps I
need to send a bit more email through it or something. Anyway, just
thought I'd share in case someone spots something that I don't.

BTW. exim version was 4.30 running on RH8 linux using kernel version
2.4.20-20.8smp.

Tom.

On Wed, 2004-01-07 at 11:24, Thomas O'Dowd wrote:
> Dear all,
>
> I've been experiencing problems with EXIM dying silently on a Redhat 8
> machine. I'm running 4.30 which I updated to recently. Previously I was
> running 3.36 which was also crashing (main reason I decided updated).
>
> Exim 3.x had been running on a different machine running RH6.2 for me
> for a long time with the same configuration. Recently we moved hardware
> and OS version. Only after the move, did we start experiencing the
> problem.
>
> This seems to suggest either a hardware or a problem with some library
> on the new OS or some setting. The only problem is that I have other
> long term running processes on this machine, including a relational
> database and a few java based engines that don't crash. Exim is the only
> process that experiences these problems. The new machine itself was also
> in production previously and is running below capacity in terms of
> memory and disk space. Its a twin cpu machine with 2GB of ram.
>
> Doing a "ulimit -a" just before exim is started by the start script,
> shows the following.
>
> core file size        (blocks, -c) unlimited
> data seg size         (kbytes, -d) unlimited
> file size             (blocks, -f) unlimited
> max locked memory     (kbytes, -l) unlimited
> max memory size       (kbytes, -m) unlimited
> open files                    (-n) 1024
> pipe size          (512 bytes, -p) 8
> stack size            (kbytes, -s) 8192
> cpu time             (seconds, -t) unlimited
> max user processes            (-u) 7168
> virtual memory        (kbytes, -v) unlimited
> Starting exim: [  OK  ]

>
> So cpu time shouldn't be a problem. I've checked lsof and its not
> running out of file descriptors. It just disappears from the process
> list. Also, even though I allow it to dump core, it doesn't???
>
> There are no unusual entries in the mainlog/paniclog or rejectlog files.
> There are a few frozen mails in the mailq but nothing extra ordinary.
>
> I'm running out ideas trying to debug this problem. Has anyone
> experienced anything like this before?
>
> Best regards,
>
> Tom.
>
>
> --
>
> ## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##
>