[exim-dev] [Bug 2715] Degradation over time

Top Page
Delete this message
Reply to this message
Author: admin
Date:  
To: exim-dev
Subject: [exim-dev] [Bug 2715] Degradation over time
https://bugs.exim.org/show_bug.cgi?id=2715

a.key <a.key@???> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |a.key@???


--- Comment #6 from a.key <a.key@???> ---
Hi,
It was my work mate who created this ticket so I'm just gonna continue here.

We haven't enabled the debugging yet as this may skew the results we're getting
anyways but will be our next step.

What we observe is the following:

We send an email over smtp to local exim using local php's swiftmailer script.
We're catching exim's children with the following:

while ! pgrep -P 1307 -a exim -n;do sleep 0.1 ;done ; strace -s 300 -f -p
$(pgrep -P 1307 exim -n)

where 1307 is the parent exim process at the time.

This catches the following:
...
write(4, "2021-05-14 16:14:57.594 Received from webmaster@???
H=localhost [127.0.0.1]:47476 I=[127.0.0.1]:25 P=esmtp S=545 RT=0.045s
id=60197d6565e9b0544ebe1fdfa477ebff@??? T=\"Wonderful Subject\"\n",
208) = 208
close(4)                                = 0
munmap(0x7fb85a18e000, 4096)            = 0
select(9, [8], NULL, NULL, {tv_sec=0, tv_usec=0}) = 0 (Timeout)
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3661, ...}) = 0
stat("", 0x7ffed09ad840)                = -1 ENOENT (No such file or directory)
close(9)                                = 0
open("/var/log/exim/main.log", O_WRONLY|O_APPEND|O_CLOEXEC) = 4
fstat(4, {st_mode=S_IFREG|0640, st_size=1485802, ...}) = 0
write(4, "2021-05-14 16:14:57.595 1lhZWj-0001T1-Dr <= webmaster@???
H=localhost [127.0.0.1]:47476 I=[127.0.0.1]:25 P=esmtp S=545 RT=0.045s
id=60197d6565e9b0544ebe1fdfa477ebff@??? T=\"Wonderful Subject\"
from <webmaster@???> for a.key@???\n", 265) = 265
rt_sigprocmask(SIG_BLOCK, [ALRM], [], 8) = 0
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=43,
tv_usec=625500}}, NULL) = 0
rt_sigsuspend(~[ALRM RTMIN RT_1], 8


-----
~ 30s delay
-----

)    = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigaction(SIGALRM, {sa_handler=0x55a88a19f6a0, sa_mask=[],
sa_flags=SA_RESTORER, sa_restorer=0x7fb85763f630}, NULL, 8) = 0
rt_sigreturn({mask=[ALRM]})             = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
close(3)                                = 0
munmap(0x7fb85a18f000, 4096)            = 0
rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=[TERM],
sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fb857298400},
{sa_handler=SIG_IGN, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART,
sa_restorer=0x7fb857298400}, 8) = 0
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[INT],
sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fb857298400},
{sa_handler=SIG_IGN, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART,
sa_restorer=0x7fb857298400}, 8) = 0
write(7, "250 OK id=1lhZWj-0001T1-Dr\r\n", 28) = 28
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7fb85a170b50) = 5908
rt_sigaction(SIGTERM, {sa_handler=0x55a88a1ea340, sa_mask=[],
sa_flags=SA_RESTORER, sa_restorer=0x7fb85763f630}, NULL, 8) = 0
setsockopt(8, SOL_TCP, TCP_QUICKACK, [0], 4) = 0
rt_sigaction(SIGALRM, {sa_handler=0x55a88a1ea330, sa_mask=[],
sa_flags=SA_RESTORER, sa_restorer=0x7fb85763f630}, NULL, 8) = 0
alarm(300)                              = 0
read(8, strace: Process 5908 attached
 <unfinished ...>






I cut out the unnecessary bits but basically...
After the final . (dot) is send to finish writing the body, exim does a little
bit processing (AV+spamassassin) scanning and eventually get to the point where
it sleeps for about 30 or more seconds.
So as seen above it sends a log message and stops at:

rt_sigsuspend(~[ALRM RTMIN RT_1], 8

This lasts about 30s or more and eventually continues. But the 30s (increasing
with the lifetime of the exim process) delay is causing the swiftmailer and the
php itself to timeout and error out.


I've managed to find similar bug reports:
https://www.mail-archive.com/exim-users@exim.org/msg54137.html

And more recently:
https://lists.exim.org/lurker/message/20210330.140234.fe889005.en.html

...both mentioning the delay at "rt_sigsuspend(~[ALRM RTMIN RT_1], 8"




We do realize the debug level log would be more helpful here but please
understand that this issue can only be observed after some time (days) of
starting of the exim service hence debugging is difficult as we'd need to leave
it running in this mode for days. Simple restart of the service fixes this
issue straight away but it's not something that we'd like to deploy as a
permanent fix.

--
You are receiving this mail because:
You are on the CC list for the bug.