[exim-dev] Re: [Bug 3085] New: Allow UTF-8 for log output

Top Page
Delete this message
Reply to this message
Author: Andrew C Aitchison
Date:  
To: exim-dev
Subject: [exim-dev] Re: [Bug 3085] New: Allow UTF-8 for log output
On Sat, 23 Mar 2024, Exim Bugzilla via Exim-dev wrote:

> https://bugs.exim.org/show_bug.cgi?id=3085
>
>            Bug ID: 3085
>           Summary: Allow UTF-8 for log output
>           Product: Exim
>           Version: N/A
>          Hardware: All
>                OS: Linux
>            Status: NEW
>          Severity: bug
>          Priority: medium
>         Component: Logging
>          Assignee: unallocated@???
>          Reporter: forza@???
>                CC: exim-dev@???
>
> This is probably not a bug, but more of a request for comments.
>
> I am logging to syslog instead of files. The syslog is handled by syslog-ng,
> and I parse the logfiles with Fail2Ban.
>
> The exim.conf:
>
> ### Logging
> log_selector            = +all
> log_file_path            = syslog
> syslog_timestamp        = false
> syslog_duplication        = false
> syslog_processname        = exim
> SYSLOG_LONG_LINES        = yes
>
>
> No, my issue is that sometimes Fail2Ban fails to read some of the lines and
> outputs a warning like this:
>
> 2024-03-17T19:23:33.870+00:00 warning fail2ban.filter[2922]: WARNING Error
> decoding line from '/var/log/exim.log' with 'UTF-8'.
> 2024-03-17T19:23:33.870+00:00 warning fail2ban.filter[2922]: WARNING Consider
> setting logencoding to appropriate encoding for this jail. Continuing to
> process line ignoring invalid characters: b'2024-03-17T19:23:33.698+00:00
> notice exim[5673]: [12\\21] F From: "\xbe\xe7\xb9\xcc\xbc\xf8"
> <msoony@???>\n'


[ So syslog-ng is writing exim's logging to /var/log/exim.log
   I guess there are reasons to go the indirect way.      ]

How do the relevant lines look in /var/log/exim.log - perhaps with
grep "2024-03-17T19:23:33.698+00:00" /var/log/exim.log
I guess the result would be something like:
2024-03-17T19:23:33.698+00:00 notice exim[5673]: [12\\21] F From: "��̼�" <msoony@???>
?

> So, this leads me to my current question. Can Exim be set to output
> UTF-8 encoded logs to syslog?


> Apparently, the syslog format
> according to RFC-5425 says " MSG SHOULD be UNICODE, encoded using
> UTF-8", but it seems to allow plain US-ASCII too.


[ For a piece of text, if the plain US-ASCII encoding is correct
then that byte stream is automatically valid UTF-8 and represents
that text correctly.
It is impossible to support UTF-8 and not handle
(true 7bit) plain US-ASCII correctly ! ]

> https://datatracker.ietf.org/doc/html/rfc5424#section-6.4
>
> I believe syslog-ng could handle non-UT8 messages, using flags(sanitize-utf8)
> on the source, however the manual specifies:
>
> "The HEADER part of the message must be in plain ASCII format, the parameter
> values of the STRUCTURED-DATA part must be in UTF-8, while the MSG part should
> be in UTF-8. The different parts of the message are explained in the following
> sections."
>
> Perhaps I am overthinking all of this. I'd appreciate some thoughts on correct
> logging configurations.


I think you are looking in the wrong place for the problem.
It is not that exim is disallowing UTF-8 output in the log,
but that it occasionally the output is not valid UTF-8.

The fundamental issue is we have "garbage in",
so will inevitably have "garbage out".

Exim is trying to log some "text" - the display-name of the From: header -
which should be ASCII (unless SMTPUTF8 is enabled, in which case it can be
UTF-8) but in this case is not UTF-8 or ASCII, but some unknown byte-stream.
[ Do you happen to know what language or
   character set this sender writes their name in ? ]

As I understand it, exim logs this byte-stream as-is and there is nothing that
syslog-ng or fail2ban could reasonably do to interpret it correctly.
I believe that if you reverted to having exim log to a file,
the same issue would be there, probably with exactly the same byte-stream
as the syslog.

The best "fix" might be for exim to log this byte-stream coded as hex,
but in many cases that would be less readable than doing nothing.
For example
     From: "André Aitchison" <andrew@???>
where the e-acute was encoded in LATIN-9 is not valid UTF-8,
but it is much clearer left like that than logged as
     From: "\x41\x6e\x64\x72\xe9\x20\x41\x69\x74\x63\x68\x69\x73\x6f\x6e" <andrew@???>
- and then exim would have to spend time figuring out when the display-name
was not valid UTF-8.

I have not used fail2ban for email logs.
Is the message merely annoying, or is this stopping you from blocking 
<msoony@???> because other lines in the log indicate a problem ?

-- 
Andrew C. Aitchison                      Kendal, UK
                    andrew@???


--
## subscription configuration (requires account):
## https://lists.exim.org/mailman3/postorius/lists/exim-dev.lists.exim.org/
## unsubscribe (doesn't require an account):
## exim-dev-unsubscribe@???
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/