Re: [exim-dev] Bad utf-8 in pgsql lookup and mainlog

Author: Axel Rau
Date:
To: Phil Pennock
CC: exim-dev
Subject: Re: [exim-dev] Bad utf-8 in pgsql lookup and mainlog

Am 24.07.2013 um 00:24 schrieb Axel Rau <Axel.Rau@???>:

> Am 21.07.2013 um 05:35 schrieb Phil Pennock <pdp@???>:
>
>> On 2013-07-20 at 19:05 +0200, Axel Rau wrote: >>> As exim works with utf-8 strings, my naive assumption was, that a header like >>> Subject: Neue =?ISO-8859-1?q?Gl=E4ser?= >>> (RFC 2047) will be converted to utf-8 by exim before I access it via $h_Subject: . >>> Looking at the complexity of expand.c, this seems to be proved. >>> Can anybody confirm this?

>>
>> Exim's behaviour depends upon what value was defined for HEADERS_CHARSET
>> in Local/Makefile when Exim was built. You also need HAVE_ICONV=yes but
>> that's supplied by default on some OSes.
>>
>> The sample configuration supplied in src/EDITME sets
>> HEADERS_CHARSET="ISO-8859-1".
>>
>> For myself, I always set HEADERS_CHARSET="UTF-8".
>
> Indeed FreeBSD ports system defaults to ISO-8859-1. > I reinstalled with UTF-8. Unfortunately exim -bV does not list the HEADERS_CHARSET. > From my simple tests, it seems to be work: > Subject: TEST =?ISO-8859-1?q?Gl=FCckliche_m=F6gliche_=C4chtung?= > was recorded correctly in UTF-8 in the DB.

>>
>>> If the header contains none-ASCII 8-bit-characters (=illegal), I would like exim to replace them by "?".
>>> Can this be done in the exim config or do we need a new expansion function for that?
>>
>> I *suspect* that a new expansion function would be needed, but I could
>> be proven wrong by a particularly clever hack. I also suspect that, if
>> we were to implement this, we'd default the replacement character to be
>> codepoint 0xFFFD, the Unicode REPLACEMENT CHARACTER.
>
>
> Wouldn't this be reasonable enhancement of the existing conversion functionality anyway?

After 10 days running with HEADERS_CHARSET="UTF-8" in Local/Makefile and
PQsetClientEncoding(pg_conn, "UTF8"); in lokkups/pgsql.c, I still get tons of
'invalid byte sequence for encoding "UTF8"'(as expected by malformed mails).
I would like to prepare a patch for a bug report to replace the illegal sequence
by this Unicode REPLACEMENT CHARACTER.

Is there any place in the exim code base, where this information is available?
(I looked at rfc2047.c, expand.c…)

I need a solution in order to log header contents in a pgsql backend.

Thanks, Axel

---
PGP-Key:29E99DD6 ☀ +49 151 2300 9283 ☀ computing @ chaos claudius

This message is part of the following thread:
	the complete thread tree sorted by date
	Axel Rau at
	Phil Pennock at