Re: [exim-dev] Bad utf-8 in pgsql lookup and mainlog

Author: Phil Pennock
Date:
To: Axel Rau
CC: exim-dev
Subject: Re: [exim-dev] Bad utf-8 in pgsql lookup and mainlog

On 2013-07-30 at 13:26 +0200, Axel Rau wrote:
> After 10 days running with HEADERS_CHARSET="UTF-8" in Local/Makefile and
> PQsetClientEncoding(pg_conn, "UTF8"); in lokkups/pgsql.c, I still get tons of
> 'invalid byte sequence for encoding "UTF8"'(as expected by malformed mails).
> I would like to prepare a patch for a bug report to replace the illegal sequence
> by this Unicode REPLACEMENT CHARACTER.
>
> Is there any place in the exim code base, where this information is available?
> (I looked at rfc2047.c, expand.c…)
>
> I need a solution in order to log header contents in a pgsql backend.

As far as Exim is concerned, it's a string of bytes, which happen to
have been prepared by normalising through iconv from another stream of
bytes.

I suspect that the cleanest approach that fits with Exim would be to add
a new expansion operator ${utf8clean:...} in expand.c, with the core
routine living either in that file or in string.c.

I'm not aware of any standard library code which can check the string,
although UTF-8 is so clean and well-defined that it should be around 10
lines of code with a switch inside a loop. Looking around I see that
the Perl Unicode::CheckUTF8 module describes itself as a wrapper around
some Unicode Consortium code for accomplishing this. I suspect that
digging that out, checking the license, and using the same base code in
Exim is the way to go.

-Phil

This message is part of the following thread:
	the complete thread tree sorted by date
	Axel Rau at