Re: [exim-dev] Bad utf-8 in pgsql lookup and mainlog

Top Page
Delete this message
Reply to this message
Author: Axel Rau
Date:  
To: exim-dev
Subject: Re: [exim-dev] Bad utf-8 in pgsql lookup and mainlog

Am 20.07.2013 um 17:25 schrieb Marcin Mirosław <marcin@???>:

> W dniu 2013-07-20 14:17, Axel Rau pisze:
>> Recording of utf-8 characters from headers in mainlog and PostgreSQL DB via lookup usually works flawlessly.
>>
>> Occasionally PostgreSQL complains during INSERT of header items or main log events (our log host uses PostgreSQL as bakend) about invalid byte sequence, like here:
>>
>> [1\3] 1V085d-00067H-9X H=mail03.noris.net [62.128.1.223] Warning: ACL "warn" statement skipped: condition test deferred: PGSQL: query failed: ERROR: invalid byte sequence for encoding "UTF8": 0xfc
>>
>> 2013-07-19T10:39:10.005396+00:00 db1 rsyslogd: db error (22021): invalid byte sequence for encoding "UTF8": 0xfc
>> 2013-07-19T10:39:10.005415+00:00 db1 rsyslogd: db error (event): |2013-07-19t10:39:09.991124+00:00|6|2|mx4|exim| [2\3] (PGRES_FATAL_ERROR) (SELECT * FROM record_Reception( '1525916', '1V085d-00067H-9X', 'Staatstheater Nürnberg <info@???>', 'Newsletter Staatstheater Nürnberg', 'none', 'N/A'))
>>
>> Does this come from bad encoding of original mail headers?
>> Is there an easy solution to skip bad characters before sending them to the DB?
>>
>> In lokkups/pgsql.c:258 I see:
>> PQsetClientEncoding(pg_conn, "SQL_ASCII");
>>
>> but I think it's not related.
>
> Hi Axel!
> I suspect it is related. If you try to insert text into postgresql you
> should know which encoding is used in this text. If you know the
> inserted text is in utf-8 you should use set "clientencoding" to utf-8.
> But in emails you never know what encoding will be used. In theory it
> should be used only basic ASCII characters.
> You can:
> a) rejects mail with non ASCII chars in Subject.
> b) encode Subject using e.g. base64 then inserts to database
> c) guess which encoding was used in Subject, then set properly
> "clientencoding" parameter
> d) use "C" collation for given database/table in postgresql - it allows
> you to insert any characters into table. But you will lost possibility
> to get tuple in your preffered charset. (E.g. you can keep text in utf-8
> in database but when you set "clientencoding" to e.g. 8859-2 you will
> get text in 8859-2. With "C" collation pgsql doesn't do encoding to e.g
> iso8859-2)

As exim works with utf-8 strings, my naive assumption was, that a header like
    Subject: Neue =?ISO-8859-1?q?Gl=E4ser?=
(RFC 2047) will be converted to utf-8 by exim before I access it via $h_Subject: .
Looking at the complexity of expand.c, this seems to be proved.
Can anybody confirm this?


If the header contains none-ASCII 8-bit-characters (=illegal), I would like exim to replace them by "?".
Can this be done in the exim config or do we need a new expansion function for that?

I must ensure valid utf-8 at the DB interface.

Axel
---
PGP-Key:29E99DD6 ☀ +49 151 2300 9283 ☀ computing @ chaos claudius