Re: [exim-dev] Bad utf-8 in pgsql lookup and mainlog

Inizio della pagina
Delete this message
Reply to this message
Autore: Marcin Mirosław
Data:  
To: exim-dev
Oggetto: Re: [exim-dev] Bad utf-8 in pgsql lookup and mainlog
W dniu 2013-07-20 14:17, Axel Rau pisze:
> Recording of utf-8 characters from headers in mainlog and PostgreSQL DB via lookup usually works flawlessly.
>
> Occasionally PostgreSQL complains during INSERT of header items or main log events (our log host uses PostgreSQL as bakend) about invalid byte sequence, like here:
>
> [1\3] 1V085d-00067H-9X H=mail03.noris.net [62.128.1.223] Warning: ACL "warn" statement skipped: condition test deferred: PGSQL: query failed: ERROR: invalid byte sequence for encoding "UTF8": 0xfc
>
> 2013-07-19T10:39:10.005396+00:00 db1 rsyslogd: db error (22021): invalid byte sequence for encoding "UTF8": 0xfc
> 2013-07-19T10:39:10.005415+00:00 db1 rsyslogd: db error (event): |2013-07-19t10:39:09.991124+00:00|6|2|mx4|exim| [2\3] (PGRES_FATAL_ERROR) (SELECT * FROM record_Reception( '1525916', '1V085d-00067H-9X', 'Staatstheater Nürnberg <info@???>', 'Newsletter Staatstheater Nürnberg', 'none', 'N/A'))
>
> Does this come from bad encoding of original mail headers?
> Is there an easy solution to skip bad characters before sending them to the DB?
>
> In lokkups/pgsql.c:258 I see:
> PQsetClientEncoding(pg_conn, "SQL_ASCII");
>
> but I think it's not related.


Hi Axel!
I suspect it is related. If you try to insert text into postgresql you
should know which encoding is used in this text. If you know the
inserted text is in utf-8 you should use set "clientencoding" to utf-8.
But in emails you never know what encoding will be used. In theory it
should be used only basic ASCII characters.
You can:
a) rejects mail with non ASCII chars in Subject
b) encode Subject using e.g. base64 then inserts to database
c) guess which encoding was used in Subject, then set properly
"clientencoding" parameter
d) use "C" collation for given database/table in postgresql - it allows
you to insert any characters into table. But you will lost possibility
to get tuple in your preffered charset. (E.g. you can keep text in utf-8
in database but when you set "clientencoding" to e.g. 8859-2 you will
get text in 8859-2. With "C" collation pgsql doesn't do encoding to e.g
iso8859-2)

Regards