Re: [exim] transport filter with know charset encoding

Góra strony
Delete this message
Reply to this message
Autor: Phil Pennock
Data:  
Dla: Cyborg
CC: exim-users
Nowe tematy: [exim] Solved : Re: transport filter with know charset encoding
Temat: Re: [exim] transport filter with know charset encoding
On 2013-08-26 at 15:48 +0200, Cyborg wrote:
> is there anychance, that a transportfilter gets knowlage about the
> messages charset before it is processed ?


There is no such thing as a "message charset".

The headers are supposed to be in ASCII, with RFC2047 encoding used to
encapsulate non-ASCII characters. Thus you might see:

  =?utf-8?q?=E2=98=83?=    or    =?utf-8?b?4piD?=


to get the Unicode character SNOWMAN in a message header.

The _body_ of an email can have different character sets in different
MIME parts. There is no one overriding character set across the entire
body. Just as you often see (in the "Western" world)
multipart/alternative messages with text/plain and text/html, you could
equally see a message which is multipart/alternative with text/plain
twice, once charset=UTF-8 and once charset=KOI8-R.

Anything which needs to parse message bodies needs to take it as a
stream of octets ("bytes" on most modern computers) and pick apart the
MIME encapsulation and then convert the octets to characters using that
knowledge.

For the _headers_, where we occasionally see non-ASCII characters
"native", there is no portable way to map those code-points to a
meaning. You can make assumptions, which will work for some messages,
but not for others -- this is _why_ charset-indicating encodings such as
RFC2047 are used. But any real-world message might contain arbitrary
octets and it's up to you to decide how you want to deal with such data
and whether or not to try to salvage anything from it.

> But, if you write a filter in java and our message is encoded in
> iso-8859-1, but your system charset is utf8,
> the message has garbled umlauts ( all non ascii chars like üöäß etc. ).


Don't conflate byte-arrays and strings. It only ever leads to pain.

MIME is not simple to parse if you want to deal with all the complexity.
Use a good library to manage that for you. Make sure that whatever
generates the messages sets appropriate MIME headers, correctly
indicating character sets, if you allow non-ASCII characters.

(I don't program Java enough to have library recommendations)

-Phil