Re: [exim] Internationalized email

Author: John C Klensin
Date:
To: exim-users
CC: Viktor Dukhovni
Subject: Re: [exim] Internationalized email

Hi.

I've been using Exim for many years, but have managed to stay
off this list until Viktor sent me a heads-up about the recent
thread.

Purely for identification, I'm the editor of RFC 5321, co-author
of RFC 6530, and former co-chair of the EAI WG that is
responsible for this stuff.

I'm going to try to avoid a long-winded explanation (yes, this
is the short version) unless asked but will try to summarize the
three main points.

Executive summary for those who don't want to read even that
much: Just don't try to encode local parts. Really. You will
hurt the sites and systems that support non-ASCII addresses and
headers all the way to the client desktop and won't really help
those who don't.

(1) There is an inherent conflict in any sort of
internationalization work, especially work that affects things
at the inter-machine transport level, between making things more
widely available (even if painfully) and making them better
adapted to individual needs (including language and writing
system). That, and one of its implications, were very
carefully considered by the EAI WG and resolved in favor of
early failure: As Victor has essentially pointed out in the case
of Delivery and Non-delivery notifications, it is a very bad
situation if a message somehow gets to a point in the network
where it cannot be moved forward and all of the choices about
how to try to notify the sender are wrong. So the plan is that,
if the next relay system in line is not fully SMTPUTF8-capable,
one gives up rather than trying to figure out to modify the
message to get it a little further, with the possibility that
the next system in line won't understand the modifications.

One example of this, and part of the thinking, is that, if a
SMTPUTF8-capable SMTP relay receives a message that including
non-ASCII address information but then needs to send it on to
something that cannot handle such messages, it should just
reject (or, if needed, bounce) the message. If it somehow
"downgrades" it, i.e., by encoding non-ASCII local-parts, there
is no guarantee that the next hop, or the hop after that, or an
MUA after the delivery MTA, will understand the encoding used
--or even understand that it _is_ an encoding-- since the SMTP
extension option will be missing. After a bit of
experimentation and a lot of thought and discussion, the WG also
concluded that the amount of work needed to produce
decoding-sensitive receiving systems or clients was likely to be
almost as much as that required to fully support SMNTPUTF8 and
that the encoding route would cause problems in the future and
be a lot more risky. Bad idea all around.

(2) As Viktor has pointed out, the set of algorithms associated
with the Punycode algorithm and the "xn--" prefix approach were
designed very specifically for IDNA and the needs and
constraints of DNS internationalization. It assumes preparatory
and filtering steps that restrict the characters used, something
we have never done with local-parts, encoding on a
label-by-label basis, and so on. Also, when the "two letters
followed by two hyphens" indicator was chosen, it was chosen
after a search through whatever parts of the DNS people could
find looking for conflicts and, even then, some labels were
found that used that sequence. Do the same thing for
local-parts, where we have no idea what strings have been used
and with what intended meaning, and that risk is very high of
something being misinterpreted (or failing to be interpreted) in
a way that creates opportunities for attack.

If you must encode, consider Base64, some variation on encoded
words, or read RFC 5137, not the Punycode algorithm and IDNA
prefix. But, for the other reasons given in this note, "don't"
would be better.

(3) The email community connected to the Internet learned (very
painfully) in the 1980s that email was used for all sort of
interesting things that few people had anticipated. Subject
lines and address local parts were used to transmit and perform
command functions as well as other clever things. As both a
privacy technique and a validation/authorization one, some of
those strings were signed in various ways and/or encrypted,
typically using shared secrets. Various characters were used as
delimiters or indicators for, e.g., inter-system message routing
(including but not limited to "%" and "!") and delivery-system
message handling, classification, or filtering (e.g.,
subaddresses with "+" and other characters), as well as control
sequences. Backward-pointing addresses associated with mail
exploders were modified to contain list management information
(most notably by qmail a decade or so later). Special mappings
were invented for mapping between X.400 and SMTP/822 systems
that used conventions that might well be interpreted as simply
local parts by non-gateway systems (see RFC 987 and the more
broadly adopted RFC 1138 and its successors). The latter used
syntax characters that would previously have been interpreted as
something entirely different, such as various job control
languages. And so on.

The lesson was that any system other than the final delivery MTA
(or a gateway that knew what was going on by virtue of being in
the routing path for the mail) that tampered with a local-part
on the assumption that it knew what it meant was likely to cause
a big mess (at best). That lesson led to stringent restrictions
on tampering with the local path when messages were in transit
-- even to convert or match letter case -- that were ultimately
reflected in strong requirement language in RFC 2821.

The same reasoning should apply to encoding non-ASCII
local-parts. Even if Exim is (as far as it knows) the final
delivery server, it has no knowledge of what an IMAP or POP
server will do, much less what will happen in MUAs that connect
to them or that access the mail store in other ways. If they
don't recognize and support the same encoding conventions, the
end user will see gibberish. And, of course, if they think
"xn--" followed by some characters really means "set off the
designed fire alarms" (rather than being some encoding we
invented), _really_ bad problems could occur.

The above actually identifies a small advantage Microsoft has
when they use a customized encoding: as many of us have
painfully found out, they are willing to behave as if the SMTP
sender and receiver, the IMAP and POP servers, and the MUAs are
all their software. If someone tries to use different
components with their systems and there are failures, the
response is either that those external-party components are
broken because they didn't conform to Microsoft Standards or
that the users wouldn't be encountering problems if they, and
their correspondents, were all using Outlook and Exchange
Server. Exim, as an MTA that does not support captive IMAP/POP
servers, much less its own MUAs, does not have that advantage.

thanks for listening,
    best,
      john

This message is part of the following thread:
	the complete thread tree sorted by date
	Viktor Dukhovni at
	Jeremy Harris at