Re: [exim-dev] UTF-8 and Exim string operations

Author: Jasen Betts
Date:
To: exim-dev
Subject: Re: [exim-dev] UTF-8 and Exim string operations

On 2018-08-17, Phil Pennock via Exim-dev <exim-dev@???> wrote:
> Anyone have strong feelings on how Exim should handle UTF-8 with
> operators such as ${length_1:STR} ?
>
> Document that the current operators work on bytes

Yeah stay with treating srings as nul terminated arrays of octets.
The same unit the RFCs use to define email and SMTP.

> and add ulength_1 for being UTF-8 aware?

Would also need utf8-aware also substr and strlen.
is it going to count code-points or glyphs?

> Look at the top-bit being set and assume UTF-8, or
> will that break too much with all the places which are still ISO-8859-1?

Just looking at that bit won't tell you enough to count code-points or
glyphs. you need to then group the octets together, and you need to do
something when you hit a non-valid octet....
parts of ${utf8clean can probably be re-used.

"${lc" "${uc" and "${if eqi" need consideraton too

-- 
     ت

This message is part of the following thread:
	the complete thread tree sorted by date
	Phil Pennock at
	Jeremy Harris at