Re: [exim-dev] UTF-8 and Exim string operations

Top Page
Delete this message
Reply to this message
Author: Phil Pennock
Date:  
To: Jasen Betts
CC: exim-dev
Subject: Re: [exim-dev] UTF-8 and Exim string operations
On 2018-08-17 at 10:36 -0000, Jasen Betts via Exim-dev wrote:
> > and add ulength_1 for being UTF-8 aware?
>
> Would also need utf8-aware also substr and strlen.


Yes, I was using length as an exemplar, not as an exhaustive list. :)

I favored ulength too, but didn't want to just add a slew of new
expansion operators, items and conditions without at least mentioning it
somewhere first.

> is it going to count code-points or glyphs?


Code-points. Exim has no business knowing about how a layout engine
might or might not choose to render code-points to glyphs. I could see
a possibility for normalization handling as another function, for
correct SASLprep for authentication.

I'd really rather not, though. Exim is setuid root and the main system
for handling such things, ICU, does lots of tricky sensitive stuff with
a history of security problems.

> > Look at the top-bit being set and assume UTF-8, or
> > will that break too much with all the places which are still ISO-8859-1?
>
> Just looking at that bit won't tell you enough to count code-points or
> glyphs.


I know, this was a suggestion for determining if the string should be
treated as UTF-8 for changing the current expansion o/i/c features; it
sucks but it was the only viable alternative I could think of and I
wanted to at least present an _idea_ of something else, for inciting
feedback.

I know a fair bit about UTF-8 internals and how to work with the various
aspects in multiple programming languages. :)

> parts of ${utf8clean can probably be re-used.


Yes, I thought of that, when pondering a new `utf8valid` expansion
condition.

> "${lc" "${uc" and "${if eqi" need consideraton too


Only if we go the ICU route and include normalization forms. Which ...
is more bloat than I'm happy with in Exim's current architecture.

-Phil