Re: [pcre-dev] Problems and questions about locales

Top Page
Delete this message
Author: Patrice Guérin
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] Problems and questions about locales
Hello Philip,

Thank you very much for your answer,

My programs run on both Windows and Linux (Debian), so I expect they
work in the same way...

ph10@??? a écrit :
> On Thu, 13 Feb 2020, Patrice Guérin wrote:
>
>> I'm facing some problems with the locale character table definitions.
> Locales are a nightmare. We will all be able to rejoice when Unicode is
> everywhere. I'm afraid I know very little about locales, and as I'm a
> Linux user, I know nothing about Windows versions except that there are
> differences.

I agree. But locales are a lot more than just the 256 bytes character
table, like iso-8859-X or Windows-1252 (aka CP1252 cf.
http://www.tachyonsoft.com/cp01252.htm).

>>     I really don't know who is right or not, and though the following
>>     characters are not widely used (except euro symbol), I think this
>>     can lead to inconsistencies.
> I'm sure it can, but I suspect there isn't anything that can be done
> about it.

>
>>       * Windows defines all unassigned characters (in hex) 81, 8D, 8F,
>>         90, 9D as Ctrl, Linux does not.
>>       * Linux defines char 0x80 (€ symbol) as Graph, Print and Punct,
>>         Windows does not.
>>       * Linux defines char 0x88 (U+02c6) as Alpha, Alnum, Graph and
>>         Print, Windows does not.
> I do not understand what you mean by "0x88 (U+02c6)" because locales
> handle only 256 characters.

Windows-1252 is almost the same than iso-8859-1 except for the first 32
bytes in the extended table (that is 0x80 to 0x9f).
As an exemple, the euro symbol € is 0x80 that correspond to U+20AC in
Unicode.
In iso-8859-15, the euro symbol is 0xA4 (it replaces the currency symbol).
>
>>       * Linux defines char 0x98 (U+02dc) and 0x99 (U+02122) as Graph,
>>         Print and Punct, Windows does not.
>>       * Linux defines char 0xA0 (nbsp) as Graph, Print and Punct,
>>         Windows defines it as Space, Blank and Print.
>>       * Windows defines chars 0xAA (ª), 0xB5(µ) and 0xBA (º) as Punct,
>>         Linux does not.
>>       * Windows defines char 0xAD (Soft hyphen) as Ctrl, Linux does not.
>>       * Windows defines chars 0xB2 (²), 0xB3 (³) and 0xB9 (¹) as Alnum
>>         et Digit, Linux does not.

>>
>> Now, I've some questions :
>>
>> 1. If I correctly understood the process of the chartables build at
>>     runtime,
>>      1. the ctype functions are used only in pcre2_maketables() so the
>>         locale can be set just before this call at thread level.
> Yes.

>
>>      2. the char table returned should be freed after the calls to
>>         pcre2_match()
> Yes.

>
>>      3. A compilation context is to be created to associate the char table.
> Yes.

>
>>      4. Can it be freed just after the call to pcre2_compile() ?
> The compilation context can be freed, but the tables themselves must be
> retained until all matches are done.

>
>> 2. What do you think about the availability of a function to load the
>>     char tables as a binary file ?
>>     This could be useful to get exactly the same tables in different OS.
> I suspect that this is a very specialist requirement, and I would rather
> encourage people to switch to Unicode. Also, moving binary things
> between OS is not that simple because of endian issues. And indeed
> 8/16/32 bit issues.

The work to switch to utf-8 is in progress.
However, it's not very efficient to translate MiB of external data to
utf-8 each time I need to use regexp. I can't modify the sources
permanently.
At my opinion, pcre2_maketables() is independant of 8/16/32 bits since
it's defined as uint8_t (ie bytes).
For the same reason, I think there is no endianness issue in the
computation of the table.
Saving and loading in binary should be ok.
If I'm wrong (I will return to school...), the same restrictions than
the serialization of RE can be applied.
>
> Philip
>

Kind regards,
Patrice.