Re: [pcre-dev] Problems and questions about locales

Top Page
Delete this message
Author: ph10
Date:  
To: Patrice Guérin
CC: pcre-dev
Subject: Re: [pcre-dev] Problems and questions about locales
On Thu, 13 Feb 2020, Patrice Guérin wrote:

> I'm facing some problems with the locale character table definitions.


Locales are a nightmare. We will all be able to rejoice when Unicode is
everywhere. I'm afraid I know very little about locales, and as I'm a
Linux user, I know nothing about Windows versions except that there are
differences.

>    I really don't know who is right or not, and though the following
>    characters are not widely used (except euro symbol), I think this
>    can lead to inconsistencies.


I'm sure it can, but I suspect there isn't anything that can be done
about it.

>      * Windows defines all unassigned characters (in hex) 81, 8D, 8F,
>        90, 9D as Ctrl, Linux does not.
>      * Linux defines char 0x80 (€ symbol) as Graph, Print and Punct,
>        Windows does not.
>      * Linux defines char 0x88 (U+02c6) as Alpha, Alnum, Graph and
>        Print, Windows does not.


I do not understand what you mean by "0x88 (U+02c6)" because locales
handle only 256 characters.

>      * Linux defines char 0x98 (U+02dc) and 0x99 (U+02122) as Graph,
>        Print and Punct, Windows does not.
>      * Linux defines char 0xA0 (nbsp) as Graph, Print and Punct,
>        Windows defines it as Space, Blank and Print.
>      * Windows defines chars 0xAA (ª), 0xB5(µ) and 0xBA (º) as Punct,
>        Linux does not.
>      * Windows defines char 0xAD (Soft hyphen) as Ctrl, Linux does not.
>      * Windows defines chars 0xB2 (²), 0xB3 (³) and 0xB9 (¹) as Alnum
>        et Digit, Linux does not.

>
> Now, I've some questions :
>
> 1. If I correctly understood the process of the chartables build at
>    runtime,
>     1. the ctype functions are used only in pcre2_maketables() so the
>        locale can be set just before this call at thread level.


Yes.

>     2. the char table returned should be freed after the calls to
>        pcre2_match()


Yes.

>     3. A compilation context is to be created to associate the char table.


Yes.

>     4. Can it be freed just after the call to pcre2_compile() ?


The compilation context can be freed, but the tables themselves must be
retained until all matches are done.

> 2. What do you think about the availability of a function to load the
>    char tables as a binary file ?
>    This could be useful to get exactly the same tables in different OS.


I suspect that this is a very specialist requirement, and I would rather
encourage people to switch to Unicode. Also, moving binary things
between OS is not that simple because of endian issues. And indeed
8/16/32 bit issues.

Philip

--
Philip Hazel