Re: [pcre-dev] Problems and questions about locales

Top Page
Delete this message
Author: zatlas1@yahoo.com
Date:  
To: pcre-dev@exim.org, ph10@hermes.cam.ac.uk, Patrice Guérin
CC: pcre-dev@exim.org
Subject: Re: [pcre-dev] Problems and questions about locales
As Phil had mentioned, this is a locale vs. Unicode issue.  My port to z/OS (and thus EBCDIC) deals with such issues as there are many loacales in EBCDIC.  I use iconv (the IBM version) to convert between various locals and developed a whole API to help handle this.  It is available in EBCDIC only, though (I really should publish it in ASCII as well.)Patrice, if you contact me directly, I would be glad to share with you code and ideas, and maybe, if usefull, I would publish to wider audience.
Ze'ev Atlas

Sent from Yahoo Mail on Android

On Fri, Feb 14, 2020 at 12:15 PM, ph10@???<ph10@???> wrote: On Thu, 13 Feb 2020, Patrice Guérin wrote:

> I'm facing some problems with the locale character table definitions.


Locales are a nightmare. We will all be able to rejoice when Unicode is
everywhere. I'm afraid I know very little about locales, and as I'm a
Linux user, I know nothing about Windows versions except that there are
differences.

>    I really don't know who is right or not, and though the following
>    characters are not widely used (except euro symbol), I think this
>    can lead to inconsistencies.


I'm sure it can, but I suspect there isn't anything that can be done
about it.

>      * Windows defines all unassigned characters (in hex) 81, 8D, 8F,
>        90, 9D as Ctrl, Linux does not.
>      * Linux defines char 0x80 (€ symbol) as Graph, Print and Punct,
>        Windows does not.
>      * Linux defines char 0x88 (U+02c6) as Alpha, Alnum, Graph and
>        Print, Windows does not.


I do not understand what you mean by "0x88 (U+02c6)" because locales
handle only 256 characters.

>      * Linux defines char 0x98 (U+02dc) and 0x99 (U+02122) as Graph,
>        Print and Punct, Windows does not.
>      * Linux defines char 0xA0 (nbsp) as Graph, Print and Punct,
>        Windows defines it as Space, Blank and Print.
>      * Windows defines chars 0xAA (ª), 0xB5(µ) and 0xBA (º) as Punct,
>        Linux does not.
>      * Windows defines char 0xAD (Soft hyphen) as Ctrl, Linux does not.
>      * Windows defines chars 0xB2 (²), 0xB3 (³) and 0xB9 (¹) as Alnum
>        et Digit, Linux does not.
>
> Now, I've some questions :
>
> 1. If I correctly understood the process of the chartables build at
>    runtime,
>    1. the ctype functions are used only in pcre2_maketables() so the
>        locale can be set just before this call at thread level.


Yes.

>    2. the char table returned should be freed after the calls to
>        pcre2_match()


Yes.

>    3. A compilation context is to be created to associate the char table.


Yes.

>    4. Can it be freed just after the call to pcre2_compile() ?


The compilation context can be freed, but the tables themselves must be
retained until all matches are done.

> 2. What do you think about the availability of a function to load the
>    char tables as a binary file ?
>    This could be useful to get exactly the same tables in different OS.


I suspect that this is a very specialist requirement, and I would rather
encourage people to switch to Unicode. Also, moving binary things
between OS is not that simple because of endian issues. And indeed
8/16/32 bit issues.

Philip

--
Philip Hazel
--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev