As Phil had mentioned, this is a locale vs. Unicode issue. My port to z/OS (and thus EBCDIC) deals with such issues as there are many loacales in EBCDIC. I use iconv (the IBM version) to convert between various locals and developed a whole API to help handle this. It is available in EBCDIC only, though (I really should publish it in ASCII as well.)Patrice, if you contact me directly, I would be glad to share with you code and ideas, and maybe, if usefull, I would publish to wider audience.
Ze'ev Atlas
Sent from Yahoo Mail on Android
On Fri, Feb 14, 2020 at 12:15 PM, ph10@???<ph10@???> wrote: On Thu, 13 Feb 2020, Patrice Guérin wrote:
> I'm facing some problems with the locale character table definitions.
Locales are a nightmare. We will all be able to rejoice when Unicode is
everywhere. I'm afraid I know very little about locales, and as I'm a
Linux user, I know nothing about Windows versions except that there are
differences.
> I really don't know who is right or not, and though the following
> characters are not widely used (except euro symbol), I think this
> can lead to inconsistencies.
I'm sure it can, but I suspect there isn't anything that can be done
about it.
> * Windows defines all unassigned characters (in hex) 81, 8D, 8F,
> 90, 9D as Ctrl, Linux does not.
> * Linux defines char 0x80 (€ symbol) as Graph, Print and Punct,
> Windows does not.
> * Linux defines char 0x88 (U+02c6) as Alpha, Alnum, Graph and
> Print, Windows does not.
I do not understand what you mean by "0x88 (U+02c6)" because locales
handle only 256 characters.
> * Linux defines char 0x98 (U+02dc) and 0x99 (U+02122) as Graph,
> Print and Punct, Windows does not.
> * Linux defines char 0xA0 (nbsp) as Graph, Print and Punct,
> Windows defines it as Space, Blank and Print.
> * Windows defines chars 0xAA (ª), 0xB5(µ) and 0xBA (º) as Punct,
> Linux does not.
> * Windows defines char 0xAD (Soft hyphen) as Ctrl, Linux does not.
> * Windows defines chars 0xB2 (²), 0xB3 (³) and 0xB9 (¹) as Alnum
> et Digit, Linux does not.
>
> Now, I've some questions :
>
> 1. If I correctly understood the process of the chartables build at
> runtime,
> 1. the ctype functions are used only in pcre2_maketables() so the
> locale can be set just before this call at thread level.
Yes.
> 2. the char table returned should be freed after the calls to
> pcre2_match()
Yes.
> 3. A compilation context is to be created to associate the char table.
Yes.
> 4. Can it be freed just after the call to pcre2_compile() ?
The compilation context can be freed, but the tables themselves must be
retained until all matches are done.
> 2. What do you think about the availability of a function to load the
> char tables as a binary file ?
> This could be useful to get exactly the same tables in different OS.
I suspect that this is a very specialist requirement, and I would rather
encourage people to switch to Unicode. Also, moving binary things
between OS is not that simple because of endian issues. And indeed
8/16/32 bit issues.
Philip
--
Philip Hazel
--
## List details at
https://lists.exim.org/mailman/listinfo/pcre-dev