[pcre-dev] Problems and questions about locales

Etusivu
Poista viesti
Lähettäjä: Patrice Guérin
Päiväys:  
Vastaanottaja: pcre-dev
Aihe: [pcre-dev] Problems and questions about locales
Hello to all,

My name is Patrice and a new user of PCRE2 and this mailing list.
I use Visual C++ 2013 on Windows  and g++ on GNU/Linux Debian 9.

The build of 10.34 release was made through CMake on Windows and
configure on Debian
and use the enable-rebuild-chartables option in both cases.

The default locales are french_France.1252 on Windows and fr_FR.UTF-8 on
Debian.

I'm facing some problems with the locale character table definitions. I
need this because I have to handle french data files both in utf-8 (in
which case, PCRE2_UTF is set) and CP1252.

 1. On Debian, the locale fr_FR ou french is not available, (until the
    installation of  fr_FR.iso88591) so test3 can fail.
 2. I've installed fr_FR.cp1252 and ran dftables -L to produce the
    default chartables data. I found some differences with the same file
    produced on Windows. I wrote a small program to get all ctype isxxx
    properties for each character.
    I really don't know who is right or not, and though the following
    characters are not widely used (except euro symbol), I think this
    can lead to inconsistencies.
      * Windows defines all unassigned characters (in hex) 81, 8D, 8F,
        90, 9D as Ctrl, Linux does not.
      * Linux defines char 0x80 (€ symbol) as Graph, Print and Punct,
        Windows does not.
      * Linux defines char 0x88 (U+02c6) as Alpha, Alnum, Graph and
        Print, Windows does not.
      * Linux defines char 0x98 (U+02dc) and 0x99 (U+02122) as Graph,
        Print and Punct, Windows does not.
      * Linux defines char 0xA0 (nbsp) as Graph, Print and Punct,
        Windows defines it as Space, Blank and Print.
      * Windows defines chars 0xAA (ª), 0xB5(µ) and 0xBA (º) as Punct,
        Linux does not.
      * Windows defines char 0xAD (Soft hyphen) as Ctrl, Linux does not.
      * Windows defines chars 0xB2 (²), 0xB3 (³) and 0xB9 (¹) as Alnum
        et Digit, Linux does not.


Now, I've some questions :

 1. If I correctly understood the process of the chartables build at
    runtime,
     1. the ctype functions are used only in pcre2_maketables() so the
        locale can be set just before this call at thread level.
     2. the char table returned should be freed after the calls to
        pcre2_match()
     3. A compilation context is to be created to associate the char table.
     4. Can it be freed just after the call to pcre2_compile() ?
 2. What do you think about the availability of a function to load the
    char tables as a binary file ?
    This could be useful to get exactly the same tables in different OS.


For information, I wrote another program to get the properties and
lower/uppercase with only the fr_FR.UTF-8 locale.
I found that converting every character from the iso8859-1 or cp1252
charset to wchar using iconv library and the wide ctype functions to get
the properties and cased characters (these are converted back to the
charset to see if they fit in the table) works well and gives the same
results than using the specific locale.

Thank you in advance.

Kind regards,
Patrice.