Re: [pcre-dev] Proposal for a new API for PCRE

Top Page
Delete this message
Author: ph10
Date:  
To: Ze'ev Atlas
CC: pcre-dev@exim.org
Subject: Re: [pcre-dev] Proposal for a new API for PCRE
On Wed, 28 Aug 2013, Ze'ev Atlas wrote:

> The first point I have is that in the old PCRE, UTF8 and EBCDIC where
> mutually exclusive in some 'magical' way.  I would suggest that this
> should be in the context rather, so that the two code-pages could live
> happily together.  


Unfortunately, this is not just a matter of "code pages". The character
table the PCRE contains is used only for quickly deciding which
low-valued (less than 256) characters are letters, digits etc. It is not
used in parsing patterns.

PCRE was originally written for an ASCII environment, then extended for
Unicode (originally UTF-8, subsequently UTF-16 and UTF-32 as well).
Unicode is a superset of ASCII, so code fragments such as

  switch (character)
    { 
    case '[':


(which occur when compiling a pattern - this one for detecting a
character class) work for both ASCII and Unicode.

To make it possible to compile PCRE in an EBCDIC environment, all
references to individual literal characters were changed to use macros,
so the code above now reads

  switch (character)
    { 
    case CHAR_LEFT_SQUARE_BRACKET:


The CHAR_xxx macros have different numerical values in an EBCDIC build
compared with an ASCII/Unicode build. Therefore, it is not possible to
have a single PCRE library that can handle both EBCDIC and ASCII/Unicode
input.

> Even fancier, EBCDIC has few code-pages in different countries (Greek,
> Turkish, Hebrew, Russian and probably some more) and so does ASCII.


Yes, but these are getting more and more obsolete. These differences
were (and still are) handled by "locales" in the ASCII world - and
indeed the functions for managing locales have been part of standard C
since 1990. PCRE has support for locales. However code-pages are not
good enough. Think what happens if you are trying to process (for
example) a Turkish-German dictionary. How can you have both character
sets simultaneously available?

The solution to this is of course Unicode, which does away with code
pages. I get the feeling that gradually everything in the ASCII world
will move over to Unicode, and code pages will be a thing of the past
(and of EBCDIC :-). Locales are still useful for specifying how to
format dates, currency, and so on, but not for character sets.

> I know that I am mixing two related issues (UTF8 vs. EBCDIC and ASCII
> vs. EBCDIC), but what I am trying to say is, standardize the chartable
> handling and manage it via the context.  


The proposal already does manage the character table via the context,
using this function:

pcre2_set_character_tables(context, unsigned char *tables);

Philip

--
Philip Hazel