Author: Philip Hazel Date: To: 791 CC: pcre-dev Subject: Re: [pcre-dev] [Bug 791] New: UTF-8 support does not work on EBCDIC
platforms
On Tue, 16 Dec 2008, Martin Jerabek wrote:
> We use PCRE 7.8 on z/OS (formerly known as OS/390 and MVS) which uses
> EBCDIC as its character encoding. We pass only UTF-8 strings to PCRE
> using the PCRE_UTF8 flag, both as patterns and subject strings,
> because our internal string class is based on UTF-8. We configured
> PCRE with both --enable-utf8 and --enable-ebcdic, and also created the
> appropriate EBCDIC character table with dftables.
Aarrgghh! I have to admit that I *never* considered the possibility of
anybody wanting to use UTF-8 with EBCDIC, especially as "classical"
EBCDIC uses bytes with values greater than 127. (I was an OS/370
programmer for many years, so I do have some EBCDIC experience. But not
in the last 15 years or so.)
> The real problem is now that the PCRE code compares the characters
> taken from the pattern with normal C character literals such as '\\',
> '*', etc. This works fine on non-EBCDIC (ASCII) platforms because the
> 7-bit ASCII subset of UTF-8 is identical to ASCII.
Exactly - I thought that was the whole point of UTF-8.
> - Convert the pattern characters to EBCDIC as they are retrieved with
> one of the GETCHAR macros. In this way they could be compared to C
> character literals. This would work for the meta characters but what
> about other characters from the pattern which cannot be represented in
> EBCDIC but only in Unicode?
I do not understand how that could work, because I do not see how EBCDIC
and Unicode can be used simultaneously. Either your input is Unicode, or
it is EBCDIC, surely? I *do* see that you could use UTF-8 to encode
codes in the range 0 to 255 and then interpret them as EBCDIC, but I
don't get it for codes greater than 255. The subset of Unicode whose
codepoints are < 256 is not the same set of characters as EBCDIC.
Consider, for example, the character A-with-grave-accent. The Unicode
codepoint is 00C0, but there is no such character in EBCDIC. So how
would you represent this character, because 00C0 *is* an EBCDIC
codepoint?
> I am writing this bug report mainly to get input how to solve the
> problem best. I am aware that you probably do not have access to an
> EBCDIC platform and that I would have to code a fix myself.
Indeed. Our mainframe was shut down in 1994, and I thankfully stopped
having to deal with EBCDIC. :-)
My suggestion is that you translate the other way, that is, you
translate everything from EBCDIC into Unicode before passing it to PCRE,
especially if you want to deal with characters that are not part of
EBCDIC (unless I've misunderstood what you meant above). Of course, you
would still have to fix the literals in PCRE. Hmmm. Not straightforward
at all.
As I see it, the first thing to make a decision on is exactly what
characters you want to deal with and how to represent them - the
A-with-grave-accent problem that I mentioned above.