[pcre-dev] [Bug 791] New: UTF-8 support does not work on EBC…

Αρχική Σελίδα
Delete this message
Συντάκτης: Martin Jerabek
Ημερομηνία:  
Προς: pcre-dev
Καινούρια Θέματα: [pcre-dev] [Bug 791] UTF-8 support does not work on EBCDIC platforms
Αντικείμενο: [pcre-dev] [Bug 791] New: UTF-8 support does not work on EBCDIC platforms
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=791
           Summary: UTF-8 support does not work on EBCDIC platforms
           Product: PCRE
           Version: 7.8
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: bug
          Priority: medium
         Component: Code
        AssignedTo: ph10@???
        ReportedBy: martin.jerabek@???
                CC: pcre-dev@???



We use PCRE 7.8 on z/OS (formerly known as OS/390 and MVS) which uses EBCDIC as
its character encoding. We pass only UTF-8 strings to PCRE using the PCRE_UTF8
flag, both as patterns and subject strings, because our internal string class
is based on UTF-8. We configured PCRE with both --enable-utf8 and
--enable-ebcdic, and also created the appropriate EBCDIC character table with
dftables.

Tests showed that compiling UTF-8 patterns failed. The pattern we used resulted
in an 'unmatched parentheses' error but the exact error does not matter. The
pattern was of course well-formed and worked on non-EBCDIC platforms. I dug
deeper into the PCRE source, and I think I found the cause of the problem:

PCRE processes the pattern by either reading the characters directly from the
pattern buffer as compile_branch() does which I analyzed, or using the GETCHAR
macros from pcre_internal.h. No matter how the character is retrieved, it is
always in the same codepage as the pattern passed to pcre_compile(). In our
case this is UTF-8.

The real problem is now that the PCRE code compares the characters taken from
the pattern with normal C character literals such as '\\', '*', etc. This works
fine on non-EBCDIC (ASCII) platforms because the 7-bit ASCII subset of UTF-8 is
identical to ASCII.

On an EBCDIC platform such character literals will never match with UTF-8
characters because the characters in the "ASCII subset" of UTF-8 and EBCDIC
have different codes. For example the backslash has code 0x5C in ASCII and
UTF-8 but 0xE0 or 0xEC on EBCDIC, depending on the EBCDIC codepage, so the C
compiler would generate code to compare the retrieved UTF-8 character with 0xE0
or 0xEC.

I see three possible solutions to this problem:

- On EBCDIC platforms, do not use C character literals in the PCRE code but use
lookup tables to get the appropriate character code, one for non-UTF-8 (EBCDIC)
mode, the other one for UTF-8 mode. The lookup tables would return either the
EBCDIC code of a character or the UTF-8 code. The lookup tables would have to
be user-definable because as I mentioned above some meta characters like the
backslash have different codes in different EBCDIC codepages.

- Convert the pattern characters to EBCDIC as they are retrieved with one of
the GETCHAR macros. In this way they could be compared to C character literals.
This would work for the meta characters but what about other characters from
the pattern which cannot be represented in EBCDIC but only in Unicode?

- Convert the entire pattern from UTF-8 to EBCDIC before compiling it. This is
clearly undesirable because it would negate all advantages of using UTF-8. I
know PCRE not well enough to be able to say if the subsequent matching would
even work in this case.

I am writing this bug report mainly to get input how to solve the problem best.
I am aware that you probably do not have access to an EBCDIC platform and that
I would have to code a fix myself.

BTW, Bugzilla does not offer appropriate choices for Hardware (zSeries) and OS
(z/OS), so I had to use wrong ones.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email