------- You are receiving this mail because: -------
You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=791
Summary: UTF-8 support does not work on EBCDIC platforms
Product: PCRE
Version: 7.8
Platform: Other
OS/Version: All
Status: NEW
Severity: bug
Priority: medium
Component: Code
AssignedTo: ph10@???
ReportedBy: martin.jerabek@???
CC: pcre-dev@???
We use PCRE 7.8 on z/OS (formerly known as OS/390 and MVS) which uses EBCDIC as
its character encoding. We pass only UTF-8 strings to PCRE using the PCRE_UTF8
flag, both as patterns and subject strings, because our internal string class
is based on UTF-8. We configured PCRE with both --enable-utf8 and
--enable-ebcdic, and also created the appropriate EBCDIC character table with
dftables.
Tests showed that compiling UTF-8 patterns failed. The pattern we used resulted
in an 'unmatched parentheses' error but the exact error does not matter. The
pattern was of course well-formed and worked on non-EBCDIC platforms. I dug
deeper into the PCRE source, and I think I found the cause of the problem:
PCRE processes the pattern by either reading the characters directly from the
pattern buffer as compile_branch() does which I analyzed, or using the GETCHAR
macros from pcre_internal.h. No matter how the character is retrieved, it is
always in the same codepage as the pattern passed to pcre_compile(). In our
case this is UTF-8.
The real problem is now that the PCRE code compares the characters taken from
the pattern with normal C character literals such as '\\', '*', etc. This works
fine on non-EBCDIC (ASCII) platforms because the 7-bit ASCII subset of UTF-8 is
identical to ASCII.
On an EBCDIC platform such character literals will never match with UTF-8
characters because the characters in the "ASCII subset" of UTF-8 and EBCDIC
have different codes. For example the backslash has code 0x5C in ASCII and
UTF-8 but 0xE0 or 0xEC on EBCDIC, depending on the EBCDIC codepage, so the C
compiler would generate code to compare the retrieved UTF-8 character with 0xE0
or 0xEC.
I see three possible solutions to this problem:
- On EBCDIC platforms, do not use C character literals in the PCRE code but use
lookup tables to get the appropriate character code, one for non-UTF-8 (EBCDIC)
mode, the other one for UTF-8 mode. The lookup tables would return either the
EBCDIC code of a character or the UTF-8 code. The lookup tables would have to
be user-definable because as I mentioned above some meta characters like the
backslash have different codes in different EBCDIC codepages.
- Convert the pattern characters to EBCDIC as they are retrieved with one of
the GETCHAR macros. In this way they could be compared to C character literals.
This would work for the meta characters but what about other characters from
the pattern which cannot be represented in EBCDIC but only in Unicode?
- Convert the entire pattern from UTF-8 to EBCDIC before compiling it. This is
clearly undesirable because it would negate all advantages of using UTF-8. I
know PCRE not well enough to be able to say if the subsequent matching would
even work in this case.
I am writing this bug report mainly to get input how to solve the problem best.
I am aware that you probably do not have access to an EBCDIC platform and that
I would have to code a fix myself.
BTW, Bugzilla does not offer appropriate choices for Hardware (zSeries) and OS
(z/OS), so I had to use wrong ones.
--
Configure bugmail:
http://bugs.exim.org/userprefs.cgi?tab=email