[pcre-dev] [Bug 791] UTF-8 support does not work on EBCDIC p…

Αρχική Σελίδα
Delete this message
Συντάκτης: Martin Jerabek
Ημερομηνία:  
Προς: pcre-dev
Αντικείμενο: [pcre-dev] [Bug 791] UTF-8 support does not work on EBCDIC platforms
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=791




--- Comment #2 from Martin Jerabek <martin.jerabek@???> 2008-12-17 09:24:02 ---
On 16.12.2008 22:19, Philip Hazel wrote:
> Aarrgghh! I have to admit that I *never* considered the possibility of
> anybody wanting to use UTF-8 with EBCDIC, especially as "classical"
> EBCDIC uses bytes with values greater than 127. (I was an OS/370
> programmer for many years, so I do have some EBCDIC experience. But not
> in the last 15 years or so.)
>

The fact that EBCDIC uses codes > 127 does not really matter, only the
fact that EBCDIC uses different codes than ASCII for characters which
are present in both character sets. A bit of background: our products
run on many platforms (Windows, various Unixen, z/OS), so we decided to
use UTF-8 as the internal representation of our own string class on all
platforms because this encoding is the only one which does not depend on
code pages and byte order.
>> - Convert the pattern characters to EBCDIC as they are retrieved with
>> one of the GETCHAR macros. In this way they could be compared to C
>> character literals. This would work for the meta characters but what
>> about other characters from the pattern which cannot be represented in
>> EBCDIC but only in Unicode?
>>
>
> I do not understand how that could work, because I do not see how EBCDIC
> and Unicode can be used simultaneously. Either your input is Unicode, or
> it is EBCDIC, surely?

Yes. In our case it will always be UTF-8 but in the general case the
type of input depends on the PCRE_UTF8 flag. If it is set, the input is
UTF-8, if not, it is EBCDIC (on EBCDIC platforms).

> I *do* see that you could use UTF-8 to encode
> codes in the range 0 to 255 and then interpret them as EBCDIC, but I
> don't get it for codes greater than 255. The subset of Unicode whose
> codepoints are < 256 is not the same set of characters as EBCDIC.
> Consider, for example, the character A-with-grave-accent. The Unicode
> codepoint is 00C0, but there is no such character in EBCDIC. So how
> would you represent this character, because 00C0 *is* an EBCDIC
> codepoint?
>

You are of course correct that there are many Unicode characters which
cannot be represented in EBCDIC, so the conversion would be lossy. (Your
example is actually a bad one because À and à do have EBCDIC codepoints:
0x64 and 0x44 but I get your point.) For the regular expression meta
characters this would not matter because they have EBCDIC counterparts,
otherwise PCRE would not work in EBCDIC at all. If the GETCHAR macros
were only used for meta characters this could work but of course not if
they are used for all characters because the conversion from UTF-8 to
EBCDIC would lose information.
> My suggestion is that you translate the other way, that is, you
> translate everything from EBCDIC into Unicode before passing it to PCRE,
> especially if you want to deal with characters that are not part of
> EBCDIC (unless I've misunderstood what you meant above).

We already do this by using only UTF-8 internally, even on EBCDIC
platforms, and passing only UTF-8 strings with the PCRE_UTF8 flag.
Exactly this causes the problem on EBCDIC platforms because PCRE
compares the passed UTF-8 characters to character literals which are of
course EBCDIC on z/OS.
> Of course, you
> would still have to fix the literals in PCRE. Hmmm. Not straightforward
> at all.
>

Exactly. The idea with the lookup tables mentioned in my original post
is still the best I can think of. I would replace all character
constants with lookup table accesses. There would be two lookup tables,
one for ASCII platforms and EBCDIC platforms in UTF-8 mode (PCRE_UTF8
set) containing the ASCII/UTF-8 codes of the characters, the other one
for EBCDIC platforms in EBCDIC mode (PCRE_UTF8 not set) containing the
EBCDIC codes.

As mentioned before these tables would probably have to be
user-configurable like the table generated with dftables because the
EBCDIC codes depend on the code page in use. We always use EBCDIC code
page 273 (Germany/Austria) in which, for example, the backslash has code
0xEC but in the international EBCDIC code pages 500 and 1047 it has code
0xE0.

All this would only fix the problem if the C character literals are
really the only cause of our difficulties. It would also require slight
code modifications because the lookup tables could not be used in
switch/case statements as it is possible with character literals.

I will try this and report back whether it worked or not.
> As I see it, the first thing to make a decision on is exactly what
> characters you want to deal with and how to represent them - the
> A-with-grave-accent problem that I mentioned above.
>

We already decided to use only UTF-8 so that we can represent all
characters, so for our purposes it would be sufficient to replace all
character constants with their ASCII/UTF-8 counterparts (e.g. '\x2A'
instead of '*') because we never pass EBCDIC strings to PCRE. This
modification could not be included in the official PCRE sources because
it would make it unusable with EBCDIC strings.

Thanks for your quick response!
Martin Jerabek


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email