[pcre-dev] [Bug 791] UTF-8 support does not work on EBCDIC p…

Inizio della pagina
Delete this message
Autore: Philip Hazel
Data:  
To: pcre-dev
Oggetto: [pcre-dev] [Bug 791] UTF-8 support does not work on EBCDIC platforms
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=791




--- Comment #3 from Philip Hazel <ph10@???> 2008-12-17 11:03:39 ---
On Wed, 17 Dec 2008, Martin Jerabek wrote:

> Yes. In our case it will always be UTF-8 but in the general case the
> type of input depends on the PCRE_UTF8 flag. If it is set, the input is
> UTF-8, if not, it is EBCDIC (on EBCDIC platforms).


Aha! I now understand. What you really need is a version of PCRE that
assumes EBCDIC in non-UTF-8 mode, and Unicode in UTF-8 mode, compiled on
a system where EBCDIC is the default (so character constants in the
source are EBCDIC). Oh dear, that is indeed tricky because of the
character constants, which would need to be different in the two cases.

> (Your example is actually a bad one because À and à do have EBCDIC
> codepoints:


They didn't in the days I used EBCDIC. :-)

> I would replace all character constants with lookup table accesses.


That would undoubtedly be a performance hit, and I do worry about
performance with PCRE. I am sure we do not want this overhead for
non-EBCDIC systems ... oh, but I guess it doesn't matter if it is only
in pcre_compile(). I've just grepped the source, and there seem to be no
occurrences of character constants in pcre_exec(), so perhaps I am
worrying unduly.

> All this would only fix the problem if the C character literals are
> really the only cause of our difficulties. It would also require slight
> code modifications because the lookup tables could not be used in
> switch/case statements as it is possible with character literals.


Yes - that would be the messiest change needed, and no doubt where any
performance hit would be worst.

> I will try this and report back whether it worked or not.


One other thought struck me: have you considered compiling two different
versions of PCRE? One would be for use in the EBCDIC case, the other for
use in the UTF-8 case. It would be easy to rename all the functions by
defining some macros, so that you could link both versions into the same
product. The only difficulty would be compiling the UTF-8 version in an
EBCDIC environment. I suppose you could replace all the character
constants with macros that could be defined appropriately - that is a
change that would not be a problem in the normal PCRE sources.

Regards,
Philip


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email