Re: [pcre-dev] Using PCRE upon Asian and other two-byte national codings

Author: Zoltán Herczeg
Date:
To: ND
CC: Pcre-dev
Subject: Re: [pcre-dev] Using PCRE upon Asian and other two-byte national codings

Hi,

PCRE supports 2 or 4 byte character encodings, but character properties are only supported for 0-255 character codes. In other words you need to use character ranges instead of \s \d \w etc. to detect these types.

There is another way, but that requires a lot of knowledge about PCRE internals. You can make a custom PCRE, where the UTF tables are replaced by data generated from the local, and change UTF character read macros to behave like fixed character reads. JIT also needs modifications to use fixed character reads.

I was once thinking to support unicode properties without unicode character decoding in native 16 bit mode, but that is easy only in JIT, the interpreter would need a lot of rework.

Regards,
Zoltan

ND <nadenj@???> írta:
>Good day!
>
>Give just one clarifying please.
>As I understand, there is no way to use PCRE patterns written in Asian
>(Japanese SHIFT-JS and so on) and other national two-byte codings against
>texts that written in such codings. It needs to convert both text and
>pattern to UTF-8 first. If I need to change founded text than it must be
>recoded to UTF-8, than processed with PCRE to find needed fragment, than
>change fragment, and than decode all text back to SHIFT-JS.
>Isn't it?
>
>Thanks a lot.
>
>--
>## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
>

This message is part of the following thread:
	the complete thread tree sorted by date
	ND at
	ph10 at

Re: [pcre-dev] Using PCRE upon Asian and other two-byte nati…