Re: [pcre-dev] Using PCRE upon Asian and other two-byte national codings

Author: Zoltán Herczeg
Date:
To: pcre-dev
Subject: Re: [pcre-dev] Using PCRE upon Asian and other two-byte national codings

Hi,

currently PCRE character tables can only hold lowercase / flipped case and various type bits for the first 256 characters. Supporting the whole 64K character set in 16 bit mode would take 409600 bytes of memory, which is less than half megabyte. Today, even smartphones can afford that cost. The trade-of would be that the same tables could not be used in 8/16/32 bit modes anymore, since the lowercase / flipped case tables would depend on the natural character length. Hence a table with only 256 characters would be bigger in 16/32 bit mode than now. (Note: the table size would always be divisible by 256. This would allow not to change anything in 8 bit mode, but we could also support character sets which does not have 64K characters in 16 bit and especially in 32 bit mode, where we have 4096M characters).

I am sure we cannot do this for 8.34 (this is not an easy task), but if this is important for many people, we might think about this later.

Regards,
Zoltan

ph10@??? írta:
>On Sat, 23 Nov 2013, Zoltán Herczeg wrote:
>
>> PCRE supports 2 or 4 byte character encodings, but character
>> properties are only supported for 0-255 character codes.
>
>I think I had better clarify that, for the record. The 16-bit and 32-bit
>PCRE libraries do support Unicode character properties, just like the
>8-bit library. However, locale-based properties apply only to 0-255
>character codes.
>
>Philip
>
>--
>Philip Hazel

This message is part of the following thread:
	the complete thread tree sorted by date
	ph10 at
	ND at

Re: [pcre-dev] Using PCRE upon Asian and other two-byte nati…