Re: [pcre-dev] [Bug 1049] Add support for UTF-16

Top Page
Delete this message
Author: Tom Bishop, Wenlin Institute
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] [Bug 1049] Add support for UTF-16

On Nov 15, 2011, at 2:32 AM, Philip Hazel wrote:

> ...The problem with encapsulating too much is performance. For example, we
> *could* change PCRE so that every time it needs to "load next character"
> it calls a function. The function could check for 8,16,32, UTF status,
> and then load the data as required. However, I reckon this would kill
> PCRE's performance, which does matter in many applications.


Right. I was picturing configuration at compile time rather than run time, and macros instead of functions where that significantly improves performance.

> PCRE supports up to 6 for UTF-8. :-)


Not 13 bytes like Perl 5? :-/

> I agree that we
> should do our best so that the work to provide libpcre16 can trivially
> be extended to provide libpcre32 if that is ever wanted.


That's great.

> [Aside: I've been a programmer for over 40 years. Today's machines as
> vastly faster, with huge memories, and yet we still worry about
> performance. I suspect we always will. :-]


Probably so! Speed, compactness, and convenience are sometimes in competition with each other. I suspect UTF-32 will eventually be used more at least for processing in RAM if not for long-term storage, since it can be more convenient for programming and might even give faster performance in some cases. I also suspect that code points larger than U+10FFFF will be used in important applications eventually, partly because regex will be useful for kinds of information other than natural language, and we'll re-use and extend the existing regex and encoding technologies rather than invent something completely separate.

PCRE is really fantastic, thank you!

Tom

文林 Wenlin Institute, Inc.        Software for Learning Chinese
E-mail: wenlin@???     Web: http://www.wenlin.com
Telephone: 1-877-4-WENLIN (1-877-493-6546)
☯