On Nov 15, 2011, at 2:32 AM, Philip Hazel wrote:
> ...The problem with encapsulating too much is performance. For example, we
> *could* change PCRE so that every time it needs to "load next character"
> it calls a function. The function could check for 8,16,32, UTF status,
> and then load the data as required. However, I reckon this would kill
> PCRE's performance, which does matter in many applications.
Right. I was picturing configuration at compile time rather than run time, and macros instead of functions where that significantly improves performance.
> PCRE supports up to 6 for UTF-8. :-)
Not 13 bytes like Perl 5? :-/
> I agree that we
> should do our best so that the work to provide libpcre16 can trivially
> be extended to provide libpcre32 if that is ever wanted.
That's great.
> [Aside: I've been a programmer for over 40 years. Today's machines as
> vastly faster, with huge memories, and yet we still worry about
> performance. I suspect we always will. :-]
Probably so! Speed, compactness, and convenience are sometimes in competition with each other. I suspect UTF-32 will eventually be used more at least for processing in RAM if not for long-term storage, since it can be more convenient for programming and might even give faster performance in some cases. I also suspect that code points larger than U+10FFFF will be used in important applications eventually, partly because regex will be useful for kinds of information other than natural language, and we'll re-use and extend the existing regex and encoding technologies rather than invent something completely separate.
PCRE is really fantastic, thank you!
Tom
文林 Wenlin Institute, Inc. Software for Learning Chinese
E-mail: wenlin@??? Web: http://www.wenlin.com
Telephone: 1-877-4-WENLIN (1-877-493-6546)
☯