[pcre-dev] [Bug 897] \w and others based on Unicode properti…

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 897] \w and others based on Unicode properties
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=897




--- Comment #1 from Philip Hazel <ph10@???> 2009-10-19 19:39:34 ---
On Mon, 19 Oct 2009, Pavel Kostromitinov wrote:

> However, having to deal with international characters almost constantly, I
> would really appreciate something like a compile-time option (for compiling
> pcre) to force it into using Unicode properties always.
> I cannot just replace all the "\b" with complex constructions based on \p{},
> since I don't write patterns myself - end-users do it. And parsing their
> patterns just to make correct replacement doesn't look appealing to me either.
>
> At least, I would greatly appreciate a hint on where should I look in pcre
> sources to try and change this behaviour myself.


Look at all the places in pcre_exec.c where one of the following opcodes
are mentioned:

OP_WORD_BOUNDARY, OP_NOT_WORD_BOUNDARY, OP_DIGIT, OP_NOT_DIGIT,
OP_WHITESPACE, OP_NOT_WHITESPACE, OP_WORDCHAR, OP_NOT_WORDCHAR.

There are 44 places in the code where you would have to make changes.
They would be quite substantial changes because not only does the
current code use a look-up table, it knows that it just needs to test
one byte from the subject instead of looking for a general UTF-8
character.

I suppose a compile-time option would be better that a runtime option,
because that would save testing the option many times during a run.
However, in theory there would still have to be a test for UTF-8 mode at
run time. Some of the tests are inside loops - you don't want to test
the flag every time round the loop, so two copies of the loop will
probably be needed.

Hmmm.... Maybe the compile-time option should be "force UTF-8 mode
always and use Unicode properties always". Then a lot of testing for
UTF-8 mode could be cut out and the PCRE_UTF8 option would be redundant.

I have just (today) released PCRE 8.00 and I don't plan on working on
PCRE now for some time, except to fix any important bugs that show up.
I have, however, noted this item for thinking about sometime in the
future.

Philip


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email