Re: [pcre-dev] \b bug with extended Unicode characters?

Author: Ralf Junker
Date:
To: pcre-dev
Subject: Re: [pcre-dev] \b bug with extended Unicode characters?

Philip Hazel wrote:

>Did this ever get answered? The answer is that it is a limitation of
>PCRE. I have upgraded the documentation about \b to make it even
>clearer. It now says this:
>
> In UTF-8 mode, characters with values greater than 128 never match
> \d, \s, or \w, and always match \D, \S, and \W. This is true
> even when Unicode character property support is available. These
> sequences retain their original meanings from before UTF-8 support was
> available, mainly for efficiency reasons. Note that this also affects
> \b, because it is defined in terms of \w and \W.

Many thanks, I welcome this additional documentation to (I suppose?)
the PCRE pattern page. With this, I would not have to look up the
PCRE source code to answer my question.

Only later I found that the PCRE help already documents the limitations:

6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
test characters of any code value, but the characters that PCRE
recognizes as digits, spaces, or word characters remain the same set
as before, all with values less than 256. This remains true even when
PCRE includes Unicode property support, because to do otherwise would
slow down PCRE in many common cases. If you really want to test for a
wider sense of, say, "digit", you must use Unicode property tests
such as \p{Nd}.

However, this paragraph is in pcre.html#utf8support, section "General
comments about UTF-8 mode". I believe that most pattern writers will
miss out on this, so its extra mention in the pattern documentation
is very much appreciated!

Ralf

This message is part of the following thread:
	the complete thread tree sorted by date
	Philip Hazel at