Re: [pcre-dev] \b bug with extended Unicode characters?

Inizio della pagina
Delete this message
Autore: Philip Hazel
Data:  
To: Ralf Junker
CC: pcre-dev
Vecchi argomenti: [pcre-dev] \b bug with extended Unicode characters?
Oggetto: Re: [pcre-dev] \b bug with extended Unicode characters?
On Wed, 21 Jan 2009, Ralf Junker wrote:

> it appears the word boundary anchor fails to work when it bounds a
> word using extended Unicode characters (PCRE 7.8, UTF-8 enabled):
>
> ŐŰőű -> Matches
> \bŐŰőű\b -> Fails
> \bNAME\b -> Matches
>
> Can anybody confirm this?


Did this ever get answered? The answer is that it is a limitation of
PCRE. I have upgraded the documentation about \b to make it even
clearer. It now says this:

In UTF-8 mode, characters with values greater than 128 never match
\d, \s, or \w, and always match \D, \S, and \W. This is true
even when Unicode character property support is available. These
sequences retain their original meanings from before UTF-8 support was
available, mainly for efficiency reasons. Note that this also affects
\b, because it is defined in terms of \w and \W.

Philip

--
Philip Hazel