Re: [pcre-dev] \b bug with extended Unicode characters?

Autore: Philip Hazel
Data:
To: Ralf Junker
CC: pcre-dev
Vecchi argomenti: [pcre-dev] \b bug with extended Unicode characters?
Oggetto: Re: [pcre-dev] \b bug with extended Unicode characters?

On Wed, 21 Jan 2009, Ralf Junker wrote:

> it appears the word boundary anchor fails to work when it bounds a
> word using extended Unicode characters (PCRE 7.8, UTF-8 enabled):
>
> ÅÅ°ÅÅ± -> Matches
> \bÅÅ°ÅÅ±\b -> Fails
> \bNAME\b -> Matches
>
> Can anybody confirm this?

Did this ever get answered? The answer is that it is a limitation of
PCRE. I have upgraded the documentation about \b to make it even
clearer. It now says this:

In UTF-8 mode, characters with values greater than 128 never match
\d, \s, or \w, and always match \D, \S, and \W. This is true
even when Unicode character property support is available. These
sequences retain their original meanings from before UTF-8 support was
available, mainly for efficiency reasons. Note that this also affects
\b, because it is defined in terms of \w and \W.

Philip

--
Philip Hazel

Questo messaggio è parte di questo thread:
	il thread completo ordinato per data

	Ralf Junker at