Philip Hazel wrote:
>On Wed, 28 Mar 2007, Ralf Junker wrote:
>
>> Given this pcretest syntax:
>>
>> /\w+(?<!\z)/g
>> abcde
>> one two three
>>
>> The very last character of a subject never matches. I was expecting
>> that it would: The assertion at the end of the subject should find
>> that the last character is not the end of the subject.
>
>> Is the observed behaviour is correct (7.0 as well as 7.1 RC 3)?
>
>The same behaviour is observed with Perl. (That's always the first test
>I do when people ask this kind of question.)
>
>It is also the correct behaviour. It swallows abcde, then does a
>backward assertion (lookbehind). In this case, the amount to move back
>is zero characters, as there are no characters to match in the
>assertion. So it is still at the end of the string. The assertion is
>that it should not be at the end of the string. So the assertion fails.
>So it backs up one of the \w characters.
Even though my initial feeling told me otherwise, your explanation makes complete sense. Many thanks!
>What were you actually trying to match?
When implementing "whole word only" regex searches in UTF-8, I tried to work around the problem with the \b assertion which only works with characters defined in pcre_chartables.c. Word boundaries for Greek / Cyrillic characters are not detected by this:
\b(YOUR_PATTERN)\b
To solve this, I tried this instead:
(?<=\pZ|\t|\n|\x0b|\f|\r|\x85|\A)(YOUR_PATTERN)(?=\pZ|\n|\t|\x0b|\f|\r|\x85|\z)
and found that it includes whitespace if YOUR_PATTERN is something like
.*
So I changed it to use the surrounding assertions twice, both as positive / negative lookahead / lookbehind. This lead to the above reported observation.
To "correct" it, I now use this one, which works just fine:
(?<=\pZ|\t|\n|\x0b|\f|\r|\x85|\A)(?!=\pZ|\t|\n|\x0b|\f|\r|\x85|\A)(YOUR_PATTERN)(?<!\pZ|\n|\t|\x0b|\f|\r|\x85)(?=\pZ|\n|\t|\x0b|\f|\r|\x85|\z)
Notice that the \z is missing from the 2nd last assertion.
I am sorry if this was longer than you might have expected, but you were asking for what I am actually trying to match ;-)
Ralf