Re: [pcre-dev] Backreference Question

Pàgina inicial
Delete this message
Autor: Ralf Junker
Data:  
A: pcre-dev
Assumpte: Re: [pcre-dev] Backreference Question
Philip Hazel wrote:

>On Wed, 28 Mar 2007, Ralf Junker wrote:
>
>> Given this pcretest syntax:
>>
>> /\w+(?<!\z)/g
>>     abcde
>>     one two three

>>
>> The very last character of a subject never matches. I was expecting
>> that it would: The assertion at the end of the subject should find
>> that the last character is not the end of the subject.
>
>> Is the observed behaviour is correct (7.0 as well as 7.1 RC 3)?
>
>The same behaviour is observed with Perl. (That's always the first test
>I do when people ask this kind of question.)
>
>It is also the correct behaviour. It swallows abcde, then does a
>backward assertion (lookbehind). In this case, the amount to move back
>is zero characters, as there are no characters to match in the
>assertion. So it is still at the end of the string. The assertion is
>that it should not be at the end of the string. So the assertion fails.
>So it backs up one of the \w characters.


Even though my initial feeling told me otherwise, your explanation makes complete sense. Many thanks!

>What were you actually trying to match?


When implementing "whole word only" regex searches in UTF-8, I tried to work around the problem with the \b assertion which only works with characters defined in pcre_chartables.c. Word boundaries for Greek / Cyrillic characters are not detected by this:

\b(YOUR_PATTERN)\b

To solve this, I tried this instead:

(?<=\pZ|\t|\n|\x0b|\f|\r|\x85|\A)(YOUR_PATTERN)(?=\pZ|\n|\t|\x0b|\f|\r|\x85|\z)

and found that it includes whitespace if YOUR_PATTERN is something like

.*

So I changed it to use the surrounding assertions twice, both as positive / negative lookahead / lookbehind. This lead to the above reported observation.

To "correct" it, I now use this one, which works just fine:

(?<=\pZ|\t|\n|\x0b|\f|\r|\x85|\A)(?!=\pZ|\t|\n|\x0b|\f|\r|\x85|\A)(YOUR_PATTERN)(?<!\pZ|\n|\t|\x0b|\f|\r|\x85)(?=\pZ|\n|\t|\x0b|\f|\r|\x85|\z)

Notice that the \z is missing from the 2nd last assertion.

I am sorry if this was longer than you might have expected, but you were asking for what I am actually trying to match ;-)

Ralf