Re: [pcre-dev] Backtracking atomic blocks

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Herczeg Zoltán
CC: pcre-dev
Subject: Re: [pcre-dev] Backtracking atomic blocks
On Sun, 10 Apr 2011, Herczeg Zoltán wrote:

> the current backtracking behaviour of PCRE for atomic blocks is not
> exactly logical to me, so I tried to compare it with PERL (v5.12.1). I
> don't know PERL (yet), this is actually my first PERL script:
>
> my $str = "aaaabaaabaabab";
> @myarray = ($str =~ m/((?>(a+)b)+aabab)/);
> print join(",", @myarray), "\n";
>
> Its result is: aaaabaaabaabab,aaa
>
> I don't know how to extract the starting/ending positions, that is why
> I choose this tricky example, since I can see that the result of the
> internal capturing bracket is the second block. Even I don' understand
> why the array does not contain the full match...


The result of assigning a pattern match to an array in Perl is a list of
the captured substrings (only). It does not contain the full match. That
is why your array has only two elements set.

> With PCRE, the result is the following: 0, 14, 0, 14, 12, 13 - the
> internal capturing bracket holds the last 'a'. Am I interpret right
> that this is a different behaviour?


There seems to be a bug in PCRE. Using pcretest (and getting rid of your
outermost parentheses, which just add complication), I see this:

/(?>(a+)b)+aabab/
aaaabaaabaabab
0: aaaabaaabaabab
1: a

Running the same thing through my Perl test script, I get this:

/(?>(a+)b)+aabab/
aaaabaaabaabab
0: aaaabaaabaabab
1: aaa

That is the same as you get. I think what is happening is that PCRE is
matching the atomic group four times, thereby setting the inner
parentheses to "a", but then when it has to backtrack, it does not reset
the value of the parentheses. However, I haven't actually looked at the
code to confirm this.

I have added this issue to my list of things to look at, which is now
getting quite long. I am planning on restarting work on PCRE some time
before the end of this month. It will probably take me several weeks to
work through all the issues at my slow, retired rate of progress. :-(
I will then put out an 8.13 release candidate followed by a proper
release. After that, I want to start straight away on integrating your
JIT code before anything else crops up.

Philip

--
Philip Hazel