Re: [pcre-dev] [Bug 1303] New: Recursion and greedy matching

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: 1303
CC: pcre-dev
Subject: Re: [pcre-dev] [Bug 1303] New: Recursion and greedy matching
On Thu, 4 Oct 2012, Jille Timmermans wrote:

> When I match this regex, I would expect the second } to be included in 0.
>
> re> /{[A-Z]+(\|param:[^|}]*(?R)?[^|}]*)*}/
> data> A{X|param:{Y}}B
> 0: {X|param:{Y}
> 1: |param:{Y


I am afraid you are wrong. I have now analysed this example. This is
what happens:

. It scans along the subject till it finds the first {
. [A-Z]+ matches X
. \|param: matches literally
. [^|}]* matches {Y, leaving }}B
. We then read (?R), which recurses
. This fails because the next character is not {
. However, since it is optional, we can carry on
. [^|}]* matches zero characters
. The final } matches the first } in }}B

> If I make de (?R) mandatory, it does work as expected:


That's because when it fails the first time, there is a backup into
[^|}]* which causes it to try again and again, until the recursion
matches.

In cases like this, it is instructive to use the /C option in pcretest
to see how it is matching:

/{[A-Z]+(\|param:[^|}]*(?R)?[^|}]*)*}/C
    A{X|param:{Y}}B
--->A{X|param:{Y}}B
 +0  ^                  {
 +1  ^^                 [A-Z]+
 +7  ^ ^                (\|param:[^|}]*(?R)?[^|}]*)*
 +8  ^ ^                \|
+10  ^  ^               p
+11  ^   ^              a
+12  ^    ^             r
+13  ^     ^            a
+14  ^      ^           m
+15  ^       ^          :
+16  ^        ^         [^|}]*
+22  ^          ^       (?R)?
 +0  ^          ^       {
+27  ^          ^       [^|}]*
+33  ^          ^       )
 +8  ^          ^       \|
+35  ^          ^       }
+36  ^           ^      


This clearly shows that the first [^|{]* is matching {Y. If you do the
same without the ? you will see it backing up.

Philip

--
Philip Hazel