https://bugs.exim.org/show_bug.cgi?id=2309
Bug ID: 2309
Summary: Definition order of self-referencing capturing
duplicate named groups respective to their initial
initialization changes behavior
Product: PCRE
Version: 10.31 (PCRE2)
Hardware: x86
OS: All
Status: NEW
Severity: wishlist
Priority: medium
Component: Code
Assignee: ph10@???
Reporter: bobwei9@???
CC: pcre-dev@???
Given "aa" as input, consider the following pattern, with J option.
\A(?<x>)(?<x>a\k<x>)\k<x>
It matches the first "a" instead of "aa". As for the latter \k<x> match, it
uses its first definition (the empty string).
Interestingly defining the subpattern before the (?<x>) initialization to an
empty string, the pattern returns the expected match of "aa":
\A(?!(?<q> # dummy assertion (which succeeds) to define our capturing group
before the (?<x>)
(?<x>a\k<x>)\k<x>
))(?<x>)(?&q)
This apparently is due to the following behavior as outlined in the man page:
If you make a back reference to a non-unique named subpattern from
elsewhere in the pattern, the subpatterns to which the name refers are
checked in the order in which they appear in the overall pattern. The
first one that is set is used for the reference. For example, this pat-
tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo":
(?:(?<n>foo)|(?<n>bar))\k<n>
Because (?<x>) is the first expression, which is set, the \k<x> in the
(?<x>a\k<x>) capturing group in the first pattern above always references the
empty (?<x>); also explaining the second pattern above: on the first execution,
the self-referencing capturing group looks first at itself, notices it isn't
set yet and then finds the capturing group which has been initialized to an
empty string.
Now, I'd like to request a behavior change here to make it useful in general:
*Back references should always refer to the most recently set capturing group.*
The reasoning is:
- not intuitive (the first defined one set?! Before reading this up, I was
pretty sure to have found a bug...)
- impossible to use duplicate named capturing groups in repeating patterns;
i.e. repeating the example from the documentation
(?:(?:(?<n>foo)|(?<n>bar))\k<n>){2} - this matches:
* foofoofoofoo
* barbarbarbar
* barbarfoofoo
but not: foofoobarbar
- the behavior is also different from branch reset groups, where [referring to
the previous example] (?:(?|(foo)|(bar))\1){2} just works fine. And I expected
named duplicate subpatterns to be a simple alternative for these - and the
documentation also makes it sound that way.
- very, very minor BC break; I cannot fathom a reasonable use case for the
current behavior which would break if we switched to the most recently set
capturing group.
--
You are receiving this mail because:
You are on the CC list for the bug.