[pcre-dev] [Bug 2309] New: Definition order of self-referencing capturing duplicate named groups respective to their initial initialization changes behavior

Author: admin
Date:
To: pcre-dev
Subject: [pcre-dev] [Bug 2309] New: Definition order of self-referencing capturing duplicate named groups respective to their initial initialization changes behavior

https://bugs.exim.org/show_bug.cgi?id=2309

            Bug ID: 2309
           Summary: Definition order of self-referencing capturing
                    duplicate named groups respective to their initial
                    initialization changes behavior
           Product: PCRE
           Version: 10.31 (PCRE2)
          Hardware: x86
                OS: All
            Status: NEW
          Severity: wishlist
          Priority: medium
         Component: Code
          Assignee: ph10@???
          Reporter: bobwei9@???
                CC: pcre-dev@???

Given "aa" as input, consider the following pattern, with J option.

\A(?<x>)(?<x>a\k<x>)\k<x>

It matches the first "a" instead of "aa". As for the latter \k<x> match, it
uses its first definition (the empty string).

Interestingly defining the subpattern before the (?<x>) initialization to an
empty string, the pattern returns the expected match of "aa":

\A(?!(?<q> # dummy assertion (which succeeds) to define our capturing group
before the (?<x>)
(?<x>a\k<x>)\k<x>
))(?<x>)(?&q)

This apparently is due to the following behavior as outlined in the man page:

       If  you  make  a  back  reference to a non-unique named subpattern from
       elsewhere in the pattern, the subpatterns to which the name refers  are
       checked  in  the order in which they appear in the overall pattern. The
       first one that is set is used for the reference. For example, this pat-
       tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo":

         (?:(?<n>foo)|(?<n>bar))\k<n>

Because (?<x>) is the first expression, which is set, the \k<x> in the
(?<x>a\k<x>) capturing group in the first pattern above always references the
empty (?<x>); also explaining the second pattern above: on the first execution,
the self-referencing capturing group looks first at itself, notices it isn't
set yet and then finds the capturing group which has been initialized to an
empty string.

Now, I'd like to request a behavior change here to make it useful in general:

*Back references should always refer to the most recently set capturing group.*

The reasoning is:
- not intuitive (the first defined one set?! Before reading this up, I was
pretty sure to have found a bug...)
- impossible to use duplicate named capturing groups in repeating patterns;
i.e. repeating the example from the documentation
(?:(?:(?<n>foo)|(?<n>bar))\k<n>){2} - this matches:
* foofoofoofoo
* barbarbarbar
* barbarfoofoo
but not: foofoobarbar
- the behavior is also different from branch reset groups, where [referring to
the previous example] (?:(?|(foo)|(bar))\1){2} just works fine. And I expected
named duplicate subpatterns to be a simple alternative for these - and the
documentation also makes it sound that way.
- very, very minor BC break; I cannot fathom a reasonable use case for the
current behavior which would break if we switched to the most recently set
capturing group.

--
You are receiving this mail because:
You are on the CC list for the bug.

This message is part of the following thread:
	the complete thread tree sorted by date

	admin at

[pcre-dev] [Bug 2309] New: Definition order of self-referenc…