Re: [pcre-dev] Incompatibility with different names for subp…

Top Page
Delete this message
Author: ph10
Date:  
To: ND
CC: Pcre-dev
Subject: Re: [pcre-dev] Incompatibility with different names for subpatterns of the same number
On Sun, 22 Jul 2018, ND via Pcre-dev wrote:

> Why PCRE2_INFO_NAMETABLE entries can't have same number? I see no drawbacks of
> this.


Consider /(?| (?<A>foo) | (?<B>bar) )/x

The table will tell you "group 1 is called A" and "group 1 is called B".
What happens if you match the pattern with "foo" and then ask "what is
the value of group B?". The table will tell you that group B is group 1,
and group 1 has matched "foo". But this is not right.

At least, I don't think it's right! I would expect a group identified as
B to be "unset".

BUT.....

I have never experimented with Perl on this, but I have now done so, and
Perl (v5.26.2) behaves exactly as I have described:

$ perl -e 'if (foo =~ /(?| (?<A>foo) | (?<B>bar) )/x) { print "yes >$& A=$+{A} B=$+{B} 1=$1<\n"; } else { print "no \n"; }'
yes >foo A=foo B=foo 1=foo<

Oh dear. Looks like Perl implemented named groups in a similar way to
PCRE, but allowed different names for the same number. I think it's very
confusing.

AHA! I have found this in the Perl documentation:

Be careful when using the branch reset pattern in combination with
named captures. Named captures are implemented as being aliases to
numbered groups holding the captures, and that interferes with the
implementation of the branch reset pattern. If you are using named
captures in a branch reset pattern, it's best to use the same
names, in the same order, in each of the alternations:

     /(?|  (?<a> x ) (?<b> y )
        |  (?<a> z ) (?<b> w )) /x


Not doing so may lead to surprises:

        "12" =~ /(?| (?<a> \d+ ) | (?<b> \D+))/x;
        say $+{a};    # Prints '12'
        say $+{b};    # *Also* prints '12'.


The problem here is that both the group named "a" and the group
named "b" are aliases for the group belonging to $1.

I think it is better to avoid the "surprises" by not allowing different
named aliases for the same number.

Philip

--
Philip Hazel