Re: [pcre-dev] Capturing duplicate named subpatterns in cond…

Top Page
Delete this message
Author: Ahmad Amireh
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] Capturing duplicate named subpatterns in conditional branches always refers to the last

On 07/13/2012 12:33 PM, Philip Hazel wrote:
> On Thu, 12 Jul 2012, Ahmad Amireh wrote:
>
>> I'm having trouble figuring out how to capture duplicate named subpatterns (as
>> allowed by the PCRE_DUPNAMES option). My initial understanding from reading
>> the manual was that I could refer to a subpattern(s) capture using a name
>> instead of the capture order number. While that is true, it seems the named
>> subpattern capture always refers to the _last_ branch in which it was
>> _defined_ and not necessarily matched. The example in the manual is almost
>> exactly like what I'm trying to do so I'll just use it to explain:
>>
>> (?J)(?<DN>Mon|Fri|Sun)(?:day)?|(?<DN>Tue)(?:sday)?|(?<DN>Wed)(?:nesday)?|(?<DN>Thu)(?:rsday)?|(?<DN>Sat)(?:urday)?
> This example works for me when I test it using the pcretest program:
>
> PCRE version 8.31 2012-07-06
>
> /(?J)(?<DN>Mon|Fri|Sun)(?:day)?|(?<DN>Tue)(?:sday)?|(?<DN>Wed)(?:nesday)?|(?<DN>Thu)(?:rsday)?|(?<DN>Sat)(?:urday)?/
>    Monday\CDN
>   0: Monday
>   1: Mon
>    C Mon (3) DN
>    Tuesday\CDN
>   0: Tuesday
>   1: <unset>
>   2: Tue
>    C Tue (3) DN
>    Wednesday\CDN
>   0: Wednesday
>   1: <unset>
>   2: <unset>
>   3: Wed
>    C Wed (3) DN
>    Thursday\CDN
>   0: Thursday
>   1: <unset>
>   2: <unset>
>   3: <unset>
>   4: Thu
>    C Thu (3) DN
>    Friday\CDN
>   0: Friday
>   1: Fri
>    C Fri (3) DN
>    Saturday\CDN
>   0: Saturday
>   1: <unset>
>   2: <unset>
>   3: <unset>
>   4: <unset>
>   5: Sat
>    C Sat (3) DN

>
> The \CDN option on the data lines means "use pcre_copy_named_substring
> to collect the value of substring DN after the match". The same test
> works with \GDN (using pcre_get_named_substring).
>
> How are you trying to extract the named substring? The two functions
> mentioned above return the first substring with the given name that is
> actually set.
>
> Philip
>

On 07/13/2012 12:33 PM, Philip Hazel wrote:
> On Thu, 12 Jul 2012, Ahmad Amireh wrote:
>
>> I'm having trouble figuring out how to capture duplicate named subpatterns (as
>> allowed by the PCRE_DUPNAMES option). My initial understanding from reading
>> the manual was that I could refer to a subpattern(s) capture using a name
>> instead of the capture order number. While that is true, it seems the named
>> subpattern capture always refers to the _last_ branch in which it was
>> _defined_ and not necessarily matched. The example in the manual is almost
>> exactly like what I'm trying to do so I'll just use it to explain:
>>
>> (?J)(?<DN>Mon|Fri|Sun)(?:day)?|(?<DN>Tue)(?:sday)?|(?<DN>Wed)(?:nesday)?|(?<DN>Thu)(?:rsday)?|(?<DN>Sat)(?:urday)?
> This example works for me when I test it using the pcretest program:
>
> PCRE version 8.31 2012-07-06
>
> /(?J)(?<DN>Mon|Fri|Sun)(?:day)?|(?<DN>Tue)(?:sday)?|(?<DN>Wed)(?:nesday)?|(?<DN>Thu)(?:rsday)?|(?<DN>Sat)(?:urday)?/
>    Monday\CDN
>   0: Monday
>   1: Mon
>    C Mon (3) DN
>    Tuesday\CDN
>   0: Tuesday
>   1: <unset>
>   2: Tue
>    C Tue (3) DN
>    Wednesday\CDN
>   0: Wednesday
>   1: <unset>
>   2: <unset>
>   3: Wed
>    C Wed (3) DN
>    Thursday\CDN
>   0: Thursday
>   1: <unset>
>   2: <unset>
>   3: <unset>
>   4: Thu
>    C Thu (3) DN
>    Friday\CDN
>   0: Friday
>   1: Fri
>    C Fri (3) DN
>    Saturday\CDN
>   0: Saturday
>   1: <unset>
>   2: <unset>
>   3: <unset>
>   4: <unset>
>   5: Sat
>    C Sat (3) DN

>
> The \CDN option on the data lines means "use pcre_copy_named_substring
> to collect the value of substring DN after the match". The same test
> works with \GDN (using pcre_get_named_substring).
>
> How are you trying to extract the named substring? The two functions
> mentioned above return the first substring with the given name that is
> actually set.
>
> Philip
>

Confirmed. It works as expected in pcretest -- PCRE version 8.30 2012-02-04.
> The \CDN option on the data lines means "use pcre_copy_named_substring
> to collect the value of substring DN after the match". The same test
> works with \GDN (using pcre_get_named_substring).
>
> How are you trying to extract the named substring? The two functions
> mentioned above return the first substring with the given name that is
> actually set.

I did not know about those functions, that's my bad. I'm using the Lua
library - lrexlib_pcre <http://rrthomas.github.com/lrexlib/manual.htm>,
but unfortunately I don't see the string extraction API exposed to Lua.

Thank you for your help, I appreciate the pointers (I didn't even know
about pcretest actually) and I will attempt to expose this functionality
to the Lua library and contact its author accordingly.

For those interested, here's a small test that shows the current
behaviour of lrexlib-pcre:

     #!/usr/bin/env lua


     require 'rex_pcre' -- available as a rock named lrexlib-pcre


     local ptrn = [[(?J)(?<DN>Mon|Fri|Sun)(?:day)?|(?<DN>Tue)(?:sday)?|(?<DN>Wed)(?:nesday)?|(?<DN>Thu)(?:rsday)?|(?<DN>Sat)(?:urday)?]]
     local regex = regex_pcre.new(ptrn)


     if not regex then
       return print("Invalid PCRE regex '" .. ptrn .. "'")
     end


     function test(subject)
       local _,__,captures = regex:exec(subject)
       print("Captures from '" .. subject .. "':")
       for k,v in pairs(captures or {}) do
         if type(k) ~= "number" and v then print("  " .. k .. " => " .. v) end
       end


       return test
     end


     test("Sunday")("Saturday")


Ahmad