Re: [pcre-dev] Oniguruma subroutines

Top Page
Delete this message
Author: Sheri
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] Oniguruma subroutines
Philip Hazel wrote:
> On Thu, 22 May 2008, Sheri wrote:
>
>
>> I was trying to make it capturing, so your second suggestion answers my
>> question. I didn't realize that a negative number could end up referring
>> to the current subpattern if it is capturing. I suppose it is necessary
>> that the 1st subpattern itself be capturing in order to give it the
>> capacity to serve as a subroutine.
>>
>
> Yes. Non-capturing subpatterns do not get allocated numbers (and cannot
> be named); consequently, there is no way to refer to them.
>
>
>> Suppose one's goal is to capture each repeat of (abc) and it is unknown
>> how many there will be. Is this about the best that can be done?:
>>
>> 1. Choose the maximum number of repeats that will be supported
>> 2. Construct the repeating pattern
>> 3. Add supported repeats as (/g<1>)?
>> Example (abc)(\g1)?(\g1)?(\g1)?(\g1)?(\g1)?(\g1)?(\g1)?(\g1)?(\g1)?
>>
>
> Without using callouts, yes. Performance will be more efficient (but the
> pattern will be less readable) if written as
>
> (abc)(\g1(\g1(\g1)?)?)?...etc
>
> although of course the captured contents will be different in detail,
> and I guess could be ambiguous, depending on the pattern. With this
> scheme, it only tries for a second \g1 if it has found the first. With
> your original scheme, it tries again and again. Ah, I guess thing to do
> is to use
>
> (abc)(?:(\g1)(?:(\g1)(?:(\g1))?)?)?...etc
>
> Then you get the same captured values but also the added efficiency, at
> the price of readability.
>
> If you are prepared to program for callouts, you can catch as many as
> you like, by using a repeat with a callout at the end, but of course
> this is not feasible if you are in an environment that doesn't allow you
> to use callouts.
>
>
>> Is there a workable approach if the repeating pattern itself has desired
>> subpatterns? With the following I find the "b" captured only once:
>> /(a(b)c)(\g1)?/ with data abcabc
>>
>
> That's correct. Nobody has invented a scheme where the contents of
> capturing subpatterns are expressed as vectors. PCRE returns the value
> from the last time the parens captured something. So if you had
> /(a(b|d)c)(\g1)?/ with data abcadc you would get "d".
>

Hmn, not so, probably because its a 1, not a -1. But also we've be
omitting the angle bracket or single quote around the subroutine number,
and that also makes for different results here:

re> /(a(b|d)c)(\g1)?/
data> abcadc

0: abc
1: abc
2: b

re> /(a(b|d)c)(\g'1')?/
data> abcadc

0: abcadc
1: abc
2: b
3: adc


Regards,
Sheri