Author: Philip Hazel Date: To: Sheri CC: pcre-dev Subject: Re: [pcre-dev] Oniguruma subroutines
On Thu, 22 May 2008, Sheri wrote:
> I was trying to make it capturing, so your second suggestion answers my
> question. I didn't realize that a negative number could end up referring
> to the current subpattern if it is capturing. I suppose it is necessary
> that the 1st subpattern itself be capturing in order to give it the
> capacity to serve as a subroutine.
Yes. Non-capturing subpatterns do not get allocated numbers (and cannot
be named); consequently, there is no way to refer to them.
> Suppose one's goal is to capture each repeat of (abc) and it is unknown
> how many there will be. Is this about the best that can be done?:
>
> 1. Choose the maximum number of repeats that will be supported
> 2. Construct the repeating pattern
> 3. Add supported repeats as (/g<1>)?
> Example (abc)(\g1)?(\g1)?(\g1)?(\g1)?(\g1)?(\g1)?(\g1)?(\g1)?(\g1)?
Without using callouts, yes. Performance will be more efficient (but the
pattern will be less readable) if written as
(abc)(\g1(\g1(\g1)?)?)?...etc
although of course the captured contents will be different in detail,
and I guess could be ambiguous, depending on the pattern. With this
scheme, it only tries for a second \g1 if it has found the first. With
your original scheme, it tries again and again. Ah, I guess thing to do
is to use
(abc)(?:(\g1)(?:(\g1)(?:(\g1))?)?)?...etc
Then you get the same captured values but also the added efficiency, at
the price of readability.
If you are prepared to program for callouts, you can catch as many as
you like, by using a repeat with a callout at the end, but of course
this is not feasible if you are in an environment that doesn't allow you
to use callouts.
> Is there a workable approach if the repeating pattern itself has desired
> subpatterns? With the following I find the "b" captured only once:
> /(a(b)c)(\g1)?/ with data abcabc
That's correct. Nobody has invented a scheme where the contents of
capturing subpatterns are expressed as vectors. PCRE returns the value
from the last time the parens captured something. So if you had
/(a(b|d)c)(\g1)?/ with data abcadc you would get "d".