Re: [pcre-dev] PCRE2: Unnecessary substring number checks?

Top Page
Delete this message
Author: Ralf Junker
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] PCRE2: Unnecessary substring number checks?
On 10.12.2014 12:30, ph10@??? wrote:

>> In various parts, PCRE2 runs the following to check if a substring number if
>> valid:
>>
>>    if (stringnumber >= match_data->oveccount ||
>>        stringnumber > match_data->code->top_bracket ||
>>        match_data->ovector[stringnumber*2] == PCRE2_UNSET)

>>
>> I wonder if it is indeed necessary to compare with top_bracket. Rationale is
>> this comment in pcre2_match.c:
>>
>>    If there is space in the offset vector, set any unused pairs at the
>>    end to PCRE2_UNSET for backwards compatibility.

>>
>> Provided that the above holds true, should it not be sufficient to test for
>> PCRE2_UNSET?
>
> I don't think so, because by "unused pairs" it actually means "unused
> pairs up to the maximum capturing groups in the pattern". Consider the
> case when oveccount is large (say 100), and pattern contains, say, 5
> capturing parentheses. Suppose a match operation passes through only
> groups 1,2,3. In that case, the code that follows that comment will set
> the vector for groups 4 and 5 to PCRE2_UNSET. However, what if
> stringnumber is set to 20?
>
> Obviously, I should make that comment in pcre2_match.c a bit clearer.


My assumption was that stringnumber 6 to 20 would also be set to
PCRE2_UNSET. It was not clear to me from reading the comment that they
are left untouched.

If the behavior is like you described, the additional test for
top_bracket is of course necessary.

>> In addition, the code snippet makes pcre2_match_data depend on pcre2_code. If
>> pcre2_code is freed before pcre2_match_data, the outcome of the code snipped
>> is undetermined. I have searched the documentation, but have not found it
>> mentioning this.
>
> That is a good point. The documentation should tell you not to free the
> code while still making use of the match data.


+1

>> A related thought:
>>
>> The above code extract reappears identically multiple times in
>> pcre2_substring.c. Would it make sense to refactor substring validity checking
>> into its own, dedicated function?
>
> I would be more tempted to make it into a macro.


+1

>> IMO, this would also be a welcome addition to the public API.
>
> You mean something like:
>
> int pcre2_substring_isset(match_data, stringnumber)


Exactly.

> ? Is this anything more than just a cosmetic version of
> pcre2_substring_length_by{number,name}() ? For those two functions, we
> could add the specification that sizeptr could be NULL, in which case
> they are exactly "test for this group being set".


The dedicated and properly named function would probably be easier to
find by developers. And it would perform slightly faster. Maybe
insignificantly so, but many small improvements add up.

Other than that, your suggestion sounds fine!

Ralf