Author: Philip Hazel Date: To: Ralf Junker CC: pcre-dev Subject: Re: [pcre-dev] pcre_exec() with out-of-bounds offset
On Tue, 2 Nov 2010, Ralf Junker wrote:
> pcre_exec() returns strange values if I pass an offset greater than the
> length of the subject string.
I have committed a patch that fixes this. I am amazed that nobody
noticed before that there was no check on the value of the starting
offset. It now returns PCRE_ERROR_BADOFFSET if the value is negative or
greater than the length of the subject. (It may, of course, be equal to
the length of the subject.)
> pcre_exec() with offset = 10 returns 1. Plus, it sets ovector to these
> values:
>
> ovector[1] = 9
> ovector[2] = -2147483640
> ovector[3] = -2139062144
> ovector[4] = -2139062144
> ... and so on ... > This indicates to me that some of the ovector elements are not
> initialized. If this intended?
Yes. The documentation says that it sets only those values that are
relevant for the regex. Your regex has no capturing parentheses, so it
should set only 0 and 1. The comment in the code says this:
/* Compute the minimum number of offsets that we need to reset each time. Doing
this makes a huge difference to execution time when there aren't many brackets
in the pattern. */
... and the documentation says this:
The first pair of integers, ovector[0] and ovector[1], identify the
portion of the subject string matched by the entire pattern. The next
pair is used for the first capturing subpattern, and so on. The value
returned by pcre_exec() is one more than the highest numbered pair
that has been set.
... and also this:
It is possible for capturing subpattern number n+1 to match some part
of the subject when subpattern n has not been used at all. For
example, if the string "abc" is matched against the pattern
(a|(z))(bc) the return from the function is 4, and subpatterns 1 and 3
are matched, but 2 is not. When this happens, both values in the
offset pairs corresponding to unused subpatterns are set to -1. Offset
values that correspond to unused subpatterns at the end of the
expression are also set to -1.
Hmm. Maybe it should say explicitly that if a pattern has n capturing
parentheses, ovector slots that correspond to capturing parens n+1 and
greater are left alone.