Re: [pcre-dev] pcre_exec() with out-of-bounds offset

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Ralf Junker
CC: pcre-dev
Subject: Re: [pcre-dev] pcre_exec() with out-of-bounds offset
On Tue, 2 Nov 2010, Ralf Junker wrote:

> pcre_exec() returns strange values if I pass an offset greater than the
> length of the subject string.


I have committed a patch that fixes this. I am amazed that nobody
noticed before that there was no check on the value of the starting
offset. It now returns PCRE_ERROR_BADOFFSET if the value is negative or
greater than the length of the subject. (It may, of course, be equal to
the length of the subject.)

> pcre_exec() with offset = 10 returns 1. Plus, it sets ovector to these
> values:
>
> ovector[1] = 9
> ovector[2] = -2147483640
> ovector[3] = -2139062144
> ovector[4] = -2139062144
> ... and so on ...


> This indicates to me that some of the ovector elements are not
> initialized. If this intended?


Yes. The documentation says that it sets only those values that are
relevant for the regex. Your regex has no capturing parentheses, so it
should set only 0 and 1. The comment in the code says this:

/* Compute the minimum number of offsets that we need to reset each time. Doing
this makes a huge difference to execution time when there aren't many brackets
in the pattern. */

... and the documentation says this:

The first pair of integers, ovector[0] and ovector[1], identify the
portion of the subject string matched by the entire pattern. The next
pair is used for the first capturing subpattern, and so on. The value
returned by pcre_exec() is one more than the highest numbered pair
that has been set.

... and also this:

It is possible for capturing subpattern number n+1 to match some part
of the subject when subpattern n has not been used at all. For
example, if the string "abc" is matched against the pattern
(a|(z))(bc) the return from the function is 4, and subpatterns 1 and 3
are matched, but 2 is not. When this happens, both values in the
offset pairs corresponding to unused subpatterns are set to -1. Offset
values that correspond to unused subpatterns at the end of the
expression are also set to -1.

Hmm. Maybe it should say explicitly that if a pattern has n capturing
parentheses, ovector slots that correspond to capturing parens n+1 and
greater are left alone.

Philip

--
Philip Hazel