[pcre-dev] [Bug 1537] New: pcre_exec does not fill offsets f…

Top Page
Delete this message
Author: Stanislav Malyshev
Date:  
To: pcre-dev
New-Topics: [pcre-dev] [Bug 1537] pcre_exec does not fill offsets for certain regexps
Subject: [pcre-dev] [Bug 1537] New: pcre_exec does not fill offsets for certain regexps
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1537
           Summary: pcre_exec does not fill offsets for certain regexps
           Product: PCRE
           Version: 8.36
          Platform: x86-64
        OS/Version: All
            Status: NEW
          Severity: bug
          Priority: high
         Component: Code
        AssignedTo: ph10@???
        ReportedBy: smalyshev@???
                CC: pcre-dev@???



In the PCRE documentation, in chapter " How pcre_exec() returns captured
substrings", the returned offsets are described as follows:

The first pair of integers, ovector[0] and ovector[1], identify the
portion of the subject string matched by the entire pattern. The next
pair is used for the first capturing subpattern, and so on. The value
returned by pcre_exec() is one more than the highest numbered pair that
has been set. For example, if two substrings have been captured, the
returned value is 3.
/.../
It is possible for capturing subpattern number n+1 to match some part
of the subject when subpattern n has not been used at all. For example,
if the string "abc" is matched against the pattern (a|(z))(bc) the
return from the function is 4, and subpatterns 1 and 3 are matched, but
2 is not. When this happens, both values in the offset pairs corre-
sponding to unused subpatterns are set to -1.

Now if we consider the following pattern:

/(?:((abcd))|(((?:(?:(?:(?:abc|(?:abcdef))))b)abcdefghi)abc)|((*ACCEPT)))/

and run pcretest, with this pattern and data "1234abcd", putting breakpoint on
pcre_exec, before pcre_exec we have this:

#1  0x0000000100007ea6 in main (argc=1606415904, argv=0x7fff5fbff620) at
pcretest.c:5207
5207            PCRE_EXEC(count, re, extra, bptr, len, start_offset,
(gdb) l 5207
5202            }
5203    #endif
5204    
5205          else
5206            {
5207            PCRE_EXEC(count, re, extra, bptr, len, start_offset,
5208              options | g_notempty, use_offsets, use_size_offsets);
5209            if (count == 0)
5210              {
5211              fprintf(outfile, "Matched, but too many substrings\n");


So we're calling the pcre_exec and offsets are in use_offsets. Since pcre_exec
does not require any initialization to offsets array,
I've filled the array with junk data:

(gdb) x/12wx use_offsets
0x100100a00:    0xbeef5555      0xbeef5555      0xbeef5555      0xbeef5555
0x100100a10:    0xbeef5555      0xbeef5555      0xbeef5555      0xbeef5555
0x100100a20:    0xbeef5555      0xbeef5555      0xbeef5555      0xbeef5555


Now after returning from the pcre_exec call we get:

0x000000010000a32a in main (argc=1, argv=0x7fff5fbff648) at pcretest.c:5207
5207            PCRE_EXEC(count, re, extra, bptr, len, start_offset,
Value returned is $17 = 6
(gdb) p count
$19 = 6


So pcre_exec returned 6, which means we have to expect 5 pattern offsets, plus
one global offset. However, if we look at the offsets data, we get this:
(gdb) x/12wx use_offsets
0x100100a00:    0x00000000      0x00000000      0xbeef5555      0xbeef5555
0x100100a10:    0xbeef5555      0xbeef5555      0xbeef5555      0xbeef5555
0x100100a20:    0xbeef5555      0xbeef5555      0x00000000      0x00000000


So only the first pair and the last pair of offsets were initialized, but the
rest keeps containing random junk. Not only this is unexpected, this can
lead to very bad consequences if the client trusts pcre_exec and passes the
offset table as is to pcre_get_substring_list() since this function does not
check the offsets and tries to calculate sizes based on them, and random junk
there may lead to very bad consequences. In fact, pcretest itself proceeds
with:

0:
ERROR: bad negative value -1091611307 for offset 2
ERROR: bad negative value -1091611307 for offset 3
1: <unset>
ERROR: bad negative value -1091611307 for offset 4
ERROR: bad negative value -1091611307 for offset 5
2: <unset>
ERROR: bad negative value -1091611307 for offset 6
ERROR: bad negative value -1091611307 for offset 7
3: <unset>
ERROR: bad negative value -1091611307 for offset 8
ERROR: bad negative value -1091611307 for offset 9
4: <unset>
5:

Which is not really what is expected from just processing a pattern.

I would expect pcre_exec to initialize all of the offsets with correct values
or at least with -1 values as described above in the docs. If not, then
in the documentation it should be clearly stated that the offset array should
be zeroed out prior to calling pcre_exec.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email