[pcre-dev] [Bug 1032] New: With MULTILINE option, cannot co…

Top Page
Delete this message
Author: Richard Smith
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1032] New: With MULTILINE option, cannot correctly handle searching for $ over CRLF
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1032
           Summary: With MULTILINE option, cannot correctly handle searching
                    for $ over CRLF
           Product: PCRE
           Version: 8.00
          Platform: x86-64
        OS/Version: Linux
            Status: NEW
          Severity: bug
          Priority: low
         Component: Code
        AssignedTo: ph10@???
        ReportedBy: exim@???
                CC: pcre-dev@???



The search character $ is an anchor; it doesn't consume any characters.
Therefore the search pattern "$" will only ever return zero-length matches.
This report affects this search pattern, plus others more complex which also
return zero-length matches.

For a MULTILINE search of a string containing a single embedded CRLF and the
search pattern "(*ANY)$" you end up with matches on both the CR and LF, even
though CRLF is a single entity and there is only one line-end. The reason for
this is as follows:

When encountering a non zero-length match, the procedure is to resume searching
from immediately after the match. For a zero length match the procedure is to
retry for a non-zero-length match, or advance to the next character if there is
none.

When you hit the zero-length match at the CR, you end up resuming the search at
the LF, which is also a valid line-ending character, so you get another,
erroneous, match there.

Theoretically it is possible for the user of pcre_exec to spot the CR LF at the
zero-length match position and itself advance two characters - but this
requires that the it knows that CRLF is a single entity, and that would require
that it parse the start of the search pattern for sequences such as (*ANY).

A possible resolution would be for pcre_exec to report the zero-length match on
the CR _and_ somehow indicate that searching should resume two characters
further on. At present ovector[0] and ovector[1] return the position of the
start and end of the match within the string (when there's a match) and -1, -1
when there is none. When there is none, perhaps ovector[1] should indicate the
resume position.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email