[pcre-dev] [Bug 1032] With MULTILINE option, cannot correct…

Inizio della pagina
Delete this message
Autore: Philip Hazel
Data:  
To: pcre-dev
Oggetto: [pcre-dev] [Bug 1032] With MULTILINE option, cannot correctly handle searching for $ over CRLF
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1032




--- Comment #1 from Philip Hazel <ph10@???> 2010-10-20 16:41:21 ---
On Wed, 20 Oct 2010, Richard Smith wrote:

> For a MULTILINE search of a string containing a single embedded CRLF and the
> search pattern "(*ANY)$" you end up with matches on both the CR and LF, even
> though CRLF is a single entity and there is only one line-end.


Please state which of PCRE's newline-selection options you have set, if
any. That is, are any of these options set?

PCRE_NEWLINE_CR     
PCRE_NEWLINE_LF     
PCRE_NEWLINE_CRLF   
PCRE_NEWLINE_ANY    
PCRE_NEWLINE_ANYCRLF


> When encountering a non zero-length match, the procedure is to resume
> searching from immediately after the match. For a zero length match
> the procedure is to retry for a non-zero-length match, or advance to
> the next character if there is none.


The pcre_exec() code contains the following comment:

/* If we have just passed a CR and we are now at a LF, and the pattern does
not contain any explicit matches for \r or \n, and the newline option is CRLF
or ANY or ANYCRLF, advance the match position by one more character. */

In other words, if CRLF is a valid line-ending sequence, it does try to
skip two bytes when advancing the start of the match.

However, I get the impression this is not what you are concerned about,
because of this remark:

> A possible resolution would be for pcre_exec to report the zero-length
> match on the CR _and_ somehow indicate that searching should resume
> two characters further on. At present ovector[0] and ovector[1] return
> the position of the start and end of the match within the string (when
> there's a match) and -1, -1 when there is none. When there is none,
> perhaps ovector[1] should indicate the resume position.


I am not clear which bit of code is doing the resuming. If it is user
code (which presumably knows it has set CRLF as a valid line ending),
surely it can decide for itself to skip over CRLF instead of just CR?
The pcretest program, when used with the /g option, does just that. It
contains the following comment:

Complication arises in the case when the newline option is "any" or
"anycrlf". If the previous match was at the end of a line terminated by
CRLF, an advance of one character just passes the \r, whereas we should
prefer the longer newline sequence, as does the code in pcre_exec().

Philip


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email