Re: [pcre-dev] PCRE 7.3 release candidate for testing

Top Page
Delete this message
Author: Sheri
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] PCRE 7.3 release candidate for testing
Sheri wrote:
> Philip Hazel wrote:
>
>> On Mon, 13 Aug 2007, Sheri wrote:
>>
>>
>>
>>>> but this fails:
>>>>
>>>> re> /(\n|\r)a/
>>>> data> \r\na
>>>> No match
>>>>
>>>>
>> This is because PCRE is in an impossible situation when CRLF is one of
>> the line ending possibilities. Looking for explicit \r and \n characters
>> in this case is bound to cause trouble.
>>
>> There is no match because of the following change that was made for 7.0:
>>
>> 46. For an unanchored pattern, if a match attempt fails at the start of 
>>     a newline sequence, and the newline setting is CRLF or ANY, and the
>>     next two characters are CRLF, advance by two characters instead of
>>     one.

>>
>> (ANYCRLF was subsequently added as a subset of ANY, with the same
>> behaviour in this situation.)
>>
>> So when PCRE fails to match at the start of your data string, it
>> advances over the CRLF (treating it as one composite character) and so
>> fails to match.
>>
>> Not doing this causes other problems, needless to say. The original
>> motivation for the change was for the newline=CRLF case, when running
>> the pattern ".+" against the string "\r\nfoo" would match "\nfoo" rather
>> than "foo".
>>
>> OK, in the ANY case that example would be different (because \n on its
>> own is also a valid newline), but I don't think the treatment should be
>> different for CRLF-only vs ANY or ANYCRLF.
>>
>> The bottom line is what I've said in the top line :-) which is that
>> using explicit \r and \n when in any of the CRLF modes is probably going
>> to cause problems. It's similar to using the \C escape in UTF-8 mode to
>> match one single byte.
>>
>> If anyone has any bright ideas about this, please post them. Otherwise,
>> I don't think there is anything I can do.
>>
>> Philip
>>
>>
>>
> Hi Philip,
>
> Change 46 need not apply if dot matches everything (?s) or if the user
> pattern has no dots in it. Could those conditions be exempted?
>

In giving this more thought, would say:

Change 46 need not apply if ((dot matches everything) or (not multiline
and pattern has no dots in it)). Otherwise, I think there would be the
potential for empty matches between the \r and \n in a multiline pattern.

For patterns such as "(?m)s*" and data "\r\n", change 46 is desirable
when CRLF is a valid line ending.

Regards,
Sheri