Re: [pcre-dev] PCRE 7.3 release candidate for testing

Top Page
Delete this message
Author: Sheri
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] PCRE 7.3 release candidate for testing
Philip Hazel wrote:
> On Mon, 13 Aug 2007, Sheri wrote:
>
>
>>> but this fails:
>>>
>>> re> /(\n|\r)a/
>>> data> \r\na
>>> No match
>>>
>
> This is because PCRE is in an impossible situation when CRLF is one of
> the line ending possibilities. Looking for explicit \r and \n characters
> in this case is bound to cause trouble.
>
> There is no match because of the following change that was made for 7.0:
>
> 46. For an unanchored pattern, if a match attempt fails at the start of 
>     a newline sequence, and the newline setting is CRLF or ANY, and the
>     next two characters are CRLF, advance by two characters instead of
>     one.

>
> (ANYCRLF was subsequently added as a subset of ANY, with the same
> behaviour in this situation.)
>
> So when PCRE fails to match at the start of your data string, it
> advances over the CRLF (treating it as one composite character) and so
> fails to match.
>
> Not doing this causes other problems, needless to say. The original
> motivation for the change was for the newline=CRLF case, when running
> the pattern ".+" against the string "\r\nfoo" would match "\nfoo" rather
> than "foo".
>
> OK, in the ANY case that example would be different (because \n on its
> own is also a valid newline), but I don't think the treatment should be
> different for CRLF-only vs ANY or ANYCRLF.
>
> The bottom line is what I've said in the top line :-) which is that
> using explicit \r and \n when in any of the CRLF modes is probably going
> to cause problems. It's similar to using the \C escape in UTF-8 mode to
> match one single byte.
>
> If anyone has any bright ideas about this, please post them. Otherwise,
> I don't think there is anything I can do.
>
> Philip
>
>

Hi Philip,

Change 46 need not apply if dot matches everything (?s) or if the user
pattern has no dots in it. Could those conditions be exempted?

Another approach would be to internalize the newline options
(particularly newline=LF). That way it would be possible for the user to
defeat a default newline setting that causes 46 to happen when using
posix functions or an app that doesn't support external options.

If all the newline options were available as internal options, it would
add user power to a wide variety of apps that do not currently support
the external options that are already available.

Regards,
Sheri