Re: [pcre-dev] PCRE 7.3 release candidate for testing

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Sheri
CC: pcre-dev
Subject: Re: [pcre-dev] PCRE 7.3 release candidate for testing
On Mon, 13 Aug 2007, Sheri wrote:

> > but this fails:
> >
> > re> /(\n|\r)a/
> > data> \r\na
> > No match


This is because PCRE is in an impossible situation when CRLF is one of
the line ending possibilities. Looking for explicit \r and \n characters
in this case is bound to cause trouble.

There is no match because of the following change that was made for 7.0:

46. For an unanchored pattern, if a match attempt fails at the start of 
    a newline sequence, and the newline setting is CRLF or ANY, and the
    next two characters are CRLF, advance by two characters instead of
    one.


(ANYCRLF was subsequently added as a subset of ANY, with the same
behaviour in this situation.)

So when PCRE fails to match at the start of your data string, it
advances over the CRLF (treating it as one composite character) and so
fails to match.

Not doing this causes other problems, needless to say. The original
motivation for the change was for the newline=CRLF case, when running
the pattern ".+" against the string "\r\nfoo" would match "\nfoo" rather
than "foo".

OK, in the ANY case that example would be different (because \n on its
own is also a valid newline), but I don't think the treatment should be
different for CRLF-only vs ANY or ANYCRLF.

The bottom line is what I've said in the top line :-) which is that
using explicit \r and \n when in any of the CRLF modes is probably going
to cause problems. It's similar to using the \C escape in UTF-8 mode to
match one single byte.

If anyone has any bright ideas about this, please post them. Otherwise,
I don't think there is anything I can do.

Philip

--
Philip Hazel, University of Cambridge Computing Service.