Re: [pcre-dev] PCRE 7.3 release candidate for testing

Top Page
Delete this message
Author: Sheri
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] PCRE 7.3 release candidate for testing
Philip Hazel wrote:
> On Mon, 13 Aug 2007, Sheri wrote:
>
>
>>> but this fails:
>>>
>>> re> /(\n|\r)a/
>>> data> \r\na
>>> No match
>>>
>
> This is because PCRE is in an impossible situation when CRLF is one of
> the line ending possibilities. Looking for explicit \r and \n characters
> in this case is bound to cause trouble.
>
> There is no match because of the following change that was made for 7.0:
>
> 46. For an unanchored pattern, if a match attempt fails at the start of 
>     a newline sequence, and the newline setting is CRLF or ANY, and the
>     next two characters are CRLF, advance by two characters instead of
>     one.

>
> (ANYCRLF was subsequently added as a subset of ANY, with the same
> behaviour in this situation.)
>
> So when PCRE fails to match at the start of your data string, it
> advances over the CRLF (treating it as one composite character) and so
> fails to match.
>
>

Hi Philip,

data> \r\na

What allows /\na/<crlf> to match the data but not /(\n|\r)a/<crlf>?

Why does the use of alternation cause No match? Doesn't it seem they
would both fail at the start of the data?

It seems to me that when searching for \r or \n (or any other way of
specifying carriage return or linefeed, e.g., \x0a) the user is
searching for non composite characters rather than line breaks. Because
there is no metacharacter for matching crlfs or line breaks in general,
users will often be looking for explicit \r and \n when the line ending
choice includes crlf.

Could you exempt 46 when dollar endonly is in effect (when pattern is
not multiline). Isn't that set by posix?

> ...though of course you could do something like that outside in your
> application and set the options for PCRE appropriately. In other words,
> write your own wrapper that picked apart a string into options and
> regex.

The whole idea was to preserve our existing functions (which use the
posix api) unmodified, and support preexisting usages, while also
providing new functions (using the native api) with enhanced
capabilities. The way its working out (unless something changes),
preexisting usage through the posix api is not really supported.

I really think it should be possible to get the same results with any of
the newline options (as default), when the pattern lacks any ^, $, dot
or \Z and doesn't match the empty string. Perhaps tying it to dollar
endonly option could achieve that.

Regards,
Sheri