Re: [pcre-dev] PCRE 7.3 release candidate for testing

Top Page
Delete this message
Author: Sheri
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] PCRE 7.3 release candidate for testing
Hi Philip,

Philip Hazel wrote:
> On Fri, 17 Aug 2007, Sheri wrote:
>
>
>> data> \r\na
>> What allows /\na/<crlf> to match the data but not /(\n|\r)a/<crlf>?
>>
>
> Oh dear, this shows up yet another darn complication of using CRLF. What
> a pain! I'm beginning to regret implementing the CRLF stuff...
>
> At first I expected /\na/<crlf> not to match \r\na because, after having
> failed to match at the first character, it would skip over the whole
> CRLF. However, there's an optimization in PCRE. When it can, it records
> the first byte that must be present for any match. In this case, it
> knows the match must begin with \n, so it skips along to there right at
> the start before trying to match. Sigh.
>
> In the /(\n|\r)a/<crlf> case there is no "fixed first character", so it
> starts the first match at offset 0. You'd get the same effect with
> /[\r\n]a/<crlf>.
>
>
>> Why does the use of alternation cause No match? Doesn't it seem they
>> would both fail at the start of the data?
>>
>
> Indeed it does. Another Sigh. I *could* discard the optimization if the
> first byte is \n and <crlf> is in effect, I suppose. However, I'm not
> sure that tinkering at the edges like this is right. As I said
> previously, I really don't think that searching for explicit \r or \n
> characters is sensible with <CRLF> in effect.
>
>
>> It seems to me that when searching for \r or \n (or any other way of
>> specifying carriage return or linefeed, e.g., \x0a) the user is
>> searching for non composite characters rather than line breaks.
>>
>
> Indeed.
>
>
>> Because there is no metacharacter for matching crlfs or line breaks in
>> general,
>>
>
> But there is! From release 7.0 \R matches any Unicode linebreak
> sequence.
>
>

I can't recommend people use of \R for that purpose because it will
match non-linebreaks as well. In non-UTF8 data it matches vertical tabs
and horizontal ellipsis. Think we had a previous discussion about it, I
suggested the meaning of \R be made to vary with the lineend in effect
but that didn't fly.
>> Could you exempt 46 when dollar endonly is in effect (when pattern is
>> not multiline). Isn't that set by posix?
>>
>
> The POSIX interface doesn't set any specific options.
>
>
>> I really think it should be possible to get the same results with any of
>> the newline options (as default), when the pattern lacks any ^, $, dot
>> or \Z and doesn't match the empty string. Perhaps tying it to dollar
>> endonly option could achieve that.
>>
>
> I think you have to say "lacks \r and \n" as well - remember, there is
> \R if people want to match linebreaks explicitly.
>
> I suppose one could say "if the pattern contains any explicit \r or \n
> characters, don't do the change 46 thing". That seems to me to be
> such a special-case thing that doesn't really feel "clean", but perhaps
> it's the best that can be done.
>
> Philip
>
>

Will give that some thought and get back to you later. Also will test
and review the character class vs alternation approach (for user
patterns) as currently implemented.

Regards,
Sheri