Re: [pcre-dev] PCRE 7.3 release candidate for testing

Author: Sheri
Date:
To: pcre-dev
Subject: Re: [pcre-dev] PCRE 7.3 release candidate for testing

Hi Philip,

Philip Hazel wrote:
> On Fri, 17 Aug 2007, Sheri wrote:
>
>
>> data> \r\na
>> What allows /\na/<crlf> to match the data but not /(\n|\r)a/<crlf>?
>>
>
> Oh dear, this shows up yet another darn complication of using CRLF. What
> a pain! I'm beginning to regret implementing the CRLF stuff...
>
> At first I expected /\na/<crlf> not to match \r\na because, after having
> failed to match at the first character, it would skip over the whole
> CRLF. However, there's an optimization in PCRE. When it can, it records
> the first byte that must be present for any match. In this case, it
> knows the match must begin with \n, so it skips along to there right at
> the start before trying to match. Sigh.
>
> In the /(\n|\r)a/<crlf> case there is no "fixed first character", so it
> starts the first match at offset 0. You'd get the same effect with
> /[\r\n]a/<crlf>.
>
>
>> Why does the use of alternation cause No match? Doesn't it seem they
>> would both fail at the start of the data?
>>
>
> Indeed it does. Another Sigh. I *could* discard the optimization if the
> first byte is \n and <crlf> is in effect, I suppose. However, I'm not
> sure that tinkering at the edges like this is right. As I said
> previously, I really don't think that searching for explicit \r or \n
> characters is sensible with <CRLF> in effect.
>
>
>> It seems to me that when searching for \r or \n (or any other way of
>> specifying carriage return or linefeed, e.g., \x0a) the user is
>> searching for non composite characters rather than line breaks.
>>
>
> Indeed.
>
>
>> Because there is no metacharacter for matching crlfs or line breaks in
>> general,
>>
>
> But there is! From release 7.0 \R matches any Unicode linebreak
> sequence.
>
>
I can't recommend people use of \R for that purpose because it will
match non-linebreaks as well. In non-UTF8 data it matches vertical tabs
and horizontal ellipsis. Think we had a previous discussion about it, I
suggested the meaning of \R be made to vary with the lineend in effect
but that didn't fly.
>> Could you exempt 46 when dollar endonly is in effect (when pattern is
>> not multiline). Isn't that set by posix?
>>
>
> The POSIX interface doesn't set any specific options.
>
>
>> I really think it should be possible to get the same results with any of
>> the newline options (as default), when the pattern lacks any ^, $, dot
>> or \Z and doesn't match the empty string. Perhaps tying it to dollar
>> endonly option could achieve that.
>>
>
> I think you have to say "lacks \r and \n" as well - remember, there is
> \R if people want to match linebreaks explicitly.
>
> I suppose one could say "if the pattern contains any explicit \r or \n
> characters, don't do the change 46 thing". That seems to me to be
> such a special-case thing that doesn't really feel "clean", but perhaps
> it's the best that can be done.
>
> Philip
>
>
Will give that some thought and get back to you later. Also will test
and review the character class vs alternation approach (for user
patterns) as currently implemented.

Regards,
Sheri

This message is part of the following thread:
	the complete thread tree sorted by date
	Philip Hazel at
	Sheri at