Re: [pcre-dev] PCRE 7.3 release candidate for testing

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Sheri
CC: pcre-dev
Subject: Re: [pcre-dev] PCRE 7.3 release candidate for testing
On Fri, 17 Aug 2007, Sheri wrote:

> data> \r\na
> What allows /\na/<crlf> to match the data but not /(\n|\r)a/<crlf>?


Oh dear, this shows up yet another darn complication of using CRLF. What
a pain! I'm beginning to regret implementing the CRLF stuff...

At first I expected /\na/<crlf> not to match \r\na because, after having
failed to match at the first character, it would skip over the whole
CRLF. However, there's an optimization in PCRE. When it can, it records
the first byte that must be present for any match. In this case, it
knows the match must begin with \n, so it skips along to there right at
the start before trying to match. Sigh.

In the /(\n|\r)a/<crlf> case there is no "fixed first character", so it
starts the first match at offset 0. You'd get the same effect with
/[\r\n]a/<crlf>.

> Why does the use of alternation cause No match? Doesn't it seem they
> would both fail at the start of the data?


Indeed it does. Another Sigh. I *could* discard the optimization if the
first byte is \n and <crlf> is in effect, I suppose. However, I'm not
sure that tinkering at the edges like this is right. As I said
previously, I really don't think that searching for explicit \r or \n
characters is sensible with <CRLF> in effect.

> It seems to me that when searching for \r or \n (or any other way of
> specifying carriage return or linefeed, e.g., \x0a) the user is
> searching for non composite characters rather than line breaks.


Indeed.

> Because there is no metacharacter for matching crlfs or line breaks in
> general,


But there is! From release 7.0 \R matches any Unicode linebreak
sequence.

> Could you exempt 46 when dollar endonly is in effect (when pattern is
> not multiline). Isn't that set by posix?


The POSIX interface doesn't set any specific options.

> I really think it should be possible to get the same results with any of
> the newline options (as default), when the pattern lacks any ^, $, dot
> or \Z and doesn't match the empty string. Perhaps tying it to dollar
> endonly option could achieve that.


I think you have to say "lacks \r and \n" as well - remember, there is
\R if people want to match linebreaks explicitly.

I suppose one could say "if the pattern contains any explicit \r or \n
characters, don't do the change 46 thing". That seems to me to be
such a special-case thing that doesn't really feel "clean", but perhaps
it's the best that can be done.

Philip

--
Philip Hazel, University of Cambridge Computing Service.