Re: [pcre-dev] Newline feature request -- matching metachara…

Pàgina inicial
Delete this message
Autor: Sheri
Data:  
A: pcre-dev
Assumpte: Re: [pcre-dev] Newline feature request -- matching metacharacter?
Philip Hazel wrote:
> On Fri, 20 Apr 2007, Sheri wrote:
>
>
>> Here's a revolutionary thought:
>>
>> Make \R match whichever linebreak is in effect. So
>> if <lf> it would match \n,
>> if <cr> it would match \r,
>> if <crlf> it would match (?>\r\n),
>> if <anycrlf> it would match (?>\r\n|\n|\r),
>> if <any> it would match (?>\r\n|\n|\x0b|\f|\r|\x85)
>>
>> Maybe this way it would be possible to continue to allow the newline
>> option to change between compile and exec and still have \R match (as
>> foolish users might expect) only the linebreaks in effect.
>>
>
> I had a similar thought, but I don't like it.
>
> The choice of linebreak affects the behaviour of ^ in multiline mode, $
> in all modes, and . in non-dotall mode. I do not think it should be
> extended to affect anything else. I have a gut feeling that that is just
> going to give trouble later.
>
> I then thought about who might be using \R. For myself, I think it is
> very unlikely that I would write a pattern that both used the "newline"
> features of ^ and $ and also contained explicit newline matching such as
> \n or \R. It would be either one or the other.
>

Maybe not in the same patterns as those that use "$", but it would be
used very frequently on Windows. Firstly because there is no
metacharacter for "\r\n", and secondly because we manipulate files with
both Unix and Windows-style line ends. The way I generally address it
myself is by using "\r?\n". I use an editor (with no unicode support)
that has implemented PCRE. It is implemented to be in multiline mode by
default and has recently updated to version 7. Everyone was very pleased
with the new \R metacharacter. The only fly in the ointment is that
occasionally it is going to match those other characters that are not
line endings in our files. I don't see what would be the point of
searching for Unicode linebreaks in non-Unicode files (this must have
utility I'm just not grasping). There is no question that \R finds them,
but in non-Unicode files (since on Windows, those characters are used
for other purposes -- e.g., horizontal ellipsis) they are false
positives. Oh well.

Regards,
Sheri

> If I wanted to search a string for any Unicode newline sequence, I would
> use \R but I wouldn't care what linebreak setting was in effect, because
> it wouldn't matter. That is why I think tying up linebreak and \R is not
> the right thing to do.
>
> If I wanted to search for an internal newline, defined according to the
> current linebreak setting, then I would use $[^x] in multiline mode.
> $. won't work unless you also set dotall mode. Or possibly $[^x][^x]?^
> I suppose.
>
> Philip
>
>