Re: [pcre-dev] Newline feature request -- matching metachara…

Top Page
Delete this message
Author: Sheri
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] Newline feature request -- matching metacharacter?
Philip Hazel wrote:
>
> (Nothing to stop PCRE users from doing the same, of course, but maybe
> that's an impractical suggestion.) At least, I think that's how it deals
> with CR, LF, CRLF line endings. I don't know what it does about the
> other Unicode line breakers.
>
>

It would break the use of offsets and startoffsets or make them much
more complicated. Remember \r\n is two bytes not one. It would also
break PCRE user patterns that expect \n to be \x0a

>> You invented \R to service line breaks on Unicode.
>>
>
> <pedantry>
> I didn't invent it; the Unicode folks did. I just implemented it.
> </pedantry>
>

Ok, I see they suggested the letter "R"
> Any other opinions on this? The classic way out of this dilemma would be
> to invent a new PCRE option such as PCRE_BACKSLASH_R_IS_ASCII_ONLY so
> that the behaviour could be controlled by the application.
>
>

If it had to be an option for every compile, it would be easier to write
a pattern without the metacharacter. The metacharacter wouldn't be a
viable shortcut. And what about users of the posix api? What would their
\R be? I guess you'd have to have a dll default
PCRE_BACKSLASH_R_IS_ASCII_ONLY vs PCRE_BACKSLASH_R_IS_UNICODE. ugh. Then
native api would still need compile options to override the dll default.

I think it should be enough if the default linebreak is ANY to default
to PCRE_BACKSLASH_R_IS_UNICODE and if default linebreak is ANYCRLF or
CRLF default to PCRE_BACKSLASH_R_IS_ASCII_ONLY*;* and if compile option
for <any> or <anycrlf> is processed, automatically set
PCRE_BACKSLASH_R_IS_UNICODE or PCRE_BACKSLASH_R_IS_ASCII_ONLY.

Then the only question is what's appropriate for \R when newline in
effect is LF or CR. I suppose it should be PCRE_BACKSLASH_R_IS_UNICODE
because for both of those there are alternative metacharacters for
matching the linebreak itself (\n or \r). IOW, \R isn't really needed
for either of those.

So, maybe you could tie \R behavior to choice of linebreak compile
option (including linebreak default) instead of utf8 mode.
> What do others on this list think?
>
> Philip
>
>

Regards,
Sheri