Re: [pcre-dev] Newline feature request -- matching metachara…

Góra strony
Delete this message
Autor: Philip Hazel
Data:  
Dla: pcre-dev
Temat: Re: [pcre-dev] Newline feature request -- matching metacharacter?
On Fri, 20 Apr 2007, Sheri wrote:

> Perl already has a metacharacter that matches whichever line ending the
> (ascii) file uses: \n. Perl is smart enough to do it based on the data
> content (I know there's more to it under the covers, but that is how it
> looks to the user).


As far as I know, Perl does it by translating line endings when it reads
and writes files, so internally it has a standardized character set.
(Nothing to stop PCRE users from doing the same, of course, but maybe
that's an impractical suggestion.) At least, I think that's how it deals
with CR, LF, CRLF line endings. I don't know what it does about the
other Unicode line breakers.

> You invented \R to service line breaks on Unicode.


<pedantry>
I didn't invent it; the Unicode folks did. I just implemented it.
</pedantry>

> PCRE doesn't process Unicode except in utf8 mode.


In some sense yes, and in some sense, since ASCII and 8859-1 are
subsets, no. :-)

> T'would be convenient to have a metacharacter that matches Ascii line
> breaks. T'would address a shortcoming PCRE has already in comparison
> with Perl. If you have a philosophical objection to \R, maybe you could
> choose something else.


The problem with "choosing something else" is compatibility with other
regex implementations. I'm very loath to use up another \ escape at
random.

Any other opinions on this? The classic way out of this dilemma would be
to invent a new PCRE option such as PCRE_BACKSLASH_R_IS_ASCII_ONLY so
that the behaviour could be controlled by the application.

What do others on this list think?

Philip

--
Philip Hazel, University of Cambridge Computing Service.