On Fri, 20 Apr 2007, Sheri wrote:
> Perl already has a metacharacter that matches whichever line ending the
> (ascii) file uses: \n. Perl is smart enough to do it based on the data
> content (I know there's more to it under the covers, but that is how it
> looks to the user).
As far as I know, Perl does it by translating line endings when it reads
and writes files, so internally it has a standardized character set.
(Nothing to stop PCRE users from doing the same, of course, but maybe
that's an impractical suggestion.) At least, I think that's how it deals
with CR, LF, CRLF line endings. I don't know what it does about the
other Unicode line breakers.
> You invented \R to service line breaks on Unicode.
<pedantry>
I didn't invent it; the Unicode folks did. I just implemented it.
</pedantry>
> PCRE doesn't process Unicode except in utf8 mode.
In some sense yes, and in some sense, since ASCII and 8859-1 are
subsets, no. :-)
> T'would be convenient to have a metacharacter that matches Ascii line
> breaks. T'would address a shortcoming PCRE has already in comparison
> with Perl. If you have a philosophical objection to \R, maybe you could
> choose something else.
The problem with "choosing something else" is compatibility with other
regex implementations. I'm very loath to use up another \ escape at
random.
Any other opinions on this? The classic way out of this dilemma would be
to invent a new PCRE option such as PCRE_BACKSLASH_R_IS_ASCII_ONLY so
that the behaviour could be controlled by the application.
What do others on this list think?
Philip
--
Philip Hazel, University of Cambridge Computing Service.