Author: Philip Hazel Date: To: pcre-dev Subject: Re: [pcre-dev] Newline feature request -- matching metacharacter?
On Thu, 19 Apr 2007, Sheri wrote:
> Hi Philip,
>
> Any way we could get a matching metacharacter, like the new \R? :) Or
> better yet make \R match
>
> "(?>\r\n|\n|\r)" when not in utf8 mode and the "any" alternates when in utf8 mode?
I don't think so, sorry. I took \R from the Unicode web page that talks
about regular expressions, and I understand that it will also probably
be in Perl 5.10. It is clearly a Unicode thing, so must always match all
the Unicode newlines. The fact that you can't have characters > 255 in
non-UTF-8 mode is not, IMO, relevant. I know people conflate the use of
UTF-8 and Unicode, but in my mind they are separate issues.
UTF-8 is a way of encoding numbers with values from 0 to 0x7FFFFFFF as
variable strings of bytes. OK, the RFC does argue its case in terms of
ISO 10646 rather than Unicode, but I think it's helpful to keep it
separate and think of it as a kind of compression method.
Unicode is a character encoding that uses code points from 0 to whatever
its limit is - less than UTF-8 can handle - something like 0x0003FFFF if
I recall rightly.
When PCRE is not in UTF-8 mode, it can still get characters with values
up to 255, and this includes the Unicode 0x85 character - which is of
course the same as ISO 8859-1.
Philip
--
Philip Hazel, University of Cambridge Computing Service.