On Wed, 21 Nov 2007, Juergen Leising wrote:
> I wonder, why certain hex codes do not match certain
> UTF-8 script names. For example, with pcre-7.4:
>
> re> /^[\p{Arabic}]/8
> data> \x{06e9}
> 0: \x{6e9}
> data> \x{0654}
> No match
> data> \x{0658}
> No match
> data> \x{0656}
> No match
>
>
> I had a glance at UnicodeData.txt from
>
> ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
>
> and therefore expected these numbers to match:
>
> 0654;ARABIC HAMZA ABOVE;Mn;230;NSM;;;;;N;;;;;
> 0656;ARABIC SUBSCRIPT ALEF;Mn;220;NSM;;;;;N;;;;;
> 0658;ARABIC MARK NOON GHUNNA;Mn;230;NSM;;;;;N;;;;;
>
> But they don't. Do I miss something? Wrong table or version or
> syntax or whatever?
I have now looked at this, and I'm afraid you *have* missed something
(and so did I when I took a quick look before). The definition of which
characters belong to which scripts are in the Unicode file Scripts.txt,
not UnicodeData.txt. In the line
0654;ARABIC HAMZA ABOVE;Mn;230;NSM;;;;;N;;;;;
the string "ARABIC HAMZA ABOVE" is the name of the character; it doesn't
say anything about the script to which it belongs (if any). In the
Scripts.txt file, 0654 etc are not listed as being in the Arabic Script.
Having said that, I notice there are still some differences between PCRE
and Perl (though Perl also doesn't match 0654 as Arabic), and I will do
some more double checking just in case there is a problem in PCRE.
Note also that Unicode Scripts are slightly odd things. I found this
text in the description on the Unicode site:
A character is assigned a specific Unicode script value (as opposed to
Common or Inherited) only when it is clearly not used with other
scripts. This facilitates the use of the Script property for common
tasks such as regular expressions, but means that some characters that
are definite members of a given script by their graphology
nevertheless are assigned one of the generic values.
Regards,
Philip
--
Philip Hazel