On Thu, Nov 22, 2007 at 10:22:20AM +0000, Philip Hazel wrote: > I can reproduce the problem, and I think I can guess its cause. In the
> table, there are some ARABIC-INDIC characters in the middle of the
> ARABIC characters. And also an AFGHANI character. I suspect that the
> program that reads the Unicode table and turns it into a table for PCRE
> to use gets confused by this, and "loses" some ARABIC characters. (...)
I see. Ok. Btw, in the meantime I stumbled over further
characters that PCRE does not recognize as I would have
expected (just random tests, not complete ones):
re> /\p{Armenian}/8
data> \x{0589} No match
UnicodeData.txt:
0589;ARMENIAN FULL STOP;Po;0;L;;;;;N;ARMENIAN PERIOD;;;;
re> /\p{Cyrillic}/8
data> \x{1D2B}/8 No match
1D2B;CYRILLIC LETTER SMALL CAPITAL EL;Ll;0;L;;;;;N;;;;;
re> /\p{Devanagari}/8
data> \x{0970} No match
0970;DEVANAGARI ABBREVIATION SIGN;Po;0;L;;;;;N;;;;;
re> /\p{Devanagari}/8
data> \x{0965} No match
0965;DEVANAGARI DOUBLE DANDA;Po;0;L;;;;;N;;;;; re> /\p{Devanagari}/8
data> \x{0964} No match
0964;DEVANAGARI DANDA;Po;0;L;;;;;N;;;;
re> /\p{Georgian}/8
data> \x{10FB} No match
10FB;GEORGIAN PARAGRAPH SEPARATOR;Po;0;L;;;;;N;;;;;
re> /\p{Greek}/8
data> \x{1D68} No match
1D68;GREEK SUBSCRIPT SMALL LETTER RHO;Ll;0;L;<sub> 03C1;;;;N;;;;;
re> /\p{Greek}/8
data> \x{1D29} No match
1D29;GREEK LETTER SMALL CAPITAL RHO;Ll;0;L;;;;;N;;;;;
re> /\p{Greek}/8
data> \x{0387} No match
0387;GREEK ANO TELEIA;Po;0;ON;00B7;;;;N;;;;;
re> /\p{Katakana}/8
data> \x{30FE} No match
30FE;KATAKANA VOICED ITERATION MARK;Lm;0;L;30FD 3099;;;;N;;;;
re> /\p{Osmanya}/8
data> \x{10486} No match
10486;OSMANYA LETTER DEEL;Lo;0;L;;;;;N;;;;;
data> \x{1049A} No match
1049A;OSMANYA LETTER U;Lo;0;L;;;;;N;;;;;
data> \x{10485} No match
10485;OSMANYA LETTER KHA;Lo;0;L;;;;;N;;;;;
re> /\p{Runic}/8
data> \x{16ED} No match
16ED;RUNIC CROSS PUNCTUATION;Po;0;L;;;;;N;;;;;
re> /\p{Thai}/8
data> \x{0E3F} No match
0E3F;THAI CURRENCY SYMBOL BAHT;Sc;0;ET;;;;;N;THAI BAHT SIGN;;;;