Revision: 354
http://vcs.pcre.org/viewvc?view=rev&revision=354
Author: ph10
Date: 2008-07-07 17:30:33 +0100 (Mon, 07 Jul 2008)
Log Message:
-----------
Fix caseless backreferences for non-ASCII characters.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/pcre_exec.c
code/trunk/testdata/testinput6
code/trunk/testdata/testoutput6
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2008-07-07 15:44:24 UTC (rev 353)
+++ code/trunk/ChangeLog 2008-07-07 16:30:33 UTC (rev 354)
@@ -17,6 +17,10 @@
3. Change 12 for 7.7 introduced a bug in pcre_study() when a pattern contained
a group with a zero qualifier. The result of the study could be incorrect,
or the function might crash, depending on the pattern.
+
+4. Caseless matching was not working for non-ASCII characters in back
+ references. For example, /(\x{de})\1/8i was not matching \x{de}\x{fe}.
+ It now works when Unicode Property Support is available.
Version 7.7 07-May-08
Modified: code/trunk/pcre_exec.c
===================================================================
--- code/trunk/pcre_exec.c 2008-07-07 15:44:24 UTC (rev 353)
+++ code/trunk/pcre_exec.c 2008-07-07 16:30:33 UTC (rev 354)
@@ -158,13 +158,39 @@
if (length > md->end_subject - eptr) return FALSE;
-/* Separate the caselesss case for speed */
+/* Separate the caseless case for speed. In UTF-8 mode we can only do this
+properly if Unicode properties are supported. Otherwise, we can check only
+ASCII characters. */
if ((ims & PCRE_CASELESS) != 0)
{
+#ifdef SUPPORT_UTF8
+#ifdef SUPPORT_UCP
+ if (md->utf8)
+ {
+ USPTR endptr = eptr + length;
+ while (eptr < endptr)
+ {
+ int c, d;
+ GETCHARINC(c, eptr);
+ GETCHARINC(d, p);
+ if (c != d && c != UCD_OTHERCASE(d)) return FALSE;
+ }
+ }
+ else
+#endif
+#endif
+
+ /* The same code works when not in UTF-8 mode and in UTF-8 mode when there
+ is no UCP support. */
+
while (length-- > 0)
- if (md->lcc[*p++] != md->lcc[*eptr++]) return FALSE;
+ { if (md->lcc[*p++] != md->lcc[*eptr++]) return FALSE; }
}
+
+/* In the caseful case, we can just compare the bytes, whether or not we
+are in UTF-8 mode. */
+
else
{ while (length-- > 0) if (*p++ != *eptr++) return FALSE; }
Modified: code/trunk/testdata/testinput6
===================================================================
--- code/trunk/testdata/testinput6 2008-07-07 15:44:24 UTC (rev 353)
+++ code/trunk/testdata/testinput6 2008-07-07 16:30:33 UTC (rev 354)
@@ -925,4 +925,22 @@
** Failers
\x{1d79}\x{a77d}
+/(A)\1/8i
+ AA
+ Aa
+ aa
+ aA
+
+/(\x{de})\1/8i
+ \x{de}\x{de}
+ \x{de}\x{fe}
+ \x{fe}\x{fe}
+ \x{fe}\x{de}
+
+/(\x{10a})\1/8i
+ \x{10a}\x{10a}
+ \x{10a}\x{10b}
+ \x{10b}\x{10b}
+ \x{10b}\x{10a}
+
/ End of testinput6 /
Modified: code/trunk/testdata/testoutput6
===================================================================
--- code/trunk/testdata/testoutput6 2008-07-07 15:44:24 UTC (rev 353)
+++ code/trunk/testdata/testoutput6 2008-07-07 16:30:33 UTC (rev 354)
@@ -1705,4 +1705,46 @@
\x{1d79}\x{a77d}
No match
+/(A)\1/8i
+ AA
+ 0: AA
+ 1: A
+ Aa
+ 0: Aa
+ 1: A
+ aa
+ 0: aa
+ 1: a
+ aA
+ 0: aA
+ 1: a
+
+/(\x{de})\1/8i
+ \x{de}\x{de}
+ 0: \x{de}\x{de}
+ 1: \x{de}
+ \x{de}\x{fe}
+ 0: \x{de}\x{fe}
+ 1: \x{de}
+ \x{fe}\x{fe}
+ 0: \x{fe}\x{fe}
+ 1: \x{fe}
+ \x{fe}\x{de}
+ 0: \x{fe}\x{de}
+ 1: \x{fe}
+
+/(\x{10a})\1/8i
+ \x{10a}\x{10a}
+ 0: \x{10a}\x{10a}
+ 1: \x{10a}
+ \x{10a}\x{10b}
+ 0: \x{10a}\x{10b}
+ 1: \x{10a}
+ \x{10b}\x{10b}
+ 0: \x{10b}\x{10b}
+ 1: \x{10b}
+ \x{10b}\x{10a}
+ 0: \x{10b}\x{10a}
+ 1: \x{10b}
+
/ End of testinput6 /