Revision: 574
http://vcs.pcre.org/viewvc?view=rev&revision=574
Author: ph10
Date: 2010-11-20 17:47:27 +0000 (Sat, 20 Nov 2010)
Log Message:
-----------
Give error if \c is followed by a byte > 127 (in ASCII/UTF-8 modes).
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/pcrepattern.3
code/trunk/pcre_compile.c
code/trunk/pcre_internal.h
code/trunk/pcreposix.c
code/trunk/testdata/testinput1
code/trunk/testdata/testinput2
code/trunk/testdata/testinput5
code/trunk/testdata/testoutput1
code/trunk/testdata/testoutput2
code/trunk/testdata/testoutput5
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/ChangeLog 2010-11-20 17:47:27 UTC (rev 574)
@@ -108,6 +108,11 @@
loops, in order to improve performance in some environments. At the same
time, I abstracted some of the common code into auxiliary macros to save
repetition (this should not affect the compiled code).
+
+19. If \c was followed by a multibyte UTF-8 character, bad things happened. A
+ compile-time error is now given if \c is not followed by an ASCII
+ character, that is, a byte less than 128. (In EBCDIC mode, the code is
+ different, and any byte value is allowed.)
Version 8.10 25-Jun-2010
Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3 2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/doc/pcrepattern.3 2010-11-20 17:47:27 UTC (rev 574)
@@ -182,9 +182,9 @@
.rs
.sp
The backslash character has several uses. Firstly, if it is followed by a
-non-alphanumeric character, it takes away any special meaning that character
-may have. This use of backslash as an escape character applies both inside and
-outside character classes.
+character that is not a number or a letter, it takes away any special meaning
+that character may have. This use of backslash as an escape character applies
+both inside and outside character classes.
.P
For example, if you want to match a * character, you write \e* in the pattern.
This escaping action applies whether or not the following character would
@@ -192,6 +192,10 @@
non-alphanumeric with backslash to specify that it stands for itself. In
particular, if you want to match a backslash, you write \e\e.
.P
+In UTF-8 mode, only ASCII numbers and letters have any special meaning after a
+backslash. All other characters (in particular, those whose codepoints are
+greater than 127) are treated as literals.
+.P
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the
pattern (other than in a character class) and characters between a # outside
a character class and the next newline are ignored. An escaping backslash can
@@ -225,7 +229,7 @@
one of the following escape sequences than the binary character it represents:
.sp
\ea alarm, that is, the BEL character (hex 07)
- \ecx "control-x", where x is any character
+ \ecx "control-x", where x is any ASCII character
\ee escape (hex 1B)
\ef formfeed (hex 0C)
\en linefeed (hex 0A)
@@ -237,8 +241,12 @@
.sp
The precise effect of \ecx is as follows: if x is a lower case letter, it
is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
-Thus \ecz becomes hex 1A, but \ec{ becomes hex 3B, while \ec; becomes hex
-7B.
+Thus \ecz becomes hex 1A (z is 7A), but \ec{ becomes hex 3B ({ is 7B), while
+\ec; becomes hex 7B (; is 3B). If the byte following \ec has a value greater
+than 127, a compile-time error occurs. This locks out non-ASCII characters in
+both byte mode and UTF-8 mode. (When PCRE is compiled in EBCDIC mode, all byte
+values are valid. A lower case letter is converted to upper case, and then the
+0xc0 bits are flipped.)
.P
After \ex, from zero to two hexadecimal digits are read (letters can be in
upper or lower case). Any number of hexadecimal digits may appear between \ex{
@@ -2718,6 +2726,6 @@
.rs
.sp
.nf
-Last updated: 17 November 2010
+Last updated: 20 November 2010
Copyright (c) 1997-2010 University of Cambridge.
.fi
Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c 2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/pcre_compile.c 2010-11-20 17:47:27 UTC (rev 574)
@@ -408,6 +408,7 @@
"different names for subpatterns of the same number are not allowed\0"
"(*MARK) must have an argument\0"
"this version of PCRE is not compiled with PCRE_UCP support\0"
+ "\\c must be followed by an ASCII character\0"
;
/* Table to identify digits and hex digits. This is used when compiling
@@ -841,7 +842,8 @@
break;
/* For \c, a following letter is upper-cased; then the 0x40 bit is flipped.
- This coding is ASCII-specific, but then the whole concept of \cx is
+ An error is given if the byte following \c is not an ASCII character. This
+ coding is ASCII-specific, but then the whole concept of \cx is
ASCII-specific. (However, an EBCDIC equivalent has now been added.) */
case CHAR_c:
@@ -851,11 +853,15 @@
*errorcodeptr = ERR2;
break;
}
-
-#ifndef EBCDIC /* ASCII/UTF-8 coding */
+#ifndef EBCDIC /* ASCII/UTF-8 coding */
+ if (c > 127) /* Excludes all non-ASCII in either mode */
+ {
+ *errorcodeptr = ERR68;
+ break;
+ }
if (c >= CHAR_a && c <= CHAR_z) c -= 32;
c ^= 0x40;
-#else /* EBCDIC coding */
+#else /* EBCDIC coding */
if (c >= CHAR_a && c <= CHAR_z) c += 64;
c ^= 0xC0;
#endif
Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h 2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/pcre_internal.h 2010-11-20 17:47:27 UTC (rev 574)
@@ -1569,7 +1569,8 @@
ERR30, ERR31, ERR32, ERR33, ERR34, ERR35, ERR36, ERR37, ERR38, ERR39,
ERR40, ERR41, ERR42, ERR43, ERR44, ERR45, ERR46, ERR47, ERR48, ERR49,
ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59,
- ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERRCOUNT };
+ ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68,
+ ERRCOUNT };
/* The real format of the start of the pcre block; the index of names and the
code vector run on as long as necessary after the end. We store an explicit
Modified: code/trunk/pcreposix.c
===================================================================
--- code/trunk/pcreposix.c 2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/pcreposix.c 2010-11-20 17:47:27 UTC (rev 574)
@@ -151,6 +151,7 @@
REG_BADPAT, /* different names for subpatterns of the same number are not allowed */
REG_BADPAT, /* (*MARK) must have an argument */
REG_INVARG, /* this version of PCRE is not compiled with PCRE_UCP support */
+ REG_BADPAT, /* \c must be followed by an ASCII character */
};
/* Table of texts corresponding to POSIX error codes */
Modified: code/trunk/testdata/testinput1
===================================================================
--- code/trunk/testdata/testinput1 2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/testdata/testinput1 2010-11-20 17:47:27 UTC (rev 574)
@@ -4076,4 +4076,7 @@
/[\x00-\xff\s]+/
\x0a\x0b\x0c\x0d
+/^\c/
+ ?
+
/-- End of testinput1 --/
Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2 2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/testdata/testinput2 2010-11-20 17:47:27 UTC (rev 574)
@@ -3552,4 +3552,6 @@
abc\>4
abc\>-4
+/^\cģ/
+
/-- End of testinput2 --/
Modified: code/trunk/testdata/testinput5
===================================================================
--- code/trunk/testdata/testinput5 2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/testdata/testinput5 2010-11-20 17:47:27 UTC (rev 574)
@@ -829,4 +829,6 @@
a\x{123}aa\>5
a\x{123}aa\>6
+/^\cģ/8
+
/-- End of testinput5 --/
Modified: code/trunk/testdata/testoutput1
===================================================================
--- code/trunk/testdata/testoutput1 2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/testdata/testoutput1 2010-11-20 17:47:27 UTC (rev 574)
@@ -6662,4 +6662,8 @@
\x0a\x0b\x0c\x0d
0: \x0a\x0b\x0c\x0d
+/^\c/
+ ?
+ 0: ?
+
/-- End of testinput1 --/
Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2 2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/testdata/testoutput2 2010-11-20 17:47:27 UTC (rev 574)
@@ -11231,4 +11231,7 @@
abc\>-4
Error -24
+/^\cģ/
+Failed: \c must be followed by an ASCII character at offset 3
+
/-- End of testinput2 --/
Modified: code/trunk/testdata/testoutput5
===================================================================
--- code/trunk/testdata/testoutput5 2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/testdata/testoutput5 2010-11-20 17:47:27 UTC (rev 574)
@@ -2314,4 +2314,7 @@
a\x{123}aa\>6
Error -24
+/^\cģ/8
+Failed: \c must be followed by an ASCII character at offset 3
+
/-- End of testinput5 --/