[Pcre-svn] [574] code/trunk: Give error if \c is followed by…

Página Inicial
Delete this message
Autor: Subversion repository
Data:  
Para: pcre-svn
Assunto: [Pcre-svn] [574] code/trunk: Give error if \c is followed by a byte > 127 (in ASCII/ UTF-8 modes).
Revision: 574
          http://vcs.pcre.org/viewvc?view=rev&revision=574
Author:   ph10
Date:     2010-11-20 17:47:27 +0000 (Sat, 20 Nov 2010)


Log Message:
-----------
Give error if \c is followed by a byte > 127 (in ASCII/UTF-8 modes).

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcrepattern.3
    code/trunk/pcre_compile.c
    code/trunk/pcre_internal.h
    code/trunk/pcreposix.c
    code/trunk/testdata/testinput1
    code/trunk/testdata/testinput2
    code/trunk/testdata/testinput5
    code/trunk/testdata/testoutput1
    code/trunk/testdata/testoutput2
    code/trunk/testdata/testoutput5


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/ChangeLog    2010-11-20 17:47:27 UTC (rev 574)
@@ -108,6 +108,11 @@
     loops, in order to improve performance in some environments. At the same 
     time, I abstracted some of the common code into auxiliary macros to save 
     repetition (this should not affect the compiled code).
+    
+19. If \c was followed by a multibyte UTF-8 character, bad things happened. A 
+    compile-time error is now given if \c is not followed by an ASCII 
+    character, that is, a byte less than 128. (In EBCDIC mode, the code is 
+    different, and any byte value is allowed.)



Version 8.10 25-Jun-2010

Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3    2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/doc/pcrepattern.3    2010-11-20 17:47:27 UTC (rev 574)
@@ -182,9 +182,9 @@
 .rs
 .sp
 The backslash character has several uses. Firstly, if it is followed by a
-non-alphanumeric character, it takes away any special meaning that character
-may have. This use of backslash as an escape character applies both inside and
-outside character classes.
+character that is not a number or a letter, it takes away any special meaning
+that character may have. This use of backslash as an escape character applies
+both inside and outside character classes. 
 .P
 For example, if you want to match a * character, you write \e* in the pattern.
 This escaping action applies whether or not the following character would
@@ -192,6 +192,10 @@
 non-alphanumeric with backslash to specify that it stands for itself. In
 particular, if you want to match a backslash, you write \e\e.
 .P
+In UTF-8 mode, only ASCII numbers and letters have any special meaning after a
+backslash. All other characters (in particular, those whose codepoints are 
+greater than 127) are treated as literals.
+.P
 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the
 pattern (other than in a character class) and characters between a # outside
 a character class and the next newline are ignored. An escaping backslash can
@@ -225,7 +229,7 @@
 one of the following escape sequences than the binary character it represents:
 .sp
   \ea        alarm, that is, the BEL character (hex 07)
-  \ecx       "control-x", where x is any character
+  \ecx       "control-x", where x is any ASCII character
   \ee        escape (hex 1B)
   \ef        formfeed (hex 0C)
   \en        linefeed (hex 0A)
@@ -237,8 +241,12 @@
 .sp
 The precise effect of \ecx is as follows: if x is a lower case letter, it
 is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
-Thus \ecz becomes hex 1A, but \ec{ becomes hex 3B, while \ec; becomes hex
-7B.
+Thus \ecz becomes hex 1A (z is 7A), but \ec{ becomes hex 3B ({ is 7B), while
+\ec; becomes hex 7B (; is 3B). If the byte following \ec has a value greater 
+than 127, a compile-time error occurs. This locks out non-ASCII characters in 
+both byte mode and UTF-8 mode. (When PCRE is compiled in EBCDIC mode, all byte 
+values are valid. A lower case letter is converted to upper case, and then the 
+0xc0 bits are flipped.)
 .P
 After \ex, from zero to two hexadecimal digits are read (letters can be in
 upper or lower case). Any number of hexadecimal digits may appear between \ex{
@@ -2718,6 +2726,6 @@
 .rs
 .sp
 .nf
-Last updated: 17 November 2010
+Last updated: 20 November 2010
 Copyright (c) 1997-2010 University of Cambridge.
 .fi


Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/pcre_compile.c    2010-11-20 17:47:27 UTC (rev 574)
@@ -408,6 +408,7 @@
   "different names for subpatterns of the same number are not allowed\0"
   "(*MARK) must have an argument\0"
   "this version of PCRE is not compiled with PCRE_UCP support\0"
+  "\\c must be followed by an ASCII character\0" 
   ;


 /* Table to identify digits and hex digits. This is used when compiling
@@ -841,7 +842,8 @@
     break;


     /* For \c, a following letter is upper-cased; then the 0x40 bit is flipped.
-    This coding is ASCII-specific, but then the whole concept of \cx is
+    An error is given if the byte following \c is not an ASCII character. This
+    coding is ASCII-specific, but then the whole concept of \cx is
     ASCII-specific. (However, an EBCDIC equivalent has now been added.) */


     case CHAR_c:
@@ -851,11 +853,15 @@
       *errorcodeptr = ERR2;
       break;
       }
-
-#ifndef EBCDIC  /* ASCII/UTF-8 coding */
+#ifndef EBCDIC    /* ASCII/UTF-8 coding */
+    if (c > 127)  /* Excludes all non-ASCII in either mode */
+      {
+      *errorcodeptr = ERR68;
+      break;  
+      }      
     if (c >= CHAR_a && c <= CHAR_z) c -= 32;
     c ^= 0x40;
-#else           /* EBCDIC coding */
+#else             /* EBCDIC coding */
     if (c >= CHAR_a && c <= CHAR_z) c += 64;
     c ^= 0xC0;
 #endif


Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/pcre_internal.h    2010-11-20 17:47:27 UTC (rev 574)
@@ -1569,7 +1569,8 @@
        ERR30, ERR31, ERR32, ERR33, ERR34, ERR35, ERR36, ERR37, ERR38, ERR39,
        ERR40, ERR41, ERR42, ERR43, ERR44, ERR45, ERR46, ERR47, ERR48, ERR49,
        ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59,
-       ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERRCOUNT };
+       ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, 
+       ERRCOUNT };


/* The real format of the start of the pcre block; the index of names and the
code vector run on as long as necessary after the end. We store an explicit

Modified: code/trunk/pcreposix.c
===================================================================
--- code/trunk/pcreposix.c    2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/pcreposix.c    2010-11-20 17:47:27 UTC (rev 574)
@@ -151,6 +151,7 @@
   REG_BADPAT,  /* different names for subpatterns of the same number are not allowed */
   REG_BADPAT,  /* (*MARK) must have an argument */
   REG_INVARG,  /* this version of PCRE is not compiled with PCRE_UCP support */
+  REG_BADPAT,  /* \c must be followed by an ASCII character */ 
 };


/* Table of texts corresponding to POSIX error codes */

Modified: code/trunk/testdata/testinput1
===================================================================
--- code/trunk/testdata/testinput1    2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/testdata/testinput1    2010-11-20 17:47:27 UTC (rev 574)
@@ -4076,4 +4076,7 @@
 /[\x00-\xff\s]+/
     \x0a\x0b\x0c\x0d


+/^\c/
+    ?
+
 /-- End of testinput1 --/


Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/testdata/testinput2    2010-11-20 17:47:27 UTC (rev 574)
@@ -3552,4 +3552,6 @@
     abc\>4
     abc\>-4 


+/^\cģ/
+
/-- End of testinput2 --/

Modified: code/trunk/testdata/testinput5
===================================================================
--- code/trunk/testdata/testinput5    2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/testdata/testinput5    2010-11-20 17:47:27 UTC (rev 574)
@@ -829,4 +829,6 @@
     a\x{123}aa\>5
     a\x{123}aa\>6


+/^\cģ/8
+
/-- End of testinput5 --/

Modified: code/trunk/testdata/testoutput1
===================================================================
--- code/trunk/testdata/testoutput1    2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/testdata/testoutput1    2010-11-20 17:47:27 UTC (rev 574)
@@ -6662,4 +6662,8 @@
     \x0a\x0b\x0c\x0d
  0: \x0a\x0b\x0c\x0d


+/^\c/
+    ?
+ 0: ?
+
 /-- End of testinput1 --/


Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/testdata/testoutput2    2010-11-20 17:47:27 UTC (rev 574)
@@ -11231,4 +11231,7 @@
     abc\>-4 
 Error -24


+/^\cģ/
+Failed: \c must be followed by an ASCII character at offset 3
+
/-- End of testinput2 --/

Modified: code/trunk/testdata/testoutput5
===================================================================
--- code/trunk/testdata/testoutput5    2010-11-19 19:29:44 UTC (rev 573)
+++ code/trunk/testdata/testoutput5    2010-11-20 17:47:27 UTC (rev 574)
@@ -2314,4 +2314,7 @@
     a\x{123}aa\>6
 Error -24


+/^\cģ/8
+Failed: \c must be followed by an ASCII character at offset 3
+
/-- End of testinput5 --/