[Pcre-svn] [1392] code/trunk: Give errors for [A-\d] and [a-…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [1392] code/trunk: Give errors for [A-\d] and [a-[:digit:]] etc.
Revision: 1392
          http://vcs.pcre.org/viewvc?view=rev&revision=1392
Author:   ph10
Date:     2013-11-06 18:00:09 +0000 (Wed, 06 Nov 2013)


Log Message:
-----------
Give errors for [A-\d] and [a-[:digit:]] etc.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcreapi.3
    code/trunk/doc/pcrecompat.3
    code/trunk/doc/pcrepattern.3
    code/trunk/pcre_compile.c
    code/trunk/pcre_internal.h
    code/trunk/pcreposix.c
    code/trunk/testdata/testinput1
    code/trunk/testdata/testinput2
    code/trunk/testdata/testinput8
    code/trunk/testdata/testoutput1
    code/trunk/testdata/testoutput2
    code/trunk/testdata/testoutput8


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2013-11-06 16:43:07 UTC (rev 1391)
+++ code/trunk/ChangeLog    2013-11-06 18:00:09 UTC (rev 1392)
@@ -159,6 +159,10 @@
     This limit is imposed to control the amount of system stack used at compile 
     time. It can be changed at build time by --with-parens-nest-limit=xxx or 
     the equivalent in CMake. 
+    
+34. Character classes such as [A-\d] or [a-[:digit:]] now cause compile-time 
+    errors. Perl warns for these when in warning mode, but PCRE has no facility 
+    for giving warnings. 



Version 8.33 28-May-2013

Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3    2013-11-06 16:43:07 UTC (rev 1391)
+++ code/trunk/doc/pcreapi.3    2013-11-06 18:00:09 UTC (rev 1392)
@@ -978,6 +978,8 @@
   79  non-hex character in \ex{} (closing brace missing?)
   80  non-octal character in \eo{} (closing brace missing?)
   81  missing opening brace after \eo 
+  82  parentheses are too deeply nested
+  83  invalid range in character class
 .sp
 The numbers 32 and 10000 in errors 48 and 49 are defaults; different values may
 be used if the limits were changed when PCRE was built.


Modified: code/trunk/doc/pcrecompat.3
===================================================================
--- code/trunk/doc/pcrecompat.3    2013-11-06 16:43:07 UTC (rev 1391)
+++ code/trunk/doc/pcrecompat.3    2013-11-06 18:00:09 UTC (rev 1392)
@@ -1,4 +1,4 @@
-.TH PCRECOMPAT 3 "19 March 2013" "PCRE 8.33"
+.TH PCRECOMPAT 3 "05 November 2013" "PCRE 8.34"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .SH "DIFFERENCES BETWEEN PCRE AND PERL"
@@ -125,13 +125,18 @@
 Perl allows white space between ( and ? but PCRE never does, even if the
 PCRE_EXTENDED option is set.
 .P
-16. In PCRE, the upper/lower case character properties Lu and Ll are not
+16. Perl, when in warning mode, gives warnings for character classes such as
+[A-\ed] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE has no 
+warning features, so it gives an error in these cases because they are almost 
+certainly user mistakes.
+.P
+17. In PCRE, the upper/lower case character properties Lu and Ll are not
 affected when case-independent matching is specified. For example, \ep{Lu}
 always matches an upper case letter. I think Perl has changed in this respect;
 in the release at the time of writing (5.16), \ep{Lu} and \ep{Ll} match all
 letters, regardless of case, when case independence is specified.
 .P
-17. PCRE provides some extensions to the Perl regular expression facilities.
+18. PCRE provides some extensions to the Perl regular expression facilities.
 Perl 5.10 includes new features that are not in earlier versions of Perl, some
 of which (such as named parentheses) have been in PCRE for some time. This list
 is with respect to Perl 5.10:
@@ -190,6 +195,6 @@
 .rs
 .sp
 .nf
-Last updated: 19 March 2013
+Last updated: 05 November 2013
 Copyright (c) 1997-2013 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3    2013-11-06 16:43:07 UTC (rev 1391)
+++ code/trunk/doc/pcrepattern.3    2013-11-06 18:00:09 UTC (rev 1392)
@@ -1235,7 +1235,9 @@
 character class. For example, [d-m] matches any letter between d and m,
 inclusive. If a minus character is required in a class, it must be escaped with
 a backslash or appear in a position where it cannot be interpreted as
-indicating a range, typically as the first or last character in the class.
+indicating a range, typically as the first or last character in the class, or 
+immediately after a range. For example, [b-d-z] matches letters in the range b
+to d, a hyphen character, or z.
 .P
 It is not possible to have the literal character "]" as the end character of a
 range. A pattern such as [W-]46] is interpreted as a class of two characters
@@ -1245,6 +1247,11 @@
 followed by two other characters. The octal or hexadecimal representation of
 "]" can also be used to end a range.
 .P
+An error is generated if a POSIX character class (see below) or an escape
+sequence other than one that defines a single character appears at a point
+where a range ending character is expected. For example, [z-\exff] is valid,
+but [A-\ed] and [A-[:digit:]] are not.
+.P
 Ranges operate in the collating sequence of character values. They can also be
 used for characters specified numerically, for example [\e000-\e037]. Ranges
 can include any characters that are valid for the current mode.


Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2013-11-06 16:43:07 UTC (rev 1391)
+++ code/trunk/pcre_compile.c    2013-11-06 18:00:09 UTC (rev 1392)
@@ -532,6 +532,7 @@
   "non-octal character in \\o{} (closing brace missing?)\0"
   "missing opening brace after \\o\0"
   "parentheses are too deeply nested\0"
+  "invalid range in character class\0" 
   ;


/* Table to identify digits and hex digits. This is used when compiling
@@ -3793,7 +3794,7 @@
below handles the special case of \], but does not try to do any other escape
processing. This makes it different from Perl for cases such as [:l\ower:]
where Perl recognizes it as the POSIX class "lower" but PCRE does not recognize
-"l\ower". This is a lesser evil that not diagnosing bad classes when Perl does,
+"l\ower". This is a lesser evil than not diagnosing bad classes when Perl does,
I think.

 A user pointed out that PCRE was rejecting [:a[:digit:]] whereas Perl was not.
@@ -5143,28 +5144,45 @@
         else
 #endif
         d = *ptr;  /* Not UTF-8 mode */
+        
+        /* The second part of a range can be a single-character escape
+        sequence, but not any of the other escapes. Perl treats a hyphen as a
+        literal in such circumstances. However, in Perl's warning mode, a
+        warning is given, so PCRE now faults it as it is almost certainly a 
+        mistake on the user's part. */


-        /* The second part of a range can be a single-character escape, but
-        not any of the other escapes. Perl 5.6 treats a hyphen as a literal
-        in such circumstances. */
-
-        if (!inescq && d == CHAR_BACKSLASH)
-          {
-          int descape;
-          descape = check_escape(&ptr, &d, errorcodeptr, cd->bracount, options, TRUE);
-          if (*errorcodeptr != 0) goto FAILED;
-
-          /* \b is backspace; any other special means the '-' was literal. */
-
-          if (descape != 0)
+        if (!inescq)
+          { 
+          if (d == CHAR_BACKSLASH)
             {
-            if (descape == ESC_b) d = CHAR_BS; else
+            int descape;
+            descape = check_escape(&ptr, &d, errorcodeptr, cd->bracount, options, TRUE);
+            if (*errorcodeptr != 0) goto FAILED;
+          
+            /* 0 means a character was put into d; \b is backspace; any other
+            special causes an error. */
+          
+            if (descape != 0)
               {
-              ptr = oldptr;
-              goto CLASS_SINGLE_CHARACTER;  /* A few lines below */
+              if (descape == ESC_b) d = CHAR_BS; else
+                {
+                *errorcodeptr = ERR83;
+                goto FAILED; 
+                }
               }
             }
-          }
+        
+          /* A hyphen followed by a POSIX class is treated in the same way. */
+          
+          else if (d == CHAR_LEFT_SQUARE_BRACKET && 
+                   (ptr[1] == CHAR_COLON || ptr[1] == CHAR_DOT ||
+                    ptr[1] == CHAR_EQUALS_SIGN) && 
+                   check_posix_syntax(ptr, &tempptr))
+            {
+            *errorcodeptr = ERR83;
+            goto FAILED;          
+            }  
+          }   


         /* Check that the two values are in the correct order. Optimize
         one-character ranges. */


Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2013-11-06 16:43:07 UTC (rev 1391)
+++ code/trunk/pcre_internal.h    2013-11-06 18:00:09 UTC (rev 1392)
@@ -2335,7 +2335,7 @@
        ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59,
        ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69,
        ERR70, ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79,
-       ERR80, ERR81, ERR82, ERRCOUNT };
+       ERR80, ERR81, ERR82, ERR83, ERRCOUNT };


/* JIT compiling modes. The function list is indexed by them. */


Modified: code/trunk/pcreposix.c
===================================================================
--- code/trunk/pcreposix.c    2013-11-06 16:43:07 UTC (rev 1391)
+++ code/trunk/pcreposix.c    2013-11-06 18:00:09 UTC (rev 1392)
@@ -168,7 +168,8 @@
   /* 80 */ 
   REG_BADPAT,  /* non-octal character in \o{} (closing brace missing?) */ 
   REG_BADPAT,  /* missing opening brace after \o */
-  REG_BADPAT   /* parentheses too deeply nested */
+  REG_BADPAT,  /* parentheses too deeply nested */
+  REG_BADPAT   /* invalid range in character class */ 
 };


/* Table of texts corresponding to POSIX error codes */

Modified: code/trunk/testdata/testinput1
===================================================================
--- code/trunk/testdata/testinput1    2013-11-06 16:43:07 UTC (rev 1391)
+++ code/trunk/testdata/testinput1    2013-11-06 18:00:09 UTC (rev 1392)
@@ -3661,13 +3661,6 @@
 /a*/g
     abbab


-/^[a-\d]/
-    abcde
-    -things
-    0digit
-    *** Failers
-    bcdef    
-
 /^[\d-a]/
     abcde
     -things


Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2013-11-06 16:43:07 UTC (rev 1391)
+++ code/trunk/testdata/testinput2    2013-11-06 18:00:09 UTC (rev 1392)
@@ -3492,6 +3492,8 @@


/a[B-\Nc]/

+/a[B\Nc]/
+
/(a)(?2){0,1999}?(b)/

/(a)(?(DEFINE)(b))(?2){0,1999}?(?2)/
@@ -3977,4 +3979,19 @@

/a{4}+/BZOi

+/[a-[:digit:]]+/
+
+/[A-[:digit:]]+/
+
+/[a-[.xxx.]]+/
+
+/[a-[=xxx=]]+/
+
+/[a-[!xxx!]]+/
+
+/[A-[!xxx!]]+/
+    A]]]
+
+/[a-\d]+/
+
 /-- End of testinput2 --/


Modified: code/trunk/testdata/testinput8
===================================================================
--- code/trunk/testdata/testinput8    2013-11-06 16:43:07 UTC (rev 1391)
+++ code/trunk/testdata/testinput8    2013-11-06 18:00:09 UTC (rev 1392)
@@ -3830,13 +3830,6 @@
 /a*/g
     abbab


-/^[a-\d]/
-    abcde
-    -things
-    0digit
-    *** Failers
-    bcdef    
-
 /^[\d-a]/
     abcde
     -things


Modified: code/trunk/testdata/testoutput1
===================================================================
--- code/trunk/testdata/testoutput1    2013-11-06 16:43:07 UTC (rev 1391)
+++ code/trunk/testdata/testoutput1    2013-11-06 18:00:09 UTC (rev 1392)
@@ -5991,18 +5991,6 @@
  0: 
  0: 


-/^[a-\d]/
-    abcde
- 0: a
-    -things
- 0: -
-    0digit
- 0: 0
-    *** Failers
-No match
-    bcdef    
-No match
-
 /^[\d-a]/
     abcde
  0: a


Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2013-11-06 16:43:07 UTC (rev 1391)
+++ code/trunk/testdata/testoutput2    2013-11-06 18:00:09 UTC (rev 1392)
@@ -11944,8 +11944,11 @@
 Failed: \N is not supported in a class at offset 3


/a[B-\Nc]/
-Failed: \N is not supported in a class at offset 5
+Failed: invalid range in character class at offset 5

+/a[B\Nc]/
+Failed: \N is not supported in a class at offset 4
+
/(a)(?2){0,1999}?(b)/

 /(a)(?(DEFINE)(b))(?2){0,1999}?(?2)/
@@ -13987,4 +13990,26 @@
         End
 ------------------------------------------------------------------


+/[a-[:digit:]]+/
+Failed: invalid range in character class at offset 3
+
+/[A-[:digit:]]+/
+Failed: invalid range in character class at offset 3
+
+/[a-[.xxx.]]+/
+Failed: invalid range in character class at offset 3
+
+/[a-[=xxx=]]+/
+Failed: invalid range in character class at offset 3
+
+/[a-[!xxx!]]+/
+Failed: range out of order in character class at offset 3
+
+/[A-[!xxx!]]+/
+    A]]]
+ 0: A]]]
+
+/[a-\d]+/
+Failed: invalid range in character class at offset 4
+
 /-- End of testinput2 --/


Modified: code/trunk/testdata/testoutput8
===================================================================
--- code/trunk/testdata/testoutput8    2013-11-06 16:43:07 UTC (rev 1391)
+++ code/trunk/testdata/testoutput8    2013-11-06 18:00:09 UTC (rev 1392)
@@ -6000,18 +6000,6 @@
  0: 
  0: 


-/^[a-\d]/
-    abcde
- 0: a
-    -things
- 0: -
-    0digit
- 0: 0
-    *** Failers
-No match
-    bcdef    
-No match
-
 /^[\d-a]/
     abcde
  0: a