[Pcre-svn] [1369] code/trunk: Update \8 and \9 handling to m…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [1369] code/trunk: Update \8 and \9 handling to match most recent Perl.
Revision: 1369
          http://vcs.pcre.org/viewvc?view=rev&revision=1369
Author:   ph10
Date:     2013-10-08 16:06:46 +0100 (Tue, 08 Oct 2013)


Log Message:
-----------
Update \8 and \9 handling to match most recent Perl.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcrepattern.3
    code/trunk/doc/pcresyntax.3
    code/trunk/pcre_compile.c
    code/trunk/testdata/testinput1
    code/trunk/testdata/testinput8
    code/trunk/testdata/testoutput1
    code/trunk/testdata/testoutput8


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2013-10-08 10:01:58 UTC (rev 1368)
+++ code/trunk/ChangeLog    2013-10-08 15:06:46 UTC (rev 1369)
@@ -105,6 +105,11 @@


       (2) Add missing HAVE_STDINT_H and HAVE_INTTYPES_H to config-cmake.h.in; 
           without these the CMake build does not work on Solaris.
+          
+21. Perl has changed its handling of \8 and \9. If there is no previously 
+    encountered capturing group of those numbers, they are treated as the 
+    literal characters 8 and 9 instead of a binary zero followed by the 
+    literals. PCRE now does the same.



Version 8.33 28-May-2013

Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3    2013-10-08 10:01:58 UTC (rev 1368)
+++ code/trunk/doc/pcrepattern.3    2013-10-08 15:06:46 UTC (rev 1369)
@@ -1,4 +1,4 @@
-.TH PCREPATTERN 3 "05 October 2013" "PCRE 8.34"
+.TH PCREPATTERN 3 "08 October 2013" "PCRE 8.34"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .SH "PCRE REGULAR EXPRESSION DETAILS"
@@ -359,9 +359,10 @@
 sure you supply two digits after the initial zero if the pattern character that
 follows is itself an octal digit.
 .P
-The handling of a backslash followed by a digit other than 0 is complicated.
-Outside a character class, PCRE reads it and any following digits as a decimal
-number. If the number is less than 10, or if there have been at least that many
+The handling of a backslash followed by a digit other than 0 is complicated,
+and Perl has changed in recent releases, causing PCRE also to change. Outside a
+character class, PCRE reads the digit and any following digits as a decimal
+number. If the number is less than 8, or if there have been at least that many
 previous capturing left parentheses in the expression, the entire sequence is
 taken as a \fIback reference\fP. A description of how this works is given
 .\" HTML <a href="#backreferences">
@@ -374,12 +375,13 @@
 parenthesized subpatterns.
 .\"
 .P
-Inside a character class, or if the decimal number is greater than 9 and there
-have not been that many capturing subpatterns, PCRE re-reads up to three octal
-digits following the backslash, and uses them to generate a data character. Any
-subsequent digits stand for themselves. The value of the character is
-constrained in the same way as characters specified in hexadecimal.
-For example:
+Inside a character class, or if the decimal number following \e is greater than
+7 and there have not been that many capturing subpatterns, PCRE handles \e8 and
+\e9 as the literal characters "8" and "9", and otherwise re-reads up to three
+octal digits following the backslash, using them to generate a data character.
+Any subsequent digits stand for themselves. The value of the character is
+constrained in the same way as characters specified in hexadecimal. For
+example:
 .sp
   \e040   is another way of writing an ASCII space
 .\" JOIN
@@ -398,8 +400,8 @@
   \e377   might be a back reference, otherwise
             the value 255 (decimal)
 .\" JOIN
-  \e81    is either a back reference, or a binary zero
-            followed by the two characters "8" and "1"
+  \e81    is either a back reference, or the two
+            characters "8" and "1"
 .sp
 Note that octal values of 100 or greater must not be introduced by a leading
 zero, because no more than three octal digits are ever read.
@@ -3156,6 +3158,6 @@
 .rs
 .sp
 .nf
-Last updated: 05 October 2013
+Last updated: 08 October 2013
 Copyright (c) 1997-2013 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcresyntax.3
===================================================================
--- code/trunk/doc/pcresyntax.3    2013-10-08 10:01:58 UTC (rev 1368)
+++ code/trunk/doc/pcresyntax.3    2013-10-08 15:06:46 UTC (rev 1369)
@@ -1,4 +1,4 @@
-.TH PCRESYNTAX 3 "05 October 2013" "PCRE 8.34"
+.TH PCRESYNTAX 3 "08 October 2013" "PCRE 8.34"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -32,6 +32,9 @@
   \eddd       character with octal code ddd, or backreference
   \exhh       character with hex code hh
   \ex{hhh..}  character with hex code hhh..
+.sp
+Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal 
+characters "8" and "9".   
 .
 .
 .SH "CHARACTER TYPES"
@@ -498,6 +501,6 @@
 .rs
 .sp
 .nf
-Last updated: 05 October 2013
+Last updated: 08 October 2013
 Copyright (c) 1997-2013 University of Cambridge.
 .fi


Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2013-10-08 10:01:58 UTC (rev 1368)
+++ code/trunk/pcre_compile.c    2013-10-08 15:06:46 UTC (rev 1369)
@@ -879,16 +879,15 @@
 *************************************************/


/* This function is called when a \ has been encountered. It either returns a
-positive value for a simple escape such as \n, or 0 for a data character
-which will be placed in chptr. A backreference to group n is returned as
-negative n. When UTF-8 is enabled, a positive value greater than 255 may
-be returned in chptr.
-On entry,ptr is pointing at the \. On exit, it is on the final character of the
-escape sequence.
+positive value for a simple escape such as \n, or 0 for a data character which
+will be placed in chptr. A backreference to group n is returned as negative n.
+When UTF-8 is enabled, a positive value greater than 255 may be returned in
+chptr. On entry, ptr is pointing at the \. On exit, it is on the final
+character of the escape sequence.

 Arguments:
   ptrptr         points to the pattern position pointer
-  chptr          points to the data character
+  chptr          points to a returned data character
   errorcodeptr   points to the errorcode variable
   bracount       number of previous extracting brackets
   options        the options bits
@@ -1092,16 +1091,20 @@
     break;


     /* The handling of escape sequences consisting of a string of digits
-    starting with one that is not zero is not straightforward. By experiment,
-    the way Perl works seems to be as follows:
+    starting with one that is not zero is not straightforward. Perl has changed 
+    over the years. Nowadays \g{} for backreferences and \o{} for octal are 
+    recommended to avoid the ambiguities in the old syntax.


     Outside a character class, the digits are read as a decimal number. If the
-    number is less than 10, or if there are that many previous extracting
-    left brackets, then it is a back reference. Otherwise, up to three octal
-    digits are read to form an escaped byte. Thus \123 is likely to be octal
-    123 (cf \0123, which is octal 012 followed by the literal 3). If the octal
-    value is greater than 377, the least significant 8 bits are taken. Inside a
-    character class, \ followed by a digit is always an octal number. */
+    number is less than 8 (used to be 10), or if there are that many previous
+    extracting left brackets, then it is a back reference. Otherwise, up to
+    three octal digits are read to form an escaped byte. Thus \123 is likely to
+    be octal 123 (cf \0123, which is octal 012 followed by the literal 3). If
+    the octal value is greater than 377, the least significant 8 bits are
+    taken. \8 and \9 are treated as the literal characters 8 and 9.
+     
+    Inside a character class, \ followed by a digit is always either a literal
+    8 or 9 or an octal number. */


     case CHAR_1: case CHAR_2: case CHAR_3: case CHAR_4: case CHAR_5:
     case CHAR_6: case CHAR_7: case CHAR_8: case CHAR_9:
@@ -1128,7 +1131,7 @@
         *errorcodeptr = ERR61;
         break;
         }
-      if (s < 10 || s <= bracount)
+      if (s < 8 || s <= bracount)  /* Check for back reference */
         {
         escape = -s;
         break;
@@ -1136,16 +1139,14 @@
       ptr = oldptr;      /* Put the pointer back and fall through */
       }


-    /* Handle an octal number following \. If the first digit is 8 or 9, Perl
-    generates a binary zero byte and treats the digit as a following literal.
-    Thus we have to pull back the pointer by one. */
+    /* Handle a digit following \ when the number is not a back reference. If
+    the first digit is 8 or 9, Perl used to generate a binary zero byte and
+    then treat the digit as a following literal. At least by Perl 5.18 this
+    changed so as not to insert the binary zero. */


-    if ((c = *ptr) >= CHAR_8)
-      {
-      ptr--;
-      c = 0;
-      break;
-      }
+    if ((c = *ptr) >= CHAR_8) break;
+    
+    /* Fall through with a digit less than 8 */   


     /* \0 always starts an octal number, but we may drop through to here with a
     larger first octal digit. The original code used just to take the least


Modified: code/trunk/testdata/testinput1
===================================================================
--- code/trunk/testdata/testinput1    2013-10-08 10:01:58 UTC (rev 1368)
+++ code/trunk/testdata/testinput1    2013-10-08 15:06:46 UTC (rev 1369)
@@ -1483,15 +1483,20 @@
     abc\100\x30
     abc\100\060
     abc\100\60
+    
+/^A\8B\9C$/
+    A8B9C
+    *** Failers
+    A\08B\09C  
+    
+/^(A)(B)(C)(D)(E)(F)(G)(H)(I)\8\9$/
+    ABCDEFGHIHI 


-/abc\81/
-    abc\081
-    abc\0\x38\x31
+/^[A\8B\9C]+$/
+    A8B9C
+    *** Failers 
+    A8B9C\x00


-/abc\91/
-    abc\091
-    abc\0\x39\x31
-
 /(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)\12\123/
     abcdefghijkllS


@@ -5626,5 +5631,8 @@
 /^ (?:(?<A>A)|(?'B'B)(?<A>A)) (?('A')x) (?(<B>)y)$/xJ
     Ax
     BAxy 
+    
+/^A\xZ/
+    A\0Z 


/-- End of testinput1 --/

Modified: code/trunk/testdata/testinput8
===================================================================
--- code/trunk/testdata/testinput8    2013-10-08 10:01:58 UTC (rev 1368)
+++ code/trunk/testdata/testinput8    2013-10-08 15:06:46 UTC (rev 1369)
@@ -1923,14 +1923,16 @@
     abc\100\060
     abc\100\60


-/abc\81/
-    abc\081
-    abc\0\x38\x31
-
-/abc\91/
-    abc\091
-    abc\0\x39\x31
-
+/^A\8B\9C$/
+    A8B9C
+    *** Failers
+    A\08B\09C  
+    
+/^[A\8B\9C]+$/
+    A8B9C
+    *** Failers 
+    A8B9C\x00
+    
 /(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12\123/
     abcdefghijk\12S



Modified: code/trunk/testdata/testoutput1
===================================================================
--- code/trunk/testdata/testoutput1    2013-10-08 10:01:58 UTC (rev 1368)
+++ code/trunk/testdata/testoutput1    2013-10-08 15:06:46 UTC (rev 1369)
@@ -2149,19 +2149,36 @@
     abc\100\60
  0: abc@0
  1: abc
+    
+/^A\8B\9C$/
+    A8B9C
+ 0: A8B9C
+    *** Failers
+No match
+    A\08B\09C  
+No match
+    
+/^(A)(B)(C)(D)(E)(F)(G)(H)(I)\8\9$/
+    ABCDEFGHIHI 
+ 0: ABCDEFGHIHI
+ 1: A
+ 2: B
+ 3: C
+ 4: D
+ 5: E
+ 6: F
+ 7: G
+ 8: H
+ 9: I


-/abc\81/
-    abc\081
- 0: abc\x0081
-    abc\0\x38\x31
- 0: abc\x0081
+/^[A\8B\9C]+$/
+    A8B9C
+ 0: A8B9C
+    *** Failers 
+No match
+    A8B9C\x00
+No match


-/abc\91/
-    abc\091
- 0: abc\x0091
-    abc\0\x39\x31
- 0: abc\x0091
-
 /(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)\12\123/
     abcdefghijkllS
  0: abcdefghijkllS
@@ -9246,5 +9263,9 @@
  1: <unset>
  2: B
  3: A
+    
+/^A\xZ/
+    A\0Z 
+ 0: A\x00Z


/-- End of testinput1 --/

Modified: code/trunk/testdata/testoutput8
===================================================================
--- code/trunk/testdata/testoutput8    2013-10-08 10:01:58 UTC (rev 1368)
+++ code/trunk/testdata/testoutput8    2013-10-08 15:06:46 UTC (rev 1369)
@@ -2930,18 +2930,22 @@
     abc\100\60
  0: abc@0


-/abc\81/
-    abc\081
- 0: abc\x0081
-    abc\0\x38\x31
- 0: abc\x0081
-
-/abc\91/
-    abc\091
- 0: abc\x0091
-    abc\0\x39\x31
- 0: abc\x0091
-
+/^A\8B\9C$/
+    A8B9C
+ 0: A8B9C
+    *** Failers
+No match
+    A\08B\09C  
+No match
+    
+/^[A\8B\9C]+$/
+    A8B9C
+ 0: A8B9C
+    *** Failers 
+No match
+    A8B9C\x00
+No match
+    
 /(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12\123/
     abcdefghijk\12S
  0: abcdefghijk\x0aS