[Pcre-svn] [754] code/trunk: Support \C in lookbehinds and D…

Startseite
Nachricht löschen
Autor: Subversion repository
Datum:  
To: pcre-svn
Betreff: [Pcre-svn] [754] code/trunk: Support \C in lookbehinds and DFA matching when not in UTF-8 mode.
Revision: 754
          http://vcs.pcre.org/viewvc?view=rev&revision=754
Author:   ph10
Date:     2011-11-19 18:32:18 +0000 (Sat, 19 Nov 2011)


Log Message:
-----------
Support \C in lookbehinds and DFA matching when not in UTF-8 mode.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcrematching.3
    code/trunk/doc/pcrepattern.3
    code/trunk/pcre_compile.c
    code/trunk/pcre_internal.h
    code/trunk/pcre_study.c
    code/trunk/testdata/testinput1
    code/trunk/testdata/testinput2
    code/trunk/testdata/testinput4
    code/trunk/testdata/testinput7
    code/trunk/testdata/testinput8
    code/trunk/testdata/testoutput1
    code/trunk/testdata/testoutput2
    code/trunk/testdata/testoutput4
    code/trunk/testdata/testoutput7
    code/trunk/testdata/testoutput8


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2011-11-19 17:03:35 UTC (rev 753)
+++ code/trunk/ChangeLog    2011-11-19 18:32:18 UTC (rev 754)
@@ -49,6 +49,8 @@
 12. Updated pcre-config so that it no longer shows -L/usr/lib, which seems
     best practice nowadays, and helps with cross-compiling. (If the exec_prefix 
     is anything other than /usr, -L is still shown). 
+    
+13. In non-UTF-8 mode, \C is now supported in lookbehinds and DFA matching.



Version 8.20 21-Oct-2011

Modified: code/trunk/doc/pcrematching.3
===================================================================
--- code/trunk/doc/pcrematching.3    2011-11-19 17:03:35 UTC (rev 753)
+++ code/trunk/doc/pcrematching.3    2011-11-19 18:32:18 UTC (rev 754)
@@ -132,9 +132,9 @@
 always 1, and the value of the \fIcapture_last\fP field is always -1.
 .P
 7. The \eC escape sequence, which (in the standard algorithm) matches a single
-byte, even in UTF-8 mode, is not supported because the alternative algorithm
-moves through the subject string one character at a time, for all active paths
-through the tree.
+byte, even in UTF-8 mode, is not supported in UTF-8 mode, because the
+alternative algorithm moves through the subject string one character at a time,
+for all active paths through the tree.
 .P
 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
 supported. (*FAIL) is supported, and behaves like a failing negative assertion.
@@ -191,6 +191,6 @@
 .rs
 .sp
 .nf
-Last updated: 17 November 2010
+Last updated: 19 November 2011
 Copyright (c) 1997-2010 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3    2011-11-19 17:03:35 UTC (rev 753)
+++ code/trunk/doc/pcrepattern.3    2011-11-19 18:32:18 UTC (rev 754)
@@ -1003,9 +1003,9 @@
 PCRE does not allow \eC to appear in lookbehind assertions
 .\" HTML <a href="#lookbehind">
 .\" </a>
-(described below),
+(described below)
 .\"
-because in UTF-8 mode this would make it impossible to calculate the length of
+in UTF-8 mode, because this would make it impossible to calculate the length of
 the lookbehind.
 .P
 In general, the \eC escape sequence is best avoided in UTF-8 mode. However, one
@@ -1970,10 +1970,10 @@
 match. If there are insufficient characters before the current position, the
 assertion fails.
 .P
-PCRE does not allow the \eC escape (which matches a single byte in UTF-8 mode)
-to appear in lookbehind assertions, because it makes it impossible to calculate
-the length of the lookbehind. The \eX and \eR escapes, which can match
-different numbers of bytes, are also not permitted.
+In UTF-8 mode, PCRE does not allow the \eC escape (which matches a single byte,
+even in UTF-8 mode) to appear in lookbehind assertions, because it makes it
+impossible to calculate the length of the lookbehind. The \eX and \eR escapes,
+which can match different numbers of bytes, are also not permitted.
 .P
 .\" HTML <a href="#subpatternsassubroutines">
 .\" </a>
@@ -2874,6 +2874,6 @@
 .rs
 .sp
 .nf
-Last updated: 14 November 2011
+Last updated: 19 November 2011
 Copyright (c) 1997-2011 University of Cambridge.
 .fi


Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2011-11-19 17:03:35 UTC (rev 753)
+++ code/trunk/pcre_compile.c    2011-11-19 18:32:18 UTC (rev 754)
@@ -1528,7 +1528,7 @@


 Returns:   the fixed length,
              or -1 if there is no fixed length,
-             or -2 if \C was encountered
+             or -2 if \C was encountered (in UTF-8 mode only)
              or -3 if an OP_RECURSE item was encountered and atend is FALSE
              or -4 if an unknown opcode was encountered (internal error)
 */
@@ -1702,7 +1702,8 @@
     cc++;
     break;


-    /* The single-byte matcher isn't allowed */
+    /* The single-byte matcher isn't allowed. This only happens in UTF-8 mode; 
+    otherwise \C is coded as OP_ALLANY. */


     case OP_ANYBYTE:
     return -2;
@@ -5600,8 +5601,8 @@


         /* ------------------------------------------------------------ */
         case CHAR_C:                 /* Callout - may be followed by digits; */
-        previous_callout = code;  /* Save for later completion */
-        after_manual_callout = 1; /* Skip one item before completing */
+        previous_callout = code;     /* Save for later completion */
+        after_manual_callout = 1;    /* Skip one item before completing */
         *code++ = OP_CALLOUT;
           {
           int n = 0;
@@ -6478,9 +6479,12 @@
           }
         else
 #endif
-          {
+        /* In non-UTF-8 mode, we turn \C into OP_ALLANY instead of OP_ANYBYTE
+        so that it works in DFA mode and in lookbehinds. */
+         
+          {  
           previous = (-c > ESC_b && -c < ESC_Z)? code : NULL;
-          *code++ = -c;
+          *code++ = (!utf8 && c == -ESC_C)? OP_ALLANY : -c;
           }
         }
       continue;


Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2011-11-19 17:03:35 UTC (rev 753)
+++ code/trunk/pcre_internal.h    2011-11-19 18:32:18 UTC (rev 754)
@@ -1252,8 +1252,8 @@
 their negation. Also, they must appear in the same order as in the opcode
 definitions below, up to ESC_z. There's a dummy for OP_ALLANY because it
 corresponds to "." in DOTALL mode rather than an escape sequence. It is also
-used for [^] in JavaScript compatibility mode. In non-DOTALL mode, "." behaves
-like \N.
+used for [^] in JavaScript compatibility mode, and for \C in non-utf8 mode. In
+non-DOTALL mode, "." behaves like \N.


The special values ESC_DU, ESC_du, etc. are used instead of ESC_D, ESC_d, etc.
when PCRE_UCP is set, when replacement of \d etc by \p sequences is required.

Modified: code/trunk/pcre_study.c
===================================================================
--- code/trunk/pcre_study.c    2011-11-19 17:03:35 UTC (rev 753)
+++ code/trunk/pcre_study.c    2011-11-19 18:32:18 UTC (rev 754)
@@ -286,7 +286,9 @@
     cc++;
     break;


-    /* The single-byte matcher means we can't proceed in UTF-8 mode */
+    /* The single-byte matcher means we can't proceed in UTF-8 mode. (In 
+    non-UTF-8 mode \C will actually be turned into OP_ALLANY, so won't ever 
+    appear, but leave the code, just in case.) */


     case OP_ANYBYTE:
 #ifdef SUPPORT_UTF8


Modified: code/trunk/testdata/testinput1
===================================================================
--- code/trunk/testdata/testinput1    2011-11-19 17:03:35 UTC (rev 753)
+++ code/trunk/testdata/testinput1    2011-11-19 18:32:18 UTC (rev 754)
@@ -4305,4 +4305,17 @@
     ** Failers
     aaaaaa


+/ab\Cde/
+    abXde
+    
+/(?<=ab\Cde)X/
+    abZdeX
+
+/a[\CD]b/
+    aCb
+    aDb 
+
+/a[\C-X]b/
+    aJb
+
 /-- End of testinput1 --/


Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2011-11-19 17:03:35 UTC (rev 753)
+++ code/trunk/testdata/testinput2    2011-11-19 18:32:18 UTC (rev 754)
@@ -4007,4 +4007,6 @@


/(?(?=c)c|d)*+Y/BZ

+/(?<=ab\Cde)X/8
+
/-- End of testinput2 --/

Modified: code/trunk/testdata/testinput4
===================================================================
--- code/trunk/testdata/testinput4    2011-11-19 17:03:35 UTC (rev 753)
+++ code/trunk/testdata/testinput4    2011-11-19 18:32:18 UTC (rev 754)
@@ -650,4 +650,7 @@
 /(abc)\1/8
    abc


+/ab\Cde/8
+    abXde
+
 /-- End of testinput4 --/


Modified: code/trunk/testdata/testinput7
===================================================================
--- code/trunk/testdata/testinput7    2011-11-19 17:03:35 UTC (rev 753)
+++ code/trunk/testdata/testinput7    2011-11-19 18:32:18 UTC (rev 754)
@@ -4703,4 +4703,10 @@
     \O6aaaa
     \O8aaaa


+/ab\Cde/
+    abXde
+    
+/(?<=ab\Cde)X/
+    abZdeX
+
 /-- End of testinput7 --/


Modified: code/trunk/testdata/testinput8
===================================================================
--- code/trunk/testdata/testinput8    2011-11-19 17:03:35 UTC (rev 753)
+++ code/trunk/testdata/testinput8    2011-11-19 18:32:18 UTC (rev 754)
@@ -700,4 +700,9 @@
     a\x{123}aa\>5
     a\x{123}aa\>6


+/ab\Cde/8
+    abXde
+
+/(?<=ab\Cde)X/8
+
 /-- End of testinput8 --/ 


Modified: code/trunk/testdata/testoutput1
===================================================================
--- code/trunk/testdata/testoutput1    2011-11-19 17:03:35 UTC (rev 753)
+++ code/trunk/testdata/testoutput1    2011-11-19 18:32:18 UTC (rev 754)
@@ -7035,4 +7035,22 @@
     aaaaaa
 No match


+/ab\Cde/
+    abXde
+ 0: abXde
+    
+/(?<=ab\Cde)X/
+    abZdeX
+ 0: X
+
+/a[\CD]b/
+    aCb
+ 0: aCb
+    aDb 
+ 0: aDb
+
+/a[\C-X]b/
+    aJb
+ 0: aJb
+
 /-- End of testinput1 --/


Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2011-11-19 17:03:35 UTC (rev 753)
+++ code/trunk/testdata/testoutput2    2011-11-19 18:32:18 UTC (rev 754)
@@ -12591,4 +12591,7 @@
         End
 ------------------------------------------------------------------


+/(?<=ab\Cde)X/8
+Failed: \C not allowed in lookbehind assertion at offset 10
+
/-- End of testinput2 --/

Modified: code/trunk/testdata/testoutput4
===================================================================
--- code/trunk/testdata/testoutput4    2011-11-19 17:03:35 UTC (rev 753)
+++ code/trunk/testdata/testoutput4    2011-11-19 18:32:18 UTC (rev 754)
@@ -1136,4 +1136,8 @@
    abc
 No match


+/ab\Cde/8
+    abXde
+ 0: abXde
+
 /-- End of testinput4 --/


Modified: code/trunk/testdata/testoutput7
===================================================================
--- code/trunk/testdata/testoutput7    2011-11-19 17:03:35 UTC (rev 753)
+++ code/trunk/testdata/testoutput7    2011-11-19 18:32:18 UTC (rev 754)
@@ -7858,4 +7858,12 @@
  2: aa
  3: a


+/ab\Cde/
+    abXde
+ 0: abXde
+    
+/(?<=ab\Cde)X/
+    abZdeX
+ 0: X
+
 /-- End of testinput7 --/


Modified: code/trunk/testdata/testoutput8
===================================================================
--- code/trunk/testdata/testoutput8    2011-11-19 17:03:35 UTC (rev 753)
+++ code/trunk/testdata/testoutput8    2011-11-19 18:32:18 UTC (rev 754)
@@ -1348,4 +1348,11 @@
     a\x{123}aa\>6
 Error -24 (bad offset value)


+/ab\Cde/8
+    abXde
+Error -16 (item unsupported for DFA matching)
+
+/(?<=ab\Cde)X/8
+Failed: \C not allowed in lookbehind assertion at offset 10
+
 /-- End of testinput8 --/