[Pcre-svn] [335] code/trunk: Do not discard subpatterns with…

Página Inicial
Delete this message
Autor: Subversion repository
Data:  
Para: pcre-svn
Assunto: [Pcre-svn] [335] code/trunk: Do not discard subpatterns with {0} quantifiers, as they may be called as
Revision: 335
          http://vcs.pcre.org/viewvc?view=rev&revision=335
Author:   ph10
Date:     2008-04-12 15:36:14 +0100 (Sat, 12 Apr 2008)


Log Message:
-----------
Do not discard subpatterns with {0} quantifiers, as they may be called as
subroutines.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/HACKING
    code/trunk/doc/pcrepattern.3
    code/trunk/pcre_compile.c
    code/trunk/pcre_dfa_exec.c
    code/trunk/pcre_exec.c
    code/trunk/pcre_internal.h
    code/trunk/pcre_study.c
    code/trunk/testdata/testinput2
    code/trunk/testdata/testoutput2


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2008-04-11 15:48:14 UTC (rev 334)
+++ code/trunk/ChangeLog    2008-04-12 14:36:14 UTC (rev 335)
@@ -50,6 +50,13 @@
     which, however, unlike Perl's \g{...}, are subroutine calls, not back 
     references. PCRE supports relative numbers with this syntax (I don't think
     Oniguruma does).
+    
+12. Previously, a group with a zero repeat such as (...){0} was completely 
+    omitted from the compiled regex. However, this means that if the group
+    was called as a subroutine from elsewhere in the pattern, things went wrong
+    (an internal error was given). Such groups are now left in the compiled 
+    pattern, with a new opcode that causes them to be skipped at execution 
+    time.



Version 7.6 28-Jan-08

Modified: code/trunk/HACKING
===================================================================
--- code/trunk/HACKING    2008-04-11 15:48:14 UTC (rev 334)
+++ code/trunk/HACKING    2008-04-12 14:36:14 UTC (rev 335)
@@ -318,9 +318,12 @@
 positive number) the offset back to the matching bracket opcode.


If a subpattern is quantified such that it is permitted to match zero times, it
-is preceded by one of OP_BRAZERO or OP_BRAMINZERO. These are single-byte
-opcodes which tell the matcher that skipping this subpattern entirely is a
-valid branch.
+is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
+single-byte opcodes that tell the matcher that skipping the following
+subpattern entirely is a valid branch. In the case of the first two, not
+skipping the pattern is also valid (greedy and non-greedy). The third is used
+when a pattern has the quantifier {0,0}. It cannot be entirely discarded,
+because it may be called as a subroutine from elsewhere in the regex.

A subpattern with an indefinite maximum repetition is replicated in the
compiled data its minimum number of times (or once with OP_BRAZERO if the
@@ -411,4 +414,4 @@
data.

Philip Hazel
-August 2007
+April 2008

Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3    2008-04-11 15:48:14 UTC (rev 334)
+++ code/trunk/doc/pcrepattern.3    2008-04-12 14:36:14 UTC (rev 335)
@@ -1259,7 +1259,14 @@
 which may be several bytes long (and they may be of different lengths).
 .P
 The quantifier {0} is permitted, causing the expression to behave as if the
-previous item and the quantifier were not present.
+previous item and the quantifier were not present. This may be useful for 
+subpatterns that are referenced as 
+.\" HTML <a href="#subpatternsassubroutines">
+.\" </a>
+subroutines
+.\"
+from elsewhere in the pattern. Items other than subpatterns that have a {0} 
+quantifier are omitted from the compiled pattern.
 .P
 For convenience, the three most common quantifiers have single-character
 abbreviations:
@@ -2232,6 +2239,6 @@
 .rs
 .sp
 .nf
-Last updated: 10 April 2008
+Last updated: 12 April 2008
 Copyright (c) 1997-2008 University of Cambridge.
 .fi


Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2008-04-11 15:48:14 UTC (rev 334)
+++ code/trunk/pcre_compile.c    2008-04-12 14:36:14 UTC (rev 335)
@@ -1567,7 +1567,7 @@


/* Groups with zero repeats can of course be empty; skip them. */

-  if (c == OP_BRAZERO || c == OP_BRAMINZERO)
+  if (c == OP_BRAZERO || c == OP_BRAMINZERO || c == OP_SKIPZERO)
     {
     code += _pcre_OP_lengths[c];
     do code += GET(code, 1); while (*code == OP_ALT);
@@ -1847,11 +1847,12 @@
 that is referenced. This means that groups can be replicated for fixed
 repetition simply by copying (because the recursion is allowed to refer to
 earlier groups that are outside the current group). However, when a group is
-optional (i.e. the minimum quantifier is zero), OP_BRAZERO is inserted before
-it, after it has been compiled. This means that any OP_RECURSE items within it
-that refer to the group itself or any contained groups have to have their
-offsets adjusted. That one of the jobs of this function. Before it is called,
-the partially compiled regex must be temporarily terminated with OP_END.
+optional (i.e. the minimum quantifier is zero), OP_BRAZERO or OP_SKIPZERO is
+inserted before it, after it has been compiled. This means that any OP_RECURSE
+items within it that refer to the group itself or any contained groups have to
+have their offsets adjusted. That one of the jobs of this function. Before it
+is called, the partially compiled regex must be temporarily terminated with
+OP_END.


This function has been extended with the possibility of forward references for
recursions and subroutine calls. It must also check the list of such references
@@ -3842,28 +3843,38 @@

       if (repeat_min == 0)
         {
-        /* If the maximum is also zero, we just omit the group from the output
-        altogether. */
+        /* If the maximum is also zero, we used to just omit the group from the
+        output altogether, like this:


-        if (repeat_max == 0)
-          {
-          code = previous;
-          goto END_REPEAT;
-          }
+        ** if (repeat_max == 0)
+        **   {
+        **   code = previous;
+        **   goto END_REPEAT;
+        **   }
+        
+        However, that fails when a group is referenced as a subroutine from 
+        elsewhere in the pattern, so now we stick in OP_SKIPZERO in front of it 
+        so that it is skipped on execution. As we don't have a list of which 
+        groups are referenced, we cannot do this selectively. 


-        /* If the maximum is 1 or unlimited, we just have to stick in the
-        BRAZERO and do no more at this point. However, we do need to adjust
-        any OP_RECURSE calls inside the group that refer to the group itself or
-        any internal or forward referenced group, because the offset is from
-        the start of the whole regex. Temporarily terminate the pattern while
-        doing this. */
+        If the maximum is 1 or unlimited, we just have to stick in the BRAZERO
+        and do no more at this point. However, we do need to adjust any
+        OP_RECURSE calls inside the group that refer to the group itself or any
+        internal or forward referenced group, because the offset is from the
+        start of the whole regex. Temporarily terminate the pattern while doing
+        this. */


-        if (repeat_max <= 1)
+        if (repeat_max <= 1)    /* Covers 0, 1, and unlimited */
           {
           *code = OP_END;
           adjust_recurse(previous, 1, utf8, cd, save_hwm);
           memmove(previous+1, previous, len);
           code++;
+          if (repeat_max == 0)
+            {
+            *previous++ = OP_SKIPZERO;
+            goto END_REPEAT;
+            }    
           *previous++ = OP_BRAZERO + repeat_type;
           }


@@ -6198,7 +6209,7 @@
   if (groupptr == NULL) errorcode = ERR53;
     else PUT(((uschar *)codestart), offset, groupptr - codestart);
   }
-
+  
 /* Give an error if there's back reference to a non-existent capturing
 subpattern. */



Modified: code/trunk/pcre_dfa_exec.c
===================================================================
--- code/trunk/pcre_dfa_exec.c    2008-04-11 15:48:14 UTC (rev 334)
+++ code/trunk/pcre_dfa_exec.c    2008-04-12 14:36:14 UTC (rev 335)
@@ -694,6 +694,13 @@
       break;


       /*-----------------------------------------------------------------*/
+      case OP_SKIPZERO:
+      code += 1 + GET(code, 2);
+      while (*code == OP_ALT) code += GET(code, 1);
+      ADD_ACTIVE(code - start_code + 1 + LINK_SIZE, 0);
+      break;
+
+      /*-----------------------------------------------------------------*/
       case OP_CIRC:
       if ((ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0) ||
           ((ims & PCRE_MULTILINE) != 0 &&


Modified: code/trunk/pcre_exec.c
===================================================================
--- code/trunk/pcre_exec.c    2008-04-11 15:48:14 UTC (rev 334)
+++ code/trunk/pcre_exec.c    2008-04-12 14:36:14 UTC (rev 335)
@@ -1148,11 +1148,11 @@
     do ecode += GET(ecode,1); while (*ecode == OP_ALT);
     break;


-    /* BRAZERO and BRAMINZERO occur just before a bracket group, indicating
-    that it may occur zero times. It may repeat infinitely, or not at all -
-    i.e. it could be ()* or ()? in the pattern. Brackets with fixed upper
-    repeat limits are compiled as a number of copies, with the optional ones
-    preceded by BRAZERO or BRAMINZERO. */
+    /* BRAZERO, BRAMINZERO and SKIPZERO occur just before a bracket group,
+    indicating that it may occur zero times. It may repeat infinitely, or not
+    at all - i.e. it could be ()* or ()? or even (){0} in the pattern. Brackets
+    with fixed upper repeat limits are compiled as a number of copies, with the
+    optional ones preceded by BRAZERO or BRAMINZERO. */


     case OP_BRAZERO:
       {
@@ -1174,6 +1174,14 @@
       }
     break;


+    case OP_SKIPZERO:
+      {
+      next = ecode+1;
+      do next += GET(next,1); while (*next == OP_ALT);
+      ecode = next + 1 + LINK_SIZE;
+      }
+    break;
+
     /* End of a group, repeated or non-repeating. */


     case OP_KET:


Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2008-04-11 15:48:14 UTC (rev 334)
+++ code/trunk/pcre_internal.h    2008-04-12 14:36:14 UTC (rev 335)
@@ -773,7 +773,11 @@
   /* These are forced failure and success verbs */


   OP_FAIL,           /* 108 */
-  OP_ACCEPT          /* 109 */
+  OP_ACCEPT,         /* 109 */
+  
+  /* This is used to skip a subpattern with a {0} quantifier */
+ 
+  OP_SKIPZERO        /* 110 */
 };



@@ -798,7 +802,8 @@
   "AssertB", "AssertB not", "Reverse",                            \
   "Once", "Bra", "CBra", "Cond", "SBra", "SCBra", "SCond",        \
   "Cond ref", "Cond rec", "Cond def", "Brazero", "Braminzero",    \
-  "*PRUNE", "*SKIP", "*THEN", "*COMMIT", "*FAIL", "*ACCEPT"
+  "*PRUNE", "*SKIP", "*THEN", "*COMMIT", "*FAIL", "*ACCEPT",      \
+  "Skip zero" 



 /* This macro defines the length of fixed length operations in the compiled
@@ -863,7 +868,7 @@
   1,                             /* DEF                                    */ \
   1, 1,                          /* BRAZERO, BRAMINZERO                    */ \
   1, 1, 1, 1,                    /* PRUNE, SKIP, THEN, COMMIT,             */ \
-  1, 1                           /* FAIL, ACCEPT                           */
+  1, 1, 1                        /* FAIL, ACCEPT, SKIPZERO                 */



/* A magic value for OP_RREF to indicate the "any recursion" condition. */

Modified: code/trunk/pcre_study.c
===================================================================
--- code/trunk/pcre_study.c    2008-04-11 15:48:14 UTC (rev 334)
+++ code/trunk/pcre_study.c    2008-04-12 14:36:14 UTC (rev 335)
@@ -217,6 +217,13 @@
       tcode += 1 + LINK_SIZE;
       break;


+      /* SKIPZERO skips the bracket. */
+
+      case OP_SKIPZERO:
+      do tcode += GET(tcode,1); while (*tcode == OP_ALT);
+      tcode += 1 + LINK_SIZE;
+      break;
+
       /* Single-char * or ? sets the bit and tries the next item */


       case OP_STAR:


Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2008-04-11 15:48:14 UTC (rev 334)
+++ code/trunk/testdata/testinput2    2008-04-12 14:36:14 UTC (rev 335)
@@ -2649,4 +2649,10 @@
    abc
    accccbbb 


+/^(?+1)(?<a>x|y){0}z/
+    xzxx
+    yzyy 
+    ** Failers
+    xxz  
+
 / End of testinput2 /


Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2008-04-11 15:48:14 UTC (rev 334)
+++ code/trunk/testdata/testoutput2    2008-04-12 14:36:14 UTC (rev 335)
@@ -9515,4 +9515,16 @@
  0: accccbbb
  1: a


+/^(?+1)(?<a>x|y){0}z/
+    xzxx
+ 0: xz
+ 1: <unset>
+    yzyy 
+ 0: yz
+ 1: <unset>
+    ** Failers
+No match
+    xxz  
+No match
+
 / End of testinput2 /