[Pcre-svn] [550] code/trunk: Fix problem with (*THEN) not ba…

Página Inicial
Delete this message
Autor: Subversion repository
Data:  
Para: pcre-svn
Assunto: [Pcre-svn] [550] code/trunk: Fix problem with (*THEN) not backing up far enough.
Revision: 550
          http://vcs.pcre.org/viewvc?view=rev&revision=550
Author:   ph10
Date:     2010-10-10 17:24:11 +0100 (Sun, 10 Oct 2010)


Log Message:
-----------
Fix problem with (*THEN) not backing up far enough.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/HACKING
    code/trunk/configure.ac
    code/trunk/doc/pcrepattern.3
    code/trunk/pcre_compile.c
    code/trunk/pcre_exec.c
    code/trunk/pcre_internal.h
    code/trunk/pcre_printint.src
    code/trunk/pcre_study.c
    code/trunk/testdata/testinput10
    code/trunk/testdata/testinput2
    code/trunk/testdata/testoutput10
    code/trunk/testdata/testoutput2


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/ChangeLog    2010-10-10 16:24:11 UTC (rev 550)
@@ -1,6 +1,17 @@
 ChangeLog for PCRE
 ------------------


+Version 8.11 10-Oct-2010
+------------------------
+
+1.  (*THEN) was not working properly if there were untried alternatives prior
+    to it in the current branch. For example, in ((a|b)(*THEN)(*F)|c..) it 
+    backtracked to try for "b" instead of moving to the next alternative branch 
+    at the same level (in this case, to look for "c"). The Perl documentation 
+    is clear that when (*THEN) is backtracked onto, it goes to the "next 
+    alternative in the innermost enclosing group".
+
+
 Version 8.10 25-Jun-2010
 ------------------------



Modified: code/trunk/HACKING
===================================================================
--- code/trunk/HACKING    2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/HACKING    2010-10-10 16:24:11 UTC (rev 550)
@@ -4,6 +4,7 @@
 These are very rough technical notes that record potentially useful information 
 about PCRE internals.


+
Historical note 1
-----------------

@@ -22,6 +23,7 @@
not necessarily maximize the individual wild portions of the pattern, as is
expected in Unix and Perl-style regular expressions.

+
Historical note 2
-----------------

@@ -34,6 +36,7 @@
matches individual wild portions of the pattern. This is an "NFA algorithm" in
Friedl's terminology.

+
OK, here's the real stuff
-------------------------

@@ -44,6 +47,7 @@
complexity in Perl regular expressions, I couldn't do this. In any case, a
first pass through the pattern is helpful for other reasons.

+
Computing the memory requirement: how it was
--------------------------------------------

@@ -54,6 +58,7 @@
the first pass is degenerate and the second pass can just store stuff straight
into the vector, which it knows is big enough.

+
Computing the memory requirement: how it is
-------------------------------------------

@@ -75,6 +80,7 @@
is doing a full analysis of the pattern. My hope was that this would not be a
big issue, and in the event, nobody has commented on it.

+
Traditional matching function
-----------------------------

@@ -84,6 +90,7 @@
as compatible with Perl as possible. This is the function most users of PCRE
will use most of the time.

+
Supplementary matching function
-------------------------------

@@ -119,7 +126,6 @@

A list of the opcodes follows:

-
Opcodes with no following data
------------------------------

@@ -151,14 +157,26 @@
   OP_EXTUNI              match an extended Unicode character 
   OP_ANYNL               match any Unicode newline sequence 


-  OP_ACCEPT              ) These are Perl 5.10's "backtracking    
-  OP_COMMIT              ) control verbs". If OP_ACCEPT is inside
-  OP_FAIL                ) capturing parentheses, it may be preceded 
-  OP_PRUNE               ) by one or more OP_CLOSE, followed by a 2-byte 
-  OP_SKIP                ) number, indicating which parentheses must be
-  OP_THEN                ) closed.
+  OP_ACCEPT              ) These are Perl 5.10's "backtracking control   
+  OP_COMMIT              ) verbs". If OP_ACCEPT is inside capturing
+  OP_FAIL                ) parentheses, it may be preceded by one or more
+  OP_PRUNE               ) OP_CLOSE, followed by a 2-byte number,
+  OP_SKIP                ) indicating which parentheses must be closed.



+Backtracking control verbs with data
+------------------------------------
+
+OP_THEN is followed by a LINK_SIZE offset, which is the distance back to the
+start of the current branch.
+
+OP_MARK is followed by the mark name, preceded by a one-byte length, and
+followed by a binary zero. For (*PRUNE), (*SKIP), and (*THEN) with arguments,
+the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used. For the first
+two, the name follows immediately; for OP_THEN_ARG, it follows the LINK_SIZE
+offset value.
+
+
Repeating single characters
---------------------------

@@ -419,4 +437,4 @@
data.

Philip Hazel
-October 2009
+October 2010

Modified: code/trunk/configure.ac
===================================================================
--- code/trunk/configure.ac    2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/configure.ac    2010-10-10 16:24:11 UTC (rev 550)
@@ -9,9 +9,9 @@
 dnl be defined as -RC2, for example. For real releases, it should be empty.


m4_define(pcre_major, [8])
-m4_define(pcre_minor, [10])
-m4_define(pcre_prerelease, [])
-m4_define(pcre_date, [2010-06-25])
+m4_define(pcre_minor, [11])
+m4_define(pcre_prerelease, [-RC1])
+m4_define(pcre_date, [2010-10-09])

# Libtool shared library interface versions (current:revision:age)
m4_define(libpcre_version, [0:1:0])

Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3    2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/doc/pcrepattern.3    2010-10-10 16:24:11 UTC (rev 550)
@@ -2630,10 +2630,10 @@
 .sp
   (*THEN) or (*THEN:NAME)
 .sp
-This verb causes a skip to the next alternation if the rest of the pattern does
-not match. That is, it cancels pending backtracking, but only within the
-current alternation. Its name comes from the observation that it can be used
-for a pattern-based if-then-else block:
+This verb causes a skip to the next alternation in the innermost enclosing 
+group if the rest of the pattern does not match. That is, it cancels pending
+backtracking, but only within the current alternation. Its name comes from the
+observation that it can be used for a pattern-based if-then-else block:
 .sp
   ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
 .sp
@@ -2666,6 +2666,6 @@
 .rs
 .sp
 .nf
-Last updated: 18 May 2010
+Last updated: 10 October 2010
 Copyright (c) 1997-2010 University of Cambridge.
 .fi


Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/pcre_compile.c    2010-10-10 16:24:11 UTC (rev 550)
@@ -1724,9 +1724,12 @@
       case OP_MARK:
       case OP_PRUNE_ARG:
       case OP_SKIP_ARG:
-      case OP_THEN_ARG:
       code += code[1];
       break;
+
+      case OP_THEN_ARG:
+      code += code[1+LINK_SIZE];
+      break;
       }


     /* Add in the fixed length from the table */
@@ -1827,9 +1830,12 @@
       case OP_MARK:
       case OP_PRUNE_ARG:
       case OP_SKIP_ARG:
-      case OP_THEN_ARG:
       code += code[1];
       break;
+
+      case OP_THEN_ARG:
+      code += code[1+LINK_SIZE];
+      break;
       }


     /* Add in the fixed length from the table */
@@ -2105,10 +2111,13 @@
     case OP_MARK:
     case OP_PRUNE_ARG:
     case OP_SKIP_ARG:
-    case OP_THEN_ARG:
     code += code[1];
     break;


+    case OP_THEN_ARG:
+    code += code[1+LINK_SIZE];
+    break;
+
     /* None of the remaining opcodes are required to match a character. */


     default:
@@ -4808,7 +4817,12 @@
               *errorcodeptr = ERR66;
               goto FAILED;
               }
-            *code++ = verbs[i].op;
+            *code = verbs[i].op;
+            if (*code++ == OP_THEN)
+              {
+              PUT(code, 0, code - bcptr->current_branch - 1);
+              code += LINK_SIZE; 
+              }  
             }


           else
@@ -4818,7 +4832,12 @@
               *errorcodeptr = ERR59;
               goto FAILED;
               }
-            *code++ = verbs[i].op_arg;
+            *code = verbs[i].op_arg;
+            if (*code++ == OP_THEN_ARG)
+              {
+              PUT(code, 0, code - bcptr->current_branch - 1);
+              code += LINK_SIZE; 
+              }  
             *code++ = arglen;
             memcpy(code, arg, arglen);
             code += arglen;


Modified: code/trunk/pcre_exec.c
===================================================================
--- code/trunk/pcre_exec.c    2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/pcre_exec.c    2010-10-10 16:24:11 UTC (rev 550)
@@ -748,18 +748,25 @@


     md->start_match_ptr = ecode + 2;
     RRETURN(MATCH_SKIP_ARG);
+    
+    /* For THEN (and THEN_ARG) we pass back the address of the bracket or
+    the alt that is at the start of the current branch. This makes it possible 
+    to skip back past alternatives that precede the THEN within the current 
+    branch. */ 


     case OP_THEN:
     RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
       ims, eptrb, flags, RM54);
     if (rrc != MATCH_NOMATCH) RRETURN(rrc);
+    md->start_match_ptr = ecode - GET(ecode, 1);
     MRRETURN(MATCH_THEN);


     case OP_THEN_ARG:
-    RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
-      ims, eptrb, flags, RM58);
+    RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1+LINK_SIZE], 
+      offset_top, md, ims, eptrb, flags, RM58);
     if (rrc != MATCH_NOMATCH) RRETURN(rrc);
-    md->mark = ecode + 2;
+    md->start_match_ptr = ecode - GET(ecode, 1);
+    md->mark = ecode + LINK_SIZE + 2;
     RRETURN(MATCH_THEN);


     /* Handle a capturing bracket. If there is space in the offset vector, save
@@ -804,7 +811,9 @@
         {
         RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
           ims, eptrb, flags, RM1);
-        if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+        if (rrc != MATCH_NOMATCH &&
+            (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+          RRETURN(rrc);
         md->capture_last = save_capture_last;
         ecode += GET(ecode, 1);
         }
@@ -865,7 +874,9 @@


       RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md, ims,
         eptrb, flags, RM2);
-      if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+      if (rrc != MATCH_NOMATCH &&
+          (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+        RRETURN(rrc);
       ecode += GET(ecode, 1);
       }
     /* Control never reaches here. */
@@ -1066,7 +1077,8 @@
         ecode += 1 + LINK_SIZE + GET(ecode, LINK_SIZE + 2);
         while (*ecode == OP_ALT) ecode += GET(ecode, 1);
         }
-      else if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN)
+      else if (rrc != MATCH_NOMATCH &&
+              (rrc != MATCH_THEN || md->start_match_ptr != ecode))
         {
         RRETURN(rrc);         /* Need braces because of following else */
         }
@@ -1194,7 +1206,9 @@
         mstart = md->start_match_ptr;   /* In case \K reset it */
         break;
         }
-      if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+      if (rrc != MATCH_NOMATCH &&
+          (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+        RRETURN(rrc);
       ecode += GET(ecode, 1);
       }
     while (*ecode == OP_ALT);
@@ -1228,7 +1242,9 @@
         do ecode += GET(ecode,1); while (*ecode == OP_ALT);
         break;
         }
-      if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+      if (rrc != MATCH_NOMATCH &&
+          (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+        RRETURN(rrc);
       ecode += GET(ecode,1);
       }
     while (*ecode == OP_ALT);
@@ -1365,7 +1381,8 @@
             (pcre_free)(new_recursive.offset_save);
           MRRETURN(MATCH_MATCH);
           }
-        else if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN)
+        else if (rrc != MATCH_NOMATCH &&
+                (rrc != MATCH_THEN || md->start_match_ptr != ecode))
           {
           DPRINTF(("Recursion gave error %d\n", rrc));
           if (new_recursive.offset_save != stacksave)
@@ -1408,7 +1425,9 @@
         mstart = md->start_match_ptr;
         break;
         }
-      if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+      if (rrc != MATCH_NOMATCH &&
+          (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+        RRETURN(rrc);
       ecode += GET(ecode,1);
       }
     while (*ecode == OP_ALT);


Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/pcre_internal.h    2010-10-10 16:24:11 UTC (rev 550)
@@ -1514,8 +1514,9 @@
   3, 3,                          /* RREF, NRREF                            */ \
   1,                             /* DEF                                    */ \
   1, 1,                          /* BRAZERO, BRAMINZERO                    */ \
-  3, 1, 3,                       /* MARK, PRUNE, PRUNE_ARG,                */ \
-  1, 3, 1, 3,                    /* SKIP, SKIP_ARG, THEN, THEN_ARG,        */ \
+  3, 1, 3,                       /* MARK, PRUNE, PRUNE_ARG                 */ \
+  1, 3,                          /* SKIP, SKIP_ARG                         */ \
+  1+LINK_SIZE, 3+LINK_SIZE,      /* THEN, THEN_ARG                         */ \
   1, 1, 1, 3, 1                  /* COMMIT, FAIL, ACCEPT, CLOSE, SKIPZERO  */




Modified: code/trunk/pcre_printint.src
===================================================================
--- code/trunk/pcre_printint.src    2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/pcre_printint.src    2010-10-10 16:24:11 UTC (rev 550)
@@ -537,11 +537,26 @@
     case OP_MARK:
     case OP_PRUNE_ARG:
     case OP_SKIP_ARG:
-    case OP_THEN_ARG:
     fprintf(f, "    %s %s", OP_names[*code], code + 2);
     extra += code[1];
     break;
+    
+    case OP_THEN:
+    if (print_lengths) 
+      fprintf(f, "    %s %d", OP_names[*code], GET(code, 1)); 
+    else   
+      fprintf(f, "    %s", OP_names[*code]); 
+    break;  


+    case OP_THEN_ARG:
+    if (print_lengths) 
+      fprintf(f, "    %s %d %s", OP_names[*code], GET(code, 1),
+        code + 2 + LINK_SIZE);
+    else     
+      fprintf(f, "    %s %s", OP_names[*code], code + 2 + LINK_SIZE);
+    extra += code[1+LINK_SIZE];
+    break;
+
     /* Anything else is just an item with no data*/


     default:


Modified: code/trunk/pcre_study.c
===================================================================
--- code/trunk/pcre_study.c    2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/pcre_study.c    2010-10-10 16:24:11 UTC (rev 550)
@@ -419,10 +419,13 @@
     case OP_MARK:
     case OP_PRUNE_ARG:
     case OP_SKIP_ARG:
-    case OP_THEN_ARG:
     cc += _pcre_OP_lengths[op] + cc[1];
     break;


+    case OP_THEN_ARG:
+    cc += _pcre_OP_lengths[op] + cc[1+LINK_SIZE];
+    break;
+
     /* For the record, these are the opcodes that are matched by "default":
     OP_ACCEPT, OP_CLOSE, OP_COMMIT, OP_FAIL, OP_PRUNE, OP_SET_SOM, OP_SKIP,
     OP_THEN. */


Modified: code/trunk/testdata/testinput10
===================================================================
--- code/trunk/testdata/testinput10    2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/testdata/testinput10    2010-10-10 16:24:11 UTC (rev 550)
@@ -132,4 +132,6 @@


/[[:^alpha:]\S]+/8WB

+/abc(d|e)(*THEN)x(123(*THEN)4|567(b|q)(*THEN)xx)/B
+
/-- End of testinput10 --/

Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/testdata/testinput2    2010-10-10 16:24:11 UTC (rev 550)
@@ -3464,22 +3464,22 @@
     abcde


 /A\NB./BZ
-  ACBD
-  ** Failers
-  A\nB
-  ACB\n   
+    ACBD
+    *** Failers
+    A\nB
+    ACB\n   


 /A\NB./sBZ
-  ACBD
-  ACB\n 
-  ** Failers
-  A\nB  
+    ACBD
+    ACB\n 
+    *** Failers
+    A\nB  


 /A\NB/<crlf>
-  A\nB
-  A\rB
-  ** Failers
-  A\r\nB    
+    A\nB
+    A\rB
+    ** Failers
+    A\r\nB    


/\R+b/BZ

@@ -3491,4 +3491,12 @@

/\s*\R/BZ

+/-- Perl treats this one differently, not failing the second string. I believe
+    that is a bug in Perl. --/
+
+/^((abc|abcx)(*THEN)y|abcd)/
+    abcd
+    *** Failers 
+    abcxy 
+
 /-- End of testinput2 --/


Modified: code/trunk/testdata/testoutput10
===================================================================
--- code/trunk/testdata/testoutput10    2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/testdata/testoutput10    2010-10-10 16:24:11 UTC (rev 550)
@@ -707,4 +707,33 @@
  18     End
 ------------------------------------------------------------------


+/abc(d|e)(*THEN)x(123(*THEN)4|567(b|q)(*THEN)xx)/B
+------------------------------------------------------------------
+  0  79 Bra
+  3     abc
+  9   7 CBra 1
+ 14     d
+ 16   5 Alt
+ 19     e
+ 21  12 Ket
+ 24     *THEN 24
+ 27     x
+ 29  16 CBra 2
+ 34     123
+ 40     *THEN 11
+ 43     4
+ 45  31 Alt
+ 48     567
+ 54   7 CBra 3
+ 59     b
+ 61   5 Alt
+ 64     q
+ 66  12 Ket
+ 69     *THEN 24
+ 72     xx
+ 76  47 Ket
+ 79  79 Ket
+ 82     End
+------------------------------------------------------------------
+
 /-- End of testinput10 --/


Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/testdata/testoutput2    2010-10-10 16:24:11 UTC (rev 550)
@@ -11043,13 +11043,13 @@
         Ket
         End
 ------------------------------------------------------------------
-  ACBD
+    ACBD
  0: ACBD
-  ** Failers
+    *** Failers
 No match
-  A\nB
+    A\nB
 No match
-  ACB\n   
+    ACB\n   
 No match


 /A\NB./sBZ
@@ -11062,23 +11062,23 @@
         Ket
         End
 ------------------------------------------------------------------
-  ACBD
+    ACBD
  0: ACBD
-  ACB\n 
+    ACB\n 
  0: ACB\x0a
-  ** Failers
+    *** Failers
 No match
-  A\nB  
+    A\nB  
 No match


 /A\NB/<crlf>
-  A\nB
+    A\nB
  0: A\x0aB
-  A\rB
+    A\rB
  0: A\x0dB
-  ** Failers
+    ** Failers
 No match
-  A\r\nB    
+    A\r\nB    
 No match


 /\R+b/BZ
@@ -11126,4 +11126,16 @@
         End
 ------------------------------------------------------------------


+/-- Perl treats this one differently, not failing the second string. I believe
+    that is a bug in Perl. --/
+
+/^((abc|abcx)(*THEN)y|abcd)/
+    abcd
+ 0: abcd
+ 1: abcd
+    *** Failers 
+No match
+    abcxy 
+No match
+
 /-- End of testinput2 --/