[Pcre-svn] [702] code/trunk: Restore tail-recursion optimiza…

Startseite
Nachricht löschen
Autor: Subversion repository
Datum:  
To: pcre-svn
Betreff: [Pcre-svn] [702] code/trunk: Restore tail-recursion optimizations when no (*THEN) in pattern.
Revision: 702
          http://vcs.pcre.org/viewvc?view=rev&revision=702
Author:   ph10
Date:     2011-09-20 16:45:06 +0100 (Tue, 20 Sep 2011)


Log Message:
-----------
Restore tail-recursion optimizations when no (*THEN) in pattern.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/pcre_compile.c
    code/trunk/pcre_exec.c
    code/trunk/pcre_internal.h
    code/trunk/testdata/testinput2
    code/trunk/testdata/testoutput2


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2011-09-20 11:30:56 UTC (rev 701)
+++ code/trunk/ChangeLog    2011-09-20 15:45:06 UTC (rev 702)
@@ -59,6 +59,12 @@


 10. A pathological pattern such as /(*ACCEPT)a/ was miscompiled, thinking that
     the first byte in a match must be "a".
+    
+11. Change 17 for 8.13 increased the recursion depth for patterns like
+    /a(?:.)*?a/ drastically. I've improved things by remembering whether a
+    pattern contains any instances of (*THEN). If it does not, the old 
+    optimizations are restored. It would be nice to do this on a per-group 
+    basis, but at the moment that is not feasible.



 Version 8.13 16-Aug-2011
@@ -158,7 +164,7 @@
     tail recursion to cut down on stack usage. Unfortunately, now that there is
     the possibility of (*THEN) occurring in these branches, tail recursion is
     no longer possible because the return has to be checked for (*THEN). These
-    two optimizations have therefore been removed.
+    two optimizations have therefore been removed. [But see 8.20/11 above.]


 18. If a pattern containing \R was studied, it was assumed that \R always
     matched two bytes, thus causing the minimum subject length to be


Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2011-09-20 11:30:56 UTC (rev 701)
+++ code/trunk/pcre_compile.c    2011-09-20 15:45:06 UTC (rev 702)
@@ -5064,6 +5064,7 @@
               {
               PUT(code, 0, code - bcptr->current_branch - 1);
               code += LINK_SIZE;
+              cd->external_flags |= PCRE_HASTHEN; 
               }
             }



Modified: code/trunk/pcre_exec.c
===================================================================
--- code/trunk/pcre_exec.c    2011-09-20 11:30:56 UTC (rev 701)
+++ code/trunk/pcre_exec.c    2011-09-20 15:45:06 UTC (rev 702)
@@ -870,12 +870,17 @@
     /* VVVVVVVVVVVVVVVVVVVVVVVVV */


     /* Non-capturing or atomic group, except for possessive with unlimited
-    repeat. Loop for all the alternatives. When we get to the final alternative
-    within the brackets, we used to return the result of a recursive call to
-    match() whatever happened so it was possible to reduce stack usage by
-    turning this into a tail recursion, except in the case of a possibly empty
-    group. However, now that there is the possiblity of (*THEN) occurring in
-    the final alternative, this optimization is no longer possible.
+    repeat. Loop for all the alternatives. 
+    
+    When we get to the final alternative within the brackets, we used to return
+    the result of a recursive call to match() whatever happened so it was
+    possible to reduce stack usage by turning this into a tail recursion,
+    except in the case of a possibly empty group. However, now that there is
+    the possiblity of (*THEN) occurring in the final alternative, this
+    optimization is no longer always possible.
+    
+    We can optimize if we know there are no (*THEN)s in the pattern; at present 
+    this is the best that can be done. 


     MATCH_ONCE is returned when the end of an atomic group is successfully
     reached, but subsequent matching fails. It passes back up the tree (causing
@@ -892,7 +897,20 @@
     for (;;)
       {
       if (op >= OP_SBRA || op == OP_ONCE) md->match_function_type = MATCH_CBEGROUP;
-      RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md, eptrb,
+
+      /* If this is not a possibly empty group, and there are no (*THEN)s in
+      the pattern, and this is the final alternative, optimize as described 
+      above. */
+
+      else if (!md->hasthen && ecode[GET(ecode, 1)] != OP_ALT)
+        {
+        ecode += _pcre_OP_lengths[*ecode];
+        goto TAIL_RECURSE;
+        }  
+
+      /* In all other cases, we have to make another call to match(). */
+
+      RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md, eptrb, 
         RM2);
       if (rrc != MATCH_NOMATCH &&
           (rrc != MATCH_THEN || md->start_match_ptr != ecode))
@@ -1264,16 +1282,25 @@
       }


     /* We are now at the branch that is to be obeyed. As there is only one,
-    we used to use tail recursion to avoid using another stack frame, except
-    when there was unlimited repeat of a possibly empty group. However, that
-    strategy no longer works because of the possibilty of (*THEN) being
-    encountered in the branch. A recursive call to match() is always required,
-    unless the second alternative doesn't exist, in which case we can just
-    plough on. */
+    we used always to use tail recursion to avoid using another stack frame,
+    except when there was unlimited repeat of a possibly empty group. However,
+    that strategy no longer works because of the possibilty of (*THEN) being
+    encountered in the branch. However, we can still use tail recursion if
+    there are no (*THEN)s in the pattern. Otherwise, a recursive call to
+    match() is always required, unless the second alternative doesn't exist, in
+    which case we can just plough on. */


     if (condition || *ecode == OP_ALT)
       {
       if (op == OP_SCOND) md->match_function_type = MATCH_CBEGROUP;
+      else if (!md->hasthen)
+        {
+        ecode += 1 + LINK_SIZE;
+        goto TAIL_RECURSE;
+        }  
+
+      /* A call to match() is required. */
+ 
       RMATCH(eptr, ecode + 1 + LINK_SIZE, offset_top, md, eptrb, RM49);


       /* If the result is THEN from within the "true" branch of the condition,
@@ -1292,7 +1319,10 @@
         }  
       RRETURN(rrc);
       }
-    else                         /* Condition false & no alternative */
+    
+     /* Condition false & no alternative; continue after the group. */
+      
+    else
       {
       ecode += 1 + LINK_SIZE;
       }
@@ -5945,6 +5975,7 @@
 md->mark = NULL;                        /* In case never set */


 md->recursive = NULL;                   /* No recursion at top level */
+md->hasthen = (re->flags & PCRE_HASTHEN) != 0;


md->lcc = tables + lcc_offset;
md->ctypes = tables + ctypes_offset;

Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2011-09-20 11:30:56 UTC (rev 701)
+++ code/trunk/pcre_internal.h    2011-09-20 15:45:06 UTC (rev 702)
@@ -594,6 +594,7 @@
 #define PCRE_STARTLINE     0x0008  /* start after \n for multiline */
 #define PCRE_JCHANGED      0x0010  /* j option used in regex */
 #define PCRE_HASCRORLF     0x0020  /* explicit \r or \n in pattern */
+#define PCRE_HASTHEN       0x0040  /* pattern contains (*THEN) */


/* Flags for the "extra" block produced by pcre_study(). */

@@ -1820,6 +1821,7 @@
   BOOL   notempty_atstart;      /* Empty string match at start not wanted */
   BOOL   hitend;                /* Hit the end of the subject at some point */
   BOOL   bsr_anycrlf;           /* \R is just any CRLF, not full Unicode */
+  BOOL   hasthen;               /* Pattern contains (*THEN) */ 
   const  uschar *start_code;    /* For use when recursing */
   USPTR  start_subject;         /* Start of the subject string */
   USPTR  end_subject;           /* End of the subject string */


Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2011-09-20 11:30:56 UTC (rev 701)
+++ code/trunk/testdata/testinput2    2011-09-20 15:45:06 UTC (rev 702)
@@ -3880,4 +3880,7 @@
 /z(*ACCEPT)a/+I
     baxzbx


+/a(?:.)*?a/ims                                                                  
+    \Mabbbbbbbbbbbbbbbbbbbbba
+
 /-- End of testinput2 --/


Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2011-09-20 11:30:56 UTC (rev 701)
+++ code/trunk/testdata/testoutput2    2011-09-20 15:45:06 UTC (rev 702)
@@ -4430,12 +4430,12 @@
 Need char = 'z'
   aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaazzbbbbbb\M
 Minimum match() limit = 8
-Minimum match() recursion limit = 7
+Minimum match() recursion limit = 6
  0: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaazz
  1: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
   aaaaaaaaaaaaaz\M
 Minimum match() limit = 32768
-Minimum match() recursion limit = 43
+Minimum match() recursion limit = 42
 No match


 /(aaa(?C1)bbb|ab)/I
@@ -6666,7 +6666,7 @@
 No need char
    /* this is a C style comment */\M
 Minimum match() limit = 120
-Minimum match() recursion limit = 35
+Minimum match() recursion limit = 6
  0: /* this is a C style comment */
  1: /* this is a C style comment */


@@ -11951,22 +11951,22 @@
 /^(?>a)++/
     aa\M
 Minimum match() limit = 5
-Minimum match() recursion limit = 3
+Minimum match() recursion limit = 2
  0: aa
     aaaaaaaaa\M 
 Minimum match() limit = 12
-Minimum match() recursion limit = 3
+Minimum match() recursion limit = 2
  0: aaaaaaaaa


 /(a)(?1)++/
     aa\M
 Minimum match() limit = 7
-Minimum match() recursion limit = 5
+Minimum match() recursion limit = 4
  0: aa
  1: a
     aaaaaaaaa\M  
 Minimum match() limit = 21
-Minimum match() recursion limit = 5
+Minimum match() recursion limit = 4
  0: aaaaaaaaa
  1: a


@@ -12322,4 +12322,10 @@
0: z
0+ bx

+/a(?:.)*?a/ims                                                                  
+    \Mabbbbbbbbbbbbbbbbbbbbba
+Minimum match() limit = 65
+Minimum match() recursion limit = 2
+ 0: abbbbbbbbbbbbbbbbbbbbba
+
 /-- End of testinput2 --/