[Pcre-svn] [701] code/trunk: Fix miscompile of /(*ACCEPT)a/,…

Startseite
Nachricht löschen
Autor: Subversion repository
Datum:  
To: pcre-svn
Betreff: [Pcre-svn] [701] code/trunk: Fix miscompile of /(*ACCEPT)a/, which thought a match had to start with "a".
Revision: 701
          http://vcs.pcre.org/viewvc?view=rev&revision=701
Author:   ph10
Date:     2011-09-20 12:30:56 +0100 (Tue, 20 Sep 2011)


Log Message:
-----------
Fix miscompile of /(*ACCEPT)a/, which thought a match had to start with "a".

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/pcre_compile.c
    code/trunk/testdata/testinput2
    code/trunk/testdata/testoutput2


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2011-09-20 10:51:26 UTC (rev 700)
+++ code/trunk/ChangeLog    2011-09-20 11:30:56 UTC (rev 701)
@@ -4,56 +4,61 @@
 Version 8.20 12-Sep-2011
 ------------------------


-1. Change 37 of 8.13 broke patterns like [:a]...[b:] because it thought it had
-   a POSIX class. After further experiments with Perl, which convinced me that
-   Perl has bugs and confusions, a closing square bracket is no longer allowed
-   in a POSIX name. This bug also affected patterns with classes that started
-   with full stops.
-
-2. If a pattern such as /(a)b|ac/ is matched against "ac", there is no captured
-   substring, but while checking the failing first alternative, substring 1 is
-   temporarily captured. If the output vector supplied to pcre_exec() was not
-   big enough for this capture, the yield of the function was still zero
-   ("insufficient space for captured substrings"). This cannot be totally fixed
-   without adding another stack variable, which seems a lot of expense for a
-   edge case. However, I have improved the situation in cases such as
-   /(a)(b)x|abc/ matched against "abc", where the return code indicates that
-   fewer than the maximum number of slots in the ovector have been set.
-
-3. Related to (2) above: when there are more back references in a pattern than
-   slots in the output vector, pcre_exec() uses temporary memory during
-   matching, and copies in the captures as far as possible afterwards. It was
-   using the entire output vector, but this conflicts with the specification
-   that only 2/3 is used for passing back captured substrings. Now it uses only
-   the first 2/3, for compatibility. This is, of course, another edge case.
-
-4. Zoltan Herczeg's just-in-time compiler support has been integrated into the
-   main code base, and can be used by building with --enable-jit. When this is
-   done, pcregrep automatically uses it unless --disable-pcregrep-jit or the
-   runtime --no-jit option is given.
-
-5. When the number of matches in a pcre_dfa_exec() run exactly filled the
-   ovector, the return from the function was zero, implying that there were
-   other matches that did not fit. The correct "exactly full" value is now
-   returned.
-
-6. If a subpattern that was called recursively or as a subroutine contained
-   (*PRUNE) or any other control that caused it to give a non-standard return,
-   invalid errors such as "Error -26 (nested recursion at the same subject
-   position)" or even infinite loops could occur.
+1.  Change 37 of 8.13 broke patterns like [:a]...[b:] because it thought it had
+    a POSIX class. After further experiments with Perl, which convinced me that
+    Perl has bugs and confusions, a closing square bracket is no longer allowed
+    in a POSIX name. This bug also affected patterns with classes that started
+    with full stops.
+    
+2.  If a pattern such as /(a)b|ac/ is matched against "ac", there is no
+    captured substring, but while checking the failing first alternative,
+    substring 1 is temporarily captured. If the output vector supplied to
+    pcre_exec() was not big enough for this capture, the yield of the function
+    was still zero ("insufficient space for captured substrings"). This cannot
+    be totally fixed without adding another stack variable, which seems a lot
+    of expense for a edge case. However, I have improved the situation in cases
+    such as /(a)(b)x|abc/ matched against "abc", where the return code
+    indicates that fewer than the maximum number of slots in the ovector have
+    been set.
+    
+3.  Related to (2) above: when there are more back references in a pattern than
+    slots in the output vector, pcre_exec() uses temporary memory during
+    matching, and copies in the captures as far as possible afterwards. It was
+    using the entire output vector, but this conflicts with the specification
+    that only 2/3 is used for passing back captured substrings. Now it uses
+    only the first 2/3, for compatibility. This is, of course, another edge
+    case.
+    
+4.  Zoltan Herczeg's just-in-time compiler support has been integrated into the
+    main code base, and can be used by building with --enable-jit. When this is
+    done, pcregrep automatically uses it unless --disable-pcregrep-jit or the
+    runtime --no-jit option is given.
+    
+5.  When the number of matches in a pcre_dfa_exec() run exactly filled the
+    ovector, the return from the function was zero, implying that there were
+    other matches that did not fit. The correct "exactly full" value is now
+    returned.
+    
+6.  If a subpattern that was called recursively or as a subroutine contained
+    (*PRUNE) or any other control that caused it to give a non-standard return,
+    invalid errors such as "Error -26 (nested recursion at the same subject
+    position)" or even infinite loops could occur.
+    
+7.  If a pattern such as /a(*SKIP)c|b(*ACCEPT)|/ was studied, it stopped
+    computing the minimum length on reaching *ACCEPT, and so ended up with the
+    wrong value of 1 rather than 0. Further investigation indicates that
+    computing a minimum subject length in the presence of *ACCEPT is difficult
+    (think back references, subroutine calls), and so I have changed the code
+    so that no minimum is registered for a pattern that contains *ACCEPT.
+    
+8.  If (*THEN) was present in the first (true) branch of a conditional group,
+    it was not handled as intended.
+    
+9.  Replaced RunTest.bat with the much improved version provided by Sheri
+    Pierce.


-7. If a pattern such as /a(*SKIP)c|b(*ACCEPT)|/ was studied, it stopped 
-   computing the minimum length on reaching *ACCEPT, and so ended up with the 
-   wrong value of 1 rather than 0. Further investigation indicates that
-   computing a minimum subject length in the presence of *ACCEPT is difficult
-   (think back references, subroutine calls), and so I have changed the code so
-   that no minimum is registered for a pattern that contains *ACCEPT.
-   
-8. If (*THEN) was present in the first (true) branch of a conditional group,
-   it was not handled as intended. 
-   
-9. Replaced RunTest.bat with the much improved version provided by Sheri 
-   Pierce. 
+10. A pathological pattern such as /(*ACCEPT)a/ was miscompiled, thinking that
+    the first byte in a match must be "a".



Version 8.13 16-Aug-2011

Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2011-09-20 10:51:26 UTC (rev 700)
+++ code/trunk/pcre_compile.c    2011-09-20 11:30:56 UTC (rev 701)
@@ -5045,6 +5045,9 @@
               PUT2INC(code, 0, oc->number);
               }
             *code++ = (cd->assert_depth > 0)? OP_ASSERT_ACCEPT : OP_ACCEPT;
+            
+            /* Do not set firstbyte after *ACCEPT */
+            if (firstbyte == REQ_UNSET) firstbyte = REQ_NONE;
             }


           /* Handle other cases with/without an argument */
@@ -6323,7 +6326,7 @@
     byte, set it from this character, but revert to none on a zero repeat.
     Otherwise, leave the firstbyte value alone, and don't change it on a zero
     repeat. */
-
+    
     if (firstbyte == REQ_UNSET)
       {
       zerofirstbyte = REQ_NONE;
@@ -6340,7 +6343,7 @@
       else firstbyte = reqbyte = REQ_NONE;
       }


-    /* firstbyte was previously set; we can set reqbyte only the length is
+    /* firstbyte was previously set; we can set reqbyte only if the length is
     1 or the matching is caseful. */


     else
@@ -7287,7 +7290,7 @@
 re->top_backref = cd->top_backref;
 re->flags = cd->external_flags;


-if (cd->had_accept) reqbyte = -1; /* Must disable after (*ACCEPT) */
+if (cd->had_accept) reqbyte = REQ_NONE; /* Must disable after (*ACCEPT) */

/* If not reached end of pattern on success, there's an excess bracket. */


Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2011-09-20 10:51:26 UTC (rev 700)
+++ code/trunk/testdata/testinput2    2011-09-20 11:30:56 UTC (rev 701)
@@ -3874,4 +3874,10 @@
     abc\N\N
     bbb\N\N 


+/(*ACCEPT)a/+I
+    bax
+
+/z(*ACCEPT)a/+I
+    baxzbx
+
 /-- End of testinput2 --/


Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2011-09-20 10:51:26 UTC (rev 700)
+++ code/trunk/testdata/testoutput2    2011-09-20 11:30:56 UTC (rev 701)
@@ -12304,4 +12304,22 @@
  0: 
  0+ bb


+/(*ACCEPT)a/+I
+Capturing subpattern count = 0
+No options
+No first char
+No need char
+    bax
+ 0: 
+ 0+ bax
+
+/z(*ACCEPT)a/+I
+Capturing subpattern count = 0
+No options
+First char = 'z'
+No need char
+    baxzbx
+ 0: z
+ 0+ bx
+
 /-- End of testinput2 --/