Revision: 701
http://vcs.pcre.org/viewvc?view=rev&revision=701
Author: ph10
Date: 2011-09-20 12:30:56 +0100 (Tue, 20 Sep 2011)
Log Message:
-----------
Fix miscompile of /(*ACCEPT)a/, which thought a match had to start with "a".
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/pcre_compile.c
code/trunk/testdata/testinput2
code/trunk/testdata/testoutput2
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2011-09-20 10:51:26 UTC (rev 700)
+++ code/trunk/ChangeLog 2011-09-20 11:30:56 UTC (rev 701)
@@ -4,56 +4,61 @@
Version 8.20 12-Sep-2011
------------------------
-1. Change 37 of 8.13 broke patterns like [:a]...[b:] because it thought it had
- a POSIX class. After further experiments with Perl, which convinced me that
- Perl has bugs and confusions, a closing square bracket is no longer allowed
- in a POSIX name. This bug also affected patterns with classes that started
- with full stops.
-
-2. If a pattern such as /(a)b|ac/ is matched against "ac", there is no captured
- substring, but while checking the failing first alternative, substring 1 is
- temporarily captured. If the output vector supplied to pcre_exec() was not
- big enough for this capture, the yield of the function was still zero
- ("insufficient space for captured substrings"). This cannot be totally fixed
- without adding another stack variable, which seems a lot of expense for a
- edge case. However, I have improved the situation in cases such as
- /(a)(b)x|abc/ matched against "abc", where the return code indicates that
- fewer than the maximum number of slots in the ovector have been set.
-
-3. Related to (2) above: when there are more back references in a pattern than
- slots in the output vector, pcre_exec() uses temporary memory during
- matching, and copies in the captures as far as possible afterwards. It was
- using the entire output vector, but this conflicts with the specification
- that only 2/3 is used for passing back captured substrings. Now it uses only
- the first 2/3, for compatibility. This is, of course, another edge case.
-
-4. Zoltan Herczeg's just-in-time compiler support has been integrated into the
- main code base, and can be used by building with --enable-jit. When this is
- done, pcregrep automatically uses it unless --disable-pcregrep-jit or the
- runtime --no-jit option is given.
-
-5. When the number of matches in a pcre_dfa_exec() run exactly filled the
- ovector, the return from the function was zero, implying that there were
- other matches that did not fit. The correct "exactly full" value is now
- returned.
-
-6. If a subpattern that was called recursively or as a subroutine contained
- (*PRUNE) or any other control that caused it to give a non-standard return,
- invalid errors such as "Error -26 (nested recursion at the same subject
- position)" or even infinite loops could occur.
+1. Change 37 of 8.13 broke patterns like [:a]...[b:] because it thought it had
+ a POSIX class. After further experiments with Perl, which convinced me that
+ Perl has bugs and confusions, a closing square bracket is no longer allowed
+ in a POSIX name. This bug also affected patterns with classes that started
+ with full stops.
+
+2. If a pattern such as /(a)b|ac/ is matched against "ac", there is no
+ captured substring, but while checking the failing first alternative,
+ substring 1 is temporarily captured. If the output vector supplied to
+ pcre_exec() was not big enough for this capture, the yield of the function
+ was still zero ("insufficient space for captured substrings"). This cannot
+ be totally fixed without adding another stack variable, which seems a lot
+ of expense for a edge case. However, I have improved the situation in cases
+ such as /(a)(b)x|abc/ matched against "abc", where the return code
+ indicates that fewer than the maximum number of slots in the ovector have
+ been set.
+
+3. Related to (2) above: when there are more back references in a pattern than
+ slots in the output vector, pcre_exec() uses temporary memory during
+ matching, and copies in the captures as far as possible afterwards. It was
+ using the entire output vector, but this conflicts with the specification
+ that only 2/3 is used for passing back captured substrings. Now it uses
+ only the first 2/3, for compatibility. This is, of course, another edge
+ case.
+
+4. Zoltan Herczeg's just-in-time compiler support has been integrated into the
+ main code base, and can be used by building with --enable-jit. When this is
+ done, pcregrep automatically uses it unless --disable-pcregrep-jit or the
+ runtime --no-jit option is given.
+
+5. When the number of matches in a pcre_dfa_exec() run exactly filled the
+ ovector, the return from the function was zero, implying that there were
+ other matches that did not fit. The correct "exactly full" value is now
+ returned.
+
+6. If a subpattern that was called recursively or as a subroutine contained
+ (*PRUNE) or any other control that caused it to give a non-standard return,
+ invalid errors such as "Error -26 (nested recursion at the same subject
+ position)" or even infinite loops could occur.
+
+7. If a pattern such as /a(*SKIP)c|b(*ACCEPT)|/ was studied, it stopped
+ computing the minimum length on reaching *ACCEPT, and so ended up with the
+ wrong value of 1 rather than 0. Further investigation indicates that
+ computing a minimum subject length in the presence of *ACCEPT is difficult
+ (think back references, subroutine calls), and so I have changed the code
+ so that no minimum is registered for a pattern that contains *ACCEPT.
+
+8. If (*THEN) was present in the first (true) branch of a conditional group,
+ it was not handled as intended.
+
+9. Replaced RunTest.bat with the much improved version provided by Sheri
+ Pierce.
-7. If a pattern such as /a(*SKIP)c|b(*ACCEPT)|/ was studied, it stopped
- computing the minimum length on reaching *ACCEPT, and so ended up with the
- wrong value of 1 rather than 0. Further investigation indicates that
- computing a minimum subject length in the presence of *ACCEPT is difficult
- (think back references, subroutine calls), and so I have changed the code so
- that no minimum is registered for a pattern that contains *ACCEPT.
-
-8. If (*THEN) was present in the first (true) branch of a conditional group,
- it was not handled as intended.
-
-9. Replaced RunTest.bat with the much improved version provided by Sheri
- Pierce.
+10. A pathological pattern such as /(*ACCEPT)a/ was miscompiled, thinking that
+ the first byte in a match must be "a".
Version 8.13 16-Aug-2011
Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c 2011-09-20 10:51:26 UTC (rev 700)
+++ code/trunk/pcre_compile.c 2011-09-20 11:30:56 UTC (rev 701)
@@ -5045,6 +5045,9 @@
PUT2INC(code, 0, oc->number);
}
*code++ = (cd->assert_depth > 0)? OP_ASSERT_ACCEPT : OP_ACCEPT;
+
+ /* Do not set firstbyte after *ACCEPT */
+ if (firstbyte == REQ_UNSET) firstbyte = REQ_NONE;
}
/* Handle other cases with/without an argument */
@@ -6323,7 +6326,7 @@
byte, set it from this character, but revert to none on a zero repeat.
Otherwise, leave the firstbyte value alone, and don't change it on a zero
repeat. */
-
+
if (firstbyte == REQ_UNSET)
{
zerofirstbyte = REQ_NONE;
@@ -6340,7 +6343,7 @@
else firstbyte = reqbyte = REQ_NONE;
}
- /* firstbyte was previously set; we can set reqbyte only the length is
+ /* firstbyte was previously set; we can set reqbyte only if the length is
1 or the matching is caseful. */
else
@@ -7287,7 +7290,7 @@
re->top_backref = cd->top_backref;
re->flags = cd->external_flags;
-if (cd->had_accept) reqbyte = -1; /* Must disable after (*ACCEPT) */
+if (cd->had_accept) reqbyte = REQ_NONE; /* Must disable after (*ACCEPT) */
/* If not reached end of pattern on success, there's an excess bracket. */
Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2 2011-09-20 10:51:26 UTC (rev 700)
+++ code/trunk/testdata/testinput2 2011-09-20 11:30:56 UTC (rev 701)
@@ -3874,4 +3874,10 @@
abc\N\N
bbb\N\N
+/(*ACCEPT)a/+I
+ bax
+
+/z(*ACCEPT)a/+I
+ baxzbx
+
/-- End of testinput2 --/
Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2 2011-09-20 10:51:26 UTC (rev 700)
+++ code/trunk/testdata/testoutput2 2011-09-20 11:30:56 UTC (rev 701)
@@ -12304,4 +12304,22 @@
0:
0+ bb
+/(*ACCEPT)a/+I
+Capturing subpattern count = 0
+No options
+No first char
+No need char
+ bax
+ 0:
+ 0+ bax
+
+/z(*ACCEPT)a/+I
+Capturing subpattern count = 0
+No options
+First char = 'z'
+No need char
+ baxzbx
+ 0: z
+ 0+ bx
+
/-- End of testinput2 --/