Revision: 550
http://vcs.pcre.org/viewvc?view=rev&revision=550
Author: ph10
Date: 2010-10-10 17:24:11 +0100 (Sun, 10 Oct 2010)
Log Message:
-----------
Fix problem with (*THEN) not backing up far enough.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/HACKING
code/trunk/configure.ac
code/trunk/doc/pcrepattern.3
code/trunk/pcre_compile.c
code/trunk/pcre_exec.c
code/trunk/pcre_internal.h
code/trunk/pcre_printint.src
code/trunk/pcre_study.c
code/trunk/testdata/testinput10
code/trunk/testdata/testinput2
code/trunk/testdata/testoutput10
code/trunk/testdata/testoutput2
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/ChangeLog 2010-10-10 16:24:11 UTC (rev 550)
@@ -1,6 +1,17 @@
ChangeLog for PCRE
------------------
+Version 8.11 10-Oct-2010
+------------------------
+
+1. (*THEN) was not working properly if there were untried alternatives prior
+ to it in the current branch. For example, in ((a|b)(*THEN)(*F)|c..) it
+ backtracked to try for "b" instead of moving to the next alternative branch
+ at the same level (in this case, to look for "c"). The Perl documentation
+ is clear that when (*THEN) is backtracked onto, it goes to the "next
+ alternative in the innermost enclosing group".
+
+
Version 8.10 25-Jun-2010
------------------------
Modified: code/trunk/HACKING
===================================================================
--- code/trunk/HACKING 2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/HACKING 2010-10-10 16:24:11 UTC (rev 550)
@@ -4,6 +4,7 @@
These are very rough technical notes that record potentially useful information
about PCRE internals.
+
Historical note 1
-----------------
@@ -22,6 +23,7 @@
not necessarily maximize the individual wild portions of the pattern, as is
expected in Unix and Perl-style regular expressions.
+
Historical note 2
-----------------
@@ -34,6 +36,7 @@
matches individual wild portions of the pattern. This is an "NFA algorithm" in
Friedl's terminology.
+
OK, here's the real stuff
-------------------------
@@ -44,6 +47,7 @@
complexity in Perl regular expressions, I couldn't do this. In any case, a
first pass through the pattern is helpful for other reasons.
+
Computing the memory requirement: how it was
--------------------------------------------
@@ -54,6 +58,7 @@
the first pass is degenerate and the second pass can just store stuff straight
into the vector, which it knows is big enough.
+
Computing the memory requirement: how it is
-------------------------------------------
@@ -75,6 +80,7 @@
is doing a full analysis of the pattern. My hope was that this would not be a
big issue, and in the event, nobody has commented on it.
+
Traditional matching function
-----------------------------
@@ -84,6 +90,7 @@
as compatible with Perl as possible. This is the function most users of PCRE
will use most of the time.
+
Supplementary matching function
-------------------------------
@@ -119,7 +126,6 @@
A list of the opcodes follows:
-
Opcodes with no following data
------------------------------
@@ -151,14 +157,26 @@
OP_EXTUNI match an extended Unicode character
OP_ANYNL match any Unicode newline sequence
- OP_ACCEPT ) These are Perl 5.10's "backtracking
- OP_COMMIT ) control verbs". If OP_ACCEPT is inside
- OP_FAIL ) capturing parentheses, it may be preceded
- OP_PRUNE ) by one or more OP_CLOSE, followed by a 2-byte
- OP_SKIP ) number, indicating which parentheses must be
- OP_THEN ) closed.
+ OP_ACCEPT ) These are Perl 5.10's "backtracking control
+ OP_COMMIT ) verbs". If OP_ACCEPT is inside capturing
+ OP_FAIL ) parentheses, it may be preceded by one or more
+ OP_PRUNE ) OP_CLOSE, followed by a 2-byte number,
+ OP_SKIP ) indicating which parentheses must be closed.
+Backtracking control verbs with data
+------------------------------------
+
+OP_THEN is followed by a LINK_SIZE offset, which is the distance back to the
+start of the current branch.
+
+OP_MARK is followed by the mark name, preceded by a one-byte length, and
+followed by a binary zero. For (*PRUNE), (*SKIP), and (*THEN) with arguments,
+the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used. For the first
+two, the name follows immediately; for OP_THEN_ARG, it follows the LINK_SIZE
+offset value.
+
+
Repeating single characters
---------------------------
@@ -419,4 +437,4 @@
data.
Philip Hazel
-October 2009
+October 2010
Modified: code/trunk/configure.ac
===================================================================
--- code/trunk/configure.ac 2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/configure.ac 2010-10-10 16:24:11 UTC (rev 550)
@@ -9,9 +9,9 @@
dnl be defined as -RC2, for example. For real releases, it should be empty.
m4_define(pcre_major, [8])
-m4_define(pcre_minor, [10])
-m4_define(pcre_prerelease, [])
-m4_define(pcre_date, [2010-06-25])
+m4_define(pcre_minor, [11])
+m4_define(pcre_prerelease, [-RC1])
+m4_define(pcre_date, [2010-10-09])
# Libtool shared library interface versions (current:revision:age)
m4_define(libpcre_version, [0:1:0])
Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3 2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/doc/pcrepattern.3 2010-10-10 16:24:11 UTC (rev 550)
@@ -2630,10 +2630,10 @@
.sp
(*THEN) or (*THEN:NAME)
.sp
-This verb causes a skip to the next alternation if the rest of the pattern does
-not match. That is, it cancels pending backtracking, but only within the
-current alternation. Its name comes from the observation that it can be used
-for a pattern-based if-then-else block:
+This verb causes a skip to the next alternation in the innermost enclosing
+group if the rest of the pattern does not match. That is, it cancels pending
+backtracking, but only within the current alternation. Its name comes from the
+observation that it can be used for a pattern-based if-then-else block:
.sp
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
.sp
@@ -2666,6 +2666,6 @@
.rs
.sp
.nf
-Last updated: 18 May 2010
+Last updated: 10 October 2010
Copyright (c) 1997-2010 University of Cambridge.
.fi
Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c 2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/pcre_compile.c 2010-10-10 16:24:11 UTC (rev 550)
@@ -1724,9 +1724,12 @@
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
- case OP_THEN_ARG:
code += code[1];
break;
+
+ case OP_THEN_ARG:
+ code += code[1+LINK_SIZE];
+ break;
}
/* Add in the fixed length from the table */
@@ -1827,9 +1830,12 @@
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
- case OP_THEN_ARG:
code += code[1];
break;
+
+ case OP_THEN_ARG:
+ code += code[1+LINK_SIZE];
+ break;
}
/* Add in the fixed length from the table */
@@ -2105,10 +2111,13 @@
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
- case OP_THEN_ARG:
code += code[1];
break;
+ case OP_THEN_ARG:
+ code += code[1+LINK_SIZE];
+ break;
+
/* None of the remaining opcodes are required to match a character. */
default:
@@ -4808,7 +4817,12 @@
*errorcodeptr = ERR66;
goto FAILED;
}
- *code++ = verbs[i].op;
+ *code = verbs[i].op;
+ if (*code++ == OP_THEN)
+ {
+ PUT(code, 0, code - bcptr->current_branch - 1);
+ code += LINK_SIZE;
+ }
}
else
@@ -4818,7 +4832,12 @@
*errorcodeptr = ERR59;
goto FAILED;
}
- *code++ = verbs[i].op_arg;
+ *code = verbs[i].op_arg;
+ if (*code++ == OP_THEN_ARG)
+ {
+ PUT(code, 0, code - bcptr->current_branch - 1);
+ code += LINK_SIZE;
+ }
*code++ = arglen;
memcpy(code, arg, arglen);
code += arglen;
Modified: code/trunk/pcre_exec.c
===================================================================
--- code/trunk/pcre_exec.c 2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/pcre_exec.c 2010-10-10 16:24:11 UTC (rev 550)
@@ -748,18 +748,25 @@
md->start_match_ptr = ecode + 2;
RRETURN(MATCH_SKIP_ARG);
+
+ /* For THEN (and THEN_ARG) we pass back the address of the bracket or
+ the alt that is at the start of the current branch. This makes it possible
+ to skip back past alternatives that precede the THEN within the current
+ branch. */
case OP_THEN:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM54);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
+ md->start_match_ptr = ecode - GET(ecode, 1);
MRRETURN(MATCH_THEN);
case OP_THEN_ARG:
- RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
- ims, eptrb, flags, RM58);
+ RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1+LINK_SIZE],
+ offset_top, md, ims, eptrb, flags, RM58);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- md->mark = ecode + 2;
+ md->start_match_ptr = ecode - GET(ecode, 1);
+ md->mark = ecode + LINK_SIZE + 2;
RRETURN(MATCH_THEN);
/* Handle a capturing bracket. If there is space in the offset vector, save
@@ -804,7 +811,9 @@
{
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM1);
- if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+ if (rrc != MATCH_NOMATCH &&
+ (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+ RRETURN(rrc);
md->capture_last = save_capture_last;
ecode += GET(ecode, 1);
}
@@ -865,7 +874,9 @@
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md, ims,
eptrb, flags, RM2);
- if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+ if (rrc != MATCH_NOMATCH &&
+ (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+ RRETURN(rrc);
ecode += GET(ecode, 1);
}
/* Control never reaches here. */
@@ -1066,7 +1077,8 @@
ecode += 1 + LINK_SIZE + GET(ecode, LINK_SIZE + 2);
while (*ecode == OP_ALT) ecode += GET(ecode, 1);
}
- else if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN)
+ else if (rrc != MATCH_NOMATCH &&
+ (rrc != MATCH_THEN || md->start_match_ptr != ecode))
{
RRETURN(rrc); /* Need braces because of following else */
}
@@ -1194,7 +1206,9 @@
mstart = md->start_match_ptr; /* In case \K reset it */
break;
}
- if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+ if (rrc != MATCH_NOMATCH &&
+ (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+ RRETURN(rrc);
ecode += GET(ecode, 1);
}
while (*ecode == OP_ALT);
@@ -1228,7 +1242,9 @@
do ecode += GET(ecode,1); while (*ecode == OP_ALT);
break;
}
- if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+ if (rrc != MATCH_NOMATCH &&
+ (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+ RRETURN(rrc);
ecode += GET(ecode,1);
}
while (*ecode == OP_ALT);
@@ -1365,7 +1381,8 @@
(pcre_free)(new_recursive.offset_save);
MRRETURN(MATCH_MATCH);
}
- else if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN)
+ else if (rrc != MATCH_NOMATCH &&
+ (rrc != MATCH_THEN || md->start_match_ptr != ecode))
{
DPRINTF(("Recursion gave error %d\n", rrc));
if (new_recursive.offset_save != stacksave)
@@ -1408,7 +1425,9 @@
mstart = md->start_match_ptr;
break;
}
- if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+ if (rrc != MATCH_NOMATCH &&
+ (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+ RRETURN(rrc);
ecode += GET(ecode,1);
}
while (*ecode == OP_ALT);
Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h 2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/pcre_internal.h 2010-10-10 16:24:11 UTC (rev 550)
@@ -1514,8 +1514,9 @@
3, 3, /* RREF, NRREF */ \
1, /* DEF */ \
1, 1, /* BRAZERO, BRAMINZERO */ \
- 3, 1, 3, /* MARK, PRUNE, PRUNE_ARG, */ \
- 1, 3, 1, 3, /* SKIP, SKIP_ARG, THEN, THEN_ARG, */ \
+ 3, 1, 3, /* MARK, PRUNE, PRUNE_ARG */ \
+ 1, 3, /* SKIP, SKIP_ARG */ \
+ 1+LINK_SIZE, 3+LINK_SIZE, /* THEN, THEN_ARG */ \
1, 1, 1, 3, 1 /* COMMIT, FAIL, ACCEPT, CLOSE, SKIPZERO */
Modified: code/trunk/pcre_printint.src
===================================================================
--- code/trunk/pcre_printint.src 2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/pcre_printint.src 2010-10-10 16:24:11 UTC (rev 550)
@@ -537,11 +537,26 @@
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
- case OP_THEN_ARG:
fprintf(f, " %s %s", OP_names[*code], code + 2);
extra += code[1];
break;
+
+ case OP_THEN:
+ if (print_lengths)
+ fprintf(f, " %s %d", OP_names[*code], GET(code, 1));
+ else
+ fprintf(f, " %s", OP_names[*code]);
+ break;
+ case OP_THEN_ARG:
+ if (print_lengths)
+ fprintf(f, " %s %d %s", OP_names[*code], GET(code, 1),
+ code + 2 + LINK_SIZE);
+ else
+ fprintf(f, " %s %s", OP_names[*code], code + 2 + LINK_SIZE);
+ extra += code[1+LINK_SIZE];
+ break;
+
/* Anything else is just an item with no data*/
default:
Modified: code/trunk/pcre_study.c
===================================================================
--- code/trunk/pcre_study.c 2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/pcre_study.c 2010-10-10 16:24:11 UTC (rev 550)
@@ -419,10 +419,13 @@
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
- case OP_THEN_ARG:
cc += _pcre_OP_lengths[op] + cc[1];
break;
+ case OP_THEN_ARG:
+ cc += _pcre_OP_lengths[op] + cc[1+LINK_SIZE];
+ break;
+
/* For the record, these are the opcodes that are matched by "default":
OP_ACCEPT, OP_CLOSE, OP_COMMIT, OP_FAIL, OP_PRUNE, OP_SET_SOM, OP_SKIP,
OP_THEN. */
Modified: code/trunk/testdata/testinput10
===================================================================
--- code/trunk/testdata/testinput10 2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/testdata/testinput10 2010-10-10 16:24:11 UTC (rev 550)
@@ -132,4 +132,6 @@
/[[:^alpha:]\S]+/8WB
+/abc(d|e)(*THEN)x(123(*THEN)4|567(b|q)(*THEN)xx)/B
+
/-- End of testinput10 --/
Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2 2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/testdata/testinput2 2010-10-10 16:24:11 UTC (rev 550)
@@ -3464,22 +3464,22 @@
abcde
/A\NB./BZ
- ACBD
- ** Failers
- A\nB
- ACB\n
+ ACBD
+ *** Failers
+ A\nB
+ ACB\n
/A\NB./sBZ
- ACBD
- ACB\n
- ** Failers
- A\nB
+ ACBD
+ ACB\n
+ *** Failers
+ A\nB
/A\NB/<crlf>
- A\nB
- A\rB
- ** Failers
- A\r\nB
+ A\nB
+ A\rB
+ ** Failers
+ A\r\nB
/\R+b/BZ
@@ -3491,4 +3491,12 @@
/\s*\R/BZ
+/-- Perl treats this one differently, not failing the second string. I believe
+ that is a bug in Perl. --/
+
+/^((abc|abcx)(*THEN)y|abcd)/
+ abcd
+ *** Failers
+ abcxy
+
/-- End of testinput2 --/
Modified: code/trunk/testdata/testoutput10
===================================================================
--- code/trunk/testdata/testoutput10 2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/testdata/testoutput10 2010-10-10 16:24:11 UTC (rev 550)
@@ -707,4 +707,33 @@
18 End
------------------------------------------------------------------
+/abc(d|e)(*THEN)x(123(*THEN)4|567(b|q)(*THEN)xx)/B
+------------------------------------------------------------------
+ 0 79 Bra
+ 3 abc
+ 9 7 CBra 1
+ 14 d
+ 16 5 Alt
+ 19 e
+ 21 12 Ket
+ 24 *THEN 24
+ 27 x
+ 29 16 CBra 2
+ 34 123
+ 40 *THEN 11
+ 43 4
+ 45 31 Alt
+ 48 567
+ 54 7 CBra 3
+ 59 b
+ 61 5 Alt
+ 64 q
+ 66 12 Ket
+ 69 *THEN 24
+ 72 xx
+ 76 47 Ket
+ 79 79 Ket
+ 82 End
+------------------------------------------------------------------
+
/-- End of testinput10 --/
Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2 2010-06-25 14:43:53 UTC (rev 549)
+++ code/trunk/testdata/testoutput2 2010-10-10 16:24:11 UTC (rev 550)
@@ -11043,13 +11043,13 @@
Ket
End
------------------------------------------------------------------
- ACBD
+ ACBD
0: ACBD
- ** Failers
+ *** Failers
No match
- A\nB
+ A\nB
No match
- ACB\n
+ ACB\n
No match
/A\NB./sBZ
@@ -11062,23 +11062,23 @@
Ket
End
------------------------------------------------------------------
- ACBD
+ ACBD
0: ACBD
- ACB\n
+ ACB\n
0: ACB\x0a
- ** Failers
+ *** Failers
No match
- A\nB
+ A\nB
No match
/A\NB/<crlf>
- A\nB
+ A\nB
0: A\x0aB
- A\rB
+ A\rB
0: A\x0dB
- ** Failers
+ ** Failers
No match
- A\r\nB
+ A\r\nB
No match
/\R+b/BZ
@@ -11126,4 +11126,16 @@
End
------------------------------------------------------------------
+/-- Perl treats this one differently, not failing the second string. I believe
+ that is a bug in Perl. --/
+
+/^((abc|abcx)(*THEN)y|abcd)/
+ abcd
+ 0: abcd
+ 1: abcd
+ *** Failers
+No match
+ abcxy
+No match
+
/-- End of testinput2 --/