Revision: 553
http://vcs.pcre.org/viewvc?view=rev&revision=553
Author: ph10
Date: 2010-10-22 16:57:50 +0100 (Fri, 22 Oct 2010)
Log Message:
-----------
Change the way PCRE_PARTIAL_HARD handles \z, \Z, \b, \B, and $.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/pcre.3
code/trunk/doc/pcre_exec.3
code/trunk/doc/pcreapi.3
code/trunk/doc/pcrematching.3
code/trunk/doc/pcrepartial.3
code/trunk/doc/pcretest.1
code/trunk/pcre.h.in
code/trunk/pcre_dfa_exec.c
code/trunk/pcre_exec.c
code/trunk/pcretest.c
code/trunk/testdata/testinput2
code/trunk/testdata/testinput7
code/trunk/testdata/testinput8
code/trunk/testdata/testoutput2
code/trunk/testdata/testoutput7
code/trunk/testdata/testoutput8
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/ChangeLog 2010-10-22 15:57:50 UTC (rev 553)
@@ -21,8 +21,26 @@
the class, even if it had been included by some previous item, for example
in [\x00-\xff\s]. (This was a bug related to the fact that VT is not part
of \s, but is part of the POSIX "space" class.)
+
+4. A partial match never returns an empty string (because you can always
+ match an empty string at the end of the subject); however the checking for
+ an empty string was starting at the "start of match" point. This has been
+ changed to the "earliest inspected character" point, because the returned
+ data for a partial match starts at this character. This means that, for
+ example, /(?<=abc)def/ gives a partial match for the subject "abc"
+ (previously it gave "no match").
+5. Changes have been made to the way PCRE_PARTIAL_HARD affects the matching
+ of $, \z, \Z, \b, and \B. If the match point is at the end of the string,
+ previously a full match would be given. However, setting PCRE_PARTIAL_HARD
+ has an implication that the given string is incomplete (because a partial
+ match is preferred over a full match). For this reason, these items now
+ give a partial match in this situation. [Aside: previously, the one case
+ /t\b/ matched against "cat" with PCRE_PARTIAL_HARD set did return a partial
+ match rather than a full match, which was wrong by the old rules, but is
+ now correct.]
+
Version 8.10 25-Jun-2010
------------------------
Modified: code/trunk/doc/pcre.3
===================================================================
--- code/trunk/doc/pcre.3 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/doc/pcre.3 2010-10-22 15:57:50 UTC (rev 553)
@@ -254,12 +254,12 @@
recognizes as digits, spaces, or word characters remain the same set as before,
all with values less than 256. This remains true even when PCRE is built to
include Unicode property support, because to do otherwise would slow down PCRE
-in many common cases. Note that this also applies to \eb, because it is defined
-in terms of \ew and \eW. If you really want to test for a wider sense of, say,
-"digit", you can use explicit Unicode property tests such as \ep{Nd}.
-Alternatively, if you set the PCRE_UCP option, the way that the character
-escapes work is changed so that Unicode properties are used to determine which
-characters match. There are more details in the section on
+in many common cases. Note in particular that this applies to \eb and \eB,
+because they are defined in terms of \ew and \eW. If you really want to test
+for a wider sense of, say, "digit", you can use explicit Unicode property tests
+such as \ep{Nd}. Alternatively, if you set the PCRE_UCP option, the way that
+the character escapes work is changed so that Unicode properties are used to
+determine which characters match. There are more details in the section on
.\" HTML <a href="pcrepattern.html#genericchartypes">
.\" </a>
generic character types
@@ -306,6 +306,6 @@
.rs
.sp
.nf
-Last updated: 12 May 2010
+Last updated: 22 October 2010
Copyright (c) 1997-2010 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre_exec.3
===================================================================
--- code/trunk/doc/pcre_exec.3 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/doc/pcre_exec.3 2010-10-22 15:57:50 UTC (rev 553)
@@ -53,7 +53,7 @@
PCRE_PARTIAL ) Return PCRE_ERROR_PARTIAL for a partial
PCRE_PARTIAL_SOFT ) match if no full matches are found
PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
- even if there is a full match as well
+ if that is found before a full match
.sp
For details of partial matching, see the
.\" HREF
Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/doc/pcreapi.3 2010-10-22 15:57:50 UTC (rev 553)
@@ -1532,12 +1532,21 @@
compatibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial match
occurs if the end of the subject string is reached successfully, but there are
not enough subject characters to complete the match. If this happens when
-PCRE_PARTIAL_HARD is set, \fBpcre_exec()\fP immediately returns
-PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, matching continues
-by testing any other alternatives. Only if they all fail is PCRE_ERROR_PARTIAL
-returned (instead of PCRE_ERROR_NOMATCH). The portion of the string that
-was inspected when the partial match was found is set as the first matching
-string. There is a more detailed discussion in the
+PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set, matching continues by
+testing any remaining alternatives. Only if no complete match can be found is
+PCRE_ERROR_PARTIAL returned instead of PCRE_ERROR_NOMATCH. In other words,
+PCRE_PARTIAL_SOFT says that the caller is prepared to handle a partial match,
+but only if no complete match can be found.
+.P
+If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this case, if a
+partial match is found, \fBpcre_exec()\fP immediately returns
+PCRE_ERROR_PARTIAL, without considering any other alternatives. In other words,
+when PCRE_PARTIAL_HARD is set, a partial match is considered to be more
+important that an alternative complete match.
+.P
+In both cases, the portion of the string that was inspected when the partial
+match was found is set as the first matching string. There is a more detailed
+discussion of partial and multi-segment matching, with examples, in the
.\" HREF
\fBpcrepartial\fP
.\"
@@ -2059,6 +2068,12 @@
there have been no complete matches, but there is still at least one matching
possibility. The portion of the string that was inspected when the longest
partial match was found is set as the first matching string in both cases.
+There is a more detailed discussion of partial and multi-segment matching, with
+examples, in the
+.\" HREF
+\fBpcrepartial\fP
+.\"
+documentation.
.sp
PCRE_DFA_SHORTEST
.sp
@@ -2178,6 +2193,6 @@
.rs
.sp
.nf
-Last updated: 21 June 2010
+Last updated: 22 October 2010
Copyright (c) 1997-2010 University of Cambridge.
.fi
Modified: code/trunk/doc/pcrematching.3
===================================================================
--- code/trunk/doc/pcrematching.3 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/doc/pcrematching.3 2010-10-22 15:57:50 UTC (rev 553)
@@ -151,11 +151,13 @@
2. Because the alternative algorithm scans the subject string just once, and
never needs to backtrack, it is possible to pass very long subject strings to
the matching function in several pieces, checking for partial matching each
-time. The
+time. It is possible to do multi-segment matching using \fBpcre_exec()\fP (by
+retaining partially matched substrings), but it is more complicated. The
.\" HREF
\fBpcrepartial\fP
.\"
-documentation gives details of partial matching.
+documentation gives details of partial matching and discusses multi-segment
+matching.
.
.
.SH "DISADVANTAGES OF THE ALTERNATIVE ALGORITHM"
@@ -187,6 +189,6 @@
.rs
.sp
.nf
-Last updated: 29 September 2009
-Copyright (c) 1997-2009 University of Cambridge.
+Last updated: 22 October 2010
+Copyright (c) 1997-2010 University of Cambridge.
.fi
Modified: code/trunk/doc/pcrepartial.3
===================================================================
--- code/trunk/doc/pcrepartial.3 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/doc/pcrepartial.3 2010-10-22 15:57:50 UTC (rev 553)
@@ -21,8 +21,8 @@
as soon as a mistake is made, by beeping and not reflecting the character that
has been typed, for example. This immediate feedback is likely to be a better
user interface than a check that is delayed until the entire string has been
-entered. Partial matching can also sometimes be useful when the subject string
-is very long and is not all available at once.
+entered. Partial matching can also be useful when the subject string is very
+long and is not all available at once.
.P
PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
PCRE_PARTIAL_HARD options, which can be set when calling \fBpcre_exec()\fP or
@@ -44,19 +44,21 @@
.SH "PARTIAL MATCHING USING pcre_exec()"
.rs
.sp
-A partial match occurs during a call to \fBpcre_exec()\fP whenever the end of
-the subject string is reached successfully, but matching cannot continue
-because more characters are needed. However, at least one character must have
-been matched. (In other words, a partial match can never be an empty string.)
+A partial match occurs during a call to \fBpcre_exec()\fP when the end of the
+subject string is reached successfully, but matching cannot continue because
+more characters are needed. However, at least one character in the subject must
+have been inspected. This character need not form part of the final matched
+string; lookbehind assertions and the \eK escape sequence provide ways of
+inspecting characters before the start of a matched substring. The requirement
+for inspecting at least one character exists because an empty string can always
+be matched; without such a restriction there would always be a partial match of
+an empty string at the end of the subject.
.P
-If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but matching
-continues as normal, and other alternatives in the pattern are tried. If no
-complete match can be found, \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL
-instead of PCRE_ERROR_NOMATCH. If there are at least two slots in the offsets
-vector, the first of them is set to the offset of the earliest character that
-was inspected when the partial match was found. For convenience, the second
-offset points to the end of the string so that a substring can easily be
-identified.
+If there are at least two slots in the offsets vector when \fBpcre_exec()\fP
+returns with a partial match, the first slot is set to the offset of the
+earliest character that was inspected when the partial match was found. For
+convenience, the second offset points to the end of the subject so that a
+substring can easily be identified.
.P
For the majority of patterns, the first offset identifies the start of the
partially matched string. However, for patterns that contain lookbehind
@@ -68,8 +70,26 @@
This pattern matches "123", but only if it is preceded by "abc". If the subject
string is "xyzabc12", the offsets after a partial match are for the substring
"abc12", because all these characters are needed if another match is tried
-with extra characters added.
+with extra characters added to the subject.
.P
+What happens when a partial match is identified depends on which of the two
+partial matching options are set.
+.
+.
+.SS "PCRE_PARTIAL_SOFT with pcre_exec()"
+.rs
+.sp
+If PCRE_PARTIAL_SOFT is set when \fBpcre_exec()\fP identifies a partial match,
+the partial match is remembered, but matching continues as normal, and other
+alternatives in the pattern are tried. If no complete match can be found,
+\fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH.
+.P
+This option is "soft" because it prefers a complete match over a partial match.
+All the various matching items in a pattern behave as if the subject string is
+potentially complete. For example, \ez, \eZ, and $ match at the end of the
+subject, as normal, and for \eb and \eB the end of the subject is treated as a
+non-alphanumeric.
+.P
If there is more than one partial match, the first one that was found provides
the data that is returned. Consider this pattern:
.sp
@@ -77,16 +97,30 @@
.sp
If this is matched against the subject string "abc123dog", both
alternatives fail to match, but the end of the subject is reached during
-matching, so PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH. The
-offsets are set to 3 and 9, identifying "123dog" as the first partial match
-that was found. (In this example, there are two partial matches, because "dog"
-on its own partially matches the second alternative.)
-.P
+matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set to 3 and 9,
+identifying "123dog" as the first partial match that was found. (In this
+example, there are two partial matches, because "dog" on its own partially
+matches the second alternative.)
+.
+.
+.SS "PCRE_PARTIAL_HARD with pcre_exec()"
+.rs
+.sp
If PCRE_PARTIAL_HARD is set for \fBpcre_exec()\fP, it returns
PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to
-search for possible complete matches. The difference between the two options
-can be illustrated by a pattern such as:
+search for possible complete matches. This option is "hard" because it prefers
+an earlier partial match over a later complete match. For this reason, the
+assumption is made that the end of the supplied subject string may not be the
+true end of the available data, and so, if \ez, \eZ, \eb, \eB, or $ are
+encountered at the end of the subject, the result is PCRE_ERROR_PARTIAL.
+.
+.
+.SS "Comparing hard and soft partial matching"
+.rs
.sp
+The difference between the two partial matching options can be illustrated by a
+pattern such as:
+.sp
/dog(sbody)?/
.sp
This matches either "dog" or "dogsbody", greedily (that is, it prefers the
@@ -115,7 +149,7 @@
character, without backtracking, searching for all possible matches
simultaneously. If the end of the subject is reached before the end of the
pattern, there is the possibility of a partial match, again provided that at
-least one character has matched.
+least one character has been inspected.
.P
When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
have been no complete matches. Otherwise, the complete matches are returned.
@@ -240,11 +274,13 @@
matching. Unlike \fBpcre_dfa_exec()\fP, it is not possible to restart the
previous match with a new segment of data. Instead, new data must be added to
the previous subject string, and the entire match re-run, starting from the
-point where the partial match occurred. Earlier data can be discarded.
-Consider an unanchored pattern that matches dates:
+point where the partial match occurred. Earlier data can be discarded. It is
+best to use PCRE_PARTIAL_HARD in this situation, because it does not treat the
+end of a segment as the end of the subject when matching \ez, \eZ, \eb, \eB,
+and $. Consider an unanchored pattern that matches dates:
.sp
re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
- data> The date is 23ja\eP
+ data> The date is 23ja\eP\eP
Partial match: 23ja
.sp
At this stage, an application could discard the text preceding "23ja", add on
@@ -265,9 +301,11 @@
Certain types of pattern may give problems with multi-segment matching,
whichever matching function is used.
.P
-1. If the pattern contains tests for the beginning or end of a line, you need
-to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
-subject string for any call does not contain the beginning or end of a line.
+1. If the pattern contains a test for the beginning of a line, you need to pass
+the PCRE_NOTBOL option when the subject string for any call does start at the
+beginning of a line. There is also a PCRE_NOTEOL option, but in practice when
+doing multi-segment matching you should be using PCRE_PARTIAL_HARD, which
+includes the effect of PCRE_NOTEOL.
.P
2. Lookbehind assertions at the start of a pattern are catered for in the
offsets that are returned for a partial match. However, in theory, a lookbehind
@@ -281,10 +319,10 @@
especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
Word Boundaries" above describes an issue that arises if the pattern ends with
\eb or \eB. Another kind of difference may occur when there are multiple
-matching possibilities, because a partial match result is given only when there
-are no completed matches. This means that as soon as the shortest match has
-been found, continuation to a new subject segment is no longer possible.
-Consider again this \fBpcretest\fP example:
+matching possibilities, because (for PCRE_PARTIAL_SOFT) a partial match result
+is given only when there are no completed matches. This means that as soon as
+the shortest match has been found, continuation to a new subject segment is no
+longer possible. Consider again this \fBpcretest\fP example:
.sp
re> /dog(sbody)?/
data> dogsb\eP
@@ -306,8 +344,8 @@
the other hand, if "dogsbody" is presented as a single string,
\fBpcre_dfa_exec()\fP finds both matches.
.P
-Because of these problems, it is probably best to use PCRE_PARTIAL_HARD when
-matching multi-segment data. The example above then behaves differently:
+Because of these problems, it is best to use PCRE_PARTIAL_HARD when matching
+multi-segment data. The example above then behaves differently:
.sp
re> /dog(sbody)?/
data> dogsb\eP\eP
@@ -341,12 +379,12 @@
each time:
.sp
re> /1234|3789/
- data> ABC123\eP
+ data> ABC123\eP\eP
Partial match: 123
data> 1237890
0: 3789
.sp
-Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re-running
+Of course, instead of using PCRE_DFA_RESTART, the same technique of re-running
the entire match can also be used with \fBpcre_dfa_exec()\fP. Another
possibility is to work with two buffers. If a partial match at offset \fIn\fP
in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on
@@ -368,6 +406,6 @@
.rs
.sp
.nf
-Last updated: 19 October 2009
-Copyright (c) 1997-2009 University of Cambridge.
+Last updated: 22 October 2010
+Copyright (c) 1997-2010 University of Cambridge.
.fi
Modified: code/trunk/doc/pcretest.1
===================================================================
--- code/trunk/doc/pcretest.1 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/doc/pcretest.1 2010-10-22 15:57:50 UTC (rev 553)
@@ -499,9 +499,11 @@
\fBpcre_exec()\fP returns, starting with number 0 for the string that matched
the whole pattern. Otherwise, it outputs "No match" when the return is
PCRE_ERROR_NOMATCH, and "Partial match:" followed by the partially matching
-substring when \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL. For any other
-returns, it outputs the PCRE negative error number. Here is an example of an
-interactive \fBpcretest\fP run.
+substring when \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL. (Note that this is
+the entire substring that was inspected during the partial match; it may
+include characters before the actual match start if a lookbehind assertion,
+\eK, \eb, or \eB was involved.) For any other returns, it outputs the PCRE
+negative error number. Here is an example of an interactive \fBpcretest\fP run.
.sp
$ pcretest
PCRE version 7.0 30-Nov-2006
@@ -584,7 +586,9 @@
(Using the normal matching function on this data finds only "tang".) The
longest matching string is always given first (and numbered zero). After a
PCRE_ERROR_PARTIAL return, the output is "Partial match:", followed by the
-partially matching substring.
+partially matching substring. (Note that this is the entire substring that was
+inspected during the partial match; it may include characters before the actual
+match start if a lookbehind assertion, \eK, \eb, or \eB was involved.)
.P
If \fB/g\fP is present on the pattern, the search for further matches resumes
at the end of the longest match. For example:
@@ -763,6 +767,6 @@
.rs
.sp
.nf
-Last updated: 14 June 2010
+Last updated: 22 October 2010
Copyright (c) 1997-2010 University of Cambridge.
.fi
Modified: code/trunk/pcre.h.in
===================================================================
--- code/trunk/pcre.h.in 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/pcre.h.in 2010-10-22 15:57:50 UTC (rev 553)
@@ -96,42 +96,44 @@
#endif
/* Options. Some are compile-time only, some are run-time only, and some are
-both, so we keep them all distinct. */
+both, so we keep them all distinct. However, almost all the bits in the options
+word are now used. In the long run, we may have to re-use some of the
+compile-time only bits for runtime options, or vice versa. */
-#define PCRE_CASELESS 0x00000001
-#define PCRE_MULTILINE 0x00000002
-#define PCRE_DOTALL 0x00000004
-#define PCRE_EXTENDED 0x00000008
-#define PCRE_ANCHORED 0x00000010
-#define PCRE_DOLLAR_ENDONLY 0x00000020
-#define PCRE_EXTRA 0x00000040
-#define PCRE_NOTBOL 0x00000080
-#define PCRE_NOTEOL 0x00000100
-#define PCRE_UNGREEDY 0x00000200
-#define PCRE_NOTEMPTY 0x00000400
-#define PCRE_UTF8 0x00000800
-#define PCRE_NO_AUTO_CAPTURE 0x00001000
-#define PCRE_NO_UTF8_CHECK 0x00002000
-#define PCRE_AUTO_CALLOUT 0x00004000
-#define PCRE_PARTIAL_SOFT 0x00008000
+#define PCRE_CASELESS 0x00000001 /* Compile */
+#define PCRE_MULTILINE 0x00000002 /* Compile */
+#define PCRE_DOTALL 0x00000004 /* Compile */
+#define PCRE_EXTENDED 0x00000008 /* Compile */
+#define PCRE_ANCHORED 0x00000010 /* Compile, exec, DFA exec */
+#define PCRE_DOLLAR_ENDONLY 0x00000020 /* Compile */
+#define PCRE_EXTRA 0x00000040 /* Compile */
+#define PCRE_NOTBOL 0x00000080 /* Exec, DFA exec */
+#define PCRE_NOTEOL 0x00000100 /* Exec, DFA exec */
+#define PCRE_UNGREEDY 0x00000200 /* Compile */
+#define PCRE_NOTEMPTY 0x00000400 /* Exec, DFA exec */
+#define PCRE_UTF8 0x00000800 /* Compile */
+#define PCRE_NO_AUTO_CAPTURE 0x00001000 /* Compile */
+#define PCRE_NO_UTF8_CHECK 0x00002000 /* Compile, exec, DFA exec */
+#define PCRE_AUTO_CALLOUT 0x00004000 /* Compile */
+#define PCRE_PARTIAL_SOFT 0x00008000 /* Exec, DFA exec */
#define PCRE_PARTIAL 0x00008000 /* Backwards compatible synonym */
-#define PCRE_DFA_SHORTEST 0x00010000
-#define PCRE_DFA_RESTART 0x00020000
-#define PCRE_FIRSTLINE 0x00040000
-#define PCRE_DUPNAMES 0x00080000
-#define PCRE_NEWLINE_CR 0x00100000
-#define PCRE_NEWLINE_LF 0x00200000
-#define PCRE_NEWLINE_CRLF 0x00300000
-#define PCRE_NEWLINE_ANY 0x00400000
-#define PCRE_NEWLINE_ANYCRLF 0x00500000
-#define PCRE_BSR_ANYCRLF 0x00800000
-#define PCRE_BSR_UNICODE 0x01000000
-#define PCRE_JAVASCRIPT_COMPAT 0x02000000
-#define PCRE_NO_START_OPTIMIZE 0x04000000
-#define PCRE_NO_START_OPTIMISE 0x04000000
-#define PCRE_PARTIAL_HARD 0x08000000
-#define PCRE_NOTEMPTY_ATSTART 0x10000000
-#define PCRE_UCP 0x20000000
+#define PCRE_DFA_SHORTEST 0x00010000 /* DFA exec */
+#define PCRE_DFA_RESTART 0x00020000 /* DFA exec */
+#define PCRE_FIRSTLINE 0x00040000 /* Compile */
+#define PCRE_DUPNAMES 0x00080000 /* Compile */
+#define PCRE_NEWLINE_CR 0x00100000 /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_LF 0x00200000 /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_CRLF 0x00300000 /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_ANY 0x00400000 /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_ANYCRLF 0x00500000 /* Compile, exec, DFA exec */
+#define PCRE_BSR_ANYCRLF 0x00800000 /* Compile, exec, DFA exec */
+#define PCRE_BSR_UNICODE 0x01000000 /* Compile, exec, DFA exec */
+#define PCRE_JAVASCRIPT_COMPAT 0x02000000 /* Compile */
+#define PCRE_NO_START_OPTIMIZE 0x04000000 /* Exec, DFA exec */
+#define PCRE_NO_START_OPTIMISE 0x04000000 /* Synonym */
+#define PCRE_PARTIAL_HARD 0x08000000 /* Exec, DFA exec */
+#define PCRE_NOTEMPTY_ATSTART 0x10000000 /* Exec, DFA exec */
+#define PCRE_UCP 0x20000000 /* Compile */
/* Exec-time and get/set-time error codes */
Modified: code/trunk/pcre_dfa_exec.c
===================================================================
--- code/trunk/pcre_dfa_exec.c 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/pcre_dfa_exec.c 2010-10-22 15:57:50 UTC (rev 553)
@@ -831,7 +831,12 @@
/*-----------------------------------------------------------------*/
case OP_EOD:
- if (ptr >= end_subject) { ADD_ACTIVE(state_offset + 1, 0); }
+ if (ptr >= end_subject)
+ {
+ if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
+ could_continue = TRUE;
+ else { ADD_ACTIVE(state_offset + 1, 0); }
+ }
break;
/*-----------------------------------------------------------------*/
@@ -871,7 +876,9 @@
/*-----------------------------------------------------------------*/
case OP_EODN:
- if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - md->nllen))
+ if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
+ could_continue = TRUE;
+ else if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - md->nllen))
{ ADD_ACTIVE(state_offset + 1, 0); }
break;
@@ -879,7 +886,9 @@
case OP_DOLL:
if ((md->moptions & PCRE_NOTEOL) == 0)
{
- if (clen == 0 ||
+ if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
+ could_continue = TRUE;
+ else if (clen == 0 ||
((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr) &&
((ims & PCRE_MULTILINE) != 0 || ptr == end_subject - md->nllen)
))
@@ -2744,8 +2753,8 @@
((md->moptions & PCRE_PARTIAL_SOFT) != 0 && /* Soft partial and */
match_count < 0) /* no matches */
) && /* And... */
- ptr >= end_subject && /* Reached end of subject */
- ptr > current_subject) /* Matched non-empty string */
+ ptr >= end_subject && /* Reached end of subject */
+ ptr > md->start_used_ptr) /* Inspected non-empty string */
{
if (offsetcount >= 2)
{
Modified: code/trunk/pcre_exec.c
===================================================================
--- code/trunk/pcre_exec.c 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/pcre_exec.c 2010-10-22 15:57:50 UTC (rev 553)
@@ -422,17 +422,18 @@
the subject. */
#define CHECK_PARTIAL()\
- if (md->partial != 0 && eptr >= md->end_subject && eptr > mstart)\
- {\
- md->hitend = TRUE;\
- if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL);\
+ if (md->partial != 0 && eptr >= md->end_subject && \
+ eptr > md->start_used_ptr) \
+ { \
+ md->hitend = TRUE; \
+ if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL); \
}
#define SCHECK_PARTIAL()\
- if (md->partial != 0 && eptr > mstart)\
- {\
- md->hitend = TRUE;\
- if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL);\
+ if (md->partial != 0 && eptr > md->start_used_ptr) \
+ { \
+ md->hitend = TRUE; \
+ if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL); \
}
@@ -711,18 +712,18 @@
MRRETURN(MATCH_NOMATCH);
/* COMMIT overrides PRUNE, SKIP, and THEN */
-
+
case OP_COMMIT:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM52);
if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE &&
- rrc != MATCH_SKIP && rrc != MATCH_SKIP_ARG &&
- rrc != MATCH_THEN)
+ rrc != MATCH_SKIP && rrc != MATCH_SKIP_ARG &&
+ rrc != MATCH_THEN)
RRETURN(rrc);
MRRETURN(MATCH_COMMIT);
/* PRUNE overrides THEN */
-
+
case OP_PRUNE:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM51);
@@ -737,11 +738,11 @@
RRETURN(MATCH_PRUNE);
/* SKIP overrides PRUNE and THEN */
-
+
case OP_SKIP:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM53);
- if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
+ if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
RRETURN(rrc);
md->start_match_ptr = eptr; /* Pass back current position */
MRRETURN(MATCH_SKIP);
@@ -749,7 +750,7 @@
case OP_SKIP_ARG:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
ims, eptrb, flags, RM57);
- if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
+ if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
RRETURN(rrc);
/* Pass back the current skip name by overloading md->start_match_ptr and
@@ -759,11 +760,11 @@
md->start_match_ptr = ecode + 2;
RRETURN(MATCH_SKIP_ARG);
-
+
/* For THEN (and THEN_ARG) we pass back the address of the bracket or
- the alt that is at the start of the current branch. This makes it possible
- to skip back past alternatives that precede the THEN within the current
- branch. */
+ the alt that is at the start of the current branch. This makes it possible
+ to skip back past alternatives that precede the THEN within the current
+ branch. */
case OP_THEN:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
@@ -773,7 +774,7 @@
MRRETURN(MATCH_THEN);
case OP_THEN_ARG:
- RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1+LINK_SIZE],
+ RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1+LINK_SIZE],
offset_top, md, ims, eptrb, flags, RM58);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
md->start_match_ptr = ecode - GET(ecode, 1);
@@ -1704,37 +1705,40 @@
if (eptr < md->end_subject)
{ if (!IS_NEWLINE(eptr)) MRRETURN(MATCH_NOMATCH); }
else
- { if (md->noteol) MRRETURN(MATCH_NOMATCH); }
+ {
+ if (md->noteol) MRRETURN(MATCH_NOMATCH);
+ SCHECK_PARTIAL();
+ }
ecode++;
break;
}
- else
+ else /* Not multiline */
{
if (md->noteol) MRRETURN(MATCH_NOMATCH);
- if (!md->endonly)
- {
- if (eptr != md->end_subject &&
- (!IS_NEWLINE(eptr) || eptr != md->end_subject - md->nllen))
- MRRETURN(MATCH_NOMATCH);
- ecode++;
- break;
- }
+ if (!md->endonly) goto ASSERT_NL_OR_EOS;
}
+
/* ... else fall through for endonly */
/* End of subject assertion (\z) */
case OP_EOD:
if (eptr < md->end_subject) MRRETURN(MATCH_NOMATCH);
+ SCHECK_PARTIAL();
ecode++;
break;
/* End of subject or ending \n assertion (\Z) */
case OP_EODN:
- if (eptr != md->end_subject &&
+ ASSERT_NL_OR_EOS:
+ if (eptr < md->end_subject &&
(!IS_NEWLINE(eptr) || eptr != md->end_subject - md->nllen))
MRRETURN(MATCH_NOMATCH);
+
+ /* Either at end of string or \n before end. */
+
+ SCHECK_PARTIAL();
ecode++;
break;
Modified: code/trunk/pcretest.c
===================================================================
--- code/trunk/pcretest.c 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/pcretest.c 2010-10-22 15:57:50 UTC (rev 553)
@@ -2313,7 +2313,9 @@
#endif
use_dfa = 1;
continue;
+#endif
+#if !defined NODFA
case 'F':
options |= PCRE_DFA_SHORTEST;
continue;
Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/testdata/testinput2 2010-10-22 15:57:50 UTC (rev 553)
@@ -3499,4 +3499,40 @@
*** Failers
abcxy
+/(?<=abc)def/
+ abc\P\P
+
+/abc$/
+ abc
+ abc\P
+ abc\P\P
+
+/abc$/m
+ abc
+ abc\n
+ abc\P\P
+ abc\n\P\P
+ abc\P
+ abc\n\P
+
+/abc\z/
+ abc
+ abc\P
+ abc\P\P
+
+/abc\Z/
+ abc
+ abc\P
+ abc\P\P
+
+/abc\b/
+ abc
+ abc\P
+ abc\P\P
+
+/abc\B/
+ abc
+ abc\P
+ abc\P\P
+
/-- End of testinput2 --/
Modified: code/trunk/testdata/testinput7
===================================================================
--- code/trunk/testdata/testinput7 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/testdata/testinput7 2010-10-22 15:57:50 UTC (rev 553)
@@ -4560,4 +4560,40 @@
/^(?(?!a(*SKIP)b))/
ac
+/(?<=abc)def/
+ abc\P\P
+
+/abc$/
+ abc
+ abc\P
+ abc\P\P
+
+/abc$/m
+ abc
+ abc\n
+ abc\P\P
+ abc\n\P\P
+ abc\P
+ abc\n\P
+
+/abc\z/
+ abc
+ abc\P
+ abc\P\P
+
+/abc\Z/
+ abc
+ abc\P
+ abc\P\P
+
+/abc\b/
+ abc
+ abc\P
+ abc\P\P
+
+/abc\B/
+ abc
+ abc\P
+ abc\P\P
+
/-- End of testinput7 --/
Modified: code/trunk/testdata/testinput8
===================================================================
--- code/trunk/testdata/testinput8 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/testdata/testinput8 2010-10-22 15:57:50 UTC (rev 553)
@@ -685,4 +685,8 @@
xxxxabcde\P
xxxxabcde\P\P
+/\bthe cat\b/8
+ the cat\P
+ the cat\P\P
+
/-- End of testinput8 --/
Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/testdata/testoutput2 2010-10-22 15:57:50 UTC (rev 553)
@@ -11138,4 +11138,62 @@
abcxy
No match
+/(?<=abc)def/
+ abc\P\P
+Partial match: abc
+
+/abc$/
+ abc
+ 0: abc
+ abc\P
+ 0: abc
+ abc\P\P
+Partial match: abc
+
+/abc$/m
+ abc
+ 0: abc
+ abc\n
+ 0: abc
+ abc\P\P
+Partial match: abc
+ abc\n\P\P
+ 0: abc
+ abc\P
+ 0: abc
+ abc\n\P
+ 0: abc
+
+/abc\z/
+ abc
+ 0: abc
+ abc\P
+ 0: abc
+ abc\P\P
+Partial match: abc
+
+/abc\Z/
+ abc
+ 0: abc
+ abc\P
+ 0: abc
+ abc\P\P
+Partial match: abc
+
+/abc\b/
+ abc
+ 0: abc
+ abc\P
+ 0: abc
+ abc\P\P
+Partial match: abc
+
+/abc\B/
+ abc
+No match
+ abc\P
+Partial match: abc
+ abc\P\P
+Partial match: abc
+
/-- End of testinput2 --/
Modified: code/trunk/testdata/testoutput7
===================================================================
--- code/trunk/testdata/testoutput7 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/testdata/testoutput7 2010-10-22 15:57:50 UTC (rev 553)
@@ -7610,4 +7610,62 @@
ac
Error -16
+/(?<=abc)def/
+ abc\P\P
+Partial match: abc
+
+/abc$/
+ abc
+ 0: abc
+ abc\P
+ 0: abc
+ abc\P\P
+Partial match: abc
+
+/abc$/m
+ abc
+ 0: abc
+ abc\n
+ 0: abc
+ abc\P\P
+Partial match: abc
+ abc\n\P\P
+ 0: abc
+ abc\P
+ 0: abc
+ abc\n\P
+ 0: abc
+
+/abc\z/
+ abc
+ 0: abc
+ abc\P
+ 0: abc
+ abc\P\P
+Partial match: abc
+
+/abc\Z/
+ abc
+ 0: abc
+ abc\P
+ 0: abc
+ abc\P\P
+Partial match: abc
+
+/abc\b/
+ abc
+ 0: abc
+ abc\P
+ 0: abc
+ abc\P\P
+Partial match: abc
+
+/abc\B/
+ abc
+No match
+ abc\P
+Partial match: abc
+ abc\P\P
+Partial match: abc
+
/-- End of testinput7 --/
Modified: code/trunk/testdata/testoutput8
===================================================================
--- code/trunk/testdata/testoutput8 2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/testdata/testoutput8 2010-10-22 15:57:50 UTC (rev 553)
@@ -1320,4 +1320,10 @@
xxxxabcde\P\P
Partial match: abcde
+/\bthe cat\b/8
+ the cat\P
+ 0: the cat
+ the cat\P\P
+Partial match: the cat
+
/-- End of testinput8 --/