[Pcre-svn] [553] code/trunk: Change the way PCRE_PARTIAL

Autor: Subversion repository
Data:
Para: pcre-svn
Assunto: [Pcre-svn] [553] code/trunk: Change the way PCRE_PARTIAL_HARD handles \z, \Z, \b, \B, and $.

Revision: 553

          http://vcs.pcre.org/viewvc?view=rev&revision=553
Author:   ph10
Date:     2010-10-22 16:57:50 +0100 (Fri, 22 Oct 2010)

Log Message:
-----------
Change the way PCRE_PARTIAL_HARD handles \z, \Z, \b, \B, and $.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcre.3
    code/trunk/doc/pcre_exec.3
    code/trunk/doc/pcreapi.3
    code/trunk/doc/pcrematching.3
    code/trunk/doc/pcrepartial.3
    code/trunk/doc/pcretest.1
    code/trunk/pcre.h.in
    code/trunk/pcre_dfa_exec.c
    code/trunk/pcre_exec.c
    code/trunk/pcretest.c
    code/trunk/testdata/testinput2
    code/trunk/testdata/testinput7
    code/trunk/testdata/testinput8
    code/trunk/testdata/testoutput2
    code/trunk/testdata/testoutput7
    code/trunk/testdata/testoutput8

Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/ChangeLog    2010-10-22 15:57:50 UTC (rev 553)
@@ -21,8 +21,26 @@
     the class, even if it had been included by some previous item, for example
     in [\x00-\xff\s]. (This was a bug related to the fact that VT is not part 
     of \s, but is part of the POSIX "space" class.)  
+    
+4.  A partial match never returns an empty string (because you can always
+    match an empty string at the end of the subject); however the checking for
+    an empty string was starting at the "start of match" point. This has been
+    changed to the "earliest inspected character" point, because the returned
+    data for a partial match starts at this character. This means that, for
+    example, /(?<=abc)def/ gives a partial match for the subject "abc"
+    (previously it gave "no match").

+5.  Changes have been made to the way PCRE_PARTIAL_HARD affects the matching
+    of $, \z, \Z, \b, and \B. If the match point is at the end of the string, 
+    previously a full match would be given. However, setting PCRE_PARTIAL_HARD
+    has an implication that the given string is incomplete (because a partial 
+    match is preferred over a full match). For this reason, these items now 
+    give a partial match in this situation. [Aside: previously, the one case
+    /t\b/ matched against "cat" with PCRE_PARTIAL_HARD set did return a partial
+    match rather than a full match, which was wrong by the old rules, but is 
+    now correct.]

+
Version 8.10 25-Jun-2010
------------------------

Modified: code/trunk/doc/pcre.3
===================================================================
--- code/trunk/doc/pcre.3    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/doc/pcre.3    2010-10-22 15:57:50 UTC (rev 553)
@@ -254,12 +254,12 @@
 recognizes as digits, spaces, or word characters remain the same set as before,
 all with values less than 256. This remains true even when PCRE is built to
 include Unicode property support, because to do otherwise would slow down PCRE
-in many common cases. Note that this also applies to \eb, because it is defined
-in terms of \ew and \eW. If you really want to test for a wider sense of, say,
-"digit", you can use explicit Unicode property tests such as \ep{Nd}.
-Alternatively, if you set the PCRE_UCP option, the way that the character
-escapes work is changed so that Unicode properties are used to determine which
-characters match. There are more details in the section on
+in many common cases. Note in particular that this applies to \eb and \eB,
+because they are defined in terms of \ew and \eW. If you really want to test
+for a wider sense of, say, "digit", you can use explicit Unicode property tests
+such as \ep{Nd}. Alternatively, if you set the PCRE_UCP option, the way that
+the character escapes work is changed so that Unicode properties are used to
+determine which characters match. There are more details in the section on
 .\" HTML <a href="pcrepattern.html#genericchartypes">
 .\" </a>
 generic character types
@@ -306,6 +306,6 @@
 .rs
 .sp
 .nf
-Last updated: 12 May 2010
+Last updated: 22 October 2010
 Copyright (c) 1997-2010 University of Cambridge.
 .fi

Modified: code/trunk/doc/pcre_exec.3
===================================================================
--- code/trunk/doc/pcre_exec.3    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/doc/pcre_exec.3    2010-10-22 15:57:50 UTC (rev 553)
@@ -53,7 +53,7 @@
   PCRE_PARTIAL           ) Return PCRE_ERROR_PARTIAL for a partial
   PCRE_PARTIAL_SOFT      )   match if no full matches are found
   PCRE_PARTIAL_HARD      Return PCRE_ERROR_PARTIAL for a partial match
-                           even if there is a full match as well
+                           if that is found before a full match
 .sp
 For details of partial matching, see the
 .\" HREF

Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/doc/pcreapi.3    2010-10-22 15:57:50 UTC (rev 553)
@@ -1532,12 +1532,21 @@
 compatibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial match
 occurs if the end of the subject string is reached successfully, but there are
 not enough subject characters to complete the match. If this happens when
-PCRE_PARTIAL_HARD is set, \fBpcre_exec()\fP immediately returns
-PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, matching continues
-by testing any other alternatives. Only if they all fail is PCRE_ERROR_PARTIAL
-returned (instead of PCRE_ERROR_NOMATCH). The portion of the string that
-was inspected when the partial match was found is set as the first matching
-string. There is a more detailed discussion in the
+PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set, matching continues by
+testing any remaining alternatives. Only if no complete match can be found is
+PCRE_ERROR_PARTIAL returned instead of PCRE_ERROR_NOMATCH. In other words,
+PCRE_PARTIAL_SOFT says that the caller is prepared to handle a partial match,
+but only if no complete match can be found.
+.P
+If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this case, if a
+partial match is found, \fBpcre_exec()\fP immediately returns
+PCRE_ERROR_PARTIAL, without considering any other alternatives. In other words,
+when PCRE_PARTIAL_HARD is set, a partial match is considered to be more 
+important that an alternative complete match.
+.P
+In both cases, the portion of the string that was inspected when the partial
+match was found is set as the first matching string. There is a more detailed
+discussion of partial and multi-segment matching, with examples, in the
 .\" HREF
 \fBpcrepartial\fP
 .\"
@@ -2059,6 +2068,12 @@
 there have been no complete matches, but there is still at least one matching
 possibility. The portion of the string that was inspected when the longest
 partial match was found is set as the first matching string in both cases.
+There is a more detailed discussion of partial and multi-segment matching, with
+examples, in the
+.\" HREF
+\fBpcrepartial\fP
+.\"
+documentation.
 .sp
   PCRE_DFA_SHORTEST
 .sp
@@ -2178,6 +2193,6 @@
 .rs
 .sp
 .nf
-Last updated: 21 June 2010
+Last updated: 22 October 2010
 Copyright (c) 1997-2010 University of Cambridge.
 .fi

Modified: code/trunk/doc/pcrematching.3
===================================================================
--- code/trunk/doc/pcrematching.3    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/doc/pcrematching.3    2010-10-22 15:57:50 UTC (rev 553)
@@ -151,11 +151,13 @@
 2. Because the alternative algorithm scans the subject string just once, and
 never needs to backtrack, it is possible to pass very long subject strings to
 the matching function in several pieces, checking for partial matching each
-time. The
+time. It is possible to do multi-segment matching using \fBpcre_exec()\fP (by 
+retaining partially matched substrings), but it is more complicated. The
 .\" HREF
 \fBpcrepartial\fP
 .\"
-documentation gives details of partial matching.
+documentation gives details of partial matching and discusses multi-segment 
+matching.
 .
 .
 .SH "DISADVANTAGES OF THE ALTERNATIVE ALGORITHM"
@@ -187,6 +189,6 @@
 .rs
 .sp
 .nf
-Last updated: 29 September 2009
-Copyright (c) 1997-2009 University of Cambridge.
+Last updated: 22 October 2010
+Copyright (c) 1997-2010 University of Cambridge.
 .fi

Modified: code/trunk/doc/pcrepartial.3
===================================================================
--- code/trunk/doc/pcrepartial.3    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/doc/pcrepartial.3    2010-10-22 15:57:50 UTC (rev 553)
@@ -21,8 +21,8 @@
 as soon as a mistake is made, by beeping and not reflecting the character that
 has been typed, for example. This immediate feedback is likely to be a better
 user interface than a check that is delayed until the entire string has been
-entered. Partial matching can also sometimes be useful when the subject string
-is very long and is not all available at once.
+entered. Partial matching can also be useful when the subject string is very
+long and is not all available at once.
 .P
 PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
 PCRE_PARTIAL_HARD options, which can be set when calling \fBpcre_exec()\fP or
@@ -44,19 +44,21 @@
 .SH "PARTIAL MATCHING USING pcre_exec()"
 .rs
 .sp
-A partial match occurs during a call to \fBpcre_exec()\fP whenever the end of
-the subject string is reached successfully, but matching cannot continue
-because more characters are needed. However, at least one character must have
-been matched. (In other words, a partial match can never be an empty string.)
+A partial match occurs during a call to \fBpcre_exec()\fP when the end of the
+subject string is reached successfully, but matching cannot continue because
+more characters are needed. However, at least one character in the subject must
+have been inspected. This character need not form part of the final matched
+string; lookbehind assertions and the \eK escape sequence provide ways of
+inspecting characters before the start of a matched substring. The requirement 
+for inspecting at least one character exists because an empty string can always 
+be matched; without such a restriction there would always be a partial match of 
+an empty string at the end of the subject.
 .P
-If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but matching
-continues as normal, and other alternatives in the pattern are tried. If no
-complete match can be found, \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL
-instead of PCRE_ERROR_NOMATCH. If there are at least two slots in the offsets
-vector, the first of them is set to the offset of the earliest character that
-was inspected when the partial match was found. For convenience, the second
-offset points to the end of the string so that a substring can easily be
-identified.
+If there are at least two slots in the offsets vector when \fBpcre_exec()\fP
+returns with a partial match, the first slot is set to the offset of the
+earliest character that was inspected when the partial match was found. For
+convenience, the second offset points to the end of the subject so that a
+substring can easily be identified. 
 .P
 For the majority of patterns, the first offset identifies the start of the
 partially matched string. However, for patterns that contain lookbehind
@@ -68,8 +70,26 @@
 This pattern matches "123", but only if it is preceded by "abc". If the subject
 string is "xyzabc12", the offsets after a partial match are for the substring
 "abc12", because all these characters are needed if another match is tried
-with extra characters added.
+with extra characters added to the subject.
 .P
+What happens when a partial match is identified depends on which of the two
+partial matching options are set. 
+.
+.
+.SS "PCRE_PARTIAL_SOFT with pcre_exec()"
+.rs
+.sp
+If PCRE_PARTIAL_SOFT is set when \fBpcre_exec()\fP identifies a partial match,
+the partial match is remembered, but matching continues as normal, and other
+alternatives in the pattern are tried. If no complete match can be found,
+\fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH.
+.P
+This option is "soft" because it prefers a complete match over a partial match. 
+All the various matching items in a pattern behave as if the subject string is 
+potentially complete. For example, \ez, \eZ, and $ match at the end of the
+subject, as normal, and for \eb and \eB the end of the subject is treated as a 
+non-alphanumeric.
+.P
 If there is more than one partial match, the first one that was found provides
 the data that is returned. Consider this pattern:
 .sp
@@ -77,16 +97,30 @@
 .sp
 If this is matched against the subject string "abc123dog", both
 alternatives fail to match, but the end of the subject is reached during
-matching, so PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH. The
-offsets are set to 3 and 9, identifying "123dog" as the first partial match
-that was found. (In this example, there are two partial matches, because "dog"
-on its own partially matches the second alternative.)
-.P
+matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set to 3 and 9,
+identifying "123dog" as the first partial match that was found. (In this
+example, there are two partial matches, because "dog" on its own partially
+matches the second alternative.)
+.
+.
+.SS "PCRE_PARTIAL_HARD with pcre_exec()"
+.rs
+.sp
 If PCRE_PARTIAL_HARD is set for \fBpcre_exec()\fP, it returns
 PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to
-search for possible complete matches. The difference between the two options
-can be illustrated by a pattern such as:
+search for possible complete matches. This option is "hard" because it prefers 
+an earlier partial match over a later complete match. For this reason, the
+assumption is made that the end of the supplied subject string may not be the
+true end of the available data, and so, if \ez, \eZ, \eb, \eB, or $ are
+encountered at the end of the subject, the result is PCRE_ERROR_PARTIAL.
+.
+.
+.SS "Comparing hard and soft partial matching"
+.rs
 .sp
+The difference between the two partial matching options can be illustrated by a
+pattern such as:
+.sp
   /dog(sbody)?/
 .sp
 This matches either "dog" or "dogsbody", greedily (that is, it prefers the
@@ -115,7 +149,7 @@
 character, without backtracking, searching for all possible matches
 simultaneously. If the end of the subject is reached before the end of the
 pattern, there is the possibility of a partial match, again provided that at
-least one character has matched.
+least one character has been inspected.
 .P
 When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
 have been no complete matches. Otherwise, the complete matches are returned.
@@ -240,11 +274,13 @@
 matching. Unlike \fBpcre_dfa_exec()\fP, it is not possible to restart the
 previous match with a new segment of data. Instead, new data must be added to
 the previous subject string, and the entire match re-run, starting from the
-point where the partial match occurred. Earlier data can be discarded.
-Consider an unanchored pattern that matches dates:
+point where the partial match occurred. Earlier data can be discarded. It is 
+best to use PCRE_PARTIAL_HARD in this situation, because it does not treat the 
+end of a segment as the end of the subject when matching \ez, \eZ, \eb, \eB,
+and $. Consider an unanchored pattern that matches dates:
 .sp
     re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
-  data> The date is 23ja\eP
+  data> The date is 23ja\eP\eP
   Partial match: 23ja
 .sp
 At this stage, an application could discard the text preceding "23ja", add on
@@ -265,9 +301,11 @@
 Certain types of pattern may give problems with multi-segment matching,
 whichever matching function is used.
 .P
-1. If the pattern contains tests for the beginning or end of a line, you need
-to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
-subject string for any call does not contain the beginning or end of a line.
+1. If the pattern contains a test for the beginning of a line, you need to pass
+the PCRE_NOTBOL option when the subject string for any call does start at the
+beginning of a line. There is also a PCRE_NOTEOL option, but in practice when 
+doing multi-segment matching you should be using PCRE_PARTIAL_HARD, which 
+includes the effect of PCRE_NOTEOL.
 .P
 2. Lookbehind assertions at the start of a pattern are catered for in the
 offsets that are returned for a partial match. However, in theory, a lookbehind
@@ -281,10 +319,10 @@
 especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
 Word Boundaries" above describes an issue that arises if the pattern ends with
 \eb or \eB. Another kind of difference may occur when there are multiple
-matching possibilities, because a partial match result is given only when there
-are no completed matches. This means that as soon as the shortest match has
-been found, continuation to a new subject segment is no longer possible.
-Consider again this \fBpcretest\fP example:
+matching possibilities, because (for PCRE_PARTIAL_SOFT) a partial match result
+is given only when there are no completed matches. This means that as soon as
+the shortest match has been found, continuation to a new subject segment is no
+longer possible. Consider again this \fBpcretest\fP example:
 .sp
     re> /dog(sbody)?/
   data> dogsb\eP
@@ -306,8 +344,8 @@
 the other hand, if "dogsbody" is presented as a single string,
 \fBpcre_dfa_exec()\fP finds both matches.
 .P
-Because of these problems, it is probably best to use PCRE_PARTIAL_HARD when
-matching multi-segment data. The example above then behaves differently:
+Because of these problems, it is best to use PCRE_PARTIAL_HARD when matching
+multi-segment data. The example above then behaves differently:
 .sp
     re> /dog(sbody)?/
   data> dogsb\eP\eP
@@ -341,12 +379,12 @@
 each time:
 .sp
     re> /1234|3789/
-  data> ABC123\eP
+  data> ABC123\eP\eP
   Partial match: 123
   data> 1237890
    0: 3789
 .sp
-Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re-running
+Of course, instead of using PCRE_DFA_RESTART, the same technique of re-running
 the entire match can also be used with \fBpcre_dfa_exec()\fP. Another
 possibility is to work with two buffers. If a partial match at offset \fIn\fP
 in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on
@@ -368,6 +406,6 @@
 .rs
 .sp
 .nf
-Last updated: 19 October 2009
-Copyright (c) 1997-2009 University of Cambridge.
+Last updated: 22 October 2010
+Copyright (c) 1997-2010 University of Cambridge.
 .fi

Modified: code/trunk/doc/pcretest.1
===================================================================
--- code/trunk/doc/pcretest.1    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/doc/pcretest.1    2010-10-22 15:57:50 UTC (rev 553)
@@ -499,9 +499,11 @@
 \fBpcre_exec()\fP returns, starting with number 0 for the string that matched
 the whole pattern. Otherwise, it outputs "No match" when the return is
 PCRE_ERROR_NOMATCH, and "Partial match:" followed by the partially matching
-substring when \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL. For any other
-returns, it outputs the PCRE negative error number. Here is an example of an
-interactive \fBpcretest\fP run.
+substring when \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL. (Note that this is
+the entire substring that was inspected during the partial match; it may
+include characters before the actual match start if a lookbehind assertion,
+\eK, \eb, or \eB was involved.) For any other returns, it outputs the PCRE
+negative error number. Here is an example of an interactive \fBpcretest\fP run.
 .sp
   $ pcretest
   PCRE version 7.0 30-Nov-2006
@@ -584,7 +586,9 @@
 (Using the normal matching function on this data finds only "tang".) The
 longest matching string is always given first (and numbered zero). After a
 PCRE_ERROR_PARTIAL return, the output is "Partial match:", followed by the
-partially matching substring.
+partially matching substring. (Note that this is the entire substring that was
+inspected during the partial match; it may include characters before the actual
+match start if a lookbehind assertion, \eK, \eb, or \eB was involved.)
 .P
 If \fB/g\fP is present on the pattern, the search for further matches resumes
 at the end of the longest match. For example:
@@ -763,6 +767,6 @@
 .rs
 .sp
 .nf
-Last updated: 14 June 2010
+Last updated: 22 October 2010
 Copyright (c) 1997-2010 University of Cambridge.
 .fi

Modified: code/trunk/pcre.h.in
===================================================================
--- code/trunk/pcre.h.in    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/pcre.h.in    2010-10-22 15:57:50 UTC (rev 553)
@@ -96,42 +96,44 @@
 #endif

/* Options. Some are compile-time only, some are run-time only, and some are
-both, so we keep them all distinct. */
+both, so we keep them all distinct. However, almost all the bits in the options
+word are now used. In the long run, we may have to re-use some of the
+compile-time only bits for runtime options, or vice versa. */

-#define PCRE_CASELESS           0x00000001
-#define PCRE_MULTILINE          0x00000002
-#define PCRE_DOTALL             0x00000004
-#define PCRE_EXTENDED           0x00000008
-#define PCRE_ANCHORED           0x00000010
-#define PCRE_DOLLAR_ENDONLY     0x00000020
-#define PCRE_EXTRA              0x00000040
-#define PCRE_NOTBOL             0x00000080
-#define PCRE_NOTEOL             0x00000100
-#define PCRE_UNGREEDY           0x00000200
-#define PCRE_NOTEMPTY           0x00000400
-#define PCRE_UTF8               0x00000800
-#define PCRE_NO_AUTO_CAPTURE    0x00001000
-#define PCRE_NO_UTF8_CHECK      0x00002000
-#define PCRE_AUTO_CALLOUT       0x00004000
-#define PCRE_PARTIAL_SOFT       0x00008000
+#define PCRE_CASELESS           0x00000001  /* Compile */
+#define PCRE_MULTILINE          0x00000002  /* Compile */
+#define PCRE_DOTALL             0x00000004  /* Compile */
+#define PCRE_EXTENDED           0x00000008  /* Compile */
+#define PCRE_ANCHORED           0x00000010  /* Compile, exec, DFA exec */
+#define PCRE_DOLLAR_ENDONLY     0x00000020  /* Compile */
+#define PCRE_EXTRA              0x00000040  /* Compile */
+#define PCRE_NOTBOL             0x00000080  /* Exec, DFA exec */
+#define PCRE_NOTEOL             0x00000100  /* Exec, DFA exec */
+#define PCRE_UNGREEDY           0x00000200  /* Compile */
+#define PCRE_NOTEMPTY           0x00000400  /* Exec, DFA exec */
+#define PCRE_UTF8               0x00000800  /* Compile */
+#define PCRE_NO_AUTO_CAPTURE    0x00001000  /* Compile */
+#define PCRE_NO_UTF8_CHECK      0x00002000  /* Compile, exec, DFA exec */
+#define PCRE_AUTO_CALLOUT       0x00004000  /* Compile */
+#define PCRE_PARTIAL_SOFT       0x00008000  /* Exec, DFA exec */
 #define PCRE_PARTIAL            0x00008000  /* Backwards compatible synonym */
-#define PCRE_DFA_SHORTEST       0x00010000
-#define PCRE_DFA_RESTART        0x00020000
-#define PCRE_FIRSTLINE          0x00040000
-#define PCRE_DUPNAMES           0x00080000
-#define PCRE_NEWLINE_CR         0x00100000
-#define PCRE_NEWLINE_LF         0x00200000
-#define PCRE_NEWLINE_CRLF       0x00300000
-#define PCRE_NEWLINE_ANY        0x00400000
-#define PCRE_NEWLINE_ANYCRLF    0x00500000
-#define PCRE_BSR_ANYCRLF        0x00800000
-#define PCRE_BSR_UNICODE        0x01000000
-#define PCRE_JAVASCRIPT_COMPAT  0x02000000
-#define PCRE_NO_START_OPTIMIZE  0x04000000
-#define PCRE_NO_START_OPTIMISE  0x04000000
-#define PCRE_PARTIAL_HARD       0x08000000
-#define PCRE_NOTEMPTY_ATSTART   0x10000000
-#define PCRE_UCP                0x20000000
+#define PCRE_DFA_SHORTEST       0x00010000  /* DFA exec */
+#define PCRE_DFA_RESTART        0x00020000  /* DFA exec */
+#define PCRE_FIRSTLINE          0x00040000  /* Compile */
+#define PCRE_DUPNAMES           0x00080000  /* Compile */
+#define PCRE_NEWLINE_CR         0x00100000  /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_LF         0x00200000  /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_CRLF       0x00300000  /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_ANY        0x00400000  /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_ANYCRLF    0x00500000  /* Compile, exec, DFA exec */
+#define PCRE_BSR_ANYCRLF        0x00800000  /* Compile, exec, DFA exec */
+#define PCRE_BSR_UNICODE        0x01000000  /* Compile, exec, DFA exec */
+#define PCRE_JAVASCRIPT_COMPAT  0x02000000  /* Compile */
+#define PCRE_NO_START_OPTIMIZE  0x04000000  /* Exec, DFA exec */
+#define PCRE_NO_START_OPTIMISE  0x04000000  /* Synonym */
+#define PCRE_PARTIAL_HARD       0x08000000  /* Exec, DFA exec */
+#define PCRE_NOTEMPTY_ATSTART   0x10000000  /* Exec, DFA exec */
+#define PCRE_UCP                0x20000000  /* Compile */

/* Exec-time and get/set-time error codes */

Modified: code/trunk/pcre_dfa_exec.c
===================================================================
--- code/trunk/pcre_dfa_exec.c    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/pcre_dfa_exec.c    2010-10-22 15:57:50 UTC (rev 553)
@@ -831,7 +831,12 @@

       /*-----------------------------------------------------------------*/
       case OP_EOD:
-      if (ptr >= end_subject) { ADD_ACTIVE(state_offset + 1, 0); }
+      if (ptr >= end_subject) 
+        { 
+        if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
+          could_continue = TRUE;
+        else { ADD_ACTIVE(state_offset + 1, 0); }
+        }
       break;

       /*-----------------------------------------------------------------*/
@@ -871,7 +876,9 @@

       /*-----------------------------------------------------------------*/
       case OP_EODN:
-      if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - md->nllen))
+      if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
+        could_continue = TRUE;
+      else if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - md->nllen))
         { ADD_ACTIVE(state_offset + 1, 0); }
       break;

@@ -879,7 +886,9 @@
       case OP_DOLL:
       if ((md->moptions & PCRE_NOTEOL) == 0)
         {
-        if (clen == 0 ||
+        if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
+          could_continue = TRUE;
+        else if (clen == 0 ||
             ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr) &&
                ((ims & PCRE_MULTILINE) != 0 || ptr == end_subject - md->nllen)
             ))
@@ -2744,8 +2753,8 @@
         ((md->moptions & PCRE_PARTIAL_SOFT) != 0 &&  /* Soft partial and */
          match_count < 0)                            /* no matches */
         ) &&                                         /* And... */
-        ptr >= end_subject &&                     /* Reached end of subject */
-        ptr > current_subject)                    /* Matched non-empty string */
+        ptr >= end_subject &&                  /* Reached end of subject */
+        ptr > md->start_used_ptr)              /* Inspected non-empty string */
       {
       if (offsetcount >= 2)
         {

Modified: code/trunk/pcre_exec.c
===================================================================
--- code/trunk/pcre_exec.c    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/pcre_exec.c    2010-10-22 15:57:50 UTC (rev 553)
@@ -422,17 +422,18 @@
 the subject. */

 #define CHECK_PARTIAL()\
-  if (md->partial != 0 && eptr >= md->end_subject && eptr > mstart)\
-    {\
-    md->hitend = TRUE;\
-    if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL);\
+  if (md->partial != 0 && eptr >= md->end_subject && \
+      eptr > md->start_used_ptr) \
+    { \
+    md->hitend = TRUE; \
+    if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL); \
     }

 #define SCHECK_PARTIAL()\
-  if (md->partial != 0 && eptr > mstart)\
-    {\
-    md->hitend = TRUE;\
-    if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL);\
+  if (md->partial != 0 && eptr > md->start_used_ptr) \
+    { \
+    md->hitend = TRUE; \
+    if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL); \
     }

@@ -711,18 +712,18 @@
     MRRETURN(MATCH_NOMATCH);

     /* COMMIT overrides PRUNE, SKIP, and THEN */
-     
+
     case OP_COMMIT:
     RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
       ims, eptrb, flags, RM52);
     if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE &&
-        rrc != MATCH_SKIP && rrc != MATCH_SKIP_ARG && 
-        rrc != MATCH_THEN) 
+        rrc != MATCH_SKIP && rrc != MATCH_SKIP_ARG &&
+        rrc != MATCH_THEN)
       RRETURN(rrc);
     MRRETURN(MATCH_COMMIT);

     /* PRUNE overrides THEN */
-     
+
     case OP_PRUNE:
     RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
       ims, eptrb, flags, RM51);
@@ -737,11 +738,11 @@
     RRETURN(MATCH_PRUNE);

     /* SKIP overrides PRUNE and THEN */
-     
+
     case OP_SKIP:
     RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
       ims, eptrb, flags, RM53);
-    if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN) 
+    if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
       RRETURN(rrc);
     md->start_match_ptr = eptr;   /* Pass back current position */
     MRRETURN(MATCH_SKIP);
@@ -749,7 +750,7 @@
     case OP_SKIP_ARG:
     RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
       ims, eptrb, flags, RM57);
-    if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN) 
+    if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
       RRETURN(rrc);

     /* Pass back the current skip name by overloading md->start_match_ptr and
@@ -759,11 +760,11 @@

     md->start_match_ptr = ecode + 2;
     RRETURN(MATCH_SKIP_ARG);
-    
+
     /* For THEN (and THEN_ARG) we pass back the address of the bracket or
-    the alt that is at the start of the current branch. This makes it possible 
-    to skip back past alternatives that precede the THEN within the current 
-    branch. */ 
+    the alt that is at the start of the current branch. This makes it possible
+    to skip back past alternatives that precede the THEN within the current
+    branch. */

     case OP_THEN:
     RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
@@ -773,7 +774,7 @@
     MRRETURN(MATCH_THEN);

     case OP_THEN_ARG:
-    RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1+LINK_SIZE], 
+    RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1+LINK_SIZE],
       offset_top, md, ims, eptrb, flags, RM58);
     if (rrc != MATCH_NOMATCH) RRETURN(rrc);
     md->start_match_ptr = ecode - GET(ecode, 1);
@@ -1704,37 +1705,40 @@
       if (eptr < md->end_subject)
         { if (!IS_NEWLINE(eptr)) MRRETURN(MATCH_NOMATCH); }
       else
-        { if (md->noteol) MRRETURN(MATCH_NOMATCH); }
+        { 
+        if (md->noteol) MRRETURN(MATCH_NOMATCH); 
+        SCHECK_PARTIAL();
+        }
       ecode++;
       break;
       }
-    else
+    else  /* Not multiline */
       {
       if (md->noteol) MRRETURN(MATCH_NOMATCH);
-      if (!md->endonly)
-        {
-        if (eptr != md->end_subject &&
-            (!IS_NEWLINE(eptr) || eptr != md->end_subject - md->nllen))
-          MRRETURN(MATCH_NOMATCH);
-        ecode++;
-        break;
-        }
+      if (!md->endonly) goto ASSERT_NL_OR_EOS;
       }
+ 
     /* ... else fall through for endonly */

     /* End of subject assertion (\z) */

     case OP_EOD:
     if (eptr < md->end_subject) MRRETURN(MATCH_NOMATCH);
+    SCHECK_PARTIAL();
     ecode++;
     break;

     /* End of subject or ending \n assertion (\Z) */

     case OP_EODN:
-    if (eptr != md->end_subject &&
+    ASSERT_NL_OR_EOS:
+    if (eptr < md->end_subject &&
         (!IS_NEWLINE(eptr) || eptr != md->end_subject - md->nllen))
       MRRETURN(MATCH_NOMATCH);
+      
+    /* Either at end of string or \n before end. */
+ 
+    SCHECK_PARTIAL();
     ecode++;
     break;

Modified: code/trunk/pcretest.c
===================================================================
--- code/trunk/pcretest.c    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/pcretest.c    2010-10-22 15:57:50 UTC (rev 553)
@@ -2313,7 +2313,9 @@
 #endif
           use_dfa = 1;
         continue;
+#endif

+#if !defined NODFA
         case 'F':
         options |= PCRE_DFA_SHORTEST;
         continue;

Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/testdata/testinput2    2010-10-22 15:57:50 UTC (rev 553)
@@ -3499,4 +3499,40 @@
     *** Failers 
     abcxy

+/(?<=abc)def/
+    abc\P\P
+
+/abc$/
+    abc
+    abc\P
+    abc\P\P
+
+/abc$/m
+    abc
+    abc\n
+    abc\P\P
+    abc\n\P\P 
+    abc\P
+    abc\n\P
+
+/abc\z/
+    abc
+    abc\P
+    abc\P\P
+
+/abc\Z/
+    abc
+    abc\P
+    abc\P\P
+
+/abc\b/
+    abc
+    abc\P
+    abc\P\P
+
+/abc\B/
+    abc
+    abc\P
+    abc\P\P
+
 /-- End of testinput2 --/

Modified: code/trunk/testdata/testinput7
===================================================================
--- code/trunk/testdata/testinput7    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/testdata/testinput7    2010-10-22 15:57:50 UTC (rev 553)
@@ -4560,4 +4560,40 @@
 /^(?(?!a(*SKIP)b))/
     ac

+/(?<=abc)def/
+    abc\P\P
+
+/abc$/
+    abc
+    abc\P
+    abc\P\P
+
+/abc$/m
+    abc
+    abc\n
+    abc\P\P
+    abc\n\P\P 
+    abc\P
+    abc\n\P
+
+/abc\z/
+    abc
+    abc\P
+    abc\P\P
+
+/abc\Z/
+    abc
+    abc\P
+    abc\P\P
+
+/abc\b/
+    abc
+    abc\P
+    abc\P\P
+
+/abc\B/
+    abc
+    abc\P
+    abc\P\P
+
 /-- End of testinput7 --/

Modified: code/trunk/testdata/testinput8
===================================================================
--- code/trunk/testdata/testinput8    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/testdata/testinput8    2010-10-22 15:57:50 UTC (rev 553)
@@ -685,4 +685,8 @@
     xxxxabcde\P
     xxxxabcde\P\P

+/\bthe cat\b/8
+    the cat\P
+    the cat\P\P
+
 /-- End of testinput8 --/

Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/testdata/testoutput2    2010-10-22 15:57:50 UTC (rev 553)
@@ -11138,4 +11138,62 @@
     abcxy 
 No match

+/(?<=abc)def/
+    abc\P\P
+Partial match: abc
+
+/abc$/
+    abc
+ 0: abc
+    abc\P
+ 0: abc
+    abc\P\P
+Partial match: abc
+
+/abc$/m
+    abc
+ 0: abc
+    abc\n
+ 0: abc
+    abc\P\P
+Partial match: abc
+    abc\n\P\P 
+ 0: abc
+    abc\P
+ 0: abc
+    abc\n\P
+ 0: abc
+
+/abc\z/
+    abc
+ 0: abc
+    abc\P
+ 0: abc
+    abc\P\P
+Partial match: abc
+
+/abc\Z/
+    abc
+ 0: abc
+    abc\P
+ 0: abc
+    abc\P\P
+Partial match: abc
+
+/abc\b/
+    abc
+ 0: abc
+    abc\P
+ 0: abc
+    abc\P\P
+Partial match: abc
+
+/abc\B/
+    abc
+No match
+    abc\P
+Partial match: abc
+    abc\P\P
+Partial match: abc
+
 /-- End of testinput2 --/

Modified: code/trunk/testdata/testoutput7
===================================================================
--- code/trunk/testdata/testoutput7    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/testdata/testoutput7    2010-10-22 15:57:50 UTC (rev 553)
@@ -7610,4 +7610,62 @@
     ac
 Error -16

+/(?<=abc)def/
+    abc\P\P
+Partial match: abc
+
+/abc$/
+    abc
+ 0: abc
+    abc\P
+ 0: abc
+    abc\P\P
+Partial match: abc
+
+/abc$/m
+    abc
+ 0: abc
+    abc\n
+ 0: abc
+    abc\P\P
+Partial match: abc
+    abc\n\P\P 
+ 0: abc
+    abc\P
+ 0: abc
+    abc\n\P
+ 0: abc
+
+/abc\z/
+    abc
+ 0: abc
+    abc\P
+ 0: abc
+    abc\P\P
+Partial match: abc
+
+/abc\Z/
+    abc
+ 0: abc
+    abc\P
+ 0: abc
+    abc\P\P
+Partial match: abc
+
+/abc\b/
+    abc
+ 0: abc
+    abc\P
+ 0: abc
+    abc\P\P
+Partial match: abc
+
+/abc\B/
+    abc
+No match
+    abc\P
+Partial match: abc
+    abc\P\P
+Partial match: abc
+
 /-- End of testinput7 --/

Modified: code/trunk/testdata/testoutput8
===================================================================
--- code/trunk/testdata/testoutput8    2010-10-13 10:15:41 UTC (rev 552)
+++ code/trunk/testdata/testoutput8    2010-10-22 15:57:50 UTC (rev 553)
@@ -1320,4 +1320,10 @@
     xxxxabcde\P\P
 Partial match: abcde

+/\bthe cat\b/8
+    the cat\P
+ 0: the cat
+    the cat\P\P
+Partial match: the cat
+
 /-- End of testinput8 --/

Esta mensagem é parte da seguinte discussão:
	Árvore completa da discussão ordenada por data

[Pcre-svn] [553] code/trunk: Change the way PCRE_PARTIAL_HAR…