Revision: 932
http://vcs.pcre.org/viewvc?view=rev&revision=932
Author: ph10
Date: 2012-02-24 18:54:43 +0000 (Fri, 24 Feb 2012)
Log Message:
-----------
Add support for PCRE_INFO_MAXLOOKBEHIND.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/pcreapi.3
code/trunk/doc/pcrepartial.3
code/trunk/pcre.h.in
code/trunk/pcre_compile.c
code/trunk/pcre_fullinfo.c
code/trunk/pcre_internal.h
code/trunk/pcretest.c
code/trunk/testdata/testinput5
code/trunk/testdata/testoutput2
code/trunk/testdata/testoutput5
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/ChangeLog 2012-02-24 18:54:43 UTC (rev 932)
@@ -46,6 +46,8 @@
10. The command "./RunTest list" lists the available tests without actually
running any of them. (Because I keep forgetting what they all are.)
+
+11. Add PCRE_INFO_MAXLOOKBEHIND.
Version 8.30 04-February-2012
Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3 2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/doc/pcreapi.3 2012-02-24 18:54:43 UTC (rev 932)
@@ -1235,6 +1235,13 @@
/^a\ed+z\ed+/ the returned value is "z", but for /^a\edz\ed/ the returned value
is -1.
.sp
+ PCRE_INFO_MAXLOOKBEHIND
+.sp
+Return the number of characters (NB not bytes) in the longest lookbehind
+assertion in the pattern. Note that the simple assertions \eb and \eB require a
+one-character lookbehind. This information is useful when doing multi-segment
+matching using the partial matching facilities.
+.sp
PCRE_INFO_MINLENGTH
.sp
If the pattern was studied and a minimum length for matching subject strings
@@ -2646,6 +2653,6 @@
.rs
.sp
.nf
-Last updated: 22 February 2012
+Last updated: 24 February 2012
Copyright (c) 1997-2012 University of Cambridge.
.fi
Modified: code/trunk/doc/pcrepartial.3
===================================================================
--- code/trunk/doc/pcrepartial.3 2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/doc/pcrepartial.3 2012-02-24 18:54:43 UTC (rev 932)
@@ -302,14 +302,16 @@
.sp
At this stage, an application could discard the text preceding "23ja", add on
text from the next segment, and call the matching function again. Unlike the
-DFA matching functions the entire matching string must always be available, and
-the complete matching process occurs for each call, so more memory and more
+DFA matching functions, the entire matching string must always be available,
+and the complete matching process occurs for each call, so more memory and more
processing time is needed.
.P
\fBNote:\fP If the pattern contains lookbehind assertions, or \eK, or starts
with \eb or \eB, the string that is returned for a partial match includes
characters that precede the partially matched string itself, because these must
be retained when adding on more characters for a subsequent matching attempt.
+However, in some cases you may need to retain even earlier characters, as
+discussed in the next section.
.
.
.SH "ISSUES WITH MULTI-SEGMENT MATCHING"
@@ -324,14 +326,31 @@
doing multi-segment matching you should be using PCRE_PARTIAL_HARD, which
includes the effect of PCRE_NOTEOL.
.P
-2. Lookbehind assertions at the start of a pattern are catered for in the
-offsets that are returned for a partial match. However, in theory, a lookbehind
-assertion later in the pattern could require even earlier characters to be
-inspected, and it might not have been reached when a partial match occurs. This
-is probably an extremely unlikely case; you could guard against it to a certain
-extent by always including extra characters at the start.
+2. Lookbehind assertions that have already been obeyed are catered for in the
+offsets that are returned for a partial match. However a lookbehind assertion
+later in the pattern could require even earlier characters to be inspected. You
+can handle this case by using the PCRE_INFO_MAXLOOKBEHIND option of the
+\fBpcre_fullinfo()\fP or \fBpcre16_fullinfo()\fP functions to obtain the length
+of the largest lookbehind in the pattern. This length is given in characters,
+not bytes. If you always retain at least that many characters before the
+partially matched string, all should be well. (Of course, near the start of the
+subject, fewer characters may be present; in that case all characters should be
+retained.)
.P
-3. Matching a subject string that is split into multiple segments may not
+3. Because a partial match must always contain at least one character, what
+might be considered a partial match of an empty string actually gives a "no
+match" result. For example:
+.sp
+ re> /c(?<=abc)x/
+ data> ab\eP
+ No match
+.sp
+If the next segment begins "cx", a match should be found, but this will only
+happen if characters from the previous segment are retained. For this reason, a
+"no match" result should be interpreted as "partial match of an empty string"
+when the pattern contains lookbehinds.
+.P
+4. Matching a subject string that is split into multiple segments may not
always produce exactly the same result as matching over one single long string,
especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
Word Boundaries" above describes an issue that arises if the pattern ends with
@@ -372,7 +391,7 @@
data> gsb\eR\eP\eP\eD
Partial match: gsb
.sp
-4. Patterns that contain alternatives at the top level which do not all start
+5. Patterns that contain alternatives at the top level which do not all start
with the same pattern item may not work as expected when PCRE_DFA_RESTART is
used. For example, consider this pattern:
.sp
@@ -421,6 +440,6 @@
.rs
.sp
.nf
-Last updated: 18 February 2012
+Last updated: 24 February 2012
Copyright (c) 1997-2012 University of Cambridge.
.fi
Modified: code/trunk/pcre.h.in
===================================================================
--- code/trunk/pcre.h.in 2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/pcre.h.in 2012-02-24 18:54:43 UTC (rev 932)
@@ -234,6 +234,7 @@
#define PCRE_INFO_MINLENGTH 15
#define PCRE_INFO_JIT 16
#define PCRE_INFO_JITSIZE 17
+#define PCRE_INFO_MAXLOOKBEHIND 18
/* Request types for pcre_config(). Do not re-arrange, in order to remain
compatible. */
Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c 2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/pcre_compile.c 2012-02-24 18:54:43 UTC (rev 932)
@@ -6852,10 +6852,13 @@
/* For the rest (including \X when Unicode properties are supported), we
can obtain the OP value by negating the escape value in the default
situation when PCRE_UCP is not set. When it *is* set, we substitute
- Unicode property tests. */
+ Unicode property tests. Note that \b and \B do a one-character
+ lookbehind. */
else
{
+ if ((-c == ESC_b || -c == ESC_B) && cd->max_lookbehind == 0)
+ cd->max_lookbehind = 1;
#ifdef SUPPORT_UCP
if (-c >= ESC_DU && -c <= ESC_wu)
{
@@ -7163,7 +7166,12 @@
*ptrptr = ptr;
return FALSE;
}
- else { PUT(reverse_count, 0, fixed_length); }
+ else
+ {
+ if (fixed_length > cd->max_lookbehind)
+ cd->max_lookbehind = fixed_length;
+ PUT(reverse_count, 0, fixed_length);
+ }
}
}
@@ -7833,6 +7841,7 @@
cd->end_pattern = (const pcre_uchar *)(pattern + STRLEN_UC((const pcre_uchar *)pattern));
cd->req_varyopt = 0;
cd->assert_depth = 0;
+cd->max_lookbehind = 0;
cd->external_options = options;
cd->external_flags = 0;
cd->open_caps = NULL;
@@ -7883,7 +7892,6 @@
re->size = (int)size;
re->options = cd->external_options;
re->flags = cd->external_flags;
-re->dummy1 = 0;
re->first_char = 0;
re->req_char = 0;
re->name_table_offset = sizeof(REAL_PCRE) / sizeof(pcre_uchar);
@@ -7903,6 +7911,7 @@
cd->final_bracount = cd->bracount; /* Save for checking forward references */
cd->assert_depth = 0;
cd->bracount = 0;
+cd->max_lookbehind = 0;
cd->names_found = 0;
cd->name_table = (pcre_uchar *)re + re->name_table_offset;
codestart = cd->name_table + re->name_entry_size * re->name_count;
@@ -7924,6 +7933,7 @@
&firstchar, &reqchar, NULL, cd, NULL);
re->top_bracket = cd->bracount;
re->top_backref = cd->top_backref;
+re->max_lookbehind = cd->max_lookbehind;
re->flags = cd->external_flags | PCRE_MODE;
if (cd->had_accept) reqchar = REQ_NONE; /* Must disable after (*ACCEPT) */
@@ -8011,6 +8021,7 @@
(fixed_length == -4)? ERR70 : ERR25;
break;
}
+ if (fixed_length > cd->max_lookbehind) cd->max_lookbehind = fixed_length;
PUT(cc, 1, fixed_length);
}
cc += 1 + LINK_SIZE;
Modified: code/trunk/pcre_fullinfo.c
===================================================================
--- code/trunk/pcre_fullinfo.c 2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/pcre_fullinfo.c 2012-02-24 18:54:43 UTC (rev 932)
@@ -192,6 +192,10 @@
case PCRE_INFO_HASCRORLF:
*((int *)where) = (re->flags & PCRE_HASCRORLF) != 0;
break;
+
+ case PCRE_INFO_MAXLOOKBEHIND:
+ *((int *)where) = re->max_lookbehind;
+ break;
default: return PCRE_ERROR_BADOPTION;
}
Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h 2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/pcre_internal.h 2012-02-24 18:54:43 UTC (rev 932)
@@ -1974,16 +1974,15 @@
pcre_uint32 size; /* Total that was malloced */
pcre_uint32 options; /* Public options */
pcre_uint16 flags; /* Private flags */
- pcre_uint16 dummy1; /* For future use */
- pcre_uint16 top_bracket;
- pcre_uint16 top_backref;
+ pcre_uint16 max_lookbehind; /* Longest lookbehind (characters) */
+ pcre_uint16 top_bracket; /* Highest numbered group */
+ pcre_uint16 top_backref; /* Highest numbered back reference */
pcre_uint16 first_char; /* Starting character */
pcre_uint16 req_char; /* This character must be seen */
pcre_uint16 name_table_offset; /* Offset to name table that follows */
pcre_uint16 name_entry_size; /* Size of any name items */
pcre_uint16 name_count; /* Number of name items */
pcre_uint16 ref_count; /* Reference count */
-
const pcre_uint8 *tables; /* Pointer to tables or NULL for std */
const pcre_uint8 *nullpad; /* NULL padding */
} REAL_PCRE;
@@ -2029,6 +2028,7 @@
int workspace_size; /* Size of workspace */
int bracount; /* Count of capturing parens as we compile */
int final_bracount; /* Saved value after first pass */
+ int max_lookbehind; /* Maximum lookbehind (characters) */
int top_backref; /* Maximum back reference */
unsigned int backref_map; /* Bitmap of low back refs */
int assert_depth; /* Depth of nested assertions */
Modified: code/trunk/pcretest.c
===================================================================
--- code/trunk/pcretest.c 2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/pcretest.c 2012-02-24 18:54:43 UTC (rev 932)
@@ -3099,7 +3099,7 @@
{
unsigned long int all_options;
int count, backrefmax, first_char, need_char, okpartial, jchanged,
- hascrorlf;
+ hascrorlf, maxlookbehind;
int nameentrysize, namecount;
const pcre_uint8 *nametable;
@@ -3113,7 +3113,8 @@
new_info(re, NULL, PCRE_INFO_NAMETABLE, (void *)&nametable) +
new_info(re, NULL, PCRE_INFO_OKPARTIAL, &okpartial) +
new_info(re, NULL, PCRE_INFO_JCHANGED, &jchanged) +
- new_info(re, NULL, PCRE_INFO_HASCRORLF, &hascrorlf)
+ new_info(re, NULL, PCRE_INFO_HASCRORLF, &hascrorlf) +
+ new_info(re, NULL, PCRE_INFO_MAXLOOKBEHIND, &maxlookbehind)
!= 0)
goto SKIP_DATA;
@@ -3252,6 +3253,9 @@
fprintf(outfile, "%s\n", caseless);
}
}
+
+ if (maxlookbehind > 0)
+ fprintf(outfile, "Max lookbehind = %d\n", maxlookbehind);
/* Don't output study size; at present it is in any case a fixed
value, but it varies, depending on the computer architecture, and
Modified: code/trunk/testdata/testinput5
===================================================================
--- code/trunk/testdata/testinput5 2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/testdata/testinput5 2012-02-24 18:54:43 UTC (rev 932)
@@ -793,4 +793,6 @@
/[^\x{100}]*[^\x{10000}]+[^\x{10ffff}]??[^\x{8000}]{4,}[^\x{7fff}]{2,9}?[^\x{fffff}]{5,6}+/8BZi
+/(?<=\x{1234}\x{1234})\bxy/I8
+
/-- End of testinput5 --/
Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2 2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/testdata/testoutput2 2012-02-24 18:54:43 UTC (rev 932)
@@ -451,6 +451,7 @@
No options
First char = 'f'
Need char = 'o'
+Max lookbehind = 6
foo
0: foo
catfoo
@@ -658,6 +659,7 @@
No options
No first char
No need char
+Max lookbehind = 3
Subject length lower bound = 1
Starting byte set: a b
@@ -666,6 +668,7 @@
No options
No first char
Need char = 'a'
+Max lookbehind = 3
Subject length lower bound = 5
Starting byte set: a o
@@ -683,6 +686,7 @@
Options: multiline
No first char
Need char = 'r'
+Max lookbehind = 4
foo\nbarbar
0: bar
***Failers
@@ -700,6 +704,7 @@
Options: multiline
First char at start or follows newline
Need char = 'r'
+Max lookbehind = 4
foo\nbarbar
0: bar
***Failers
@@ -741,6 +746,7 @@
No options
First char = '-'
Need char = 't'
+Max lookbehind = 7
the bullock-cart
0: -cart
a donkey-cart race
@@ -757,12 +763,14 @@
No options
No first char
No need char
+Max lookbehind = 3
/(?>.*)(?<=(abcd)|(xyz))/I
Capturing subpattern count = 2
No options
First char at start or follows newline
No need char
+Max lookbehind = 4
alphabetabcd
0: alphabetabcd
1: abcd
@@ -776,6 +784,7 @@
No options
First char = 'Z'
Need char = 'Z'
+Max lookbehind = 4
abxyZZ
0: ZZ
abXyZZ
@@ -804,6 +813,7 @@
No options
First char = 'b'
Need char = 'r'
+Max lookbehind = 4
bar
0: bar
foobbar
@@ -1205,6 +1215,7 @@
No options
First char = 'i'
Need char = 's'
+Max lookbehind = 1
Mississippi
0: iss
0+ issippi
@@ -1225,6 +1236,7 @@
No options
First char = 'i'
Need char = 's'
+Max lookbehind = 1
Mississippi
0: iss
0+ issippi
@@ -1234,6 +1246,7 @@
No options
First char = 'i'
Need char = 's'
+Max lookbehind = 1
Mississippi
0: iss
0+ issippi
@@ -1249,6 +1262,7 @@
No options
First char = 'i'
Need char = 's'
+Max lookbehind = 1
Mississippi
0: iss
0+ issippi
@@ -1260,6 +1274,7 @@
No options
First char = 'i'
Need char = 's'
+Max lookbehind = 1
Mississippi
0: iss
0+ issippi
@@ -1440,6 +1455,7 @@
No options
No first char
No need char
+Max lookbehind = 3
/abc(?!pqr)/I
Capturing subpattern count = 0
@@ -3220,6 +3236,7 @@
No options
First char = '8'
Need char = 'X'
+Max lookbehind = 1
|\$\<\.X\+ix\[d1b\!H\#\?vV0vrK\:ZH1\=2M\>iV\;\?aPhFB\<\*vW\@QW\@sO9\}cfZA\-i\'w\%hKd6gt1UJP\,15_\#QY\$M\^Mss_U\/\]\&LK9\[5vQub\^w\[KDD\<EjmhUZ\?\.akp2dF\>qmj\;2\}YWFdYx\.Ap\]hjCPTP\(n28k\+3\;o\&WXqs\/gOXdr\$\:r\'do0\;b4c\(f_Gr\=\"\\4\)\[01T7ajQJvL\$W\~mL_sS\/4h\:x\*\[ZN\=KLs\&L5zX\/\/\>it\,o\:aU\(\;Z\>pW\&T7oP\'2K\^E\:x9\'c\[\%z\-\,64JQ5AeH_G\#KijUKghQw\^\\vea3a\?kka_G\$8\#\`\*kynsxzBLru\'\]k_\[7FrVx\}\^\=\$blx\>s\-N\%j\;D\*aZDnsw\:YKZ\%Q\.Kne9\#hP\?\+b3\(SOvL\,\^\;\&u5\@\?5C5Bhb\=m\-vEh_L15Jl\]U\)0RP6\{q\%L\^_z5E\'Dw6X\b|IDZ
------------------------------------------------------------------
@@ -3233,6 +3250,7 @@
No options
First char = '$'
Need char = 'X'
+Max lookbehind = 1
/(.*)\d+\1/I
Capturing subpattern count = 1
@@ -3748,6 +3766,7 @@
No options
First char = 'x'
Need char = 'z'
+Max lookbehind = 3
abcxyz\C+
Callout 0: last capture = 1
0: <unset>
@@ -5395,6 +5414,7 @@
No options
No first char
No need char
+Max lookbehind = 1
ab cd\>1
0: cd
@@ -5403,6 +5423,7 @@
Options: dotall
No first char
No need char
+Max lookbehind = 1
ab cd\>1
0: cd
@@ -11596,6 +11617,7 @@
No options
First char = 't'
Need char = 't'
+Max lookbehind = 1
Subject length lower bound = 18
No set of starting bytes
@@ -11604,6 +11626,7 @@
No options
No first char
No need char
+Max lookbehind = 1
Subject length lower bound = 8
Starting byte set: < o t u
Modified: code/trunk/testdata/testoutput5
===================================================================
--- code/trunk/testdata/testoutput5 2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/testdata/testoutput5 2012-02-24 18:54:43 UTC (rev 932)
@@ -1873,4 +1873,11 @@
End
------------------------------------------------------------------
+/(?<=\x{1234}\x{1234})\bxy/I8
+Capturing subpattern count = 0
+Options: utf
+First char = 'x'
+Need char = 'y'
+Max lookbehind = 2
+
/-- End of testinput5 --/