Revision: 1028
http://www.exim.org/viewvc/pcre2?view=rev&revision=1028
Author: ph10
Date: 2018-10-17 09:33:38 +0100 (Wed, 17 Oct 2018)
Log Message:
-----------
Implement PCRE2_COPY_MATCHED_SUBJECT.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/html/pcre2_dfa_match.html
code/trunk/doc/html/pcre2_match.html
code/trunk/doc/html/pcre2_match_data_free.html
code/trunk/doc/html/pcre2api.html
code/trunk/doc/html/pcre2jit.html
code/trunk/doc/pcre2.txt
code/trunk/doc/pcre2_dfa_match.3
code/trunk/doc/pcre2_match.3
code/trunk/doc/pcre2_match_data_free.3
code/trunk/doc/pcre2api.3
code/trunk/doc/pcre2jit.3
code/trunk/src/pcre2.h.in
code/trunk/src/pcre2_dfa_match.c
code/trunk/src/pcre2_internal.h
code/trunk/src/pcre2_intmodedep.h
code/trunk/src/pcre2_jit_match.c
code/trunk/src/pcre2_match.c
code/trunk/src/pcre2_match_data.c
code/trunk/src/pcre2test.c
code/trunk/testdata/testinput17
code/trunk/testdata/testinput2
code/trunk/testdata/testinput6
code/trunk/testdata/testoutput17
code/trunk/testdata/testoutput2
code/trunk/testdata/testoutput6
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/ChangeLog 2018-10-17 08:33:38 UTC (rev 1028)
@@ -37,7 +37,11 @@
ranges such as a-z in EBCDIC environments. The original code probably never
worked, though there were no bug reports.
+10. Implement PCRE2_COPY_MATCHED_SUBJECT for pcre2_match() (including JIT via
+pcre2_match()) and pcre2_dfa_match(), but *not* the pcre2_jit_match() fast
+path.
+
Version 10.32 10-September-2018
-------------------------------
Modified: code/trunk/doc/html/pcre2_dfa_match.html
===================================================================
--- code/trunk/doc/html/pcre2_dfa_match.html 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/doc/html/pcre2_dfa_match.html 2018-10-17 08:33:38 UTC (rev 1028)
@@ -51,6 +51,8 @@
characters. The options are:
<pre>
PCRE2_ANCHORED Match only at the first position
+ PCRE2_COPY_MATCHED_SUBJECT
+ On success, make a private subject copy
PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_NOTBOL Subject is not the beginning of a line
PCRE2_NOTEOL Subject is not the end of a line
Modified: code/trunk/doc/html/pcre2_match.html
===================================================================
--- code/trunk/doc/html/pcre2_match.html 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/doc/html/pcre2_match.html 2018-10-17 08:33:38 UTC (rev 1028)
@@ -55,11 +55,13 @@
Change the backtracking depth limit
Set custom memory management specifically for the match
</pre>
-The <i>length</i> and <i>startoffset</i> values are code
-units, not characters. The length may be given as PCRE2_ZERO_TERMINATE for a
-subject that is terminated by a binary zero code unit. The options are:
+The <i>length</i> and <i>startoffset</i> values are code units, not characters.
+The length may be given as PCRE2_ZERO_TERMINATED for a subject that is
+terminated by a binary zero code unit. The options are:
<pre>
PCRE2_ANCHORED Match only at the first position
+ PCRE2_COPY_MATCHED_SUBJECT
+ On success, make a private subject copy
PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_NOTBOL Subject string is not the beginning of a line
PCRE2_NOTEOL Subject string is not the end of a line
Modified: code/trunk/doc/html/pcre2_match_data_free.html
===================================================================
--- code/trunk/doc/html/pcre2_match_data_free.html 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/doc/html/pcre2_match_data_free.html 2018-10-17 08:33:38 UTC (rev 1028)
@@ -31,6 +31,11 @@
with which it was created, or <b>free()</b> if that was not set.
</P>
<P>
+If the PCRE2_COPY_MATCHED_SUBJECT was used for a successful match using this
+match data block, the copy of the subject that was remembered with the block is
+also freed.
+</P>
+<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
Modified: code/trunk/doc/html/pcre2api.html
===================================================================
--- code/trunk/doc/html/pcre2api.html 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/doc/html/pcre2api.html 2018-10-17 08:33:38 UTC (rev 1028)
@@ -1305,10 +1305,13 @@
NOTE: When one of the matching functions is called, pointers to the compiled
pattern and the subject string are set in the match data block so that they can
be referenced by the substring extraction functions. After running a match, you
-must not free a compiled pattern (or a subject string) until after all
+must not free a compiled pattern or a subject string until after all
operations on the
<a href="#matchdatablock">match data block</a>
-have taken place.
+have taken place, unless, in the case of the subject string, you have used the
+PCRE2_COPY_MATCHED_SUBJECT option, which is described in the section entitled
+"Option bits for <b>pcre2_match()</b>"
+<a href="#matchoptions>">below.</a>
</P>
<P>
The <i>options</i> argument for <b>pcre2_compile()</b> contains various bit
@@ -2419,7 +2422,10 @@
and the subject string are set in the match data block so that they can be
referenced by the extraction functions. After running a match, you must not
free a compiled pattern or a subject string until after all operations on the
-match data block (for that match) have taken place.
+match data block (for that match) have taken place, unless, in the case of the
+subject string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, which is
+described in the section entitled "Option bits for <b>pcre2_match()</b>"
+<a href="#matchoptions>">below.</a>
</P>
<P>
When a match data block itself is no longer needed, it should be freed by
@@ -2531,10 +2537,10 @@
</b><br>
<P>
The unused bits of the <i>options</i> argument for <b>pcre2_match()</b> must be
-zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED,
-PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
-PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT.
-Their action is described below.
+zero. The only bits that may be set are PCRE2_ANCHORED,
+PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NOTEOL,
+PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK,
+PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below.
</P>
<P>
Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not supported by
@@ -2550,6 +2556,22 @@
matching time. Note that setting the option at match time disables JIT
matching.
<pre>
+ PCRE2_COPY_MATCHED_SUBJECT
+</pre>
+By default, a pointer to the subject is remembered in the match data block so
+that, after a successful match, it can be referenced by the substring
+extraction functions. This means that the subject's memory must not be freed
+until all such operations are complete. For some applications where the
+lifetime of the subject string is not guaranteed, it may be necessary to make a
+copy of the subject string, but it is wasteful to do this unless the match is
+successful. After a successful match, if PCRE2_COPY_MATCHED_SUBJECT is set, the
+subject is copied and the new pointer is remembered in the match data block
+instead of the original subject pointer. The memory allocator that was used for
+the match block itself is used. The copy is automatically freed when
+<b>pcre2_match_data_free()</b> is called to free the match data block. It is also
+automatically freed if the match data block is re-used for another match
+operation.
+<pre>
PCRE2_ENDANCHORED
</pre>
If the PCRE2_ENDANCHORED option is set, any string that <b>pcre2_match()</b>
@@ -2954,7 +2976,8 @@
If a pattern contains many nested backtracking points, heap memory is used to
remember them. This error is given when the memory allocation function (default
or custom) fails. Note that a different error, PCRE2_ERROR_HEAPLIMIT, is given
-if the amount of memory needed exceeds the heap limit.
+if the amount of memory needed exceeds the heap limit. PCRE2_ERROR_NOMEMORY is
+also returned if PCRE2_COPY_MATCHED_SUBJECT is set and memory allocation fails.
<pre>
PCRE2_ERROR_NULL
</pre>
@@ -3584,11 +3607,12 @@
</b><br>
<P>
The unused bits of the <i>options</i> argument for <b>pcre2_dfa_match()</b> must
-be zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED,
-PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
-PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST,
-and PCRE2_DFA_RESTART. All but the last four of these are exactly the same as
-for <b>pcre2_match()</b>, so their description is not repeated here.
+be zero. The only bits that may be set are PCRE2_ANCHORED,
+PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NOTEOL,
+PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD,
+PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last
+four of these are exactly the same as for <b>pcre2_match()</b>, so their
+description is not repeated here.
<pre>
PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT
@@ -3732,7 +3756,7 @@
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 21 September 2018
+Last updated: 16 October 2018
<br>
Copyright © 1997-2018 University of Cambridge.
<br>
Modified: code/trunk/doc/html/pcre2jit.html
===================================================================
--- code/trunk/doc/html/pcre2jit.html 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/doc/html/pcre2jit.html 2018-10-17 08:33:38 UTC (rev 1028)
@@ -147,9 +147,10 @@
<br><a name="SEC4" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
<P>
The <b>pcre2_match()</b> options that are supported for JIT matching are
-PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
-PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The
-PCRE2_ANCHORED option is not supported at match time.
+PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
+PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and
+PCRE2_PARTIAL_SOFT. The PCRE2_ANCHORED and PCRE2_ENDANCHORED options are not
+supported at match time.
</P>
<P>
If the PCRE2_NO_JIT option is passed to <b>pcre2_match()</b> it disables the
@@ -402,10 +403,13 @@
</P>
<P>
The fast path function is called <b>pcre2_jit_match()</b>, and it takes exactly
-the same arguments as <b>pcre2_match()</b>. The return values are also the same,
-plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or complete) is
-requested that was not compiled. Unsupported option bits (for example,
-PCRE2_ANCHORED) are ignored, as is the PCRE2_NO_JIT option.
+the same arguments as <b>pcre2_match()</b>. However, the subject string must be
+specified with a length; PCRE2_ZERO_TERMINATED is not supported. Unsupported
+option bits (for example, PCRE2_ANCHORED, PCRE2_ENDANCHORED and
+PCRE2_COPY_MATCHED_SUBJECT) are ignored, as is the PCRE2_NO_JIT option. The
+return values are also the same as for <b>pcre2_match()</b>, plus
+PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or complete) is requested
+that was not compiled.
</P>
<P>
When you call <b>pcre2_match()</b>, as well as testing for invalid options, a
@@ -434,7 +438,7 @@
</P>
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 28 June 2018
+Last updated: 16 October 2018
<br>
Copyright © 1997-2018 University of Cambridge.
<br>
Modified: code/trunk/doc/pcre2.txt
===================================================================
--- code/trunk/doc/pcre2.txt 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/doc/pcre2.txt 2018-10-17 08:33:38 UTC (rev 1028)
@@ -1302,9 +1302,11 @@
NOTE: When one of the matching functions is called, pointers to the
compiled pattern and the subject string are set in the match data block
so that they can be referenced by the substring extraction functions.
- After running a match, you must not free a compiled pattern (or a sub-
- ject string) until after all operations on the match data block have
- taken place.
+ After running a match, you must not free a compiled pattern or a sub-
+ ject string until after all operations on the match data block have
+ taken place, unless, in the case of the subject string, you have used
+ the PCRE2_COPY_MATCHED_SUBJECT option, which is described in the sec-
+ tion entitled "Option bits for pcre2_match()" below.
The options argument for pcre2_compile() contains various bit settings
that affect the compilation. It should be zero if no options are
@@ -2388,7 +2390,9 @@
they can be referenced by the extraction functions. After running a
match, you must not free a compiled pattern or a subject string until
after all operations on the match data block (for that match) have
- taken place.
+ taken place, unless, in the case of the subject string, you have used
+ the PCRE2_COPY_MATCHED_SUBJECT option, which is described in the sec-
+ tion entitled "Option bits for pcre2_match()" below.
When a match data block itself is no longer needed, it should be freed
by calling pcre2_match_data_free(). If this function is called with a
@@ -2488,25 +2492,43 @@
Option bits for pcre2_match()
The unused bits of the options argument for pcre2_match() must be zero.
- The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED,
- PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
- PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PAR-
- TIAL_SOFT. Their action is described below.
+ The only bits that may be set are PCRE2_ANCHORED,
+ PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL,
+ PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT,
+ PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their
+ action is described below.
- Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup-
- ported by the just-in-time (JIT) compiler. If it is set, JIT matching
- is disabled and the interpretive code in pcre2_match() is run. Apart
- from PCRE2_NO_JIT (obviously), the remaining options are supported for
+ Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup-
+ ported by the just-in-time (JIT) compiler. If it is set, JIT matching
+ is disabled and the interpretive code in pcre2_match() is run. Apart
+ from PCRE2_NO_JIT (obviously), the remaining options are supported for
JIT matching.
PCRE2_ANCHORED
The PCRE2_ANCHORED option limits pcre2_match() to matching at the first
- matching position. If a pattern was compiled with PCRE2_ANCHORED, or
- turned out to be anchored by virtue of its contents, it cannot be made
- unachored at matching time. Note that setting the option at match time
+ matching position. If a pattern was compiled with PCRE2_ANCHORED, or
+ turned out to be anchored by virtue of its contents, it cannot be made
+ unachored at matching time. Note that setting the option at match time
disables JIT matching.
+ PCRE2_COPY_MATCHED_SUBJECT
+
+ By default, a pointer to the subject is remembered in the match data
+ block so that, after a successful match, it can be referenced by the
+ substring extraction functions. This means that the subject's memory
+ must not be freed until all such operations are complete. For some
+ applications where the lifetime of the subject string is not guaran-
+ teed, it may be necessary to make a copy of the subject string, but it
+ is wasteful to do this unless the match is successful. After a success-
+ ful match, if PCRE2_COPY_MATCHED_SUBJECT is set, the subject is copied
+ and the new pointer is remembered in the match data block instead of
+ the original subject pointer. The memory allocator that was used for
+ the match block itself is used. The copy is automatically freed when
+ pcre2_match_data_free() is called to free the match data block. It is
+ also automatically freed if the match data block is re-used for another
+ match operation.
+
PCRE2_ENDANCHORED
If the PCRE2_ENDANCHORED option is set, any string that pcre2_match()
@@ -2881,7 +2903,8 @@
used to remember them. This error is given when the memory allocation
function (default or custom) fails. Note that a different error,
PCRE2_ERROR_HEAPLIMIT, is given if the amount of memory needed exceeds
- the heap limit.
+ the heap limit. PCRE2_ERROR_NOMEMORY is also returned if
+ PCRE2_COPY_MATCHED_SUBJECT is set and memory allocation fails.
PCRE2_ERROR_NULL
@@ -2889,12 +2912,12 @@
PCRE2_ERROR_RECURSELOOP
- This error is returned when pcre2_match() detects a recursion loop
- within the pattern. Specifically, it means that either the whole pat-
+ This error is returned when pcre2_match() detects a recursion loop
+ within the pattern. Specifically, it means that either the whole pat-
tern or a subpattern has been called recursively for the second time at
- the same position in the subject string. Some simple patterns that
- might do this are detected and faulted at compile time, but more com-
- plicated cases, in particular mutual recursions between two different
+ the same position in the subject string. Some simple patterns that
+ might do this are detected and faulted at compile time, but more com-
+ plicated cases, in particular mutual recursions between two different
subpatterns, cannot be detected until matching is attempted.
@@ -2903,20 +2926,20 @@
int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
PCRE2_SIZE bufflen);
- A text message for an error code from any PCRE2 function (compile,
- match, or auxiliary) can be obtained by calling pcre2_get_error_mes-
- sage(). The code is passed as the first argument, with the remaining
- two arguments specifying a code unit buffer and its length in code
- units, into which the text message is placed. The message is returned
- in code units of the appropriate width for the library that is being
+ A text message for an error code from any PCRE2 function (compile,
+ match, or auxiliary) can be obtained by calling pcre2_get_error_mes-
+ sage(). The code is passed as the first argument, with the remaining
+ two arguments specifying a code unit buffer and its length in code
+ units, into which the text message is placed. The message is returned
+ in code units of the appropriate width for the library that is being
used.
- The returned message is terminated with a trailing zero, and the func-
- tion returns the number of code units used, excluding the trailing
+ The returned message is terminated with a trailing zero, and the func-
+ tion returns the number of code units used, excluding the trailing
zero. If the error number is unknown, the negative error code
- PCRE2_ERROR_BADDATA is returned. If the buffer is too small, the mes-
- sage is truncated (but still with a trailing zero), and the negative
- error code PCRE2_ERROR_NOMEMORY is returned. None of the messages are
+ PCRE2_ERROR_BADDATA is returned. If the buffer is too small, the mes-
+ sage is truncated (but still with a trailing zero), and the negative
+ error code PCRE2_ERROR_NOMEMORY is returned. None of the messages are
very long; a buffer size of 120 code units is ample.
@@ -2935,39 +2958,39 @@
void pcre2_substring_free(PCRE2_UCHAR *buffer);
- Captured substrings can be accessed directly by using the ovector as
+ Captured substrings can be accessed directly by using the ovector as
described above. For convenience, auxiliary functions are provided for
- extracting captured substrings as new, separate, zero-terminated
+ extracting captured substrings as new, separate, zero-terminated
strings. A substring that contains a binary zero is correctly extracted
- and has a further zero added on the end, but the result is not, of
+ and has a further zero added on the end, but the result is not, of
course, a C string.
The functions in this section identify substrings by number. The number
zero refers to the entire matched substring, with higher numbers refer-
- ring to substrings captured by parenthesized groups. After a partial
- match, only substring zero is available. An attempt to extract any
- other substring gives the error PCRE2_ERROR_PARTIAL. The next section
+ ring to substrings captured by parenthesized groups. After a partial
+ match, only substring zero is available. An attempt to extract any
+ other substring gives the error PCRE2_ERROR_PARTIAL. The next section
describes similar functions for extracting captured substrings by name.
- If a pattern uses the \K escape sequence within a positive assertion,
+ If a pattern uses the \K escape sequence within a positive assertion,
the reported start of a successful match can be greater than the end of
- the match. For example, if the pattern (?=ab\K) is matched against
- "ab", the start and end offset values for the match are 2 and 0. In
- this situation, calling these functions with a zero substring number
+ the match. For example, if the pattern (?=ab\K) is matched against
+ "ab", the start and end offset values for the match are 2 and 0. In
+ this situation, calling these functions with a zero substring number
extracts a zero-length empty string.
- You can find the length in code units of a captured substring without
- extracting it by calling pcre2_substring_length_bynumber(). The first
- argument is a pointer to the match data block, the second is the group
- number, and the third is a pointer to a variable into which the length
- is placed. If you just want to know whether or not the substring has
+ You can find the length in code units of a captured substring without
+ extracting it by calling pcre2_substring_length_bynumber(). The first
+ argument is a pointer to the match data block, the second is the group
+ number, and the third is a pointer to a variable into which the length
+ is placed. If you just want to know whether or not the substring has
been captured, you can pass the third argument as NULL.
- The pcre2_substring_copy_bynumber() function copies a captured sub-
- string into a supplied buffer, whereas pcre2_substring_get_bynumber()
- copies it into new memory, obtained using the same memory allocation
- function that was used for the match data block. The first two argu-
- ments of these functions are a pointer to the match data block and a
+ The pcre2_substring_copy_bynumber() function copies a captured sub-
+ string into a supplied buffer, whereas pcre2_substring_get_bynumber()
+ copies it into new memory, obtained using the same memory allocation
+ function that was used for the match data block. The first two argu-
+ ments of these functions are a pointer to the match data block and a
capturing group number.
The final arguments of pcre2_substring_copy_bynumber() are a pointer to
@@ -2976,25 +2999,25 @@
for the extracted substring, excluding the terminating zero.
For pcre2_substring_get_bynumber() the third and fourth arguments point
- to variables that are updated with a pointer to the new memory and the
- number of code units that comprise the substring, again excluding the
- terminating zero. When the substring is no longer needed, the memory
+ to variables that are updated with a pointer to the new memory and the
+ number of code units that comprise the substring, again excluding the
+ terminating zero. When the substring is no longer needed, the memory
should be freed by calling pcre2_substring_free().
- The return value from all these functions is zero for success, or a
- negative error code. If the pattern match failed, the match failure
- code is returned. If a substring number greater than zero is used
- after a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible
+ The return value from all these functions is zero for success, or a
+ negative error code. If the pattern match failed, the match failure
+ code is returned. If a substring number greater than zero is used
+ after a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible
error codes are:
PCRE2_ERROR_NOMEMORY
- The buffer was too small for pcre2_substring_copy_bynumber(), or the
+ The buffer was too small for pcre2_substring_copy_bynumber(), or the
attempt to get memory failed for pcre2_substring_get_bynumber().
PCRE2_ERROR_NOSUBSTRING
- There is no substring with that number in the pattern, that is, the
+ There is no substring with that number in the pattern, that is, the
number is greater than the number of capturing parentheses.
PCRE2_ERROR_UNAVAILABLE
@@ -3005,8 +3028,8 @@
PCRE2_ERROR_UNSET
- The substring did not participate in the match. For example, if the
- pattern is (abc)|(def) and the subject is "def", and the ovector con-
+ The substring did not participate in the match. For example, if the
+ pattern is (abc)|(def) and the subject is "def", and the ovector con-
tains at least two capturing slots, substring number 1 is unset.
@@ -3017,32 +3040,32 @@
void pcre2_substring_list_free(PCRE2_SPTR *list);
- The pcre2_substring_list_get() function extracts all available sub-
- strings and builds a list of pointers to them. It also (optionally)
- builds a second list that contains their lengths (in code units),
+ The pcre2_substring_list_get() function extracts all available sub-
+ strings and builds a list of pointers to them. It also (optionally)
+ builds a second list that contains their lengths (in code units),
excluding a terminating zero that is added to each of them. All this is
done in a single block of memory that is obtained using the same memory
allocation function that was used to get the match data block.
- This function must be called only after a successful match. If called
+ This function must be called only after a successful match. If called
after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.
- The address of the memory block is returned via listptr, which is also
+ The address of the memory block is returned via listptr, which is also
the start of the list of string pointers. The end of the list is marked
- by a NULL pointer. The address of the list of lengths is returned via
- lengthsptr. If your strings do not contain binary zeros and you do not
+ by a NULL pointer. The address of the list of lengths is returned via
+ lengthsptr. If your strings do not contain binary zeros and you do not
therefore need the lengths, you may supply NULL as the lengthsptr argu-
- ment to disable the creation of a list of lengths. The yield of the
- function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
- ory block could not be obtained. When the list is no longer needed, it
+ ment to disable the creation of a list of lengths. The yield of the
+ function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
+ ory block could not be obtained. When the list is no longer needed, it
should be freed by calling pcre2_substring_list_free().
If this function encounters a substring that is unset, which can happen
- when capturing subpattern number n+1 matches some part of the subject,
- but subpattern n has not been used at all, it returns an empty string.
- This can be distinguished from a genuine zero-length substring by
+ when capturing subpattern number n+1 matches some part of the subject,
+ but subpattern n has not been used at all, it returns an empty string.
+ This can be distinguished from a genuine zero-length substring by
inspecting the appropriate offset in the ovector, which contain
- PCRE2_UNSET for unset substrings, or by calling pcre2_sub-
+ PCRE2_UNSET for unset substrings, or by calling pcre2_sub-
string_length_bynumber().
@@ -3062,39 +3085,39 @@
void pcre2_substring_free(PCRE2_UCHAR *buffer);
- To extract a substring by name, you first have to find associated num-
+ To extract a substring by name, you first have to find associated num-
ber. For example, for this pattern:
(a+)b(?<xxx>\d+)...
the number of the subpattern called "xxx" is 2. If the name is known to
- be unique (PCRE2_DUPNAMES was not set), you can find the number from
+ be unique (PCRE2_DUPNAMES was not set), you can find the number from
the name by calling pcre2_substring_number_from_name(). The first argu-
- ment is the compiled pattern, and the second is the name. The yield of
+ ment is the compiled pattern, and the second is the name. The yield of
the function is the subpattern number, PCRE2_ERROR_NOSUBSTRING if there
- is no subpattern of that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if
- there is more than one subpattern of that name. Given the number, you
- can extract the substring directly from the ovector, or use one of the
+ is no subpattern of that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if
+ there is more than one subpattern of that name. Given the number, you
+ can extract the substring directly from the ovector, or use one of the
"bynumber" functions described above.
- For convenience, there are also "byname" functions that correspond to
- the "bynumber" functions, the only difference being that the second
- argument is a name instead of a number. If PCRE2_DUPNAMES is set and
+ For convenience, there are also "byname" functions that correspond to
+ the "bynumber" functions, the only difference being that the second
+ argument is a name instead of a number. If PCRE2_DUPNAMES is set and
there are duplicate names, these functions scan all the groups with the
given name, and return the first named string that is set.
- If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
- returned. If all groups with the name have numbers that are greater
- than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is
- returned. If there is at least one group with a slot in the ovector,
+ If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
+ returned. If all groups with the name have numbers that are greater
+ than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is
+ returned. If there is at least one group with a slot in the ovector,
but no group is found to be set, PCRE2_ERROR_UNSET is returned.
Warning: If the pattern uses the (?| feature to set up multiple subpat-
- terns with the same number, as described in the section on duplicate
- subpattern numbers in the pcre2pattern page, you cannot use names to
- distinguish the different subpatterns, because names are not included
- in the compiled code. The matching process uses only numbers. For this
- reason, the use of different names for subpatterns of the same number
+ terns with the same number, as described in the section on duplicate
+ subpattern numbers in the pcre2pattern page, you cannot use names to
+ distinguish the different subpatterns, because names are not included
+ in the compiled code. The matching process uses only numbers. For this
+ reason, the use of different names for subpatterns of the same number
causes an error at compile time.
@@ -3107,54 +3130,54 @@
PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
PCRE2_SIZE *outlengthptr);
- This function calls pcre2_match() and then makes a copy of the subject
- string in outputbuffer, replacing one or more parts that were matched
+ This function calls pcre2_match() and then makes a copy of the subject
+ string in outputbuffer, replacing one or more parts that were matched
with the replacement string, whose length is supplied in rlength. This
- can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string.
- The default is to perform just one replacement, but there is an option
- that requests multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL below
+ can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string.
+ The default is to perform just one replacement, but there is an option
+ that requests multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL below
for details).
- Matches in which a \K item in a lookahead in the pattern causes the
- match to end before it starts are not supported, and give rise to an
+ Matches in which a \K item in a lookahead in the pattern causes the
+ match to end before it starts are not supported, and give rise to an
error return. For global replacements, matches in which \K in a lookbe-
- hind causes the match to start earlier than the point that was reached
+ hind causes the match to start earlier than the point that was reached
in the previous iteration are also not supported.
- The first seven arguments of pcre2_substitute() are the same as for
+ The first seven arguments of pcre2_substitute() are the same as for
pcre2_match(), except that the partial matching options are not permit-
- ted, and match_data may be passed as NULL, in which case a match data
- block is obtained and freed within this function, using memory manage-
- ment functions from the match context, if provided, or else those that
+ ted, and match_data may be passed as NULL, in which case a match data
+ block is obtained and freed within this function, using memory manage-
+ ment functions from the match context, if provided, or else those that
were used to allocate memory for the compiled code.
- If an external match_data block is provided, its contents afterwards
- are those set by the final call to pcre2_match(). For global changes,
- this will have ended in a matching error. The contents of the ovector
+ If an external match_data block is provided, its contents afterwards
+ are those set by the final call to pcre2_match(). For global changes,
+ this will have ended in a matching error. The contents of the ovector
within the match data block may or may not have been changed.
- The outlengthptr argument must point to a variable that contains the
- length, in code units, of the output buffer. If the function is suc-
- cessful, the value is updated to contain the length of the new string,
+ The outlengthptr argument must point to a variable that contains the
+ length, in code units, of the output buffer. If the function is suc-
+ cessful, the value is updated to contain the length of the new string,
excluding the trailing zero that is automatically added.
- If the function is not successful, the value set via outlengthptr
- depends on the type of error. For syntax errors in the replacement
- string, the value is the offset in the replacement string where the
- error was detected. For other errors, the value is PCRE2_UNSET by
- default. This includes the case of the output buffer being too small,
- unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which
- case the value is the minimum length needed, including space for the
- trailing zero. Note that in order to compute the required length,
- pcre2_substitute() has to simulate all the matching and copying,
+ If the function is not successful, the value set via outlengthptr
+ depends on the type of error. For syntax errors in the replacement
+ string, the value is the offset in the replacement string where the
+ error was detected. For other errors, the value is PCRE2_UNSET by
+ default. This includes the case of the output buffer being too small,
+ unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which
+ case the value is the minimum length needed, including space for the
+ trailing zero. Note that in order to compute the required length,
+ pcre2_substitute() has to simulate all the matching and copying,
instead of giving an error return as soon as the buffer overflows. Note
also that the length is in code units, not bytes.
- In the replacement string, which is interpreted as a UTF string in UTF
- mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK
+ In the replacement string, which is interpreted as a UTF string in UTF
+ mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK
option is set, a dollar character is an escape character that can spec-
- ify the insertion of characters from capturing groups or names from
- (*MARK) or other control verbs in the pattern. The following forms are
+ ify the insertion of characters from capturing groups or names from
+ (*MARK) or other control verbs in the pattern. The following forms are
always recognized:
$$ insert a dollar character
@@ -3161,19 +3184,19 @@
$<n> or ${<n>} insert the contents of group <n>
$*MARK or ${*MARK} insert a control verb name
- Either a group number or a group name can be given for <n>. Curly
- brackets are required only if the following character would be inter-
+ Either a group number or a group name can be given for <n>. Curly
+ brackets are required only if the following character would be inter-
preted as part of the number or name. The number may be zero to include
- the entire matched string. For example, if the pattern a(b)c is
- matched with "=abc=" and the replacement string "+$1$0$1+", the result
+ the entire matched string. For example, if the pattern a(b)c is
+ matched with "=abc=" and the replacement string "+$1$0$1+", the result
is "=+babcb+=".
$*MARK inserts the name from the last encountered (*ACCEPT), (*COMMIT),
- (*MARK), (*PRUNE), or (*THEN) on the matching path that has a name.
- (*MARK) must always include a name, but the other verbs need not. For
+ (*MARK), (*PRUNE), or (*THEN) on the matching path that has a name.
+ (*MARK) must always include a name, but the other verbs need not. For
example, in the case of (*MARK:A)(*PRUNE) the name inserted is "A", but
- for (*MARK:A)(*PRUNE:B) the relevant name is "B". This facility can be
- used to perform simple simultaneous substitutions, as this pcre2test
+ for (*MARK:A)(*PRUNE:B) the relevant name is "B". This facility can be
+ used to perform simple simultaneous substitutions, as this pcre2test
example shows:
/(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
@@ -3180,19 +3203,19 @@
apple lemon
2: pear orange
- As well as the usual options for pcre2_match(), a number of additional
+ As well as the usual options for pcre2_match(), a number of additional
options can be set in the options argument of pcre2_substitute().
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
- string, replacing every matching substring. If this option is not set,
- only the first matching substring is replaced. The search for matches
- takes place in the original subject string (that is, previous replace-
- ments do not affect it). Iteration is implemented by advancing the
- startoffset value for each search, which is always passed the entire
+ string, replacing every matching substring. If this option is not set,
+ only the first matching substring is replaced. The search for matches
+ takes place in the original subject string (that is, previous replace-
+ ments do not affect it). Iteration is implemented by advancing the
+ startoffset value for each search, which is always passed the entire
subject string. If an offset limit is set in the match context, search-
ing stops when that limit is reached.
- You can restrict the effect of a global substitution to a portion of
+ You can restrict the effect of a global substitution to a portion of
the subject string by setting either or both of startoffset and an off-
set limit. Here is a pcre2test example:
@@ -3200,87 +3223,87 @@
ABC ABC ABC ABC\=offset=3,offset_limit=12
2: ABC A!C A!C ABC
- When continuing with global substitutions after matching a substring
+ When continuing with global substitutions after matching a substring
with zero length, an attempt to find a non-empty match at the same off-
set is performed. If this is not successful, the offset is advanced by
one character except when CRLF is a valid newline sequence and the next
- two characters are CR, LF. In this case, the offset is advanced by two
+ two characters are CR, LF. In this case, the offset is advanced by two
characters.
- PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output
+ PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output
buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
- ORY immediately. If this option is set, however, pcre2_substitute()
+ ORY immediately. If this option is set, however, pcre2_substitute()
continues to go through the motions of matching and substituting (with-
- out, of course, writing anything) in order to compute the size of buf-
- fer that is needed. This value is passed back via the outlengthptr
- variable, with the result of the function still being
+ out, of course, writing anything) in order to compute the size of buf-
+ fer that is needed. This value is passed back via the outlengthptr
+ variable, with the result of the function still being
PCRE2_ERROR_NOMEMORY.
- Passing a buffer size of zero is a permitted way of finding out how
- much memory is needed for given substitution. However, this does mean
+ Passing a buffer size of zero is a permitted way of finding out how
+ much memory is needed for given substitution. However, this does mean
that the entire operation is carried out twice. Depending on the appli-
- cation, it may be more efficient to allocate a large buffer and free
- the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER-
+ cation, it may be more efficient to allocate a large buffer and free
+ the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER-
FLOW_LENGTH.
- PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups
- that do not appear in the pattern to be treated as unset groups. This
- option should be used with care, because it means that a typo in a
- group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING
+ PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups
+ that do not appear in the pattern to be treated as unset groups. This
+ option should be used with care, because it means that a typo in a
+ group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING
error.
- PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including
+ PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including
unknown groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be
- treated as empty strings when inserted as described above. If this
- option is not set, an attempt to insert an unset group causes the
- PCRE2_ERROR_UNSET error. This option does not influence the extended
+ treated as empty strings when inserted as described above. If this
+ option is not set, an attempt to insert an unset group causes the
+ PCRE2_ERROR_UNSET error. This option does not influence the extended
substitution syntax described below.
- PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
- replacement string. Without this option, only the dollar character is
- special, and only the group insertion forms listed above are valid.
+ PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
+ replacement string. Without this option, only the dollar character is
+ special, and only the group insertion forms listed above are valid.
When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
- Firstly, backslash in a replacement string is interpreted as an escape
+ Firstly, backslash in a replacement string is interpreted as an escape
character. The usual forms such as \n or \x{ddd} can be used to specify
- particular character codes, and backslash followed by any non-alphanu-
- meric character quotes that character. Extended quoting can be coded
+ particular character codes, and backslash followed by any non-alphanu-
+ meric character quotes that character. Extended quoting can be coded
using \Q...\E, exactly as in pattern strings.
- There are also four escape sequences for forcing the case of inserted
- letters. The insertion mechanism has three states: no case forcing,
+ There are also four escape sequences for forcing the case of inserted
+ letters. The insertion mechanism has three states: no case forcing,
force upper case, and force lower case. The escape sequences change the
current state: \U and \L change to upper or lower case forcing, respec-
- tively, and \E (when not terminating a \Q quoted sequence) reverts to
- no case forcing. The sequences \u and \l force the next character (if
- it is a letter) to upper or lower case, respectively, and then the
+ tively, and \E (when not terminating a \Q quoted sequence) reverts to
+ no case forcing. The sequences \u and \l force the next character (if
+ it is a letter) to upper or lower case, respectively, and then the
state automatically reverts to no case forcing. Case forcing applies to
all inserted characters, including those from captured groups and let-
ters within \Q...\E quoted sequences.
Note that case forcing sequences such as \U...\E do not nest. For exam-
- ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
+ ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
\E has no effect.
- The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
- flexibility to group substitution. The syntax is similar to that used
+ The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
+ flexibility to group substitution. The syntax is similar to that used
by Bash:
${<n>:-<string>}
${<n>:+<string1>:<string2>}
- As before, <n> may be a group number or a name. The first form speci-
- fies a default value. If group <n> is set, its value is inserted; if
- not, <string> is expanded and the result inserted. The second form
- specifies strings that are expanded and inserted when group <n> is set
- or unset, respectively. The first form is just a convenient shorthand
+ As before, <n> may be a group number or a name. The first form speci-
+ fies a default value. If group <n> is set, its value is inserted; if
+ not, <string> is expanded and the result inserted. The second form
+ specifies strings that are expanded and inserted when group <n> is set
+ or unset, respectively. The first form is just a convenient shorthand
for
${<n>:+${<n>}:<string>}
- Backslash can be used to escape colons and closing curly brackets in
- the replacement strings. A change of the case forcing state within a
- replacement string remains in force afterwards, as shown in this
+ Backslash can be used to escape colons and closing curly brackets in
+ the replacement strings. A change of the case forcing state within a
+ replacement string remains in force afterwards, as shown in this
pcre2test example:
/(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
@@ -3289,16 +3312,16 @@
somebody
1: HELLO
- The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
- substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause
+ The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
+ substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause
unknown groups in the extended syntax forms to be treated as unset.
- If successful, pcre2_substitute() returns the number of replacements
+ If successful, pcre2_substitute() returns the number of replacements
that were made. This may be zero if no matches were found, and is never
greater than 1 unless PCRE2_SUBSTITUTE_GLOBAL is set.
In the event of an error, a negative error code is returned. Except for
- PCRE2_ERROR_NOMATCH (which is never returned), errors from
+ PCRE2_ERROR_NOMATCH (which is never returned), errors from
pcre2_match() are passed straight back.
PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
@@ -3305,26 +3328,26 @@
tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
- ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
+ ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
when the simple (non-extended) syntax is used and PCRE2_SUBSTI-
TUTE_UNSET_EMPTY is not set.
- PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big
+ PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big
enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
- of buffer that is needed is returned via outlengthptr. Note that this
+ of buffer that is needed is returned via outlengthptr. Note that this
does not happen by default.
- PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in
+ PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in
the replacement string, with more particular errors being
- PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP-
- MISSINGBRACE (closing curly bracket not found), PCRE2_ERROR_BADSUBSTI-
+ PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP-
+ MISSINGBRACE (closing curly bracket not found), PCRE2_ERROR_BADSUBSTI-
TUTION (syntax error in extended group substitution), and
- PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before it started
- or the match started earlier than the current position in the subject,
+ PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before it started
+ or the match started earlier than the current position in the subject,
which can happen if \K is used in an assertion).
As for all PCRE2 errors, a text message that describes the error can be
- obtained by calling the pcre2_get_error_message() function (see
+ obtained by calling the pcre2_get_error_message() function (see
"Obtaining a textual error message" above).
Substitution callouts
@@ -3333,15 +3356,15 @@
void (*callout_function)(pcre2_substitute_callout_block *, void *),
void *callout_data);
- The pcre2_set_substitution_callout() function can be used to specify a
- callout function for pcre2_substitute(). This information is passed in
- a match context. The callout function is called after each substitu-
- tion. It is not called for simulated substitutions that happen as a
- result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option. A callout func-
+ The pcre2_set_substitution_callout() function can be used to specify a
+ callout function for pcre2_substitute(). This information is passed in
+ a match context. The callout function is called after each substitu-
+ tion. It is not called for simulated substitutions that happen as a
+ result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option. A callout func-
tion should not return any value.
The first argument of the callout function is a pointer to a substitute
- callout block structure, which contains the following fields, not nec-
+ callout block structure, which contains the following fields, not nec-
essarily in this order:
uint32_t version;
@@ -3348,16 +3371,16 @@
PCRE2_SIZE input_offsets[2];
PCRE2_SIZE output_offsets[2];
- The version field contains the version number of the block format. The
- current version is 0. The version number will increase in future if
- more fields are added, but the intention is never to remove any of the
+ The version field contains the version number of the block format. The
+ current version is 0. The version number will increase in future if
+ more fields are added, but the intention is never to remove any of the
existing fields.
- The input_offsets vector contains the code unit offsets in the input
+ The input_offsets vector contains the code unit offsets in the input
string of the matched substring, and the output_offsets vector contains
the offsets of the replacement in the output string.
- The second argument of the callout function is the value passed as
+ The second argument of the callout function is the value passed as
callout_data when the function was registered.
@@ -3366,56 +3389,56 @@
int pcre2_substring_nametable_scan(const pcre2_code *code,
PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
- When a pattern is compiled with the PCRE2_DUPNAMES option, names for
- subpatterns are not required to be unique. Duplicate names are always
- allowed for subpatterns with the same number, created by using the (?|
- feature. Indeed, if such subpatterns are named, they are required to
+ When a pattern is compiled with the PCRE2_DUPNAMES option, names for
+ subpatterns are not required to be unique. Duplicate names are always
+ allowed for subpatterns with the same number, created by using the (?|
+ feature. Indeed, if such subpatterns are named, they are required to
use the same names.
Normally, patterns with duplicate names are such that in any one match,
- only one of the named subpatterns participates. An example is shown in
+ only one of the named subpatterns participates. An example is shown in
the pcre2pattern documentation.
- When duplicates are present, pcre2_substring_copy_byname() and
- pcre2_substring_get_byname() return the first substring corresponding
- to the given name that is set. Only if none are set is
- PCRE2_ERROR_UNSET is returned. The pcre2_substring_number_from_name()
+ When duplicates are present, pcre2_substring_copy_byname() and
+ pcre2_substring_get_byname() return the first substring corresponding
+ to the given name that is set. Only if none are set is
+ PCRE2_ERROR_UNSET is returned. The pcre2_substring_number_from_name()
function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are
duplicate names.
- If you want to get full details of all captured substrings for a given
- name, you must use the pcre2_substring_nametable_scan() function. The
- first argument is the compiled pattern, and the second is the name. If
- the third and fourth arguments are NULL, the function returns a group
+ If you want to get full details of all captured substrings for a given
+ name, you must use the pcre2_substring_nametable_scan() function. The
+ first argument is the compiled pattern, and the second is the name. If
+ the third and fourth arguments are NULL, the function returns a group
number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
When the third and fourth arguments are not NULL, they must be pointers
- to variables that are updated by the function. After it has run, they
+ to variables that are updated by the function. After it has run, they
point to the first and last entries in the name-to-number table for the
- given name, and the function returns the length of each entry in code
- units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
+ given name, and the function returns the length of each entry in code
+ units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
no entries for the given name.
The format of the name table is described above in the section entitled
- Information about a pattern. Given all the relevant entries for the
- name, you can extract each of their numbers, and hence the captured
+ Information about a pattern. Given all the relevant entries for the
+ name, you can extract each of their numbers, and hence the captured
data.
FINDING ALL POSSIBLE MATCHES AT ONE POSITION
- The traditional matching function uses a similar algorithm to Perl,
- which stops when it finds the first match at a given point in the sub-
+ The traditional matching function uses a similar algorithm to Perl,
+ which stops when it finds the first match at a given point in the sub-
ject. If you want to find all possible matches, or the longest possible
- match at a given position, consider using the alternative matching
- function (see below) instead. If you cannot use the alternative func-
+ match at a given position, consider using the alternative matching
+ function (see below) instead. If you cannot use the alternative func-
tion, you can kludge it up by making use of the callout facility, which
is described in the pcre2callout documentation.
What you have to do is to insert a callout right at the end of the pat-
- tern. When your callout function is called, extract and save the cur-
- rent matched substring. Then return 1, which forces pcre2_match() to
- backtrack and try other alternatives. Ultimately, when it runs out of
+ tern. When your callout function is called, extract and save the cur-
+ rent matched substring. Then return 1, which forces pcre2_match() to
+ backtrack and try other alternatives. Ultimately, when it runs out of
matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
@@ -3427,26 +3450,26 @@
pcre2_match_context *mcontext,
int *workspace, PCRE2_SIZE wscount);
- The function pcre2_dfa_match() is called to match a subject string
- against a compiled pattern, using a matching algorithm that scans the
+ The function pcre2_dfa_match() is called to match a subject string
+ against a compiled pattern, using a matching algorithm that scans the
subject string just once (not counting lookaround assertions), and does
- not backtrack. This has different characteristics to the normal algo-
- rithm, and is not compatible with Perl. Some of the features of PCRE2
- patterns are not supported. Nevertheless, there are times when this
- kind of matching can be useful. For a discussion of the two matching
+ not backtrack. This has different characteristics to the normal algo-
+ rithm, and is not compatible with Perl. Some of the features of PCRE2
+ patterns are not supported. Nevertheless, there are times when this
+ kind of matching can be useful. For a discussion of the two matching
algorithms, and a list of features that pcre2_dfa_match() does not sup-
port, see the pcre2matching documentation.
- The arguments for the pcre2_dfa_match() function are the same as for
+ The arguments for the pcre2_dfa_match() function are the same as for
pcre2_match(), plus two extras. The ovector within the match data block
is used in a different way, and this is described below. The other com-
- mon arguments are used in the same way as for pcre2_match(), so their
+ mon arguments are used in the same way as for pcre2_match(), so their
description is not repeated here.
- The two additional arguments provide workspace for the function. The
- workspace vector should contain at least 20 elements. It is used for
+ The two additional arguments provide workspace for the function. The
+ workspace vector should contain at least 20 elements. It is used for
keeping track of multiple paths through the pattern tree. More
- workspace is needed for patterns and subjects where there are a lot of
+ workspace is needed for patterns and subjects where there are a lot of
potential matches.
Here is an example of a simple call to pcre2_dfa_match():
@@ -3466,13 +3489,14 @@
Option bits for pcre_dfa_match()
- The unused bits of the options argument for pcre2_dfa_match() must be
- zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDAN-
- CHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
- PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD,
- PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but
- the last four of these are exactly the same as for pcre2_match(), so
- their description is not repeated here.
+ The unused bits of the options argument for pcre2_dfa_match() must be
+ zero. The only bits that may be set are PCRE2_ANCHORED,
+ PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL,
+ PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
+ PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT,
+ PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last four of
+ these are exactly the same as for pcre2_match(), so their description
+ is not repeated here.
PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT
@@ -3607,7 +3631,7 @@
REVISION
- Last updated: 21 September 2018
+ Last updated: 16 October 2018
Copyright (c) 1997-2018 University of Cambridge.
------------------------------------------------------------------------------
@@ -4924,15 +4948,16 @@
UNSUPPORTED OPTIONS AND PATTERN ITEMS
The pcre2_match() options that are supported for JIT matching are
- PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
- PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The
- PCRE2_ANCHORED option is not supported at match time.
+ PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
+ PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and
+ PCRE2_PARTIAL_SOFT. The PCRE2_ANCHORED and PCRE2_ENDANCHORED options
+ are not supported at match time.
- If the PCRE2_NO_JIT option is passed to pcre2_match() it disables the
+ If the PCRE2_NO_JIT option is passed to pcre2_match() it disables the
use of JIT, forcing matching by the interpreter code.
- The only unsupported pattern items are \C (match a single data unit)
- when running in a UTF mode, and a callout immediately before an asser-
+ The only unsupported pattern items are \C (match a single data unit)
+ when running in a UTF mode, and a callout immediately before an asser-
tion condition in a conditional group.
@@ -4939,14 +4964,14 @@
RETURN VALUES FROM JIT MATCHING
When a pattern is matched using JIT matching, the return values are the
- same as those given by the interpretive pcre2_match() code, with the
- addition of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means
- that the memory used for the JIT stack was insufficient. See "Control-
+ same as those given by the interpretive pcre2_match() code, with the
+ addition of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means
+ that the memory used for the JIT stack was insufficient. See "Control-
ling the JIT stack" below for a discussion of JIT stack usage.
- The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
- searching a very large pattern tree goes on for too long, as it is in
- the same circumstance when JIT is not used, but the details of exactly
+ The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
+ searching a very large pattern tree goes on for too long, as it is in
+ the same circumstance when JIT is not used, but the details of exactly
what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
is never returned when JIT matching is used.
@@ -4954,25 +4979,25 @@
CONTROLLING THE JIT STACK
When the compiled JIT code runs, it needs a block of memory to use as a
- stack. By default, it uses 32KiB on the machine stack. However, some
- large or complicated patterns need more than this. The error
- PCRE2_ERROR_JIT_STACKLIMIT is given when there is not enough stack.
- Three functions are provided for managing blocks of memory for use as
- JIT stacks. There is further discussion about the use of JIT stacks in
+ stack. By default, it uses 32KiB on the machine stack. However, some
+ large or complicated patterns need more than this. The error
+ PCRE2_ERROR_JIT_STACKLIMIT is given when there is not enough stack.
+ Three functions are provided for managing blocks of memory for use as
+ JIT stacks. There is further discussion about the use of JIT stacks in
the section entitled "JIT stack FAQ" below.
- The pcre2_jit_stack_create() function creates a JIT stack. Its argu-
- ments are a starting size, a maximum size, and a general context (for
- memory allocation functions, or NULL for standard memory allocation).
+ The pcre2_jit_stack_create() function creates a JIT stack. Its argu-
+ ments are a starting size, a maximum size, and a general context (for
+ memory allocation functions, or NULL for standard memory allocation).
It returns a pointer to an opaque structure of type pcre2_jit_stack, or
- NULL if there is an error. The pcre2_jit_stack_free() function is used
+ NULL if there is an error. The pcre2_jit_stack_free() function is used
to free a stack that is no longer needed. If its argument is NULL, this
- function returns immediately, without doing anything. (For the techni-
- cally minded: the address space is allocated by mmap or VirtualAlloc.)
- A maximum stack size of 512KiB to 1MiB should be more than enough for
+ function returns immediately, without doing anything. (For the techni-
+ cally minded: the address space is allocated by mmap or VirtualAlloc.)
+ A maximum stack size of 512KiB to 1MiB should be more than enough for
any pattern.
- The pcre2_jit_stack_assign() function specifies which stack JIT code
+ The pcre2_jit_stack_assign() function specifies which stack JIT code
should use. Its arguments are as follows:
pcre2_match_context *mcontext
@@ -4982,7 +5007,7 @@
The first argument is a pointer to a match context. When this is subse-
quently passed to a matching function, its information determines which
JIT stack is used. If this argument is NULL, the function returns imme-
- diately, without doing anything. There are three cases for the values
+ diately, without doing anything. There are three cases for the values
of the other two options:
(1) If callback is NULL and data is NULL, an internal 32KiB block
@@ -5000,34 +5025,34 @@
return value must be a valid JIT stack, the result of calling
pcre2_jit_stack_create().
- A callback function is obeyed whenever JIT code is about to be run; it
+ A callback function is obeyed whenever JIT code is about to be run; it
is not obeyed when pcre2_match() is called with options that are incom-
- patible for JIT matching. A callback function can therefore be used to
- determine whether a match operation was executed by JIT or by the
+ patible for JIT matching. A callback function can therefore be used to
+ determine whether a match operation was executed by JIT or by the
interpreter.
You may safely use the same JIT stack for more than one pattern (either
- by assigning directly or by callback), as long as the patterns are
+ by assigning directly or by callback), as long as the patterns are
matched sequentially in the same thread. Currently, the only way to set
- up non-sequential matches in one thread is to use callouts: if a call-
- out function starts another match, that match must use a different JIT
+ up non-sequential matches in one thread is to use callouts: if a call-
+ out function starts another match, that match must use a different JIT
stack to the one used for currently suspended match(es).
- In a multithread application, if you do not specify a JIT stack, or if
- you assign or pass back NULL from a callback, that is thread-safe,
- because each thread has its own machine stack. However, if you assign
- or pass back a non-NULL JIT stack, this must be a different stack for
+ In a multithread application, if you do not specify a JIT stack, or if
+ you assign or pass back NULL from a callback, that is thread-safe,
+ because each thread has its own machine stack. However, if you assign
+ or pass back a non-NULL JIT stack, this must be a different stack for
each thread so that the application is thread-safe.
- Strictly speaking, even more is allowed. You can assign the same non-
- NULL stack to a match context that is used by any number of patterns,
- as long as they are not used for matching by multiple threads at the
- same time. For example, you could use the same stack in all compiled
- patterns, with a global mutex in the callback to wait until the stack
+ Strictly speaking, even more is allowed. You can assign the same non-
+ NULL stack to a match context that is used by any number of patterns,
+ as long as they are not used for matching by multiple threads at the
+ same time. For example, you could use the same stack in all compiled
+ patterns, with a global mutex in the callback to wait until the stack
is available for use. However, this is an inefficient solution, and not
recommended.
- This is a suggestion for how a multithreaded program that needs to set
+ This is a suggestion for how a multithreaded program that needs to set
up non-default JIT stacks might operate:
During thread initalization
@@ -5039,7 +5064,7 @@
Use a one-line callback function
return thread_local_var
- All the functions described in this section do nothing if JIT is not
+ All the functions described in this section do nothing if JIT is not
available.
@@ -5048,20 +5073,20 @@
(1) Why do we need JIT stacks?
PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
- where the local data of the current node is pushed before checking its
+ where the local data of the current node is pushed before checking its
child nodes. Allocating real machine stack on some platforms is diffi-
cult. For example, the stack chain needs to be updated every time if we
- extend the stack on PowerPC. Although it is possible, its updating
+ extend the stack on PowerPC. Although it is possible, its updating
time overhead decreases performance. So we do the recursion in memory.
(2) Why don't we simply allocate blocks of memory with malloc()?
- Modern operating systems have a nice feature: they can reserve an
+ Modern operating systems have a nice feature: they can reserve an
address space instead of allocating memory. We can safely allocate mem-
- ory pages inside this address space, so the stack could grow without
+ ory pages inside this address space, so the stack could grow without
moving memory data (this is important because of pointers). Thus we can
allocate 1MiB address space, and use only a single memory page (usually
- 4KiB) if that is enough. However, we can still grow up to 1MiB anytime
+ 4KiB) if that is enough. However, we can still grow up to 1MiB anytime
if needed.
(3) Who "owns" a JIT stack?
@@ -5069,8 +5094,8 @@
The owner of the stack is the user program, not the JIT studied pattern
or anything else. The user program must ensure that if a stack is being
used by pcre2_match(), (that is, it is assigned to a match context that
- is passed to the pattern currently running), that stack must not be
- used by any other threads (to avoid overwriting the same memory area).
+ is passed to the pattern currently running), that stack must not be
+ used by any other threads (to avoid overwriting the same memory area).
The best practice for multithreaded programs is to allocate a stack for
each thread, and return this stack through the JIT callback function.
@@ -5078,36 +5103,36 @@
You can free a JIT stack at any time, as long as it will not be used by
pcre2_match() again. When you assign the stack to a match context, only
- a pointer is set. There is no reference counting or any other magic.
+ a pointer is set. There is no reference counting or any other magic.
You can free compiled patterns, contexts, and stacks in any order, any-
- time. Just do not call pcre2_match() with a match context pointing to
+ time. Just do not call pcre2_match() with a match context pointing to
an already freed stack, as that will cause SEGFAULT. (Also, do not free
- a stack currently used by pcre2_match() in another thread). You can
- also replace the stack in a context at any time when it is not in use.
+ a stack currently used by pcre2_match() in another thread). You can
+ also replace the stack in a context at any time when it is not in use.
You should free the previous stack before assigning a replacement.
- (5) Should I allocate/free a stack every time before/after calling
+ (5) Should I allocate/free a stack every time before/after calling
pcre2_match()?
- No, because this is too costly in terms of resources. However, you
- could implement some clever idea which release the stack if it is not
- used in let's say two minutes. The JIT callback can help to achieve
+ No, because this is too costly in terms of resources. However, you
+ could implement some clever idea which release the stack if it is not
+ used in let's say two minutes. The JIT callback can help to achieve
this without keeping a list of patterns.
- (6) OK, the stack is for long term memory allocation. But what happens
- if a pattern causes stack overflow with a stack of 1MiB? Is that 1MiB
+ (6) OK, the stack is for long term memory allocation. But what happens
+ if a pattern causes stack overflow with a stack of 1MiB? Is that 1MiB
kept until the stack is freed?
- Especially on embedded sytems, it might be a good idea to release mem-
- ory sometimes without freeing the stack. There is no API for this at
- the moment. Probably a function call which returns with the currently
- allocated memory for any stack and another which allows releasing mem-
+ Especially on embedded sytems, it might be a good idea to release mem-
+ ory sometimes without freeing the stack. There is no API for this at
+ the moment. Probably a function call which returns with the currently
+ allocated memory for any stack and another which allows releasing mem-
ory (shrinking the stack) would be a good idea if someone needs this.
(7) This is too much of a headache. Isn't there any better solution for
JIT stack handling?
- No, thanks to Windows. If POSIX threads were used everywhere, we could
+ No, thanks to Windows. If POSIX threads were used everywhere, we could
throw out this complicated API.
@@ -5116,10 +5141,10 @@
void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
The JIT executable allocator does not free all memory when it is possi-
- ble. It expects new allocations, and keeps some free memory around to
- improve allocation speed. However, in low memory conditions, it might
- be better to free all possible memory. You can cause this to happen by
- calling pcre2_jit_free_unused_memory(). Its argument is a general con-
+ ble. It expects new allocations, and keeps some free memory around to
+ improve allocation speed. However, in low memory conditions, it might
+ be better to free all possible memory. You can cause this to happen by
+ calling pcre2_jit_free_unused_memory(). Its argument is a general con-
text, for custom memory management, or NULL for standard memory manage-
ment.
@@ -5126,8 +5151,8 @@
EXAMPLE CODE
- This is a single-threaded example that specifies a JIT stack without
- using a callback. A real program should include error checking after
+ This is a single-threaded example that specifies a JIT stack without
+ using a callback. A real program should include error checking after
all the function calls.
int rc;
@@ -5155,29 +5180,31 @@
JIT FAST PATH API
Because the API described above falls back to interpreted matching when
- JIT is not available, it is convenient for programs that are written
+ JIT is not available, it is convenient for programs that are written
for general use in many environments. However, calling JIT via
pcre2_match() does have a performance impact. Programs that are written
- for use where JIT is known to be available, and which need the best
- possible performance, can instead use a "fast path" API to call JIT
- matching directly instead of calling pcre2_match() (obviously only for
+ for use where JIT is known to be available, and which need the best
+ possible performance, can instead use a "fast path" API to call JIT
+ matching directly instead of calling pcre2_match() (obviously only for
patterns that have been successfully processed by pcre2_jit_compile()).
- The fast path function is called pcre2_jit_match(), and it takes
- exactly the same arguments as pcre2_match(). The return values are also
- the same, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or
- complete) is requested that was not compiled. Unsupported option bits
- (for example, PCRE2_ANCHORED) are ignored, as is the PCRE2_NO_JIT
- option.
+ The fast path function is called pcre2_jit_match(), and it takes
+ exactly the same arguments as pcre2_match(). However, the subject
+ string must be specified with a length; PCRE2_ZERO_TERMINATED is not
+ supported. Unsupported option bits (for example, PCRE2_ANCHORED,
+ PCRE2_ENDANCHORED and PCRE2_COPY_MATCHED_SUBJECT) are ignored, as is
+ the PCRE2_NO_JIT option. The return values are also the same as for
+ pcre2_match(), plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (par-
+ tial or complete) is requested that was not compiled.
- When you call pcre2_match(), as well as testing for invalid options, a
+ When you call pcre2_match(), as well as testing for invalid options, a
number of other sanity checks are performed on the arguments. For exam-
ple, if the subject pointer is NULL, an immediate error is given. Also,
- unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for
- validity. In the interests of speed, these checks do not happen on the
+ unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for
+ validity. In the interests of speed, these checks do not happen on the
JIT fast path, and if invalid data is passed, the result is undefined.
- Bypassing the sanity checks and the pcre2_match() wrapping can give
+ Bypassing the sanity checks and the pcre2_match() wrapping can give
speedups of more than 10%.
@@ -5195,7 +5222,7 @@
REVISION
- Last updated: 28 June 2018
+ Last updated: 16 October 2018
Copyright (c) 1997-2018 University of Cambridge.
------------------------------------------------------------------------------
Modified: code/trunk/doc/pcre2_dfa_match.3
===================================================================
--- code/trunk/doc/pcre2_dfa_match.3 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/doc/pcre2_dfa_match.3 2018-10-17 08:33:38 UTC (rev 1028)
@@ -1,4 +1,4 @@
-.TH PCRE2_DFA_MATCH 3 "26 April 2018" "PCRE2 10.32"
+.TH PCRE2_DFA_MATCH 3 "16 October 2018" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -39,6 +39,8 @@
characters. The options are:
.sp
PCRE2_ANCHORED Match only at the first position
+ PCRE2_COPY_MATCHED_SUBJECT
+ On success, make a private subject copy
PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_NOTBOL Subject is not the beginning of a line
PCRE2_NOTEOL Subject is not the end of a line
Modified: code/trunk/doc/pcre2_match.3
===================================================================
--- code/trunk/doc/pcre2_match.3 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/doc/pcre2_match.3 2018-10-17 08:33:38 UTC (rev 1028)
@@ -1,4 +1,4 @@
-.TH PCRE2_MATCH 3 "14 November 2017" "PCRE2 10.31"
+.TH PCRE2_MATCH 3 "16 October 2018" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -43,11 +43,13 @@
Change the backtracking depth limit
Set custom memory management specifically for the match
.sp
-The \fIlength\fP and \fIstartoffset\fP values are code
-units, not characters. The length may be given as PCRE2_ZERO_TERMINATE for a
-subject that is terminated by a binary zero code unit. The options are:
+The \fIlength\fP and \fIstartoffset\fP values are code units, not characters.
+The length may be given as PCRE2_ZERO_TERMINATED for a subject that is
+terminated by a binary zero code unit. The options are:
.sp
PCRE2_ANCHORED Match only at the first position
+ PCRE2_COPY_MATCHED_SUBJECT
+ On success, make a private subject copy
PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_NOTBOL Subject string is not the beginning of a line
PCRE2_NOTEOL Subject string is not the end of a line
Modified: code/trunk/doc/pcre2_match_data_free.3
===================================================================
--- code/trunk/doc/pcre2_match_data_free.3 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/doc/pcre2_match_data_free.3 2018-10-17 08:33:38 UTC (rev 1028)
@@ -1,4 +1,4 @@
-.TH PCRE2_MATCH_DATA_FREE 3 "28 June 2018" "PCRE2 10.32"
+.TH PCRE2_MATCH_DATA_FREE 3 "16 October 2018" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -18,6 +18,10 @@
using the memory freeing function from the general context or compiled pattern
with which it was created, or \fBfree()\fP if that was not set.
.P
+If the PCRE2_COPY_MATCHED_SUBJECT was used for a successful match using this
+match data block, the copy of the subject that was remembered with the block is
+also freed.
+.P
There is a complete description of the PCRE2 native API in the
.\" HREF
\fBpcre2api\fP
Modified: code/trunk/doc/pcre2api.3
===================================================================
--- code/trunk/doc/pcre2api.3 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/doc/pcre2api.3 2018-10-17 08:33:38 UTC (rev 1028)
@@ -1,4 +1,4 @@
-.TH PCRE2API 3 "21 September 2018" "PCRE2 10.33"
+.TH PCRE2API 3 "16 October 2018" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@@ -1237,13 +1237,19 @@
NOTE: When one of the matching functions is called, pointers to the compiled
pattern and the subject string are set in the match data block so that they can
be referenced by the substring extraction functions. After running a match, you
-must not free a compiled pattern (or a subject string) until after all
+must not free a compiled pattern or a subject string until after all
operations on the
.\" HTML <a href="#matchdatablock">
.\" </a>
match data block
.\"
-have taken place.
+have taken place, unless, in the case of the subject string, you have used the
+PCRE2_COPY_MATCHED_SUBJECT option, which is described in the section entitled
+"Option bits for \fBpcre2_match()\fP"
+.\" HTML <a href="#matchoptions>">
+.\" </a>
+below.
+.\"
.P
The \fIoptions\fP argument for \fBpcre2_compile()\fP contains various bit
settings that affect the compilation. It should be zero if no options are
@@ -2390,7 +2396,13 @@
and the subject string are set in the match data block so that they can be
referenced by the extraction functions. After running a match, you must not
free a compiled pattern or a subject string until after all operations on the
-match data block (for that match) have taken place.
+match data block (for that match) have taken place, unless, in the case of the
+subject string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, which is
+described in the section entitled "Option bits for \fBpcre2_match()\fP"
+.\" HTML <a href="#matchoptions>">
+.\" </a>
+below.
+.\"
.P
When a match data block itself is no longer needed, it should be freed by
calling \fBpcre2_match_data_free()\fP. If this function is called with a NULL
@@ -2507,10 +2519,10 @@
.rs
.sp
The unused bits of the \fIoptions\fP argument for \fBpcre2_match()\fP must be
-zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED,
-PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
-PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT.
-Their action is described below.
+zero. The only bits that may be set are PCRE2_ANCHORED,
+PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NOTEOL,
+PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK,
+PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below.
.P
Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not supported by
the just-in-time (JIT) compiler. If it is set, JIT matching is disabled and the
@@ -2525,6 +2537,22 @@
matching time. Note that setting the option at match time disables JIT
matching.
.sp
+ PCRE2_COPY_MATCHED_SUBJECT
+.sp
+By default, a pointer to the subject is remembered in the match data block so
+that, after a successful match, it can be referenced by the substring
+extraction functions. This means that the subject's memory must not be freed
+until all such operations are complete. For some applications where the
+lifetime of the subject string is not guaranteed, it may be necessary to make a
+copy of the subject string, but it is wasteful to do this unless the match is
+successful. After a successful match, if PCRE2_COPY_MATCHED_SUBJECT is set, the
+subject is copied and the new pointer is remembered in the match data block
+instead of the original subject pointer. The memory allocator that was used for
+the match block itself is used. The copy is automatically freed when
+\fBpcre2_match_data_free()\fP is called to free the match data block. It is also
+automatically freed if the match data block is re-used for another match
+operation.
+.sp
PCRE2_ENDANCHORED
.sp
If the PCRE2_ENDANCHORED option is set, any string that \fBpcre2_match()\fP
@@ -2961,7 +2989,8 @@
If a pattern contains many nested backtracking points, heap memory is used to
remember them. This error is given when the memory allocation function (default
or custom) fails. Note that a different error, PCRE2_ERROR_HEAPLIMIT, is given
-if the amount of memory needed exceeds the heap limit.
+if the amount of memory needed exceeds the heap limit. PCRE2_ERROR_NOMEMORY is
+also returned if PCRE2_COPY_MATCHED_SUBJECT is set and memory allocation fails.
.sp
PCRE2_ERROR_NULL
.sp
@@ -3579,11 +3608,12 @@
.rs
.sp
The unused bits of the \fIoptions\fP argument for \fBpcre2_dfa_match()\fP must
-be zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED,
-PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
-PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST,
-and PCRE2_DFA_RESTART. All but the last four of these are exactly the same as
-for \fBpcre2_match()\fP, so their description is not repeated here.
+be zero. The only bits that may be set are PCRE2_ANCHORED,
+PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NOTEOL,
+PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD,
+PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last
+four of these are exactly the same as for \fBpcre2_match()\fP, so their
+description is not repeated here.
.sp
PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT
@@ -3737,6 +3767,6 @@
.rs
.sp
.nf
-Last updated: 21 September 2018
+Last updated: 16 October 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre2jit.3
===================================================================
--- code/trunk/doc/pcre2jit.3 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/doc/pcre2jit.3 2018-10-17 08:33:38 UTC (rev 1028)
@@ -1,4 +1,4 @@
-.TH PCRE2JIT 3 "28 June 2018" "PCRE2 10.32"
+.TH PCRE2JIT 3 "16 October 2018" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
@@ -124,9 +124,10 @@
.rs
.sp
The \fBpcre2_match()\fP options that are supported for JIT matching are
-PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
-PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The
-PCRE2_ANCHORED option is not supported at match time.
+PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
+PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and
+PCRE2_PARTIAL_SOFT. The PCRE2_ANCHORED and PCRE2_ENDANCHORED options are not
+supported at match time.
.P
If the PCRE2_NO_JIT option is passed to \fBpcre2_match()\fP it disables the
use of JIT, forcing matching by the interpreter code.
@@ -376,10 +377,13 @@
processed by \fBpcre2_jit_compile()\fP).
.P
The fast path function is called \fBpcre2_jit_match()\fP, and it takes exactly
-the same arguments as \fBpcre2_match()\fP. The return values are also the same,
-plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or complete) is
-requested that was not compiled. Unsupported option bits (for example,
-PCRE2_ANCHORED) are ignored, as is the PCRE2_NO_JIT option.
+the same arguments as \fBpcre2_match()\fP. However, the subject string must be
+specified with a length; PCRE2_ZERO_TERMINATED is not supported. Unsupported
+option bits (for example, PCRE2_ANCHORED, PCRE2_ENDANCHORED and
+PCRE2_COPY_MATCHED_SUBJECT) are ignored, as is the PCRE2_NO_JIT option. The
+return values are also the same as for \fBpcre2_match()\fP, plus
+PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or complete) is requested
+that was not compiled.
.P
When you call \fBpcre2_match()\fP, as well as testing for invalid options, a
number of other sanity checks are performed on the arguments. For example, if
@@ -412,6 +416,6 @@
.rs
.sp
.nf
-Last updated: 28 June 2018
+Last updated: 16 October 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi
Modified: code/trunk/src/pcre2.h.in
===================================================================
--- code/trunk/src/pcre2.h.in 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/src/pcre2.h.in 2018-10-17 08:33:38 UTC (rev 1028)
@@ -167,37 +167,28 @@
#define PCRE2_JIT_PARTIAL_HARD 0x00000004u
#define PCRE2_JIT_INVALID_UTF 0x00000100u
-/* These are for pcre2_match(), pcre2_dfa_match(), and pcre2_jit_match(). Note
-that PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK can also be passed to these
-functions (though pcre2_jit_match() ignores the latter since it bypasses all
-sanity checks). */
+/* These are for pcre2_match(), pcre2_dfa_match(), pcre2_jit_match(), and
+pcre2_substitute(). Some are allowed only for one of the functions, and in
+these cases it is noted below. Note that PCRE2_ANCHORED, PCRE2_ENDANCHORED and
+PCRE2_NO_UTF_CHECK can also be passed to these functions (though
+pcre2_jit_match() ignores the latter since it bypasses all sanity checks). */
-#define PCRE2_NOTBOL 0x00000001u
-#define PCRE2_NOTEOL 0x00000002u
-#define PCRE2_NOTEMPTY 0x00000004u /* ) These two must be kept */
-#define PCRE2_NOTEMPTY_ATSTART 0x00000008u /* ) adjacent to each other. */
-#define PCRE2_PARTIAL_SOFT 0x00000010u
-#define PCRE2_PARTIAL_HARD 0x00000020u
+#define PCRE2_NOTBOL 0x00000001u
+#define PCRE2_NOTEOL 0x00000002u
+#define PCRE2_NOTEMPTY 0x00000004u /* ) These two must be kept */
+#define PCRE2_NOTEMPTY_ATSTART 0x00000008u /* ) adjacent to each other. */
+#define PCRE2_PARTIAL_SOFT 0x00000010u
+#define PCRE2_PARTIAL_HARD 0x00000020u
+#define PCRE2_DFA_RESTART 0x00000040u /* pcre2_dfa_match() only */
+#define PCRE2_DFA_SHORTEST 0x00000080u /* pcre2_dfa_match() only */
+#define PCRE2_SUBSTITUTE_GLOBAL 0x00000100u /* pcre2_substitute() only */
+#define PCRE2_SUBSTITUTE_EXTENDED 0x00000200u /* pcre2_substitute() only */
+#define PCRE2_SUBSTITUTE_UNSET_EMPTY 0x00000400u /* pcre2_substitute() only */
+#define PCRE2_SUBSTITUTE_UNKNOWN_UNSET 0x00000800u /* pcre2_substitute() only */
+#define PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 0x00001000u /* pcre2_substitute() only */
+#define PCRE2_NO_JIT 0x00002000u /* Not for pcre2_dfa_match() */
+#define PCRE2_COPY_MATCHED_SUBJECT 0x00004000u
-/* These are additional options for pcre2_dfa_match(). */
-
-#define PCRE2_DFA_RESTART 0x00000040u
-#define PCRE2_DFA_SHORTEST 0x00000080u
-
-/* These are additional options for pcre2_substitute(), which passes any others
-through to pcre2_match(). */
-
-#define PCRE2_SUBSTITUTE_GLOBAL 0x00000100u
-#define PCRE2_SUBSTITUTE_EXTENDED 0x00000200u
-#define PCRE2_SUBSTITUTE_UNSET_EMPTY 0x00000400u
-#define PCRE2_SUBSTITUTE_UNKNOWN_UNSET 0x00000800u
-#define PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 0x00001000u
-
-/* A further option for pcre2_match(), not allowed for pcre2_dfa_match(),
-ignored for pcre2_jit_match(). */
-
-#define PCRE2_NO_JIT 0x00002000u
-
/* Options for pcre2_pattern_convert(). */
#define PCRE2_CONVERT_UTF 0x00000001u
Modified: code/trunk/src/pcre2_dfa_match.c
===================================================================
--- code/trunk/src/pcre2_dfa_match.c 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/src/pcre2_dfa_match.c 2018-10-17 08:33:38 UTC (rev 1028)
@@ -85,7 +85,8 @@
#define PUBLIC_DFA_MATCH_OPTIONS \
(PCRE2_ANCHORED|PCRE2_ENDANCHORED|PCRE2_NOTBOL|PCRE2_NOTEOL|PCRE2_NOTEMPTY| \
PCRE2_NOTEMPTY_ATSTART|PCRE2_NO_UTF_CHECK|PCRE2_PARTIAL_HARD| \
- PCRE2_PARTIAL_SOFT|PCRE2_DFA_SHORTEST|PCRE2_DFA_RESTART)
+ PCRE2_PARTIAL_SOFT|PCRE2_DFA_SHORTEST|PCRE2_DFA_RESTART| \
+ PCRE2_COPY_MATCHED_SUBJECT)
/*************************************************
@@ -3228,6 +3229,8 @@
pcre2_match_context *mcontext, int *workspace, PCRE2_SIZE wscount)
{
int rc;
+int was_zero_terminated = 0;
+
const pcre2_real_code *re = (const pcre2_real_code *)code;
PCRE2_SPTR start_match;
@@ -3267,7 +3270,11 @@
/* A length equal to PCRE2_ZERO_TERMINATED implies a zero-terminated
subject string. */
-if (length == PCRE2_ZERO_TERMINATED) length = PRIV(strlen)(subject);
+if (length == PCRE2_ZERO_TERMINATED)
+ {
+ length = PRIV(strlen)(subject);
+ was_zero_terminated = 1;
+ }
/* Plausibility checks */
@@ -3520,10 +3527,21 @@
}
}
+/* If the match data block was previously used with PCRE2_COPY_MATCHED_SUBJECT,
+free the memory that was obtained. */
+
+if ((match_data->flags & PCRE2_MD_COPIED_SUBJECT) != 0)
+ {
+ match_data->memctl.free((void *)match_data->subject,
+ match_data->memctl.memory_data);
+ match_data->flags &= ~PCRE2_MD_COPIED_SUBJECT;
+ }
+
/* Fill in fields that are always returned in the match data. */
match_data->code = re;
match_data->subject = subject;
+match_data->flags = 0;
match_data->mark = NULL;
match_data->matchedby = PCRE2_MATCHEDBY_DFA_INTERPRETER;
@@ -3818,6 +3836,17 @@
match_data->rightchar = (PCRE2_SIZE)( mb->last_used_ptr - subject);
match_data->startchar = (PCRE2_SIZE)(start_match - subject);
match_data->rc = rc;
+
+ if (rc >= 0 &&(options & PCRE2_COPY_MATCHED_SUBJECT) != 0)
+ {
+ length = CU2BYTES(length + was_zero_terminated);
+ match_data->subject = match_data->memctl.malloc(length,
+ match_data->memctl.memory_data);
+ if (match_data->subject == NULL) return PCRE2_ERROR_NOMEMORY;
+ memcpy((void *)match_data->subject, subject, length);
+ match_data->flags |= PCRE2_MD_COPIED_SUBJECT;
+ }
+
goto EXIT;
}
Modified: code/trunk/src/pcre2_internal.h
===================================================================
--- code/trunk/src/pcre2_internal.h 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/src/pcre2_internal.h 2018-10-17 08:33:38 UTC (rev 1028)
@@ -534,7 +534,11 @@
enum { PCRE2_MATCHEDBY_INTERPRETER, /* pcre2_match() */
PCRE2_MATCHEDBY_DFA_INTERPRETER, /* pcre2_dfa_match() */
PCRE2_MATCHEDBY_JIT }; /* pcre2_jit_match() */
+
+/* Values for the flags field in a match data block. */
+#define PCRE2_MD_COPIED_SUBJECT 0x01u
+
/* Magic number to provide a small check against being handed junk. */
#define MAGIC_NUMBER 0x50435245UL /* 'PCRE' */
Modified: code/trunk/src/pcre2_intmodedep.h
===================================================================
--- code/trunk/src/pcre2_intmodedep.h 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/src/pcre2_intmodedep.h 2018-10-17 08:33:38 UTC (rev 1028)
@@ -658,7 +658,8 @@
PCRE2_SIZE leftchar; /* Offset to leftmost code unit */
PCRE2_SIZE rightchar; /* Offset to rightmost code unit */
PCRE2_SIZE startchar; /* Offset to starting code unit */
- uint16_t matchedby; /* Type of match (normal, JIT, DFA) */
+ uint8_t matchedby; /* Type of match (normal, JIT, DFA) */
+ uint8_t flags; /* Various flags */
uint16_t oveccount; /* Number of pairs */
int rc; /* The return code from the match */
PCRE2_SIZE ovector[131072]; /* Must be last in the structure */
Modified: code/trunk/src/pcre2_jit_match.c
===================================================================
--- code/trunk/src/pcre2_jit_match.c 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/src/pcre2_jit_match.c 2018-10-17 08:33:38 UTC (rev 1028)
@@ -7,7 +7,7 @@
Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge
- New API code Copyright (c) 2016 University of Cambridge
+ New API code Copyright (c) 2016-2018 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -174,6 +174,7 @@
rc = 0;
match_data->code = re;
match_data->subject = subject;
+match_data->flags = 0;
match_data->rc = rc;
match_data->startchar = arguments.startchar_ptr - subject;
match_data->leftchar = 0;
Modified: code/trunk/src/pcre2_match.c
===================================================================
--- code/trunk/src/pcre2_match.c 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/src/pcre2_match.c 2018-10-17 08:33:38 UTC (rev 1028)
@@ -69,11 +69,12 @@
#define PUBLIC_MATCH_OPTIONS \
(PCRE2_ANCHORED|PCRE2_ENDANCHORED|PCRE2_NOTBOL|PCRE2_NOTEOL|PCRE2_NOTEMPTY| \
PCRE2_NOTEMPTY_ATSTART|PCRE2_NO_UTF_CHECK|PCRE2_PARTIAL_HARD| \
- PCRE2_PARTIAL_SOFT|PCRE2_NO_JIT)
+ PCRE2_PARTIAL_SOFT|PCRE2_NO_JIT|PCRE2_COPY_MATCHED_SUBJECT)
#define PUBLIC_JIT_MATCH_OPTIONS \
(PCRE2_NO_UTF_CHECK|PCRE2_NOTBOL|PCRE2_NOTEOL|PCRE2_NOTEMPTY|\
- PCRE2_NOTEMPTY_ATSTART|PCRE2_PARTIAL_SOFT|PCRE2_PARTIAL_HARD)
+ PCRE2_NOTEMPTY_ATSTART|PCRE2_PARTIAL_SOFT|PCRE2_PARTIAL_HARD|\
+ PCRE2_COPY_MATCHED_SUBJECT)
/* Non-error returns from and within the match() function. Error returns are
externally defined PCRE2_ERROR_xxx codes, which are all negative. */
@@ -5014,7 +5015,7 @@
must record a backtracking point and also set up a chained frame. */
case OP_ONCE:
- case OP_SCRIPT_RUN:
+ case OP_SCRIPT_RUN:
case OP_SBRA:
Lframe_type = GF_NOCAPTURE | Fop;
@@ -5526,14 +5527,14 @@
case OP_ASSERT_NOT:
case OP_ASSERTBACK_NOT:
RRETURN(MATCH_MATCH);
-
- /* At the end of a script run, apply the script-checking rules. This code
- will never by exercised if Unicode support it not compiled, because in
+
+ /* At the end of a script run, apply the script-checking rules. This code
+ will never by exercised if Unicode support it not compiled, because in
that environment script runs cause an error at compile time. */
-
+
case OP_SCRIPT_RUN:
if (!PRIV(script_run)(P->eptr, Feptr, utf)) RRETURN(MATCH_NOMATCH);
- break;
+ break;
/* Whole-pattern recursion is coded as a recurse into group 0, so it
won't be picked up here. Instead, we catch it when the OP_END is reached.
@@ -6009,10 +6010,11 @@
pcre2_match_context *mcontext)
{
int rc;
+int was_zero_terminated = 0;
const uint8_t *start_bits = NULL;
-
const pcre2_real_code *re = (const pcre2_real_code *)code;
+
BOOL anchored;
BOOL firstline;
BOOL has_first_cu = FALSE;
@@ -6052,7 +6054,11 @@
/* A length equal to PCRE2_ZERO_TERMINATED implies a zero-terminated
subject string. */
-if (length == PCRE2_ZERO_TERMINATED) length = PRIV(strlen)(subject);
+if (length == PCRE2_ZERO_TERMINATED)
+ {
+ length = PRIV(strlen)(subject);
+ was_zero_terminated = 1;
+ }
end_subject = subject + length;
/* Plausibility checks */
@@ -6166,7 +6172,17 @@
if (mcontext != NULL && mcontext->offset_limit != PCRE2_UNSET &&
(re->overall_options & PCRE2_USE_OFFSET_LIMIT) == 0)
return PCRE2_ERROR_BADOFFSETLIMIT;
+
+/* If the match data block was previously used with PCRE2_COPY_MATCHED_SUBJECT,
+free the memory that was obtained. */
+if ((match_data->flags & PCRE2_MD_COPIED_SUBJECT) != 0)
+ {
+ match_data->memctl.free((void *)match_data->subject,
+ match_data->memctl.memory_data);
+ match_data->flags &= ~PCRE2_MD_COPIED_SUBJECT;
+ }
+
/* If the pattern was successfully studied with JIT support, run the JIT
executable instead of the rest of this function. Most options must be set at
compile time for the JIT code to be usable. Fallback to the normal code path if
@@ -6178,7 +6194,19 @@
{
rc = pcre2_jit_match(code, subject, length, start_offset, options,
match_data, mcontext);
- if (rc != PCRE2_ERROR_JIT_BADOPTION) return rc;
+ if (rc != PCRE2_ERROR_JIT_BADOPTION)
+ {
+ if (rc >= 0 && (options & PCRE2_COPY_MATCHED_SUBJECT) != 0)
+ {
+ length = CU2BYTES(length + was_zero_terminated);
+ match_data->subject = match_data->memctl.malloc(length,
+ match_data->memctl.memory_data);
+ if (match_data->subject == NULL) return PCRE2_ERROR_NOMEMORY;
+ memcpy((void *)match_data->subject, subject, length);
+ match_data->flags |= PCRE2_MD_COPIED_SUBJECT;
+ }
+ return rc;
+ }
}
#endif
@@ -6819,12 +6847,14 @@
match_data->code = re;
match_data->subject = subject;
+match_data->flags = 0;
match_data->mark = mb->mark;
match_data->matchedby = PCRE2_MATCHEDBY_INTERPRETER;
/* Handle a fully successful match. Set the return code to the number of
captured strings, or 0 if there were too many to fit into the ovector, and then
-set the remaining returned values before returning. */
+set the remaining returned values before returning. Make a copy of the subject
+string if requested. */
if (rc == MATCH_MATCH)
{
@@ -6834,6 +6864,17 @@
match_data->leftchar = mb->start_used_ptr - subject;
match_data->rightchar = ((mb->last_used_ptr > mb->end_match_ptr)?
mb->last_used_ptr : mb->end_match_ptr) - subject;
+
+ if ((options & PCRE2_COPY_MATCHED_SUBJECT) != 0)
+ {
+ length = CU2BYTES(length + was_zero_terminated);
+ match_data->subject = match_data->memctl.malloc(length,
+ match_data->memctl.memory_data);
+ if (match_data->subject == NULL) return PCRE2_ERROR_NOMEMORY;
+ memcpy((void *)match_data->subject, subject, length);
+ match_data->flags |= PCRE2_MD_COPIED_SUBJECT;
+ }
+
return match_data->rc;
}
Modified: code/trunk/src/pcre2_match_data.c
===================================================================
--- code/trunk/src/pcre2_match_data.c 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/src/pcre2_match_data.c 2018-10-17 08:33:38 UTC (rev 1028)
@@ -7,7 +7,7 @@
Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge
- New API code Copyright (c) 2016-2017 University of Cambridge
+ New API code Copyright (c) 2016-2018 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -63,6 +63,7 @@
(pcre2_memctl *)gcontext);
if (yield == NULL) return NULL;
yield->oveccount = oveccount;
+yield->flags = 0;
return yield;
}
@@ -93,7 +94,12 @@
pcre2_match_data_free(pcre2_match_data *match_data)
{
if (match_data != NULL)
+ {
+ if ((match_data->flags & PCRE2_MD_COPIED_SUBJECT) != 0)
+ match_data->memctl.free((void *)match_data->subject,
+ match_data->memctl.memory_data);
match_data->memctl.free(match_data, match_data->memctl.memory_data);
+ }
}
Modified: code/trunk/src/pcre2test.c
===================================================================
--- code/trunk/src/pcre2test.c 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/src/pcre2test.c 2018-10-17 08:33:38 UTC (rev 1028)
@@ -620,6 +620,7 @@
{ "convert_glob_separator", MOD_PAT, MOD_CHR, 0, PO(convert_glob_separator) },
{ "convert_length", MOD_PAT, MOD_INT, 0, PO(convert_length) },
{ "copy", MOD_DAT, MOD_NN, DO(copy_numbers), DO(copy_names) },
+ { "copy_matched_subject", MOD_DAT, MOD_OPT, PCRE2_COPY_MATCHED_SUBJECT, DO(options) },
{ "debug", MOD_PAT, MOD_CTL, CTL_DEBUG, PO(control) },
{ "depth_limit", MOD_CTM, MOD_INT, 0, MO(depth_limit) },
{ "dfa", MOD_DAT, MOD_CTL, CTL_DFA, DO(control) },
@@ -4180,7 +4181,7 @@
((options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) != 0)? " bad_escape_is_literal" : "",
((options & PCRE2_EXTRA_MATCH_WORD) != 0)? " match_word" : "",
((options & PCRE2_EXTRA_MATCH_LINE) != 0)? " match_line" : "",
- ((options & PCRE2_EXTRA_ESCAPED_CR_IS_LF) != 0)? " escaped_cr_is_lf" : "",
+ ((options & PCRE2_EXTRA_ESCAPED_CR_IS_LF) != 0)? " escaped_cr_is_lf" : "",
after);
}
@@ -4196,11 +4197,13 @@
static void
show_match_options(uint32_t options)
{
-fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s",
+fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s",
((options & PCRE2_ANCHORED) != 0)? " anchored" : "",
+ ((options & PCRE2_COPY_MATCHED_SUBJECT) != 0)? " copy_matched_subject" : "",
((options & PCRE2_DFA_RESTART) != 0)? " dfa_restart" : "",
((options & PCRE2_DFA_SHORTEST) != 0)? " dfa_shortest" : "",
((options & PCRE2_ENDANCHORED) != 0)? " endanchored" : "",
+ ((options & PCRE2_NO_JIT) != 0)? " no_jit" : "",
((options & PCRE2_NO_UTF_CHECK) != 0)? " no_utf_check" : "",
((options & PCRE2_NOTBOL) != 0)? " notbol" : "",
((options & PCRE2_NOTEMPTY) != 0)? " notempty" : "",
@@ -7442,6 +7445,25 @@
}
}
+ /* If PCRE2_COPY_MATCHED_SUBJECT was set, check that things are as they
+ should be, but not for fast JIT, where it isn't supported. */
+
+ if ((dat_datctl.options & PCRE2_COPY_MATCHED_SUBJECT) != 0 &&
+ (pat_patctl.control & CTL_JITFAST) == 0)
+ {
+ if ((FLD(match_data, flags) & PCRE2_MD_COPIED_SUBJECT) == 0)
+ fprintf(outfile,
+ "** PCRE2 error: flag not set after copy_matched_subject\n");
+
+ if (CASTFLD(void *, match_data, subject) == pp)
+ fprintf(outfile,
+ "** PCRE2 error: copy_matched_subject has not copied\n");
+
+ if (memcmp(CASTFLD(void *, match_data, subject), pp, ulen) != 0)
+ fprintf(outfile,
+ "** PCRE2 error: copy_matched_subject mismatch\n");
+ }
+
/* If this is not the first time round a global loop, check that the
returned string has changed. If it has not, check for an empty string match
at different starting offset from the previous match. This is a failed test
Modified: code/trunk/testdata/testinput17
===================================================================
--- code/trunk/testdata/testinput17 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/testdata/testinput17 2018-10-17 08:33:38 UTC (rev 1028)
@@ -299,9 +299,9 @@
# ----
/[aC]/mg,firstline,newline=lf
-match\nmatch
+ match\nmatch
/[aCz]/mg,firstline,newline=lf
-match\nmatch
+ match\nmatch
# End of testinput17
Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/testdata/testinput2 2018-10-17 08:33:38 UTC (rev 1028)
@@ -5531,4 +5531,11 @@
/(?(*script_run:xxx)zzz)/
+/foobar/
+ the foobar thing\=copy_matched_subject
+ the foobar thing\=copy_matched_subject,zero_terminate
+
+/foobar/g
+ the foobar thing foobar again\=copy_matched_subject
+
# End of testinput2
Modified: code/trunk/testdata/testinput6
===================================================================
--- code/trunk/testdata/testinput6 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/testdata/testinput6 2018-10-17 08:33:38 UTC (rev 1028)
@@ -4955,4 +4955,11 @@
\= Expect no match
\na
+/foobar/
+ the foobar thing\=copy_matched_subject
+ the foobar thing\=copy_matched_subject,zero_terminate
+
+/foobar/g
+ the foobar thing foobar again\=copy_matched_subject
+
# End of testinput6
Modified: code/trunk/testdata/testoutput17
===================================================================
--- code/trunk/testdata/testoutput17 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/testdata/testoutput17 2018-10-17 08:33:38 UTC (rev 1028)
@@ -543,11 +543,11 @@
# ----
/[aC]/mg,firstline,newline=lf
-match\nmatch
+ match\nmatch
0: a (JIT)
/[aCz]/mg,firstline,newline=lf
-match\nmatch
+ match\nmatch
0: a (JIT)
# End of testinput17
Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/testdata/testoutput2 2018-10-17 08:33:38 UTC (rev 1028)
@@ -16821,6 +16821,17 @@
/(?(*script_run:xxx)zzz)/
Failed: error 128 at offset 14: assertion expected after (?( or (?(?C)
+/foobar/
+ the foobar thing\=copy_matched_subject
+ 0: foobar
+ the foobar thing\=copy_matched_subject,zero_terminate
+ 0: foobar
+
+/foobar/g
+ the foobar thing foobar again\=copy_matched_subject
+ 0: foobar
+ 0: foobar
+
# End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data
Modified: code/trunk/testdata/testoutput6
===================================================================
--- code/trunk/testdata/testoutput6 2018-10-15 11:01:24 UTC (rev 1027)
+++ code/trunk/testdata/testoutput6 2018-10-17 08:33:38 UTC (rev 1028)
@@ -7783,4 +7783,15 @@
\na
No match
+/foobar/
+ the foobar thing\=copy_matched_subject
+ 0: foobar
+ the foobar thing\=copy_matched_subject,zero_terminate
+ 0: foobar
+
+/foobar/g
+ the foobar thing foobar again\=copy_matched_subject
+ 0: foobar
+ 0: foobar
+
# End of testinput6