Revision: 381
http://www.exim.org/viewvc/pcre2?view=rev&revision=381
Author: ph10
Date: 2015-10-07 18:32:48 +0100 (Wed, 07 Oct 2015)
Log Message:
-----------
Implement PCRE2_SUBSTITUTE_EXTENDED.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/pcre2_substitute.3
code/trunk/doc/pcre2api.3
code/trunk/src/pcre2.h
code/trunk/src/pcre2.h.in
code/trunk/src/pcre2_compile.c
code/trunk/src/pcre2_error.c
code/trunk/src/pcre2_internal.h
code/trunk/src/pcre2_substitute.c
code/trunk/src/pcre2test.c
code/trunk/testdata/testinput18
code/trunk/testdata/testinput2
code/trunk/testdata/testinput5
code/trunk/testdata/testoutput18
code/trunk/testdata/testoutput2
code/trunk/testdata/testoutput5
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2015-09-25 16:14:40 UTC (rev 380)
+++ code/trunk/ChangeLog 2015-10-07 17:32:48 UTC (rev 381)
@@ -192,7 +192,9 @@
54. Add the null_context modifier to pcre2test so that calling pcre2_compile()
and the matching functions with NULL contexts can be tested.
+55. Implemented PCRE2_SUBSTITUTE_EXTENDED.
+
Version 10.20 30-June-2015
--------------------------
Modified: code/trunk/doc/pcre2_substitute.3
===================================================================
--- code/trunk/doc/pcre2_substitute.3 2015-09-25 16:14:40 UTC (rev 380)
+++ code/trunk/doc/pcre2_substitute.3 2015-10-07 17:32:48 UTC (rev 381)
@@ -1,4 +1,4 @@
-.TH PCRE2_SUBSTITUTE 3 "11 November 2014" "PCRE2 10.00"
+.TH PCRE2_SUBSTITUTE 3 "06 October 2015" "PCRE2 10.21"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -47,20 +47,22 @@
\fIoutlengthptr\fP, which is updated to the actual length of the new string.
The options are:
.sp
- PCRE2_ANCHORED Match only at the first position
- PCRE2_NOTBOL Subject string is not the beginning of a line
- PCRE2_NOTEOL Subject string is not the end of a line
- PCRE2_NOTEMPTY An empty string is not a valid match
- PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
- is not a valid match
- PCRE2_NO_UTF_CHECK Do not check the subject or replacement for
- UTF validity (only relevant if PCRE2_UTF
- was set at compile time)
- PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
+ PCRE2_ANCHORED Match only at the first position
+ PCRE2_NOTBOL Subject is not the beginning of a line
+ PCRE2_NOTEOL Subject is not the end of a line
+ PCRE2_NOTEMPTY An empty string is not a valid match
+ PCRE2_NOTEMPTY_ATSTART An empty string at the start of the
+ subject is not a valid match
+ PCRE2_NO_UTF_CHECK Do not check the subject or replacement
+ for UTF validity (only relevant if
+ PCRE2_UTF was set at compile time)
+ PCRE2_SUBSTITUTE_EXTENDED Do extended replacement processing
+ PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
.sp
The function returns the number of substitutions, which may be zero if there
were no matches. The result can be greater than one only when
-PCRE2_SUBSTITUTE_GLOBAL is set.
+PCRE2_SUBSTITUTE_GLOBAL is set. In the event of an error, a negative error code
+is returned.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF
Modified: code/trunk/doc/pcre2api.3
===================================================================
--- code/trunk/doc/pcre2api.3 2015-09-25 16:14:40 UTC (rev 380)
+++ code/trunk/doc/pcre2api.3 2015-10-07 17:32:48 UTC (rev 381)
@@ -1,4 +1,4 @@
-.TH PCRE2API 3 "22 September 2015" "PCRE2 10.21"
+.TH PCRE2API 3 "07 October 2015" "PCRE2 10.21"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@@ -1170,7 +1170,7 @@
.sp
If this option is set, an unanchored pattern is required to match before or at
the first newline in the subject string, though the matched text may continue
-over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a more
+over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a more
general limiting facility.
.sp
PCRE2_MATCH_UNSET_BACKREF
@@ -1367,8 +1367,8 @@
.sp
PCRE2_USE_OFFSET_LIMIT
.sp
-This option must be set for \fBpcre2_compile()\fP if
-\fBpcre2_set_offset_limit()\fP is going to be used to set a non-default offset
+This option must be set for \fBpcre2_compile()\fP if
+\fBpcre2_set_offset_limit()\fP is going to be used to set a non-default offset
limit in a match context for matches that use this pattern. An error is
generated if an offset limit is set without this option. For more details, see
the description of \fBpcre2_set_offset_limit()\fP in the
@@ -2657,64 +2657,121 @@
.B int pcre2_substitute(const pcre2_code *\fIcode\fP, PCRE2_SPTR \fIsubject\fP,
.B " PCRE2_SIZE \fIlength\fP, PCRE2_SIZE \fIstartoffset\fP,"
.B " uint32_t \fIoptions\fP, pcre2_match_data *\fImatch_data\fP,"
-.B " pcre2_match_context *\fImcontext\fP, PCRE2_SPTR \fIreplacementzfP,"
+.B " pcre2_match_context *\fImcontext\fP, PCRE2_SPTR \fIreplacement\fP,"
.B " PCRE2_SIZE \fIrlength\fP, PCRE2_UCHAR *\fIoutputbuffer\zfP,"
.B " PCRE2_SIZE *\fIoutlengthptr\fP);"
.fi
+.P
This function calls \fBpcre2_match()\fP and then makes a copy of the subject
string in \fIoutputbuffer\fP, replacing the part that was matched with the
\fIreplacement\fP string, whose length is supplied in \fBrlength\fP. This can
be given as PCRE2_ZERO_TERMINATED for a zero-terminated string.
.P
+The first seven arguments of \fBpcre2_substitute()\fP are the same as for
+\fBpcre2_match()\fP, except that the partial matching options are not
+permitted, and \fImatch_data\fP may be passed as NULL, in which case a match
+data block is obtained and freed within this function, using memory management
+functions from the match context, if provided, or else those that were used to
+allocate memory for the compiled code.
+.P
+The \fIoutlengthptr\fP argument must point to a variable that contains the
+length, in code units, of the output buffer. If the function is successful,
+the value is updated to contain the length of the new string, excluding the
+trailing zero that is automatically added. If the function is not successful,
+the value is set to PCRE2_UNSET for general errors (such as output buffer too
+small). For syntax errors in the replacement string, the value is set to the
+offset in the replacement string where the error was detected.
+.P
In the replacement string, which is interpreted as a UTF string in UTF mode,
and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set, a
dollar character is an escape character that can specify the insertion of
characters from capturing groups or (*MARK) items in the pattern. The following
-forms are recognized:
+forms are always recognized:
.sp
$$ insert a dollar character
$<n> or ${<n>} insert the contents of group <n>
- $*MARK or ${*MARK} insert the name of the last (*MARK) encountered
+ $*MARK or ${*MARK} insert the name of the last (*MARK) encountered
.sp
Either a group number or a group name can be given for <n>. Curly brackets are
required only if the following character would be interpreted as part of the
number or name. The number may be zero to include the entire matched string.
For example, if the pattern a(b)c is matched with "=abc=" and the replacement
-string "+$1$0$1+", the result is "=+babcb+=". Group insertion is done by
-calling \fBpcre2_copy_byname()\fP or \fBpcre2_copy_bynumber()\fP as
-appropriate.
+string "+$1$0$1+", the result is "=+babcb+=".
.P
-The facility for inserting a (*MARK) name can be used to perform simple
+The facility for inserting a (*MARK) name can be used to perform simple
simultaneous substitutions, as this \fBpcre2test\fP example shows:
.sp
/(*:pear)apple|(*:orange)lemon/g,replace=${*MARK}
apple lemon
2: pear orange
-.P
-The first seven arguments of \fBpcre2_substitute()\fP are the same as for
-\fBpcre2_match()\fP, except that the partial matching options are not
-permitted, and \fImatch_data\fP may be passed as NULL, in which case a match
-data block is obtained and freed within this function, using memory management
-functions from the match context, if provided, or else those that were used to
-allocate memory for the compiled code.
-.P
-There is one additional option, PCRE2_SUBSTITUTE_GLOBAL, which causes the
+.sp
+There is an additional option, PCRE2_SUBSTITUTE_GLOBAL, which causes the
function to iterate over the subject string, replacing every matching
substring. If this is not set, only the first matching substring is replaced.
.P
-The \fIoutlengthptr\fP argument must point to a variable that contains the
-length, in code units, of the output buffer. It is updated to contain the
-length of the new string, excluding the trailing zero that is automatically
-added.
+A second additional option, PCRE2_SUBSTITUTE_EXTENDED, causes extra processing
+to be applied to the replacement string. Without this option, only the dollar
+character is special, and only the group insertion forms listed above are
+valid. When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
.P
-The function returns the number of replacements that were made. This may be
-zero if no matches were found, and is never greater than 1 unless
-PCRE2_SUBSTITUTE_GLOBAL is set. In the event of an error, a negative error code
-is returned. Except for PCRE2_ERROR_NOMATCH (which is never returned), any
-errors from \fBpcre2_match()\fP or the substring copying functions are passed
-straight back. PCRE2_ERROR_BADREPLACEMENT is returned for an invalid
-replacement string (unrecognized sequence following a dollar sign), and
-PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough.
+Firstly, backslash in a replacement string is interpreted as an escape
+character. The usual forms such as \en or \ex{ddd} can be used to specify
+particular character codes, and backslash followed by any non-alphanumeric
+character quotes that character. Extended quoting can be coded using \eQ...\eE,
+exactly as in pattern strings.
+.P
+There are also four escape sequences for forcing the case of inserted letters.
+The insertion mechanism has three states: no case forcing, force upper case,
+and force lower case. The escape sequences change the current state: \eU and
+\eL change to upper or lower case forcing, respectively, and \eE (when not
+terminating a \eQ quoted sequence) reverts to no case forcing. The sequences
+\eu and \el force the next character (if it is a letter) to upper or lower
+case, respectively, and then the state automatically reverts to no case
+forcing. Case forcing applies to all inserted characters, including those from
+captured groups and letters within \eQ...\eE quoted sequences.
+.P
+Note that case forcing sequences such as \eU...\eE do not nest. For example,
+the result of processing "\eUaa\eLBB\eEcc\eE" is "AAbbcc"; the final \eE has no
+effect.
+.P
+The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
+flexibility to group substitution. The syntax is similar to that used by Bash:
+.sp
+ ${<n>:-<string>}
+ ${<n>:+<string1>:<string2>}
+.sp
+As before, <n> may be a group number or a name. The first form specifies a
+default value. If group <n> is set, its value is inserted; if not, <string> is
+expanded and the result inserted. The second form specifies strings that are
+expanded and inserted when group <n> is set or unset, respectively. The first
+form is just a convenient shorthand for
+.sp
+ ${<n>:+${<n>}:<string>}
+.sp
+Backslash can be used to escape colons and closing curly brackets in the
+replacement strings. A change of the case forcing state within a replacement
+string remains in force afterwards, as shown in this \fBpcre2test\fP example:
+.sp
+ /(some)?(body)/substitute_extended,replace=${1:+\eU:\eL}HeLLo
+ body
+ 1: hello
+ somebody
+ 1: HELLO
+.sp
+If successful, the function returns the number of replacements that were made.
+This may be zero if no matches were found, and is never greater than 1 unless
+PCRE2_SUBSTITUTE_GLOBAL is set.
+.P
+In the event of an error, a negative error code is returned. Except for
+PCRE2_ERROR_NOMATCH (which is never returned), errors from \fBpcre2_match()\fP
+are passed straight back. PCRE2_ERROR_NOMEMORY is returned if the output buffer
+is not big enough. PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax
+errors in the replacement string, with more particular errors being
+PCRE2_ERROR_BADREPESCAPE (invalid escape sequence),
+PCRE2_ERROR_REPMISSING_BRACE (closing curly bracket not found), and
+PCRE2_BADSUBSTITUTION (syntax error in extended group substitution). As for all
+PCRE2 errors, a text message that describes the error can be obtained by
+calling \fBpcre2_get_error_message()\fP.
.
.
.SH "DUPLICATE SUBPATTERN NAMES"
@@ -3008,6 +3065,6 @@
.rs
.sp
.nf
-Last updated: 22 September 2015
+Last updated: 07 October 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi
Modified: code/trunk/src/pcre2.h
===================================================================
--- code/trunk/src/pcre2.h 2015-09-25 16:14:40 UTC (rev 380)
+++ code/trunk/src/pcre2.h 2015-10-07 17:32:48 UTC (rev 381)
@@ -146,9 +146,10 @@
#define PCRE2_DFA_RESTART 0x00000040u
#define PCRE2_DFA_SHORTEST 0x00000080u
-/* This is an additional option for pcre2_substitute(). */
+/* These are additional options for pcre2_substitute(). */
#define PCRE2_SUBSTITUTE_GLOBAL 0x00000100u
+#define PCRE2_SUBSTITUTE_EXTENDED 0x00000200u
/* Newline and \R settings, for use in compile contexts. The newline values
must be kept in step with values set in config.h and both sets must all be
@@ -236,6 +237,9 @@
#define PCRE2_ERROR_UNAVAILABLE (-54)
#define PCRE2_ERROR_UNSET (-55)
#define PCRE2_ERROR_BADOFFSETLIMIT (-56)
+#define PCRE2_ERROR_BADREPESCAPE (-57)
+#define PCRE2_ERROR_REPMISSINGBRACE (-58)
+#define PCRE2_ERROR_BADSUBSTITUTION (-59)
/* Request types for pcre2_pattern_info() */
Modified: code/trunk/src/pcre2.h.in
===================================================================
--- code/trunk/src/pcre2.h.in 2015-09-25 16:14:40 UTC (rev 380)
+++ code/trunk/src/pcre2.h.in 2015-10-07 17:32:48 UTC (rev 381)
@@ -146,9 +146,10 @@
#define PCRE2_DFA_RESTART 0x00000040u
#define PCRE2_DFA_SHORTEST 0x00000080u
-/* This is an additional option for pcre2_substitute(). */
+/* These are additional options for pcre2_substitute(). */
#define PCRE2_SUBSTITUTE_GLOBAL 0x00000100u
+#define PCRE2_SUBSTITUTE_EXTENDED 0x00000200u
/* Newline and \R settings, for use in compile contexts. The newline values
must be kept in step with values set in config.h and both sets must all be
@@ -236,6 +237,9 @@
#define PCRE2_ERROR_UNAVAILABLE (-54)
#define PCRE2_ERROR_UNSET (-55)
#define PCRE2_ERROR_BADOFFSETLIMIT (-56)
+#define PCRE2_ERROR_BADREPESCAPE (-57)
+#define PCRE2_ERROR_REPMISSINGBRACE (-58)
+#define PCRE2_ERROR_BADSUBSTITUTION (-59)
/* Request types for pcre2_pattern_info() */
Modified: code/trunk/src/pcre2_compile.c
===================================================================
--- code/trunk/src/pcre2_compile.c 2015-09-25 16:14:40 UTC (rev 380)
+++ code/trunk/src/pcre2_compile.c 2015-10-07 17:32:48 UTC (rev 381)
@@ -1612,8 +1612,15 @@
entry, ptr is pointing at the \. On exit, it points the final code unit of the
escape sequence.
+This function is also called from pcre2_substitute() to handle escape sequences
+in replacement strings. In this case, the cb argument is NULL, and only
+sequences that define a data character are recognised. The isclass argument is
+not relevant, but the options argument is the final value of the compiled
+pattern's options.
+
Arguments:
- ptrptr points to the pattern position pointer
+ ptrptr points to the input position pointer
+ ptrend points to the end of the input
chptr points to a returned data character
errorcodeptr points to the errorcode variable (containing zero)
options the current options bits
@@ -1626,9 +1633,9 @@
on error, errorcodeptr is set non-zero
*/
-static int
-check_escape(PCRE2_SPTR *ptrptr, uint32_t *chptr, int *errorcodeptr,
- uint32_t options, BOOL isclass, compile_block *cb)
+int
+PRIV(check_escape)(PCRE2_SPTR *ptrptr, PCRE2_SPTR ptrend, uint32_t *chptr,
+ int *errorcodeptr, uint32_t options, BOOL isclass, compile_block *cb)
{
BOOL utf = (options & PCRE2_UTF) != 0;
PCRE2_SPTR ptr = *ptrptr + 1;
@@ -1636,19 +1643,23 @@
int escape = 0;
int i;
+/* If backslash is at the end of the pattern, it's an error. */
+
+if (ptr >= ptrend)
+ {
+ *errorcodeptr = ERR1;
+ return 0;
+ }
+
GETCHARINCTEST(c, ptr); /* Get character value, increment pointer */
ptr--; /* Set pointer back to the last code unit */
-/* If backslash is at the end of the pattern, it's an error. */
-
-if (c == CHAR_NULL && ptr >= cb->end_pattern) *errorcodeptr = ERR1;
-
/* Non-alphanumerics are literals, so we just leave the value in c. An initial
value test saves a memory lookup for code points outside the alphanumeric
range. Otherwise, do a table lookup. A non-zero result is something that can be
returned immediately. Otherwise further processing is required. */
-else if (c < ESCAPES_FIRST || c > ESCAPES_LAST) {} /* Definitely literal */
+if (c < ESCAPES_FIRST || c > ESCAPES_LAST) {} /* Definitely literal */
else if ((i = escapes[c - ESCAPES_FIRST]) != 0)
{
@@ -1660,7 +1671,9 @@
}
}
-/* Escapes that need further processing, including those that are unknown. */
+/* Escapes that need further processing, including those that are unknown.
+When called from pcre2_substitute(), only \c, \o, and \x are recognized (and \u
+when BSUX is set). */
else
{
@@ -1667,7 +1680,16 @@
PCRE2_SPTR oldptr;
BOOL braced, negated, overflow;
unsigned int s;
+
+ /* Filter calls from pcre2_substitute(). */
+ if (cb == NULL && c != CHAR_c && c != CHAR_o && c != CHAR_x &&
+ (c != CHAR_u || (options & PCRE2_ALT_BSUX) != 0))
+ {
+ *errorcodeptr = ERR3;
+ return 0;
+ }
+
switch (c)
{
/* A number of Perl escapes are not handled by PCRE. We give an explicit
@@ -2020,7 +2042,7 @@
c = *(++ptr);
if (c >= CHAR_a && c <= CHAR_z) c = UPPER_CASE(c);
- if (c == CHAR_NULL && ptr >= cb->end_pattern)
+ if (c == CHAR_NULL && ptr >= ptrend)
{
*errorcodeptr = ERR2;
break;
@@ -2874,7 +2896,8 @@
{
int rc;
*errorcodeptr = 0;
- rc = check_escape(&ptr, &x, errorcodeptr, options, FALSE, cb);
+ rc = PRIV(check_escape)(&ptr, cb->end_pattern, &x, errorcodeptr, options,
+ FALSE, cb);
*ptrptr = ptr; /* For possible error */
if (*errorcodeptr != 0) return -1;
if (rc != 0)
@@ -3048,7 +3071,8 @@
case CHAR_BACKSLASH:
errorcode = 0;
- escape = check_escape(&ptr, &c, &errorcode, options, FALSE, cb);
+ escape = PRIV(check_escape)(&ptr, cb->end_pattern, &c, &errorcode, options,
+ FALSE, cb);
if (errorcode != 0) goto FAILED;
if (escape == ESC_Q) inescq = TRUE;
break;
@@ -3132,7 +3156,8 @@
else if (c == CHAR_BACKSLASH)
{
errorcode = 0;
- escape = check_escape(&ptr, &c, &errorcode, options, TRUE, cb);
+ escape = PRIV(check_escape)(&ptr, cb->end_pattern, &c, &errorcode,
+ options, TRUE, cb);
if (errorcode != 0) goto FAILED;
if (escape == ESC_Q) inescq = TRUE;
}
@@ -4195,7 +4220,8 @@
if (c == CHAR_BACKSLASH)
{
- escape = check_escape(&ptr, &ec, errorcodeptr, options, TRUE, cb);
+ escape = PRIV(check_escape)(&ptr, cb->end_pattern, &ec, errorcodeptr,
+ options, TRUE, cb);
if (*errorcodeptr != 0) goto FAILED;
if (escape == 0) /* Escaped single char */
{
@@ -4405,7 +4431,8 @@
if (d == CHAR_BACKSLASH)
{
int descape;
- descape = check_escape(&ptr, &d, errorcodeptr, options, TRUE, cb);
+ descape = PRIV(check_escape)(&ptr, cb->end_pattern, &d,
+ errorcodeptr, options, TRUE, cb);
if (*errorcodeptr != 0) goto FAILED;
#ifdef EBCDIC
range_is_literal = FALSE;
@@ -6862,7 +6889,8 @@
case CHAR_BACKSLASH:
tempptr = ptr;
- escape = check_escape(&ptr, &ec, errorcodeptr, options, FALSE, cb);
+ escape = PRIV(check_escape)(&ptr, cb->end_pattern, &ec, errorcodeptr,
+ options, FALSE, cb);
if (*errorcodeptr != 0) goto FAILED;
if (escape == 0) /* The escape coded a single character */
Modified: code/trunk/src/pcre2_error.c
===================================================================
--- code/trunk/src/pcre2_error.c 2015-09-25 16:14:40 UTC (rev 380)
+++ code/trunk/src/pcre2_error.c 2015-10-07 17:32:48 UTC (rev 381)
@@ -238,9 +238,12 @@
"nested recursion at the same subject position\0"
"recursion limit exceeded\0"
"requested value is not available\0"
- /* 55 */
+ /* 55 */
"requested value is not set\0"
- "offset limit set without PCRE2_USE_OFFSET_LIMIT\0"
+ "offset limit set without PCRE2_USE_OFFSET_LIMIT\0"
+ "bad escape sequence in replacement string\0"
+ "expected closing curly bracket in replacement string\0"
+ "bad substitution in replacement string\0"
;
Modified: code/trunk/src/pcre2_internal.h
===================================================================
--- code/trunk/src/pcre2_internal.h 2015-09-25 16:14:40 UTC (rev 380)
+++ code/trunk/src/pcre2_internal.h 2015-10-07 17:32:48 UTC (rev 381)
@@ -1886,6 +1886,7 @@
is available. */
#define _pcre2_auto_possessify PCRE2_SUFFIX(_pcre2_auto_possessify_)
+#define _pcre2_check_escape PCRE2_SUFFIX(_pcre2_check_escape_)
#define _pcre2_find_bracket PCRE2_SUFFIX(_pcre2_find_bracket_)
#define _pcre2_is_newline PCRE2_SUFFIX(_pcre2_is_newline_)
#define _pcre2_jit_free_rodata PCRE2_SUFFIX(_pcre2_jit_free_rodata_)
@@ -1907,6 +1908,8 @@
extern int _pcre2_auto_possessify(PCRE2_UCHAR *, BOOL,
const compile_block *);
+extern int _pcre2_check_escape(PCRE2_SPTR *, PCRE2_SPTR, uint32_t *,
+ int *, uint32_t, BOOL, compile_block *);
extern PCRE2_SPTR _pcre2_find_bracket(PCRE2_SPTR, BOOL, int);
extern BOOL _pcre2_is_newline(PCRE2_SPTR, uint32_t, PCRE2_SPTR,
uint32_t *, BOOL);
Modified: code/trunk/src/pcre2_substitute.c
===================================================================
--- code/trunk/src/pcre2_substitute.c 2015-09-25 16:14:40 UTC (rev 380)
+++ code/trunk/src/pcre2_substitute.c 2015-10-07 17:32:48 UTC (rev 381)
@@ -45,8 +45,117 @@
#include "pcre2_internal.h"
+#define PTR_STACK_SIZE 20
+
/*************************************************
+* Find end of substitute text *
+*************************************************/
+
+/* In extended mode, we recognize ${name:+set text:unset text} and similar
+constructions. This requires the identification of unescaped : and }
+characters. This function scans for such. It must deal with nested ${
+constructions. The pointer to the text is updated, either to the required end
+character, or to where an error was detected.
+
+Arguments:
+ code points to the compiled expression (for options)
+ ptrptr points to the pointer to the start of the text (updated)
+ ptrend end of the whole string
+ last TRUE if the last expected string (only } recognized)
+
+Returns: 0 on success
+ negative error code on failure
+*/
+
+static int
+find_text_end(const pcre2_code *code, PCRE2_SPTR *ptrptr, PCRE2_SPTR ptrend,
+ BOOL last)
+{
+int rc = 0;
+uint32_t nestlevel = 0;
+BOOL literal = FALSE;
+PCRE2_SPTR ptr = *ptrptr;
+
+for (; ptr < ptrend; ptr++)
+ {
+ if (literal)
+ {
+ if (ptr[0] == CHAR_BACKSLASH && ptr < ptrend - 1 && ptr[1] == CHAR_E)
+ {
+ literal = FALSE;
+ ptr += 1;
+ }
+ }
+
+ else if (*ptr == CHAR_RIGHT_CURLY_BRACKET)
+ {
+ if (nestlevel == 0) goto EXIT;
+ nestlevel--;
+ }
+
+ else if (*ptr == CHAR_COLON && !last && nestlevel == 0) goto EXIT;
+
+ else if (*ptr == CHAR_DOLLAR_SIGN)
+ {
+ if (ptr < ptrend - 1 && ptr[1] == CHAR_LEFT_CURLY_BRACKET)
+ {
+ nestlevel++;
+ ptr += 1;
+ }
+ }
+
+ else if (*ptr == CHAR_BACKSLASH)
+ {
+ int erc;
+ int errorcode = 0;
+ uint32_t ch;
+
+ if (ptr < ptrend - 1) switch (ptr[1])
+ {
+ case CHAR_L:
+ case CHAR_l:
+ case CHAR_U:
+ case CHAR_u:
+ ptr += 1;
+ continue;
+ }
+
+ erc = PRIV(check_escape)(&ptr, ptrend, &ch, &errorcode,
+ code->overall_options, FALSE, NULL);
+ if (errorcode != 0)
+ {
+ rc = errorcode;
+ goto EXIT;
+ }
+
+ switch(erc)
+ {
+ case 0: /* Data character */
+ case ESC_E: /* Isolated \E is ignored */
+ break;
+
+ case ESC_Q:
+ literal = TRUE;
+ break;
+
+ default:
+ rc = PCRE2_ERROR_BADREPESCAPE;
+ goto EXIT;
+ }
+ }
+ }
+
+rc = PCRE2_ERROR_REPMISSINGBRACE; /* Terminator not found */
+
+EXIT:
+*ptrptr = ptr;
+return rc;
+}
+
+
+
+/*************************************************
* Match and substitute *
*************************************************/
@@ -80,13 +189,23 @@
{
int rc;
int subs;
+int forcecase = 0;
+int forcecasereset = 0;
uint32_t ovector_count;
uint32_t goptions = 0;
BOOL match_data_created = FALSE;
BOOL global = FALSE;
-PCRE2_SIZE buff_offset, lengthleft, fraglength;
+BOOL extended = FALSE;
+BOOL literal = FALSE;
+BOOL utf = (code->overall_options & PCRE2_UTF) != 0;
+PCRE2_SPTR ptr;
+PCRE2_SPTR repend;
+PCRE2_SIZE buff_offset, buff_length, lengthleft, fraglength;
PCRE2_SIZE *ovector;
+buff_length = *blength;
+*blength = PCRE2_UNSET;
+
/* Partial matching is not valid. */
if ((options & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) != 0)
@@ -109,8 +228,7 @@
/* Check UTF replacement string if necessary. */
#ifdef SUPPORT_UNICODE
-if ((code->overall_options & PCRE2_UTF) != 0 &&
- (options & PCRE2_NO_UTF_CHECK) == 0)
+if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
{
rc = PRIV(valid_utf)(replacement, rlength, &(match_data->rightchar));
if (rc != 0)
@@ -121,8 +239,8 @@
}
#endif /* SUPPORT_UNICODE */
-/* Notice the global option and remove it from the options that are passed to
-pcre2_match(). */
+/* Notice the global and extended options and remove them from the options that
+are passed to pcre2_match(). */
if ((options & PCRE2_SUBSTITUTE_GLOBAL) != 0)
{
@@ -130,17 +248,24 @@
global = TRUE;
}
-/* Find lengths of zero-terminated strings. */
+if ((options & PCRE2_SUBSTITUTE_EXTENDED) != 0)
+ {
+ options &= ~PCRE2_SUBSTITUTE_EXTENDED;
+ extended = TRUE;
+ }
+/* Find lengths of zero-terminated strings and the end of the replacement. */
+
if (length == PCRE2_ZERO_TERMINATED) length = PRIV(strlen)(subject);
if (rlength == PCRE2_ZERO_TERMINATED) rlength = PRIV(strlen)(replacement);
+repend = replacement + rlength;
/* Copy up to the start offset */
-if (start_offset > *blength) goto NOROOM;
+if (start_offset > buff_length) goto NOROOM;
memcpy(buffer, subject, start_offset * (PCRE2_CODE_UNIT_WIDTH/8));
buff_offset = start_offset;
-lengthleft = *blength - start_offset;
+lengthleft = buff_length - start_offset;
/* Loop for global substituting. */
@@ -147,7 +272,8 @@
subs = 0;
do
{
- PCRE2_SIZE i;
+ PCRE2_SPTR ptrstack[PTR_STACK_SIZE];
+ uint32_t ptrstackptr = 0;
rc = pcre2_match(code, subject, length, start_offset, options|goptions,
match_data, mcontext);
@@ -199,19 +325,56 @@
buff_offset += fraglength;
lengthleft -= fraglength;
- for (i = 0; i < rlength; i++)
+ /* Process the replacement string. Literal mode is set by \Q, but only in
+ extended mode when backslashes are being interpreted. In extended mode we
+ must handle nested substrings that are to be reprocessed. */
+
+ ptr = replacement;
+ for (;;)
{
- if (replacement[i] == CHAR_DOLLAR_SIGN)
+ uint32_t ch;
+
+ /* If at the end of a nested substring, pop the stack. */
+
+ if (ptr >= repend)
{
+ if (ptrstackptr <= 0) break;
+ repend = ptrstack[--ptrstackptr];
+ ptr = ptrstack[--ptrstackptr];
+ continue;
+ }
+
+ /* Handle the next character */
+
+ if (literal)
+ {
+ if (ptr[0] == CHAR_BACKSLASH && ptr < repend - 1 && ptr[1] == CHAR_E)
+ {
+ literal = FALSE;
+ ptr += 2;
+ continue;
+ }
+ goto LOADLITERAL;
+ }
+
+ /* Not in literal mode. */
+
+ if (*ptr == CHAR_DOLLAR_SIGN)
+ {
int group, n;
+ uint32_t special = 0;
BOOL inparens;
BOOL star;
PCRE2_SIZE sublength;
+ PCRE2_SPTR text1_start = NULL;
+ PCRE2_SPTR text1_end = NULL;
+ PCRE2_SPTR text2_start = NULL;
+ PCRE2_SPTR text2_end = NULL;
PCRE2_UCHAR next;
PCRE2_UCHAR name[33];
- if (++i == rlength) goto BAD;
- if ((next = replacement[i]) == CHAR_DOLLAR_SIGN) goto LITERAL;
+ if (++ptr >= repend) goto BAD;
+ if ((next = *ptr) == CHAR_DOLLAR_SIGN) goto LOADLITERAL;
group = -1;
n = 0;
@@ -220,15 +383,15 @@
if (next == CHAR_LEFT_CURLY_BRACKET)
{
- if (++i == rlength) goto BAD;
- next = replacement[i];
+ if (++ptr >= repend) goto BAD;
+ next = *ptr;
inparens = TRUE;
}
if (next == CHAR_ASTERISK)
{
- if (++i == rlength) goto BAD;
- next = replacement[i];
+ if (++ptr >= repend) goto BAD;
+ next = *ptr;
star = TRUE;
}
@@ -235,9 +398,9 @@
if (!star && next >= CHAR_0 && next <= CHAR_9)
{
group = next - CHAR_0;
- while (++i < rlength)
+ while (++ptr < repend)
{
- next = replacement[i];
+ next = *ptr;
if (next < CHAR_0 || next > CHAR_9) break;
group = group * 10 + next - CHAR_0;
}
@@ -249,18 +412,53 @@
{
name[n++] = next;
if (n > 32) goto BAD;
- if (i == rlength) break;
- next = replacement[++i];
+ if (ptr >= repend) break;
+ next = *(++ptr);
}
if (n == 0) goto BAD;
name[n] = 0;
}
+ /* In extended mode we recognize ${name:+set text:unset text} and
+ ${name:-default text}. */
+
if (inparens)
{
- if (i == rlength || next != CHAR_RIGHT_CURLY_BRACKET) goto BAD;
+
+ if (extended && !star && ptr < repend - 2 && next == CHAR_COLON)
+ {
+ special = *(++ptr);
+ if (special != CHAR_PLUS && special != CHAR_MINUS)
+ {
+ rc = PCRE2_ERROR_BADSUBSTITUTION;
+ goto PTREXIT;
+ }
+
+ text1_start = ++ptr;
+ rc = find_text_end(code, &ptr, repend, special == CHAR_MINUS);
+ if (rc != 0) goto PTREXIT;
+ text1_end = ptr;
+
+ if (special == CHAR_PLUS && *ptr == CHAR_COLON)
+ {
+ text2_start = ++ptr;
+ rc = find_text_end(code, &ptr, repend, TRUE);
+ if (rc != 0) goto PTREXIT;
+ text2_end = ptr;
+ }
+ }
+
+ else
+ {
+ if (ptr >= repend || *ptr != CHAR_RIGHT_CURLY_BRACKET)
+ {
+ rc = PCRE2_ERROR_REPMISSINGBRACE;
+ goto PTREXIT;
+ }
+ }
+
+ ptr++;
}
- else i--; /* Last code unit of name/number */
/* Have found a syntactically correct group number or name, or
*name. Only *MARK is currently recognized. */
@@ -282,31 +480,242 @@
else goto BAD;
}
- /* Substitute the contents of a group. */
+ /* Substitute the contents of a group. We don't use substring_copy
+ functions any more, in order to support case forcing. */
else
{
- sublength = lengthleft;
+ PCRE2_SPTR subptr, subptrend;
+
+ /* Find a number for a named group. In case there are duplicate names,
+ search for the first one that is set. */
+
if (group < 0)
- rc = pcre2_substring_copy_byname(match_data, name,
- buffer + buff_offset, &sublength);
- else
- rc = pcre2_substring_copy_bynumber(match_data, group,
- buffer + buff_offset, &sublength);
- if (rc < 0) goto EXIT;
+ {
+ PCRE2_SPTR first, last, entry;
+ rc = pcre2_substring_nametable_scan(code, name, &first, &last);
+ if (rc < 0) goto PTREXIT;
+ for (entry = first; entry <= last; entry += rc)
+ {
+ uint32_t ng = GET2(entry, 0);
+ if (ng < ovector_count)
+ {
+ if (group < 0) group = ng; /* First in ovector */
+ if (ovector[ng*2] != PCRE2_UNSET)
+ {
+ group = ng; /* First that is set */
+ break;
+ }
+ }
+ }
+
+ /* If group is still negative, it means we did not find a group that
+ is in the ovector. Just set the first group. */
+
+ if (group < 0) group = GET2(first, 0);
+ }
- buff_offset += sublength;
- lengthleft -= sublength;
+ rc = pcre2_substring_length_bynumber(match_data, group, &sublength);
+ if (rc < 0 && (special == 0 || rc != PCRE2_ERROR_UNSET)) goto PTREXIT;
+
+ /* If special is '+' we have a 'set' and possibly an 'unset' text,
+ both of which are reprocessed when used. If special is '-' we have a
+ default text for when the group is unset; it must be reprocessed. */
+
+ if (special != 0)
+ {
+ if (special == CHAR_MINUS)
+ {
+ if (rc == 0) goto LITERAL_SUBSTITUTE;
+ text2_start = text1_start;
+ text2_end = text1_end;
+ }
+
+ if (ptrstackptr >= PTR_STACK_SIZE) goto BAD;
+ ptrstack[ptrstackptr++] = ptr;
+ ptrstack[ptrstackptr++] = repend;
+
+ if (rc == 0)
+ {
+ ptr = text1_start;
+ repend = text1_end;
+ }
+ else
+ {
+ ptr = text2_start;
+ repend = text2_end;
+ }
+ continue;
+ }
+
+ /* Otherwise we have a literal substitution of a group's contents. */
+
+ LITERAL_SUBSTITUTE:
+ subptr = subject + ovector[group*2];
+ subptrend = subject + ovector[group*2 + 1];
+
+ /* Substitute a literal string, possibly forcing alphabetic case. */
+
+ while (subptr < subptrend)
+ {
+ GETCHARINCTEST(ch, subptr);
+ if (forcecase != 0)
+ {
+#ifdef SUPPORT_UNICODE
+ if (utf)
+ {
+ uint32_t type = UCD_CHARTYPE(ch);
+ if (PRIV(ucp_gentype)[type] == ucp_L &&
+ type != ((forcecase > 0)? ucp_Lu : ucp_Ll))
+ ch = UCD_OTHERCASE(ch);
+ }
+ else
+#endif
+ {
+ if (((code->tables + cbits_offset +
+ ((forcecase > 0)? cbit_upper:cbit_lower)
+ )[ch/8] & (1 << (ch%8))) == 0)
+ ch = (code->tables + fcc_offset)[ch];
+ }
+ forcecase = forcecasereset;
+ }
+
+#ifdef SUPPORT_UNICODE
+ if (utf)
+ {
+ unsigned int chlen;
+#if PCRE2_CODE_UNIT_WIDTH == 8
+ if (lengthleft < 6) goto NOROOM;
+#elif PCRE2_CODE_UNIT_WIDTH == 16
+ if (lengthleft < 2) goto NOROOM;
+#else
+ if (lengthleft < 1) goto NOROOM;
+#endif
+ chlen = PRIV(ord2utf)(ch, buffer + buff_offset);
+ buff_offset += chlen;
+ lengthleft -= chlen;
+ }
+ else
+#endif
+ {
+ if (lengthleft-- < 1) goto NOROOM;
+ buffer[buff_offset++] = ch;
+ }
+ }
}
}
- /* Handle a literal code unit */
+ /* Handle an escape sequence in extended mode. We can use check_escape()
+ to process \Q, \E, \c, \o, \x and \ followed by non-alphanumerics, but
+ the case-forcing escapes are not supported in pcre2_compile() so must be
+ recognized here. */
- else
+ else if (extended && *ptr == CHAR_BACKSLASH)
{
+ int errorcode = 0;
+
+ if (ptr < repend - 1) switch (ptr[1])
+ {
+ case CHAR_L:
+ forcecase = forcecasereset = -1;
+ ptr += 2;
+ continue;
+
+ case CHAR_l:
+ forcecase = -1;
+ forcecasereset = 0;
+ ptr += 2;
+ continue;
+
+ case CHAR_U:
+ forcecase = forcecasereset = 1;
+ ptr += 2;
+ continue;
+
+ case CHAR_u:
+ forcecase = 1;
+ forcecasereset = 0;
+ ptr += 2;
+ continue;
+
+ default:
+ break;
+ }
+
+ rc = PRIV(check_escape)(&ptr, repend, &ch, &errorcode,
+ code->overall_options, FALSE, NULL);
+ if (errorcode != 0) goto BADESCAPE;
+ ptr++;
+
+ switch(rc)
+ {
+ case ESC_E:
+ forcecase = forcecasereset = 0;
+ continue;
+
+ case ESC_Q:
+ literal = TRUE;
+ continue;
+
+ case 0: /* Data character */
+ goto LITERAL;
+
+ default:
+ goto BADESCAPE;
+ }
+ }
+
+ /* Handle a literal code unit */
+
+ else
+ {
+ LOADLITERAL:
+ GETCHARINCTEST(ch, ptr); /* Get character value, increment pointer */
+
LITERAL:
- if (lengthleft-- < 1) goto NOROOM;
- buffer[buff_offset++] = replacement[i];
+ if (forcecase != 0)
+ {
+#ifdef SUPPORT_UNICODE
+ if (utf)
+ {
+ uint32_t type = UCD_CHARTYPE(ch);
+ if (PRIV(ucp_gentype)[type] == ucp_L &&
+ type != ((forcecase > 0)? ucp_Lu : ucp_Ll))
+ ch = UCD_OTHERCASE(ch);
+ }
+ else
+#endif
+ {
+ if (((code->tables + cbits_offset +
+ ((forcecase > 0)? cbit_upper:cbit_lower)
+ )[ch/8] & (1 << (ch%8))) == 0)
+ ch = (code->tables + fcc_offset)[ch];
+ }
+
+ forcecase = forcecasereset;
+ }
+
+#ifdef SUPPORT_UNICODE
+ if (utf)
+ {
+ unsigned int chlen;
+#if PCRE2_CODE_UNIT_WIDTH == 8
+ if (lengthleft < 6) goto NOROOM;
+#elif PCRE2_CODE_UNIT_WIDTH == 16
+ if (lengthleft < 2) goto NOROOM;
+#else
+ if (lengthleft < 1) goto NOROOM;
+#endif
+ chlen = PRIV(ord2utf)(ch, buffer + buff_offset);
+ buff_offset += chlen;
+ lengthleft -= chlen;
+ }
+ else
+#endif
+ {
+ if (lengthleft-- < 1) goto NOROOM;
+ buffer[buff_offset++] = ch;
+ }
}
}
@@ -341,6 +750,13 @@
BAD:
rc = PCRE2_ERROR_BADREPLACEMENT;
+goto PTREXIT;
+
+BADESCAPE:
+rc = PCRE2_ERROR_BADREPESCAPE;
+
+PTREXIT:
+*blength = (PCRE2_SIZE)(ptr - replacement);
goto EXIT;
}
Modified: code/trunk/src/pcre2test.c
===================================================================
--- code/trunk/src/pcre2test.c 2015-09-25 16:14:40 UTC (rev 380)
+++ code/trunk/src/pcre2test.c 2015-10-07 17:32:48 UTC (rev 381)
@@ -182,13 +182,13 @@
#define LOCALESIZE 32 /* Size of locale name */
#define LOOPREPEAT 500000 /* Default loop count for timing */
#define PATSTACKSIZE 20 /* Pattern stack for save/restore testing */
-#define REPLACE_MODSIZE 96 /* Field for reading 8-bit replacement */
+#define REPLACE_MODSIZE 100 /* Field for reading 8-bit replacement */
#define VERSION_SIZE 64 /* Size of buffer for the version strings */
/* Make sure the buffer into which replacement strings are copied is big enough
to hold them as 32-bit code units. */
-#define REPLACE_BUFFSIZE (4*REPLACE_MODSIZE)
+#define REPLACE_BUFFSIZE 1024 /* This is a byte value */
/* Execution modes */
@@ -385,31 +385,32 @@
/* Control bits. Some apply to compiling, some to matching, but some can be set
either on a pattern or a data line, so they must all be distinct. */
-#define CTL_AFTERTEXT 0x00000001u
-#define CTL_ALLAFTERTEXT 0x00000002u
-#define CTL_ALLCAPTURES 0x00000004u
-#define CTL_ALLUSEDTEXT 0x00000008u
-#define CTL_ALTGLOBAL 0x00000010u
-#define CTL_BINCODE 0x00000020u
-#define CTL_CALLOUT_CAPTURE 0x00000040u
-#define CTL_CALLOUT_INFO 0x00000080u
-#define CTL_CALLOUT_NONE 0x00000100u
-#define CTL_DFA 0x00000200u
-#define CTL_FINDLIMITS 0x00000400u
-#define CTL_FULLBINCODE 0x00000800u
-#define CTL_GETALL 0x00001000u
-#define CTL_GLOBAL 0x00002000u
-#define CTL_HEXPAT 0x00004000u
-#define CTL_INFO 0x00008000u
-#define CTL_JITFAST 0x00010000u
-#define CTL_JITVERIFY 0x00020000u
-#define CTL_MARK 0x00040000u
-#define CTL_MEMORY 0x00080000u
-#define CTL_NULLCONTEXT 0x00100000u
-#define CTL_POSIX 0x00200000u
-#define CTL_PUSH 0x00400000u
-#define CTL_STARTCHAR 0x00800000u
-#define CTL_ZERO_TERMINATE 0x01000000u
+#define CTL_AFTERTEXT 0x00000001u
+#define CTL_ALLAFTERTEXT 0x00000002u
+#define CTL_ALLCAPTURES 0x00000004u
+#define CTL_ALLUSEDTEXT 0x00000008u
+#define CTL_ALTGLOBAL 0x00000010u
+#define CTL_BINCODE 0x00000020u
+#define CTL_CALLOUT_CAPTURE 0x00000040u
+#define CTL_CALLOUT_INFO 0x00000080u
+#define CTL_CALLOUT_NONE 0x00000100u
+#define CTL_DFA 0x00000200u
+#define CTL_FINDLIMITS 0x00000400u
+#define CTL_FULLBINCODE 0x00000800u
+#define CTL_GETALL 0x00001000u
+#define CTL_GLOBAL 0x00002000u
+#define CTL_HEXPAT 0x00004000u
+#define CTL_INFO 0x00008000u
+#define CTL_JITFAST 0x00010000u
+#define CTL_JITVERIFY 0x00020000u
+#define CTL_MARK 0x00040000u
+#define CTL_MEMORY 0x00080000u
+#define CTL_NULLCONTEXT 0x00100000u
+#define CTL_POSIX 0x00200000u
+#define CTL_PUSH 0x00400000u
+#define CTL_STARTCHAR 0x00800000u
+#define CTL_SUBSTITUTE_EXTENDED 0x01000000u
+#define CTL_ZERO_TERMINATE 0x02000000u
#define CTL_BSR_SET 0x80000000u /* This is informational */
#define CTL_NL_SET 0x40000000u /* This is informational */
@@ -566,6 +567,7 @@
{ "replace", MOD_PND, MOD_STR, REPLACE_MODSIZE, PO(replacement) },
{ "stackguard", MOD_PAT, MOD_INT, 0, PO(stackguard_test) },
{ "startchar", MOD_PND, MOD_CTL, CTL_STARTCHAR, PO(control) },
+ { "substitute_extended", MOD_PAT, MOD_CTL, CTL_SUBSTITUTE_EXTENDED, PO(control) },
{ "tables", MOD_PAT, MOD_INT, 0, PO(tables_id) },
{ "ucp", MOD_PATP, MOD_OPT, PCRE2_UCP, PO(options) },
{ "ungreedy", MOD_PAT, MOD_OPT, PCRE2_UNGREEDY, PO(options) },
@@ -3453,7 +3455,7 @@
static void
show_controls(uint32_t controls, const char *before)
{
-fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
+fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
before,
((controls & CTL_AFTERTEXT) != 0)? " aftertext" : "",
((controls & CTL_ALLAFTERTEXT) != 0)? " allaftertext" : "",
@@ -3481,6 +3483,7 @@
((controls & CTL_POSIX) != 0)? " posix" : "",
((controls & CTL_PUSH) != 0)? " push" : "",
((controls & CTL_STARTCHAR) != 0)? " startchar" : "",
+ ((controls & CTL_SUBSTITUTE_EXTENDED) != 0)? " substitute_extended" : "",
((controls & CTL_ZERO_TERMINATE) != 0)? " zero_terminate" : "");
}
@@ -5685,7 +5688,7 @@
uint8_t *pr;
uint8_t rbuffer[REPLACE_BUFFSIZE];
uint8_t nbuffer[REPLACE_BUFFSIZE];
- uint32_t goption;
+ uint32_t xoptions;
PCRE2_SIZE rlen, nsize, erroroffset;
BOOL badutf = FALSE;
@@ -5702,8 +5705,11 @@
if (timeitm)
fprintf(outfile, "** Timing is not supported with replace: ignored\n");
- goption = ((dat_datctl.control & CTL_GLOBAL) == 0)? 0 :
- PCRE2_SUBSTITUTE_GLOBAL;
+ xoptions = (((dat_datctl.control & CTL_GLOBAL) == 0)? 0 :
+ PCRE2_SUBSTITUTE_GLOBAL) |
+ (((pat_patctl.control & CTL_SUBSTITUTE_EXTENDED) == 0)? 0 :
+ PCRE2_SUBSTITUTE_EXTENDED);
+
SETCASTPTR(r, rbuffer); /* Sets r8, r16, or r32, as appropriate. */
pr = dat_datctl.replacement;
@@ -5790,12 +5796,15 @@
else
rlen = (CASTVAR(uint8_t *, r) - rbuffer)/code_unit_size;
PCRE2_SUBSTITUTE(rc, compiled_code, pp, ulen, dat_datctl.offset,
- dat_datctl.options|goption, match_data, dat_context,
+ dat_datctl.options|xoptions, match_data, dat_context,
rbuffer, rlen, nbuffer, &nsize);
if (rc < 0)
{
- fprintf(outfile, "Failed: error %d: ", rc);
+ fprintf(outfile, "Failed: error %d", rc);
+ if (nsize != PCRE2_UNSET)
+ fprintf(outfile, " at offset %ld in replacement", nsize);
+ fprintf(outfile, ": ");
PCRE2_GET_ERROR_MESSAGE(nsize, rc, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, nsize, FALSE, outfile);
}
Modified: code/trunk/testdata/testinput18
===================================================================
--- code/trunk/testdata/testinput18 2015-09-25 16:14:40 UTC (rev 380)
+++ code/trunk/testdata/testinput18 2015-10-07 17:32:48 UTC (rev 381)
@@ -92,4 +92,6 @@
"(?(?C)"
+/abcd/substitute_extended
+
# End of testdata/testinput18
Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2 2015-09-25 16:14:40 UTC (rev 380)
+++ code/trunk/testdata/testinput2 2015-10-07 17:32:48 UTC (rev 381)
@@ -4539,4 +4539,55 @@
abcd\=null_context,find_limits
abcd\=allusedtext,startchar
+/abcd/replace=w\rx\x82y\o{333}z(\Q12\$34$$\x34\E5$$),substitute_extended
+ abcd
+
+/a(bc)(DE)/replace=a\u$1\U$1\E$1\l$2\L$2\Eab\Uab\LYZ\EDone,substitute_extended
+ abcDE
+
+/abcd/replace=xy\kz,substitute_extended
+ abcd
+
+/a(?:(b)|(c))/substitute_extended,replace=X${1:+1:-1}X${2:+2:-2}
+ ab
+ ac
+ ab\=replace=${1:+$1\:$1:$2}
+ ac\=replace=${1:+$1\:$1:$2}
+
+/a(?:(b)|(c))/substitute_extended,replace=X${1:-1:-1}X${2:-2:-2}
+ ab
+ ac
+
+/(a)/substitute_extended,replace=>${1:+\Q$1:{}$$\E+\U$1}<
+ a
+
+/X(b)Y/substitute_extended
+ XbY\=replace=x${1:+$1\U$1}y
+ XbY\=replace=\Ux${1:+$1$1}y
+
+/a/substitute_extended,replace=${*MARK:+a:b}
+ a
+
+/(abcd)/replace=${1:+xy\kz},substitute_extended
+ abcd
+
+/abcd/substitute_extended,replace=>$1<
+ abcd
+
+/abcd/substitute_extended,replace=>xxx${xyz}<<<
+ abcd
+
+/(?J)(?:(?<A>a)|(?<A>b))/replace=<$A>
+ [a]
+ [b]
+\= Expect error
+ (a)\=ovector=1
+
+/(a)|(b)/replace=<$1>
+\= Expect error
+ b
+
+/(aa)(BB)/substitute_extended,replace=\U$1\L$2\E$1..\U$1\l$2$1
+ aaBB
+
# End of testinput2
Modified: code/trunk/testdata/testinput5
===================================================================
--- code/trunk/testdata/testinput5 2015-09-25 16:14:40 UTC (rev 380)
+++ code/trunk/testdata/testinput5 2015-10-07 17:32:48 UTC (rev 381)
@@ -1681,9 +1681,16 @@
/[\pS#moq]/
=
-# UTF tests
-
/(*:a\x{12345}b\t(d\)c)xxx/utf,alt_verbnames,mark
cxxxz
+/abcd/utf,replace=x\x{824}y\o{3333}z(\Q12\$34$$\x34\E5$$),substitute_extended
+ abcd
+
+/a(\x{e0}\x{101})(\x{c0}\x{102})/utf,replace=a\u$1\U$1\E$1\l$2\L$2\Eab\U\x{e0}\x{101}\L\x{d0}\x{160}\EDone,substitute_extended
+ a\x{e0}\x{101}\x{c0}\x{102}
+
+/((?<digit>\d)|(?<letter>\p{L}))/g,substitute_extended,replace=<${digit:+digit; :not digit; }${letter:+letter:not a letter}>
+ ab12cde
+
# End of testinput5
Modified: code/trunk/testdata/testoutput18
===================================================================
--- code/trunk/testdata/testoutput18 2015-09-25 16:14:40 UTC (rev 380)
+++ code/trunk/testdata/testoutput18 2015-10-07 17:32:48 UTC (rev 381)
@@ -135,9 +135,12 @@
0+ issippi
/abc/\
-Failed: POSIX code 9: bad escape sequence at offset 4
+Failed: POSIX code 9: bad escape sequence at offset 3
"(?(?C)"
Failed: POSIX code 3: pattern error at offset 2
+/abcd/substitute_extended
+** Ignored with POSIX interface: substitute_extended
+
# End of testdata/testinput18
Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2 2015-09-25 16:14:40 UTC (rev 380)
+++ code/trunk/testdata/testoutput2 2015-10-07 17:32:48 UTC (rev 381)
@@ -946,10 +946,10 @@
Failed: error 104 at offset 7: numbers out of order in {} quantifier
/abc/\
-Failed: error 101 at offset 4: \ at end of pattern
+Failed: error 101 at offset 3: \ at end of pattern
/abc/\i
-Failed: error 101 at offset 4: \ at end of pattern
+Failed: error 101 at offset 3: \ at end of pattern
/(a)bc(d)/I
Capturing subpattern count = 2
@@ -13546,27 +13546,27 @@
/abc/replace=a$++
123abc
-Failed: error -35: invalid replacement string
+Failed: error -35 at offset 2 in replacement: invalid replacement string
/abc/replace=a$bad
123abc
-Failed: error -49: unknown substring
+Failed: error -49 at offset 5 in replacement: unknown substring
/abc/replace=a${A234567890123456789_123456789012}z
123abc
-Failed: error -49: unknown substring
+Failed: error -49 at offset 36 in replacement: unknown substring
/abc/replace=a${A23456789012345678901234567890123}z
123abc
-Failed: error -35: invalid replacement string
+Failed: error -35 at offset 35 in replacement: invalid replacement string
/abc/replace=a${bcd
123abc
-Failed: error -35: invalid replacement string
+Failed: error -58 at offset 6 in replacement: expected closing curly bracket in replacement string
/abc/replace=a${b+d}z
123abc
-Failed: error -35: invalid replacement string
+Failed: error -58 at offset 4 in replacement: expected closing curly bracket in replacement string
/abc/replace=[10]XYZ
123abc123
@@ -13632,19 +13632,19 @@
/(*:pear)apple/g,replace=${*MARKING}
apple lemon blackberry
-Failed: error -35: invalid replacement string
+Failed: error -35 at offset 11 in replacement: invalid replacement string
/(*:pear)apple/g,replace=${*MARK-time
apple lemon blackberry
-Failed: error -35: invalid replacement string
+Failed: error -58 at offset 7 in replacement: expected closing curly bracket in replacement string
/(*:pear)apple/g,replace=${*mark}
apple lemon blackberry
-Failed: error -35: invalid replacement string
+Failed: error -35 at offset 8 in replacement: invalid replacement string
/(*:pear)apple|(*:orange)lemon|(*:strawberry)blackberry/g,replace=<$*MARKET>
apple lemon blackberry
-Failed: error -35: invalid replacement string
+Failed: error -35 at offset 9 in replacement: invalid replacement string
/(*:pear)apple|(*:orange)lemon|(*:strawberry)blackberry/g,replace=[22]${*MARK}
apple lemon blackberry
@@ -14669,4 +14669,76 @@
abcd\=allusedtext,startchar
** Not allowed together: allusedtext startchar
+/abcd/replace=w\rx\x82y\o{333}z(\Q12\$34$$\x34\E5$$),substitute_extended
+ abcd
+ 1: w\x0dx\x82y\xdbz(12\$34$$\x345$)
+
+/a(bc)(DE)/replace=a\u$1\U$1\E$1\l$2\L$2\Eab\Uab\LYZ\EDone,substitute_extended
+ abcDE
+ 1: aBcBCbcdEdeabAByzDone
+
+/abcd/replace=xy\kz,substitute_extended
+ abcd
+Failed: error -57 at offset 4 in replacement: bad escape sequence in replacement string
+
+/a(?:(b)|(c))/substitute_extended,replace=X${1:+1:-1}X${2:+2:-2}
+ ab
+ 1: X1X-2
+ ac
+ 1: X-1X2
+ ab\=replace=${1:+$1\:$1:$2}
+ 1: b:b
+ ac\=replace=${1:+$1\:$1:$2}
+ 1: c
+
+/a(?:(b)|(c))/substitute_extended,replace=X${1:-1:-1}X${2:-2:-2}
+ ab
+ 1: XbX2:-2
+ ac
+ 1: X1:-1Xc
+
+/(a)/substitute_extended,replace=>${1:+\Q$1:{}$$\E+\U$1}<
+ a
+ 1: >$1:{}$$+A<
+
+/X(b)Y/substitute_extended
+ XbY\=replace=x${1:+$1\U$1}y
+ 1: xbBY
+ XbY\=replace=\Ux${1:+$1$1}y
+ 1: XBBY
+
+/a/substitute_extended,replace=${*MARK:+a:b}
+ a
+Failed: error -58 at offset 7 in replacement: expected closing curly bracket in replacement string
+
+/(abcd)/replace=${1:+xy\kz},substitute_extended
+ abcd
+Failed: error -57 at offset 8 in replacement: bad escape sequence in replacement string
+
+/abcd/substitute_extended,replace=>$1<
+ abcd
+Failed: error -49 at offset 3 in replacement: unknown substring
+
+/abcd/substitute_extended,replace=>xxx${xyz}<<<
+ abcd
+Failed: error -49 at offset 10 in replacement: unknown substring
+
+/(?J)(?:(?<A>a)|(?<A>b))/replace=<$A>
+ [a]
+ 1: [<a>]
+ [b]
+ 1: [<b>]
+\= Expect error
+ (a)\=ovector=1
+Failed: error -54 at offset 3 in replacement: requested value is not available
+
+/(a)|(b)/replace=<$1>
+\= Expect error
+ b
+Failed: error -55 at offset 3 in replacement: requested value is not set
+
+/(aa)(BB)/substitute_extended,replace=\U$1\L$2\E$1..\U$1\l$2$1
+ aaBB
+ 1: AAbbaa..AAbBaa
+
# End of testinput2
Modified: code/trunk/testdata/testoutput5
===================================================================
--- code/trunk/testdata/testoutput5 2015-09-25 16:14:40 UTC (rev 380)
+++ code/trunk/testdata/testoutput5 2015-10-07 17:32:48 UTC (rev 381)
@@ -4029,11 +4029,21 @@
=
0: =
-# UTF tests
-
/(*:a\x{12345}b\t(d\)c)xxx/utf,alt_verbnames,mark
cxxxz
0: xxx
MK: a\x{12345}b\x{09}(d)c
+/abcd/utf,replace=x\x{824}y\o{3333}z(\Q12\$34$$\x34\E5$$),substitute_extended
+ abcd
+ 1: x\x{824}y\x{6db}z(12\$34$$\x345$)
+
+/a(\x{e0}\x{101})(\x{c0}\x{102})/utf,replace=a\u$1\U$1\E$1\l$2\L$2\Eab\U\x{e0}\x{101}\L\x{d0}\x{160}\EDone,substitute_extended
+ a\x{e0}\x{101}\x{c0}\x{102}
+ 1: a\x{c0}\x{101}\x{c0}\x{100}\x{e0}\x{101}\x{e0}\x{102}\x{e0}\x{103}ab\x{c0}\x{100}\x{f0}\x{161}Done
+
+/((?<digit>\d)|(?<letter>\p{L}))/g,substitute_extended,replace=<${digit:+digit; :not digit; }${letter:+letter:not a letter}>
+ ab12cde
+ 7: <not digit; letter><not digit; letter><digit; not a letter><digit; not a letter><not digit; letter><not digit; letter><not digit; letter>
+
# End of testinput5