[Pcre-svn] [1219] code/trunk: Fix problems with new PCRE2_S…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [1219] code/trunk: Fix problems with new PCRE2_SUBSTITUTE_MATCHED code.
Revision: 1219
          http://www.exim.org/viewvc/pcre2?view=rev&revision=1219
Author:   ph10
Date:     2020-02-16 17:46:40 +0000 (Sun, 16 Feb 2020)
Log Message:
-----------
Fix problems with new PCRE2_SUBSTITUTE_MATCHED code.


Modified Paths:
--------------
    code/trunk/doc/pcre2api.3
    code/trunk/src/pcre2_substitute.c
    code/trunk/testdata/testinput2
    code/trunk/testdata/testoutput2


Modified: code/trunk/doc/pcre2api.3
===================================================================
--- code/trunk/doc/pcre2api.3    2020-02-11 16:37:08 UTC (rev 1218)
+++ code/trunk/doc/pcre2api.3    2020-02-16 17:46:40 UTC (rev 1219)
@@ -1,4 +1,4 @@
-.TH PCRE2API 3 "22 January 2020" "PCRE2 10.35"
+.TH PCRE2API 3 "16 February 2020" "PCRE2 10.35"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .sp
@@ -3328,12 +3328,12 @@
 option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to return just the
 replacement string(s). The default action is to perform just one replacement if
 the pattern matches, but there is an option that requests multiple replacements
-(see PCRE2_SUBSTITUTE_GLOBAL below for details).
+(see PCRE2_SUBSTITUTE_GLOBAL below).
 .P
 If successful, \fBpcre2_substitute()\fP returns the number of substitutions
 that were carried out. This may be zero if no match was found, and is never
 greater than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A negative value is
-returned if an error is detected (see below for details).
+returned if an error is detected.
 .P
 Matches in which a \eK item in a lookahead in the pattern causes the match to
 end before it starts are not supported, and give rise to an error return. For
@@ -3348,10 +3348,11 @@
 functions from the match context, if provided, or else those that were used to
 allocate memory for the compiled code.
 .P
-If an external \fImatch_data\fP block is provided, its contents afterwards
-are those set by the final call to \fBpcre2_match()\fP. For global changes,
-this will have ended in a no-match error. The contents of the ovector within
-the match data block may or may not have been changed.
+If \fImatch_data\fP is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set, the
+provided block is used for all calls to \fBpcre2_match()\fP, and its contents
+afterwards are the result of the final call. For global changes, this will
+always be a no-match error. The contents of the ovector within the match data
+block may or may not have been changed.
 .P
 As well as the usual options for \fBpcre2_match()\fP, a number of additional
 options can be set in the \fIoptions\fP argument of \fBpcre2_substitute()\fP.
@@ -3363,16 +3364,22 @@
 an application to check for a match before choosing to substitute, without
 having to repeat the match.
 .P
-The \fIcode\fP argument is not used for the first substitution when
-PCRE2_SUBSTITUTE_MATCHED is set, but if PCRE2_SUBSTITUTE_GLOBAL is also set,
-\fBpcre2_match()\fP will be called after the first substitution to check for
-further matches, and the contents of the \fImatch_data\fP block will be
-changed.
+The contents of the externally supplied match data block are not changed when
+PCRE2_SUBSTITUTE_MATCHED is set. If PCRE2_SUBSTITUTE_GLOBAL is also set,
+\fBpcre2_match()\fP is called after the first substitution to check for further
+matches, but this is done using an internally obtained match data block, thus
+always leaving the external block unchanged.
 .P
-The default is to return a copy of the subject string with matched substrings 
-replaced. However, if PCRE2_SUBSTITUTE_REPLACEMENT_ONLY is set, only the 
-replacement substrings are returned. In the global case, multiple replacements 
-are concatenated in the output buffer. Substitution callouts (see
+The \fIcode\fP argument is not used for matching before the first substitution
+when PCRE2_SUBSTITUTE_MATCHED is set, but it must be provided, even when
+PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains information such as the
+UTF setting and the number of capturing parentheses in the pattern.
+.P
+The default action of \fBpcre2_substitute()\fP is to return a copy of the
+subject string with matched substrings replaced. However, if
+PCRE2_SUBSTITUTE_REPLACEMENT_ONLY is set, only the replacement substrings are
+returned. In the global case, multiple replacements are concatenated in the
+output buffer. Substitution callouts (see
 .\" HTML <a href="#subcallouts">
 .\" </a>
 below)
@@ -3381,26 +3388,39 @@
 .P
 The \fIoutlengthptr\fP argument of \fBpcre2_substitute()\fP must point to a
 variable that contains the length, in code units, of the output buffer. If the
-function is successful, the value is updated to contain the length of the new
-string, excluding the trailing zero that is automatically added.
+function is successful, the value is updated to contain the length in code
+units of the new string, excluding the trailing zero that is automatically
+added.
 .P
 If the function is not successful, the value set via \fIoutlengthptr\fP depends
 on the type of error. For syntax errors in the replacement string, the value is
 the offset in the replacement string where the error was detected. For other
 errors, the value is PCRE2_UNSET by default. This includes the case of the
-output buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set
-(see below), in which case the value is the minimum length needed, including
-space for the trailing zero. Note that in order to compute the required length,
-\fBpcre2_substitute()\fP has to simulate all the matching and copying, instead
-of giving an error return as soon as the buffer overflows. Note also that the
-length is in code units, not bytes.
+output buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set.
 .P
-The replacement string, which is interpreted as a UTF string in UTF mode,
-is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set. If the
-PCRE2_SUBSTITUTE_LITERAL option is set, it is not interpreted in any way. By
-default, however, a dollar character is an escape character that can specify
-the insertion of characters from capture groups and names from (*MARK) or other
-control verbs in the pattern. The following forms are always recognized:
+PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is
+too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If
+this option is set, however, \fBpcre2_substitute()\fP continues to go through
+the motions of matching and substituting (without, of course, writing anything)
+in order to compute the size of buffer that is needed. This value is passed
+back via the \fIoutlengthptr\fP variable, with the result of the function still
+being PCRE2_ERROR_NOMEMORY.
+.P
+Passing a buffer size of zero is a permitted way of finding out how much memory
+is needed for given substitution. However, this does mean that the entire
+operation is carried out twice. Depending on the application, it may be more
+efficient to allocate a large buffer and free the excess afterwards, instead of
+using PCRE2_SUBSTITUTE_OVERFLOW_LENGTH.
+.P
+The replacement string, which is interpreted as a UTF string in UTF mode, is
+checked for UTF validity unless PCRE2_NO_UTF_CHECK is set. An invalid UTF
+replacement string causes an immediate return with the relevant UTF error code.
+.P
+If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is not interpreted
+in any way. By default, however, a dollar character is an escape character that
+can specify the insertion of characters from capture groups and names from
+(*MARK) or other control verbs in the pattern. The following forms are always
+recognized:
 .sp
   $$                  insert a dollar character
   $<n> or ${<n>}      insert the contents of group <n>
@@ -3445,20 +3465,6 @@
 CRLF is a valid newline sequence and the next two characters are CR, LF. In
 this case, the offset is advanced by two characters.
 .P
-PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is
-too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If
-this option is set, however, \fBpcre2_substitute()\fP continues to go through
-the motions of matching and substituting (without, of course, writing anything)
-in order to compute the size of buffer that is needed. This value is passed
-back via the \fIoutlengthptr\fP variable, with the result of the function still
-being PCRE2_ERROR_NOMEMORY.
-.P
-Passing a buffer size of zero is a permitted way of finding out how much memory
-is needed for given substitution. However, this does mean that the entire
-operation is carried out twice. Depending on the application, it may be more
-efficient to allocate a large buffer and free the excess afterwards, instead of
-using PCRE2_SUBSTITUTE_OVERFLOW_LENGTH.
-.P
 PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that do
 not appear in the pattern to be treated as unset groups. This option should be
 used with care, because it means that a typo in a group name or number no
@@ -3917,6 +3923,6 @@
 .rs
 .sp
 .nf
-Last updated: 22 January 2020
+Last updated: 16 February 2020
 Copyright (c) 1997-2020 University of Cambridge.
 .fi


Modified: code/trunk/src/pcre2_substitute.c
===================================================================
--- code/trunk/src/pcre2_substitute.c    2020-02-11 16:37:08 UTC (rev 1218)
+++ code/trunk/src/pcre2_substitute.c    2020-02-16 17:46:40 UTC (rev 1219)
@@ -229,7 +229,7 @@
 uint32_t ovector_count;
 uint32_t goptions = 0;
 uint32_t suboptions;
-BOOL match_data_created = FALSE;
+pcre2_match_data *internal_match_data = NULL;
 BOOL escaped_literal = FALSE;
 BOOL overflowed = FALSE;
 BOOL use_existing_match;
@@ -265,22 +265,42 @@
 use_existing_match = ((options & PCRE2_SUBSTITUTE_MATCHED) != 0);
 replacement_only = ((options & PCRE2_SUBSTITUTE_REPLACEMENT_ONLY) != 0);


-if (use_existing_match)
+/* If starting from an existing match, there must be an externally provided
+match data block. We create an internal match_data block in two cases: (a) an
+external one is not supplied (and we are not starting from an existing match);
+(b) an existing match is to be used for the first substitution. In the latter
+case, we copy the existing match into the internal block. This ensures that no
+changes are made to the existing match data block. */
+
+if (match_data == NULL)
   {
-  if (match_data == NULL) return PCRE2_ERROR_NULL;
+  pcre2_general_context *gcontext;
+  if (use_existing_match) return PCRE2_ERROR_NULL;
+  gcontext = (mcontext == NULL)?
+    (pcre2_general_context *)code :
+    (pcre2_general_context *)mcontext;
+  match_data = internal_match_data =
+    pcre2_match_data_create_from_pattern(code, gcontext);
+  if (internal_match_data == NULL) return PCRE2_ERROR_NOMEMORY;
   }


-/* Otherwise, if no match data block is provided, create one. */
-
-else if (match_data == NULL)
+else if (use_existing_match)
   {
   pcre2_general_context *gcontext = (mcontext == NULL)?
     (pcre2_general_context *)code :
     (pcre2_general_context *)mcontext;
-  match_data = pcre2_match_data_create_from_pattern(code, gcontext);
-  if (match_data == NULL) return PCRE2_ERROR_NOMEMORY;
-  match_data_created = TRUE;
+  int pairs = (code->top_bracket + 1 < match_data->oveccount)?
+    code->top_bracket + 1 : match_data->oveccount;
+  internal_match_data = pcre2_match_data_create(match_data->oveccount,
+    gcontext);
+  if (internal_match_data == NULL) return PCRE2_ERROR_NOMEMORY;
+  memcpy(internal_match_data, match_data, offsetof(pcre2_match_data, ovector)
+    + 2*pairs*sizeof(PCRE2_SIZE));
+  match_data = internal_match_data;
   }
+
+/* Remember ovector details */
+
 ovector = pcre2_get_ovector_pointer(match_data);
 ovector_count = pcre2_get_ovector_count(match_data);


@@ -302,7 +322,7 @@
 #ifdef SUPPORT_UNICODE
 if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
   {
-  rc = PRIV(valid_utf)(replacement, rlength, &(match_data->rightchar));
+  rc = PRIV(valid_utf)(replacement, rlength, &(match_data->startchar));
   if (rc != 0)
     {
     match_data->leftchar = 0;
@@ -316,7 +336,7 @@
 suboptions = options & SUBSTITUTE_OPTIONS;
 options &= ~SUBSTITUTE_OPTIONS;


-/* Error if the start match offset it greater than the length of the subject. */
+/* Error if the start match offset is greater than the length of the subject. */

 if (start_offset > length)
   {
@@ -344,7 +364,7 @@
     use_existing_match = FALSE;
     }
   else rc = pcre2_match(code, subject, length, start_offset, options|goptions,
-      match_data, mcontext);
+    match_data, mcontext);


 #ifdef SUPPORT_UNICODE
   if (utf) options |= PCRE2_NO_UTF_CHECK;  /* Only need to check once */
@@ -898,8 +918,9 @@
     }


/* Save the details of this match. See above for how this data is used. If we
- matched an empty string, do the magic for global matches. Finally, update the
- start offset to point to the rest of the subject string. */
+ matched an empty string, do the magic for global matches. Update the start
+ offset to point to the rest of the subject string. If we re-used an existing
+ match for the first match, switch to the internal match data block. */

ovecsave[0] = ovector[0];
ovecsave[1] = ovector[1];
@@ -942,7 +963,7 @@
}

EXIT:
-if (match_data_created) pcre2_match_data_free(match_data);
+if (internal_match_data != NULL) pcre2_match_data_free(internal_match_data);
else match_data->rc = rc;
return rc;


Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2020-02-11 16:37:08 UTC (rev 1218)
+++ code/trunk/testdata/testinput2    2020-02-16 17:46:40 UTC (rev 1219)
@@ -5806,6 +5806,33 @@
     12abc34xyz99abc55\=substitute_skip=1
     12abc34xyz99abc55\=substitute_skip=2


+/a(..)d/replace=>$1<,substitute_matched
+    xyzabcdxyzabcdxyz
+    xyzabcdxyzabcdxyz\=ovector=2
+\= Expect error     
+    xyzabcdxyzabcdxyz\=ovector=1
+
+/a(..)d/g,replace=>$1<,substitute_matched
+    xyzabcdxyzabcdxyz
+    xyzabcdxyzabcdxyz\=ovector=2
+\= Expect error     
+    xyzabcdxyzabcdxyz\=ovector=1
+    xyzabcdxyzabcdxyz\=ovector=1,substitute_unset_empty
+
+/55|a(..)d/g,replace=>$1<,substitute_matched
+    xyz55abcdxyzabcdxyz\=ovector=2,substitute_unset_empty
+\= Expect error     
+    xyz55abcdxyzabcdxyz\=ovector=2
+
+/55|a(..)d/replace=>$1<,substitute_matched
+    xyz55abcdxyzabcdxyz\=ovector=2,substitute_unset_empty
+
+/55|a(..)d/replace=>$1<
+    xyz55abcdxyzabcdxyz\=ovector=2,substitute_unset_empty
+
+/55|a(..)d/g,replace=>$1<
+    xyz55abcdxyzabcdxyz\=ovector=2,substitute_unset_empty
+
 # Expect non-fixed-length error


"(?<=X(?(DEFINE)(.*))(?1))."

Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2020-02-11 16:37:08 UTC (rev 1218)
+++ code/trunk/testdata/testoutput2    2020-02-16 17:46:40 UTC (rev 1219)
@@ -17536,6 +17536,45 @@
  3(2) Old 12 15 "abc" New 5 10 "<abc>"
  3: <abc><abc>


+/a(..)d/replace=>$1<,substitute_matched
+    xyzabcdxyzabcdxyz
+ 1: xyz>bc<xyzabcdxyz
+    xyzabcdxyzabcdxyz\=ovector=2
+ 1: xyz>bc<xyzabcdxyz
+\= Expect error     
+    xyzabcdxyzabcdxyz\=ovector=1
+Failed: error -54 at offset 3 in replacement: requested value is not available
+
+/a(..)d/g,replace=>$1<,substitute_matched
+    xyzabcdxyzabcdxyz
+ 2: xyz>bc<xyz>bc<xyz
+    xyzabcdxyzabcdxyz\=ovector=2
+ 2: xyz>bc<xyz>bc<xyz
+\= Expect error     
+    xyzabcdxyzabcdxyz\=ovector=1
+Failed: error -54 at offset 3 in replacement: requested value is not available
+    xyzabcdxyzabcdxyz\=ovector=1,substitute_unset_empty
+Failed: error -54 at offset 3 in replacement: requested value is not available
+
+/55|a(..)d/g,replace=>$1<,substitute_matched
+    xyz55abcdxyzabcdxyz\=ovector=2,substitute_unset_empty
+ 3: xyz><>bc<xyz>bc<xyz
+\= Expect error     
+    xyz55abcdxyzabcdxyz\=ovector=2
+Failed: error -55 at offset 3 in replacement: requested value is not set
+
+/55|a(..)d/replace=>$1<,substitute_matched
+    xyz55abcdxyzabcdxyz\=ovector=2,substitute_unset_empty
+ 1: xyz><abcdxyzabcdxyz
+
+/55|a(..)d/replace=>$1<
+    xyz55abcdxyzabcdxyz\=ovector=2,substitute_unset_empty
+ 1: xyz><abcdxyzabcdxyz
+
+/55|a(..)d/g,replace=>$1<
+    xyz55abcdxyzabcdxyz\=ovector=2,substitute_unset_empty
+ 3: xyz><>bc<xyz>bc<xyz
+
 # Expect non-fixed-length error


"(?<=X(?(DEFINE)(.*))(?1))."