Revision: 144
http://www.exim.org/viewvc/pcre2?view=rev&revision=144
Author: ph10
Date: 2014-11-12 16:57:56 +0000 (Wed, 12 Nov 2014)
Log Message:
-----------
Substitution tests and documentation.
Modified Paths:
--------------
code/trunk/doc/pcre2test.1
code/trunk/src/pcre2_error.c
code/trunk/src/pcre2_valid_utf.c
code/trunk/src/pcre2test.c
code/trunk/testdata/testinput10
code/trunk/testdata/testinput2
code/trunk/testdata/testoutput10
code/trunk/testdata/testoutput2
Modified: code/trunk/doc/pcre2test.1
===================================================================
--- code/trunk/doc/pcre2test.1 2014-11-11 16:51:07 UTC (rev 143)
+++ code/trunk/doc/pcre2test.1 2014-11-12 16:57:56 UTC (rev 144)
@@ -1,4 +1,4 @@
-.TH PCRE2TEST 1 "09 November 2014" "PCRE 10.00"
+.TH PCRE2TEST 1 "12 November 2014" "PCRE 10.00"
.SH NAME
pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@@ -645,6 +645,7 @@
allusedtext show all consulted text
/g global global matching
mark show mark values
+ replace=<string> specify a replacement string
startchar show starting character when relevant
.sp
These modifiers may not appear in a \fB#pattern\fP command. If you want them as
@@ -719,6 +720,7 @@
offset=<n> set starting offset
ovector=<n> set size of output vector
recursion_limit=<n> set a recursion limit
+ replace=<string> specify a replacement string
startchar show startchar when relevant
zero_terminate pass the subject as zero-terminated
.sp
@@ -797,6 +799,29 @@
function.
.
.
+.SS "Finding all matches in a string"
+.rs
+.sp
+Searching for all possible matches within a subject can be requested by the
+\fBglobal\fP or \fB/altglobal\fP modifier. After finding a match, the matching
+function is called again to search the remainder of the subject. The difference
+between \fBglobal\fP and \fBaltglobal\fP is that the former uses the
+\fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
+to start searching at a new point within the entire string (which is what Perl
+does), whereas the latter passes over a shortened substring. This makes a
+difference to the matching process if the pattern begins with a lookbehind
+assertion (including \eb or \eB).
+.P
+If an empty string is matched, the next match is done with the
+PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for
+another, non-empty, match at the same point in the subject. If this match
+fails, the start offset is advanced, and the normal match is retried. This
+imitates the way Perl handles such cases when using the \fB/g\fP modifier or
+the \fBsplit()\fP function. Normally, the start offset is advanced by one
+character, but if the newline convention recognizes CRLF as a newline, and the
+current character is CR followed by LF, an advance of two is used.
+.
+.
.SS "Testing substring extraction functions"
.rs
.sp
@@ -821,27 +846,38 @@
parentheses after each substring.
.
.
-.SS "Finding all matches in a string"
+.SS "Testing the substitution function"
.rs
.sp
-Searching for all possible matches within a subject can be requested by the
-\fBglobal\fP or \fB/altglobal\fP modifier. After finding a match, the matching
-function is called again to search the remainder of the subject. The difference
-between \fBglobal\fP and \fBaltglobal\fP is that the former uses the
-\fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
-to start searching at a new point within the entire string (which is what Perl
-does), whereas the latter passes over a shortened substring. This makes a
-difference to the matching process if the pattern begins with a lookbehind
-assertion (including \eb or \eB).
+If the \fBreplace\fP modifier is set, the \fBpcre2_substitute()\fP function is
+called instead of one of the matching functions. Unlike subject strings,
+\fBpcre2test\fP does not process replacement strings for escape sequences. In
+UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
+If so, it is correctly converted to a UTF string of the appropriate code unit
+width. If it is not a valid UTF-8 string, the individual code units are copied
+directly. This provides a means of passing an invalid UTF-8 string for testing
+purposes.
.P
-If an empty string is matched, the next match is done with the
-PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for
-another, non-empty, match at the same point in the subject. If this match
-fails, the start offset is advanced, and the normal match is retried. This
-imitates the way Perl handles such cases when using the \fB/g\fP modifier or
-the \fBsplit()\fP function. Normally, the start offset is advanced by one
-character, but if the newline convention recognizes CRLF as a newline, and the
-current character is CR followed by LF, an advance of two is used.
+If the \fBglobal\fP modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
+\fBpcre2_substitute()\fP. After a successful substitution, the modified string
+is output, preceded by the number of replacements. This may be zero if there
+were no matches. Here is a simple example of a substitution test:
+.sp
+ /abc/replace=xxx
+ =abc=abc=
+ 1: =xxx=abc=
+ =abc=abc=\=global
+ 2: =xxx=xxx=
+.sp
+Subject and replacement strings should be kept relatively short for
+substitution tests, as fixed-size buffers are used. To make it easy to test for
+buffer overflow, if the replacement string starts with a number in square
+brackets, that number is passed to \fBpcre2_substitute()\fP as the size of the
+output buffer, with the replacement string starting at the next character.
+.P
+A replacement string is ignored with POSIX and DFA matching. Specifying partial
+matching provokes an error return ("bad option value") from
+\fBpcre2_substitute()\fP.
.
.
.SS "Setting the JIT stack size"
@@ -1200,6 +1236,6 @@
.rs
.sp
.nf
-Last updated: 09 November 2014
+Last updated: 12 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi
Modified: code/trunk/src/pcre2_error.c
===================================================================
--- code/trunk/src/pcre2_error.c 2014-11-11 16:51:07 UTC (rev 143)
+++ code/trunk/src/pcre2_error.c 2014-11-12 16:57:56 UTC (rev 144)
@@ -102,7 +102,7 @@
/* 30 */
"unknown POSIX class name\0"
"internal error in pcre2_study(): should not occur\0"
- "this version of PCRE does not have UTF or Unicode property support\0"
+ "this version of PCRE2 does not have Unicode support\0"
"parentheses are too deeply nested (stack check)\0"
"character code point value in \\x{} or \\o{} is too large\0"
/* 35 */
@@ -118,7 +118,7 @@
"two named subpatterns have the same name (PCRE2_DUPNAMES not set)\0"
"group name must start with a non-digit\0"
/* 45 */
- "this version of PCRE does not have support for \\P, \\p, or \\X\0"
+ "this version of PCRE2 does not have support for \\P, \\p, or \\X\0"
"malformed \\P or \\p sequence\0"
"unknown property name after \\P or \\p\0"
"subpattern name is too long (maximum " XSTRING(MAX_NAME_SIZE) " characters)\0"
Modified: code/trunk/src/pcre2_valid_utf.c
===================================================================
--- code/trunk/src/pcre2_valid_utf.c 2014-11-11 16:51:07 UTC (rev 143)
+++ code/trunk/src/pcre2_valid_utf.c 2014-11-12 16:57:56 UTC (rev 144)
@@ -40,14 +40,16 @@
/* This module contains an internal function for validating UTF character
-strings. */
+strings. This file is also #included by the pcre2test program, which uses
+macros to change names from _pcre2_xxx to xxxx, thereby avoiding name clashes
+with the library. In this case, PCRE2_PCRE2TEST is defined. */
-
+#ifndef PCRE2_PCRE2TEST /* We're compiling the library */
#ifdef HAVE_CONFIG_H
#include "config.h"
#endif
-
#include "pcre2_internal.h"
+#endif /* PCRE2_PCRE2TEST */
#ifndef SUPPORT_UNICODE
Modified: code/trunk/src/pcre2test.c
===================================================================
--- code/trunk/src/pcre2test.c 2014-11-11 16:51:07 UTC (rev 143)
+++ code/trunk/src/pcre2test.c 2014-11-12 16:57:56 UTC (rev 144)
@@ -165,9 +165,14 @@
#define DEFAULT_OVECCOUNT 15 /* Default ovector count */
#define JUNK_OFFSET 0xdeadbeef /* For initializing ovector */
#define LOOPREPEAT 500000 /* Default loop count for timing */
-#define REPLACE_BUFFSIZE 400 /* For replacement strings */
+#define REPLACE_MODSIZE 96 /* Field for reading 8-bit replacement */
#define VERSION_SIZE 64 /* Size of buffer for the version strings */
+/* Make sure the buffer into which replacement strings are copied is big enough
+to hold them as 32-bit code units. */
+
+#define REPLACE_BUFFSIZE (4*REPLACE_MODSIZE)
+
/* Execution modes */
#define PCRE8_MODE 8
@@ -258,6 +263,20 @@
#define PCRE2_SUFFIX(a) a
+/* We need to be able to check input text for UTF-8 validity, whatever code
+widths are actually available, because the input to pcre2test is always in
+8-bit code units. So we include the UTF validity checking function for 8-bit
+code units. */
+
+extern int valid_utf(PCRE2_SPTR8, PCRE2_SIZE, PCRE2_SIZE *);
+
+#define PCRE2_CODE_UNIT_WIDTH 8
+#undef PCRE2_SPTR
+#define PCRE2_SPTR PCRE2_SPTR8
+#include "pcre2_valid_utf.c"
+#undef PCRE2_CODE_UNIT_WIDTH
+#undef PCRE2_SPTR
+
/* If we have 8-bit support, default to it; if there is also 16-or 32-bit
support, it can be selected by a command-line option. If there is no 8-bit
support, there must be 16- or 32-bit support, so default to one of them. The
@@ -369,15 +388,20 @@
CTL_MARK|\
CTL_MEMORY|\
CTL_STARTCHAR)
+
+/* Structures for holding modifier information for patterns and subject strings
+(data). Fields containing modifiers that can be set either for a pattern or a
+subject must be at the start and in the same order in both cases so that the
+same offset in the big table below works for both. */
typedef struct patctl { /* Structure for pattern modifiers. */
uint32_t options; /* Must be in same position as datctl */
uint32_t control; /* Must be in same position as datctl */
+ uint8_t replacement[REPLACE_MODSIZE]; /* So must this */
uint32_t jit;
uint32_t stackguard_test;
uint32_t tables_id;
uint8_t locale[32];
- uint8_t replacement[REPLACE_BUFFSIZE];
} patctl;
#define MAXCPYGET 10
@@ -386,6 +410,7 @@
typedef struct datctl { /* Structure for data line modifiers. */
uint32_t options; /* Must be in same position as patctl */
uint32_t control; /* Must be in same position as patctl */
+ uint8_t replacement[REPLACE_MODSIZE]; /* So must this */
uint32_t cfail[2];
int32_t callout_data;
int32_t copy_numbers[MAXCPYGET];
@@ -487,7 +512,7 @@
{ "posix", MOD_PAT, MOD_CTL, CTL_POSIX, PO(control) },
{ "ps", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) },
{ "recursion_limit", MOD_CTM, MOD_INT, 0, MO(recursion_limit) },
- { "replace", MOD_PAT, MOD_STR, 0, PO(replacement) },
+ { "replace", MOD_PND, MOD_STR, 0, PO(replacement) },
{ "stackguard", MOD_PAT, MOD_INT, 0, PO(stackguard_test) },
{ "startchar", MOD_PND, MOD_CTL, CTL_STARTCHAR, PO(control) },
{ "tables", MOD_PAT, MOD_INT, 0, PO(tables_id) },
@@ -4211,13 +4236,14 @@
/* Copy the default context and data control blocks to the active ones. Then
copy from the pattern the controls that can be set in either the pattern or the
-data. This allows them to be unset in the data line. We do not do this for
+data. This allows them to be overridden in the data line. We do not do this for
options because those that are common apply separately to compiling and
matching. */
DATCTXCPY(dat_context, default_dat_context);
memcpy(&dat_datctl, &def_datctl, sizeof(datctl));
dat_datctl.control |= (pat_patctl.control & CTL_ALLPD);
+strcpy((char *)dat_datctl.replacement, (char *)pat_patctl.replacement);
/* Initialize for scanning the data line. */
@@ -4715,20 +4741,28 @@
PCRE2_MATCH_DATA_FREE(match_data);
PCRE2_MATCH_DATA_CREATE(match_data, max_oveccount, NULL);
}
+
+/* Replacement processing is ignored for DFA matching. */
+if (dat_datctl.replacement[0] != 0 && (dat_datctl.control & CTL_DFA) != 0)
+ {
+ fprintf(outfile, "** Ignored for DFA matching: replace\n");
+ dat_datctl.replacement[0] = 0;
+ }
+
/* If a replacement string is provided, call pcre2_substitute() instead of one
of the matching functions. First we have to convert the replacement string to
the appropriate width. */
-if (pat_patctl.replacement[0] != 0)
+if (dat_datctl.replacement[0] != 0)
{
int rc;
uint8_t *pr;
uint8_t rbuffer[REPLACE_BUFFSIZE];
uint8_t nbuffer[REPLACE_BUFFSIZE];
uint32_t goption;
- PCRE2_SIZE rlen;
- PCRE2_SIZE nsize;
+ PCRE2_SIZE rlen, nsize, erroroffset;
+ BOOL badutf = FALSE;
#ifdef SUPPORT_PCRE2_8
uint8_t *r8 = NULL;
@@ -4740,10 +4774,13 @@
uint32_t *r32 = NULL;
#endif
- goption = ((pat_patctl.control & CTL_GLOBAL) == 0)? 0 :
+ if (timeitm)
+ fprintf(outfile, "** Timing is not supported with replace: ignored\n");
+
+ goption = ((dat_datctl.control & CTL_GLOBAL) == 0)? 0 :
PCRE2_SUBSTITUTE_GLOBAL;
SETCASTPTR(r, rbuffer); /* Sets r8, r16, or r32, as appropriate. */
- pr = pat_patctl.replacement;
+ pr = dat_datctl.replacement;
/* If the replacement starts with '[<number>]' we interpret that as length
value for the replacement buffer. */
@@ -4767,52 +4804,58 @@
nsize = n;
}
- /* Now copy the replacement string to a buffer of the appropriate width. */
+ /* Now copy the replacement string to a buffer of the appropriate width. No
+ escape processing is done for replacements. In UTF mode, check for an invalid
+ UTF-8 input string, and if it is invalid, just copy its code units without
+ UTF interpretation. This provides a means of checking that an invalid string
+ is detected. Otherwise, UTF-8 can be used to include wide characters in a
+ replacement. */
+
+ if (utf) badutf = valid_utf(pr, strlen((const char *)pr), &erroroffset);
- while ((c = *pr++) != 0)
+ /* Not UTF or invalid UTF-8: just copy the code units. */
+
+ if (!utf || badutf)
{
- if (utf && HASUTF8EXTRALEN(c)) { GETUTF8INC(c, pr); }
+ while ((c = *pr++) != 0)
+ {
+#ifdef SUPPORT_PCRE2_8
+ if (test_mode == PCRE8_MODE) *r8++ = c;
+#endif
+#ifdef SUPPORT_PCRE2_16
+ if (test_mode == PCRE16_MODE) *r16++ = c;
+#endif
+#ifdef SUPPORT_PCRE2_32
+ if (test_mode == PCRE32_MODE) *r32++ = c;
+#endif
+ }
+ }
+
+ /* Valid UTF-8 replacement string */
+
+ else while ((c = *pr++) != 0)
+ {
+ if (HASUTF8EXTRALEN(c)) { GETUTF8INC(c, pr); }
- /* At present no escape processing is provided for replacements. */
-
#ifdef SUPPORT_PCRE2_8
- if (test_mode == PCRE8_MODE)
- {
- if (utf)
- {
- r8 += ord2utf8(c, r8);
- }
- else
- {
- *r8++ = c;
- }
- }
+ if (test_mode == PCRE8_MODE) r8 += ord2utf8(c, r8);
#endif
+
#ifdef SUPPORT_PCRE2_16
if (test_mode == PCRE16_MODE)
{
- if (utf)
+ if (c >= 0x10000u)
{
- if (c >= 0x10000u)
- {
- c-= 0x10000u;
- *r16++ = 0xD800 | (c >> 10);
- *r16++ = 0xDC00 | (c & 0x3ff);
- }
- else
- *r16++ = c;
+ c-= 0x10000u;
+ *r16++ = 0xD800 | (c >> 10);
+ *r16++ = 0xDC00 | (c & 0x3ff);
}
- else
- {
- *r16++ = c;
- }
+ else *r16++ = c;
}
#endif
+
#ifdef SUPPORT_PCRE2_32
- if (test_mode == PCRE32_MODE)
- {
- *r32++ = c;
- }
+ if (test_mode == PCRE32_MODE) *r32++ = c;
#endif
}
Modified: code/trunk/testdata/testinput10
===================================================================
--- code/trunk/testdata/testinput10 2014-11-11 16:51:07 UTC (rev 143)
+++ code/trunk/testdata/testinput10 2014-11-12 16:57:56 UTC (rev 144)
@@ -444,4 +444,7 @@
/\x{3a3}B/IBi,utf
+/abc/utf,replace=\xC3
+ abc
+
# End of testinput10
Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2 2014-11-11 16:51:07 UTC (rev 143)
+++ code/trunk/testdata/testinput2 2014-11-12 16:57:56 UTC (rev 144)
@@ -4067,6 +4067,12 @@
/abc/replace=xyz
1abc2\=partial_hard
+/abc/replace=xyz
+ 123abc456
+ 123abc456\=replace=pqr
+ 123abc456abc789
+ 123abc456abc789\=g
+
# End of substitute tests
# End of testinput2
Modified: code/trunk/testdata/testoutput10
===================================================================
--- code/trunk/testdata/testoutput10 2014-11-11 16:51:07 UTC (rev 143)
+++ code/trunk/testdata/testoutput10 2014-11-12 16:57:56 UTC (rev 144)
@@ -1546,4 +1546,8 @@
Last code unit = 'B' (caseless)
Subject length lower bound = 2
+/abc/utf,replace=\xC3
+ abc
+Failed: error -3: UTF-8 error: 1 byte missing at end
+
# End of testinput10
Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2 2014-11-11 16:51:07 UTC (rev 143)
+++ code/trunk/testdata/testoutput2 2014-11-12 16:57:56 UTC (rev 144)
@@ -13689,6 +13689,16 @@
1abc2\=partial_hard
Failed: error -34: bad option value
+/abc/replace=xyz
+ 123abc456
+ 1: 123xyz456
+ 123abc456\=replace=pqr
+ 1: 123pqr456
+ 123abc456abc789
+ 1: 123xyz456abc789
+ 123abc456abc789\=g
+ 2: 123xyz456xyz789
+
# End of substitute tests
# End of testinput2