[Pcre-svn] [144] code/trunk: Substitution tests and document…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [144] code/trunk: Substitution tests and documentation.
Revision: 144
          http://www.exim.org/viewvc/pcre2?view=rev&revision=144
Author:   ph10
Date:     2014-11-12 16:57:56 +0000 (Wed, 12 Nov 2014)


Log Message:
-----------
Substitution tests and documentation.

Modified Paths:
--------------
    code/trunk/doc/pcre2test.1
    code/trunk/src/pcre2_error.c
    code/trunk/src/pcre2_valid_utf.c
    code/trunk/src/pcre2test.c
    code/trunk/testdata/testinput10
    code/trunk/testdata/testinput2
    code/trunk/testdata/testoutput10
    code/trunk/testdata/testoutput2


Modified: code/trunk/doc/pcre2test.1
===================================================================
--- code/trunk/doc/pcre2test.1    2014-11-11 16:51:07 UTC (rev 143)
+++ code/trunk/doc/pcre2test.1    2014-11-12 16:57:56 UTC (rev 144)
@@ -1,4 +1,4 @@
-.TH PCRE2TEST 1 "09 November 2014" "PCRE 10.00"
+.TH PCRE2TEST 1 "12 November 2014" "PCRE 10.00"
 .SH NAME
 pcre2test - a program for testing Perl-compatible regular expressions.
 .SH SYNOPSIS
@@ -645,6 +645,7 @@
       allusedtext         show all consulted text
   /g  global              global matching
       mark                show mark values
+      replace=<string>    specify a replacement string 
       startchar           show starting character when relevant
 .sp
 These modifiers may not appear in a \fB#pattern\fP command. If you want them as
@@ -719,6 +720,7 @@
       offset=<n>                set starting offset
       ovector=<n>               set size of output vector
       recursion_limit=<n>       set a recursion limit
+      replace=<string>          specify a replacement string 
       startchar                 show startchar when relevant
       zero_terminate            pass the subject as zero-terminated
 .sp
@@ -797,6 +799,29 @@
 function.
 .
 .
+.SS "Finding all matches in a string"
+.rs
+.sp
+Searching for all possible matches within a subject can be requested by the
+\fBglobal\fP or \fB/altglobal\fP modifier. After finding a match, the matching
+function is called again to search the remainder of the subject. The difference
+between \fBglobal\fP and \fBaltglobal\fP is that the former uses the
+\fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
+to start searching at a new point within the entire string (which is what Perl
+does), whereas the latter passes over a shortened substring. This makes a
+difference to the matching process if the pattern begins with a lookbehind
+assertion (including \eb or \eB).
+.P
+If an empty string is matched, the next match is done with the
+PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for
+another, non-empty, match at the same point in the subject. If this match
+fails, the start offset is advanced, and the normal match is retried. This
+imitates the way Perl handles such cases when using the \fB/g\fP modifier or
+the \fBsplit()\fP function. Normally, the start offset is advanced by one
+character, but if the newline convention recognizes CRLF as a newline, and the
+current character is CR followed by LF, an advance of two is used.
+.
+.
 .SS "Testing substring extraction functions"
 .rs
 .sp
@@ -821,27 +846,38 @@
 parentheses after each substring.
 .
 .
-.SS "Finding all matches in a string"
+.SS "Testing the substitution function"
 .rs
 .sp
-Searching for all possible matches within a subject can be requested by the
-\fBglobal\fP or \fB/altglobal\fP modifier. After finding a match, the matching
-function is called again to search the remainder of the subject. The difference
-between \fBglobal\fP and \fBaltglobal\fP is that the former uses the
-\fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
-to start searching at a new point within the entire string (which is what Perl
-does), whereas the latter passes over a shortened substring. This makes a
-difference to the matching process if the pattern begins with a lookbehind
-assertion (including \eb or \eB).
+If the \fBreplace\fP modifier is set, the \fBpcre2_substitute()\fP function is 
+called instead of one of the matching functions. Unlike subject strings,
+\fBpcre2test\fP does not process replacement strings for escape sequences. In
+UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
+If so, it is correctly converted to a UTF string of the appropriate code unit
+width. If it is not a valid UTF-8 string, the individual code units are copied
+directly. This provides a means of passing an invalid UTF-8 string for testing
+purposes. 
 .P
-If an empty string is matched, the next match is done with the
-PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for
-another, non-empty, match at the same point in the subject. If this match
-fails, the start offset is advanced, and the normal match is retried. This
-imitates the way Perl handles such cases when using the \fB/g\fP modifier or
-the \fBsplit()\fP function. Normally, the start offset is advanced by one
-character, but if the newline convention recognizes CRLF as a newline, and the
-current character is CR followed by LF, an advance of two is used.
+If the \fBglobal\fP modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
+\fBpcre2_substitute()\fP. After a successful substitution, the modified string
+is output, preceded by the number of replacements. This may be zero if there
+were no matches. Here is a simple example of a substitution test:
+.sp
+  /abc/replace=xxx
+      =abc=abc=
+   1: =xxx=abc=
+      =abc=abc=\=global
+   2: =xxx=xxx=
+.sp
+Subject and replacement strings should be kept relatively short for 
+substitution tests, as fixed-size buffers are used. To make it easy to test for
+buffer overflow, if the replacement string starts with a number in square 
+brackets, that number is passed to \fBpcre2_substitute()\fP as the size of the 
+output buffer, with the replacement string starting at the next character.
+.P
+A replacement string is ignored with POSIX and DFA matching. Specifying partial 
+matching provokes an error return ("bad option value") from
+\fBpcre2_substitute()\fP.
 .
 .
 .SS "Setting the JIT stack size"
@@ -1200,6 +1236,6 @@
 .rs
 .sp
 .nf
-Last updated: 09 November 2014
+Last updated: 12 November 2014
 Copyright (c) 1997-2014 University of Cambridge.
 .fi


Modified: code/trunk/src/pcre2_error.c
===================================================================
--- code/trunk/src/pcre2_error.c    2014-11-11 16:51:07 UTC (rev 143)
+++ code/trunk/src/pcre2_error.c    2014-11-12 16:57:56 UTC (rev 144)
@@ -102,7 +102,7 @@
   /* 30 */
   "unknown POSIX class name\0"
   "internal error in pcre2_study(): should not occur\0"
-  "this version of PCRE does not have UTF or Unicode property support\0"
+  "this version of PCRE2 does not have Unicode support\0"
   "parentheses are too deeply nested (stack check)\0"
   "character code point value in \\x{} or \\o{} is too large\0"
   /* 35 */
@@ -118,7 +118,7 @@
   "two named subpatterns have the same name (PCRE2_DUPNAMES not set)\0"
   "group name must start with a non-digit\0"
   /* 45 */
-  "this version of PCRE does not have support for \\P, \\p, or \\X\0"
+  "this version of PCRE2 does not have support for \\P, \\p, or \\X\0"
   "malformed \\P or \\p sequence\0"
   "unknown property name after \\P or \\p\0"
   "subpattern name is too long (maximum " XSTRING(MAX_NAME_SIZE) " characters)\0"


Modified: code/trunk/src/pcre2_valid_utf.c
===================================================================
--- code/trunk/src/pcre2_valid_utf.c    2014-11-11 16:51:07 UTC (rev 143)
+++ code/trunk/src/pcre2_valid_utf.c    2014-11-12 16:57:56 UTC (rev 144)
@@ -40,14 +40,16 @@



/* This module contains an internal function for validating UTF character
-strings. */
+strings. This file is also #included by the pcre2test program, which uses
+macros to change names from _pcre2_xxx to xxxx, thereby avoiding name clashes
+with the library. In this case, PCRE2_PCRE2TEST is defined. */

-
+#ifndef PCRE2_PCRE2TEST           /* We're compiling the library */
 #ifdef HAVE_CONFIG_H
 #include "config.h"
 #endif
-
 #include "pcre2_internal.h"
+#endif /* PCRE2_PCRE2TEST */



#ifndef SUPPORT_UNICODE

Modified: code/trunk/src/pcre2test.c
===================================================================
--- code/trunk/src/pcre2test.c    2014-11-11 16:51:07 UTC (rev 143)
+++ code/trunk/src/pcre2test.c    2014-11-12 16:57:56 UTC (rev 144)
@@ -165,9 +165,14 @@
 #define DEFAULT_OVECCOUNT 15    /* Default ovector count */
 #define JUNK_OFFSET 0xdeadbeef  /* For initializing ovector */
 #define LOOPREPEAT 500000       /* Default loop count for timing */
-#define REPLACE_BUFFSIZE 400    /* For replacement strings */
+#define REPLACE_MODSIZE 96      /* Field for reading 8-bit replacement */
 #define VERSION_SIZE 64         /* Size of buffer for the version strings */


+/* Make sure the buffer into which replacement strings are copied is big enough
+to hold them as 32-bit code units. */
+
+#define REPLACE_BUFFSIZE (4*REPLACE_MODSIZE)
+
/* Execution modes */

#define PCRE8_MODE 8
@@ -258,6 +263,20 @@

#define PCRE2_SUFFIX(a) a

+/* We need to be able to check input text for UTF-8 validity, whatever code 
+widths are actually available, because the input to pcre2test is always in 
+8-bit code units. So we include the UTF validity checking function for 8-bit 
+code units. */
+
+extern int valid_utf(PCRE2_SPTR8, PCRE2_SIZE, PCRE2_SIZE *);
+
+#define  PCRE2_CODE_UNIT_WIDTH 8
+#undef   PCRE2_SPTR
+#define  PCRE2_SPTR PCRE2_SPTR8
+#include "pcre2_valid_utf.c"
+#undef   PCRE2_CODE_UNIT_WIDTH
+#undef   PCRE2_SPTR
+
 /* If we have 8-bit support, default to it; if there is also 16-or 32-bit
 support, it can be selected by a command-line option. If there is no 8-bit
 support, there must be 16- or 32-bit support, so default to one of them. The
@@ -369,15 +388,20 @@
                     CTL_MARK|\
                     CTL_MEMORY|\
                     CTL_STARTCHAR)
+                    
+/* Structures for holding modifier information for patterns and subject strings 
+(data). Fields containing modifiers that can be set either for a pattern or a 
+subject must be at the start and in the same order in both cases so that the 
+same offset in the big table below works for both. */


 typedef struct patctl {    /* Structure for pattern modifiers. */
   uint32_t  options;       /* Must be in same position as datctl */
   uint32_t  control;       /* Must be in same position as datctl */
+   uint8_t  replacement[REPLACE_MODSIZE];  /* So must this */
   uint32_t  jit;
   uint32_t  stackguard_test;
   uint32_t  tables_id;
   uint8_t   locale[32];
-  uint8_t   replacement[REPLACE_BUFFSIZE];
 } patctl;


 #define MAXCPYGET 10
@@ -386,6 +410,7 @@
 typedef struct datctl {    /* Structure for data line modifiers. */
   uint32_t  options;       /* Must be in same position as patctl */
   uint32_t  control;       /* Must be in same position as patctl */
+   uint8_t  replacement[REPLACE_MODSIZE];  /* So must this */
   uint32_t  cfail[2];
    int32_t  callout_data;
    int32_t  copy_numbers[MAXCPYGET];
@@ -487,7 +512,7 @@
   { "posix",               MOD_PAT,  MOD_CTL, CTL_POSIX,                 PO(control) },
   { "ps",                  MOD_DAT,  MOD_OPT, PCRE2_PARTIAL_SOFT,        DO(options) },
   { "recursion_limit",     MOD_CTM,  MOD_INT, 0,                         MO(recursion_limit) },
-  { "replace",             MOD_PAT,  MOD_STR, 0,                         PO(replacement) },
+  { "replace",             MOD_PND,  MOD_STR, 0,                         PO(replacement) },
   { "stackguard",          MOD_PAT,  MOD_INT, 0,                         PO(stackguard_test) },
   { "startchar",           MOD_PND,  MOD_CTL, CTL_STARTCHAR,             PO(control) },
   { "tables",              MOD_PAT,  MOD_INT, 0,                         PO(tables_id) },
@@ -4211,13 +4236,14 @@


/* Copy the default context and data control blocks to the active ones. Then
copy from the pattern the controls that can be set in either the pattern or the
-data. This allows them to be unset in the data line. We do not do this for
+data. This allows them to be overridden in the data line. We do not do this for
options because those that are common apply separately to compiling and
matching. */

DATCTXCPY(dat_context, default_dat_context);
memcpy(&dat_datctl, &def_datctl, sizeof(datctl));
dat_datctl.control |= (pat_patctl.control & CTL_ALLPD);
+strcpy((char *)dat_datctl.replacement, (char *)pat_patctl.replacement);

/* Initialize for scanning the data line. */

@@ -4715,20 +4741,28 @@
PCRE2_MATCH_DATA_FREE(match_data);
PCRE2_MATCH_DATA_CREATE(match_data, max_oveccount, NULL);
}
+
+/* Replacement processing is ignored for DFA matching. */

+if (dat_datctl.replacement[0] != 0 && (dat_datctl.control & CTL_DFA) != 0)
+ {
+ fprintf(outfile, "** Ignored for DFA matching: replace\n");
+ dat_datctl.replacement[0] = 0;
+ }
+
/* If a replacement string is provided, call pcre2_substitute() instead of one
of the matching functions. First we have to convert the replacement string to
the appropriate width. */

-if (pat_patctl.replacement[0] != 0)
+if (dat_datctl.replacement[0] != 0)
{
int rc;
uint8_t *pr;
uint8_t rbuffer[REPLACE_BUFFSIZE];
uint8_t nbuffer[REPLACE_BUFFSIZE];
uint32_t goption;
- PCRE2_SIZE rlen;
- PCRE2_SIZE nsize;
+ PCRE2_SIZE rlen, nsize, erroroffset;
+ BOOL badutf = FALSE;

#ifdef SUPPORT_PCRE2_8
uint8_t *r8 = NULL;
@@ -4740,10 +4774,13 @@
uint32_t *r32 = NULL;
#endif

-  goption = ((pat_patctl.control & CTL_GLOBAL) == 0)? 0 :
+  if (timeitm)
+    fprintf(outfile, "** Timing is not supported with replace: ignored\n"); 
+
+  goption = ((dat_datctl.control & CTL_GLOBAL) == 0)? 0 :
     PCRE2_SUBSTITUTE_GLOBAL;
   SETCASTPTR(r, rbuffer);  /* Sets r8, r16, or r32, as appropriate. */
-  pr = pat_patctl.replacement;
+  pr = dat_datctl.replacement;


   /* If the replacement starts with '[<number>]' we interpret that as length
   value for the replacement buffer. */
@@ -4767,52 +4804,58 @@
     nsize = n;
     }


- /* Now copy the replacement string to a buffer of the appropriate width. */
+ /* Now copy the replacement string to a buffer of the appropriate width. No
+ escape processing is done for replacements. In UTF mode, check for an invalid
+ UTF-8 input string, and if it is invalid, just copy its code units without
+ UTF interpretation. This provides a means of checking that an invalid string
+ is detected. Otherwise, UTF-8 can be used to include wide characters in a
+ replacement. */
+
+ if (utf) badutf = valid_utf(pr, strlen((const char *)pr), &erroroffset);

-  while ((c = *pr++) != 0)
+  /* Not UTF or invalid UTF-8: just copy the code units. */
+  
+  if (!utf || badutf)
     {
-    if (utf && HASUTF8EXTRALEN(c)) { GETUTF8INC(c, pr); }
+    while ((c = *pr++) != 0)
+      { 
+#ifdef SUPPORT_PCRE2_8
+      if (test_mode == PCRE8_MODE) *r8++ = c;
+#endif
+#ifdef SUPPORT_PCRE2_16
+      if (test_mode == PCRE16_MODE) *r16++ = c;
+#endif
+#ifdef SUPPORT_PCRE2_32
+      if (test_mode == PCRE32_MODE) *r32++ = c;
+#endif
+      }
+    }
+    
+  /* Valid UTF-8 replacement string */
+        
+  else while ((c = *pr++) != 0)
+    {
+    if (HASUTF8EXTRALEN(c)) { GETUTF8INC(c, pr); }


-    /* At present no escape processing is provided for replacements. */
-
 #ifdef SUPPORT_PCRE2_8
-    if (test_mode == PCRE8_MODE)
-      {
-      if (utf)
-        {
-        r8 += ord2utf8(c, r8);
-        }
-      else
-        {
-        *r8++ = c;
-        }
-      }
+    if (test_mode == PCRE8_MODE) r8 += ord2utf8(c, r8);
 #endif
+
 #ifdef SUPPORT_PCRE2_16
     if (test_mode == PCRE16_MODE)
       {
-      if (utf)
+      if (c >= 0x10000u)
         {
-        if (c >= 0x10000u)
-          {
-          c-= 0x10000u;
-          *r16++ = 0xD800 | (c >> 10);
-          *r16++ = 0xDC00 | (c & 0x3ff);
-          }
-        else
-          *r16++ = c;
+        c-= 0x10000u;
+        *r16++ = 0xD800 | (c >> 10);
+        *r16++ = 0xDC00 | (c & 0x3ff);
         }
-      else
-        {
-        *r16++ = c;
-        }
+      else *r16++ = c;
       }
 #endif
+
 #ifdef SUPPORT_PCRE2_32
-    if (test_mode == PCRE32_MODE)
-      {
-      *r32++ = c;
-      }
+    if (test_mode == PCRE32_MODE) *r32++ = c;
 #endif
     }



Modified: code/trunk/testdata/testinput10
===================================================================
--- code/trunk/testdata/testinput10    2014-11-11 16:51:07 UTC (rev 143)
+++ code/trunk/testdata/testinput10    2014-11-12 16:57:56 UTC (rev 144)
@@ -444,4 +444,7 @@


/\x{3a3}B/IBi,utf

+/abc/utf,replace=\xC3
+ abc
+
# End of testinput10

Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2014-11-11 16:51:07 UTC (rev 143)
+++ code/trunk/testdata/testinput2    2014-11-12 16:57:56 UTC (rev 144)
@@ -4067,6 +4067,12 @@
 /abc/replace=xyz
     1abc2\=partial_hard


+/abc/replace=xyz
+    123abc456
+    123abc456\=replace=pqr
+    123abc456abc789
+    123abc456abc789\=g
+
 # End of substitute tests 


# End of testinput2

Modified: code/trunk/testdata/testoutput10
===================================================================
--- code/trunk/testdata/testoutput10    2014-11-11 16:51:07 UTC (rev 143)
+++ code/trunk/testdata/testoutput10    2014-11-12 16:57:56 UTC (rev 144)
@@ -1546,4 +1546,8 @@
 Last code unit = 'B' (caseless)
 Subject length lower bound = 2


+/abc/utf,replace=\xC3
+ abc
+Failed: error -3: UTF-8 error: 1 byte missing at end
+
# End of testinput10

Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2014-11-11 16:51:07 UTC (rev 143)
+++ code/trunk/testdata/testoutput2    2014-11-12 16:57:56 UTC (rev 144)
@@ -13689,6 +13689,16 @@
     1abc2\=partial_hard
 Failed: error -34: bad option value


+/abc/replace=xyz
+    123abc456
+ 1: 123xyz456
+    123abc456\=replace=pqr
+ 1: 123pqr456
+    123abc456abc789
+ 1: 123xyz456abc789
+    123abc456abc789\=g
+ 2: 123xyz456xyz789
+
 # End of substitute tests 


# End of testinput2