Revision: 247
http://www.exim.org/viewvc/pcre2?view=rev&revision=247
Author: ph10
Date: 2015-04-13 18:29:05 +0100 (Mon, 13 Apr 2015)
Log Message:
-----------
Implement PCRE2_NEVER_BACKSLASH_C.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/pcre2.3
code/trunk/doc/pcre2_compile.3
code/trunk/doc/pcre2api.3
code/trunk/doc/pcre2build.3
code/trunk/doc/pcre2pattern.3
code/trunk/doc/pcre2syntax.3
code/trunk/doc/pcre2test.1
code/trunk/src/pcre2.h.in
code/trunk/src/pcre2_compile.c
code/trunk/src/pcre2_error.c
code/trunk/src/pcre2test.c
code/trunk/testdata/testinput2
code/trunk/testdata/testoutput2
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2015-04-13 09:13:39 UTC (rev 246)
+++ code/trunk/ChangeLog 2015-04-13 17:29:05 UTC (rev 247)
@@ -85,7 +85,9 @@
example, /((?2){73}(?2))((?1))/. A better mutual recursion detection method has
been implemented. This infelicity was discovered by the LLVM fuzzer.
+21. Implemented PCRE2_NEVER_BACKSLASH_C.
+
Version 10.10 06-March-2015
---------------------------
Modified: code/trunk/doc/pcre2.3
===================================================================
--- code/trunk/doc/pcre2.3 2015-04-13 09:13:39 UTC (rev 246)
+++ code/trunk/doc/pcre2.3 2015-04-13 17:29:05 UTC (rev 247)
@@ -1,4 +1,4 @@
-.TH PCRE2 3 "18 November 2014" "PCRE2 10.00"
+.TH PCRE2 3 "13 April 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH INTRODUCTION
@@ -103,14 +103,24 @@
.P
One way of guarding against this possibility is to use the
\fBpcre2_pattern_info()\fP function to check the compiled pattern's options for
-UTF. Alternatively, you can set the PCRE2_NEVER_UTF option at compile time.
-This causes an compile time error if a pattern contains a UTF-setting sequence.
+PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
+\fBpcre2_compile()\fP. This causes an compile time error if a pattern contains
+a UTF-setting sequence.
.P
+The use of Unicode properties for character types such as \ed can also be
+enabled from within the pattern, by specifying "(*UCP)". This feature can be
+disallowed by setting the PCRE2_NEVER_UCP option.
+.P
If your application is one that supports UTF, be aware that validity checking
can take time. If the same data string is to be matched many times, you can use
the PCRE2_NO_UTF_CHECK option for the second and subsequent matches to avoid
running redundant checks.
.P
+The use of the \eC escape sequence in a UTF-8 or UTF-16 pattern can lead to
+problems, because it may leave the current matching point in the middle of a
+multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used to
+lock out the use of \eC, causing a compile-time error if it is encountered.
+.P
Another way that performance can be hit is by running a pattern that has a very
large search tree against a string that will never match. Nested unlimited
repeats in a pattern are a common example. PCRE2 provides some protection
@@ -177,6 +187,6 @@
.rs
.sp
.nf
-Last updated: 18 November 2014
-Copyright (c) 1997-2014 University of Cambridge.
+Last updated: 13 April 2015
+Copyright (c) 1997-2015 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre2_compile.3
===================================================================
--- code/trunk/doc/pcre2_compile.3 2015-04-13 09:13:39 UTC (rev 246)
+++ code/trunk/doc/pcre2_compile.3 2015-04-13 17:29:05 UTC (rev 247)
@@ -1,4 +1,4 @@
-.TH PCRE2_COMPILE 3 "02 January 2015" "PCRE2 10.00"
+.TH PCRE2_COMPILE 3 "13 April 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -46,6 +46,7 @@
PCRE2_FIRSTLINE Force matching to be before newline
PCRE2_MATCH_UNSET_BACKREF Match unset back references
PCRE2_MULTILINE ^ and $ match newlines within data
+ PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
PCRE2_NEVER_UCP Lock out PCRE2_UCP, e.g. via (*UCP)
PCRE2_NEVER_UTF Lock out PCRE2_UTF, e.g. via (*UTF)
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
Modified: code/trunk/doc/pcre2api.3
===================================================================
--- code/trunk/doc/pcre2api.3 2015-04-13 09:13:39 UTC (rev 246)
+++ code/trunk/doc/pcre2api.3 2015-04-13 17:29:05 UTC (rev 247)
@@ -1,4 +1,4 @@
-.TH PCRE2API 3 "23 March 2015" "PCRE2 10.20"
+.TH PCRE2API 3 "13 April 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@@ -1150,23 +1150,31 @@
(?m) option setting. If there are no newlines in a subject string, or no
occurrences of ^ or $ in a pattern, setting PCRE2_MULTILINE has no effect.
.sp
+ PCRE2_NEVER_BACKSLASH_C
+.sp
+This option locks out the use of \eC in the pattern that is being compiled.
+This escape can cause unpredictable behaviour in UTF-8 or UTF-16 modes, because
+it may leave the current matching point in the middle of a multi-code-unit
+character. This option may be useful in applications that process patterns from
+external sources.
+.sp
PCRE2_NEVER_UCP
.sp
This option locks out the use of Unicode properties for handling \eB, \eb, \eD,
\ed, \eS, \es, \eW, \ew, and some of the POSIX character classes, as described
for the PCRE2_UCP option below. In particular, it prevents the creator of the
pattern from enabling this facility by starting the pattern with (*UCP). This
-may be useful in applications that process patterns from external sources. The
-option combination PCRE_UCP and PCRE_NEVER_UCP causes an error.
+option may be useful in applications that process patterns from external
+sources. The option combination PCRE_UCP and PCRE_NEVER_UCP causes an error.
.sp
PCRE2_NEVER_UTF
.sp
This option locks out interpretation of the pattern as UTF-8, UTF-16, or
UTF-32, depending on which library is in use. In particular, it prevents the
creator of the pattern from switching to UTF interpretation by starting the
-pattern with (*UTF). This may be useful in applications that process patterns
-from external sources. The combination of PCRE2_UTF and PCRE2_NEVER_UTF causes
-an error.
+pattern with (*UTF). This option may be useful in applications that process
+patterns from external sources. The combination of PCRE2_UTF and
+PCRE2_NEVER_UTF causes an error.
.sp
PCRE2_NO_AUTO_CAPTURE
.sp
@@ -2919,6 +2927,6 @@
.rs
.sp
.nf
-Last updated: 23 March 2015
+Last updated: 13 April 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre2build.3
===================================================================
--- code/trunk/doc/pcre2build.3 2015-04-13 09:13:39 UTC (rev 246)
+++ code/trunk/doc/pcre2build.3 2015-04-13 17:29:05 UTC (rev 247)
@@ -1,4 +1,4 @@
-.TH PCRE2BUILD 3 "26 January 2015" "PCRE2 10.00"
+.TH PCRE2BUILD 3 "13 April 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.
@@ -132,6 +132,11 @@
properties. The application can request that they do by setting the PCRE2_UCP
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
request this by starting with (*UCP).
+.P
+The \eC escape sequence, which matches a single code unit, even in a UTF mode,
+can cause unpredictable behaviour because it may leave the current matching
+point in the middle of a multi-code-unit character. It can be locked out by
+setting the PCRE2_NEVER_BACKSLASH_C option.
.
.
.SH "JUST-IN-TIME COMPILER SUPPORT"
@@ -494,6 +499,6 @@
.rs
.sp
.nf
-Last updated: 26 January 2015
+Last updated: 13 April 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre2pattern.3
===================================================================
--- code/trunk/doc/pcre2pattern.3 2015-04-13 09:13:39 UTC (rev 246)
+++ code/trunk/doc/pcre2pattern.3 2015-04-13 17:29:05 UTC (rev 247)
@@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "15 March 2015" "PCRE2 10.20"
+.TH PCRE2PATTERN 3 "13 April 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@@ -1200,13 +1200,16 @@
byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
32-bit unit. Unlike a dot, \eC always matches line-ending characters. The
feature is provided in Perl in order to match individual bytes in UTF-8 mode,
-but it is unclear how it can usefully be used. Because \eC breaks up characters
-into individual code units, matching one unit with \eC in a UTF mode means that
-the rest of the string may start with a malformed UTF character. This has
-undefined results, because PCRE2 assumes that it is dealing with valid UTF
-strings (and by default it checks this at the start of processing unless the
-PCRE2_NO_UTF_CHECK option is used).
+but it is unclear how it can usefully be used.
.P
+Because \eC breaks up characters into individual code units, matching one unit
+with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
+with a malformed UTF character. This has undefined results, because PCRE2
+assumes that it is matching character by character in a valid UTF string (by
+default it checks the subject string's validity at the start of processing
+unless the PCRE2_NO_UTF_CHECK option is used). An application can lock out the
+use of \eC by setting the PCRE2_NEVER_BACKSLASH_C option.
+.P
PCRE2 does not allow \eC to appear in lookbehind assertions
.\" HTML <a href="#lookbehind">
.\" </a>
@@ -3329,6 +3332,6 @@
.rs
.sp
.nf
-Last updated: 15 March 2015
+Last updated: 13 April 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre2syntax.3
===================================================================
--- code/trunk/doc/pcre2syntax.3 2015-04-13 09:13:39 UTC (rev 246)
+++ code/trunk/doc/pcre2syntax.3 2015-04-13 17:29:05 UTC (rev 247)
@@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "15 March 2015" "PCRE2 10.20"
+.TH PCRE2SYNTAX 3 "13 April 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -44,7 +44,7 @@
.sp
. any character except newline;
in dotall mode, any character whatsoever
- \eC one data unit, even in UTF mode (best avoided)
+ \eC one code unit, even in UTF mode (best avoided)
\ed a decimal digit
\eD a character that is not a decimal digit
\eh a horizontal white space character
@@ -61,6 +61,10 @@
\eW a "non-word" character
\eX a Unicode extended grapheme cluster
.sp
+The application can lock out the use of \eC by setting the
+PCRE2_NEVER_BACKSLASH_C option. It is dangerous because it may leave the
+current matching point in the middle of a UTF-8 or UTF-16 character.
+.P
By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
happening, \es and \ew may also match characters with code points in the range
@@ -396,7 +400,9 @@
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
.sp
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
-limits set by the caller of pcre2_match(), not increase them.
+limits set by the caller of pcre2_match(), not increase them. The application
+can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
+PCRE2_NEVER_UCP options, respectively, at compile time.
.
.
.SH "NEWLINE CONVENTION"
@@ -543,6 +549,6 @@
.rs
.sp
.nf
-Last updated: 15 March 2015
+Last updated: 13 April 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre2test.1
===================================================================
--- code/trunk/doc/pcre2test.1 2015-04-13 09:13:39 UTC (rev 246)
+++ code/trunk/doc/pcre2test.1 2015-04-13 17:29:05 UTC (rev 247)
@@ -1,4 +1,4 @@
-.TH PCRE2TEST 1 "22 March 2015" "PCRE 10.20"
+.TH PCRE2TEST 1 "13 April 2015" "PCRE 10.20"
.SH NAME
pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@@ -447,6 +447,7 @@
firstline set PCRE2_FIRSTLINE
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
/m multiline set PCRE2_MULTILINE
+ never_backslash_c set PCRE2_NEVER_BACKSLASH_C
never_ucp set PCRE2_NEVER_UCP
never_utf set PCRE2_NEVER_UTF
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
@@ -1443,6 +1444,6 @@
.rs
.sp
.nf
-Last updated: 22 March 2015
+Last updated: 13 April 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi
Modified: code/trunk/src/pcre2.h.in
===================================================================
--- code/trunk/src/pcre2.h.in 2015-04-13 09:13:39 UTC (rev 246)
+++ code/trunk/src/pcre2.h.in 2015-04-13 17:29:05 UTC (rev 247)
@@ -118,6 +118,7 @@
#define PCRE2_UCP 0x00020000u /* C J M D */
#define PCRE2_UNGREEDY 0x00040000u /* C */
#define PCRE2_UTF 0x00080000u /* C J M D */
+#define PCRE2_NEVER_BACKSLASH_C 0x00100000u /* C */
/* These are for pcre2_jit_compile(). */
Modified: code/trunk/src/pcre2_compile.c
===================================================================
--- code/trunk/src/pcre2_compile.c 2015-04-13 09:13:39 UTC (rev 246)
+++ code/trunk/src/pcre2_compile.c 2015-04-13 17:29:05 UTC (rev 247)
@@ -556,9 +556,10 @@
(PCRE2_ANCHORED|PCRE2_ALLOW_EMPTY_CLASS|PCRE2_ALT_BSUX|PCRE2_AUTO_CALLOUT| \
PCRE2_CASELESS|PCRE2_DOLLAR_ENDONLY|PCRE2_DOTALL|PCRE2_DUPNAMES| \
PCRE2_EXTENDED|PCRE2_FIRSTLINE|PCRE2_MATCH_UNSET_BACKREF| \
- PCRE2_MULTILINE|PCRE2_NEVER_UCP|PCRE2_NEVER_UTF|PCRE2_NO_AUTO_CAPTURE| \
- PCRE2_NO_AUTO_POSSESS|PCRE2_NO_DOTSTAR_ANCHOR|PCRE2_NO_START_OPTIMIZE| \
- PCRE2_NO_UTF_CHECK|PCRE2_UCP|PCRE2_UNGREEDY|PCRE2_UTF)
+ PCRE2_MULTILINE|PCRE2_NEVER_BACKSLASH_C|PCRE2_NEVER_UCP| \
+ PCRE2_NEVER_UTF|PCRE2_NO_AUTO_CAPTURE|PCRE2_NO_AUTO_POSSESS| \
+ PCRE2_NO_DOTSTAR_ANCHOR|PCRE2_NO_START_OPTIMIZE|PCRE2_NO_UTF_CHECK| \
+ PCRE2_UCP|PCRE2_UNGREEDY|PCRE2_UTF)
/* Compile time error code numbers. They are given names so that they can more
easily be tracked. When a new number is added, the tables called eint1 and
@@ -574,7 +575,7 @@
ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59, ERR60,
ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
- ERR81, ERR82 };
+ ERR81, ERR82, ERR83 };
/* This is a table of start-of-pattern options such as (*UTF) and settings such
as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
@@ -6676,6 +6677,14 @@
}
#endif
+ /* The use of \C can be locked out. */
+
+ else if (escape == ESC_C && (options & PCRE2_NEVER_BACKSLASH_C) != 0)
+ {
+ *errorcodeptr = ERR83;
+ goto FAILED;
+ }
+
/* For the rest (including \X when Unicode properties are supported), we
can obtain the OP value by negating the escape value in the default
situation when PCRE2_UCP is not set. When it *is* set, we substitute
Modified: code/trunk/src/pcre2_error.c
===================================================================
--- code/trunk/src/pcre2_error.c 2015-04-13 09:13:39 UTC (rev 246)
+++ code/trunk/src/pcre2_error.c 2015-04-13 17:29:05 UTC (rev 247)
@@ -84,7 +84,7 @@
/* 15 */
"reference to non-existent subpattern\0"
"pattern passed as NULL\0"
- "unknown compile-time option bit(s)\0"
+ "unrecognised compile-time option bit(s)\0"
"missing ) after (?# comment\0"
"parentheses are too deeply nested\0"
/* 20 */
@@ -163,6 +163,7 @@
"internal error: unknown opcode in auto_possessify()\0"
"missing terminating delimiter for callout with string argument\0"
"unrecognized string delimiter follows (?C\0"
+ "using \\C is disabled by the application\0"
;
/* Match-time and UTF error texts are in the same format. */
Modified: code/trunk/src/pcre2test.c
===================================================================
--- code/trunk/src/pcre2test.c 2015-04-13 09:13:39 UTC (rev 246)
+++ code/trunk/src/pcre2test.c 2015-04-13 17:29:05 UTC (rev 247)
@@ -525,6 +525,7 @@
{ "match_unset_backref", MOD_PAT, MOD_OPT, PCRE2_MATCH_UNSET_BACKREF, PO(options) },
{ "memory", MOD_PD, MOD_CTL, CTL_MEMORY, PD(control) },
{ "multiline", MOD_PATP, MOD_OPT, PCRE2_MULTILINE, PO(options) },
+ { "never_backslash_c", MOD_PAT, MOD_OPT, PCRE2_NEVER_BACKSLASH_C, PO(options) },
{ "never_ucp", MOD_PAT, MOD_OPT, PCRE2_NEVER_UCP, PO(options) },
{ "never_utf", MOD_PAT, MOD_OPT, PCRE2_NEVER_UTF, PO(options) },
{ "newline", MOD_CTC, MOD_NL, 0, CO(newline_convention) },
@@ -3459,7 +3460,7 @@
show_compile_options(uint32_t options, const char *before, const char *after)
{
if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
-else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
+else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
before,
((options & PCRE2_ALT_BSUX) != 0)? " alt_bsux" : "",
((options & PCRE2_ALLOW_EMPTY_CLASS) != 0)? " allow_empty_class" : "",
@@ -3473,6 +3474,7 @@
((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "",
((options & PCRE2_MATCH_UNSET_BACKREF) != 0)? " match_unset_backref" : "",
((options & PCRE2_MULTILINE) != 0)? " multiline" : "",
+ ((options & PCRE2_NEVER_BACKSLASH_C) != 0)? " never_backslash_c" : "",
((options & PCRE2_NEVER_UCP) != 0)? " never_ucp" : "",
((options & PCRE2_NEVER_UTF) != 0)? " never_utf" : "",
((options & PCRE2_NO_AUTO_CAPTURE) != 0)? " no_auto_capture" : "",
Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2 2015-04-13 09:13:39 UTC (rev 246)
+++ code/trunk/testdata/testinput2 2015-04-13 17:29:05 UTC (rev 247)
@@ -4265,4 +4265,6 @@
/((?2){73}(?2))((?1))/info
+/ab\Cde/never_backslash_c
+
# End of testinput2
Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2 2015-04-13 09:13:39 UTC (rev 246)
+++ code/trunk/testdata/testoutput2 2015-04-13 17:29:05 UTC (rev 247)
@@ -14293,4 +14293,7 @@
May match empty string
Subject length lower bound = 0
+/ab\Cde/never_backslash_c
+Failed: error 183 at offset 3: using \C is disabled by the application
+
# End of testinput2