Revision: 255
http://www.exim.org/viewvc/pcre2?view=rev&revision=255
Author: ph10
Date: 2015-04-23 18:28:39 +0100 (Thu, 23 Apr 2015)
Log Message:
-----------
Fix compatibility issues for \8 and \9.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/pcre2pattern.3
code/trunk/doc/pcre2syntax.3
code/trunk/src/pcre2_compile.c
code/trunk/testdata/testinput1
code/trunk/testdata/testinput2
code/trunk/testdata/testoutput1
code/trunk/testdata/testoutput2
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2015-04-23 13:53:29 UTC (rev 254)
+++ code/trunk/ChangeLog 2015-04-23 17:28:39 UTC (rev 255)
@@ -94,7 +94,10 @@
23. Added the PCRE2_ALT_CIRCUMFLEX option.
+24. Adjust the treatment of \8 and \9 to be the same as the current Perl
+behaviour.
+
Version 10.10 06-March-2015
---------------------------
Modified: code/trunk/doc/pcre2pattern.3
===================================================================
--- code/trunk/doc/pcre2pattern.3 2015-04-23 13:53:29 UTC (rev 254)
+++ code/trunk/doc/pcre2pattern.3 2015-04-23 17:28:39 UTC (rev 255)
@@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "22 April 2015" "PCRE2 10.20"
+.TH PCRE2PATTERN 3 "23 April 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@@ -387,11 +387,13 @@
describe the old, ambiguous syntax.
.P
The handling of a backslash followed by a digit other than 0 is complicated,
-and Perl has changed in recent releases, causing PCRE2 also to change. Outside
-a character class, PCRE2 reads the digit and any following digits as a decimal
-number. If the number is less than 8, or if there have been at least that many
-previous capturing left parentheses in the expression, the entire sequence is
-taken as a \fIback reference\fP. A description of how this works is given
+and Perl has changed over time, causing PCRE2 also to change.
+.P
+Outside a character class, PCRE2 reads the digit and any following digits as a
+decimal number. If the number is less than 10, begins with the digit 8 or 9, or
+if there are at least that many previous capturing left parentheses in the
+expression, the entire sequence is taken as a \fIback reference\fP. A
+description of how this works is given
.\" HTML <a href="#backreferences">
.\" </a>
later,
@@ -399,14 +401,14 @@
following the discussion of
.\" HTML <a href="#subpattern">
.\" </a>
-parenthesized subpatterns.
+parenthesized subpatterns.
.\"
+Otherwise, up to three octal digits are read to form a character code.
.P
-Inside a character class, or if the decimal number following \e is greater than
-7 and there have not been that many capturing subpatterns, PCRE2 handles \e8
-and \e9 as the literal characters "8" and "9", and otherwise re-reads up to
-three octal digits following the backslash, using them to generate a data
-character. Any subsequent digits stand for themselves. For example:
+Inside a character class, PCRE2 handles \e8 and \e9 as the literal characters
+"8" and "9", and otherwise reads up to three octal digits following the
+backslash, using them to generate a data character. Any subsequent digits stand
+for themselves. For example, outside a character class:
.sp
\e040 is another way of writing an ASCII space
.\" JOIN
@@ -425,8 +427,7 @@
\e377 might be a back reference, otherwise
the value 255 (decimal)
.\" JOIN
- \e81 is either a back reference, or the two
- characters "8" and "1"
+ \e81 is always a back reference
.sp
Note that octal values of 100 or greater that are specified using this syntax
must not be introduced by a leading zero, because no more than three octal
@@ -3337,6 +3338,6 @@
.rs
.sp
.nf
-Last updated: 22 April 2015
+Last updated: 23 April 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre2syntax.3
===================================================================
--- code/trunk/doc/pcre2syntax.3 2015-04-23 13:53:29 UTC (rev 254)
+++ code/trunk/doc/pcre2syntax.3 2015-04-23 17:28:39 UTC (rev 255)
@@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "22 April 2015" "PCRE2 10.20"
+.TH PCRE2SYNTAX 3 "23 April 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -19,7 +19,7 @@
\eQ...\eE treat enclosed characters as literal
.
.
-.SH "CHARACTERS"
+.SH "ESCAPED CHARACTERS"
.rs
.sp
\ea alarm, that is, the BEL character (hex 07)
@@ -32,17 +32,28 @@
\e0dd character with octal code 0dd
\eddd character with octal code ddd, or backreference
\eo{ddd..} character with octal code ddd..
- \eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
+ \eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
\euhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
- \exhh character with hex code hh
+ \exhh character with hex code hh
\ex{hhh..} character with hex code hhh..
.sp
-Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal
-characters "8" and "9". When \ex is not followed by {, from zero to two
-hexadecimal digits are read, but if PCRE2_ALT_BSUX is set, \ex must be followed
-by two hexadecimal digits to be recognized as a hexadecimal escape; otherwise
-it matches a literal "x". Likewise, if \eu (in ALT_BSUX mode) is not followed
-by four hexadecimal digits, it matches a literal "u".
+Note that \e0dd is always an octal code. The treatment of backslash followed by
+a non-zero digit is complicated; for details see the section
+.\" HTML <a href="pcre2pattern.html#digitsafterbackslash">
+.\" </a>
+"Non-printing characters"
+.\"
+in the
+.\" HREF
+\fBpcre2pattern\fP
+.\"
+documentation.
+.P
+When \ex is not followed by {, from zero to two hexadecimal digits are read,
+but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
+be recognized as a hexadecimal escape; otherwise it matches a literal "x".
+Likewise, if \eu (in ALT_BSUX mode) is not followed by four hexadecimal digits,
+it matches a literal "u".
.
.
.SH "CHARACTER TYPES"
@@ -329,7 +340,7 @@
\eB not a word boundary
^ start of subject
also after an internal newline in multiline mode
- (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
+ (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
\eA start of subject
$ end of subject
also before newline at end of subject
@@ -407,8 +418,8 @@
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
.sp
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
-limits set by the caller of pcre2_match(), not increase them. The application
-can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
+limits set by the caller of pcre2_match(), not increase them. The application
+can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
PCRE2_NEVER_UCP options, respectively, at compile time.
.
.
@@ -530,9 +541,9 @@
(?Cn) callout with numerical data n
(?C"text") callout with string data
.sp
-The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
-start and the end), and the starting delimiter { matched with the ending
-delimiter }. To encode the ending delimiter within the string, double it.
+The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
+start and the end), and the starting delimiter { matched with the ending
+delimiter }. To encode the ending delimiter within the string, double it.
.
.
.SH "SEE ALSO"
@@ -556,6 +567,6 @@
.rs
.sp
.nf
-Last updated: 22 April 2015
+Last updated: 23 April 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi
Modified: code/trunk/src/pcre2_compile.c
===================================================================
--- code/trunk/src/pcre2_compile.c 2015-04-23 13:53:29 UTC (rev 254)
+++ code/trunk/src/pcre2_compile.c 2015-04-23 17:28:39 UTC (rev 255)
@@ -1868,9 +1868,9 @@
Outside a character class, the digits are read as a decimal number. If the
number is less than 10, or if there are that many previous extracting left
brackets, it is a back reference. Otherwise, up to three octal digits are
- read to form an escaped byte. Thus \123 is likely to be octal 123 (cf
- \0123, which is octal 012 followed by the literal 3). If the octal value is
- greater than 377, the least significant 8 bits are taken.
+ read to form an escaped character code. Thus \123 is likely to be octal 123
+ (cf \0123, which is octal 012 followed by the literal 3). If the octal
+ value is greater than 377, the least significant 8 bits are taken.
Inside a character class, \ followed by a digit is always either a literal
8 or 9 or an octal number. */
@@ -1899,18 +1899,24 @@
*errorcodeptr = ERR61;
break;
}
- if (s < 10 || s <= cb->bracount) /* Check for back reference */
+
+ /* \1 to \9 are always back references. \8x and \9x are too, unless there
+ are an awful lot of previous captures; \1x to \7x are octal escapes if
+ there are not that many previous captures. */
+
+ if (s < 10 || *oldptr >= CHAR_8 || s <= cb->bracount)
{
- escape = -s;
+ escape = -s; /* Indicates a back reference */
break;
}
ptr = oldptr; /* Put the pointer back and fall through */
}
- /* Handle a digit following \ when the number is not a back reference. If
- the first digit is 8 or 9, Perl used to generate a binary zero byte and
- then treat the digit as a following literal. At least by Perl 5.18 this
- changed so as not to insert the binary zero. */
+ /* Handle a digit following \ when the number is not a back reference, or
+ we are within a character class. If the first digit is 8 or 9, Perl used to
+ generate a binary zero byte and then treat the digit as a following
+ literal. At least by Perl 5.18 this changed so as not to insert the binary
+ zero. */
if ((c = *ptr) >= CHAR_8) break;
Modified: code/trunk/testdata/testinput1
===================================================================
--- code/trunk/testdata/testinput1 2015-04-23 13:53:29 UTC (rev 254)
+++ code/trunk/testdata/testinput1 2015-04-23 17:28:39 UTC (rev 255)
@@ -5715,4 +5715,10 @@
"(?1)(?#?'){8}(a)"
baaaaaaaaac
+/((((((((((((x))))))))))))\12/
+ xx
+
+/A[\8]B[\9]C/
+ A8B9C
+
# End of testinput1
Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2 2015-04-23 13:53:29 UTC (rev 254)
+++ code/trunk/testdata/testinput2 2015-04-23 17:28:39 UTC (rev 255)
@@ -4279,4 +4279,15 @@
/^/gm,alt_circumflex
\n\n\n
+/((((((((x))))))))\81/
+ xx1
+
+/((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((x))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))\80/
+ xx
+
+/\80/
+
+/A\8B\9C/
+ A8B9C
+
# End of testinput2
Modified: code/trunk/testdata/testoutput1
===================================================================
--- code/trunk/testdata/testoutput1 2015-04-23 13:53:29 UTC (rev 254)
+++ code/trunk/testdata/testoutput1 2015-04-23 17:28:39 UTC (rev 255)
@@ -9427,4 +9427,24 @@
0: aaaaaaaaa
1: a
+/((((((((((((x))))))))))))\12/
+ xx
+ 0: xx
+ 1: x
+ 2: x
+ 3: x
+ 4: x
+ 5: x
+ 6: x
+ 7: x
+ 8: x
+ 9: x
+10: x
+11: x
+12: x
+
+/A[\8]B[\9]C/
+ A8B9C
+ 0: A8B9C
+
# End of testinput1
Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2 2015-04-23 13:53:29 UTC (rev 254)
+++ code/trunk/testdata/testoutput2 2015-04-23 17:28:39 UTC (rev 255)
@@ -14318,4 +14318,34 @@
0:
0:
+/((((((((x))))))))\81/
+Failed: error 115 at offset 20: reference to non-existent subpattern
+ xx1
+
+/((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((x))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))\80/
+ xx
+Matched, but too many substrings
+ 0: xx
+ 1: x
+ 2: x
+ 3: x
+ 4: x
+ 5: x
+ 6: x
+ 7: x
+ 8: x
+ 9: x
+10: x
+11: x
+12: x
+13: x
+14: x
+
+/\80/
+Failed: error 115 at offset 3: reference to non-existent subpattern
+
+/A\8B\9C/
+Failed: error 115 at offset 7: reference to non-existent subpattern
+ A8B9C
+
# End of testinput2