Revision: 1369
http://vcs.pcre.org/viewvc?view=rev&revision=1369
Author: ph10
Date: 2013-10-08 16:06:46 +0100 (Tue, 08 Oct 2013)
Log Message:
-----------
Update \8 and \9 handling to match most recent Perl.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/pcrepattern.3
code/trunk/doc/pcresyntax.3
code/trunk/pcre_compile.c
code/trunk/testdata/testinput1
code/trunk/testdata/testinput8
code/trunk/testdata/testoutput1
code/trunk/testdata/testoutput8
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2013-10-08 10:01:58 UTC (rev 1368)
+++ code/trunk/ChangeLog 2013-10-08 15:06:46 UTC (rev 1369)
@@ -105,6 +105,11 @@
(2) Add missing HAVE_STDINT_H and HAVE_INTTYPES_H to config-cmake.h.in;
without these the CMake build does not work on Solaris.
+
+21. Perl has changed its handling of \8 and \9. If there is no previously
+ encountered capturing group of those numbers, they are treated as the
+ literal characters 8 and 9 instead of a binary zero followed by the
+ literals. PCRE now does the same.
Version 8.33 28-May-2013
Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3 2013-10-08 10:01:58 UTC (rev 1368)
+++ code/trunk/doc/pcrepattern.3 2013-10-08 15:06:46 UTC (rev 1369)
@@ -1,4 +1,4 @@
-.TH PCREPATTERN 3 "05 October 2013" "PCRE 8.34"
+.TH PCREPATTERN 3 "08 October 2013" "PCRE 8.34"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "PCRE REGULAR EXPRESSION DETAILS"
@@ -359,9 +359,10 @@
sure you supply two digits after the initial zero if the pattern character that
follows is itself an octal digit.
.P
-The handling of a backslash followed by a digit other than 0 is complicated.
-Outside a character class, PCRE reads it and any following digits as a decimal
-number. If the number is less than 10, or if there have been at least that many
+The handling of a backslash followed by a digit other than 0 is complicated,
+and Perl has changed in recent releases, causing PCRE also to change. Outside a
+character class, PCRE reads the digit and any following digits as a decimal
+number. If the number is less than 8, or if there have been at least that many
previous capturing left parentheses in the expression, the entire sequence is
taken as a \fIback reference\fP. A description of how this works is given
.\" HTML <a href="#backreferences">
@@ -374,12 +375,13 @@
parenthesized subpatterns.
.\"
.P
-Inside a character class, or if the decimal number is greater than 9 and there
-have not been that many capturing subpatterns, PCRE re-reads up to three octal
-digits following the backslash, and uses them to generate a data character. Any
-subsequent digits stand for themselves. The value of the character is
-constrained in the same way as characters specified in hexadecimal.
-For example:
+Inside a character class, or if the decimal number following \e is greater than
+7 and there have not been that many capturing subpatterns, PCRE handles \e8 and
+\e9 as the literal characters "8" and "9", and otherwise re-reads up to three
+octal digits following the backslash, using them to generate a data character.
+Any subsequent digits stand for themselves. The value of the character is
+constrained in the same way as characters specified in hexadecimal. For
+example:
.sp
\e040 is another way of writing an ASCII space
.\" JOIN
@@ -398,8 +400,8 @@
\e377 might be a back reference, otherwise
the value 255 (decimal)
.\" JOIN
- \e81 is either a back reference, or a binary zero
- followed by the two characters "8" and "1"
+ \e81 is either a back reference, or the two
+ characters "8" and "1"
.sp
Note that octal values of 100 or greater must not be introduced by a leading
zero, because no more than three octal digits are ever read.
@@ -3156,6 +3158,6 @@
.rs
.sp
.nf
-Last updated: 05 October 2013
+Last updated: 08 October 2013
Copyright (c) 1997-2013 University of Cambridge.
.fi
Modified: code/trunk/doc/pcresyntax.3
===================================================================
--- code/trunk/doc/pcresyntax.3 2013-10-08 10:01:58 UTC (rev 1368)
+++ code/trunk/doc/pcresyntax.3 2013-10-08 15:06:46 UTC (rev 1369)
@@ -1,4 +1,4 @@
-.TH PCRESYNTAX 3 "05 October 2013" "PCRE 8.34"
+.TH PCRESYNTAX 3 "08 October 2013" "PCRE 8.34"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -32,6 +32,9 @@
\eddd character with octal code ddd, or backreference
\exhh character with hex code hh
\ex{hhh..} character with hex code hhh..
+.sp
+Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal
+characters "8" and "9".
.
.
.SH "CHARACTER TYPES"
@@ -498,6 +501,6 @@
.rs
.sp
.nf
-Last updated: 05 October 2013
+Last updated: 08 October 2013
Copyright (c) 1997-2013 University of Cambridge.
.fi
Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c 2013-10-08 10:01:58 UTC (rev 1368)
+++ code/trunk/pcre_compile.c 2013-10-08 15:06:46 UTC (rev 1369)
@@ -879,16 +879,15 @@
*************************************************/
/* This function is called when a \ has been encountered. It either returns a
-positive value for a simple escape such as \n, or 0 for a data character
-which will be placed in chptr. A backreference to group n is returned as
-negative n. When UTF-8 is enabled, a positive value greater than 255 may
-be returned in chptr.
-On entry,ptr is pointing at the \. On exit, it is on the final character of the
-escape sequence.
+positive value for a simple escape such as \n, or 0 for a data character which
+will be placed in chptr. A backreference to group n is returned as negative n.
+When UTF-8 is enabled, a positive value greater than 255 may be returned in
+chptr. On entry, ptr is pointing at the \. On exit, it is on the final
+character of the escape sequence.
Arguments:
ptrptr points to the pattern position pointer
- chptr points to the data character
+ chptr points to a returned data character
errorcodeptr points to the errorcode variable
bracount number of previous extracting brackets
options the options bits
@@ -1092,16 +1091,20 @@
break;
/* The handling of escape sequences consisting of a string of digits
- starting with one that is not zero is not straightforward. By experiment,
- the way Perl works seems to be as follows:
+ starting with one that is not zero is not straightforward. Perl has changed
+ over the years. Nowadays \g{} for backreferences and \o{} for octal are
+ recommended to avoid the ambiguities in the old syntax.
Outside a character class, the digits are read as a decimal number. If the
- number is less than 10, or if there are that many previous extracting
- left brackets, then it is a back reference. Otherwise, up to three octal
- digits are read to form an escaped byte. Thus \123 is likely to be octal
- 123 (cf \0123, which is octal 012 followed by the literal 3). If the octal
- value is greater than 377, the least significant 8 bits are taken. Inside a
- character class, \ followed by a digit is always an octal number. */
+ number is less than 8 (used to be 10), or if there are that many previous
+ extracting left brackets, then it is a back reference. Otherwise, up to
+ three octal digits are read to form an escaped byte. Thus \123 is likely to
+ be octal 123 (cf \0123, which is octal 012 followed by the literal 3). If
+ the octal value is greater than 377, the least significant 8 bits are
+ taken. \8 and \9 are treated as the literal characters 8 and 9.
+
+ Inside a character class, \ followed by a digit is always either a literal
+ 8 or 9 or an octal number. */
case CHAR_1: case CHAR_2: case CHAR_3: case CHAR_4: case CHAR_5:
case CHAR_6: case CHAR_7: case CHAR_8: case CHAR_9:
@@ -1128,7 +1131,7 @@
*errorcodeptr = ERR61;
break;
}
- if (s < 10 || s <= bracount)
+ if (s < 8 || s <= bracount) /* Check for back reference */
{
escape = -s;
break;
@@ -1136,16 +1139,14 @@
ptr = oldptr; /* Put the pointer back and fall through */
}
- /* Handle an octal number following \. If the first digit is 8 or 9, Perl
- generates a binary zero byte and treats the digit as a following literal.
- Thus we have to pull back the pointer by one. */
+ /* Handle a digit following \ when the number is not a back reference. If
+ the first digit is 8 or 9, Perl used to generate a binary zero byte and
+ then treat the digit as a following literal. At least by Perl 5.18 this
+ changed so as not to insert the binary zero. */
- if ((c = *ptr) >= CHAR_8)
- {
- ptr--;
- c = 0;
- break;
- }
+ if ((c = *ptr) >= CHAR_8) break;
+
+ /* Fall through with a digit less than 8 */
/* \0 always starts an octal number, but we may drop through to here with a
larger first octal digit. The original code used just to take the least
Modified: code/trunk/testdata/testinput1
===================================================================
--- code/trunk/testdata/testinput1 2013-10-08 10:01:58 UTC (rev 1368)
+++ code/trunk/testdata/testinput1 2013-10-08 15:06:46 UTC (rev 1369)
@@ -1483,15 +1483,20 @@
abc\100\x30
abc\100\060
abc\100\60
+
+/^A\8B\9C$/
+ A8B9C
+ *** Failers
+ A\08B\09C
+
+/^(A)(B)(C)(D)(E)(F)(G)(H)(I)\8\9$/
+ ABCDEFGHIHI
-/abc\81/
- abc\081
- abc\0\x38\x31
+/^[A\8B\9C]+$/
+ A8B9C
+ *** Failers
+ A8B9C\x00
-/abc\91/
- abc\091
- abc\0\x39\x31
-
/(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)\12\123/
abcdefghijkllS
@@ -5626,5 +5631,8 @@
/^ (?:(?<A>A)|(?'B'B)(?<A>A)) (?('A')x) (?(<B>)y)$/xJ
Ax
BAxy
+
+/^A\xZ/
+ A\0Z
/-- End of testinput1 --/
Modified: code/trunk/testdata/testinput8
===================================================================
--- code/trunk/testdata/testinput8 2013-10-08 10:01:58 UTC (rev 1368)
+++ code/trunk/testdata/testinput8 2013-10-08 15:06:46 UTC (rev 1369)
@@ -1923,14 +1923,16 @@
abc\100\060
abc\100\60
-/abc\81/
- abc\081
- abc\0\x38\x31
-
-/abc\91/
- abc\091
- abc\0\x39\x31
-
+/^A\8B\9C$/
+ A8B9C
+ *** Failers
+ A\08B\09C
+
+/^[A\8B\9C]+$/
+ A8B9C
+ *** Failers
+ A8B9C\x00
+
/(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12\123/
abcdefghijk\12S
Modified: code/trunk/testdata/testoutput1
===================================================================
--- code/trunk/testdata/testoutput1 2013-10-08 10:01:58 UTC (rev 1368)
+++ code/trunk/testdata/testoutput1 2013-10-08 15:06:46 UTC (rev 1369)
@@ -2149,19 +2149,36 @@
abc\100\60
0: abc@0
1: abc
+
+/^A\8B\9C$/
+ A8B9C
+ 0: A8B9C
+ *** Failers
+No match
+ A\08B\09C
+No match
+
+/^(A)(B)(C)(D)(E)(F)(G)(H)(I)\8\9$/
+ ABCDEFGHIHI
+ 0: ABCDEFGHIHI
+ 1: A
+ 2: B
+ 3: C
+ 4: D
+ 5: E
+ 6: F
+ 7: G
+ 8: H
+ 9: I
-/abc\81/
- abc\081
- 0: abc\x0081
- abc\0\x38\x31
- 0: abc\x0081
+/^[A\8B\9C]+$/
+ A8B9C
+ 0: A8B9C
+ *** Failers
+No match
+ A8B9C\x00
+No match
-/abc\91/
- abc\091
- 0: abc\x0091
- abc\0\x39\x31
- 0: abc\x0091
-
/(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)\12\123/
abcdefghijkllS
0: abcdefghijkllS
@@ -9246,5 +9263,9 @@
1: <unset>
2: B
3: A
+
+/^A\xZ/
+ A\0Z
+ 0: A\x00Z
/-- End of testinput1 --/
Modified: code/trunk/testdata/testoutput8
===================================================================
--- code/trunk/testdata/testoutput8 2013-10-08 10:01:58 UTC (rev 1368)
+++ code/trunk/testdata/testoutput8 2013-10-08 15:06:46 UTC (rev 1369)
@@ -2930,18 +2930,22 @@
abc\100\60
0: abc@0
-/abc\81/
- abc\081
- 0: abc\x0081
- abc\0\x38\x31
- 0: abc\x0081
-
-/abc\91/
- abc\091
- 0: abc\x0091
- abc\0\x39\x31
- 0: abc\x0091
-
+/^A\8B\9C$/
+ A8B9C
+ 0: A8B9C
+ *** Failers
+No match
+ A\08B\09C
+No match
+
+/^[A\8B\9C]+$/
+ A8B9C
+ 0: A8B9C
+ *** Failers
+No match
+ A8B9C\x00
+No match
+
/(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12\123/
abcdefghijk\12S
0: abcdefghijk\x0aS