Revision: 1394
http://vcs.pcre.org/viewvc?view=rev&revision=1394
Author: ph10
Date: 2013-11-09 09:17:20 +0000 (Sat, 09 Nov 2013)
Log Message:
-----------
Require group names to start with a non-digit.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/pcrepattern.3
code/trunk/pcre_compile.c
code/trunk/pcre_internal.h
code/trunk/pcreposix.c
code/trunk/testdata/testinput2
code/trunk/testdata/testoutput11-16
code/trunk/testdata/testoutput11-32
code/trunk/testdata/testoutput11-8
code/trunk/testdata/testoutput2
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2013-11-08 16:37:21 UTC (rev 1393)
+++ code/trunk/ChangeLog 2013-11-09 09:17:20 UTC (rev 1394)
@@ -169,6 +169,9 @@
for which PCRE does not allow quantifiers. The optimization is now disabled
when a quantifier follows (?!). I can't see any use for this, but it makes
things uniform.
+
+36. Perl no longer allows group names to start with digits, so I have made this
+ change also in PCRE. It simplifies the code a bit.
Version 8.33 28-May-2013
Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3 2013-11-08 16:37:21 UTC (rev 1393)
+++ code/trunk/doc/pcrepattern.3 2013-11-09 09:17:20 UTC (rev 1394)
@@ -1,4 +1,4 @@
-.TH PCREPATTERN 3 "05 November 2013" "PCRE 8.34"
+.TH PCREPATTERN 3 "08 November 2013" "PCRE 8.34"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "PCRE REGULAR EXPRESSION DETAILS"
@@ -1595,11 +1595,12 @@
.\"
can be made by name as well as by number.
.P
-Names consist of up to 32 alphanumeric characters and underscores. Named
-capturing parentheses are still allocated numbers as well as names, exactly as
-if the names were not present. The PCRE API provides function calls for
-extracting the name-to-number translation table from a compiled pattern. There
-is also a convenience function for extracting a captured substring by name.
+Names consist of up to 32 alphanumeric characters and underscores, but must
+start with a non-digit. Named capturing parentheses are still allocated numbers
+as well as names, exactly as if the names were not present. The PCRE API
+provides function calls for extracting the name-to-number translation table
+from a compiled pattern. There is also a convenience function for extracting a
+captured substring by name.
.P
By default, a name must be unique within a pattern, but it is possible to relax
this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate
@@ -2331,12 +2332,7 @@
.sp
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
subpattern by name. For compatibility with earlier versions of PCRE, which had
-this facility before Perl, the syntax (?(name)...) is also recognized. However,
-there is a possible ambiguity with this syntax, because subpattern names may
-consist entirely of digits. PCRE looks first for a named subpattern; if it
-cannot find one and the name consists entirely of digits, PCRE looks for a
-subpattern of that number, which must be greater than zero. Using subpattern
-names that consist entirely of digits is not recommended.
+this facility before Perl, the syntax (?(name)...) is also recognized.
.P
Rewriting the above example to use a named subpattern gives this:
.sp
@@ -3205,6 +3201,6 @@
.rs
.sp
.nf
-Last updated: 05 November 2013
+Last updated: 08 November 2013
Copyright (c) 1997-2013 University of Cambridge.
.fi
Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c 2013-11-08 16:37:21 UTC (rev 1393)
+++ code/trunk/pcre_compile.c 2013-11-09 09:17:20 UTC (rev 1394)
@@ -533,6 +533,7 @@
"missing opening brace after \\o\0"
"parentheses are too deeply nested\0"
"invalid range in character class\0"
+ "group name must start with a non-digit\0"
;
/* Table to identify digits and hex digits. This is used when compiling
@@ -6465,17 +6466,16 @@
tempptr = ptr;
/* A condition can be an assertion, a number (referring to a numbered
- group), a name (referring to a named group), or 'R', referring to
- recursion. R<digits> and R&name are also permitted for recursion tests.
+ group's having been set), a name (referring to a named group), or 'R',
+ referring to recursion. R<digits> and R&name are also permitted for
+ recursion tests.
- There are several syntaxes for testing a named group: (?(name)) is used
- by Python; Perl 5.10 onwards uses (?(<name>) or (?('name')).
+ There are ways of testing a named group: (?(name)) is used by Python;
+ Perl 5.10 onwards uses (?(<name>) or (?('name')).
- There are two unfortunate ambiguities, caused by history. (a) 'R' can
- be the recursive thing or the name 'R' (and similarly for 'R' followed
- by digits), and (b) a number could be a name that consists of digits.
- In both cases, we look for a name first; if not found, we try the other
- cases.
+ There is one unfortunate ambiguity, caused by history. 'R' can be the
+ recursive thing or the name 'R' (and similarly for 'R' followed by
+ digits). We look for a name first; if not found, we try the other case.
For compatibility with auto-callouts, we allow a callout to be
specified before a condition that is an assertion. First, check for the
@@ -6508,7 +6508,8 @@
/* Check for a test for recursion in a named group. */
- if (ptr[1] == CHAR_R && ptr[2] == CHAR_AMPERSAND)
+ ptr++;
+ if (*ptr == CHAR_R && ptr[1] == CHAR_AMPERSAND)
{
terminator = -1;
ptr += 2;
@@ -6517,15 +6518,14 @@
/* Check for a test for a named group's having been set, using the Perl
syntax (?(<name>) or (?('name'), and also allow for the original PCRE
- syntax of (?(name) or for (?(+n), (?(-n), and just (?(n). As names may
- consist entirely of digits, there is scope for ambiguity. */
+ syntax of (?(name) or for (?(+n), (?(-n), and just (?(n). */
- else if (ptr[1] == CHAR_LESS_THAN_SIGN)
+ else if (*ptr == CHAR_LESS_THAN_SIGN)
{
terminator = CHAR_GREATER_THAN_SIGN;
ptr++;
}
- else if (ptr[1] == CHAR_APOSTROPHE)
+ else if (*ptr == CHAR_APOSTROPHE)
{
terminator = CHAR_APOSTROPHE;
ptr++;
@@ -6533,43 +6533,55 @@
else
{
terminator = CHAR_NULL;
- if (ptr[1] == CHAR_MINUS || ptr[1] == CHAR_PLUS) refsign = *(++ptr);
+ if (*ptr == CHAR_MINUS || *ptr == CHAR_PLUS) refsign = *ptr++;
+ else if (IS_DIGIT(*ptr)) refsign = 0;
}
-
- /* When a name is one of a number of duplicates, a different opcode is
- used and it needs more memory. Unfortunately we cannot tell whether a
- name is a duplicate in the first pass, so we have to allow for more
- memory except when we know it is a relative numerical reference. */
-
- if (refsign < 0 && lengthptr != NULL) *lengthptr += IMM2_SIZE;
-
- /* We now expect to read a name (possibly all digits); any thing else
- is an error. In the case of all digits, also get it as a number. */
-
- if (!MAX_255(ptr[1]) || (cd->ctypes[ptr[1]] & ctype_word) == 0)
+
+ /* Handle a number */
+
+ if (refsign >= 0)
{
- ptr += 1; /* To get the right offset */
- *errorcodeptr = ERR28;
- goto FAILED;
- }
+ recno = 0;
+ while (IS_DIGIT(*ptr))
+ {
+ recno = recno * 10 + (int)(*ptr - CHAR_0);
+ ptr++;
+ }
+ }
+
+ /* Otherwise we expect to read a name; anything else is an error. When
+ a name is one of a number of duplicates, a different opcode is used and
+ it needs more memory. Unfortunately we cannot tell whether a name is a
+ duplicate in the first pass, so we have to allow for more memory. */
- recno = 0;
- name = ++ptr;
- while (MAX_255(*ptr) && (cd->ctypes[*ptr] & ctype_word) != 0)
+ else
{
- if (recno >= 0)
- recno = (IS_DIGIT(*ptr))? recno * 10 + (int)(*ptr - CHAR_0) : -1;
- ptr++;
+ if (IS_DIGIT(*ptr))
+ {
+ *errorcodeptr = ERR84;
+ goto FAILED;
+ }
+ if (!MAX_255(*ptr) || (cd->ctypes[*ptr] & ctype_word) == 0)
+ {
+ *errorcodeptr = ERR28; /* Assertion expected */
+ goto FAILED;
+ }
+ name = ptr++;
+ while (MAX_255(*ptr) && (cd->ctypes[*ptr] & ctype_word) != 0)
+ {
+ ptr++;
+ }
+ namelen = (int)(ptr - name);
+ if (lengthptr != NULL) *lengthptr += IMM2_SIZE;
}
- namelen = (int)(ptr - name);
/* Check the terminator */
if ((terminator > 0 && *ptr++ != (pcre_uchar)terminator) ||
*ptr++ != CHAR_RIGHT_PARENTHESIS)
{
- ptr--; /* Error offset */
- *errorcodeptr = ERR26;
+ ptr--; /* Error offset */
+ *errorcodeptr = ERR26; /* Malformed number or name */
goto FAILED;
}
@@ -6578,18 +6590,18 @@
if (lengthptr != NULL) break;
/* In the real compile we do the work of looking for the actual
- reference. If the string started with "+" or "-" we require the rest to
- be digits, in which case recno will be set. */
-
- if (refsign > 0)
+ reference. If refsign is not negative, it means we have a number in
+ recno. */
+
+ if (refsign >= 0)
{
if (recno <= 0)
{
- *errorcodeptr = ERR58;
+ *errorcodeptr = ERR35;
goto FAILED;
}
- recno = (refsign == CHAR_MINUS)?
- cd->bracount - recno + 1 : recno +cd->bracount;
+ if (refsign != 0) recno = (refsign == CHAR_MINUS)?
+ cd->bracount - recno + 1 : recno + cd->bracount;
if (recno <= 0 || recno > cd->final_bracount)
{
*errorcodeptr = ERR15;
@@ -6599,8 +6611,7 @@
break;
}
- /* Otherwise (did not start with "+" or "-"), start by looking for the
- name. */
+ /* Otherwise look for the name. */
slot = cd->name_table;
for (i = 0; i < cd->names_found; i++)
@@ -6641,8 +6652,8 @@
/* If terminator == CHAR_NULL it means that the name followed directly
after the opening parenthesis [e.g. (?(abc)...] and in this case there
are some further alternatives to try. For the cases where terminator !=
- 0 [things like (?(<name>... or (?('name')... or (?(R&name)... ] we have
- now checked all the possibilities, so give an error. */
+ CHAR_NULL [things like (?(<name>... or (?('name')... or (?(R&name)... ]
+ we have now checked all the possibilities, so give an error. */
else if (terminator != CHAR_NULL)
{
@@ -6679,19 +6690,11 @@
skipbytes = 1;
}
- /* Check for the "name" actually being a subpattern number. We are
- in the second pass here, so final_bracount is set. */
+ /* Reference to an unidentified subpattern. */
- else if (recno > 0 && recno <= cd->final_bracount)
- {
- PUT2(code, 2+LINK_SIZE, recno);
- }
-
- /* Either an unidentified subpattern, or a reference to (?(0) */
-
else
{
- *errorcodeptr = (recno == 0)? ERR35: ERR15;
+ *errorcodeptr = ERR15;
goto FAILED;
}
break;
@@ -6811,7 +6814,11 @@
terminator = (*ptr == CHAR_LESS_THAN_SIGN)?
CHAR_GREATER_THAN_SIGN : CHAR_APOSTROPHE;
name = ++ptr;
-
+ if (IS_DIGIT(*ptr))
+ {
+ *errorcodeptr = ERR84; /* Group name must start with non-digit */
+ goto FAILED;
+ }
while (MAX_255(*ptr) && (cd->ctypes[*ptr] & ctype_word) != 0) ptr++;
namelen = (int)(ptr - name);
@@ -6925,6 +6932,11 @@
NAMED_REF_OR_RECURSE:
name = ++ptr;
+ if (IS_DIGIT(*ptr))
+ {
+ *errorcodeptr = ERR84; /* Group name must start with non-digit */
+ goto FAILED;
+ }
while (MAX_255(*ptr) && (cd->ctypes[*ptr] & ctype_word) != 0) ptr++;
namelen = (int)(ptr - name);
@@ -7576,44 +7588,31 @@
if (escape == ESC_g)
{
const pcre_uchar *p;
+ pcre_uint32 cf;
+
save_hwm = cd->hwm; /* Normally this is set when '(' is read */
terminator = (*(++ptr) == CHAR_LESS_THAN_SIGN)?
CHAR_GREATER_THAN_SIGN : CHAR_APOSTROPHE;
/* These two statements stop the compiler for warning about possibly
unset variables caused by the jump to HANDLE_NUMERICAL_RECURSION. In
- fact, because we actually check for a number below, the paths that
+ fact, because we do the check for a number below, the paths that
would actually be in error are never taken. */
skipbytes = 0;
reset_bracount = FALSE;
- /* Test for a name */
+ /* If it's not a signed or unsigned number, treat it as a name. */
- if (ptr[1] != CHAR_PLUS && ptr[1] != CHAR_MINUS)
+ cf = ptr[1];
+ if (cf != CHAR_PLUS && cf != CHAR_MINUS && !IS_DIGIT(cf))
{
- BOOL is_a_number = TRUE;
- for (p = ptr + 1; *p != CHAR_NULL && *p != (pcre_uchar)terminator; p++)
- {
- if (!MAX_255(*p)) { is_a_number = FALSE; break; }
- if ((cd->ctypes[*p] & ctype_digit) == 0) is_a_number = FALSE;
- if ((cd->ctypes[*p] & ctype_word) == 0) break;
- }
- if (*p != (pcre_uchar)terminator)
- {
- *errorcodeptr = ERR57;
- break;
- }
- if (is_a_number)
- {
- ptr++;
- goto HANDLE_NUMERICAL_RECURSION;
- }
is_recurse = TRUE;
goto NAMED_REF_OR_RECURSE;
}
- /* Test a signed number in angle brackets or quotes. */
+ /* Signed or unsigned number (cf = ptr[1]) is known to be plus or minus
+ or a digit. */
p = ptr + 2;
while (IS_DIGIT(*p)) p++;
Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h 2013-11-08 16:37:21 UTC (rev 1393)
+++ code/trunk/pcre_internal.h 2013-11-09 09:17:20 UTC (rev 1394)
@@ -2335,7 +2335,7 @@
ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59,
ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69,
ERR70, ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79,
- ERR80, ERR81, ERR82, ERR83, ERRCOUNT };
+ ERR80, ERR81, ERR82, ERR83, ERR84, ERRCOUNT };
/* JIT compiling modes. The function list is indexed by them. */
Modified: code/trunk/pcreposix.c
===================================================================
--- code/trunk/pcreposix.c 2013-11-08 16:37:21 UTC (rev 1393)
+++ code/trunk/pcreposix.c 2013-11-09 09:17:20 UTC (rev 1394)
@@ -169,7 +169,8 @@
REG_BADPAT, /* non-octal character in \o{} (closing brace missing?) */
REG_BADPAT, /* missing opening brace after \o */
REG_BADPAT, /* parentheses too deeply nested */
- REG_BADPAT /* invalid range in character class */
+ REG_BADPAT, /* invalid range in character class */
+ REG_BADPAT /* group name must start with a non-digit */
};
/* Table of texts corresponding to POSIX error codes */
Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2 2013-11-08 16:37:21 UTC (rev 1393)
+++ code/trunk/testdata/testinput2 2013-11-09 09:17:20 UTC (rev 1394)
@@ -1945,11 +1945,8 @@
/(?<A> (?'B' abc (?(R) (?(R&A)1) (?(R&B)2) X | (?1) (?2) (?R) ))) /x
abcabc1Xabc2XabcXabcabc
-/(?<A> (?'B' abc (?(R) (?(R&1)1) (?(R&B)2) X | (?1) (?2) (?R) ))) /x
+/(?<A> (?'B' abc (?(R) (?(R&C)1) (?(R&B)2) X | (?1) (?2) (?R) ))) /x
-/(?<1> (?'B' abc (?(R) (?(R&1)1) (?(R&B)2) X | (?1) (?2) (?R) ))) /x
- abcabc1Xabc2XabcXabcabc
-
/^(?(DEFINE) abc | xyz ) /x
/(?(DEFINE) abc) xyz/xI
@@ -2065,7 +2062,7 @@
/^(a)\g{3/
-/^(a)\g{4a}/
+/^(a)\g{aa}/
/^a.b/<lf>
a\rb
@@ -3997,4 +3994,40 @@
/[a-\d]+/
+/(?<0abc>xx)/
+
+/(?&1abc)xx(?<1abc>y)/
+
+/(?<ab-cd>xx)/
+
+/(?'0abc'xx)/
+
+/(?P<0abc>xx)/
+
+/\k<5ghj>/
+
+/\k'5ghj'/
+
+/\k{2fgh}/
+
+/(?P=8yuki)/
+
+/\g{4df}/
+
+/(?&1abc)xx(?<1abc>y)/
+
+/(?P>1abc)xx(?<1abc>y)/
+
+/\g'3gh'/
+
+/\g<5fg>/
+
+/(?(<4gh>)abc)/
+
+/(?('4gh')abc)/
+
+/(?(4gh)abc)/
+
+/(?(R&6yh)abc)/
+
/-- End of testinput2 --/
Modified: code/trunk/testdata/testoutput11-16
===================================================================
--- code/trunk/testdata/testoutput11-16 2013-11-08 16:37:21 UTC (rev 1393)
+++ code/trunk/testdata/testoutput11-16 2013-11-09 09:17:20 UTC (rev 1394)
@@ -535,7 +535,7 @@
------------------------------------------------------------------
/( ( (?(1)0|) )* )/xBM
-Memory allocation (code space): 54
+Memory allocation (code space): 52
------------------------------------------------------------------
0 23 Bra
2 19 CBra 1
@@ -553,7 +553,7 @@
------------------------------------------------------------------
/( (?(1)0|)* )/xBM
-Memory allocation (code space): 44
+Memory allocation (code space): 42
------------------------------------------------------------------
0 18 Bra
2 14 CBra 1
Modified: code/trunk/testdata/testoutput11-32
===================================================================
--- code/trunk/testdata/testoutput11-32 2013-11-08 16:37:21 UTC (rev 1393)
+++ code/trunk/testdata/testoutput11-32 2013-11-09 09:17:20 UTC (rev 1394)
@@ -535,7 +535,7 @@
------------------------------------------------------------------
/( ( (?(1)0|) )* )/xBM
-Memory allocation (code space): 108
+Memory allocation (code space): 104
------------------------------------------------------------------
0 23 Bra
2 19 CBra 1
@@ -553,7 +553,7 @@
------------------------------------------------------------------
/( (?(1)0|)* )/xBM
-Memory allocation (code space): 88
+Memory allocation (code space): 84
------------------------------------------------------------------
0 18 Bra
2 14 CBra 1
Modified: code/trunk/testdata/testoutput11-8
===================================================================
--- code/trunk/testdata/testoutput11-8 2013-11-08 16:37:21 UTC (rev 1393)
+++ code/trunk/testdata/testoutput11-8 2013-11-09 09:17:20 UTC (rev 1394)
@@ -535,7 +535,7 @@
------------------------------------------------------------------
/( ( (?(1)0|) )* )/xBM
-Memory allocation (code space): 40
+Memory allocation (code space): 38
------------------------------------------------------------------
0 34 Bra
3 28 CBra 1
@@ -553,7 +553,7 @@
------------------------------------------------------------------
/( (?(1)0|)* )/xBM
-Memory allocation (code space): 32
+Memory allocation (code space): 30
------------------------------------------------------------------
0 26 Bra
3 20 CBra 1
Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2 2013-11-08 16:37:21 UTC (rev 1393)
+++ code/trunk/testdata/testoutput2 2013-11-09 09:17:20 UTC (rev 1394)
@@ -549,10 +549,10 @@
Failed: conditional group contains more than two branches at offset 12
/(?(1a)/
-Failed: missing ) at offset 6
+Failed: malformed number or name after (?( at offset 4
/(?(1a))/
-Failed: reference to non-existent subpattern at offset 6
+Failed: malformed number or name after (?( at offset 4
/(?(?i))/
Failed: assertion expected after (?( at offset 3
@@ -6359,7 +6359,7 @@
1: X
/(?:(?(2y)a|b)(X))+/I
-Failed: reference to non-existent subpattern at offset 9
+Failed: malformed number or name after (?( at offset 7
/(?:(?(ZA)a|b)(?P<ZZ>X))+/I
Failed: reference to non-existent subpattern at offset 9
@@ -7866,15 +7866,9 @@
1: abcabc1Xabc2XabcX
2: abcabc1Xabc2XabcX
-/(?<A> (?'B' abc (?(R) (?(R&1)1) (?(R&B)2) X | (?1) (?2) (?R) ))) /x
+/(?<A> (?'B' abc (?(R) (?(R&C)1) (?(R&B)2) X | (?1) (?2) (?R) ))) /x
Failed: reference to non-existent subpattern at offset 29
-/(?<1> (?'B' abc (?(R) (?(R&1)1) (?(R&B)2) X | (?1) (?2) (?R) ))) /x
- abcabc1Xabc2XabcXabcabc
- 0: abcabc1Xabc2XabcX
- 1: abcabc1Xabc2XabcX
- 2: abcabc1Xabc2XabcX
-
/^(?(DEFINE) abc | xyz ) /x
Failed: DEFINE group contains more than one branch at offset 22
@@ -8098,7 +8092,7 @@
/^(a)\g{3/
Failed: \g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number at offset 8
-/^(a)\g{4a}/
+/^(a)\g{aa}/
Failed: reference to non-existent subpattern at offset 9
/^a.b/<lf>
@@ -14032,4 +14026,58 @@
/[a-\d]+/
Failed: invalid range in character class at offset 4
+/(?<0abc>xx)/
+Failed: group name must start with a non-digit at offset 3
+
+/(?&1abc)xx(?<1abc>y)/
+Failed: group name must start with a non-digit at offset 3
+
+/(?<ab-cd>xx)/
+Failed: syntax error in subpattern name (missing terminator) at offset 5
+
+/(?'0abc'xx)/
+Failed: group name must start with a non-digit at offset 3
+
+/(?P<0abc>xx)/
+Failed: group name must start with a non-digit at offset 4
+
+/\k<5ghj>/
+Failed: group name must start with a non-digit at offset 3
+
+/\k'5ghj'/
+Failed: group name must start with a non-digit at offset 3
+
+/\k{2fgh}/
+Failed: group name must start with a non-digit at offset 3
+
+/(?P=8yuki)/
+Failed: group name must start with a non-digit at offset 4
+
+/\g{4df}/
+Failed: group name must start with a non-digit at offset 3
+
+/(?&1abc)xx(?<1abc>y)/
+Failed: group name must start with a non-digit at offset 3
+
+/(?P>1abc)xx(?<1abc>y)/
+Failed: group name must start with a non-digit at offset 4
+
+/\g'3gh'/
+Failed: \g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number at offset 7
+
+/\g<5fg>/
+Failed: \g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number at offset 7
+
+/(?(<4gh>)abc)/
+Failed: group name must start with a non-digit at offset 4
+
+/(?('4gh')abc)/
+Failed: group name must start with a non-digit at offset 4
+
+/(?(4gh)abc)/
+Failed: malformed number or name after (?( at offset 4
+
+/(?(R&6yh)abc)/
+Failed: group name must start with a non-digit at offset 5
+
/-- End of testinput2 --/