Revision: 333
http://vcs.pcre.org/viewvc?view=rev&revision=333
Author: ph10
Date: 2008-04-10 20:55:57 +0100 (Thu, 10 Apr 2008)
Log Message:
-----------
Add Oniguruma syntax \g<...> and \g'...' for subroutine calls.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/pcrepattern.3
code/trunk/doc/pcresyntax.3
code/trunk/pcre_compile.c
code/trunk/pcre_internal.h
code/trunk/testdata/testinput2
code/trunk/testdata/testoutput2
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2008-04-05 16:11:05 UTC (rev 332)
+++ code/trunk/ChangeLog 2008-04-10 19:55:57 UTC (rev 333)
@@ -45,6 +45,11 @@
10. Applied Alan Lehotsky's patch to add REG_STARTEND support to the POSIX
matching function regexec().
+
+11. Added support for the Oniguruma syntax \g<name>, \g<n>, \g'name', \g'n',
+ which, however, unlike Perl's \g{...}, are subroutine calls, not back
+ references. PCRE supports relative numbers with this syntax (I don't think
+ Oniguruma does).
Version 7.6 28-Jan-08
Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3 2008-04-05 16:11:05 UTC (rev 332)
+++ code/trunk/doc/pcrepattern.3 2008-04-10 19:55:57 UTC (rev 333)
@@ -9,7 +9,12 @@
.\" HREF
\fBpcresyntax\fP
.\"
-page. Perl's regular expressions are described in its own documentation, and
+page. PCRE tries to match Perl syntax and semantics as closely as it can. PCRE
+also supports some alternative regular expression syntax (which does not
+conflict with the Perl syntax) in order to provide some compatibility with
+regular expressions in Python, .NET, and Oniguruma.
+.P
+Perl's regular expressions are described in its own documentation, and
regular expressions in general are covered in a number of books, some of which
have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",
published by O'Reilly, covers regular expressions in great detail. This
@@ -310,6 +315,20 @@
.\"
.
.
+.SS "Absolute and relative subroutine calls"
+.rs
+.sp
+For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
+a number enclosed either in angle brackets or single quotes, is an alternative
+syntax for referencing a subpattern as a "subroutine". Details are discussed
+.\" HTML <a href="#onigurumasubroutines">
+.\" </a>
+later.
+.\"
+Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
+synonymous. The former is a back reference; the latter is a subroutine call.
+.
+.
.SS "Generic character types"
.rs
.sp
@@ -2023,6 +2042,27 @@
processing option does not affect the called subpattern.
.
.
+.\" HTML <a name="onigurumasubroutines"></a>
+.SH "ONIGURUMA SUBROUTINE SYNTAX"
+.rs
+.sp
+For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
+a number enclosed either in angle brackets or single quotes, is an alternative
+syntax for referencing a subpattern as a subroutine, possibly recursively. Here
+are two of the examples used above, rewritten using this syntax:
+.sp
+ (?<pn> \e( ( (?>[^()]+) | \eg<pn> )* \e) )
+ (sens|respons)e and \eg'1'ibility
+.sp
+PCRE supports an extension to Oniguruma: if a number is preceded by a
+plus or a minus sign it is taken as a relative reference. For example:
+.sp
+ (abc)(?i:\eg<-1>)
+.sp
+Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
+synonymous. The former is a back reference; the latter is a subroutine call.
+.
+.
.SH CALLOUTS
.rs
.sp
@@ -2192,6 +2232,6 @@
.rs
.sp
.nf
-Last updated: 17 September 2007
-Copyright (c) 1997-2007 University of Cambridge.
+Last updated: 10 April 2008
+Copyright (c) 1997-2008 University of Cambridge.
.fi
Modified: code/trunk/doc/pcresyntax.3
===================================================================
--- code/trunk/doc/pcresyntax.3 2008-04-05 16:11:05 UTC (rev 332)
+++ code/trunk/doc/pcresyntax.3 2008-04-10 19:55:57 UTC (rev 333)
@@ -304,6 +304,8 @@
(?<!...) negative look behind
.sp
Each top-level branch of a look behind must be of a fixed length.
+.
+.
.SH "BACKREFERENCES"
.rs
.sp
@@ -327,6 +329,14 @@
(?-n) call subpattern by relative number
(?&name) call subpattern by name (Perl)
(?P>name) call subpattern by name (Python)
+ \eg<name> call subpattern by name (Oniguruma)
+ \eg'name' call subpattern by name (Oniguruma)
+ \eg<n> call subpattern by absolute number (Oniguruma)
+ \eg'n' call subpattern by absolute number (Oniguruma)
+ \eg<+n> call subpattern by relative number (PCRE extension)
+ \eg'+n' call subpattern by relative number (PCRE extension)
+ \eg<-n> call subpattern by relative number (PCRE extension)
+ \eg'-n' call subpattern by relative number (PCRE extension)
.
.
.SH "CONDITIONAL PATTERNS"
@@ -418,6 +428,6 @@
.rs
.sp
.nf
-Last updated: 14 November 2007
-Copyright (c) 1997-2007 University of Cambridge.
+Last updated: 09 April 2008
+Copyright (c) 1997-2008 University of Cambridge.
.fi
Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c 2008-04-05 16:11:05 UTC (rev 332)
+++ code/trunk/pcre_compile.c 2008-04-10 19:55:57 UTC (rev 333)
@@ -295,8 +295,8 @@
/* 55 */
"repeating a DEFINE group is not allowed\0"
"inconsistent NEWLINE options\0"
- "\\g is not followed by a braced name or an optionally braced non-zero number\0"
- "(?+ or (?- or (?(+ or (?(- must be followed by a non-zero number\0"
+ "\\g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number\0"
+ "a numbered reference must not be zero\0"
"(*VERB) with an argument is not supported\0"
/* 60 */
"(*VERB) not recognized\0"
@@ -531,14 +531,31 @@
*errorcodeptr = ERR37;
break;
- /* \g must be followed by a number, either plain or braced. If positive, it
- is an absolute backreference. If negative, it is a relative backreference.
- This is a Perl 5.10 feature. Perl 5.10 also supports \g{name} as a
- reference to a named group. This is part of Perl's movement towards a
- unified syntax for back references. As this is synonymous with \k{name}, we
- fudge it up by pretending it really was \k. */
+ /* \g must be followed by one of a number of specific things:
+
+ (1) A number, either plain or braced. If positive, it is an absolute
+ backreference. If negative, it is a relative backreference. This is a Perl
+ 5.10 feature.
+
+ (2) Perl 5.10 also supports \g{name} as a reference to a named group. This
+ is part of Perl's movement towards a unified syntax for back references. As
+ this is synonymous with \k{name}, we fudge it up by pretending it really
+ was \k.
+
+ (3) For Oniguruma compatibility we also support \g followed by a name or a
+ number either in angle brackets or in single quotes. However, these are
+ (possibly recursive) subroutine calls, _not_ backreferences. Just return
+ the -ESC_g code (cf \k). */
case 'g':
+ if (ptr[1] == '<' || ptr[1] == '\'')
+ {
+ c = -ESC_g;
+ break;
+ }
+
+ /* Handle the Perl-compatible cases */
+
if (ptr[1] == '{')
{
const uschar *p;
@@ -565,17 +582,23 @@
while ((digitab[ptr[1]] & ctype_digit) != 0)
c = c * 10 + *(++ptr) - '0';
- if (c < 0)
+ if (c < 0) /* Integer overflow */
{
*errorcodeptr = ERR61;
break;
}
-
- if (c == 0 || (braced && *(++ptr) != '}'))
+
+ if (braced && *(++ptr) != '}')
{
*errorcodeptr = ERR57;
break;
}
+
+ if (c == 0)
+ {
+ *errorcodeptr = ERR58;
+ break;
+ }
if (negated)
{
@@ -611,7 +634,7 @@
c -= '0';
while ((digitab[ptr[1]] & ctype_digit) != 0)
c = c * 10 + *(++ptr) - '0';
- if (c < 0)
+ if (c < 0) /* Integer overflow */
{
*errorcodeptr = ERR61;
break;
@@ -4567,7 +4590,7 @@
references (?P=name) and recursion (?P>name), as well as falling
through from the Perl recursion syntax (?&name). We also come here from
the Perl \k<name> or \k'name' back reference syntax and the \k{name}
- .NET syntax. */
+ .NET syntax, and the Oniguruma \g<...> and \g'...' subroutine syntax. */
NAMED_REF_OR_RECURSE:
name = ++ptr;
@@ -4645,6 +4668,15 @@
case '5': case '6': case '7': case '8': case '9': /* subroutine */
{
const uschar *called;
+ terminator = ')';
+
+ /* Come here from the \g<...> and \g'...' code (Oniguruma
+ compatibility). However, the syntax has been checked to ensure that
+ the ... are a (signed) number, so that neither ERR63 nor ERR29 will
+ be called on this path, nor with the jump to OTHER_CHAR_AFTER_QUERY
+ ever be taken. */
+
+ HANDLE_NUMERICAL_RECURSION:
if ((refsign = *ptr) == '+')
{
@@ -4666,7 +4698,7 @@
while((digitab[*ptr] & ctype_digit) != 0)
recno = recno * 10 + *ptr++ - '0';
- if (*ptr != ')')
+ if (*ptr != terminator)
{
*errorcodeptr = ERR29;
goto FAILED;
@@ -5062,7 +5094,7 @@
back references and those types that consume a character may be repeated.
We can test for values between ESC_b and ESC_Z for the latter; this may
have to change if any new ones are ever created. */
-
+
case '\\':
tempptr = ptr;
c = check_escape(&ptr, errorcodeptr, cd->bracount, options, FALSE);
@@ -5089,6 +5121,63 @@
zerofirstbyte = firstbyte;
zeroreqbyte = reqbyte;
+
+ /* \g<name> or \g'name' is a subroutine call by name and \g<n> or \g'n'
+ is a subroutine call by number (Oniguruma syntax). In fact, the value
+ -ESC_g is returned only for these cases. So we don't need to check for <
+ or ' if the value is -ESC_g. For the Perl syntax \g{n} the value is
+ -ESC_REF+n, and for the Perl syntax \g{name} the result is -ESC_k (as
+ that is a synonym). */
+
+ if (-c == ESC_g)
+ {
+ const uschar *p;
+ terminator = (*(++ptr) == '<')? '>' : '\'';
+
+ /* These two statements stop the compiler for warning about possibly
+ unset variables caused by the jump to HANDLE_NUMERICAL_RECURSION. In
+ fact, because we actually check for a number below, the paths that
+ would actually be in error are never taken. */
+
+ skipbytes = 0;
+ reset_bracount = FALSE;
+
+ /* Test for a name */
+
+ if (ptr[1] != '+' && ptr[1] != '-')
+ {
+ BOOL isnumber = TRUE;
+ for (p = ptr + 1; *p != 0 && *p != terminator; p++)
+ {
+ if ((cd->ctypes[*p] & ctype_digit) == 0) isnumber = FALSE;
+ if ((cd->ctypes[*p] & ctype_word) == 0) break;
+ }
+ if (*p != terminator)
+ {
+ *errorcodeptr = ERR57;
+ break;
+ }
+ if (isnumber)
+ {
+ ptr++;
+ goto HANDLE_NUMERICAL_RECURSION;
+ }
+ is_recurse = TRUE;
+ goto NAMED_REF_OR_RECURSE;
+ }
+
+ /* Test a signed number in angle brackets or quotes. */
+
+ p = ptr + 2;
+ while ((digitab[*p] & ctype_digit) != 0) p++;
+ if (*p != terminator)
+ {
+ *errorcodeptr = ERR57;
+ break;
+ }
+ ptr++;
+ goto HANDLE_NUMERICAL_RECURSION;
+ }
/* \k<name> or \k'name' is a back reference by name (Perl syntax).
We also support \k{name} (.NET syntax) */
Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h 2008-04-05 16:11:05 UTC (rev 332)
+++ code/trunk/pcre_internal.h 2008-04-10 19:55:57 UTC (rev 333)
@@ -613,7 +613,7 @@
enum { ESC_A = 1, ESC_G, ESC_K, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s,
ESC_W, ESC_w, ESC_dum1, ESC_C, ESC_P, ESC_p, ESC_R, ESC_H, ESC_h,
- ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z, ESC_E, ESC_Q, ESC_k, ESC_REF };
+ ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z, ESC_E, ESC_Q, ESC_g, ESC_k, ESC_REF };
/* Opcode table: Starting from 1 (i.e. after OP_END), the values up to
Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2 2008-04-05 16:11:05 UTC (rev 332)
+++ code/trunk/testdata/testinput2 2008-04-10 19:55:57 UTC (rev 333)
@@ -2589,4 +2589,58 @@
/[[:a\dz:]]/
+/^(?<name>a|b\g<name>c)/
+ aaaa
+ bacxxx
+ bbaccxxx
+ bbbacccxx
+
+/^(?<name>a|b\g'name'c)/
+ aaaa
+ bacxxx
+ bbaccxxx
+ bbbacccxx
+
+/^(a|b\g<1>c)/
+ aaaa
+ bacxxx
+ bbaccxxx
+ bbbacccxx
+
+/^(a|b\g'1'c)/
+ aaaa
+ bacxxx
+ bbaccxxx
+ bbbacccxx
+
+/^(a|b\g'-1'c)/
+ aaaa
+ bacxxx
+ bbaccxxx
+ bbbacccxx
+
+/(^(a|b\g<-1>c))/
+ aaaa
+ bacxxx
+ bbaccxxx
+ bbbacccxx
+
+/(^(a|b\g<-1'c))/
+
+/(^(a|b\g{-1}))/
+ bacxxx
+
+/(?-i:\g<name>)(?i:(?<name>a))/
+ XaaX
+ XAAX
+
+/(?i:\g<name>)(?-i:(?<name>a))/
+ XaaX
+ ** Failers
+ XAAX
+
+/(?-i:\g<+1>)(?i:(a))/
+ XaaX
+ XAAX
+
/ End of testinput2 /
Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2 2008-04-05 16:11:05 UTC (rev 332)
+++ code/trunk/testdata/testoutput2 2008-04-10 19:55:57 UTC (rev 333)
@@ -8074,13 +8074,13 @@
Failed: reference to non-existent subpattern at offset 7
/^(a)\g/
-Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 5
+Failed: a numbered reference must not be zero at offset 5
/^(a)\g{0}/
-Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 7
+Failed: a numbered reference must not be zero at offset 8
/^(a)\g{3/
-Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 8
+Failed: \g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number at offset 8
/^(a)\g{4a}/
Failed: reference to non-existent subpattern at offset 9
@@ -8217,13 +8217,13 @@
No match
/x(?-0)y/
-Failed: (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number at offset 5
+Failed: a numbered reference must not be zero at offset 5
/x(?-1)y/
Failed: reference to non-existent subpattern at offset 5
/x(?+0)y/
-Failed: (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number at offset 5
+Failed: a numbered reference must not be zero at offset 5
/x(?+1)y/
Failed: reference to non-existent subpattern at offset 5
@@ -9385,4 +9385,124 @@
/[[:a\dz:]]/
Failed: unknown POSIX class name at offset 3
+/^(?<name>a|b\g<name>c)/
+ aaaa
+ 0: a
+ 1: a
+ bacxxx
+ 0: bac
+ 1: bac
+ bbaccxxx
+ 0: bbacc
+ 1: bbacc
+ bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+
+/^(?<name>a|b\g'name'c)/
+ aaaa
+ 0: a
+ 1: a
+ bacxxx
+ 0: bac
+ 1: bac
+ bbaccxxx
+ 0: bbacc
+ 1: bbacc
+ bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+
+/^(a|b\g<1>c)/
+ aaaa
+ 0: a
+ 1: a
+ bacxxx
+ 0: bac
+ 1: bac
+ bbaccxxx
+ 0: bbacc
+ 1: bbacc
+ bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+
+/^(a|b\g'1'c)/
+ aaaa
+ 0: a
+ 1: a
+ bacxxx
+ 0: bac
+ 1: bac
+ bbaccxxx
+ 0: bbacc
+ 1: bbacc
+ bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+
+/^(a|b\g'-1'c)/
+ aaaa
+ 0: a
+ 1: a
+ bacxxx
+ 0: bac
+ 1: bac
+ bbaccxxx
+ 0: bbacc
+ 1: bbacc
+ bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+
+/(^(a|b\g<-1>c))/
+ aaaa
+ 0: a
+ 1: a
+ 2: a
+ bacxxx
+ 0: bac
+ 1: bac
+ 2: bac
+ bbaccxxx
+ 0: bbacc
+ 1: bbacc
+ 2: bbacc
+ bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+ 2: bbbaccc
+
+/(^(a|b\g<-1'c))/
+Failed: \g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number at offset 15
+
+/(^(a|b\g{-1}))/
+ bacxxx
+No match
+
+/(?-i:\g<name>)(?i:(?<name>a))/
+ XaaX
+ 0: aa
+ 1: a
+ XAAX
+ 0: AA
+ 1: A
+
+/(?i:\g<name>)(?-i:(?<name>a))/
+ XaaX
+ 0: aa
+ 1: a
+ ** Failers
+No match
+ XAAX
+No match
+
+/(?-i:\g<+1>)(?i:(a))/
+ XaaX
+ 0: aa
+ 1: a
+ XAAX
+ 0: AA
+ 1: A
+
/ End of testinput2 /