[Pcre-svn] [333] code/trunk: Add Oniguruma syntax \g<...> an…

Página Inicial
Delete this message
Autor: Subversion repository
Data:  
Para: pcre-svn
Assunto: [Pcre-svn] [333] code/trunk: Add Oniguruma syntax \g<...> and \g'...' for subroutine calls.
Revision: 333
          http://vcs.pcre.org/viewvc?view=rev&revision=333
Author:   ph10
Date:     2008-04-10 20:55:57 +0100 (Thu, 10 Apr 2008)


Log Message:
-----------
Add Oniguruma syntax \g<...> and \g'...' for subroutine calls.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcrepattern.3
    code/trunk/doc/pcresyntax.3
    code/trunk/pcre_compile.c
    code/trunk/pcre_internal.h
    code/trunk/testdata/testinput2
    code/trunk/testdata/testoutput2


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2008-04-05 16:11:05 UTC (rev 332)
+++ code/trunk/ChangeLog    2008-04-10 19:55:57 UTC (rev 333)
@@ -45,6 +45,11 @@


 10. Applied Alan Lehotsky's patch to add REG_STARTEND support to the POSIX
     matching function regexec().
+    
+11. Added support for the Oniguruma syntax \g<name>, \g<n>, \g'name', \g'n', 
+    which, however, unlike Perl's \g{...}, are subroutine calls, not back 
+    references. PCRE supports relative numbers with this syntax (I don't think
+    Oniguruma does).



Version 7.6 28-Jan-08

Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3    2008-04-05 16:11:05 UTC (rev 332)
+++ code/trunk/doc/pcrepattern.3    2008-04-10 19:55:57 UTC (rev 333)
@@ -9,7 +9,12 @@
 .\" HREF
 \fBpcresyntax\fP
 .\"
-page. Perl's regular expressions are described in its own documentation, and
+page. PCRE tries to match Perl syntax and semantics as closely as it can. PCRE
+also supports some alternative regular expression syntax (which does not
+conflict with the Perl syntax) in order to provide some compatibility with
+regular expressions in Python, .NET, and Oniguruma.
+.P
+Perl's regular expressions are described in its own documentation, and
 regular expressions in general are covered in a number of books, some of which
 have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",
 published by O'Reilly, covers regular expressions in great detail. This
@@ -310,6 +315,20 @@
 .\"
 .
 .
+.SS "Absolute and relative subroutine calls"
+.rs
+.sp
+For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or 
+a number enclosed either in angle brackets or single quotes, is an alternative 
+syntax for referencing a subpattern as a "subroutine". Details are discussed 
+.\" HTML <a href="#onigurumasubroutines">
+.\" </a>
+later.
+.\"
+Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP 
+synonymous. The former is a back reference; the latter is a subroutine call.
+.
+.
 .SS "Generic character types"
 .rs
 .sp
@@ -2023,6 +2042,27 @@
 processing option does not affect the called subpattern.
 .
 .
+.\" HTML <a name="onigurumasubroutines"></a>
+.SH "ONIGURUMA SUBROUTINE SYNTAX"
+.rs
+.sp
+For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or 
+a number enclosed either in angle brackets or single quotes, is an alternative 
+syntax for referencing a subpattern as a subroutine, possibly recursively. Here 
+are two of the examples used above, rewritten using this syntax:
+.sp
+  (?<pn> \e( ( (?>[^()]+) | \eg<pn> )* \e) )
+  (sens|respons)e and \eg'1'ibility
+.sp
+PCRE supports an extension to Oniguruma: if a number is preceded by a 
+plus or a minus sign it is taken as a relative reference. For example:
+.sp
+  (abc)(?i:\eg<-1>)
+.sp
+Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP 
+synonymous. The former is a back reference; the latter is a subroutine call.
+.
+.
 .SH CALLOUTS
 .rs
 .sp
@@ -2192,6 +2232,6 @@
 .rs
 .sp
 .nf
-Last updated: 17 September 2007
-Copyright (c) 1997-2007 University of Cambridge.
+Last updated: 10 April 2008
+Copyright (c) 1997-2008 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcresyntax.3
===================================================================
--- code/trunk/doc/pcresyntax.3    2008-04-05 16:11:05 UTC (rev 332)
+++ code/trunk/doc/pcresyntax.3    2008-04-10 19:55:57 UTC (rev 333)
@@ -304,6 +304,8 @@
   (?<!...)       negative look behind
 .sp
 Each top-level branch of a look behind must be of a fixed length.
+.
+.
 .SH "BACKREFERENCES"
 .rs
 .sp
@@ -327,6 +329,14 @@
   (?-n)          call subpattern by relative number
   (?&name)       call subpattern by name (Perl)
   (?P>name)      call subpattern by name (Python)
+  \eg<name>       call subpattern by name (Oniguruma) 
+  \eg'name'       call subpattern by name (Oniguruma) 
+  \eg<n>          call subpattern by absolute number (Oniguruma) 
+  \eg'n'          call subpattern by absolute number (Oniguruma) 
+  \eg<+n>         call subpattern by relative number (PCRE extension)
+  \eg'+n'         call subpattern by relative number (PCRE extension)
+  \eg<-n>         call subpattern by relative number (PCRE extension)
+  \eg'-n'         call subpattern by relative number (PCRE extension)
 .
 .
 .SH "CONDITIONAL PATTERNS"
@@ -418,6 +428,6 @@
 .rs
 .sp
 .nf
-Last updated: 14 November 2007
-Copyright (c) 1997-2007 University of Cambridge.
+Last updated: 09 April 2008
+Copyright (c) 1997-2008 University of Cambridge.
 .fi


Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2008-04-05 16:11:05 UTC (rev 332)
+++ code/trunk/pcre_compile.c    2008-04-10 19:55:57 UTC (rev 333)
@@ -295,8 +295,8 @@
   /* 55 */
   "repeating a DEFINE group is not allowed\0"
   "inconsistent NEWLINE options\0"
-  "\\g is not followed by a braced name or an optionally braced non-zero number\0"
-  "(?+ or (?- or (?(+ or (?(- must be followed by a non-zero number\0"
+  "\\g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number\0"
+  "a numbered reference must not be zero\0"
   "(*VERB) with an argument is not supported\0"
   /* 60 */
   "(*VERB) not recognized\0"
@@ -531,14 +531,31 @@
     *errorcodeptr = ERR37;
     break;


-    /* \g must be followed by a number, either plain or braced. If positive, it
-    is an absolute backreference. If negative, it is a relative backreference.
-    This is a Perl 5.10 feature. Perl 5.10 also supports \g{name} as a
-    reference to a named group. This is part of Perl's movement towards a
-    unified syntax for back references. As this is synonymous with \k{name}, we
-    fudge it up by pretending it really was \k. */
+    /* \g must be followed by one of a number of specific things:
+    
+    (1) A number, either plain or braced. If positive, it is an absolute
+    backreference. If negative, it is a relative backreference. This is a Perl
+    5.10 feature.
+    
+    (2) Perl 5.10 also supports \g{name} as a reference to a named group. This
+    is part of Perl's movement towards a unified syntax for back references. As
+    this is synonymous with \k{name}, we fudge it up by pretending it really
+    was \k.
+    
+    (3) For Oniguruma compatibility we also support \g followed by a name or a 
+    number either in angle brackets or in single quotes. However, these are 
+    (possibly recursive) subroutine calls, _not_ backreferences. Just return 
+    the -ESC_g code (cf \k). */


     case 'g':
+    if (ptr[1] == '<' || ptr[1] == '\'')
+      {
+      c = -ESC_g;
+      break;  
+      }  
+
+    /* Handle the Perl-compatible cases */
+ 
     if (ptr[1] == '{')
       {
       const uschar *p;
@@ -565,17 +582,23 @@
     while ((digitab[ptr[1]] & ctype_digit) != 0)
       c = c * 10 + *(++ptr) - '0';


-    if (c < 0)
+    if (c < 0)   /* Integer overflow */
       {
       *errorcodeptr = ERR61;
       break;
       }
-
-    if (c == 0 || (braced && *(++ptr) != '}'))
+      
+    if (braced && *(++ptr) != '}')
       {
       *errorcodeptr = ERR57;
       break;
       }
+      
+    if (c == 0)
+      {
+      *errorcodeptr = ERR58;
+      break;
+      }     


     if (negated)
       {
@@ -611,7 +634,7 @@
       c -= '0';
       while ((digitab[ptr[1]] & ctype_digit) != 0)
         c = c * 10 + *(++ptr) - '0';
-      if (c < 0)
+      if (c < 0)    /* Integer overflow */
         {
         *errorcodeptr = ERR61;
         break;
@@ -4567,7 +4590,7 @@
         references (?P=name) and recursion (?P>name), as well as falling
         through from the Perl recursion syntax (?&name). We also come here from
         the Perl \k<name> or \k'name' back reference syntax and the \k{name}
-        .NET syntax. */
+        .NET syntax, and the Oniguruma \g<...> and \g'...' subroutine syntax. */


         NAMED_REF_OR_RECURSE:
         name = ++ptr;
@@ -4645,6 +4668,15 @@
         case '5': case '6': case '7': case '8': case '9':   /* subroutine */
           {
           const uschar *called;
+          terminator = ')';
+          
+          /* Come here from the \g<...> and \g'...' code (Oniguruma 
+          compatibility). However, the syntax has been checked to ensure that 
+          the ... are a (signed) number, so that neither ERR63 nor ERR29 will 
+          be called on this path, nor with the jump to OTHER_CHAR_AFTER_QUERY
+          ever be taken. */
+          
+          HANDLE_NUMERICAL_RECURSION: 


           if ((refsign = *ptr) == '+')
             {
@@ -4666,7 +4698,7 @@
           while((digitab[*ptr] & ctype_digit) != 0)
             recno = recno * 10 + *ptr++ - '0';


-          if (*ptr != ')')
+          if (*ptr != terminator)
             {
             *errorcodeptr = ERR29;
             goto FAILED;
@@ -5062,7 +5094,7 @@
     back references and those types that consume a character may be repeated.
     We can test for values between ESC_b and ESC_Z for the latter; this may
     have to change if any new ones are ever created. */
-
+    
     case '\\':
     tempptr = ptr;
     c = check_escape(&ptr, errorcodeptr, cd->bracount, options, FALSE);
@@ -5089,6 +5121,63 @@


       zerofirstbyte = firstbyte;
       zeroreqbyte = reqbyte;
+      
+      /* \g<name> or \g'name' is a subroutine call by name and \g<n> or \g'n' 
+      is a subroutine call by number (Oniguruma syntax). In fact, the value 
+      -ESC_g is returned only for these cases. So we don't need to check for <
+      or ' if the value is -ESC_g. For the Perl syntax \g{n} the value is
+      -ESC_REF+n, and for the Perl syntax \g{name} the result is -ESC_k (as
+      that is a synonym). */
+      
+      if (-c == ESC_g)
+        {
+        const uschar *p;
+        terminator = (*(++ptr) == '<')? '>' : '\'';
+        
+        /* These two statements stop the compiler for warning about possibly
+        unset variables caused by the jump to HANDLE_NUMERICAL_RECURSION. In 
+        fact, because we actually check for a number below, the paths that 
+        would actually be in error are never taken. */
+          
+        skipbytes = 0;
+        reset_bracount = FALSE; 
+        
+        /* Test for a name */
+        
+        if (ptr[1] != '+' && ptr[1] != '-')
+          { 
+          BOOL isnumber = TRUE; 
+          for (p = ptr + 1; *p != 0 && *p != terminator; p++)
+            {  
+            if ((cd->ctypes[*p] & ctype_digit) == 0) isnumber = FALSE;
+            if ((cd->ctypes[*p] & ctype_word) == 0) break;
+            }
+          if (*p != terminator)
+            {
+            *errorcodeptr = ERR57;
+            break; 
+            }    
+          if (isnumber) 
+            {
+            ptr++; 
+            goto HANDLE_NUMERICAL_RECURSION;
+            } 
+          is_recurse = TRUE;
+          goto NAMED_REF_OR_RECURSE;
+          }
+        
+        /* Test a signed number in angle brackets or quotes. */
+        
+        p = ptr + 2;
+        while ((digitab[*p] & ctype_digit) != 0) p++;
+        if (*p != terminator)
+          {
+          *errorcodeptr = ERR57;
+          break;
+          }
+        ptr++;   
+        goto HANDLE_NUMERICAL_RECURSION;
+        }    


       /* \k<name> or \k'name' is a back reference by name (Perl syntax).
       We also support \k{name} (.NET syntax) */


Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2008-04-05 16:11:05 UTC (rev 332)
+++ code/trunk/pcre_internal.h    2008-04-10 19:55:57 UTC (rev 333)
@@ -613,7 +613,7 @@


 enum { ESC_A = 1, ESC_G, ESC_K, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s,
        ESC_W, ESC_w, ESC_dum1, ESC_C, ESC_P, ESC_p, ESC_R, ESC_H, ESC_h,
-       ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z, ESC_E, ESC_Q, ESC_k, ESC_REF };
+       ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z, ESC_E, ESC_Q, ESC_g, ESC_k, ESC_REF };



/* Opcode table: Starting from 1 (i.e. after OP_END), the values up to

Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2008-04-05 16:11:05 UTC (rev 332)
+++ code/trunk/testdata/testinput2    2008-04-10 19:55:57 UTC (rev 333)
@@ -2589,4 +2589,58 @@


/[[:a\dz:]]/

+/^(?<name>a|b\g<name>c)/
+    aaaa
+    bacxxx
+    bbaccxxx 
+    bbbacccxx
+
+/^(?<name>a|b\g'name'c)/
+    aaaa
+    bacxxx
+    bbaccxxx 
+    bbbacccxx
+
+/^(a|b\g<1>c)/
+    aaaa
+    bacxxx
+    bbaccxxx 
+    bbbacccxx
+
+/^(a|b\g'1'c)/
+    aaaa
+    bacxxx
+    bbaccxxx 
+    bbbacccxx
+
+/^(a|b\g'-1'c)/
+    aaaa
+    bacxxx
+    bbaccxxx 
+    bbbacccxx
+
+/(^(a|b\g<-1>c))/
+    aaaa
+    bacxxx
+    bbaccxxx 
+    bbbacccxx
+
+/(^(a|b\g<-1'c))/
+
+/(^(a|b\g{-1}))/
+    bacxxx
+
+/(?-i:\g<name>)(?i:(?<name>a))/
+    XaaX
+    XAAX 
+
+/(?i:\g<name>)(?-i:(?<name>a))/
+    XaaX
+    ** Failers 
+    XAAX 
+
+/(?-i:\g<+1>)(?i:(a))/
+    XaaX
+    XAAX 
+
 / End of testinput2 /


Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2008-04-05 16:11:05 UTC (rev 332)
+++ code/trunk/testdata/testoutput2    2008-04-10 19:55:57 UTC (rev 333)
@@ -8074,13 +8074,13 @@
 Failed: reference to non-existent subpattern at offset 7


/^(a)\g/
-Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 5
+Failed: a numbered reference must not be zero at offset 5

/^(a)\g{0}/
-Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 7
+Failed: a numbered reference must not be zero at offset 8

/^(a)\g{3/
-Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 8
+Failed: \g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number at offset 8

/^(a)\g{4a}/
Failed: reference to non-existent subpattern at offset 9
@@ -8217,13 +8217,13 @@
No match

/x(?-0)y/
-Failed: (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number at offset 5
+Failed: a numbered reference must not be zero at offset 5

/x(?-1)y/
Failed: reference to non-existent subpattern at offset 5

/x(?+0)y/
-Failed: (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number at offset 5
+Failed: a numbered reference must not be zero at offset 5

/x(?+1)y/
Failed: reference to non-existent subpattern at offset 5
@@ -9385,4 +9385,124 @@
/[[:a\dz:]]/
Failed: unknown POSIX class name at offset 3

+/^(?<name>a|b\g<name>c)/
+    aaaa
+ 0: a
+ 1: a
+    bacxxx
+ 0: bac
+ 1: bac
+    bbaccxxx 
+ 0: bbacc
+ 1: bbacc
+    bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+
+/^(?<name>a|b\g'name'c)/
+    aaaa
+ 0: a
+ 1: a
+    bacxxx
+ 0: bac
+ 1: bac
+    bbaccxxx 
+ 0: bbacc
+ 1: bbacc
+    bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+
+/^(a|b\g<1>c)/
+    aaaa
+ 0: a
+ 1: a
+    bacxxx
+ 0: bac
+ 1: bac
+    bbaccxxx 
+ 0: bbacc
+ 1: bbacc
+    bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+
+/^(a|b\g'1'c)/
+    aaaa
+ 0: a
+ 1: a
+    bacxxx
+ 0: bac
+ 1: bac
+    bbaccxxx 
+ 0: bbacc
+ 1: bbacc
+    bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+
+/^(a|b\g'-1'c)/
+    aaaa
+ 0: a
+ 1: a
+    bacxxx
+ 0: bac
+ 1: bac
+    bbaccxxx 
+ 0: bbacc
+ 1: bbacc
+    bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+
+/(^(a|b\g<-1>c))/
+    aaaa
+ 0: a
+ 1: a
+ 2: a
+    bacxxx
+ 0: bac
+ 1: bac
+ 2: bac
+    bbaccxxx 
+ 0: bbacc
+ 1: bbacc
+ 2: bbacc
+    bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+ 2: bbbaccc
+
+/(^(a|b\g<-1'c))/
+Failed: \g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number at offset 15
+
+/(^(a|b\g{-1}))/
+    bacxxx
+No match
+
+/(?-i:\g<name>)(?i:(?<name>a))/
+    XaaX
+ 0: aa
+ 1: a
+    XAAX 
+ 0: AA
+ 1: A
+
+/(?i:\g<name>)(?-i:(?<name>a))/
+    XaaX
+ 0: aa
+ 1: a
+    ** Failers 
+No match
+    XAAX 
+No match
+
+/(?-i:\g<+1>)(?i:(a))/
+    XaaX
+ 0: aa
+ 1: a
+    XAAX 
+ 0: AA
+ 1: A
+
 / End of testinput2 /