[Pcre-svn] [457] code/trunk: Allow duplicate names for same…

Página Inicial
Delete this message
Autor: Subversion repository
Data:  
Para: pcre-svn
Assunto: [Pcre-svn] [457] code/trunk: Allow duplicate names for same-numbered groups; forbid different names.
Revision: 457
          http://vcs.pcre.org/viewvc?view=rev&revision=457
Author:   ph10
Date:     2009-10-03 17:24:08 +0100 (Sat, 03 Oct 2009)


Log Message:
-----------
Allow duplicate names for same-numbered groups; forbid different names.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcreapi.3
    code/trunk/doc/pcrecompat.3
    code/trunk/doc/pcrepattern.3
    code/trunk/pcre_compile.c
    code/trunk/pcre_internal.h
    code/trunk/pcreposix.c
    code/trunk/testdata/testinput11
    code/trunk/testdata/testinput2
    code/trunk/testdata/testoutput11
    code/trunk/testdata/testoutput2


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2009-10-02 08:53:31 UTC (rev 456)
+++ code/trunk/ChangeLog    2009-10-03 16:24:08 UTC (rev 457)
@@ -151,6 +151,12 @@
     pcre_dfa_exec() that meant it never actually used it. Double oops. There
     were also very few tests of studied patterns with pcre_dfa_exec().


+27. If (?| is used to create subpatterns with duplicate numbers, they are now
+    allowed to have the same name, even if PCRE_DUPNAMES is not set. However,
+    on the other side of the coin, they are no longer allowed to have different
+    names, because these cannot be distinguished in PCRE, and this has caused
+    confusion. (This is a difference from Perl.)
+    


Version 7.9 11-Apr-09
---------------------

Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3    2009-10-02 08:53:31 UTC (rev 456)
+++ code/trunk/doc/pcreapi.3    2009-10-03 16:24:08 UTC (rev 457)
@@ -1012,10 +1012,27 @@
 length of the longest name. PCRE_INFO_NAMETABLE returns a pointer to the first
 entry of the table (a pointer to \fBchar\fP). The first two bytes of each entry
 are the number of the capturing parenthesis, most significant byte first. The
-rest of the entry is the corresponding name, zero terminated. The names are in
-alphabetical order. When PCRE_DUPNAMES is set, duplicate names are in order of
-their parentheses numbers. For example, consider the following pattern (assume
-PCRE_EXTENDED is set, so white space - including newlines - is ignored):
+rest of the entry is the corresponding name, zero terminated. 
+.P
+The names are in alphabetical order. Duplicate names may appear if (?| is used
+to create multiple groups with the same number, as described in the
+.\" HTML <a href="pcrepattern.html#dupsubpatternnumber">
+.\" </a>
+section on duplicate subpattern numbers
+.\"
+in the
+.\" HREF
+\fBpcrepattern\fP
+.\"
+page. Duplicate names for subpatterns with different numbers are permitted only 
+if PCRE_DUPNAMES is set. In all cases of duplicate names, they appear in the 
+table in the order in which they were found in the pattern. In the absence of 
+(?| this is the order of increasing number; when (?| is used this is not 
+necessarily the case because later subpatterns may have lower numbers.
+.P
+As a simple example of the name/number table, consider the following pattern
+(assume PCRE_EXTENDED is set, so white space - including newlines - is
+ignored):
 .sp
 .\" JOIN
   (?<date> (?<year>(\ed\ed)?\ed\ed) -
@@ -1789,10 +1806,20 @@
 appropriate. \fBNOTE:\fP If PCRE_DUPNAMES is set and there are duplicate names,
 the behaviour may not be what you want (see the next section).
 .P
-\fBWarning:\fP If the pattern uses the "(?|" feature to set up multiple
-subpatterns with the same number, you cannot use names to distinguish them,
-because names are not included in the compiled code. The matching process uses
-only numbers.
+\fBWarning:\fP If the pattern uses the (?| feature to set up multiple
+subpatterns with the same number, as described in the
+.\" HTML <a href="pcrepattern.html#dupsubpatternnumber">
+.\" </a>
+section on duplicate subpattern numbers
+.\"
+in the
+.\" HREF
+\fBpcrepattern\fP
+.\"
+page, you cannot use names to distinguish the different subpatterns, because
+names are not included in the compiled code. The matching process uses only
+numbers. For this reason, the use of different names for subpatterns of the
+same number causes an error at compile time.
 .
 .SH "DUPLICATE SUBPATTERN NAMES"
 .rs
@@ -1802,9 +1829,12 @@
 .B const char *\fIname\fP, char **\fIfirst\fP, char **\fIlast\fP);
 .PP
 When a pattern is compiled with the PCRE_DUPNAMES option, names for subpatterns
-are not required to be unique. Normally, patterns with duplicate names are such
-that in any one match, only one of the named subpatterns participates. An
-example is shown in the
+are not required to be unique. (Duplicate names are always allowed for
+subpatterns with the same number, created by using the (?| feature. Indeed, if
+such subpatterns are named, they are required to use the same names.)
+.P
+Normally, patterns with duplicate names are such that in any one match, only
+one of the named subpatterns participates. An example is shown in the
 .\" HREF
 \fBpcrepattern\fP
 .\"
@@ -2045,6 +2075,6 @@
 .rs
 .sp
 .nf
-Last updated: 29 September 2009
+Last updated: 03 October 2009
 Copyright (c) 1997-2009 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcrecompat.3
===================================================================
--- code/trunk/doc/pcrecompat.3    2009-10-02 08:53:31 UTC (rev 456)
+++ code/trunk/doc/pcrecompat.3    2009-10-03 16:24:08 UTC (rev 457)
@@ -102,10 +102,12 @@
 works internally just with numbers, using an external table to translate 
 between numbers and names. The following are some specific differences:
 .sp
-(a) After matching a pattern such as (?|(?<a>A)|(?<b)B) where the two capturing 
-parentheses have the same number but different names, it is not possible to 
-distinguish which parentheses matched, because both names map to capturing
-subpattern number 1.
+(a) A pattern such as (?|(?<a>A)|(?<b)B), where the two capturing 
+parentheses have the same number but different names, is not supported, and 
+causes an error at compile time. If it were allowed, it would not be possible 
+to distinguish which parentheses matched, because both names map to capturing
+subpattern number 1. To avoid this confusing situation, an error is given at 
+compile time.
 .sp
 (b) A condition test for a subpattern with a name that is duplicated gives
 unpredictable results. For example, when the pattern
@@ -170,6 +172,6 @@
 .rs
 .sp
 .nf
-Last updated: 29 September 2009
+Last updated: 03 October 2009
 Copyright (c) 1997-2009 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3    2009-10-02 08:53:31 UTC (rev 456)
+++ code/trunk/doc/pcrepattern.3    2009-10-03 16:24:08 UTC (rev 457)
@@ -1175,7 +1175,6 @@
 .sp
   /(?|(abc)|(def))(?1)/
 .sp
-.P
 An alternative approach to using the "branch reset" feature is to use
 duplicate named subpatterns, as described in the next section.
 .
@@ -1216,11 +1215,13 @@
 is also a convenience function for extracting a captured substring by name.
 .P
 By default, a name must be unique within a pattern, but it is possible to relax
-this constraint by setting the PCRE_DUPNAMES option at compile time. This can
-be useful for patterns where only one instance of the named parentheses can
-match. Suppose you want to match the name of a weekday, either as a 3-letter
-abbreviation or as the full name, and in both cases you want to extract the
-abbreviation. This pattern (ignoring the line breaks) does the job:
+this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate
+names are also always permitted for subpatterns with the same number, set up as 
+described in the previous section.) Duplicate names can be useful for patterns
+where only one instance of the named parentheses can match. Suppose you want to
+match the name of a weekday, either as a 3-letter abbreviation or as the full
+name, and in both cases you want to extract the abbreviation. This pattern
+(ignoring the line breaks) does the job:
 .sp
   (?<DN>Mon|Fri|Sun)(?:day)?|
   (?<DN>Tue)(?:sday)?|
@@ -1244,8 +1245,10 @@
 documentation.
 .P
 \fBWarning:\fP You cannot use different names to distinguish between two
-subpatterns with the same number (see the previous section) because PCRE uses
-only the numbers when matching.
+subpatterns with the same number because PCRE uses only the numbers when
+matching. For this reason, an error is given at compile time if different names
+are given to subpatterns with the same number. However, you can give the same
+name to subpatterns with the same number, even when PCRE_DUPNAMES is not set.
 .
 .
 .SH REPETITION
@@ -2388,6 +2391,6 @@
 .rs
 .sp
 .nf
-Last updated: 30 September 2009
+Last updated: 03 October 2009
 Copyright (c) 1997-2009 University of Cambridge.
 .fi


Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2009-10-02 08:53:31 UTC (rev 456)
+++ code/trunk/pcre_compile.c    2009-10-03 16:24:08 UTC (rev 457)
@@ -341,7 +341,9 @@
   "number is too big\0"
   "subpattern name expected\0"
   "digit expected after (?+\0"
-  "] is an invalid data character in JavaScript compatibility mode";
+  "] is an invalid data character in JavaScript compatibility mode\0"
+  /* 65 */
+  "different names for subpatterns of the same number are not allowed"; 



 /* Table to identify digits and hex digits. This is used when compiling
@@ -4867,11 +4869,24 @@
               }
             }


-          /* In the real compile, create the entry in the table */
+          /* In the real compile, create the entry in the table, maintaining
+          alphabetical order. Duplicate names for different numbers are
+          permitted only if PCRE_DUPNAMES is set. Duplicate names for the same
+          number are always OK. (An existing number can be re-used if (?|
+          appears in the pattern.) In either event, a duplicate name results in
+          a duplicate entry in the table, even if the number is the same. This
+          is because the number of names, and hence the table size, is computed
+          in the pre-compile, and it affects various numbers and pointers which
+          would all have to be modified, and the compiled code moved down, if
+          duplicates with the same number were omitted from the table. This 
+          doesn't seem worth the hassle. However, *different* names for the
+          same number are not permitted. */


           else
             {
+            BOOL dupname = FALSE;
             slot = cd->name_table;
+ 
             for (i = 0; i < cd->names_found; i++)
               {
               int crc = memcmp(name, slot+2, namelen);
@@ -4879,22 +4894,54 @@
                 {
                 if (slot[2+namelen] == 0)
                   {
-                  if ((options & PCRE_DUPNAMES) == 0)
+                  if (GET2(slot, 0) != cd->bracount + 1 &&
+                      (options & PCRE_DUPNAMES) == 0)
                     {
                     *errorcodeptr = ERR43;
                     goto FAILED;
                     }
+                  else dupname = TRUE;   
                   }
-                else crc = -1;      /* Current name is substring */
+                else crc = -1;      /* Current name is a substring */
                 }
+                
+              /* Make space in the table and break the loop for an earlier 
+              name. For a duplicate or later name, carry on. We do this for 
+              duplicates so that in the simple case (when ?(| is not used) they 
+              are in order of their numbers. */
+                
               if (crc < 0)
                 {
                 memmove(slot + cd->name_entry_size, slot,
                   (cd->names_found - i) * cd->name_entry_size);
                 break;
                 }
+                
+              /* Continue the loop for a later or duplicate name */
+               
               slot += cd->name_entry_size;
               }
+              
+            /* For non-duplicate names, check for a duplicate number before
+            adding the new name. */
+              
+            if (!dupname)
+              {
+              uschar *cslot = cd->name_table;
+              for (i = 0; i < cd->names_found; i++)
+                {
+                if (cslot != slot)
+                  {
+                  if (GET2(cslot, 0) == cd->bracount + 1)
+                    {
+                    *errorcodeptr = ERR65;
+                    goto FAILED;
+                    }    
+                  }
+                else i--;   
+                cslot += cd->name_entry_size;
+                }  
+              }   


             PUT2(slot, 0, cd->bracount + 1);
             memcpy(slot + 2, name, namelen);
@@ -4902,10 +4949,11 @@
             }
           }


-        /* In both cases, count the number of names we've encountered. */
+        /* In both pre-compile and compile, count the number of names we've
+        encountered. */


+        cd->names_found++;
         ptr++;                    /* Move past > or ' */
-        cd->names_found++;
         goto NUMBERED_GROUP;




Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2009-10-02 08:53:31 UTC (rev 456)
+++ code/trunk/pcre_internal.h    2009-10-03 16:24:08 UTC (rev 457)
@@ -1476,7 +1476,7 @@
        ERR30, ERR31, ERR32, ERR33, ERR34, ERR35, ERR36, ERR37, ERR38, ERR39,
        ERR40, ERR41, ERR42, ERR43, ERR44, ERR45, ERR46, ERR47, ERR48, ERR49,
        ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59,
-       ERR60, ERR61, ERR62, ERR63, ERR64 };
+       ERR60, ERR61, ERR62, ERR63, ERR64, ERR65 };


/* The real format of the start of the pcre block; the index of names and the
code vector run on as long as necessary after the end. We store an explicit

Modified: code/trunk/pcreposix.c
===================================================================
--- code/trunk/pcreposix.c    2009-10-02 08:53:31 UTC (rev 456)
+++ code/trunk/pcreposix.c    2009-10-03 16:24:08 UTC (rev 457)
@@ -141,7 +141,9 @@
   REG_BADPAT,  /* number is too big */
   REG_BADPAT,  /* subpattern name expected */
   REG_BADPAT,  /* digit expected after (?+ */
-  REG_BADPAT   /* ] is an invalid data character in JavaScript compatibility mode */
+  REG_BADPAT,  /* ] is an invalid data character in JavaScript compatibility mode */
+  /* 65 */
+  REG_BADPAT   /* different names for subpatterns of the same number are not allowed */ 
 };


/* Table of texts corresponding to POSIX error codes */

Modified: code/trunk/testdata/testinput11
===================================================================
--- code/trunk/testdata/testinput11    2009-10-02 08:53:31 UTC (rev 456)
+++ code/trunk/testdata/testinput11    2009-10-03 16:24:08 UTC (rev 457)
@@ -283,4 +283,18 @@
 /(?<X>a)(?<=b(?&X))/
     baz


+/^(?|(abc)|(def))\1/
+    abcabc
+    defdef 
+    ** Failers
+    abcdef
+    defabc   
+    
+/^(?|(abc)|(def))(?1)/
+    abcabc
+    defabc
+    ** Failers
+    defdef
+    abcdef    
+
 /-- End of testinput11 --/


Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2009-10-02 08:53:31 UTC (rev 456)
+++ code/trunk/testdata/testinput2    2009-10-03 16:24:08 UTC (rev 457)
@@ -3098,4 +3098,10 @@
   (?(1)|.)                    # check that there was an empty component
   /xiIS


+/(?|(?<a>A)|(?<a>B))/I
+    AB\Ca
+    BA\Ca
+
+/(?|(?<a>A)|(?<b>B))/ 
+
 /-- End of testinput2 --/


Modified: code/trunk/testdata/testoutput11
===================================================================
--- code/trunk/testdata/testoutput11    2009-10-02 08:53:31 UTC (rev 456)
+++ code/trunk/testdata/testoutput11    2009-10-03 16:24:08 UTC (rev 457)
@@ -600,4 +600,32 @@
  0: a
  1: a


+/^(?|(abc)|(def))\1/
+    abcabc
+ 0: abcabc
+ 1: abc
+    defdef 
+ 0: defdef
+ 1: def
+    ** Failers
+No match
+    abcdef
+No match
+    defabc   
+No match
+    
+/^(?|(abc)|(def))(?1)/
+    abcabc
+ 0: abcabc
+ 1: abc
+    defabc
+ 0: defabc
+ 1: def
+    ** Failers
+No match
+    defdef
+No match
+    abcdef    
+No match
+
 /-- End of testinput11 --/


Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2009-10-02 08:53:31 UTC (rev 456)
+++ code/trunk/testdata/testoutput2    2009-10-03 16:24:08 UTC (rev 457)
@@ -10220,4 +10220,24 @@
 Subject length lower bound = 2
 No set of starting bytes


+/(?|(?<a>A)|(?<a>B))/I
+Capturing subpattern count = 1
+Named capturing subpatterns:
+  a   1
+  a   1
+No options
+No first char
+No need char
+    AB\Ca
+ 0: A
+ 1: A
+  C A (1) a
+    BA\Ca
+ 0: B
+ 1: B
+  C B (1) a
+
+/(?|(?<a>A)|(?<b>B))/ 
+Failed: different names for subpatterns of the same number are not allowed at offset 15
+
 /-- End of testinput2 --/