[Pcre-svn] [1359] code/trunk: Refactor the code for creating the name/number table.

Author: Subversion repository
Date:
To: pcre-svn
Subject: [Pcre-svn] [1359] code/trunk: Refactor the code for creating the name/number table.

Revision: 1359

          http://vcs.pcre.org/viewvc?view=rev&revision=1359
Author:   ph10
Date:     2013-09-03 11:10:59 +0100 (Tue, 03 Sep 2013)

Log Message:
-----------
Refactor the code for creating the name/number table.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcreapi.3
    code/trunk/pcre_compile.c
    code/trunk/pcre_internal.h
    code/trunk/pcretest.c
    code/trunk/testdata/testinput2
    code/trunk/testdata/testoutput2

Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2013-08-29 13:40:47 UTC (rev 1358)
+++ code/trunk/ChangeLog    2013-09-03 10:10:59 UTC (rev 1359)
@@ -66,7 +66,16 @@

13. Added the -T and -TM options to pcretest.

+14. The code in pcre_compile.c for creating the table of named capturing groups
+    has been refactored. Instead of creating the table dynamically during the 
+    actual compiling pass, the information is remembered during the pre-compile 
+    pass (on the stack unless there are more than 20 named groups, in which 
+    case malloc() is used) and the whole table is created before the actual 
+    compile happens. This has simplified the code (it is now nearly 150 lines
+    shorter) and prepared the way for better handling of references to groups 
+    with duplicate names.

+
Version 8.33 28-May-2013
------------------------

Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3    2013-08-29 13:40:47 UTC (rev 1358)
+++ code/trunk/doc/pcreapi.3    2013-09-03 10:10:59 UTC (rev 1359)
@@ -1,4 +1,4 @@
-.TH PCREAPI 3 "05 July 2013" "PCRE 8.34"
+.TH PCREAPI 3 "03 September 2013" "PCRE 8.34"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .sp
@@ -1350,8 +1350,8 @@
 contains the parenthesis number. The rest of the entry is the corresponding
 name, zero terminated.
 .P
-The names are in alphabetical order. Duplicate names may appear if (?| is used
-to create multiple groups with the same number, as described in the
+The names are in alphabetical order. If (?| is used to create multiple groups
+with the same number, as described in the
 .\" HTML <a href="pcrepattern.html#dupsubpatternnumber">
 .\" </a>
 section on duplicate subpattern numbers
@@ -1360,11 +1360,13 @@
 .\" HREF
 \fBpcrepattern\fP
 .\"
-page. Duplicate names for subpatterns with different numbers are permitted only
-if PCRE_DUPNAMES is set. In all cases of duplicate names, they appear in the
-table in the order in which they were found in the pattern. In the absence of
-(?| this is the order of increasing number; when (?| is used this is not
-necessarily the case because later subpatterns may have lower numbers.
+page, the groups may be given the same name, but there is only one entry in the
+table. Different names for groups of the same number are not permitted.
+Duplicate names for subpatterns with different numbers are permitted,
+but only if PCRE_DUPNAMES is set. They appear in the table in the order in
+which they were found in the pattern. In the absence of (?| this is the order
+of increasing number; when (?| is used this is not necessarily the case because
+later subpatterns may have lower numbers.
 .P
 As a simple example of the name/number table, consider the following pattern
 after compilation by the 8-bit library (assume PCRE_EXTENDED is set, so white
@@ -2847,6 +2849,6 @@
 .rs
 .sp
 .nf
-Last updated: 05 July 2013
+Last updated: 03 September 2013
 Copyright (c) 1997-2013 University of Cambridge.
 .fi

Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2013-08-29 13:40:47 UTC (rev 1358)
+++ code/trunk/pcre_compile.c    2013-09-03 10:10:59 UTC (rev 1359)
@@ -115,6 +115,13 @@
 #define COMPILE_WORK_SIZE (2048*LINK_SIZE)
 #define COMPILE_WORK_SIZE_MAX (100*COMPILE_WORK_SIZE)

+/* This value determines the size of the initial vector that is used for
+remembering named groups during the pre-compile. It is allocated on the stack,
+but if it is too small, it is expanded using malloc(), in a similar way to the
+workspace. The value is the number of slots in the list. */
+
+#define NAMED_GROUP_LIST_SIZE 20
+
/* The overrun tests check for a slightly smaller size so that they detect the
overrun before it actually does run off the end of the data block. */

@@ -1358,306 +1365,6 @@

 /*************************************************
-*  Subroutine for finding forward reference      *
-*************************************************/
-
-/* This recursive function is called only from find_parens() below. The
-top-level call starts at the beginning of the pattern. All other calls must
-start at a parenthesis. It scans along a pattern's text looking for capturing
-subpatterns, and counting them. If it finds a named pattern that matches the
-name it is given, it returns its number. Alternatively, if the name is NULL, it
-returns when it reaches a given numbered subpattern. Recursion is used to keep
-track of subpatterns that reset the capturing group numbers - the (?| feature.
-
-This function was originally called only from the second pass, in which we know
-that if (?< or (?' or (?P< is encountered, the name will be correctly
-terminated because that is checked in the first pass. There is now one call to
-this function in the first pass, to check for a recursive back reference by
-name (so that we can make the whole group atomic). In this case, we need check
-only up to the current position in the pattern, and that is still OK because
-and previous occurrences will have been checked. To make this work, the test
-for "end of pattern" is a check against cd->end_pattern in the main loop,
-instead of looking for a binary zero. This means that the special first-pass
-call can adjust cd->end_pattern temporarily. (Checks for binary zero while
-processing items within the loop are OK, because afterwards the main loop will
-terminate.)
-
-Arguments:
-  ptrptr       address of the current character pointer (updated)
-  cd           compile background data
-  name         name to seek, or NULL if seeking a numbered subpattern
-  lorn         name length, or subpattern number if name is NULL
-  xmode        TRUE if we are in /x mode
-  utf          TRUE if we are in UTF-8 / UTF-16 / UTF-32 mode
-  count        pointer to the current capturing subpattern number (updated)
-
-Returns:       the number of the named subpattern, or -1 if not found
-*/
-
-static int
-find_parens_sub(pcre_uchar **ptrptr, compile_data *cd, const pcre_uchar *name, int lorn,
-  BOOL xmode, BOOL utf, int *count)
-{
-pcre_uchar *ptr = *ptrptr;
-int start_count = *count;
-int hwm_count = start_count;
-BOOL dup_parens = FALSE;
-
-/* If the first character is a parenthesis, check on the type of group we are
-dealing with. The very first call may not start with a parenthesis. */
-
-if (ptr[0] == CHAR_LEFT_PARENTHESIS)
-  {
-  /* Handle specials such as (*SKIP) or (*UTF8) etc. */
-
-  if (ptr[1] == CHAR_ASTERISK)
-    {
-    ptr += 2;
-    while (ptr < cd->end_pattern && *ptr != CHAR_RIGHT_PARENTHESIS) ptr++;
-    }
-
-  /* Handle a normal, unnamed capturing parenthesis. */
-
-  else if (ptr[1] != CHAR_QUESTION_MARK)
-    {
-    *count += 1;
-    if (name == NULL && *count == lorn) return *count;
-    ptr++;
-    }
-
-  /* All cases now have (? at the start. Remember when we are in a group
-  where the parenthesis numbers are duplicated. */
-
-  else if (ptr[2] == CHAR_VERTICAL_LINE)
-    {
-    ptr += 3;
-    dup_parens = TRUE;
-    }
-
-  /* Handle comments; all characters are allowed until a ket is reached. */
-
-  else if (ptr[2] == CHAR_NUMBER_SIGN)
-    {
-    for (ptr += 3; *ptr != CHAR_NULL; ptr++)
-      if (*ptr == CHAR_RIGHT_PARENTHESIS) break;
-    goto FAIL_EXIT;
-    }
-
-  /* Handle a condition. If it is an assertion, just carry on so that it
-  is processed as normal. If not, skip to the closing parenthesis of the
-  condition (there can't be any nested parens). */
-
-  else if (ptr[2] == CHAR_LEFT_PARENTHESIS)
-    {
-    ptr += 2;
-    if (ptr[1] != CHAR_QUESTION_MARK)
-      {
-      while (*ptr != CHAR_NULL && *ptr != CHAR_RIGHT_PARENTHESIS) ptr++;
-      if (*ptr != CHAR_NULL) ptr++;
-      }
-    }
-
-  /* Start with (? but not a condition. */
-
-  else
-    {
-    ptr += 2;
-    if (*ptr == CHAR_P) ptr++;                      /* Allow optional P */
-
-    /* We have to disambiguate (?<! and (?<= from (?<name> for named groups */
-
-    if ((*ptr == CHAR_LESS_THAN_SIGN && ptr[1] != CHAR_EXCLAMATION_MARK &&
-        ptr[1] != CHAR_EQUALS_SIGN) || *ptr == CHAR_APOSTROPHE)
-      {
-      pcre_uchar term;
-      const pcre_uchar *thisname;
-      *count += 1;
-      if (name == NULL && *count == lorn) return *count;
-      term = *ptr++;
-      if (term == CHAR_LESS_THAN_SIGN) term = CHAR_GREATER_THAN_SIGN;
-      thisname = ptr;
-      while (*ptr != term) ptr++;
-      if (name != NULL && lorn == (int)(ptr - thisname) &&
-          STRNCMP_UC_UC(name, thisname, (unsigned int)lorn) == 0)
-        return *count;
-      term++;
-      }
-    }
-  }
-
-/* Past any initial parenthesis handling, scan for parentheses or vertical
-bars. Stop if we get to cd->end_pattern. Note that this is important for the
-first-pass call when this value is temporarily adjusted to stop at the current
-position. So DO NOT change this to a test for binary zero. */
-
-for (; ptr < cd->end_pattern; ptr++)
-  {
-  /* Skip over backslashed characters and also entire \Q...\E */
-
-  if (*ptr == CHAR_BACKSLASH)
-    {
-    if (*(++ptr) == CHAR_NULL) goto FAIL_EXIT;
-    if (*ptr == CHAR_Q) for (;;)
-      {
-      while (*(++ptr) != CHAR_NULL && *ptr != CHAR_BACKSLASH) {};
-      if (*ptr == CHAR_NULL) goto FAIL_EXIT;
-      if (*(++ptr) == CHAR_E) break;
-      }
-    continue;
-    }
-
-  /* Skip over character classes; this logic must be similar to the way they
-  are handled for real. If the first character is '^', skip it. Also, if the
-  first few characters (either before or after ^) are \Q\E or \E we skip them
-  too. This makes for compatibility with Perl. Note the use of STR macros to
-  encode "Q\\E" so that it works in UTF-8 on EBCDIC platforms. */
-
-  if (*ptr == CHAR_LEFT_SQUARE_BRACKET)
-    {
-    BOOL negate_class = FALSE;
-    for (;;)
-      {
-      if (ptr[1] == CHAR_BACKSLASH)
-        {
-        if (ptr[2] == CHAR_E)
-          ptr+= 2;
-        else if (STRNCMP_UC_C8(ptr + 2,
-                 STR_Q STR_BACKSLASH STR_E, 3) == 0)
-          ptr += 4;
-        else
-          break;
-        }
-      else if (!negate_class && ptr[1] == CHAR_CIRCUMFLEX_ACCENT)
-        {
-        negate_class = TRUE;
-        ptr++;
-        }
-      else break;
-      }
-
-    /* If the next character is ']', it is a data character that must be
-    skipped, except in JavaScript compatibility mode. */
-
-    if (ptr[1] == CHAR_RIGHT_SQUARE_BRACKET &&
-        (cd->external_options & PCRE_JAVASCRIPT_COMPAT) == 0)
-      ptr++;
-
-    while (*(++ptr) != CHAR_RIGHT_SQUARE_BRACKET)
-      {
-      if (*ptr == CHAR_NULL) return -1;
-      if (*ptr == CHAR_BACKSLASH)
-        {
-        if (*(++ptr) == CHAR_NULL) goto FAIL_EXIT;
-        if (*ptr == CHAR_Q) for (;;)
-          {
-          while (*(++ptr) != CHAR_NULL && *ptr != CHAR_BACKSLASH) {};
-          if (*ptr == CHAR_NULL) goto FAIL_EXIT;
-          if (*(++ptr) == CHAR_E) break;
-          }
-        continue;
-        }
-      }
-    continue;
-    }
-
-  /* Skip comments in /x mode */
-
-  if (xmode && *ptr == CHAR_NUMBER_SIGN)
-    {
-    ptr++;
-    while (*ptr != CHAR_NULL)
-      {
-      if (IS_NEWLINE(ptr)) { ptr += cd->nllen - 1; break; }
-      ptr++;
-#ifdef SUPPORT_UTF
-      if (utf) FORWARDCHAR(ptr);
-#endif
-      }
-    if (*ptr == CHAR_NULL) goto FAIL_EXIT;
-    continue;
-    }
-
-  /* Check for the special metacharacters */
-
-  if (*ptr == CHAR_LEFT_PARENTHESIS)
-    {
-    int rc = find_parens_sub(&ptr, cd, name, lorn, xmode, utf, count);
-    if (rc > 0) return rc;
-    if (*ptr == CHAR_NULL) goto FAIL_EXIT;
-    }
-
-  else if (*ptr == CHAR_RIGHT_PARENTHESIS)
-    {
-    if (dup_parens && *count < hwm_count) *count = hwm_count;
-    goto FAIL_EXIT;
-    }
-
-  else if (*ptr == CHAR_VERTICAL_LINE && dup_parens)
-    {
-    if (*count > hwm_count) hwm_count = *count;
-    *count = start_count;
-    }
-  }
-
-FAIL_EXIT:
-*ptrptr = ptr;
-return -1;
-}
-
-
-
-
-/*************************************************
-*       Find forward referenced subpattern       *
-*************************************************/
-
-/* This function scans along a pattern's text looking for capturing
-subpatterns, and counting them. If it finds a named pattern that matches the
-name it is given, it returns its number. Alternatively, if the name is NULL, it
-returns when it reaches a given numbered subpattern. This is used for forward
-references to subpatterns. We used to be able to start this scan from the
-current compiling point, using the current count value from cd->bracount, and
-do it all in a single loop, but the addition of the possibility of duplicate
-subpattern numbers means that we have to scan from the very start, in order to
-take account of such duplicates, and to use a recursive function to keep track
-of the different types of group.
-
-Arguments:
-  cd           compile background data
-  name         name to seek, or NULL if seeking a numbered subpattern
-  lorn         name length, or subpattern number if name is NULL
-  xmode        TRUE if we are in /x mode
-  utf          TRUE if we are in UTF-8 / UTF-16 / UTF-32 mode
-
-Returns:       the number of the found subpattern, or -1 if not found
-*/
-
-static int
-find_parens(compile_data *cd, const pcre_uchar *name, int lorn, BOOL xmode,
-  BOOL utf)
-{
-pcre_uchar *ptr = (pcre_uchar *)cd->start_pattern;
-int count = 0;
-int rc;
-
-/* If the pattern does not start with an opening parenthesis, the first call
-to find_parens_sub() will scan right to the end (if necessary). However, if it
-does start with a parenthesis, find_parens_sub() will return when it hits the
-matching closing parens. That is why we have to have a loop. */
-
-for (;;)
-  {
-  rc = find_parens_sub(&ptr, cd, name, lorn, xmode, utf, &count);
-  if (rc > 0 || *ptr++ == CHAR_NULL) break;
-  }
-
-return rc;
-}
-
-
-
-
-/*************************************************
 *      Find first significant op code            *
 *************************************************/

@@ -5949,7 +5656,7 @@
           slot += cd->name_entry_size;
           }

-        /* Found a previous named subpattern */
+        /* Found the named subpattern */

         if (i < cd->names_found)
           {
@@ -5958,15 +5665,6 @@
           code[1+LINK_SIZE]++;
           }

-        /* Search the pattern for a forward reference */
-
-        else if ((i = find_parens(cd, name, namelen,
-                        (options & PCRE_EXTENDED) != 0, utf)) > 0)
-          {
-          PUT2(code, 2+LINK_SIZE, i);
-          code[1+LINK_SIZE]++;
-          }
-
         /* If terminator == CHAR_NULL it means that the name followed directly
         after the opening parenthesis [e.g. (?(abc)...] and in this case there
         are some further alternatives to try. For the cases where terminator !=
@@ -6130,124 +5828,105 @@
         /* ------------------------------------------------------------ */
         DEFINE_NAME:    /* Come here from (?< handling */
         case CHAR_APOSTROPHE:
-          {
-          terminator = (*ptr == CHAR_LESS_THAN_SIGN)?
-            CHAR_GREATER_THAN_SIGN : CHAR_APOSTROPHE;
-          name = ++ptr;
+        terminator = (*ptr == CHAR_LESS_THAN_SIGN)?
+          CHAR_GREATER_THAN_SIGN : CHAR_APOSTROPHE;
+        name = ++ptr;

-          while (MAX_255(*ptr) && (cd->ctypes[*ptr] & ctype_word) != 0) ptr++;
-          namelen = (int)(ptr - name);
+        while (MAX_255(*ptr) && (cd->ctypes[*ptr] & ctype_word) != 0) ptr++;
+        namelen = (int)(ptr - name);

-          /* In the pre-compile phase, just do a syntax check. */
+        /* In the pre-compile phase, do a syntax check, remember the longest
+        name, and then remember the group in a vector, expanding it if
+        necessary. Duplicates for the same number are skipped; other duplicates 
+        are checked for validity. In the actual compile, there is nothing to
+        do. */

-          if (lengthptr != NULL)
+        if (lengthptr != NULL)
+          {
+          named_group *ng; 
+          pcre_uint32 number = cd->bracount + 1;
+ 
+          if (*ptr != (pcre_uchar)terminator)
             {
-            if (*ptr != (pcre_uchar)terminator)
+            *errorcodeptr = ERR42;
+            goto FAILED;
+            }
+             
+          if (cd->names_found >= MAX_NAME_COUNT)
+            {
+            *errorcodeptr = ERR49;
+            goto FAILED;
+            }
+             
+          if (namelen + IMM2_SIZE + 1 > cd->name_entry_size)
+            {
+            cd->name_entry_size = namelen + IMM2_SIZE + 1;
+            if (namelen > MAX_NAME_SIZE)
               {
-              *errorcodeptr = ERR42;
+              *errorcodeptr = ERR48;
               goto FAILED;
               }
-            if (cd->names_found >= MAX_NAME_COUNT)
-              {
-              *errorcodeptr = ERR49;
-              goto FAILED;
-              }
-            if (namelen + IMM2_SIZE + 1 > cd->name_entry_size)
-              {
-              cd->name_entry_size = namelen + IMM2_SIZE + 1;
-              if (namelen > MAX_NAME_SIZE)
-                {
-                *errorcodeptr = ERR48;
-                goto FAILED;
-                }
-              }
             }
-
-          /* In the real compile, create the entry in the table, maintaining
-          alphabetical order. Duplicate names for different numbers are
-          permitted only if PCRE_DUPNAMES is set. Duplicate names for the same
-          number are always OK. (An existing number can be re-used if (?|
-          appears in the pattern.) In either event, a duplicate name results in
-          a duplicate entry in the table, even if the number is the same. This
-          is because the number of names, and hence the table size, is computed
-          in the pre-compile, and it affects various numbers and pointers which
-          would all have to be modified, and the compiled code moved down, if
-          duplicates with the same number were omitted from the table. This
-          doesn't seem worth the hassle. However, *different* names for the
-          same number are not permitted. */
-
-          else
+            
+          /* Scan the list to check for duplicates. For duplicate names, if the
+          number is the same, break the loop, which causes the name to be
+          discarded; otherwise, if DUPNAMES is not set, give an error.
+          If it is set, allow the name with a different number, but continue
+          scanning in case this is a duplicate with the same number. For
+          non-duplicate names, give an error if the number is duplicated. */
+          
+          ng = cd->named_groups;
+          for (i = 0; i < cd->names_found; i++, ng++)
             {
-            BOOL dupname = FALSE;
-            slot = cd->name_table;
-
-            for (i = 0; i < cd->names_found; i++)
-              {
-              int crc = memcmp(name, slot+IMM2_SIZE, IN_UCHARS(namelen));
-              if (crc == 0)
+            if (namelen == ng->length &&
+                STRNCMP_UC_UC(name, ng->name, namelen) == 0)
+              {  
+              if (ng->number == number) break;
+              if ((options & PCRE_DUPNAMES) == 0)
                 {
-                if (slot[IMM2_SIZE+namelen] == 0)
-                  {
-                  if (GET2(slot, 0) != cd->bracount + 1 &&
-                      (options & PCRE_DUPNAMES) == 0)
-                    {
-                    *errorcodeptr = ERR43;
-                    goto FAILED;
-                    }
-                  else dupname = TRUE;
-                  }
-                else crc = -1;      /* Current name is a substring */
-                }
+                *errorcodeptr = ERR43;
+                goto FAILED;  
+                }  
+              } 
+            else if (ng->number == number)
+              {
+              *errorcodeptr = ERR65;
+              goto FAILED;
+              }      
+            }

-              /* Make space in the table and break the loop for an earlier
-              name. For a duplicate or later name, carry on. We do this for
-              duplicates so that in the simple case (when ?(| is not used) they
-              are in order of their numbers. */
-
-              if (crc < 0)
-                {
-                memmove(slot + cd->name_entry_size, slot,
-                  IN_UCHARS((cd->names_found - i) * cd->name_entry_size));
-                break;
-                }
-
-              /* Continue the loop for a later or duplicate name */
-
-              slot += cd->name_entry_size;
-              }
-
-            /* For non-duplicate names, check for a duplicate number before
-            adding the new name. */
-
-            if (!dupname)
+          if (i >= cd->names_found)     /* Not a duplicate with same number */
+            { 
+            /* Increase the list size if necessary */
+            
+            if (cd->names_found >= cd->named_group_list_size)
               {
-              pcre_uchar *cslot = cd->name_table;
-              for (i = 0; i < cd->names_found; i++)
+              int newsize = cd->named_group_list_size * 2;
+              named_group *newspace = (PUBL(malloc))
+                (newsize * sizeof(named_group));
+                 
+              if (newspace == NULL) 
                 {
-                if (cslot != slot)
-                  {
-                  if (GET2(cslot, 0) == cd->bracount + 1)
-                    {
-                    *errorcodeptr = ERR65;
-                    goto FAILED;
-                    }
-                  }
-                else i--;
-                cslot += cd->name_entry_size;
-                }
-              }
-
-            PUT2(slot, 0, cd->bracount + 1);
-            memcpy(slot + IMM2_SIZE, name, IN_UCHARS(namelen));
-            slot[IMM2_SIZE + namelen] = 0;
+                *errorcodeptr = ERR21;
+                goto FAILED; 
+                } 
+                 
+              memcpy(newspace, cd->named_groups, 
+                cd->named_group_list_size * sizeof(named_group));
+              if (cd->named_group_list_size > NAMED_GROUP_LIST_SIZE)
+                (PUBL(free))((void *)cd->named_groups);
+              cd->named_groups = newspace;
+              cd->named_group_list_size = newsize;
+              }  
+            
+            cd->named_groups[cd->names_found].name = name;
+            cd->named_groups[cd->names_found].length = namelen;
+            cd->named_groups[cd->names_found].number = number;
+            cd->names_found++;
             }
           }

-        /* In both pre-compile and compile, count the number of names we've
-        encountered. */
-
-        cd->names_found++;
-        ptr++;                    /* Move past > or ' */
+        ptr++;                    /* Move past > or ' in both passes. */
         goto NUMBERED_GROUP;

@@ -6277,8 +5956,8 @@

         if (lengthptr != NULL)
           {
-          const pcre_uchar *temp;
-
+          named_group *ng;
+           
           if (namelen == 0)
             {
             *errorcodeptr = ERR62;
@@ -6295,27 +5974,26 @@
             goto FAILED;
             }

-          /* The name table does not exist in the first pass, so we cannot
-          do a simple search as in the code below. Instead, we have to scan the
-          pattern to find the number. It is important that we scan it only as
-          far as we have got because the syntax of named subpatterns has not
-          been checked for the rest of the pattern, and find_parens() assumes
-          correct syntax. In any case, it's a waste of resources to scan
-          further. We stop the scan at the current point by temporarily
-          adjusting the value of cd->endpattern. */
+          /* The name table does not exist in the first pass; instead we must
+          scan the list of names encountered so far in order to get the 
+          number. The number may be negative if it is for a name that may be 
+          duplicated. If the name is not found, set the value to 0 for a
+          forward reference. */

-          temp = cd->end_pattern;
-          cd->end_pattern = ptr;
-          recno = find_parens(cd, name, namelen,
-            (options & PCRE_EXTENDED) != 0, utf);
-          cd->end_pattern = temp;
-          if (recno < 0) recno = 0;    /* Forward ref; set dummy number */
+          ng = cd->named_groups;
+          for (i = 0; i < cd->names_found; i++, ng++)
+            {
+            if (namelen == ng->length &&
+                STRNCMP_UC_UC(name, ng->name, namelen) == 0)
+              break;
+            }       
+          recno = (i < cd->names_found)? ng->number : 0;
           }

-        /* In the real compile, seek the name in the table. We check the name
+        /* In the real compile, search the name table. We check the name
         first, and then check that we have reached the end of the name in the
-        table. That way, if the name that is longer than any in the table,
-        the comparison will fail without reading beyond the table entry. */
+        table. That way, if the name is longer than any in the table, the
+        comparison will fail without reading beyond the table entry. */

         else
           {
@@ -6328,13 +6006,11 @@
             slot += cd->name_entry_size;
             }

-          if (i < cd->names_found)         /* Back reference */
+          if (i < cd->names_found)
             {
             recno = GET2(slot, 0);
             }
-          else if ((recno =                /* Forward back reference */
-                    find_parens(cd, name, namelen,
-                      (options & PCRE_EXTENDED) != 0, utf)) <= 0)
+          else 
             {
             *errorcodeptr = ERR15;
             goto FAILED;
@@ -6444,8 +6120,7 @@

             if (called == NULL)
               {
-              if (find_parens(cd, NULL, recno,
-                    (options & PCRE_EXTENDED) != 0, utf) < 0)
+              if (recno > cd->final_bracount)
                 {
                 *errorcodeptr = ERR15;
                 goto FAILED;
@@ -7790,6 +7465,61 @@

 /*************************************************
+*     Add an entry to the name/number table      *
+*************************************************/
+
+/* This function is called between compiling passes to add an entry to the
+name/number table, maintaining alphabetical order. Checking for permitted 
+and forbidden duplicates has already been done.
+
+Arguments:
+  cd           the compile data block
+  name         the name to add
+  length       the length of the name
+  groupno      the group number   
+
+Returns:       nothing
+*/
+
+static void
+add_name(compile_data *cd, const pcre_uchar *name, int length, 
+  unsigned int groupno)
+{
+int i;
+pcre_uchar *slot = cd->name_table;
+
+for (i = 0; i < cd->names_found; i++)
+  {
+  int crc = memcmp(name, slot+IMM2_SIZE, IN_UCHARS(length));
+  if (crc == 0 && slot[IMM2_SIZE+length] != 0) 
+    crc = -1; /* Current name is a substring */
+
+  /* Make space in the table and break the loop for an earlier name. For a
+  duplicate or later name, carry on. We do this for duplicates so that in the
+  simple case (when ?(| is not used) they are in order of their numbers. In all
+  cases they are in the order in which they appear in the pattern. */
+
+  if (crc < 0)
+    {
+    memmove(slot + cd->name_entry_size, slot,
+      IN_UCHARS((cd->names_found - i) * cd->name_entry_size));
+    break;
+    }
+
+  /* Continue the loop for a later or duplicate name */
+
+  slot += cd->name_entry_size;
+  }
+
+PUT2(slot, 0, groupno);
+memcpy(slot + IMM2_SIZE, name, IN_UCHARS(length));
+slot[IMM2_SIZE + length] = 0;
+cd->names_found++;
+}
+
+
+
+/*************************************************
 *        Compile a Regular Expression            *
 *************************************************/

@@ -7876,6 +7606,11 @@

pcre_uchar cworkspace[COMPILE_WORK_SIZE];

+/* This vector is used for remembering name groups during the pre-compile. In a
+similar way to cworkspace, it can be expanded using malloc() if necessary. */
+
+named_group named_groups[NAMED_GROUP_LIST_SIZE];
+
/* Set this early so that early errors get offset 0. */

ptr = (const pcre_uchar *)pattern;
@@ -8142,6 +7877,8 @@
cd->hwm = cworkspace;
cd->start_workspace = cworkspace;
cd->workspace_size = COMPILE_WORK_SIZE;
+cd->named_groups = named_groups;
+cd->named_group_list_size = NAMED_GROUP_LIST_SIZE;
cd->start_pattern = (const pcre_uchar *)pattern;
cd->end_pattern = (const pcre_uchar *)(pattern + STRLEN_UC((const pcre_uchar *)pattern));
cd->req_varyopt = 0;
@@ -8224,7 +7961,6 @@
cd->assert_depth = 0;
cd->bracount = 0;
cd->max_lookbehind = 0;
-cd->names_found = 0;
cd->name_table = (pcre_uchar *)re + re->name_table_offset;
codestart = cd->name_table + re->name_entry_size * re->name_count;
cd->start_code = codestart;
@@ -8235,6 +7971,20 @@
cd->check_lookbehind = FALSE;
cd->open_caps = NULL;

+/* If any named groups were found, create the name/number table from the list 
+created in the first pass. */
+
+if (cd->names_found > 0)
+  {
+  int i = cd->names_found;
+  named_group *ng = cd->named_groups; 
+  cd->names_found = 0;
+  for (; i > 0; i--, ng++)
+    add_name(cd, ng->name, ng->length, ng->number); 
+  if (cd->named_group_list_size > NAMED_GROUP_LIST_SIZE)
+    (PUBL(free))((void *)cd->named_groups);
+  } 
+
 /* Set up a starting, non-extracting bracket, then compile the expression. On
 error, errorcode will be set non-zero, so we don't need to look at the result
 of the function here. */

Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2013-08-29 13:40:47 UTC (rev 1358)
+++ code/trunk/pcre_internal.h    2013-09-03 10:10:59 UTC (rev 1359)
@@ -2407,6 +2407,15 @@
   pcre_uint16 flag;             /* Set TRUE if recursive back ref */
 } open_capitem;

+/* Structure for building a list of named groups during the first pass of 
+compiling. */
+
+typedef struct named_group {
+  const pcre_uchar  *name;          /* Points to the name in the pattern */
+  int                length;        /* Length of the name */
+  pcre_uint32        number;        /* Group number */
+} named_group;    
+
 /* Structure for passing "static" information around between the functions
 doing the compiling, so that they are thread-safe. */

@@ -2419,11 +2428,13 @@
   const pcre_uchar *start_code;     /* The start of the compiled code */
   const pcre_uchar *start_pattern;  /* The start of the pattern */
   const pcre_uchar *end_pattern;    /* The end of the pattern */
+  pcre_uchar *hwm;                  /* High watermark of workspace */
   open_capitem *open_caps;          /* Chain of open capture items */
-  pcre_uchar *hwm;                  /* High watermark of workspace */
+  named_group *named_groups;        /* Points to vector in pre-compile */
   pcre_uchar *name_table;           /* The name/number table */
   int  names_found;                 /* Number of entries so far */
   int  name_entry_size;             /* Size of each entry */
+  int  named_group_list_size;       /* Number of entries in the list */ 
   int  workspace_size;              /* Size of workspace */
   unsigned int bracount;            /* Count of capturing parens as we compile */
   int  final_bracount;              /* Saved value after first pass */

Modified: code/trunk/pcretest.c
===================================================================
--- code/trunk/pcretest.c    2013-08-29 13:40:47 UTC (rev 1358)
+++ code/trunk/pcretest.c    2013-09-03 10:10:59 UTC (rev 1359)
@@ -1016,8 +1016,6 @@
 static int locale_set = 0;
 static int show_malloc;
 static int use_utf;
-static size_t gotten_store;
-static size_t first_gotten_store = 0;
 static const unsigned char *last_callout_mark = NULL;

 /* The buffers grow automatically if very long input lines are encountered. */
@@ -2322,8 +2320,6 @@
 static void *new_malloc(size_t size)
 {
 void *block = malloc(size);
-gotten_store = size;
-if (first_gotten_store == 0) first_gotten_store = size;
 if (show_malloc)
   fprintf(outfile, "malloc       %3d %p\n", (int)size, block);
 return block;
@@ -3409,7 +3405,7 @@
   const pcre_uint8 *tables = NULL;
   unsigned long int get_options;
   unsigned long int true_size, true_study_size = 0;
-  size_t size, regex_gotten_store;
+  size_t size;
   int do_allcaps = 0;
   int do_mark = 0;
   int do_study = 0;
@@ -3464,8 +3460,6 @@
       fprintf(outfile, "Failed to open %s: %s\n", p, strerror(errno));
       continue;
       }
-
-    first_gotten_store = 0;
     if (fread(sbuf, 1, 8, f) != 8) goto FAIL_READ;

     true_size =
@@ -3481,8 +3475,6 @@
       yield = 1;
       goto EXIT;
       }
-    regex_gotten_store = first_gotten_store;
-
     if (fread(re, 1, true_size, f) != true_size) goto FAIL_READ;

     magic = REAL_PCRE_MAGIC(re);
@@ -3796,7 +3788,6 @@
     if ((options & PCRE_UCP) != 0) cflags |= REG_UCP;
     if ((options & PCRE_UNGREEDY) != 0) cflags |= REG_UNGREEDY;

-    first_gotten_store = 0;
     rc = regcomp(&preg, (char *)p, cflags);

     /* Compilation failed; go back for another re, skipping to blank line
@@ -3889,7 +3880,6 @@
           (double)CLOCKS_PER_SEC);
       }

-    first_gotten_store = 0;
     PCRE_COMPILE(re, p, options, &error, &erroroffset, tables);

     /* Compilation failed; go back for another re, skipping to blank line
@@ -3929,7 +3919,6 @@
     and remember the store that was got. */

     true_size = REAL_PCRE_SIZE(re);
-    regex_gotten_store = first_gotten_store;

     /* Output code size information if requested */

@@ -3952,8 +3941,9 @@
       if (REAL_PCRE_FLAGS(re) & PCRE_MODE32)
         real_pcre_size = sizeof(real_pcre32);
 #endif
+      new_info(re, NULL, PCRE_INFO_SIZE, &size);
       fprintf(outfile, "Memory allocation (code space): %d\n",
-        (int)(first_gotten_store - real_pcre_size - name_count * name_entry_size));
+        (int)(size - real_pcre_size - name_count * name_entry_size));
       }

     /* If -s or /S was present, study the regex to generate additional info to
@@ -4032,8 +4022,7 @@
       int nameentrysize, namecount;
       const pcre_uint8 *nametable;

-      if (new_info(re, NULL, PCRE_INFO_SIZE, &size) +
-          new_info(re, NULL, PCRE_INFO_CAPTURECOUNT, &count) +
+      if (new_info(re, NULL, PCRE_INFO_CAPTURECOUNT, &count) +
           new_info(re, NULL, PCRE_INFO_BACKREFMAX, &backrefmax) +
           new_info(re, NULL, PCRE_INFO_FIRSTCHARACTER, &first_char) +
           new_info(re, NULL, PCRE_INFO_FIRSTCHARACTERFLAGS, &first_char_set) +
@@ -4050,10 +4039,6 @@
           != 0)
         goto SKIP_DATA;

-      if (size != regex_gotten_store) fprintf(outfile,
-        "Size disagreement: pcre_fullinfo=%d call to malloc for %d\n",
-        (int)size, (int)regex_gotten_store);
-
       fprintf(outfile, "Capturing subpattern count = %d\n", count);

       if (backrefmax > 0)

Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2013-08-29 13:40:47 UTC (rev 1358)
+++ code/trunk/testdata/testinput2    2013-09-03 10:10:59 UTC (rev 1359)
@@ -1495,6 +1495,8 @@
     a2b\CA
     ** Failers
     a1b\CZ\CA
+    
+/(?|(?<a>)(?<b>)(?<a>)|(?<a>)(?<b>)(?<a>))/IJ

 /^(?P<A>a)(?P<A>b)/IJ
     ab\CA

Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2013-08-29 13:40:47 UTC (rev 1358)
+++ code/trunk/testdata/testoutput2    2013-09-03 10:10:59 UTC (rev 1359)
@@ -6131,6 +6131,17 @@
  2: a1
 copy substring Z failed -7
   C a1 (2) A
+    
+/(?|(?<a>)(?<b>)(?<a>)|(?<a>)(?<b>)(?<a>))/IJ
+Capturing subpattern count = 3
+Named capturing subpatterns:
+  a   1
+  a   3
+  b   2
+May match empty string
+Options: dupnames
+No first char
+No need char

/^(?P<A>a)(?P<A>b)/IJ
Capturing subpattern count = 2
@@ -10200,7 +10211,6 @@
Capturing subpattern count = 1
Named capturing subpatterns:
a 1
- a 1
No options
No first char
No need char

This message is part of the following thread:
	the complete thread tree sorted by date

[Pcre-svn] [1359] code/trunk: Refactor the code for creating…