[Pcre-svn] [455] code/trunk: Added lower bound length-findin…

Página Inicial
Delete this message
Autor: Subversion repository
Data:  
Para: pcre-svn
Assunto: [Pcre-svn] [455] code/trunk: Added lower bound length-finding to pcre_study() and use it when matching; make
Revision: 455
          http://vcs.pcre.org/viewvc?view=rev&revision=455
Author:   ph10
Date:     2009-09-26 20:12:32 +0100 (Sat, 26 Sep 2009)


Log Message:
-----------
Added lower bound length-finding to pcre_study() and use it when matching; make
the value available via pcre_fullinfo(); also fixed bugs connected with
pcre_study() in pcre_dfa_exec().

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcre_fullinfo.3
    code/trunk/doc/pcreapi.3
    code/trunk/doc/pcretest.1
    code/trunk/maint/README
    code/trunk/pcre.h.in
    code/trunk/pcre_compile.c
    code/trunk/pcre_dfa_exec.c
    code/trunk/pcre_exec.c
    code/trunk/pcre_fullinfo.c
    code/trunk/pcre_internal.h
    code/trunk/pcre_study.c
    code/trunk/pcre_try_flipped.c
    code/trunk/pcretest.c
    code/trunk/testdata/testinput2
    code/trunk/testdata/testinput7
    code/trunk/testdata/testoutput2
    code/trunk/testdata/testoutput3
    code/trunk/testdata/testoutput5
    code/trunk/testdata/testoutput7


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/ChangeLog    2009-09-26 19:12:32 UTC (rev 455)
@@ -135,6 +135,22 @@
     pattern matches a fixed length string. PCRE did not allow this; now it 
     does. Neither allows recursion. 


+25. I finally figured out how to implement a request to provide the minimum 
+    length of subject string that was needed in order to match a given pattern. 
+    (It was back references and recursion that I had previously got hung up 
+    on.) This code has now been added to pcre_study(); it finds a lower bound 
+    to the length of subject needed. It is not necessarily the greatest lower
+    bound, but using it to avoid searching strings that are too short does give
+    some useful speed-ups. The value is available to calling programs via
+    pcre_fullinfo().
+    
+26. While implementing 25, I discovered to my embarrassment that pcretest had
+    not been passing the result of pcre_study() to pcre_dfa_exec(), so the
+    study optimizations had never been tested with that matching function.
+    Oops. What is worse, even when it was passed study data, there was a bug in
+    pcre_dfa_exec() that meant it never actually used it. Double oops. There
+    were also very few tests of studied patterns with pcre_dfa_exec().
+    


Version 7.9 11-Apr-09
---------------------

Modified: code/trunk/doc/pcre_fullinfo.3
===================================================================
--- code/trunk/doc/pcre_fullinfo.3    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/doc/pcre_fullinfo.3    2009-09-26 19:12:32 UTC (rev 455)
@@ -33,6 +33,7 @@
   PCRE_INFO_FIRSTTABLE      Table of first bytes (after studying)
   PCRE_INFO_JCHANGED        Return 1 if (?J) or (?-J) was used
   PCRE_INFO_LASTLITERAL     Literal last byte required
+  PCRE_INFO_MINLENGTH       Lower bound length of matching strings 
   PCRE_INFO_NAMECOUNT       Number of named subpatterns
   PCRE_INFO_NAMEENTRYSIZE   Size of name table entry
   PCRE_INFO_NAMETABLE       Pointer to name table


Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/doc/pcreapi.3    2009-09-26 19:12:32 UTC (rev 455)
@@ -772,19 +772,19 @@
 results of the study.
 .P
 The returned value from \fBpcre_study()\fP can be passed directly to
-\fBpcre_exec()\fP. However, a \fBpcre_extra\fP block also contains other
-fields that can be set by the caller before the block is passed; these are
-described
+\fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP. However, a \fBpcre_extra\fP block
+also contains other fields that can be set by the caller before the block is
+passed; these are described
 .\" HTML <a href="#extradata">
 .\" </a>
 below
 .\"
 in the section on matching a pattern.
 .P
-If studying the pattern does not produce any additional information
+If studying the pattern does not produce any useful information,
 \fBpcre_study()\fP returns NULL. In that circumstance, if the calling program
-wants to pass any of the other fields to \fBpcre_exec()\fP, it must set up its
-own \fBpcre_extra\fP block.
+wants to pass any of the other fields to \fBpcre_exec()\fP or 
+\fBpcre_dfa_exec()\fP, it must set up its own \fBpcre_extra\fP block.
 .P
 The second argument of \fBpcre_study()\fP contains option bits. At present, no
 options are defined, and this argument should always be zero.
@@ -804,9 +804,18 @@
     0,              /* no options exist */
     &error);        /* set to NULL or points to a message */
 .sp
-At present, studying a pattern is useful only for non-anchored patterns that do
-not have a single fixed starting character. A bitmap of possible starting
-bytes is created.
+Studying a pattern does two things: first, a lower bound for the length of
+subject string that is needed to match the pattern is computed. This does not 
+mean that there are any strings of that length that match, but it does 
+guarantee that no shorter strings match. The value is used by 
+\fBpcre_exec()\fP and \fBpcre_dfa_exec()\fP to avoid wasting time by trying to 
+match strings that are shorter than the lower bound. You can find out the value 
+in a calling program via the \fBpcre_fullinfo()\fP function.
+.P
+Studying a pattern is also useful for non-anchored patterns that do not have a
+single fixed starting character. A bitmap of possible starting bytes is
+created. This speeds up finding a position in the subject at which to start 
+matching.
 .
 .
 .\" HTML <a name="localesupport"></a>
@@ -971,6 +980,16 @@
 /^a\ed+z\ed+/ the returned value is "z", but for /^a\edz\ed/ the returned value
 is -1.
 .sp
+  PCRE_INFO_MINLENGTH
+.sp
+If the pattern was studied and a minimum length for matching subject strings
+was computed, its value is returned. Otherwise the returned value is -1. The
+value is a number of characters, not bytes (there may be a difference in UTF-8
+mode). The fourth argument should point to an \fBint\fP variable. A
+non-negative value is a lower bound to the length of any matching string. There
+may not be any strings of that length that do actually match, but every string
+that does match is at least that long.
+.sp
   PCRE_INFO_NAMECOUNT
   PCRE_INFO_NAMEENTRYSIZE
   PCRE_INFO_NAMETABLE
@@ -1059,7 +1078,8 @@
 Return the size of the data block pointed to by the \fIstudy_data\fP field in
 a \fBpcre_extra\fP block. That is, it is the value that was passed to
 \fBpcre_malloc()\fP when PCRE was getting memory into which to place the data
-created by \fBpcre_study()\fP. The fourth argument should point to a
+created by \fBpcre_study()\fP. If \fBpcre_extra\fP is NULL, or there is no
+study data, zero is returned. The fourth argument should point to a
 \fBsize_t\fP variable.
 .
 .
@@ -1121,7 +1141,7 @@
 .P
 The function \fBpcre_exec()\fP is called to match a subject string against a
 compiled pattern, which is passed in the \fIcode\fP argument. If the
-pattern has been studied, the result of the study should be passed in the
+pattern was studied, the result of the study should be passed in the
 \fIextra\fP argument. This function is the main matching facility of the
 library, and it operates in a Perl-like manner. For specialist use there is
 also an alternative matching function, which is described
@@ -2023,6 +2043,6 @@
 .rs
 .sp
 .nf
-Last updated: 22 September 2009
+Last updated: 26 September 2009
 Copyright (c) 1997-2009 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcretest.1
===================================================================
--- code/trunk/doc/pcretest.1    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/doc/pcretest.1    2009-09-26 19:12:32 UTC (rev 455)
@@ -371,6 +371,9 @@
   \eR         pass the PCRE_DFA_RESTART option to \fBpcre_dfa_exec()\fP
   \eS         output details of memory get/free calls during matching
 .\" JOIN
+  \eY         pass the PCRE_NO_START_OPTIMIZE option to \fBpcre_exec()\fP
+               or \fBpcre_dfa_exec()\fP
+.\" JOIN
   \eZ         pass the PCRE_NOTEOL option to \fBpcre_exec()\fP
                or \fBpcre_dfa_exec()\fP
 .\" JOIN
@@ -728,6 +731,6 @@
 .rs
 .sp
 .nf
-Last updated: 11 September 2009
+Last updated: 26 September 2009
 Copyright (c) 1997-2009 University of Cambridge.
 .fi


Modified: code/trunk/maint/README
===================================================================
--- code/trunk/maint/README    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/maint/README    2009-09-26 19:12:32 UTC (rev 455)
@@ -178,15 +178,8 @@
     o A required byte from alternatives - not just the last char, but an
       earlier one if common to all alternatives.


-    o Minimum length of subject needed (see also next . bullet).
-
     o Friedl contains other ideas.


-. There was a request for a way of finding the minimum subject length that can
- match a given pattern. (If this were available, it could be usefully added
- to study() - see above.) This is easy for simple cases, but I haven't figured
- out how to handle recursion.
-
. If Perl gets to a consistent state over the settings of capturing sub-
patterns inside repeats, see if we can match it. One example of the
difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE
@@ -308,4 +301,4 @@
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 20 September 2009
+Last updated: 26 September 2009

Modified: code/trunk/pcre.h.in
===================================================================
--- code/trunk/pcre.h.in    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/pcre.h.in    2009-09-26 19:12:32 UTC (rev 455)
@@ -177,6 +177,7 @@
 #define PCRE_INFO_OKPARTIAL         12
 #define PCRE_INFO_JCHANGED          13
 #define PCRE_INFO_HASCRORLF         14
+#define PCRE_INFO_MINLENGTH         15


/* Request types for pcre_config(). Do not re-arrange, in order to remain
compatible. */

Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/pcre_compile.c    2009-09-26 19:12:32 UTC (rev 455)
@@ -1550,7 +1550,9 @@


/* This little function scans through a compiled pattern until it finds a
capturing bracket with the given number, or, if the number is negative, an
-instance of OP_REVERSE for a lookbehind.
+instance of OP_REVERSE for a lookbehind. The function is global in the C sense
+so that it can be called from pcre_study() when finding the minimum matching
+length.

 Arguments:
   code        points to start of expression
@@ -1560,8 +1562,8 @@
 Returns:      pointer to the opcode for the bracket, or NULL if not found
 */


-static const uschar *
-find_bracket(const uschar *code, BOOL utf8, int number)
+const uschar *
+_pcre_find_bracket(const uschar *code, BOOL utf8, int number)
 {
 for (;;)
   {
@@ -5072,7 +5074,8 @@
           if (lengthptr == NULL)
             {
             *code = OP_END;
-            if (recno != 0) called = find_bracket(cd->start_code, utf8, recno);
+            if (recno != 0) 
+              called = _pcre_find_bracket(cd->start_code, utf8, recno);


             /* Forward reference */


@@ -6582,7 +6585,7 @@
   cd->hwm -= LINK_SIZE;
   offset = GET(cd->hwm, 0);
   recno = GET(codestart, offset);
-  groupptr = find_bracket(codestart, utf8, recno);
+  groupptr = _pcre_find_bracket(codestart, utf8, recno);
   if (groupptr == NULL) errorcode = ERR53;
     else PUT(((uschar *)codestart), offset, groupptr - codestart);
   }
@@ -6609,9 +6612,9 @@
   of zero, but that is a pathological case, and it does no harm.) When we find 
   one, we temporarily terminate the branch it is in while we scan it. */


-  for (cc = (uschar *)find_bracket(codestart, utf8, -1);
+  for (cc = (uschar *)_pcre_find_bracket(codestart, utf8, -1);
        cc != NULL;
-       cc = (uschar *)find_bracket(cc, utf8, -1))
+       cc = (uschar *)_pcre_find_bracket(cc, utf8, -1))
     { 
     if (GET(cc, 1) == 0)
       { 


Modified: code/trunk/pcre_dfa_exec.c
===================================================================
--- code/trunk/pcre_dfa_exec.c    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/pcre_dfa_exec.c    2009-09-26 19:12:32 UTC (rev 455)
@@ -2651,7 +2651,7 @@
   if ((flags & PCRE_EXTRA_TABLES) != 0)
     md->tables = extra_data->tables;
   }
-
+  
 /* Check that the first field in the block is the magic number. If it is not,
 test for a regex that was compiled on a host of opposite endianness. If this is
 the case, flipped values are put in internal_re and internal_study if there was
@@ -2790,8 +2790,8 @@
     }
   else
     {
-    if (startline && study != NULL &&
-         (study->options & PCRE_STUDY_MAPPED) != 0)
+    if (!startline && study != NULL &&
+         (study->flags & PCRE_STUDY_MAPPED) != 0)
       start_bits = study->start_bits;
     }
   }
@@ -2842,13 +2842,11 @@
       }


     /* There are some optimizations that avoid running the match if a known
-    starting point is not found, or if a known later character is not present.
-    However, there is an option that disables these, for testing and for
-    ensuring that all callouts do actually occur. */
+    starting point is not found. However, there is an option that disables
+    these, for testing and for ensuring that all callouts do actually occur. */


     if ((options & PCRE_NO_START_OPTIMIZE) == 0)
       {
-
       /* Advance to a known first byte. */


       if (first_byte >= 0)
@@ -2914,66 +2912,75 @@
     /* Restore fudged end_subject */


     end_subject = save_end_subject;
-    }


-  /* If req_byte is set, we know that that character must appear in the subject
-  for the match to succeed. If the first character is set, req_byte must be
-  later in the subject; otherwise the test starts at the match point. This
-  optimization can save a huge amount of work in patterns with nested unlimited
-  repeats that aren't going to match. Writing separate code for cased/caseless
-  versions makes it go faster, as does using an autoincrement and backing off
-  on a match.
+    /* The following two optimizations are disabled for partial matching or if 
+    disabling is explicitly requested (and of course, by the test above, this 
+    code is not obeyed when restarting after a partial match). */
+    
+    if ((options & PCRE_NO_START_OPTIMIZE) == 0 &&
+        (options & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) == 0)
+      {   
+      /* If the pattern was studied, a minimum subject length may be set. This
+      is a lower bound; no actual string of that length may actually match the
+      pattern. Although the value is, strictly, in characters, we treat it as
+      bytes to avoid spending too much time in this optimization. */


-  HOWEVER: when the subject string is very, very long, searching to its end can
-  take a long time, and give bad performance on quite ordinary patterns. This
-  showed up when somebody was matching /^C/ on a 32-megabyte string... so we
-  don't do this when the string is sufficiently long.
-
-  ALSO: this processing is disabled when partial matching is requested, and can
-  also be explicitly deactivated. Furthermore, we have to disable when
-  restarting after a partial match, because the required character may have
-  already been matched. */
-
-  if ((options & PCRE_NO_START_OPTIMIZE) == 0 &&
-      req_byte >= 0 &&
-      end_subject - current_subject < REQ_BYTE_MAX &&
-      (options & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT|PCRE_DFA_RESTART)) == 0)
-    {
-    register const uschar *p = current_subject + ((first_byte >= 0)? 1 : 0);
-
-    /* We don't need to repeat the search if we haven't yet reached the
-    place we found it at last time. */
-
-    if (p > req_byte_ptr)
-      {
-      if (req_byte_caseless)
+      if (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0 &&
+          end_subject - current_subject < study->minlength)
+        return PCRE_ERROR_NOMATCH;
+    
+      /* If req_byte is set, we know that that character must appear in the
+      subject for the match to succeed. If the first character is set, req_byte
+      must be later in the subject; otherwise the test starts at the match
+      point. This optimization can save a huge amount of work in patterns with
+      nested unlimited repeats that aren't going to match. Writing separate
+      code for cased/caseless versions makes it go faster, as does using an
+      autoincrement and backing off on a match.
+      
+      HOWEVER: when the subject string is very, very long, searching to its end
+      can take a long time, and give bad performance on quite ordinary
+      patterns. This showed up when somebody was matching /^C/ on a 32-megabyte
+      string... so we don't do this when the string is sufficiently long. */
+      
+      if (req_byte >= 0 && end_subject - current_subject < REQ_BYTE_MAX)
         {
-        while (p < end_subject)
+        register const uschar *p = current_subject + ((first_byte >= 0)? 1 : 0);
+      
+        /* We don't need to repeat the search if we haven't yet reached the
+        place we found it at last time. */
+      
+        if (p > req_byte_ptr)
           {
-          register int pp = *p++;
-          if (pp == req_byte || pp == req_byte2) { p--; break; }
+          if (req_byte_caseless)
+            {
+            while (p < end_subject)
+              {
+              register int pp = *p++;
+              if (pp == req_byte || pp == req_byte2) { p--; break; }
+              }
+            }
+          else
+            {
+            while (p < end_subject)
+              {
+              if (*p++ == req_byte) { p--; break; }
+              }
+            }
+      
+          /* If we can't find the required character, break the matching loop,
+          which will cause a return or PCRE_ERROR_NOMATCH. */
+      
+          if (p >= end_subject) break;
+      
+          /* If we have found the required character, save the point where we
+          found it, so that we don't search again next time round the loop if
+          the start hasn't passed this character yet. */
+      
+          req_byte_ptr = p;
           }
-        }
-      else
-        {
-        while (p < end_subject)
-          {
-          if (*p++ == req_byte) { p--; break; }
-          }
-        }
-
-      /* If we can't find the required character, break the matching loop,
-      which will cause a return or PCRE_ERROR_NOMATCH. */
-
-      if (p >= end_subject) break;
-
-      /* If we have found the required character, save the point where we
-      found it, so that we don't search again next time round the loop if
-      the start hasn't passed this character yet. */
-
-      req_byte_ptr = p;
+        }   
       }
-    }
+    }   /* End of optimizations that are done when not restarting */


/* OK, now we can do the business */


Modified: code/trunk/pcre_exec.c
===================================================================
--- code/trunk/pcre_exec.c    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/pcre_exec.c    2009-09-26 19:12:32 UTC (rev 455)
@@ -5121,7 +5121,7 @@
     }
   else
     if (!startline && study != NULL &&
-      (study->options & PCRE_STUDY_MAPPED) != 0)
+      (study->flags & PCRE_STUDY_MAPPED) != 0)
         start_bits = study->start_bits;
   }


@@ -5247,75 +5247,87 @@
/* Restore fudged end_subject */

   end_subject = save_end_subject;
-
-#ifdef DEBUG  /* Sigh. Some compilers never learn. */
-  printf(">>>> Match against: ");
-  pchars(start_match, end_subject - start_match, TRUE, md);
-  printf("\n");
-#endif
-
-  /* If req_byte is set, we know that that character must appear in the
-  subject for the match to succeed. If the first character is set, req_byte
-  must be later in the subject; otherwise the test starts at the match point.
-  This optimization can save a huge amount of backtracking in patterns with
-  nested unlimited repeats that aren't going to match. Writing separate code
-  for cased/caseless versions makes it go faster, as does using an
-  autoincrement and backing off on a match.
-
-  HOWEVER: when the subject string is very, very long, searching to its end
-  can take a long time, and give bad performance on quite ordinary patterns.
-  This showed up when somebody was matching something like /^\d+C/ on a
-  32-megabyte string... so we don't do this when the string is sufficiently
-  long.
-
-  ALSO: this processing is disabled when partial matching is requested, or if
+  
+  /* The following two optimizations are disabled for partial matching or if 
   disabling is explicitly requested. */
-
-  if ((options & PCRE_NO_START_OPTIMIZE) == 0 &&
-      req_byte >= 0 &&
-      end_subject - start_match < REQ_BYTE_MAX &&
-      !md->partial)
-    {
-    register USPTR p = start_match + ((first_byte >= 0)? 1 : 0);
-
-    /* We don't need to repeat the search if we haven't yet reached the
-    place we found it at last time. */
-
-    if (p > req_byte_ptr)
+  
+  if ((options & PCRE_NO_START_OPTIMIZE) == 0 && !md->partial) 
+    { 
+    /* If the pattern was studied, a minimum subject length may be set. This is
+    a lower bound; no actual string of that length may actually match the
+    pattern. Although the value is, strictly, in characters, we treat it as
+    bytes to avoid spending too much time in this optimization. */
+    
+    if (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0 &&
+        end_subject - start_match < study->minlength)
       {
-      if (req_byte_caseless)
+      rc = MATCH_NOMATCH;
+      break;     
+      }
+ 
+    /* If req_byte is set, we know that that character must appear in the
+    subject for the match to succeed. If the first character is set, req_byte
+    must be later in the subject; otherwise the test starts at the match point.
+    This optimization can save a huge amount of backtracking in patterns with
+    nested unlimited repeats that aren't going to match. Writing separate code
+    for cased/caseless versions makes it go faster, as does using an
+    autoincrement and backing off on a match.
+    
+    HOWEVER: when the subject string is very, very long, searching to its end
+    can take a long time, and give bad performance on quite ordinary patterns.
+    This showed up when somebody was matching something like /^\d+C/ on a
+    32-megabyte string... so we don't do this when the string is sufficiently
+    long. */
+    
+    if (req_byte >= 0 && end_subject - start_match < REQ_BYTE_MAX)
+      {
+      register USPTR p = start_match + ((first_byte >= 0)? 1 : 0);
+    
+      /* We don't need to repeat the search if we haven't yet reached the
+      place we found it at last time. */
+    
+      if (p > req_byte_ptr)
         {
-        while (p < end_subject)
+        if (req_byte_caseless)
           {
-          register int pp = *p++;
-          if (pp == req_byte || pp == req_byte2) { p--; break; }
+          while (p < end_subject)
+            {
+            register int pp = *p++;
+            if (pp == req_byte || pp == req_byte2) { p--; break; }
+            }
           }
-        }
-      else
-        {
-        while (p < end_subject)
+        else
           {
-          if (*p++ == req_byte) { p--; break; }
+          while (p < end_subject)
+            {
+            if (*p++ == req_byte) { p--; break; }
+            }
           }
+    
+        /* If we can't find the required character, break the matching loop,
+        forcing a match failure. */
+    
+        if (p >= end_subject)
+          {
+          rc = MATCH_NOMATCH;
+          break;
+          }
+    
+        /* If we have found the required character, save the point where we
+        found it, so that we don't search again next time round the loop if
+        the start hasn't passed this character yet. */
+    
+        req_byte_ptr = p;
         }
+      }
+    }   


-      /* If we can't find the required character, break the matching loop,
-      forcing a match failure. */
+#ifdef DEBUG  /* Sigh. Some compilers never learn. */
+  printf(">>>> Match against: ");
+  pchars(start_match, end_subject - start_match, TRUE, md);
+  printf("\n");
+#endif


-      if (p >= end_subject)
-        {
-        rc = MATCH_NOMATCH;
-        break;
-        }
-
-      /* If we have found the required character, save the point where we
-      found it, so that we don't search again next time round the loop if
-      the start hasn't passed this character yet. */
-
-      req_byte_ptr = p;
-      }
-    }
-
   /* OK, we can now run the match. If "hitend" is set afterwards, remember the
   first starting point for which a partial match was found. */



Modified: code/trunk/pcre_fullinfo.c
===================================================================
--- code/trunk/pcre_fullinfo.c    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/pcre_fullinfo.c    2009-09-26 19:12:32 UTC (rev 455)
@@ -119,9 +119,15 @@


   case PCRE_INFO_FIRSTTABLE:
   *((const uschar **)where) =
-    (study != NULL && (study->options & PCRE_STUDY_MAPPED) != 0)?
+    (study != NULL && (study->flags & PCRE_STUDY_MAPPED) != 0)?
       ((const pcre_study_data *)extra_data->study_data)->start_bits : NULL;
   break;
+  
+  case PCRE_INFO_MINLENGTH:
+  *((int *)where) =
+    (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0)?
+      study->minlength : -1;
+  break;         


case PCRE_INFO_LASTLITERAL:
*((int *)where) =

Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/pcre_internal.h    2009-09-26 19:12:32 UTC (rev 455)
@@ -549,6 +549,7 @@
 /* Options for the "extra" block produced by pcre_study(). */


 #define PCRE_STUDY_MAPPED   0x01     /* a map of starting chars exists */
+#define PCRE_STUDY_MINLEN   0x02     /* a minimum length field exists */


/* Masks for identifying the public options that are permitted at compile
time, run time, or study time, respectively. */
@@ -1491,7 +1492,7 @@
structure should be made at the end, and something earlier (e.g. a new
flag in the options or one of the dummy fields) should indicate that the new
fields are present. Currently PCRE always sets the dummy fields to zero.
-NOTE NOTE NOTE:
+NOTE NOTE NOTE
*/

typedef struct real_pcre {
@@ -1518,8 +1519,9 @@

 typedef struct pcre_study_data {
   pcre_uint32 size;               /* Total that was malloced */
-  pcre_uint32 options;
-  uschar start_bits[32];
+  pcre_uint32 flags;              /* Private flags */
+  uschar start_bits[32];          /* Starting char bits */
+  pcre_uint32 minlength;          /* Minimum subject length */ 
 } pcre_study_data;


/* Structure for building a chain of open capturing subpatterns during
@@ -1722,15 +1724,16 @@
one of the exported public functions. They have to be "external" in the C
sense, but are not part of the PCRE public API. */

-extern BOOL         _pcre_is_newline(const uschar *, int, const uschar *,
-                      int *, BOOL);
-extern int          _pcre_ord2utf8(int, uschar *);
-extern real_pcre   *_pcre_try_flipped(const real_pcre *, real_pcre *,
-                      const pcre_study_data *, pcre_study_data *);
-extern int          _pcre_valid_utf8(const uschar *, int);
-extern BOOL         _pcre_was_newline(const uschar *, int, const uschar *,
-                      int *, BOOL);
-extern BOOL         _pcre_xclass(int, const uschar *);
+extern const uschar *_pcre_find_bracket(const uschar *, BOOL, int);
+extern BOOL          _pcre_is_newline(const uschar *, int, const uschar *,
+                       int *, BOOL);
+extern int           _pcre_ord2utf8(int, uschar *);
+extern real_pcre    *_pcre_try_flipped(const real_pcre *, real_pcre *,
+                       const pcre_study_data *, pcre_study_data *);
+extern int           _pcre_valid_utf8(const uschar *, int);
+extern BOOL          _pcre_was_newline(const uschar *, int, const uschar *,
+                       int *, BOOL);
+extern BOOL          _pcre_xclass(int, const uschar *);



/* Unicode character database (UCD) */

Modified: code/trunk/pcre_study.c
===================================================================
--- code/trunk/pcre_study.c    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/pcre_study.c    2009-09-26 19:12:32 UTC (rev 455)
@@ -6,7 +6,7 @@
 and semantics are as close as possible to those of the Perl 5 language.


                        Written by Philip Hazel
-           Copyright (c) 1997-2008 University of Cambridge
+           Copyright (c) 1997-2009 University of Cambridge


-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -54,7 +54,355 @@
enum { SSB_FAIL, SSB_DONE, SSB_CONTINUE };


+
 /*************************************************
+*   Find the minimum subject length for a group  *
+*************************************************/
+
+/* Scan a parenthesized group and compute the minimum length of subject that
+is needed to match it. This is a lower bound; it does not mean there is a 
+string of that length that matches. In UTF8 mode, the result is in characters
+rather than bytes.
+
+Arguments:
+  code       pointer to start of group (the bracket)
+  startcode  pointer to start of the whole pattern
+  options    the compiling options 
+
+Returns:   the minimum length
+           -1 if \C was encountered
+           -2 internal error (missing capturing bracket) 
+*/
+
+static int
+find_minlength(const uschar *code, const uschar *startcode, int options)
+{
+int length = -1;
+BOOL utf8 = (options & PCRE_UTF8) != 0;
+BOOL had_recurse = FALSE;
+register int branchlength = 0;
+register uschar *cc = (uschar *)code + 1 + LINK_SIZE;
+
+if (*code == OP_CBRA || *code == OP_SCBRA) cc += 2;
+
+/* Scan along the opcodes for this branch. If we get to the end of the
+branch, check the length against that of the other branches. */
+
+for (;;)
+  {
+  int d, min;
+  uschar *cs, *ce; 
+  register int op = *cc;
+  
+  switch (op)
+    {
+    case OP_CBRA:
+    case OP_SCBRA: 
+    case OP_BRA:
+    case OP_SBRA: 
+    case OP_ONCE:
+    case OP_COND:
+    case OP_SCOND: 
+    d = find_minlength(cc, startcode, options);
+    if (d < 0) return d;
+    branchlength += d;
+    do cc += GET(cc, 1); while (*cc == OP_ALT);
+    cc += 1 + LINK_SIZE;
+    break;
+
+    /* Reached end of a branch; if it's a ket it is the end of a nested
+    call. If it's ALT it is an alternation in a nested call. If it is
+    END it's the end of the outer call. All can be handled by the same code. */
+
+    case OP_ALT:
+    case OP_KET:
+    case OP_KETRMAX:
+    case OP_KETRMIN:
+    case OP_END:
+    if (length < 0 || (!had_recurse && branchlength < length)) 
+      length = branchlength;
+    if (*cc != OP_ALT) return length;
+    cc += 1 + LINK_SIZE;
+    branchlength = 0;
+    had_recurse = FALSE;   
+    break;
+
+    /* Skip over assertive subpatterns */
+
+    case OP_ASSERT:
+    case OP_ASSERT_NOT:
+    case OP_ASSERTBACK:
+    case OP_ASSERTBACK_NOT:
+    do cc += GET(cc, 1); while (*cc == OP_ALT);
+    /* Fall through */
+
+    /* Skip over things that don't match chars */
+
+    case OP_REVERSE:
+    case OP_CREF:
+    case OP_RREF:
+    case OP_DEF:
+    case OP_OPT:
+    case OP_CALLOUT:
+    case OP_SOD:
+    case OP_SOM:
+    case OP_EOD:
+    case OP_EODN:
+    case OP_CIRC:
+    case OP_DOLL:
+    case OP_NOT_WORD_BOUNDARY:
+    case OP_WORD_BOUNDARY:
+    cc += _pcre_OP_lengths[*cc];
+    break;
+    
+    /* Skip over a subpattern that has a {0} or {0,x} quantifier */
+
+    case OP_BRAZERO:
+    case OP_BRAMINZERO: 
+    case OP_SKIPZERO:
+    cc += _pcre_OP_lengths[*cc];
+    do cc += GET(cc, 1); while (*cc == OP_ALT);
+    cc += 1 + LINK_SIZE;
+    break;
+
+    /* Handle literal characters and + repetitions */
+
+    case OP_CHAR:
+    case OP_CHARNC:
+    case OP_NOT:
+    case OP_PLUS:
+    case OP_MINPLUS:
+    case OP_POSPLUS:
+    case OP_NOTPLUS:
+    case OP_NOTMINPLUS:
+    case OP_NOTPOSPLUS:
+    branchlength++;
+    cc += 2;
+#ifdef SUPPORT_UTF8
+    if (utf8 && cc[-1] >= 0xc0) cc += _pcre_utf8_table4[cc[-1] & 0x3f];
+#endif
+    break;
+    
+    case OP_TYPEPLUS:
+    case OP_TYPEMINPLUS:
+    case OP_TYPEPOSPLUS:      
+    branchlength++;
+    cc += (cc[1] == OP_PROP || cc[1] == OP_NOTPROP)? 4 : 2;
+    break;
+
+    /* Handle exact repetitions. The count is already in characters, but we
+    need to skip over a multibyte character in UTF8 mode.  */
+
+    case OP_EXACT:
+    case OP_NOTEXACT: 
+    branchlength += GET2(cc,1);
+    cc += 4;
+#ifdef SUPPORT_UTF8
+    if (utf8 && cc[-1] >= 0xc0) cc += _pcre_utf8_table4[cc[-1] & 0x3f];
+#endif
+    break;
+
+    case OP_TYPEEXACT:
+    branchlength += GET2(cc,1);
+    cc += (cc[3] == OP_PROP || cc[3] == OP_NOTPROP)? 6 : 4;
+    break;
+
+    /* Handle single-char non-literal matchers */
+
+    case OP_PROP:
+    case OP_NOTPROP:
+    cc += 2;
+    /* Fall through */
+
+    case OP_NOT_DIGIT:
+    case OP_DIGIT:
+    case OP_NOT_WHITESPACE:
+    case OP_WHITESPACE:
+    case OP_NOT_WORDCHAR:
+    case OP_WORDCHAR:
+    case OP_ANY:
+    case OP_ALLANY:
+    case OP_EXTUNI:
+    case OP_HSPACE:  
+    case OP_NOT_HSPACE:
+    case OP_VSPACE:
+    case OP_NOT_VSPACE:   
+    branchlength++;
+    cc++;
+    break;
+    
+    /* "Any newline" might match two characters */
+    
+    case OP_ANYNL:
+    branchlength += 2;
+    cc++;
+    break;     
+
+    /* The single-byte matcher means we can't proceed in UTF-8 mode */
+
+    case OP_ANYBYTE:
+#ifdef SUPPORT_UTF8
+    if (utf8) return -1;
+#endif
+    branchlength++;
+    cc++;
+    break;  
+
+    /* For repeated character types, we have to test for \p and \P, which have
+    an extra two bytes of parameters. */
+
+    case OP_TYPESTAR:
+    case OP_TYPEMINSTAR:
+    case OP_TYPEQUERY:
+    case OP_TYPEMINQUERY:
+    case OP_TYPEPOSSTAR:
+    case OP_TYPEPOSQUERY:
+    if (cc[1] == OP_PROP || cc[1] == OP_NOTPROP) cc += 2;
+    cc += _pcre_OP_lengths[op];
+    break;
+
+    case OP_TYPEUPTO:
+    case OP_TYPEMINUPTO:
+    case OP_TYPEPOSUPTO:
+    if (cc[3] == OP_PROP || cc[3] == OP_NOTPROP) cc += 2;
+    cc += _pcre_OP_lengths[op];
+    break;
+
+    /* Check a class for variable quantification */
+
+#ifdef SUPPORT_UTF8
+    case OP_XCLASS:
+    cc += GET(cc, 1) - 33;
+    /* Fall through */
+#endif
+
+    case OP_CLASS:
+    case OP_NCLASS:
+    cc += 33;
+
+    switch (*cc)
+      {
+      case OP_CRPLUS:
+      case OP_CRMINPLUS:
+      branchlength++;
+      /* Fall through */ 
+
+      case OP_CRSTAR:
+      case OP_CRMINSTAR:
+      case OP_CRQUERY:
+      case OP_CRMINQUERY:
+      cc++; 
+      break;
+      
+      case OP_CRRANGE:
+      case OP_CRMINRANGE:
+      branchlength += GET2(cc,1);
+      cc += 5;
+      break;
+      
+      default:
+      branchlength++;
+      break;   
+      }
+    break;
+    
+    /* Backreferences and subroutine calls are treated in the same way: we find 
+    the minimum length for the subpattern. A recursion, however, causes an 
+    a flag to be set that causes the length of this branch to be ignored. The
+    logic is that a recursion can only make sense if there is another
+    alternation that stops the recursing. That will provide the minimum length
+    (when no recursion happens). A backreference within the group that it is
+    referencing behaves in the same way. */
+    
+    case OP_REF:
+    ce = cs = (uschar *)_pcre_find_bracket(startcode, utf8, GET2(cc, 1));
+    if (cs == NULL) return -2;
+    do ce += GET(ce, 1); while (*ce == OP_ALT);
+    if (cc > cs && cc < ce)
+      {
+      d = 0;
+      had_recurse = TRUE; 
+      }  
+    else d = find_minlength(cs, startcode, options);
+    cc += 3; 
+
+    /* Handle repeated back references */
+     
+    switch (*cc)
+      {
+      case OP_CRSTAR:
+      case OP_CRMINSTAR:
+      case OP_CRQUERY:
+      case OP_CRMINQUERY:
+      min = 0;
+      cc++;
+      break;
+       
+      case OP_CRRANGE:
+      case OP_CRMINRANGE:
+      min = GET2(cc, 1);
+      cc += 5;
+      break;
+      
+      default:
+      min = 1;
+      break;
+      }
+
+    branchlength += min * d;
+    break; 
+
+    case OP_RECURSE:  
+    cs = ce = (uschar *)startcode + GET(cc, 1);
+    if (cs == NULL) return -2;
+    do ce += GET(ce, 1); while (*ce == OP_ALT);
+    if (cc > cs && cc < ce)
+      had_recurse = TRUE; 
+    else 
+      branchlength += find_minlength(cs, startcode, options);
+    cc += 1 + LINK_SIZE;
+    break;
+
+    /* Anything else does not or need not match a character. We can get the
+    item's length from the table, but for those that can match zero occurrences 
+    of a character, we must take special action for UTF-8 characters. */ 
+     
+    case OP_UPTO:
+    case OP_NOTUPTO: 
+    case OP_MINUPTO:
+    case OP_NOTMINUPTO: 
+    case OP_POSUPTO:
+    case OP_STAR:
+    case OP_MINSTAR:
+    case OP_NOTMINSTAR: 
+    case OP_POSSTAR:
+    case OP_NOTPOSSTAR: 
+    case OP_QUERY:
+    case OP_MINQUERY:
+    case OP_NOTMINQUERY:
+    case OP_POSQUERY:
+    case OP_NOTPOSQUERY: 
+    cc += _pcre_OP_lengths[op];
+#ifdef SUPPORT_UTF8     
+    if (utf8 && cc[-1] >= 0xc0) cc += _pcre_utf8_table4[cc[-1] & 0x3f];
+#endif 
+    break;
+
+    /* For the record, these are the opcodes that are matched by "default":
+    OP_ACCEPT, OP_CLOSE, OP_COMMIT, OP_FAIL, OP_PRUNE, OP_SET_SOM, OP_SKIP,
+    OP_THEN. */
+     
+    default:
+    cc += _pcre_OP_lengths[op];
+    break;
+    }
+  }
+/* Control never gets here */
+}
+
+
+
+/*************************************************
 *      Set a bit and maybe its alternate case    *
 *************************************************/


@@ -500,13 +848,15 @@
             set NULL unless error


 Returns:    pointer to a pcre_extra block, with study_data filled in and the
-              appropriate flag set;
+              appropriate flags set;
             NULL on error or if no optimization possible
 */


PCRE_EXP_DEFN pcre_extra * PCRE_CALL_CONVENTION
pcre_study(const pcre *external_re, int options, const char **errorptr)
{
+int min;
+BOOL bits_set = FALSE;
uschar start_bits[32];
pcre_extra *extra;
pcre_study_data *study;
@@ -533,31 +883,40 @@
(re->name_count * re->name_entry_size);

/* For an anchored pattern, or an unanchored pattern that has a first char, or
-a multiline pattern that matches only at "line starts", no further processing
-at present. */
+a multiline pattern that matches only at "line starts", there is no point in
+seeking a list of starting bytes. */

-if ((re->options & PCRE_ANCHORED) != 0 ||
-    (re->flags & (PCRE_FIRSTSET|PCRE_STARTLINE)) != 0)
-  return NULL;
+if ((re->options & PCRE_ANCHORED) == 0 &&
+    (re->flags & (PCRE_FIRSTSET|PCRE_STARTLINE)) == 0)
+  {
+  /* Set the character tables in the block that is passed around */
+  
+  tables = re->tables;
+  if (tables == NULL)
+    (void)pcre_fullinfo(external_re, NULL, PCRE_INFO_DEFAULT_TABLES,
+    (void *)(&tables));
+  
+  compile_block.lcc = tables + lcc_offset;
+  compile_block.fcc = tables + fcc_offset;
+  compile_block.cbits = tables + cbits_offset;
+  compile_block.ctypes = tables + ctypes_offset;
+  
+  /* See if we can find a fixed set of initial characters for the pattern. */
+  
+  memset(start_bits, 0, 32 * sizeof(uschar));
+  bits_set = set_start_bits(code, start_bits, 
+    (re->options & PCRE_CASELESS) != 0, (re->options & PCRE_UTF8) != 0, 
+    &compile_block) == SSB_DONE;
+  }
+   
+/* Find the minimum length of subject string. */


-/* Set the character tables in the block that is passed around */
+min = find_minlength(code, code, re->options);

-tables = re->tables;
-if (tables == NULL)
- (void)pcre_fullinfo(external_re, NULL, PCRE_INFO_DEFAULT_TABLES,
- (void *)(&tables));
+/* Return NULL if no optimization is possible. */

-compile_block.lcc = tables + lcc_offset;
-compile_block.fcc = tables + fcc_offset;
-compile_block.cbits = tables + cbits_offset;
-compile_block.ctypes = tables + ctypes_offset;
+if (!bits_set && min < 0) return NULL;

-/* See if we can find a fixed set of initial characters for the pattern. */
-
-memset(start_bits, 0, 32 * sizeof(uschar));
-if (set_start_bits(code, start_bits, (re->options & PCRE_CASELESS) != 0,
- (re->options & PCRE_UTF8) != 0, &compile_block) != SSB_DONE) return NULL;
-
/* Get a pcre_extra block and a pcre_study_data block. The study data is put in
the latter, which is pointed to by the former, which may also get additional
data set later by the calling program. At the moment, the size of
@@ -579,9 +938,20 @@
extra->study_data = study;

study->size = sizeof(pcre_study_data);
-study->options = PCRE_STUDY_MAPPED;
-memcpy(study->start_bits, start_bits, sizeof(start_bits));
+study->flags = 0;

+if (bits_set)
+  {
+  study->flags |= PCRE_STUDY_MAPPED;
+  memcpy(study->start_bits, start_bits, sizeof(start_bits));
+  }
+  
+if (min >= 0)
+  {
+  study->flags |= PCRE_STUDY_MINLEN;
+  study->minlength = min;
+  }     
+
 return extra;
 }



Modified: code/trunk/pcre_try_flipped.c
===================================================================
--- code/trunk/pcre_try_flipped.c    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/pcre_try_flipped.c    2009-09-26 19:12:32 UTC (rev 455)
@@ -6,7 +6,7 @@
 and semantics are as close as possible to those of the Perl 5 language.


                        Written by Philip Hazel
-           Copyright (c) 1997-2008 University of Cambridge
+           Copyright (c) 1997-2009 University of Cambridge


 -----------------------------------------------------------------------------
 Redistribution and use in source and binary forms, with or without
@@ -128,7 +128,9 @@
   {
   *internal_study = *study;   /* To copy other fields */
   internal_study->size = byteflip(study->size, sizeof(study->size));
-  internal_study->options = byteflip(study->options, sizeof(study->options));
+  internal_study->flags = byteflip(study->flags, sizeof(study->flags));
+  internal_study->minlength = byteflip(study->minlength, 
+    sizeof(study->minlength));
   }


return internal_re;

Modified: code/trunk/pcretest.c
===================================================================
--- code/trunk/pcretest.c    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/pcretest.c    2009-09-26 19:12:32 UTC (rev 455)
@@ -1451,7 +1451,8 @@
         {
         pcre_study_data *rsd = (pcre_study_data *)(extra->study_data);
         rsd->size = byteflip(rsd->size, sizeof(rsd->size));
-        rsd->options = byteflip(rsd->options, sizeof(rsd->options));
+        rsd->flags = byteflip(rsd->flags, sizeof(rsd->flags));
+        rsd->minlength = byteflip(rsd->minlength, sizeof(rsd->minlength));
         }
       }


@@ -1628,10 +1629,14 @@
         else
           {
           uschar *start_bits = NULL;
+          int minlength;
+           
+          new_info(re, extra, PCRE_INFO_MINLENGTH, &minlength);
+          fprintf(outfile, "Subject length lower bound = %d\n", minlength);  
+ 
           new_info(re, extra, PCRE_INFO_FIRSTTABLE, &start_bits);
-
           if (start_bits == NULL)
-            fprintf(outfile, "No starting byte set\n");
+            fprintf(outfile, "No set of starting bytes\n");
           else
             {
             int i;
@@ -2150,7 +2155,7 @@
           {
           int workspace[1000];
           for (i = 0; i < timeitm; i++)
-            count = pcre_dfa_exec(re, NULL, (char *)bptr, len, start_offset,
+            count = pcre_dfa_exec(re, extra, (char *)bptr, len, start_offset,
               options | g_notempty, use_offsets, use_size_offsets, workspace,
               sizeof(workspace)/sizeof(int));
           }
@@ -2213,7 +2218,7 @@
       else if (all_use_dfa || use_dfa)
         {
         int workspace[1000];
-        count = pcre_dfa_exec(re, NULL, (char *)bptr, len, start_offset,
+        count = pcre_dfa_exec(re, extra, (char *)bptr, len, start_offset,
           options | g_notempty, use_offsets, use_size_offsets, workspace,
           sizeof(workspace)/sizeof(int));
         if (count == 0)


Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/testdata/testinput2    2009-09-26 19:12:32 UTC (rev 455)
@@ -2861,4 +2861,241 @@


/(?<=b(?1))xyz(b+)pqrstuvew/

+/(a|bc)\1/SI
+
+/(a|bc)\1{2,3}/SI
+
+/(a|bc)(?1)/SI
+
+/(a|b\1)(a|b\1)/SI
+
+/(a|b\1){2}/SI
+
+/(a|bbbb\1)(a|bbbb\1)/SI
+
+/(a|bbbb\1){2}/SI
+
+/^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/SI
+
+/  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*                          # optional leading comment
+(?:    (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?:                      # opening quote...
+[^\\\x80-\xff\n\015"]                #   Anything except backslash and quote
+|                     #    or
+\\ [^\x80-\xff]           #   Escaped something (something != CR)
+)* "  # closing quote
+)                    # initial word
+(?:  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  \.  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*   (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?:                      # opening quote...
+[^\\\x80-\xff\n\015"]                #   Anything except backslash and quote
+|                     #    or
+\\ [^\x80-\xff]           #   Escaped something (something != CR)
+)* "  # closing quote
+)  )* # further okay, if led by a period
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  @  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*    (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|   \[                         # [
+(?: [^\\\x80-\xff\n\015\[\]] |  \\ [^\x80-\xff]  )*    #    stuff
+\]                        #           ]
+)                           # initial subdomain
+(?:                                  #
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  \.                        # if led by a period...
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*   (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|   \[                         # [
+(?: [^\\\x80-\xff\n\015\[\]] |  \\ [^\x80-\xff]  )*    #    stuff
+\]                        #           ]
+)                     #   ...further okay
+)*
+# address
+|                     #  or
+(?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?:                      # opening quote...
+[^\\\x80-\xff\n\015"]                #   Anything except backslash and quote
+|                     #    or
+\\ [^\x80-\xff]           #   Escaped something (something != CR)
+)* "  # closing quote
+)             # one word, optionally followed by....
+(?:
+[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]  |  # atom and space parts, or...
+\(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)       |  # comments, or...
+
+" (?:                      # opening quote...
+[^\\\x80-\xff\n\015"]                #   Anything except backslash and quote
+|                     #    or
+\\ [^\x80-\xff]           #   Escaped something (something != CR)
+)* "  # closing quote
+# quoted strings
+)*
+<  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*                     # leading <
+(?:  @  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*    (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|   \[                         # [
+(?: [^\\\x80-\xff\n\015\[\]] |  \\ [^\x80-\xff]  )*    #    stuff
+\]                        #           ]
+)                           # initial subdomain
+(?:                                  #
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  \.                        # if led by a period...
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*   (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|   \[                         # [
+(?: [^\\\x80-\xff\n\015\[\]] |  \\ [^\x80-\xff]  )*    #    stuff
+\]                        #           ]
+)                     #   ...further okay
+)*
+
+(?:  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  ,  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  @  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*    (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|   \[                         # [
+(?: [^\\\x80-\xff\n\015\[\]] |  \\ [^\x80-\xff]  )*    #    stuff
+\]                        #           ]
+)                           # initial subdomain
+(?:                                  #
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  \.                        # if led by a period...
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*   (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|   \[                         # [
+(?: [^\\\x80-\xff\n\015\[\]] |  \\ [^\x80-\xff]  )*    #    stuff
+\]                        #           ]
+)                     #   ...further okay
+)*
+)* # further okay, if led by comma
+:                                # closing colon
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  )? #       optional route
+(?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?:                      # opening quote...
+[^\\\x80-\xff\n\015"]                #   Anything except backslash and quote
+|                     #    or
+\\ [^\x80-\xff]           #   Escaped something (something != CR)
+)* "  # closing quote
+)                    # initial word
+(?:  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  \.  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*   (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?:                      # opening quote...
+[^\\\x80-\xff\n\015"]                #   Anything except backslash and quote
+|                     #    or
+\\ [^\x80-\xff]           #   Escaped something (something != CR)
+)* "  # closing quote
+)  )* # further okay, if led by a period
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  @  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*    (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|   \[                         # [
+(?: [^\\\x80-\xff\n\015\[\]] |  \\ [^\x80-\xff]  )*    #    stuff
+\]                        #           ]
+)                           # initial subdomain
+(?:                                  #
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  \.                        # if led by a period...
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*   (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|   \[                         # [
+(?: [^\\\x80-\xff\n\015\[\]] |  \\ [^\x80-\xff]  )*    #    stuff
+\]                        #           ]
+)                     #   ...further okay
+)*
+#       address spec
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  > #                  trailing >
+# name and address
+)  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*                       # optional trailing comment
+/xSI
+
+/<tr([\w\W\s\d][^<>]{0,})><TD([\w\W\s\d][^<>]{0,})>([\d]{0,}\.)(.*)((<BR>([\w\W\s\d][^<>]{0,})|[\s]{0,}))<\/a><\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><\/TR>/isIS
+
+"(?>.*/)foo"SI
+
+/(?(?=[^a-z]+[a-z])  \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} ) /xSI
+
+/(?:(?:(?:(?:(?:(?:(?:(?:(?:(a|b|c))))))))))/iSI
+
+/(?:c|d)(?:)(?:aaaaaaaa(?:)(?:bbbbbbbb)(?:bbbbbbbb(?:))(?:bbbbbbbb(?:)(?:bbbbbbbb)))/SI
+
+/<a[\s]+href[\s]*=[\s]*          # find <a href=
+ ([\"\'])?                       # find single or double quote
+ (?(1) (.*?)\1 | ([^\s]+))       # if quote found, match up to next matching
+                                 # quote, otherwise match up to next space
+/isxSI
+
+/^(?!:)                       # colon disallowed at start
+  (?:                         # start of item
+    (?: [0-9a-f]{1,4} |       # 1-4 hex digits or
+    (?(1)0 | () ) )           # if null previously matched, fail; else null
+    :                         # followed by colon
+  ){1,7}                      # end item; 1-7 of them required               
+  [0-9a-f]{1,4} $             # final hex number at end of string
+  (?(1)|.)                    # check that there was an empty component
+  /xiIS
+
 /-- End of testinput2 --/


Modified: code/trunk/testdata/testinput7
===================================================================
--- code/trunk/testdata/testinput7    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/testdata/testinput7    2009-09-26 19:12:32 UTC (rev 455)
@@ -4489,4 +4489,22 @@
 /(?=C)/g+
     ABCDECBA


+/(abc|def|xyz)/I
+    terhjk;abcdaadsfe
+    the quick xyz brown fox 
+    \Yterhjk;abcdaadsfe
+    \Ythe quick xyz brown fox 
+    ** Failers
+    thejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
+    \Ythejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
+
+/(abc|def|xyz)/SI
+    terhjk;abcdaadsfe
+    the quick xyz brown fox 
+    \Yterhjk;abcdaadsfe
+    \Ythe quick xyz brown fox 
+    ** Failers
+    thejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
+    \Ythejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
+
 /-- End of testinput7 --/


Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/testdata/testoutput2    2009-09-26 19:12:32 UTC (rev 455)
@@ -145,6 +145,7 @@
 No options
 No first char
 No need char
+Subject length lower bound = 3
 Starting byte set: c d e 
     this sentence eventually mentions a cat
  0: cat
@@ -156,6 +157,7 @@
 Options: caseless
 No first char
 No need char
+Subject length lower bound = 3
 Starting byte set: C D E c d e 
     this sentence eventually mentions a CAT cat
  0: CAT
@@ -167,6 +169,7 @@
 No options
 No first char
 No need char
+Subject length lower bound = 1
 Starting byte set: a b c d 


/(a|[^\dZ])/IS
@@ -174,6 +177,7 @@
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a
\x0b \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19
\x1a \x1b \x1c \x1d \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / : ; < = >
@@ -194,6 +198,7 @@
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: \x09 \x0a \x0c \x0d \x20 a b

/(ab\2)/
@@ -527,6 +532,7 @@
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b c d

/(?i)[abcd]/IS
@@ -534,6 +540,7 @@
Options: caseless
No first char
No need char
+Subject length lower bound = 1
Starting byte set: A B C D a b c d

/(?m)[xy]|(b|c)/IS
@@ -541,6 +548,7 @@
Options: multiline
No first char
No need char
+Subject length lower bound = 1
Starting byte set: b c x y

/(^a|^b)/Im
@@ -605,13 +613,15 @@
No options
First char = 'b' (caseless)
No need char
-Study returned NULL
+Subject length lower bound = 1
+No set of starting bytes

/(a*b|(?i:c*(?-i)d))/IS
Capturing subpattern count = 1
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: C a b c d

/a$/I
@@ -676,6 +686,7 @@
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b

/(?<!foo)(alpha|omega)/IS
@@ -683,6 +694,7 @@
No options
No first char
Need char = 'a'
+Subject length lower bound = 5
Starting byte set: a o

/(?!alphabet)[ab]/IS
@@ -690,6 +702,7 @@
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b

/(?<=foo\n)^bar/Im
@@ -1658,7 +1671,8 @@
Options: anchored
No first char
Need char = 'd'
-Study returned NULL
+Subject length lower bound = 4
+No set of starting bytes

 /\(             # ( at start
   (?:           # Non-capturing bracket
@@ -1890,6 +1904,7 @@
 No options
 No first char
 No need char
+Subject length lower bound = 1
 Starting byte set: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 
   _ a b c d e f g h i j k l m n o p q r s t u v w x y z 


@@ -1951,6 +1966,7 @@
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: \x09 \x0a \x0b \x0c \x0d \x20

/^[[:cntrl:]]/DZ
@@ -3421,6 +3437,7 @@
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b

/[^a]/I
@@ -3440,6 +3457,7 @@
No options
No first char
Need char = '6'
+Subject length lower bound = 4
Starting byte set: 0 1 2 3 4 5 6 7 8 9

/a^b/I
@@ -3473,6 +3491,7 @@
Options: caseless
No first char
No need char
+Subject length lower bound = 1
Starting byte set: A B a b

/[ab](?i)cd/IS
@@ -3480,6 +3499,7 @@
No options
No first char
Need char = 'd' (caseless)
+Subject length lower bound = 3
Starting byte set: a b

/abc(?C)def/I
@@ -3830,6 +3850,7 @@
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b

 /(?R)/I
@@ -4614,7 +4635,8 @@
 Options: caseless
 No first char
 Need char = 'g' (caseless)
-Study returned NULL
+Subject length lower bound = 8
+No set of starting bytes
      Baby Bjorn Active Carrier - With free SHIPPING!!
  0: Baby Bjorn Active Carrier - With free SHIPPING!!
  1: Baby Bjorn Active Carrier - With free SHIPPING!!
@@ -4632,7 +4654,8 @@
 No options
 No first char
 Need char = 'b'
-Study returned NULL
+Subject length lower bound = 1
+No set of starting bytes


/(a|b)*.?c/ISDZ
------------------------------------------------------------------
@@ -4652,7 +4675,8 @@
No options
No first char
Need char = 'c'
-Study returned NULL
+Subject length lower bound = 1
+No set of starting bytes

 /abc(?C255)de(?C)f/DZ
 ------------------------------------------------------------------
@@ -5435,6 +5459,7 @@
 No options
 No first char
 No need char
+Subject length lower bound = 1
 Starting byte set: a b 
 Compiled regex written to testsavedregex
 Study data written to testsavedregex
@@ -5455,6 +5480,7 @@
 No options
 No first char
 No need char
+Subject length lower bound = 1
 Starting byte set: a b 
 Compiled regex written to testsavedregex
 Study data written to testsavedregex
@@ -6167,6 +6193,7 @@
 No options
 No first char
 Need char = ','
+Subject length lower bound = 1
 Starting byte set: \x09 \x0a \x0c \x0d \x20 , 
     \x0b,\x0b
  0: ,
@@ -6471,6 +6498,7 @@
 No options
 No first char
 No need char
+Subject length lower bound = 1
 Starting byte set: C a b c d 


/()[ab]xyz/IS
@@ -6478,6 +6506,7 @@
No options
No first char
Need char = 'z'
+Subject length lower bound = 4
Starting byte set: a b

/(|)[ab]xyz/IS
@@ -6485,6 +6514,7 @@
No options
No first char
Need char = 'z'
+Subject length lower bound = 4
Starting byte set: a b

/(|c)[ab]xyz/IS
@@ -6492,6 +6522,7 @@
No options
No first char
Need char = 'z'
+Subject length lower bound = 4
Starting byte set: a b c

/(|c?)[ab]xyz/IS
@@ -6499,6 +6530,7 @@
No options
No first char
Need char = 'z'
+Subject length lower bound = 4
Starting byte set: a b c

/(d?|c?)[ab]xyz/IS
@@ -6506,6 +6538,7 @@
No options
No first char
Need char = 'z'
+Subject length lower bound = 4
Starting byte set: a b c d

/(d?|c)[ab]xyz/IS
@@ -6513,6 +6546,7 @@
No options
No first char
Need char = 'z'
+Subject length lower bound = 4
Starting byte set: a b c d

/^a*b\d/DZ
@@ -6605,6 +6639,7 @@
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b c d

/(a+|b*)[cd]/IS
@@ -6612,6 +6647,7 @@
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b c d

/(a*|b+)[cd]/IS
@@ -6619,6 +6655,7 @@
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b c d

/(a+|b+)[cd]/IS
@@ -6626,6 +6663,7 @@
No options
No first char
No need char
+Subject length lower bound = 2
Starting byte set: a b

/((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
@@ -9069,6 +9107,7 @@
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: x y z

/(?(?=.*b)b|^)/CI
@@ -9838,4 +9877,347 @@
/(?<=b(?1))xyz(b+)pqrstuvew/
Failed: lookbehind assertion is not fixed length at offset 26

+/(a|bc)\1/SI
+Capturing subpattern count = 1
+Max back reference = 1
+No options
+No first char
+No need char
+Subject length lower bound = 2
+Starting byte set: a b 
+
+/(a|bc)\1{2,3}/SI
+Capturing subpattern count = 1
+Max back reference = 1
+No options
+No first char
+No need char
+Subject length lower bound = 3
+Starting byte set: a b 
+
+/(a|bc)(?1)/SI
+Capturing subpattern count = 1
+No options
+No first char
+No need char
+Subject length lower bound = 2
+Starting byte set: a b 
+
+/(a|b\1)(a|b\1)/SI
+Capturing subpattern count = 2
+Max back reference = 1
+No options
+No first char
+No need char
+Subject length lower bound = 2
+Starting byte set: a b 
+
+/(a|b\1){2}/SI
+Capturing subpattern count = 1
+Max back reference = 1
+No options
+No first char
+No need char
+Subject length lower bound = 2
+Starting byte set: a b 
+
+/(a|bbbb\1)(a|bbbb\1)/SI
+Capturing subpattern count = 2
+Max back reference = 1
+No options
+No first char
+No need char
+Subject length lower bound = 2
+Starting byte set: a b 
+
+/(a|bbbb\1){2}/SI
+Capturing subpattern count = 1
+Max back reference = 1
+No options
+No first char
+No need char
+Subject length lower bound = 2
+Starting byte set: a b 
+
+/^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/SI
+Capturing subpattern count = 1
+Options: anchored
+No first char
+Need char = ':'
+Subject length lower bound = 22
+No set of starting bytes
+
+/  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*                          # optional leading comment
+(?:    (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?:                      # opening quote...
+[^\\\x80-\xff\n\015"]                #   Anything except backslash and quote
+|                     #    or
+\\ [^\x80-\xff]           #   Escaped something (something != CR)
+)* "  # closing quote
+)                    # initial word
+(?:  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  \.  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*   (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?:                      # opening quote...
+[^\\\x80-\xff\n\015"]                #   Anything except backslash and quote
+|                     #    or
+\\ [^\x80-\xff]           #   Escaped something (something != CR)
+)* "  # closing quote
+)  )* # further okay, if led by a period
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  @  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*    (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|   \[                         # [
+(?: [^\\\x80-\xff\n\015\[\]] |  \\ [^\x80-\xff]  )*    #    stuff
+\]                        #           ]
+)                           # initial subdomain
+(?:                                  #
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  \.                        # if led by a period...
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*   (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|   \[                         # [
+(?: [^\\\x80-\xff\n\015\[\]] |  \\ [^\x80-\xff]  )*    #    stuff
+\]                        #           ]
+)                     #   ...further okay
+)*
+# address
+|                     #  or
+(?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?:                      # opening quote...
+[^\\\x80-\xff\n\015"]                #   Anything except backslash and quote
+|                     #    or
+\\ [^\x80-\xff]           #   Escaped something (something != CR)
+)* "  # closing quote
+)             # one word, optionally followed by....
+(?:
+[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]  |  # atom and space parts, or...
+\(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)       |  # comments, or...
+
+" (?:                      # opening quote...
+[^\\\x80-\xff\n\015"]                #   Anything except backslash and quote
+|                     #    or
+\\ [^\x80-\xff]           #   Escaped something (something != CR)
+)* "  # closing quote
+# quoted strings
+)*
+<  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*                     # leading <
+(?:  @  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*    (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|   \[                         # [
+(?: [^\\\x80-\xff\n\015\[\]] |  \\ [^\x80-\xff]  )*    #    stuff
+\]                        #           ]
+)                           # initial subdomain
+(?:                                  #
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  \.                        # if led by a period...
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*   (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|   \[                         # [
+(?: [^\\\x80-\xff\n\015\[\]] |  \\ [^\x80-\xff]  )*    #    stuff
+\]                        #           ]
+)                     #   ...further okay
+)*
+
+(?:  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  ,  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  @  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*    (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|   \[                         # [
+(?: [^\\\x80-\xff\n\015\[\]] |  \\ [^\x80-\xff]  )*    #    stuff
+\]                        #           ]
+)                           # initial subdomain
+(?:                                  #
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  \.                        # if led by a period...
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*   (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|   \[                         # [
+(?: [^\\\x80-\xff\n\015\[\]] |  \\ [^\x80-\xff]  )*    #    stuff
+\]                        #           ]
+)                     #   ...further okay
+)*
+)* # further okay, if led by comma
+:                                # closing colon
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  )? #       optional route
+(?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?:                      # opening quote...
+[^\\\x80-\xff\n\015"]                #   Anything except backslash and quote
+|                     #    or
+\\ [^\x80-\xff]           #   Escaped something (something != CR)
+)* "  # closing quote
+)                    # initial word
+(?:  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  \.  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*   (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?:                      # opening quote...
+[^\\\x80-\xff\n\015"]                #   Anything except backslash and quote
+|                     #    or
+\\ [^\x80-\xff]           #   Escaped something (something != CR)
+)* "  # closing quote
+)  )* # further okay, if led by a period
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  @  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*    (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|   \[                         # [
+(?: [^\\\x80-\xff\n\015\[\]] |  \\ [^\x80-\xff]  )*    #    stuff
+\]                        #           ]
+)                           # initial subdomain
+(?:                                  #
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  \.                        # if led by a period...
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*   (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+    # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|   \[                         # [
+(?: [^\\\x80-\xff\n\015\[\]] |  \\ [^\x80-\xff]  )*    #    stuff
+\]                        #           ]
+)                     #   ...further okay
+)*
+#       address spec
+(?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*  > #                  trailing >
+# name and address
+)  (?: [\040\t] |  \(
+(?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
+\)  )*                       # optional trailing comment
+/xSI
+Capturing subpattern count = 0
+Contains explicit CR or LF match
+Options: extended
+No first char
+No need char
+Subject length lower bound = 3
+Starting byte set: \x09 \x20 ! " # $ % & ' ( * + - / 0 1 2 3 4 5 6 7 8 
+  9 = ? A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ^ _ ` a b c d e 
+  f g h i j k l m n o p q r s t u v w x y z { | } ~ \x7f 
+
+/<tr([\w\W\s\d][^<>]{0,})><TD([\w\W\s\d][^<>]{0,})>([\d]{0,}\.)(.*)((<BR>([\w\W\s\d][^<>]{0,})|[\s]{0,}))<\/a><\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><\/TR>/isIS
+Capturing subpattern count = 11
+Options: caseless dotall
+First char = '<'
+Need char = '>'
+Subject length lower bound = 47
+No set of starting bytes
+
+"(?>.*/)foo"SI
+Capturing subpattern count = 0
+No options
+First char at start or follows newline
+Need char = 'o'
+Subject length lower bound = 4
+No set of starting bytes
+
+/(?(?=[^a-z]+[a-z])  \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} ) /xSI
+Capturing subpattern count = 0
+Options: extended
+No first char
+Need char = '-'
+Subject length lower bound = 8
+No set of starting bytes
+
+/(?:(?:(?:(?:(?:(?:(?:(?:(?:(a|b|c))))))))))/iSI
+Capturing subpattern count = 1
+Options: caseless
+No first char
+No need char
+Subject length lower bound = 1
+Starting byte set: A B C a b c 
+
+/(?:c|d)(?:)(?:aaaaaaaa(?:)(?:bbbbbbbb)(?:bbbbbbbb(?:))(?:bbbbbbbb(?:)(?:bbbbbbbb)))/SI
+Capturing subpattern count = 0
+No options
+No first char
+Need char = 'b'
+Subject length lower bound = 41
+Starting byte set: c d 
+
+/<a[\s]+href[\s]*=[\s]*          # find <a href=
+ ([\"\'])?                       # find single or double quote
+ (?(1) (.*?)\1 | ([^\s]+))       # if quote found, match up to next matching
+                                 # quote, otherwise match up to next space
+/isxSI
+Capturing subpattern count = 3
+Max back reference = 1
+Options: caseless extended dotall
+First char = '<'
+Need char = '='
+Subject length lower bound = 9
+No set of starting bytes
+
+/^(?!:)                       # colon disallowed at start
+  (?:                         # start of item
+    (?: [0-9a-f]{1,4} |       # 1-4 hex digits or
+    (?(1)0 | () ) )           # if null previously matched, fail; else null
+    :                         # followed by colon
+  ){1,7}                      # end item; 1-7 of them required               
+  [0-9a-f]{1,4} $             # final hex number at end of string
+  (?(1)|.)                    # check that there was an empty component
+  /xiIS
+Capturing subpattern count = 1
+Options: anchored caseless extended
+No first char
+Need char = ':'
+Subject length lower bound = 2
+No set of starting bytes
+
 /-- End of testinput2 --/


Modified: code/trunk/testdata/testoutput3
===================================================================
--- code/trunk/testdata/testoutput3    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/testdata/testoutput3    2009-09-26 19:12:32 UTC (rev 455)
@@ -87,6 +87,7 @@
 No options
 No first char
 No need char
+Subject length lower bound = 1
 Starting byte set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P 
   Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z 


@@ -95,6 +96,7 @@
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
\xAA \xB5 \xBA \xC0 \xC1 \xC2 \xC3 \xC4 \xC5 \xC6 \xC7 \xC8 \xC9 \xCA \xCB \xCC \xCD \xCE \xCF \xD0 \xD1 \xD2 \xD3 \xD4 \xD5 \xD6 \xD8 \xD9 \xDA \xDB \xDC \xDD \xDE \xDF \xE0 \xE1 \xE2

Modified: code/trunk/testdata/testoutput5
===================================================================
--- code/trunk/testdata/testoutput5    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/testdata/testoutput5    2009-09-26 19:12:32 UTC (rev 455)
@@ -351,6 +351,7 @@
 Options: utf8
 No first char
 No need char
+Subject length lower bound = 1
 Starting byte set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a 
   \x0b \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19 
   \x1a \x1b \x1c \x1d \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 
@@ -388,7 +389,8 @@
 Options: utf8
 First char = 196
 Need char = 128
-Study returned NULL
+Subject length lower bound = 3
+No set of starting bytes
   \x{100}\x{100}\x{100}\x{100\x{100}
  0: \x{100}\x{100}\x{100}


@@ -407,6 +409,7 @@
Options: utf8
No first char
No need char
+Subject length lower bound = 1
Starting byte set: x \xc4

/(\x{100}*a|x)/8SDZ
@@ -425,6 +428,7 @@
Options: utf8
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a x \xc4

/(\x{100}{0,2}a|x)/8SDZ
@@ -443,6 +447,7 @@
Options: utf8
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a x \xc4

/(\x{100}{1,2}a|x)/8SDZ
@@ -462,6 +467,7 @@
Options: utf8
No first char
No need char
+Subject length lower bound = 1
Starting byte set: x \xc4

/\x{100}*(\d+|"(?1)")/8

Modified: code/trunk/testdata/testoutput7
===================================================================
--- code/trunk/testdata/testoutput7    2009-09-22 09:42:11 UTC (rev 454)
+++ code/trunk/testdata/testoutput7    2009-09-26 19:12:32 UTC (rev 455)
@@ -7472,4 +7472,46 @@
  0: 
  0+ CBA


+/(abc|def|xyz)/I
+Capturing subpattern count = 1
+No options
+No first char
+No need char
+    terhjk;abcdaadsfe
+ 0: abc
+    the quick xyz brown fox 
+ 0: xyz
+    \Yterhjk;abcdaadsfe
+ 0: abc
+    \Ythe quick xyz brown fox 
+ 0: xyz
+    ** Failers
+No match
+    thejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
+No match
+    \Ythejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
+No match
+
+/(abc|def|xyz)/SI
+Capturing subpattern count = 1
+No options
+No first char
+No need char
+Subject length lower bound = 3
+Starting byte set: a d x 
+    terhjk;abcdaadsfe
+ 0: abc
+    the quick xyz brown fox 
+ 0: xyz
+    \Yterhjk;abcdaadsfe
+ 0: abc
+    \Ythe quick xyz brown fox 
+ 0: xyz
+    ** Failers
+No match
+    thejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
+No match
+    \Ythejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
+No match
+
 /-- End of testinput7 --/