[Pcre-svn] [932] code/trunk: Add support for PCRE_INFO_MAXLO…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [932] code/trunk: Add support for PCRE_INFO_MAXLOOKBEHIND.
Revision: 932
          http://vcs.pcre.org/viewvc?view=rev&revision=932
Author:   ph10
Date:     2012-02-24 18:54:43 +0000 (Fri, 24 Feb 2012)


Log Message:
-----------
Add support for PCRE_INFO_MAXLOOKBEHIND.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcreapi.3
    code/trunk/doc/pcrepartial.3
    code/trunk/pcre.h.in
    code/trunk/pcre_compile.c
    code/trunk/pcre_fullinfo.c
    code/trunk/pcre_internal.h
    code/trunk/pcretest.c
    code/trunk/testdata/testinput5
    code/trunk/testdata/testoutput2
    code/trunk/testdata/testoutput5


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/ChangeLog    2012-02-24 18:54:43 UTC (rev 932)
@@ -46,6 +46,8 @@


 10. The command "./RunTest list" lists the available tests without actually
     running any of them. (Because I keep forgetting what they all are.)
+    
+11. Add PCRE_INFO_MAXLOOKBEHIND. 



Version 8.30 04-February-2012

Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3    2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/doc/pcreapi.3    2012-02-24 18:54:43 UTC (rev 932)
@@ -1235,6 +1235,13 @@
 /^a\ed+z\ed+/ the returned value is "z", but for /^a\edz\ed/ the returned value
 is -1.
 .sp
+  PCRE_INFO_MAXLOOKBEHIND
+.sp
+Return the number of characters (NB not bytes) in the longest lookbehind
+assertion in the pattern. Note that the simple assertions \eb and \eB require a
+one-character lookbehind. This information is useful when doing multi-segment 
+matching using the partial matching facilities.
+.sp
   PCRE_INFO_MINLENGTH
 .sp
 If the pattern was studied and a minimum length for matching subject strings
@@ -2646,6 +2653,6 @@
 .rs
 .sp
 .nf
-Last updated: 22 February 2012
+Last updated: 24 February 2012
 Copyright (c) 1997-2012 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcrepartial.3
===================================================================
--- code/trunk/doc/pcrepartial.3    2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/doc/pcrepartial.3    2012-02-24 18:54:43 UTC (rev 932)
@@ -302,14 +302,16 @@
 .sp
 At this stage, an application could discard the text preceding "23ja", add on
 text from the next segment, and call the matching function again. Unlike the
-DFA matching functions the entire matching string must always be available, and
-the complete matching process occurs for each call, so more memory and more
+DFA matching functions, the entire matching string must always be available,
+and the complete matching process occurs for each call, so more memory and more
 processing time is needed.
 .P
 \fBNote:\fP If the pattern contains lookbehind assertions, or \eK, or starts
 with \eb or \eB, the string that is returned for a partial match includes
 characters that precede the partially matched string itself, because these must
 be retained when adding on more characters for a subsequent matching attempt.
+However, in some cases you may need to retain even earlier characters, as
+discussed in the next section.
 .
 .
 .SH "ISSUES WITH MULTI-SEGMENT MATCHING"
@@ -324,14 +326,31 @@
 doing multi-segment matching you should be using PCRE_PARTIAL_HARD, which
 includes the effect of PCRE_NOTEOL.
 .P
-2. Lookbehind assertions at the start of a pattern are catered for in the
-offsets that are returned for a partial match. However, in theory, a lookbehind
-assertion later in the pattern could require even earlier characters to be
-inspected, and it might not have been reached when a partial match occurs. This
-is probably an extremely unlikely case; you could guard against it to a certain
-extent by always including extra characters at the start.
+2. Lookbehind assertions that have already been obeyed are catered for in the
+offsets that are returned for a partial match. However a lookbehind assertion
+later in the pattern could require even earlier characters to be inspected. You 
+can handle this case by using the PCRE_INFO_MAXLOOKBEHIND option of the 
+\fBpcre_fullinfo()\fP or \fBpcre16_fullinfo()\fP functions to obtain the length
+of the largest lookbehind in the pattern. This length is given in characters,
+not bytes. If you always retain at least that many characters before the
+partially matched string, all should be well. (Of course, near the start of the
+subject, fewer characters may be present; in that case all characters should be
+retained.)
 .P
-3. Matching a subject string that is split into multiple segments may not
+3. Because a partial match must always contain at least one character, what
+might be considered a partial match of an empty string actually gives a "no
+match" result. For example:
+.sp
+    re> /c(?<=abc)x/
+  data> ab\eP
+  No match
+.sp
+If the next segment begins "cx", a match should be found, but this will only 
+happen if characters from the previous segment are retained. For this reason, a
+"no match" result should be interpreted as "partial match of an empty string"
+when the pattern contains lookbehinds.
+.P
+4. Matching a subject string that is split into multiple segments may not
 always produce exactly the same result as matching over one single long string,
 especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
 Word Boundaries" above describes an issue that arises if the pattern ends with
@@ -372,7 +391,7 @@
   data> gsb\eR\eP\eP\eD
   Partial match: gsb
 .sp
-4. Patterns that contain alternatives at the top level which do not all start
+5. Patterns that contain alternatives at the top level which do not all start
 with the same pattern item may not work as expected when PCRE_DFA_RESTART is
 used. For example, consider this pattern:
 .sp
@@ -421,6 +440,6 @@
 .rs
 .sp
 .nf
-Last updated: 18 February 2012
+Last updated: 24 February 2012
 Copyright (c) 1997-2012 University of Cambridge.
 .fi


Modified: code/trunk/pcre.h.in
===================================================================
--- code/trunk/pcre.h.in    2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/pcre.h.in    2012-02-24 18:54:43 UTC (rev 932)
@@ -234,6 +234,7 @@
 #define PCRE_INFO_MINLENGTH         15
 #define PCRE_INFO_JIT               16
 #define PCRE_INFO_JITSIZE           17
+#define PCRE_INFO_MAXLOOKBEHIND     18


/* Request types for pcre_config(). Do not re-arrange, in order to remain
compatible. */

Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/pcre_compile.c    2012-02-24 18:54:43 UTC (rev 932)
@@ -6852,10 +6852,13 @@
       /* For the rest (including \X when Unicode properties are supported), we
       can obtain the OP value by negating the escape value in the default
       situation when PCRE_UCP is not set. When it *is* set, we substitute
-      Unicode property tests. */
+      Unicode property tests. Note that \b and \B do a one-character 
+      lookbehind. */


       else
         {
+        if ((-c == ESC_b || -c == ESC_B) && cd->max_lookbehind == 0)
+          cd->max_lookbehind = 1; 
 #ifdef SUPPORT_UCP
         if (-c >= ESC_DU && -c <= ESC_wu)
           {
@@ -7163,7 +7166,12 @@
         *ptrptr = ptr;
         return FALSE;
         }
-      else { PUT(reverse_count, 0, fixed_length); }
+      else 
+        { 
+        if (fixed_length > cd->max_lookbehind) 
+          cd->max_lookbehind = fixed_length; 
+        PUT(reverse_count, 0, fixed_length); 
+        }
       }
     }


@@ -7833,6 +7841,7 @@
cd->end_pattern = (const pcre_uchar *)(pattern + STRLEN_UC((const pcre_uchar *)pattern));
cd->req_varyopt = 0;
cd->assert_depth = 0;
+cd->max_lookbehind = 0;
cd->external_options = options;
cd->external_flags = 0;
cd->open_caps = NULL;
@@ -7883,7 +7892,6 @@
re->size = (int)size;
re->options = cd->external_options;
re->flags = cd->external_flags;
-re->dummy1 = 0;
re->first_char = 0;
re->req_char = 0;
re->name_table_offset = sizeof(REAL_PCRE) / sizeof(pcre_uchar);
@@ -7903,6 +7911,7 @@
cd->final_bracount = cd->bracount; /* Save for checking forward references */
cd->assert_depth = 0;
cd->bracount = 0;
+cd->max_lookbehind = 0;
cd->names_found = 0;
cd->name_table = (pcre_uchar *)re + re->name_table_offset;
codestart = cd->name_table + re->name_entry_size * re->name_count;
@@ -7924,6 +7933,7 @@
&firstchar, &reqchar, NULL, cd, NULL);
re->top_bracket = cd->bracount;
re->top_backref = cd->top_backref;
+re->max_lookbehind = cd->max_lookbehind;
re->flags = cd->external_flags | PCRE_MODE;

 if (cd->had_accept) reqchar = REQ_NONE;   /* Must disable after (*ACCEPT) */
@@ -8011,6 +8021,7 @@
                     (fixed_length == -4)? ERR70 : ERR25;
         break;
         }
+      if (fixed_length > cd->max_lookbehind) cd->max_lookbehind = fixed_length;
       PUT(cc, 1, fixed_length);
       }
     cc += 1 + LINK_SIZE;


Modified: code/trunk/pcre_fullinfo.c
===================================================================
--- code/trunk/pcre_fullinfo.c    2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/pcre_fullinfo.c    2012-02-24 18:54:43 UTC (rev 932)
@@ -192,6 +192,10 @@
   case PCRE_INFO_HASCRORLF:
   *((int *)where) = (re->flags & PCRE_HASCRORLF) != 0;
   break;
+  
+  case PCRE_INFO_MAXLOOKBEHIND: 
+  *((int *)where) = re->max_lookbehind;
+  break;  


default: return PCRE_ERROR_BADOPTION;
}

Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/pcre_internal.h    2012-02-24 18:54:43 UTC (rev 932)
@@ -1974,16 +1974,15 @@
   pcre_uint32 size;               /* Total that was malloced */
   pcre_uint32 options;            /* Public options */
   pcre_uint16 flags;              /* Private flags */
-  pcre_uint16 dummy1;             /* For future use */
-  pcre_uint16 top_bracket;
-  pcre_uint16 top_backref;
+  pcre_uint16 max_lookbehind;     /* Longest lookbehind (characters) */
+  pcre_uint16 top_bracket;        /* Highest numbered group */
+  pcre_uint16 top_backref;        /* Highest numbered back reference */
   pcre_uint16 first_char;         /* Starting character */
   pcre_uint16 req_char;           /* This character must be seen */
   pcre_uint16 name_table_offset;  /* Offset to name table that follows */
   pcre_uint16 name_entry_size;    /* Size of any name items */
   pcre_uint16 name_count;         /* Number of name items */
   pcre_uint16 ref_count;          /* Reference count */
-
   const pcre_uint8 *tables;       /* Pointer to tables or NULL for std */
   const pcre_uint8 *nullpad;      /* NULL padding */
 } REAL_PCRE;
@@ -2029,6 +2028,7 @@
   int  workspace_size;              /* Size of workspace */
   int  bracount;                    /* Count of capturing parens as we compile */
   int  final_bracount;              /* Saved value after first pass */
+  int  max_lookbehind;              /* Maximum lookbehind (characters) */
   int  top_backref;                 /* Maximum back reference */
   unsigned int backref_map;         /* Bitmap of low back refs */
   int  assert_depth;                /* Depth of nested assertions */


Modified: code/trunk/pcretest.c
===================================================================
--- code/trunk/pcretest.c    2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/pcretest.c    2012-02-24 18:54:43 UTC (rev 932)
@@ -3099,7 +3099,7 @@
       {
       unsigned long int all_options;
       int count, backrefmax, first_char, need_char, okpartial, jchanged,
-        hascrorlf;
+        hascrorlf, maxlookbehind;
       int nameentrysize, namecount;
       const pcre_uint8 *nametable;


@@ -3113,7 +3113,8 @@
           new_info(re, NULL, PCRE_INFO_NAMETABLE, (void *)&nametable) +
           new_info(re, NULL, PCRE_INFO_OKPARTIAL, &okpartial) +
           new_info(re, NULL, PCRE_INFO_JCHANGED, &jchanged) +
-          new_info(re, NULL, PCRE_INFO_HASCRORLF, &hascrorlf)
+          new_info(re, NULL, PCRE_INFO_HASCRORLF, &hascrorlf) +
+          new_info(re, NULL, PCRE_INFO_MAXLOOKBEHIND, &maxlookbehind) 
           != 0)
         goto SKIP_DATA;


@@ -3252,6 +3253,9 @@
           fprintf(outfile, "%s\n", caseless);
           }
         }
+        
+      if (maxlookbehind > 0) 
+        fprintf(outfile, "Max lookbehind = %d\n", maxlookbehind); 


       /* Don't output study size; at present it is in any case a fixed
       value, but it varies, depending on the computer architecture, and


Modified: code/trunk/testdata/testinput5
===================================================================
--- code/trunk/testdata/testinput5    2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/testdata/testinput5    2012-02-24 18:54:43 UTC (rev 932)
@@ -793,4 +793,6 @@


/[^\x{100}]*[^\x{10000}]+[^\x{10ffff}]??[^\x{8000}]{4,}[^\x{7fff}]{2,9}?[^\x{fffff}]{5,6}+/8BZi

+/(?<=\x{1234}\x{1234})\bxy/I8
+
/-- End of testinput5 --/

Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/testdata/testoutput2    2012-02-24 18:54:43 UTC (rev 932)
@@ -451,6 +451,7 @@
 No options
 First char = 'f'
 Need char = 'o'
+Max lookbehind = 6
     foo
  0: foo
     catfoo
@@ -658,6 +659,7 @@
 No options
 No first char
 No need char
+Max lookbehind = 3
 Subject length lower bound = 1
 Starting byte set: a b 


@@ -666,6 +668,7 @@
No options
No first char
Need char = 'a'
+Max lookbehind = 3
Subject length lower bound = 5
Starting byte set: a o

@@ -683,6 +686,7 @@
 Options: multiline
 No first char
 Need char = 'r'
+Max lookbehind = 4
     foo\nbarbar
  0: bar
     ***Failers
@@ -700,6 +704,7 @@
 Options: multiline
 First char at start or follows newline
 Need char = 'r'
+Max lookbehind = 4
     foo\nbarbar
  0: bar
     ***Failers
@@ -741,6 +746,7 @@
 No options
 First char = '-'
 Need char = 't'
+Max lookbehind = 7
     the bullock-cart
  0: -cart
     a donkey-cart race
@@ -757,12 +763,14 @@
 No options
 No first char
 No need char
+Max lookbehind = 3


 /(?>.*)(?<=(abcd)|(xyz))/I
 Capturing subpattern count = 2
 No options
 First char at start or follows newline
 No need char
+Max lookbehind = 4
     alphabetabcd
  0: alphabetabcd
  1: abcd
@@ -776,6 +784,7 @@
 No options
 First char = 'Z'
 Need char = 'Z'
+Max lookbehind = 4
     abxyZZ
  0: ZZ
     abXyZZ
@@ -804,6 +813,7 @@
 No options
 First char = 'b'
 Need char = 'r'
+Max lookbehind = 4
     bar
  0: bar
     foobbar
@@ -1205,6 +1215,7 @@
 No options
 First char = 'i'
 Need char = 's'
+Max lookbehind = 1
     Mississippi
  0: iss
  0+ issippi
@@ -1225,6 +1236,7 @@
 No options
 First char = 'i'
 Need char = 's'
+Max lookbehind = 1
     Mississippi
  0: iss
  0+ issippi
@@ -1234,6 +1246,7 @@
 No options
 First char = 'i'
 Need char = 's'
+Max lookbehind = 1
     Mississippi
  0: iss
  0+ issippi
@@ -1249,6 +1262,7 @@
 No options
 First char = 'i'
 Need char = 's'
+Max lookbehind = 1
     Mississippi
  0: iss
  0+ issippi
@@ -1260,6 +1274,7 @@
 No options
 First char = 'i'
 Need char = 's'
+Max lookbehind = 1
     Mississippi
  0: iss
  0+ issippi
@@ -1440,6 +1455,7 @@
 No options
 No first char
 No need char
+Max lookbehind = 3


/abc(?!pqr)/I
Capturing subpattern count = 0
@@ -3220,6 +3236,7 @@
No options
First char = '8'
Need char = 'X'
+Max lookbehind = 1

|\$\<\.X\+ix\[d1b\!H\#\?vV0vrK\:ZH1\=2M\>iV\;\?aPhFB\<\*vW\@QW\@sO9\}cfZA\-i\'w\%hKd6gt1UJP\,15_\#QY\$M\^Mss_U\/\]\&LK9\[5vQub\^w\[KDD\<EjmhUZ\?\.akp2dF\>qmj\;2\}YWFdYx\.Ap\]hjCPTP\(n28k\+3\;o\&WXqs\/gOXdr\$\:r\'do0\;b4c\(f_Gr\=\"\\4\)\[01T7ajQJvL\$W\~mL_sS\/4h\:x\*\[ZN\=KLs\&L5zX\/\/\>it\,o\:aU\(\;Z\>pW\&T7oP\'2K\^E\:x9\'c\[\%z\-\,64JQ5AeH_G\#KijUKghQw\^\\vea3a\?kka_G\$8\#\`\*kynsxzBLru\'\]k_\[7FrVx\}\^\=\$blx\>s\-N\%j\;D\*aZDnsw\:YKZ\%Q\.Kne9\#hP\?\+b3\(SOvL\,\^\;\&u5\@\?5C5Bhb\=m\-vEh_L15Jl\]U\)0RP6\{q\%L\^_z5E\'Dw6X\b|IDZ
------------------------------------------------------------------
@@ -3233,6 +3250,7 @@
No options
First char = '$'
Need char = 'X'
+Max lookbehind = 1

 /(.*)\d+\1/I
 Capturing subpattern count = 1
@@ -3748,6 +3766,7 @@
 No options
 First char = 'x'
 Need char = 'z'
+Max lookbehind = 3
    abcxyz\C+
 Callout 0: last capture = 1
  0: <unset>
@@ -5395,6 +5414,7 @@
 No options
 No first char
 No need char
+Max lookbehind = 1
   ab cd\>1
  0:  cd


@@ -5403,6 +5423,7 @@
Options: dotall
No first char
No need char
+Max lookbehind = 1
ab cd\>1
0: cd

@@ -11596,6 +11617,7 @@
No options
First char = 't'
Need char = 't'
+Max lookbehind = 1
Subject length lower bound = 18
No set of starting bytes

@@ -11604,6 +11626,7 @@
No options
No first char
No need char
+Max lookbehind = 1
Subject length lower bound = 8
Starting byte set: < o t u


Modified: code/trunk/testdata/testoutput5
===================================================================
--- code/trunk/testdata/testoutput5    2012-02-24 13:21:02 UTC (rev 931)
+++ code/trunk/testdata/testoutput5    2012-02-24 18:54:43 UTC (rev 932)
@@ -1873,4 +1873,11 @@
         End
 ------------------------------------------------------------------


+/(?<=\x{1234}\x{1234})\bxy/I8
+Capturing subpattern count = 0
+Options: utf
+First char = 'x'
+Need char = 'y'
+Max lookbehind = 2
+
/-- End of testinput5 --/