[Pcre-svn] [1103] code/trunk: Minor improvement to minimum l…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [1103] code/trunk: Minor improvement to minimum length calculation.
Revision: 1103
          http://www.exim.org/viewvc/pcre2?view=rev&revision=1103
Author:   ph10
Date:     2019-06-13 17:00:11 +0100 (Thu, 13 Jun 2019)
Log Message:
-----------
Minor improvement to minimum length calculation.


Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcre2api.3
    code/trunk/doc/pcre2test.1
    code/trunk/src/pcre2_compile.c
    code/trunk/src/pcre2_study.c
    code/trunk/src/pcre2test.c
    code/trunk/testdata/testoutput2
    code/trunk/testdata/testoutput22-16
    code/trunk/testdata/testoutput22-8
    code/trunk/testdata/testoutput5
    code/trunk/testdata/testoutput6


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2019-06-11 07:37:29 UTC (rev 1102)
+++ code/trunk/ChangeLog    2019-06-13 16:00:11 UTC (rev 1103)
@@ -28,7 +28,15 @@
 8. Allow (*ACCEPT) to be quantified, because an ungreedy quantifier with a zero 
 minimum is potentially useful.


+9. Some changes to the way the minimum subject length is handled:

+   * When PCRE2_NO_START_OPTIMIZE is set, no minimum length is computed; 
+     pcre2test no longer shows a value (of zero).
+     
+   * When no minimum length is set by the normal scan, but a first and/or last 
+     code unit is recorded, set the minimum to 1 or 2 as appropriate.
+
+
 Version 10.33 16-April-2019
 ---------------------------



Modified: code/trunk/doc/pcre2api.3
===================================================================
--- code/trunk/doc/pcre2api.3    2019-06-11 07:37:29 UTC (rev 1102)
+++ code/trunk/doc/pcre2api.3    2019-06-13 16:00:11 UTC (rev 1103)
@@ -1,4 +1,4 @@
-.TH PCRE2API 3 "30 May 2019" "PCRE2 10.34"
+.TH PCRE2API 3 "11 June 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .sp
@@ -2229,12 +2229,12 @@
   PCRE2_INFO_MINLENGTH
 .sp
 If a minimum length for matching subject strings was computed, its value is
-returned. Otherwise the returned value is 0. The value is a number of
-characters, which in UTF mode may be different from the number of code units.
-The third argument should point to an \fBuint32_t\fP variable. The value is a
-lower bound to the length of any matching string. There may not be any strings
-of that length that do actually match, but every string that does match is at
-least that long.
+returned. Otherwise the returned value is 0. This value is not computed when
+PCRE2_NO_START_OPTIMIZE is set. The value is a number of characters, which in
+UTF mode may be different from the number of code units. The third argument
+should point to an \fBuint32_t\fP variable. The value is a lower bound to the
+length of any matching string. There may not be any strings of that length that
+do actually match, but every string that does match is at least that long.
 .sp
   PCRE2_INFO_NAMECOUNT
   PCRE2_INFO_NAMEENTRYSIZE
@@ -3848,6 +3848,6 @@
 .rs
 .sp
 .nf
-Last updated: 30 May 2019
+Last updated: 11 June 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcre2test.1
===================================================================
--- code/trunk/doc/pcre2test.1    2019-06-11 07:37:29 UTC (rev 1102)
+++ code/trunk/doc/pcre2test.1    2019-06-13 16:00:11 UTC (rev 1103)
@@ -1,4 +1,4 @@
-.TH PCRE2TEST 1 "23 May 2019" "PCRE 10.34"
+.TH PCRE2TEST 1 "11 June 2019" "PCRE 10.34"
 .SH NAME
 pcre2test - a program for testing Perl-compatible regular expressions.
 .SH SYNOPSIS
@@ -695,7 +695,9 @@
 if there is more than one they are listed as "starting code units". "Last code
 unit" is the last literal code unit that must be present in any match. This is
 not necessarily the last character. These lines are omitted if no starting or
-ending code units are recorded.
+ending code units are recorded. The subject length line is omitted when 
+\fBno_start_optimize\fP is set because the minimum length is not calculated 
+when it can never be used.
 .P
 The \fBframesize\fP modifier shows the size, in bytes, of the storage frames
 used by \fBpcre2_match()\fP for handling backtracking. The size depends on the
@@ -2060,6 +2062,6 @@
 .rs
 .sp
 .nf
-Last updated: 23 May 2019
+Last updated: 11 June 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi


Modified: code/trunk/src/pcre2_compile.c
===================================================================
--- code/trunk/src/pcre2_compile.c    2019-06-11 07:37:29 UTC (rev 1102)
+++ code/trunk/src/pcre2_compile.c    2019-06-13 16:00:11 UTC (rev 1103)
@@ -10151,6 +10151,8 @@


 if ((re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0)
   {
+  int minminlength = 0;  /* For minimal minlength from first/required CU */
+
   /* If we do not have a first code unit, see if there is one that is asserted
   (these are not saved during the compile because they can cause conflicts with
   actual literals that follow). */
@@ -10158,12 +10160,14 @@
   if (firstcuflags < 0)
     firstcu = find_firstassertedcu(codestart, &firstcuflags, 0);


- /* Save the data for a first code unit. */
+ /* Save the data for a first code unit. The existence of one means the
+ minimum length must be at least 1. */

   if (firstcuflags >= 0)
     {
     re->first_codeunit = firstcu;
     re->flags |= PCRE2_FIRSTSET;
+    minminlength++;


     /* Handle caseless first code units. */


@@ -10197,33 +10201,54 @@
            is_startline(codestart, 0, &cb, 0, FALSE))
     re->flags |= PCRE2_STARTLINE;


- /* Handle the "required code unit", if one is set. In the case of an anchored
- pattern, do this only if it follows a variable length item in the pattern. */
+ /* Handle the "required code unit", if one is set. We can increment the
+ minimum minimum length only if we are sure this really is a different
+ character, because the count is in characters, not code units. */

-  if (reqcuflags >= 0 &&
-       ((re->overall_options & PCRE2_ANCHORED) == 0 ||
-        (reqcuflags & REQ_VARY) != 0))
+  if (reqcuflags >= 0)
     {
-    re->last_codeunit = reqcu;
-    re->flags |= PCRE2_LASTSET;
+#if PCRE2_CODE_UNIT_WIDTH == 16
+    if ((re->overall_options & PCRE2_UTF) == 0 ||   /* Not UTF */
+        firstcuflags < 0 ||                         /* First not set */
+        (firstcu & 0xf800) != 0xd800 ||             /* First not surrogate */
+        (reqcu & 0xfc00) != 0xdc00)                 /* Req not low surrogate */
+#elif PCRE2_CODE_UNIT_WIDTH == 8
+    if ((re->overall_options & PCRE2_UTF) == 0 ||   /* Not UTF */
+        firstcuflags < 0 ||                         /* First not set */
+        (firstcu & 0x80) == 0 ||                    /* First is ASCII */
+        (reqcu & 0x80) == 0)                        /* Req is ASCII */
+#endif
+      {
+      minminlength++;
+      }


-    /* Handle caseless required code units as for first code units (above). */
+    /* In the case of an anchored pattern, set up the value only if it follows
+    a variable length item in the pattern. */


-    if ((reqcuflags & REQ_CASELESS) != 0)
+    if ((re->overall_options & PCRE2_ANCHORED) == 0 ||
+        (reqcuflags & REQ_VARY) != 0)
       {
-      if (reqcu < 128 || (!utf && reqcu < 255))
+      re->last_codeunit = reqcu;
+      re->flags |= PCRE2_LASTSET;
+
+      /* Handle caseless required code units as for first code units (above). */
+
+      if ((reqcuflags & REQ_CASELESS) != 0)
         {
-        if (cb.fcc[reqcu] != reqcu) re->flags |= PCRE2_LASTCASELESS;
-        }
+        if (reqcu < 128 || (!utf && reqcu < 255))
+          {
+          if (cb.fcc[reqcu] != reqcu) re->flags |= PCRE2_LASTCASELESS;
+          }
 #if defined SUPPORT_UNICODE && PCRE2_CODE_UNIT_WIDTH != 8
-      else if (reqcu <= MAX_UTF_CODE_POINT && UCD_OTHERCASE(reqcu) != reqcu)
-        re->flags |= PCRE2_LASTCASELESS;
+        else if (reqcu <= MAX_UTF_CODE_POINT && UCD_OTHERCASE(reqcu) != reqcu)
+          re->flags |= PCRE2_LASTCASELESS;
 #endif
+        }
       }
     }


- /* Finally, study the compiled pattern to set up information such as a bitmap
- of starting code units and a minimum matching length. */
+ /* Study the compiled pattern to set up information such as a bitmap of
+ starting code units and a minimum matching length. */

   if (PRIV(study)(re) != 0)
     {
@@ -10230,6 +10255,11 @@
     errorcode = ERR31;
     goto HAD_CB_ERROR;
     }
+
+  /* If the minimum length set (or not set) by study() is less than the minimum
+  implied by required code units, override it. */
+
+  if (re->minlength < minminlength) re->minlength = minminlength;
   }   /* End of start-of-match optimizations. */


/* Control ends up here in all cases. When running under valgrind, make a

Modified: code/trunk/src/pcre2_study.c
===================================================================
--- code/trunk/src/pcre2_study.c    2019-06-11 07:37:29 UTC (rev 1102)
+++ code/trunk/src/pcre2_study.c    2019-06-13 16:00:11 UTC (rev 1103)
@@ -214,9 +214,7 @@


     /* Reached end of a branch; if it's a ket it is the end of a nested
     call. If it's ALT it is an alternation in a nested call. If it is END it's
-    the end of the outer call. All can be handled by the same code. If an
-    ACCEPT was previously encountered, use the length that was in force at that
-    time, and pass back the shortest ACCEPT length. */
+    the end of the outer call. All can be handled by the same code. */


     case OP_ALT:
     case OP_KET:


Modified: code/trunk/src/pcre2test.c
===================================================================
--- code/trunk/src/pcre2test.c    2019-06-11 07:37:29 UTC (rev 1102)
+++ code/trunk/src/pcre2test.c    2019-06-13 16:00:11 UTC (rev 1103)
@@ -4704,7 +4704,8 @@
       }
     }


-  fprintf(outfile, "Subject length lower bound = %d\n", minlength);
+  if ((FLD(compiled_code, overall_options) & PCRE2_NO_START_OPTIMIZE) == 0)
+    fprintf(outfile, "Subject length lower bound = %d\n", minlength);


   if (pat_patctl.jit != 0 && (pat_patctl.control & CTL_JITVERIFY) != 0)
     {


Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2019-06-11 07:37:29 UTC (rev 1102)
+++ code/trunk/testdata/testoutput2    2019-06-13 16:00:11 UTC (rev 1103)
@@ -1480,7 +1480,7 @@
 Capture group count = 0
 First code unit = 'a'
 Last code unit = 'a'
-Subject length lower bound = 1
+Subject length lower bound = 2


/(?=a)a.*/I
Capture group count = 0
@@ -3406,7 +3406,7 @@
Capture group count = 0
May match empty string
First code unit = 'a'
-Subject length lower bound = 0
+Subject length lower bound = 1

/(?=abc).xyz/Ii
Capture group count = 0
@@ -3425,7 +3425,7 @@
Capture group count = 0
May match empty string
First code unit = 'a'
-Subject length lower bound = 0
+Subject length lower bound = 1

/(?=.)a/I
Capture group count = 0
@@ -3436,7 +3436,7 @@
Capture group count = 1
First code unit = 'a'
Last code unit = 'a'
-Subject length lower bound = 1
+Subject length lower bound = 2

/((?=abcda)ab)/I
Capture group count = 1
@@ -10780,7 +10780,7 @@
Options: caseless
First code unit = 'a' (caseless)
Last code unit = 'a' (caseless)
-Subject length lower bound = 1
+Subject length lower bound = 2

/(abc)\1+/

@@ -11254,7 +11254,7 @@
 /z(*ACCEPT)a/I,aftertext
 Capture group count = 0
 First code unit = 'z'
-Subject length lower bound = 0
+Subject length lower bound = 1
     baxzbx
  0: z
  0+ bx
@@ -13572,7 +13572,6 @@
 /abcd/I,no_start_optimize
 Capture group count = 0
 Options: no_start_optimize
-Subject length lower bound = 0


 /(|ab)*?d/I
 Capture group count = 1
@@ -13588,7 +13587,6 @@
 /(|ab)*?d/I,no_start_optimize
 Capture group count = 1
 Options: no_start_optimize
-Subject length lower bound = 0
    abd
  0: abd
  1: ab
@@ -14638,7 +14636,7 @@
 Options: dupnames
 Starting code units: a b 
 Last code unit = 'c'
-Subject length lower bound = 0
+Subject length lower bound = 1


 /ab{3cd/
     ab{3cd
@@ -16436,7 +16434,7 @@
 Max back reference = 1
 First code unit = 'a'
 Last code unit = 'b'
-Subject length lower bound = 1
+Subject length lower bound = 2
     ab
  0: ab
  1: a


Modified: code/trunk/testdata/testoutput22-16
===================================================================
--- code/trunk/testdata/testoutput22-16    2019-06-11 07:37:29 UTC (rev 1102)
+++ code/trunk/testdata/testoutput22-16    2019-06-13 16:00:11 UTC (rev 1103)
@@ -9,7 +9,7 @@
 Options: utf
 First code unit = 'a'
 Last code unit = 'e'
-Subject length lower bound = 0
+Subject length lower bound = 2
     abXde
  0: abXde



Modified: code/trunk/testdata/testoutput22-8
===================================================================
--- code/trunk/testdata/testoutput22-8    2019-06-11 07:37:29 UTC (rev 1102)
+++ code/trunk/testdata/testoutput22-8    2019-06-13 16:00:11 UTC (rev 1103)
@@ -9,7 +9,7 @@
 Options: utf
 First code unit = 'a'
 Last code unit = 'e'
-Subject length lower bound = 0
+Subject length lower bound = 2
     abXde
  0: abXde



Modified: code/trunk/testdata/testoutput5
===================================================================
--- code/trunk/testdata/testoutput5    2019-06-11 07:37:29 UTC (rev 1102)
+++ code/trunk/testdata/testoutput5    2019-06-13 16:00:11 UTC (rev 1103)
@@ -474,7 +474,6 @@
 Capture group count = 0
 Compile options: no_start_optimize utf
 Overall options: anchored no_start_optimize utf
-Subject length lower bound = 0


/()()()()()()()()()()
()()()()()()()()()()

Modified: code/trunk/testdata/testoutput6
===================================================================
--- code/trunk/testdata/testoutput6    2019-06-11 07:37:29 UTC (rev 1102)
+++ code/trunk/testdata/testoutput6    2019-06-13 16:00:11 UTC (rev 1103)
@@ -6845,7 +6845,6 @@
 /(abc|def|xyz)/I,no_start_optimize
 Capture group count = 1
 Options: no_start_optimize
-Subject length lower bound = 0
     terhjk;abcdaadsfe
  0: abc
     the quick xyz brown fox