[Pcre-svn] [899] code/trunk: Fix PCRE2_FIRSTLINE bug when a pattern match starts with the first code unit of

Author: Subversion repository
Date:
To: pcre-svn
Subject: [Pcre-svn] [899] code/trunk: Fix PCRE2_FIRSTLINE bug when a pattern match starts with the first code unit of

Revision: 899

          http://www.exim.org/viewvc/pcre2?view=rev&revision=899
Author:   ph10
Date:     2018-01-01 14:12:35 +0000 (Mon, 01 Jan 2018)
Log Message:
-----------
Fix PCRE2_FIRSTLINE bug when a pattern match starts with the first code unit of 
a newline sequence.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/src/pcre2_dfa_match.c
    code/trunk/src/pcre2_match.c
    code/trunk/testdata/testinput2
    code/trunk/testdata/testinput6
    code/trunk/testdata/testoutput2
    code/trunk/testdata/testoutput6

Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2017-12-31 17:44:12 UTC (rev 898)
+++ code/trunk/ChangeLog    2018-01-01 14:12:35 UTC (rev 899)
@@ -100,12 +100,18 @@
 end at the original start point. Also arranged for it to detect when \K causes
 the end of a match to be before its start.

-24. Similar to 23 above, strange things (including loops) could happen in
-pcre2grep when \K was used in an assertion when --colour was used or in
-multiline mode. The "end at original start point" bug is fixed, and if the end
+24. Similar to 23 above, strange things (including loops) could happen in
+pcre2grep when \K was used in an assertion when --colour was used or in
+multiline mode. The "end at original start point" bug is fixed, and if the end
point is found to be before the start point, they are swapped.

+25. When PCRE2_FIRSTLINE without PCRE2_NO_START_OPTIMIZE was used in non-JIT
+matching (both pcre2_match() and pcre2_dfa_match()) and the matched string
+started with the first code unit of a newline sequence, matching failed because
+the search for the first code unit stopped before rather than after the first
+code unit of a newline in the subject string.

+
Version 10.30 14-August-2017
----------------------------

Modified: code/trunk/src/pcre2_dfa_match.c
===================================================================
--- code/trunk/src/pcre2_dfa_match.c    2017-12-31 17:44:12 UTC (rev 898)
+++ code/trunk/src/pcre2_dfa_match.c    2018-01-01 14:12:35 UTC (rev 899)
@@ -7,7 +7,7 @@

                        Written by Philip Hazel
      Original API code Copyright (c) 1997-2012 University of Cambridge
-          New API code Copyright (c) 2016-2017 University of Cambridge
+          New API code Copyright (c) 2016-2018 University of Cambridge

-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -3367,9 +3367,11 @@

     /* If firstline is TRUE, the start of the match is constrained to the first
     line of a multiline string. That is, the match must be before or at the
-    first newline. Implement this by temporarily adjusting end_subject so that
-    we stop the optimization scans for a first code unit at a newline. If the
-    match fails at the newline, later code breaks this loop. */
+    first newline following the start of matching. Temporarily adjust
+    end_subject so that we stop the optimization scans for a first code unit
+    immediately after the first character of a newline (the first code unit can
+    legitimately be a newline). If the match fails at the newline, later code
+    breaks this loop. */

     if (firstline)
       {
@@ -3377,7 +3379,7 @@
 #ifdef SUPPORT_UNICODE
       if (utf)
         {
-        while (t < mb->end_subject && !IS_NEWLINE(t))
+        while (t < end_subject && !IS_NEWLINE(t))
           {
           t++;
           ACROSSCHAR(t < end_subject, *t, t++);
@@ -3385,7 +3387,14 @@
         }
       else
 #endif
-      while (t < mb->end_subject && !IS_NEWLINE(t)) t++;
+      while (t < end_subject && !IS_NEWLINE(t)) t++;
+
+      /* Note that we only need to advance by one code unit if we found a
+      newline. If the newline is CRLF, a first code unit of LF should not
+      match, because it is not at or before the newline. Similarly, only the
+      first code unit of a Unicode newline might be relevant. */
+
+      if (t < end_subject) t++;
       end_subject = t;
       }

Modified: code/trunk/src/pcre2_match.c
===================================================================
--- code/trunk/src/pcre2_match.c    2017-12-31 17:44:12 UTC (rev 898)
+++ code/trunk/src/pcre2_match.c    2018-01-01 14:12:35 UTC (rev 899)
@@ -6367,9 +6367,11 @@

     /* If firstline is TRUE, the start of the match is constrained to the first
     line of a multiline string. That is, the match must be before or at the
-    first newline. Implement this by temporarily adjusting end_subject so that
-    we stop the optimization scans for a first code unit at a newline. If the
-    match fails at the newline, later code breaks this loop. */
+    first newline following the start of matching. Temporarily adjust
+    end_subject so that we stop the optimization scans for a first code unit
+    immediately after the first character of a newline (the first code unit can
+    legitimately be a newline). If the match fails at the newline, later code
+    breaks this loop. */

     if (firstline)
       {
@@ -6377,7 +6379,7 @@
 #ifdef SUPPORT_UNICODE
       if (utf)
         {
-        while (t < mb->end_subject && !IS_NEWLINE(t))
+        while (t < end_subject && !IS_NEWLINE(t))
           {
           t++;
           ACROSSCHAR(t < end_subject, *t, t++);
@@ -6385,7 +6387,14 @@
         }
       else
 #endif
-      while (t < mb->end_subject && !IS_NEWLINE(t)) t++;
+      while (t < end_subject && !IS_NEWLINE(t)) t++;
+
+      /* Note that we only need to advance by one code unit if we found a
+      newline. If the newline is CRLF, a first code unit of LF should not
+      match, because it is not at or before the newline. Similarly, only the
+      first code unit of a Unicode newline might be relevant. */
+
+      if (t < end_subject) t++;
       end_subject = t;
       }

@@ -6648,7 +6657,7 @@

   cb.start_match = (PCRE2_SIZE)(start_match - subject);
   cb.callout_flags |= PCRE2_CALLOUT_STARTMATCH;
-    
+
   mb->start_used_ptr = start_match;
   mb->last_used_ptr = start_match;
   mb->match_call_count = 0;

Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2017-12-31 17:44:12 UTC (rev 898)
+++ code/trunk/testdata/testinput2    2018-01-01 14:12:35 UTC (rev 899)
@@ -5395,4 +5395,14 @@
 \= Expect no match
     aac\=callout_extra

+/\n/firstline
+    xyz\nabc
+
+/\nabc/firstline
+    xyz\nabc
+
+/\x{0a}abc/firstline,newline=crlf
+\= Expect no match
+    xyz\r\nabc
+
 # End of testinput2

Modified: code/trunk/testdata/testinput6
===================================================================
--- code/trunk/testdata/testinput6    2017-12-31 17:44:12 UTC (rev 898)
+++ code/trunk/testdata/testinput6    2018-01-01 14:12:35 UTC (rev 899)
@@ -4932,4 +4932,14 @@
 /(*LIMIT_MATCH=100).*(?![|H]?.*(?![|H]?););.*(?![|H]?.*(?![|H]?););?\x00\x00\x00\x00\x00\x00\x00(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?![|);)?.*(![|H]?);)?.*(?![|H]?);)?.*(?![|H]?);)?.*(?![|H]););![|H]?););[|H]?);|H]?);)\x00\x00\x00?\x00\x00\x00?H]?););?![|H]?);)?.*(?![|H]?););[||H]?);)?.*(?![|H]?););[|H]?);(?![|H]?););![|H]?););[|H]?);|H]?);)?.*(?![|H]?););;[?\x00\x00\x00\x00\x00\x00\x00![|H]?););![|H]?););[|H]?);|H]?);)?.*(?![|H]?););/no_dotstar_anchor
 .*(?![|H]?.*(?![|H]?););.*(?![|H]?.*(?![|H]?););?\x00\x00\x00\x00\x00\x00\x00(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?![|);)?.*(![|H]?);)?.*(?![|H]?);)?.*(?![|H]?);)?.*(?![|H]););![|H]?););[|H]?);|H]?);)\x00\x00\x00?\x00\x00\x00?H]?););?![|H]?);)?.*(?![|H]?););[||H]?);)?.*(?![|H]?););[|H]?);(?![|H]?););![|H]?););[|H]?);|H]?);)?.*(?![|H]?););;[?\x00\x00\x00\x00\x00\x00\x00![|H]?););![|H]?););[|H]?);|H]?);)?.*(?![|H]?););

+/\n/firstline
+    xyz\nabc
+
+/\nabc/firstline
+    xyz\nabc
+
+/\x{0a}abc/firstline,newline=crlf
+\= Expect no match
+    xyz\r\nabc
+
 # End of testinput6

Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2017-12-31 17:44:12 UTC (rev 898)
+++ code/trunk/testdata/testoutput2    2018-01-01 14:12:35 UTC (rev 899)
@@ -16440,6 +16440,19 @@
      ^^     b
 No match

+/\n/firstline
+    xyz\nabc
+ 0: \x0a
+
+/\nabc/firstline
+    xyz\nabc
+ 0: \x0aabc
+
+/\x{0a}abc/firstline,newline=crlf
+\= Expect no match
+    xyz\r\nabc
+No match
+
 # End of testinput2
 Error -65: PCRE2_ERROR_BADDATA (unknown error number)
 Error -62: bad serialized data

Modified: code/trunk/testdata/testoutput6
===================================================================
--- code/trunk/testdata/testoutput6    2017-12-31 17:44:12 UTC (rev 898)
+++ code/trunk/testdata/testoutput6    2018-01-01 14:12:35 UTC (rev 899)
@@ -7753,4 +7753,17 @@
 .*(?![|H]?.*(?![|H]?););.*(?![|H]?.*(?![|H]?););?\x00\x00\x00\x00\x00\x00\x00(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?!(?![|);)?.*(![|H]?);)?.*(?![|H]?);)?.*(?![|H]?);)?.*(?![|H]););![|H]?););[|H]?);|H]?);)\x00\x00\x00?\x00\x00\x00?H]?););?![|H]?);)?.*(?![|H]?););[||H]?);)?.*(?![|H]?););[|H]?);(?![|H]?););![|H]?););[|H]?);|H]?);)?.*(?![|H]?););;[?\x00\x00\x00\x00\x00\x00\x00![|H]?););![|H]?););[|H]?);|H]?);)?.*(?![|H]?););
 Failed: error -47: match limit exceeded

+/\n/firstline
+    xyz\nabc
+ 0: \x0a
+
+/\nabc/firstline
+    xyz\nabc
+ 0: \x0aabc
+
+/\x{0a}abc/firstline,newline=crlf
+\= Expect no match
+    xyz\r\nabc
+No match
+
 # End of testinput6

This message is part of the following thread:
	the complete thread tree sorted by date

[Pcre-svn] [899] code/trunk: Fix PCRE2_FIRSTLINE bug when a …