[Pcre-svn] [1226] code/trunk: Fix bug in UTF-16 checker retu…

Kezdőlap
Üzenet törlése
Szerző: Subversion repository
Dátum:  
Címzett: pcre-svn
Tárgy: [Pcre-svn] [1226] code/trunk: Fix bug in UTF-16 checker returning wrong offset for missing low surrogate.
Revision: 1226
          http://www.exim.org/viewvc/pcre2?view=rev&revision=1226
Author:   ph10
Date:     2020-02-24 15:39:56 +0000 (Mon, 24 Feb 2020)
Log Message:
-----------
Fix bug in UTF-16 checker returning wrong offset for missing low surrogate.


Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/src/pcre2_valid_utf.c
    code/trunk/testdata/testinput12
    code/trunk/testdata/testoutput12-16
    code/trunk/testdata/testoutput12-32
    code/trunk/testdata/testoutput14-16


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2020-02-24 05:26:15 UTC (rev 1225)
+++ code/trunk/ChangeLog    2020-02-24 15:39:56 UTC (rev 1226)
@@ -71,7 +71,13 @@
 upper/lower case computations on characters whose code points are greater than 
 127. Documentation is not yet updated. JIT is not yet updated.


+19. The function for checking UTF-16 validity was returning an incorrect offset
+for the start of the error when a high surrogate was not followed by a valid
+low surrogate. This caused incorrect behaviour, for example when
+PCRE2_MATCH_INVALID_UTF was set and a match started immediately following the
+invalid high surrogate, such as /aa/ matching "\x{d800}aa".

+
Version 10.34 21-November-2019
------------------------------


Modified: code/trunk/src/pcre2_valid_utf.c
===================================================================
--- code/trunk/src/pcre2_valid_utf.c    2020-02-24 05:26:15 UTC (rev 1225)
+++ code/trunk/src/pcre2_valid_utf.c    2020-02-24 15:39:56 UTC (rev 1226)
@@ -7,7 +7,7 @@


                        Written by Philip Hazel
      Original API code Copyright (c) 1997-2012 University of Cambridge
-          New API code Copyright (c) 2016-2017 University of Cambridge
+          New API code Copyright (c) 2016-2020 University of Cambridge


 -----------------------------------------------------------------------------
 Redistribution and use in source and binary forms, with or without
@@ -347,7 +347,7 @@
     length--;
     if ((*p & 0xfc00) != 0xdc00)
       {
-      *erroroffset = p - string;
+      *erroroffset = p - string - 1;
       return PCRE2_ERROR_UTF16_ERR2;
       }
     }


Modified: code/trunk/testdata/testinput12
===================================================================
--- code/trunk/testdata/testinput12    2020-02-24 05:26:15 UTC (rev 1225)
+++ code/trunk/testdata/testinput12    2020-02-24 15:39:56 UTC (rev 1226)
@@ -444,7 +444,13 @@
 \= Expect no match
     A\x{d800}B
     A\x{110000}B 
+    
+/aa/utf,ucp,match_invalid_utf,global
+    aa\x{d800}aa


+/aa/utf,ucp,match_invalid_utf,global
+    \x{d800}aa
+
 # ---------------------------------------------------- 


/(*UTF)(?=\x{123})/I

Modified: code/trunk/testdata/testoutput12-16
===================================================================
--- code/trunk/testdata/testoutput12-16    2020-02-24 05:26:15 UTC (rev 1225)
+++ code/trunk/testdata/testoutput12-16    2020-02-24 15:39:56 UTC (rev 1226)
@@ -533,7 +533,7 @@
     XX\x{110000}
 ** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
     XX\x{d800}\x{1234}
-Failed: error -25: UTF-16 error: invalid low surrogate at offset 3
+Failed: error -25: UTF-16 error: invalid low surrogate at offset 2
 \= Expect no match
     XX\x{d800}\=offset=3
 No match
@@ -1576,7 +1576,16 @@
 No match
     A\x{110000}B 
 ** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
+    
+/aa/utf,ucp,match_invalid_utf,global
+    aa\x{d800}aa
+ 0: aa
+ 0: aa


+/aa/utf,ucp,match_invalid_utf,global
+    \x{d800}aa
+ 0: aa
+
 # ---------------------------------------------------- 


/(*UTF)(?=\x{123})/I

Modified: code/trunk/testdata/testoutput12-32
===================================================================
--- code/trunk/testdata/testoutput12-32    2020-02-24 05:26:15 UTC (rev 1225)
+++ code/trunk/testdata/testoutput12-32    2020-02-24 15:39:56 UTC (rev 1226)
@@ -1574,7 +1574,16 @@
 No match
     A\x{110000}B 
 No match
+    
+/aa/utf,ucp,match_invalid_utf,global
+    aa\x{d800}aa
+ 0: aa
+ 0: aa


+/aa/utf,ucp,match_invalid_utf,global
+    \x{d800}aa
+ 0: aa
+
 # ---------------------------------------------------- 


/(*UTF)(?=\x{123})/I

Modified: code/trunk/testdata/testoutput14-16
===================================================================
--- code/trunk/testdata/testoutput14-16    2020-02-24 05:26:15 UTC (rev 1225)
+++ code/trunk/testdata/testoutput14-16    2020-02-24 15:39:56 UTC (rev 1226)
@@ -33,7 +33,7 @@
     XX\x{110000}
 ** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
     XX\x{d800}\x{1234}
-Failed: error -25: UTF-16 error: invalid low surrogate at offset 3
+Failed: error -25: UTF-16 error: invalid low surrogate at offset 2


 /badutf/utf
     X\xdf