[Pcre-svn] [569] code/trunk: Add PCRE_ERROR_SHORTUTF8 to PCR…

Página Inicial
Delete this message
Autor: Subversion repository
Data:  
Para: pcre-svn
Assunto: [Pcre-svn] [569] code/trunk: Add PCRE_ERROR_SHORTUTF8 to PCRE_PARTIAL_HARD processing.
Revision: 569
          http://vcs.pcre.org/viewvc?view=rev&revision=569
Author:   ph10
Date:     2010-11-07 16:14:50 +0000 (Sun, 07 Nov 2010)


Log Message:
-----------
Add PCRE_ERROR_SHORTUTF8 to PCRE_PARTIAL_HARD processing.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcreapi.3
    code/trunk/doc/pcrepartial.3
    code/trunk/pcre.h.in
    code/trunk/pcre_dfa_exec.c
    code/trunk/pcre_exec.c
    code/trunk/pcre_valid_utf8.c
    code/trunk/testdata/testinput5
    code/trunk/testdata/testinput8
    code/trunk/testdata/testoutput5
    code/trunk/testdata/testoutput8


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/ChangeLog    2010-11-07 16:14:50 UTC (rev 569)
@@ -92,6 +92,9 @@
 15. In both pcre_exec() and pcre_dfa_exec() the code for checking that the 
     starting offset points to the beginning of a UTF-8 character was 
     unnecessarily clumsy. I tidied it up.  
+    
+16. Added PCRE_ERROR_SHORTUTF8 to make it possible to distinguish between a
+    bad UTF-8 sequence and one that is incomplete. 



Version 8.10 25-Jun-2010

Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3    2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/doc/pcreapi.3    2010-11-07 16:14:50 UTC (rev 569)
@@ -435,13 +435,17 @@
 Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fP returns
 NULL, and sets the variable pointed to by \fIerrptr\fP to point to a textual
 error message. This is a static string that is part of the library. You must
-not try to free it. The byte offset from the start of the pattern to the
-character that was being processed when the error was discovered is placed in
-the variable pointed to by \fIerroffset\fP, which must not be NULL. If it is,
-an immediate error is given. Some errors are not detected until checks are
-carried out when the whole pattern has been scanned; in this case the offset is
-set to the end of the pattern.
+not try to free it. The offset from the start of the pattern to the byte that
+was being processed when the error was discovered is placed in the variable
+pointed to by \fIerroffset\fP, which must not be NULL. If it is, an immediate
+error is given. Some errors are not detected until checks are carried out when
+the whole pattern has been scanned; in this case the offset is set to the end
+of the pattern. 
 .P
+Note that the offset is in bytes, not characters, even in UTF-8 mode. It may 
+point into the middle of a UTF-8 character (for example, when
+PCRE_ERROR_BADUTF8 is returned for an invalid UTF-8 string).
+.P
 If \fBpcre_compile2()\fP is used instead of \fBpcre_compile()\fP, and the
 \fIerrorcodeptr\fP argument is not NULL, a non-zero error code number is
 returned via this argument in the event of an error. This is in addition to the
@@ -1515,9 +1519,11 @@
 \fBpcre\fP
 .\"
 page. If an invalid UTF-8 sequence of bytes is found, \fBpcre_exec()\fP returns
-the error PCRE_ERROR_BADUTF8. If \fIstartoffset\fP contains a value that does 
-not point to the start of a UTF-8 character (or to the end of the subject),
-PCRE_ERROR_BADUTF8_OFFSET is returned.
+the error PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is
+a truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORTUTF8. If
+\fIstartoffset\fP contains a value that does not point to the start of a UTF-8
+character (or to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is
+returned.
 .P
 If you already know that your subject is valid, and you want to skip these
 checks for performance reasons, you can set the PCRE_NO_UTF8_CHECK option when
@@ -1756,11 +1762,14 @@
   PCRE_ERROR_BADUTF8        (-10)
 .sp
 A string that contains an invalid UTF-8 byte sequence was passed as a subject.
+However, if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 
+character at the end of the subject, PCRE_ERROR_SHORTUTF8 is used instead.
 .sp
   PCRE_ERROR_BADUTF8_OFFSET (-11)
 .sp
 The UTF-8 byte sequence that was passed as a subject was valid, but the value
-of \fIstartoffset\fP did not point to the beginning of a UTF-8 character.
+of \fIstartoffset\fP did not point to the beginning of a UTF-8 character or the 
+end of the subject.
 .sp
   PCRE_ERROR_PARTIAL        (-12)
 .sp
@@ -1800,6 +1809,12 @@
 .sp
 The value of \fIstartoffset\fP was negative or greater than the length of the 
 subject, that is, the value in \fIlength\fP.
+.sp
+  PCRE_ERROR_SHORTUTF8      (-25)
+.sp
+The subject string ended with an incomplete (truncated) UTF-8 character, and 
+the PCRE_PARTIAL_HARD option was set. Without this option, PCRE_ERROR_BADUTF8 
+is returned in this situation.
 .P
 Error numbers -16 to -20 and -22 are not used by \fBpcre_exec()\fP.
 .


Modified: code/trunk/doc/pcrepartial.3
===================================================================
--- code/trunk/doc/pcrepartial.3    2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/doc/pcrepartial.3    2010-11-07 16:14:50 UTC (rev 569)
@@ -113,6 +113,12 @@
 assumption is made that the end of the supplied subject string may not be the
 true end of the available data, and so, if \ez, \eZ, \eb, \eB, or $ are
 encountered at the end of the subject, the result is PCRE_ERROR_PARTIAL.
+.P
+Setting PCRE_PARTIAL_HARD also affects the way \fBpcre_exec()\fP checks UTF-8 
+subject strings for validity. Normally, an invalid UTF-8 sequence causes the
+error PCRE_ERROR_BADUTF8. However, in the special case of a truncated UTF-8 
+character at the end of the subject, PCRE_ERROR_SHORTUTF8 is returned when 
+PCRE_PARTIAL_HARD is set.
 .
 .
 .SS "Comparing hard and soft partial matching"
@@ -406,6 +412,6 @@
 .rs
 .sp
 .nf
-Last updated: 22 October 2010
+Last updated: 07 November 2010
 Copyright (c) 1997-2010 University of Cambridge.
 .fi


Modified: code/trunk/pcre.h.in
===================================================================
--- code/trunk/pcre.h.in    2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/pcre.h.in    2010-11-07 16:14:50 UTC (rev 569)
@@ -162,6 +162,7 @@
 #define PCRE_ERROR_NULLWSLIMIT    (-22)  /* No longer actually used */
 #define PCRE_ERROR_BADNEWLINE     (-23)
 #define PCRE_ERROR_BADOFFSET      (-24)
+#define PCRE_ERROR_SHORTUTF8      (-25)


/* Request types for pcre_fullinfo() */


Modified: code/trunk/pcre_dfa_exec.c
===================================================================
--- code/trunk/pcre_dfa_exec.c    2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/pcre_dfa_exec.c    2010-11-07 16:14:50 UTC (rev 569)
@@ -2963,11 +2963,13 @@
 #ifdef SUPPORT_UTF8
 if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0)
   {
-  if (_pcre_valid_utf8((uschar *)subject, length) >= 0)
-    return PCRE_ERROR_BADUTF8;
+  int tb; 
+  if ((tb = _pcre_valid_utf8((uschar *)subject, length)) >= 0)
+    return (tb == length && (options & PCRE_PARTIAL_HARD) != 0)? 
+      PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
   if (start_offset > 0 && start_offset < length)
     {
-    int tb = ((USPTR)subject)[start_offset] & 0xc0;
+    tb = ((USPTR)subject)[start_offset] & 0xc0;
     if (tb == 0x80) return PCRE_ERROR_BADUTF8_OFFSET;
     }
   }


Modified: code/trunk/pcre_exec.c
===================================================================
--- code/trunk/pcre_exec.c    2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/pcre_exec.c    2010-11-07 16:14:50 UTC (rev 569)
@@ -5801,11 +5801,13 @@
 #ifdef SUPPORT_UTF8
 if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0)
   {
-  if (_pcre_valid_utf8((USPTR)subject, length) >= 0)
-    return PCRE_ERROR_BADUTF8;
+  int tb; 
+  if ((tb = _pcre_valid_utf8((USPTR)subject, length)) >= 0)
+    return (tb == length && md->partial > 1)? 
+      PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
   if (start_offset > 0 && start_offset < length)
     {
-    int tb = ((USPTR)subject)[start_offset] & 0xc0;
+    tb = ((USPTR)subject)[start_offset] & 0xc0;
     if (tb == 0x80) return PCRE_ERROR_BADUTF8_OFFSET;
     }
   }


Modified: code/trunk/pcre_valid_utf8.c
===================================================================
--- code/trunk/pcre_valid_utf8.c    2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/pcre_valid_utf8.c    2010-11-07 16:14:50 UTC (rev 569)
@@ -72,6 +72,20 @@


 Returns:       < 0    if the string is a valid UTF-8 string
                >= 0   otherwise; the value is the offset of the bad byte
+               
+Bad bytes can be:
+
+  . An isolated byte whose most significant bits are 0x80, because this
+    can only correctly appear within a UTF-8 character;
+      
+  . A byte whose most significant bits are 0xc0, but whose other bits indicate
+    that there are more than 3 additional bytes (i.e. an RFC 2279 starting 
+    byte, which is no longer valid under RFC 3629);
+    
+  .
+  
+The returned offset may also be equal to the length of the string; this means 
+that one or more bytes is missing from the final UTF-8 character.
 */


 int
@@ -93,7 +107,8 @@
   if (c < 128) continue;
   if (c < 0xc0) return p - string;
   ab = _pcre_utf8_table4[c & 0x3f];     /* Number of additional bytes */
-  if (length < ab || ab > 3) return p - string;
+  if (ab > 3) return p - string;        /* Too many for RFC 3629 */
+  if (length < ab) return p + 1 + length - string;   /* Missing bytes */
   length -= ab;


/* Check top bits in the second byte */

Modified: code/trunk/testdata/testinput5
===================================================================
--- code/trunk/testdata/testinput5    2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/testdata/testinput5    2010-11-07 16:14:50 UTC (rev 569)
@@ -205,6 +205,9 @@
     \xC3
     \xC3\xC3\xC3
     \xC3\xC3\xC3\?
+    \xe1\x88 
+    \P\xe1\x88 
+    \P\P\xe1\x88 


 /anything/8
     \xc0\x80


Modified: code/trunk/testdata/testinput8
===================================================================
--- code/trunk/testdata/testinput8    2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/testdata/testinput8    2010-11-07 16:14:50 UTC (rev 569)
@@ -63,6 +63,9 @@
     \xC3
     \xC3\xC3\xC3
     \xC3\xC3\xC3\?
+    \xe1\x88 
+    \P\xe1\x88 
+    \P\P\xe1\x88 


 /a.b/8
     acb


Modified: code/trunk/testdata/testoutput5
===================================================================
--- code/trunk/testdata/testoutput5    2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/testdata/testoutput5    2010-11-07 16:14:50 UTC (rev 569)
@@ -797,7 +797,7 @@
 Failed: invalid UTF-8 string at offset 2


/\xC3/8
-Failed: invalid UTF-8 string at offset 0
+Failed: invalid UTF-8 string at offset 1

 /\xC3\xC3\xC3xxx/8
 Failed: invalid UTF-8 string at offset 1
@@ -823,6 +823,12 @@
 Error -10
     \xC3\xC3\xC3\?
 No match
+    \xe1\x88 
+Error -10
+    \P\xe1\x88 
+Error -10
+    \P\P\xe1\x88 
+Error -25


 /anything/8
     \xc0\x80


Modified: code/trunk/testdata/testoutput8
===================================================================
--- code/trunk/testdata/testoutput8    2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/testdata/testoutput8    2010-11-07 16:14:50 UTC (rev 569)
@@ -105,6 +105,12 @@
 Error -10
     \xC3\xC3\xC3\?
 No match
+    \xe1\x88 
+Error -10
+    \P\xe1\x88 
+Error -10
+    \P\P\xe1\x88 
+Error -25


 /a.b/8
     acb