Revision: 569
http://vcs.pcre.org/viewvc?view=rev&revision=569
Author: ph10
Date: 2010-11-07 16:14:50 +0000 (Sun, 07 Nov 2010)
Log Message:
-----------
Add PCRE_ERROR_SHORTUTF8 to PCRE_PARTIAL_HARD processing.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/pcreapi.3
code/trunk/doc/pcrepartial.3
code/trunk/pcre.h.in
code/trunk/pcre_dfa_exec.c
code/trunk/pcre_exec.c
code/trunk/pcre_valid_utf8.c
code/trunk/testdata/testinput5
code/trunk/testdata/testinput8
code/trunk/testdata/testoutput5
code/trunk/testdata/testoutput8
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/ChangeLog 2010-11-07 16:14:50 UTC (rev 569)
@@ -92,6 +92,9 @@
15. In both pcre_exec() and pcre_dfa_exec() the code for checking that the
starting offset points to the beginning of a UTF-8 character was
unnecessarily clumsy. I tidied it up.
+
+16. Added PCRE_ERROR_SHORTUTF8 to make it possible to distinguish between a
+ bad UTF-8 sequence and one that is incomplete.
Version 8.10 25-Jun-2010
Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3 2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/doc/pcreapi.3 2010-11-07 16:14:50 UTC (rev 569)
@@ -435,13 +435,17 @@
Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fP returns
NULL, and sets the variable pointed to by \fIerrptr\fP to point to a textual
error message. This is a static string that is part of the library. You must
-not try to free it. The byte offset from the start of the pattern to the
-character that was being processed when the error was discovered is placed in
-the variable pointed to by \fIerroffset\fP, which must not be NULL. If it is,
-an immediate error is given. Some errors are not detected until checks are
-carried out when the whole pattern has been scanned; in this case the offset is
-set to the end of the pattern.
+not try to free it. The offset from the start of the pattern to the byte that
+was being processed when the error was discovered is placed in the variable
+pointed to by \fIerroffset\fP, which must not be NULL. If it is, an immediate
+error is given. Some errors are not detected until checks are carried out when
+the whole pattern has been scanned; in this case the offset is set to the end
+of the pattern.
.P
+Note that the offset is in bytes, not characters, even in UTF-8 mode. It may
+point into the middle of a UTF-8 character (for example, when
+PCRE_ERROR_BADUTF8 is returned for an invalid UTF-8 string).
+.P
If \fBpcre_compile2()\fP is used instead of \fBpcre_compile()\fP, and the
\fIerrorcodeptr\fP argument is not NULL, a non-zero error code number is
returned via this argument in the event of an error. This is in addition to the
@@ -1515,9 +1519,11 @@
\fBpcre\fP
.\"
page. If an invalid UTF-8 sequence of bytes is found, \fBpcre_exec()\fP returns
-the error PCRE_ERROR_BADUTF8. If \fIstartoffset\fP contains a value that does
-not point to the start of a UTF-8 character (or to the end of the subject),
-PCRE_ERROR_BADUTF8_OFFSET is returned.
+the error PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is
+a truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORTUTF8. If
+\fIstartoffset\fP contains a value that does not point to the start of a UTF-8
+character (or to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is
+returned.
.P
If you already know that your subject is valid, and you want to skip these
checks for performance reasons, you can set the PCRE_NO_UTF8_CHECK option when
@@ -1756,11 +1762,14 @@
PCRE_ERROR_BADUTF8 (-10)
.sp
A string that contains an invalid UTF-8 byte sequence was passed as a subject.
+However, if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8
+character at the end of the subject, PCRE_ERROR_SHORTUTF8 is used instead.
.sp
PCRE_ERROR_BADUTF8_OFFSET (-11)
.sp
The UTF-8 byte sequence that was passed as a subject was valid, but the value
-of \fIstartoffset\fP did not point to the beginning of a UTF-8 character.
+of \fIstartoffset\fP did not point to the beginning of a UTF-8 character or the
+end of the subject.
.sp
PCRE_ERROR_PARTIAL (-12)
.sp
@@ -1800,6 +1809,12 @@
.sp
The value of \fIstartoffset\fP was negative or greater than the length of the
subject, that is, the value in \fIlength\fP.
+.sp
+ PCRE_ERROR_SHORTUTF8 (-25)
+.sp
+The subject string ended with an incomplete (truncated) UTF-8 character, and
+the PCRE_PARTIAL_HARD option was set. Without this option, PCRE_ERROR_BADUTF8
+is returned in this situation.
.P
Error numbers -16 to -20 and -22 are not used by \fBpcre_exec()\fP.
.
Modified: code/trunk/doc/pcrepartial.3
===================================================================
--- code/trunk/doc/pcrepartial.3 2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/doc/pcrepartial.3 2010-11-07 16:14:50 UTC (rev 569)
@@ -113,6 +113,12 @@
assumption is made that the end of the supplied subject string may not be the
true end of the available data, and so, if \ez, \eZ, \eb, \eB, or $ are
encountered at the end of the subject, the result is PCRE_ERROR_PARTIAL.
+.P
+Setting PCRE_PARTIAL_HARD also affects the way \fBpcre_exec()\fP checks UTF-8
+subject strings for validity. Normally, an invalid UTF-8 sequence causes the
+error PCRE_ERROR_BADUTF8. However, in the special case of a truncated UTF-8
+character at the end of the subject, PCRE_ERROR_SHORTUTF8 is returned when
+PCRE_PARTIAL_HARD is set.
.
.
.SS "Comparing hard and soft partial matching"
@@ -406,6 +412,6 @@
.rs
.sp
.nf
-Last updated: 22 October 2010
+Last updated: 07 November 2010
Copyright (c) 1997-2010 University of Cambridge.
.fi
Modified: code/trunk/pcre.h.in
===================================================================
--- code/trunk/pcre.h.in 2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/pcre.h.in 2010-11-07 16:14:50 UTC (rev 569)
@@ -162,6 +162,7 @@
#define PCRE_ERROR_NULLWSLIMIT (-22) /* No longer actually used */
#define PCRE_ERROR_BADNEWLINE (-23)
#define PCRE_ERROR_BADOFFSET (-24)
+#define PCRE_ERROR_SHORTUTF8 (-25)
/* Request types for pcre_fullinfo() */
Modified: code/trunk/pcre_dfa_exec.c
===================================================================
--- code/trunk/pcre_dfa_exec.c 2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/pcre_dfa_exec.c 2010-11-07 16:14:50 UTC (rev 569)
@@ -2963,11 +2963,13 @@
#ifdef SUPPORT_UTF8
if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0)
{
- if (_pcre_valid_utf8((uschar *)subject, length) >= 0)
- return PCRE_ERROR_BADUTF8;
+ int tb;
+ if ((tb = _pcre_valid_utf8((uschar *)subject, length)) >= 0)
+ return (tb == length && (options & PCRE_PARTIAL_HARD) != 0)?
+ PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
if (start_offset > 0 && start_offset < length)
{
- int tb = ((USPTR)subject)[start_offset] & 0xc0;
+ tb = ((USPTR)subject)[start_offset] & 0xc0;
if (tb == 0x80) return PCRE_ERROR_BADUTF8_OFFSET;
}
}
Modified: code/trunk/pcre_exec.c
===================================================================
--- code/trunk/pcre_exec.c 2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/pcre_exec.c 2010-11-07 16:14:50 UTC (rev 569)
@@ -5801,11 +5801,13 @@
#ifdef SUPPORT_UTF8
if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0)
{
- if (_pcre_valid_utf8((USPTR)subject, length) >= 0)
- return PCRE_ERROR_BADUTF8;
+ int tb;
+ if ((tb = _pcre_valid_utf8((USPTR)subject, length)) >= 0)
+ return (tb == length && md->partial > 1)?
+ PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
if (start_offset > 0 && start_offset < length)
{
- int tb = ((USPTR)subject)[start_offset] & 0xc0;
+ tb = ((USPTR)subject)[start_offset] & 0xc0;
if (tb == 0x80) return PCRE_ERROR_BADUTF8_OFFSET;
}
}
Modified: code/trunk/pcre_valid_utf8.c
===================================================================
--- code/trunk/pcre_valid_utf8.c 2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/pcre_valid_utf8.c 2010-11-07 16:14:50 UTC (rev 569)
@@ -72,6 +72,20 @@
Returns: < 0 if the string is a valid UTF-8 string
>= 0 otherwise; the value is the offset of the bad byte
+
+Bad bytes can be:
+
+ . An isolated byte whose most significant bits are 0x80, because this
+ can only correctly appear within a UTF-8 character;
+
+ . A byte whose most significant bits are 0xc0, but whose other bits indicate
+ that there are more than 3 additional bytes (i.e. an RFC 2279 starting
+ byte, which is no longer valid under RFC 3629);
+
+ .
+
+The returned offset may also be equal to the length of the string; this means
+that one or more bytes is missing from the final UTF-8 character.
*/
int
@@ -93,7 +107,8 @@
if (c < 128) continue;
if (c < 0xc0) return p - string;
ab = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */
- if (length < ab || ab > 3) return p - string;
+ if (ab > 3) return p - string; /* Too many for RFC 3629 */
+ if (length < ab) return p + 1 + length - string; /* Missing bytes */
length -= ab;
/* Check top bits in the second byte */
Modified: code/trunk/testdata/testinput5
===================================================================
--- code/trunk/testdata/testinput5 2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/testdata/testinput5 2010-11-07 16:14:50 UTC (rev 569)
@@ -205,6 +205,9 @@
\xC3
\xC3\xC3\xC3
\xC3\xC3\xC3\?
+ \xe1\x88
+ \P\xe1\x88
+ \P\P\xe1\x88
/anything/8
\xc0\x80
Modified: code/trunk/testdata/testinput8
===================================================================
--- code/trunk/testdata/testinput8 2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/testdata/testinput8 2010-11-07 16:14:50 UTC (rev 569)
@@ -63,6 +63,9 @@
\xC3
\xC3\xC3\xC3
\xC3\xC3\xC3\?
+ \xe1\x88
+ \P\xe1\x88
+ \P\P\xe1\x88
/a.b/8
acb
Modified: code/trunk/testdata/testoutput5
===================================================================
--- code/trunk/testdata/testoutput5 2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/testdata/testoutput5 2010-11-07 16:14:50 UTC (rev 569)
@@ -797,7 +797,7 @@
Failed: invalid UTF-8 string at offset 2
/\xC3/8
-Failed: invalid UTF-8 string at offset 0
+Failed: invalid UTF-8 string at offset 1
/\xC3\xC3\xC3xxx/8
Failed: invalid UTF-8 string at offset 1
@@ -823,6 +823,12 @@
Error -10
\xC3\xC3\xC3\?
No match
+ \xe1\x88
+Error -10
+ \P\xe1\x88
+Error -10
+ \P\P\xe1\x88
+Error -25
/anything/8
\xc0\x80
Modified: code/trunk/testdata/testoutput8
===================================================================
--- code/trunk/testdata/testoutput8 2010-11-06 17:36:26 UTC (rev 568)
+++ code/trunk/testdata/testoutput8 2010-11-07 16:14:50 UTC (rev 569)
@@ -105,6 +105,12 @@
Error -10
\xC3\xC3\xC3\?
No match
+ \xe1\x88
+Error -10
+ \P\xe1\x88
+Error -10
+ \P\P\xe1\x88
+Error -25
/a.b/8
acb