[Pcre-svn] [598] code/trunk: Pass back detailed info when UT…

Página Inicial
Delete this message
Autor: Subversion repository
Data:  
Para: pcre-svn
Assunto: [Pcre-svn] [598] code/trunk: Pass back detailed info when UTF-8 check fails at runtime .
Revision: 598
          http://vcs.pcre.org/viewvc?view=rev&revision=598
Author:   ph10
Date:     2011-05-07 16:37:31 +0100 (Sat, 07 May 2011)


Log Message:
-----------
Pass back detailed info when UTF-8 check fails at runtime.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcre.3
    code/trunk/doc/pcreapi.3
    code/trunk/doc/pcretest.1
    code/trunk/pcre.h.in
    code/trunk/pcre_compile.c
    code/trunk/pcre_dfa_exec.c
    code/trunk/pcre_exec.c
    code/trunk/pcre_internal.h
    code/trunk/pcre_valid_utf8.c
    code/trunk/pcretest.c
    code/trunk/testdata/testinput5
    code/trunk/testdata/testoutput2
    code/trunk/testdata/testoutput5
    code/trunk/testdata/testoutput7
    code/trunk/testdata/testoutput8


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2011-05-02 17:08:52 UTC (rev 597)
+++ code/trunk/ChangeLog    2011-05-07 15:37:31 UTC (rev 598)
@@ -20,6 +20,22 @@
     code. (b) A reference to 2 copies of a 3-byte code would not match 2 of a
     2-byte code at the end of the subject (it thought there wasn't enough data
     left).
+    
+5.  Comprehensive information about what went wrong is now returned by 
+    pcre_exec() and pcre_dfa_exec() when the UTF-8 string check fails, as long 
+    as the output vector has at least 2 elements. The offset of the start of 
+    the failing character and a reason code are placed in the vector.
+    
+6.  When the UTF-8 string check fails for pcre_compile(), the offset that is 
+    now returned is for the first byte of the failing character, instead of the 
+    last byte inspected. This is an incompatible change, but I hope it is small 
+    enough not to be a problem. It makes the returned offset consistent with
+    pcre_exec() and pcre_dfa_exec().
+    
+7.  pcretest now gives a text phrase as well as the error number when
+    pcre_exec() or pcre_dfa_exec() fails; if the error is a UTF-8 check
+    failure, the offset and reason code are output.
+    



Version 8.12 15-Jan-2011

Modified: code/trunk/doc/pcre.3
===================================================================
--- code/trunk/doc/pcre.3    2011-05-02 17:08:52 UTC (rev 597)
+++ code/trunk/doc/pcre.3    2011-05-07 15:37:31 UTC (rev 598)
@@ -14,7 +14,7 @@
 The current implementation of PCRE corresponds approximately with Perl 5.12,
 including support for UTF-8 encoded strings and Unicode general category
 properties. However, UTF-8 and Unicode support has to be explicitly enabled; it
-is not the default. The Unicode tables correspond to Unicode release 5.2.0.
+is not the default. The Unicode tables correspond to Unicode release 6.0.0.
 .P
 In addition to the Perl-compatible matching function, PCRE contains an
 alternative function that matches the same compiled patterns in a different
@@ -208,14 +208,18 @@
 the whole surrogate thing is a fudge for UTF-16 which unfortunately messes up
 UTF-8.)
 .P
-If an invalid UTF-8 string is passed to PCRE, an error return
-(PCRE_ERROR_BADUTF8) is given. In some situations, you may already know that
-your strings are valid, and therefore want to skip these checks in order to
-improve performance. If you set the PCRE_NO_UTF8_CHECK flag at compile time or
-at run time, PCRE assumes that the pattern or subject it is given
-(respectively) contains only valid UTF-8 codes. In this case, it does not
-diagnose an invalid UTF-8 string.
+If an invalid UTF-8 string is passed to PCRE, an error return is given. At 
+compile time, the only additional information is the offset to the first byte 
+of the failing character. The runtime functions (\fBpcre_exec()\fP and
+\fBpcre_dfa_exec()\fP), pass back this information as well as a more detailed
+reason code if the caller has provided memory in which to do this.
 .P
+In some situations, you may already know that your strings are valid, and
+therefore want to skip these checks in order to improve performance. If you set
+the PCRE_NO_UTF8_CHECK flag at compile time or at run time, PCRE assumes that
+the pattern or subject it is given (respectively) contains only valid UTF-8
+codes. In this case, it does not diagnose an invalid UTF-8 string.
+.P
 If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, what
 happens depends on why the string is invalid. If the string conforms to the
 "old" definition of UTF-8 (RFC 2279), it is processed as a string of characters
@@ -304,6 +308,6 @@
 .rs
 .sp
 .nf
-Last updated: 13 November 2010
-Copyright (c) 1997-2010 University of Cambridge.
+Last updated: 07 May 2011
+Copyright (c) 1997-2011 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3    2011-05-02 17:08:52 UTC (rev 597)
+++ code/trunk/doc/pcreapi.3    2011-05-07 15:37:31 UTC (rev 598)
@@ -436,16 +436,16 @@
 Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fP returns
 NULL, and sets the variable pointed to by \fIerrptr\fP to point to a textual
 error message. This is a static string that is part of the library. You must
-not try to free it. The offset from the start of the pattern to the byte that
-was being processed when the error was discovered is placed in the variable
-pointed to by \fIerroffset\fP, which must not be NULL. If it is, an immediate
-error is given. Some errors are not detected until checks are carried out when
-the whole pattern has been scanned; in this case the offset is set to the end
-of the pattern.
+not try to free it. Normally, the offset from the start of the pattern to the
+byte that was being processed when the error was discovered is placed in the
+variable pointed to by \fIerroffset\fP, which must not be NULL (if it is, an
+immediate error is given). However, for an invalid UTF-8 string, the offset is
+that of the first byte of the failing character. Also, some errors are not
+detected until checks are carried out when the whole pattern has been scanned;
+in these cases the offset passed back is the length of the pattern.
 .P
 Note that the offset is in bytes, not characters, even in UTF-8 mode. It may
-point into the middle of a UTF-8 character (for example, when
-PCRE_ERROR_BADUTF8 is returned for an invalid UTF-8 string).
+sometimes point into the middle of a UTF-8 character.
 .P
 If \fBpcre_compile2()\fP is used instead of \fBpcre_compile()\fP, and the
 \fIerrorcodeptr\fP argument is not NULL, a non-zero error code number is
@@ -552,9 +552,9 @@
 ignored. This is equivalent to Perl's /x option, and it can be changed within a
 pattern by a (?x) option setting.
 .P
-Which characters are interpreted as newlines
-is controlled by the options passed to \fBpcre_compile()\fP or by a special
-sequence at the start of the pattern, as described in the section entitled
+Which characters are interpreted as newlines is controlled by the options
+passed to \fBpcre_compile()\fP or by a special sequence at the start of the
+pattern, as described in the section entitled
 .\" HTML <a href="pcrepattern.html#newlines">
 .\" </a>
 "Newline conventions"
@@ -946,6 +946,7 @@
 below in the section on matching a pattern.
 .
 .
+.\" HTML <a name="infoaboutpattern"></a>
 .SH "INFORMATION ABOUT A PATTERN"
 .rs
 .sp
@@ -1547,9 +1548,16 @@
 .\"
 page. If an invalid UTF-8 sequence of bytes is found, \fBpcre_exec()\fP returns
 the error PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is
-a truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORTUTF8. If
-\fIstartoffset\fP contains a value that does not point to the start of a UTF-8
-character (or to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is
+a truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORTUTF8. In 
+both cases, information about the precise nature of the error may also be
+returned (see the descriptions of these errors in the section entitled \fIError
+return values from\fP \fBpcre_exec()\fP
+.\" HTML <a href="#errorlist">
+.\" </a>
+below).
+.\"
+If \fIstartoffset\fP contains a value that does not point to the start of a
+UTF-8 character (or to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is
 returned.
 .P
 If you already know that your subject is valid, and you want to skip these
@@ -1717,6 +1725,7 @@
 Some convenience functions are provided for extracting the captured substrings
 as separate strings. These are described below.
 .
+.
 .\" HTML <a name="errorlist"></a>
 .SS "Error return values from \fBpcre_exec()\fP"
 .rs
@@ -1786,14 +1795,24 @@
 .sp
   PCRE_ERROR_BADUTF8        (-10)
 .sp
-A string that contains an invalid UTF-8 byte sequence was passed as a subject.
-However, if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8
-character at the end of the subject, PCRE_ERROR_SHORTUTF8 is used instead.
+A string that contains an invalid UTF-8 byte sequence was passed as a subject,
+and the PCRE_NO_UTF8_CHECK option was not set. If the size of the output vector
+(\fIovecsize\fP) is at least 2, the byte offset to the start of the the invalid
+UTF-8 character is placed in the first element, and a reason code is placed in
+the second element. The reason codes are listed in the
+.\" HTML <a href="#badutf8reasons">
+.\" </a>
+following section.
+.\"
+For backward compatibility, if PCRE_PARTIAL_HARD is set and the problem is a
+truncated UTF-8 character at the end of the subject (reason codes 1 to 5),
+PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
 .sp
   PCRE_ERROR_BADUTF8_OFFSET (-11)
 .sp
-The UTF-8 byte sequence that was passed as a subject was valid, but the value
-of \fIstartoffset\fP did not point to the beginning of a UTF-8 character or the
+The UTF-8 byte sequence that was passed as a subject was checked and found to 
+be valid (the PCRE_NO_UTF8_CHECK option was not set), but the value of
+\fIstartoffset\fP did not point to the beginning of a UTF-8 character or the
 end of the subject.
 .sp
   PCRE_ERROR_PARTIAL        (-12)
@@ -1837,13 +1856,90 @@
 .sp
   PCRE_ERROR_SHORTUTF8      (-25)
 .sp
-The subject string ended with an incomplete (truncated) UTF-8 character, and
-the PCRE_PARTIAL_HARD option was set. Without this option, PCRE_ERROR_BADUTF8
-is returned in this situation.
+This error is returned instead of PCRE_ERROR_BADUTF8 when the subject string
+ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD option is set.
+Information about the failure is returned as for PCRE_ERROR_BADUTF8. It is in
+fact sufficient to detect this case, but this special error code for
+PCRE_PARTIAL_HARD precedes the implementation of returned information; it is
+retained for backwards compatibility.
 .P
 Error numbers -16 to -20 and -22 are not used by \fBpcre_exec()\fP.
 .
 .
+.\" HTML <a name="badutf8reasons"></a>
+.SS "Reason codes for invalid UTF-8 strings"
+.rs
+.sp
+When \fBpcre_exec()\fP returns either PCRE_ERROR_BADUTF8 or 
+PCRE_ERROR_SHORTUTF8, and the size of the output vector (\fIovecsize\fP) is at 
+least 2, the offset of the start of the invalid UTF-8 character is placed in 
+the first output vector element (\fIovector[0]\fP) and a reason code is placed 
+in the second element (\fIovector[1]\fP). The reason codes are given names in
+the \fBpcre.h\fP header file:
+.sp
+  PCRE_UTF8_ERR1
+  PCRE_UTF8_ERR2
+  PCRE_UTF8_ERR3
+  PCRE_UTF8_ERR4
+  PCRE_UTF8_ERR5
+.sp
+The string ends with a truncated UTF-8 character; the code specifies how many 
+bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 characters to be
+no longer than 4 bytes, the encoding scheme (originally defined by RFC 2279)
+allows for up to 6 bytes, and this is checked first; hence the possibility of 
+4 or 5 missing bytes.
+.sp
+  PCRE_UTF8_ERR6
+  PCRE_UTF8_ERR7
+  PCRE_UTF8_ERR8
+  PCRE_UTF8_ERR9
+  PCRE_UTF8_ERR10
+.sp
+The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of the 
+character do not have the binary value 0b10 (that is, either the most
+significant bit is 0, or the next bit is 1).
+.sp 
+  PCRE_UTF8_ERR11
+  PCRE_UTF8_ERR12
+.sp
+A character that is valid by the RFC 2279 rules is either 5 or 6 bytes long; 
+these code points are excluded by RFC 3629.  
+.sp 
+  PCRE_UTF8_ERR13
+.sp
+A 4-byte character has a value greater than 0x10fff; these code points are 
+excluded by RFC 3629.
+.sp 
+  PCRE_UTF8_ERR14
+.sp
+A 3-byte character has a value in the range 0xd800 to 0xdfff; this range of
+code points are reserved by RFC 3629 for use with UTF-16, and so are excluded 
+from UTF-8.
+.sp 
+  PCRE_UTF8_ERR15
+  PCRE_UTF8_ERR16
+  PCRE_UTF8_ERR17
+  PCRE_UTF8_ERR18
+  PCRE_UTF8_ERR19
+.sp
+A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes for a 
+value that can be represented by fewer bytes, which is invalid. For example, 
+the two bytes 0xc0, 0xae give the value 0x2e, whose correct coding uses just
+one byte.
+.sp
+  PCRE_UTF8_ERR20
+.sp
+The two most significant bits of the first byte of a character have the binary 
+value 0b10 (that is, the most significant bit is 1 and the second is 0). Such a 
+byte can only validly occur as the second or subsequent byte of a multi-byte
+character.
+.sp
+  PCRE_UTF8_ERR21
+.sp
+The first byte of a character has the value 0xfe or 0xff. These values can
+never occur in a valid UTF-8 string.
+.
+.
 .SH "EXTRACTING CAPTURED SUBSTRINGS BY NUMBER"
 .rs
 .sp
@@ -2039,7 +2135,11 @@
 has run, they point to the first and last entries in the name-to-number table
 for the given name. The function itself returns the length of each entry, or
 PCRE_ERROR_NOSUBSTRING (-7) if there are none. The format of the table is
-described above in the section entitled \fIInformation about a pattern\fP.
+described above in the section entitled \fIInformation about a pattern\fP
+.\" HTML <a href="#infoaboutpattern">
+.\" </a>
+above.
+.\"
 Given all the relevant entries for the name, you can extract each of their
 numbers, and hence the captured data, if any.
 .
@@ -2169,6 +2269,7 @@
 .\"
 documentation.
 .
+.
 .SS "Successful returns from \fBpcre_dfa_exec()\fP"
 .rs
 .sp
@@ -2202,6 +2303,7 @@
 \fIovector\fP, the yield of the function is zero, and the vector is filled with
 the longest matches.
 .
+.
 .SS "Error returns from \fBpcre_dfa_exec()\fP"
 .rs
 .sp
@@ -2267,6 +2369,6 @@
 .rs
 .sp
 .nf
-Last updated: 21 November 2010
-Copyright (c) 1997-2010 University of Cambridge.
+Last updated: 07 May 2011
+Copyright (c) 1997-2011 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcretest.1
===================================================================
--- code/trunk/doc/pcretest.1    2011-05-02 17:08:52 UTC (rev 597)
+++ code/trunk/doc/pcretest.1    2011-05-07 15:37:31 UTC (rev 598)
@@ -507,18 +507,22 @@
 This section describes the output when the normal matching function,
 \fBpcre_exec()\fP, is being used.
 .P
-When a match succeeds, pcretest outputs the list of captured substrings that
-\fBpcre_exec()\fP returns, starting with number 0 for the string that matched
-the whole pattern. Otherwise, it outputs "No match" when the return is
+When a match succeeds, \fBpcretest\fP outputs the list of captured substrings
+that \fBpcre_exec()\fP returns, starting with number 0 for the string that
+matched the whole pattern. Otherwise, it outputs "No match" when the return is
 PCRE_ERROR_NOMATCH, and "Partial match:" followed by the partially matching
 substring when \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL. (Note that this is
 the entire substring that was inspected during the partial match; it may
 include characters before the actual match start if a lookbehind assertion,
-\eK, \eb, or \eB was involved.) For any other returns, it outputs the PCRE
-negative error number. Here is an example of an interactive \fBpcretest\fP run.
+\eK, \eb, or \eB was involved.) For any other return, \fBpcretest\fP outputs
+the PCRE negative error number and a short descriptive phrase. If the error is
+a failed UTF-8 string check, the byte offset of the start of the failing
+character and the reason code are also output, provided that the size of the 
+output vector is at least two. Here is an example of an interactive
+\fBpcretest\fP run.
 .sp
   $ pcretest
-  PCRE version 7.0 30-Nov-2006
+  PCRE version 8.13 2011-04-30
 .sp
     re> /^abc(\ed+)/
   data> abc123
@@ -527,11 +531,11 @@
   data> xyz
   No match
 .sp
-Note that unset capturing substrings that are not followed by one that is set
-are not returned by \fBpcre_exec()\fP, and are not shown by \fBpcretest\fP. In
-the following example, there are two capturing substrings, but when the first
-data line is matched, the second, unset substring is not shown. An "internal"
-unset substring is shown as "<unset>", as for the second data line.
+Unset capturing substrings that are not followed by one that is set are not
+returned by \fBpcre_exec()\fP, and are not shown by \fBpcretest\fP. In the
+following example, there are two capturing substrings, but when the first data
+line is matched, the second, unset substring is not shown. An "internal" unset
+substring is shown as "<unset>", as for the second data line.
 .sp
     re> /(a)|(b)/
   data> a
@@ -565,7 +569,13 @@
    0: ipp
    1: pp
 .sp
-"No match" is output only if the first match attempt fails.
+"No match" is output only if the first match attempt fails. Here is an example 
+of a failure message (the offset 4 that is specified by \e>4 is past the end of 
+the subject string):
+.sp
+    re> /xyz/
+  data> xyz\>4
+  Error -24 (bad offset value)      
 .P
 If any of the sequences \fB\eC\fP, \fB\eG\fP, or \fB\eL\fP are present in a
 data line that is successfully matched, the substrings extracted by the
@@ -779,6 +789,6 @@
 .rs
 .sp
 .nf
-Last updated: 21 November 2010
-Copyright (c) 1997-2010 University of Cambridge.
+Last updated: 06 May 2011
+Copyright (c) 1997-2011 University of Cambridge.
 .fi


Modified: code/trunk/pcre.h.in
===================================================================
--- code/trunk/pcre.h.in    2011-05-02 17:08:52 UTC (rev 597)
+++ code/trunk/pcre.h.in    2011-05-07 15:37:31 UTC (rev 598)
@@ -5,7 +5,7 @@
 /* This is the public header file for the PCRE library, to be #included by
 applications that call the PCRE functions.


-           Copyright (c) 1997-2010 University of Cambridge
+           Copyright (c) 1997-2011 University of Cambridge


 -----------------------------------------------------------------------------
 Redistribution and use in source and binary forms, with or without
@@ -164,6 +164,31 @@
 #define PCRE_ERROR_BADOFFSET      (-24)
 #define PCRE_ERROR_SHORTUTF8      (-25)


+/* Specific error codes for UTF-8 validity checks */
+
+#define PCRE_UTF8_ERR0               0
+#define PCRE_UTF8_ERR1               1
+#define PCRE_UTF8_ERR2               2
+#define PCRE_UTF8_ERR3               3
+#define PCRE_UTF8_ERR4               4
+#define PCRE_UTF8_ERR5               5
+#define PCRE_UTF8_ERR6               6
+#define PCRE_UTF8_ERR7               7
+#define PCRE_UTF8_ERR8               8
+#define PCRE_UTF8_ERR9               9
+#define PCRE_UTF8_ERR10             10
+#define PCRE_UTF8_ERR11             11
+#define PCRE_UTF8_ERR12             12
+#define PCRE_UTF8_ERR13             13
+#define PCRE_UTF8_ERR14             14
+#define PCRE_UTF8_ERR15             15
+#define PCRE_UTF8_ERR16             16
+#define PCRE_UTF8_ERR17             17
+#define PCRE_UTF8_ERR18             18
+#define PCRE_UTF8_ERR19             19
+#define PCRE_UTF8_ERR20             20
+#define PCRE_UTF8_ERR21             21
+
 /* Request types for pcre_fullinfo() */


 #define PCRE_INFO_OPTIONS            0


Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2011-05-02 17:08:52 UTC (rev 597)
+++ code/trunk/pcre_compile.c    2011-05-07 15:37:31 UTC (rev 598)
@@ -6,7 +6,7 @@
 and semantics are as close as possible to those of the Perl 5 language.


                        Written by Philip Hazel
-           Copyright (c) 1997-2010 University of Cambridge
+           Copyright (c) 1997-2011 University of Cambridge


-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -6921,11 +6921,15 @@

utf8 = (options & PCRE_UTF8) != 0;

-/* Can't support UTF8 unless PCRE has been compiled to include the code. */
+/* Can't support UTF8 unless PCRE has been compiled to include the code. The
+return of an error code from _pcre_valid_utf8() is a new feature, introduced in
+release 8.13. The only use we make of it here is to adjust the offset value to
+the end of the string for a short string error, for compatibility with previous
+versions. */

 #ifdef SUPPORT_UTF8
 if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0 &&
-     (*erroroffset = _pcre_valid_utf8((USPTR)pattern, -1)) >= 0)
+     (*erroroffset = _pcre_valid_utf8((USPTR)pattern, -1, &errorcode)) >= 0)
   {
   errorcode = ERR44;
   goto PCRE_EARLY_ERROR_RETURN2;


Modified: code/trunk/pcre_dfa_exec.c
===================================================================
--- code/trunk/pcre_dfa_exec.c    2011-05-02 17:08:52 UTC (rev 597)
+++ code/trunk/pcre_dfa_exec.c    2011-05-07 15:37:31 UTC (rev 598)
@@ -7,7 +7,7 @@
 below for why this module is different).


                        Written by Philip Hazel
-           Copyright (c) 1997-2010 University of Cambridge
+           Copyright (c) 1997-2011 University of Cambridge


 -----------------------------------------------------------------------------
 Redistribution and use in source and binary forms, with or without
@@ -2963,10 +2963,18 @@
 #ifdef SUPPORT_UTF8
 if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0)
   {
-  int tb;
-  if ((tb = _pcre_valid_utf8((uschar *)subject, length)) >= 0)
-    return (tb == length && (options & PCRE_PARTIAL_HARD) != 0)?
+  int errorcode; 
+  int tb = _pcre_valid_utf8((uschar *)subject, length, &errorcode);
+  if (tb >= 0)
+    {
+    if (offsetcount >= 2)
+      {
+      offsets[0] = tb;
+      offsets[1] = errorcode;
+      }    
+    return (errorcode <= PCRE_UTF8_ERR5 && (options & PCRE_PARTIAL_HARD) != 0)?
       PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
+    }  
   if (start_offset > 0 && start_offset < length)
     {
     tb = ((USPTR)subject)[start_offset] & 0xc0;


Modified: code/trunk/pcre_exec.c
===================================================================
--- code/trunk/pcre_exec.c    2011-05-02 17:08:52 UTC (rev 597)
+++ code/trunk/pcre_exec.c    2011-05-07 15:37:31 UTC (rev 598)
@@ -6,7 +6,7 @@
 and semantics are as close as possible to those of the Perl 5 language.


                        Written by Philip Hazel
-           Copyright (c) 1997-2010 University of Cambridge
+           Copyright (c) 1997-2011 University of Cambridge


-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -5812,16 +5812,24 @@
if (md->partial && (re->flags & PCRE_NOPARTIAL) != 0)
return PCRE_ERROR_BADPARTIAL;

-/* Check a UTF-8 string if required. Unfortunately there's no way of passing
-back the character offset. */
+/* Check a UTF-8 string if required. Pass back the character offset and error
+code if a results vector is available. */

 #ifdef SUPPORT_UTF8
 if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0)
   {
-  int tb;
-  if ((tb = _pcre_valid_utf8((USPTR)subject, length)) >= 0)
-    return (tb == length && md->partial > 1)?
+  int errorcode; 
+  int tb = _pcre_valid_utf8((USPTR)subject, length, &errorcode);
+  if (tb >= 0)
+    {
+    if (offsetcount >= 2)
+      {
+      offsets[0] = tb;
+      offsets[1] = errorcode;
+      }    
+    return (errorcode <= PCRE_UTF8_ERR5 && md->partial > 1)?
       PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
+    }   
   if (start_offset > 0 && start_offset < length)
     {
     tb = ((USPTR)subject)[start_offset] & 0xc0;


Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2011-05-02 17:08:52 UTC (rev 597)
+++ code/trunk/pcre_internal.h    2011-05-07 15:37:31 UTC (rev 598)
@@ -1831,7 +1831,7 @@
 extern int           _pcre_ord2utf8(int, uschar *);
 extern real_pcre    *_pcre_try_flipped(const real_pcre *, real_pcre *,
                        const pcre_study_data *, pcre_study_data *);
-extern int           _pcre_valid_utf8(USPTR, int);
+extern int           _pcre_valid_utf8(USPTR, int, int *);
 extern BOOL          _pcre_was_newline(USPTR, int, USPTR, int *, BOOL);
 extern BOOL          _pcre_xclass(int, const uschar *);



Modified: code/trunk/pcre_valid_utf8.c
===================================================================
--- code/trunk/pcre_valid_utf8.c    2011-05-02 17:08:52 UTC (rev 597)
+++ code/trunk/pcre_valid_utf8.c    2011-05-07 15:37:31 UTC (rev 598)
@@ -54,42 +54,56 @@
 *************************************************/


/* This function is called (optionally) at the start of compile or match, to
-validate that a supposed UTF-8 string is actually valid. The early check means
+check that a supposed UTF-8 string is actually valid. The early check means
that subsequent code can assume it is dealing with a valid string. The check
-can be turned off for maximum performance, but the consequences of supplying
-an invalid string are then undefined.
+can be turned off for maximum performance, but the consequences of supplying an
+invalid string are then undefined.

Originally, this function checked according to RFC 2279, allowing for values in
the range 0 to 0x7fffffff, up to 6 bytes long, but ensuring that they were in
the canonical format. Once somebody had pointed out RFC 3629 to me (it
obsoletes 2279), additional restrictions were applied. The values are now
limited to be between 0 and 0x0010ffff, no more than 4 bytes long, and the
-subrange 0xd000 to 0xdfff is excluded.
+subrange 0xd000 to 0xdfff is excluded. However, the format of 5-byte and 6-byte
+characters is still checked.

+From release 8.13 more information about the details of the error are passed 
+back in the error code:
+
+PCRE_UTF8_ERR0   No error
+PCRE_UTF8_ERR1   Missing 1 byte at the end of the string
+PCRE_UTF8_ERR2   Missing 2 bytes at the end of the string
+PCRE_UTF8_ERR3   Missing 3 bytes at the end of the string
+PCRE_UTF8_ERR4   Missing 4 bytes at the end of the string
+PCRE_UTF8_ERR5   Missing 5 bytes at the end of the string
+PCRE_UTF8_ERR6   2nd-byte's two top bits are not 0x80
+PCRE_UTF8_ERR7   3rd-byte's two top bits are not 0x80
+PCRE_UTF8_ERR8   4th-byte's two top bits are not 0x80
+PCRE_UTF8_ERR9   5th-byte's two top bits are not 0x80
+PCRE_UTF8_ERR10  6th-byte's two top bits are not 0x80
+PCRE_UTF8_ERR11  5-byte character is not permitted by RFC 3629
+PCRE_UTF8_ERR12  6-byte character is not permitted by RFC 3629
+PCRE_UTF8_ERR13  4-byte character with value > 0x10ffff is not permitted
+PCRE_UTF8_ERR14  3-byte character with value 0xd000-0xdfff is not permitted
+PCRE_UTF8_ERR15  Overlong 2-byte sequence
+PCRE_UTF8_ERR16  Overlong 3-byte sequence
+PCRE_UTF8_ERR17  Overlong 4-byte sequence
+PCRE_UTF8_ERR18  Overlong 5-byte sequence (won't ever occur)
+PCRE_UTF8_ERR19  Overlong 6-byte sequence (won't ever occur)
+PCRE_UTF8_ERR20  Isolated 0x80 byte (not within UTF-8 character)
+PCRE_UTF8_ERR21  Byte with the illegal value 0xfe or 0xff
+
 Arguments:
   string       points to the string
   length       length of string, or -1 if the string is zero-terminated
+  errp         pointer to an error code variable 


 Returns:       < 0    if the string is a valid UTF-8 string
-               >= 0   otherwise; the value is the offset of the bad byte
-
-Bad bytes can be:
-
-  . An isolated byte whose most significant bits are 0x80, because this
-    can only correctly appear within a UTF-8 character;
-
-  . A byte whose most significant bits are 0xc0, but whose other bits indicate
-    that there are more than 3 additional bytes (i.e. an RFC 2279 starting
-    byte, which is no longer valid under RFC 3629);
-
-  .
-
-The returned offset may also be equal to the length of the string; this means
-that one or more bytes is missing from the final UTF-8 character.
+               >= 0   otherwise; the value is the offset of the bad character
 */


int
-_pcre_valid_utf8(USPTR string, int length)
+_pcre_valid_utf8(USPTR string, int length, int *errorcode)
{
#ifdef SUPPORT_UTF8
register USPTR p;
@@ -100,81 +114,188 @@
length = p - string;
}

+*errorcode = PCRE_UTF8_ERR0;
+
 for (p = string; length-- > 0; p++)
   {
-  register int ab;
-  register int c = *p;
-  if (c < 128) continue;
-  if (c < 0xc0) return p - string;
+  register int ab, c, d;
+  
+  c = *p;
+  if (c < 128) continue;                /* ASCII character */
+   
+  if (c < 0xc0)                         /* Isolated 10xx xxxx byte */
+    {
+    *errorcode = PCRE_UTF8_ERR20; 
+    return p - string;
+    } 
+
+  if (c >= 0xfe)                        /* Invalid 0xfe or 0xff bytes */
+    {
+    *errorcode = PCRE_UTF8_ERR21; 
+    return p - string;
+    } 
+ 
   ab = _pcre_utf8_table4[c & 0x3f];     /* Number of additional bytes */
-  if (ab > 3) return p - string;        /* Too many for RFC 3629 */
-  if (length < ab) return p + 1 + length - string;   /* Missing bytes */
-  length -= ab;
+  if (length < ab) 
+    {
+    *errorcode = ab - length;           /* Codes ERR1 to ERR5 */
+    return p - string;                  /* Missing bytes */
+    } 
+  length -= ab;                         /* Length remaining */


   /* Check top bits in the second byte */
-  if ((*(++p) & 0xc0) != 0x80) return p - string;
+   
+  if (((d = *(++p)) & 0xc0) != 0x80) 
+    {
+    *errorcode = PCRE_UTF8_ERR6; 
+    return p - string - 1;
+    } 


- /* Check for overlong sequences for each different length, and for the
- excluded range 0xd000 to 0xdfff. */
+ /* For each length, check that the remaining bytes start with the 0x80 bit
+ set and not the 0x40 bit. Then check for an overlong sequence, and for the
+ excluded range 0xd800 to 0xdfff. */

   switch (ab)
     {
-    /* Check for xx00 000x (overlong sequence) */
+    /* 2-byte character. No further bytes to check for 0x80. Check first byte
+    for for xx00 000x (overlong sequence). */
+     
+    case 1: if ((c & 0x3e) == 0)  
+      {
+      *errorcode = PCRE_UTF8_ERR15;   
+      return p - string - 1;  
+      } 
+    break; 


-    case 1:
-    if ((c & 0x3e) == 0) return p - string;
-    continue;   /* We know there aren't any more bytes to check */
-
-    /* Check for 1110 0000, xx0x xxxx (overlong sequence) or
-                 1110 1101, 1010 xxxx (0xd000 - 0xdfff) */
-
+    /* 3-byte character. Check third byte for 0x80. Then check first 2 bytes 
+      for 1110 0000, xx0x xxxx (overlong sequence) or
+          1110 1101, 1010 xxxx (0xd800 - 0xdfff) */
+           
     case 2:
-    if ((c == 0xe0 && (*p & 0x20) == 0) ||
-        (c == 0xed && *p >= 0xa0))
-      return p - string;
+    if ((*(++p) & 0xc0) != 0x80)     /* Third byte */
+      {
+      *errorcode = PCRE_UTF8_ERR7;
+      return p - string - 2;   
+      } 
+    if (c == 0xe0 && (d & 0x20) == 0)
+      {
+      *errorcode = PCRE_UTF8_ERR16; 
+      return p - string - 2;
+      } 
+    if (c == 0xed && d >= 0xa0)
+      {
+      *errorcode = PCRE_UTF8_ERR14;  
+      return p - string - 2;
+      } 
     break;


-    /* Check for 1111 0000, xx00 xxxx (overlong sequence) or
-       greater than 0x0010ffff (f4 8f bf bf) */
-
+    /* 4-byte character. Check 3rd and 4th bytes for 0x80. Then check first 2
+       bytes for for 1111 0000, xx00 xxxx (overlong sequence), then check for a
+       character greater than 0x0010ffff (f4 8f bf bf) */
+        
     case 3:
-    if ((c == 0xf0 && (*p & 0x30) == 0) ||
-        (c > 0xf4 ) ||
-        (c == 0xf4 && *p > 0x8f))
-      return p - string;
+    if ((*(++p) & 0xc0) != 0x80)     /* Third byte */
+      {
+      *errorcode = PCRE_UTF8_ERR7;
+      return p - string - 2;   
+      } 
+    if ((*(++p) & 0xc0) != 0x80)     /* Fourth byte */
+      {
+      *errorcode = PCRE_UTF8_ERR8;
+      return p - string - 3;   
+      } 
+    if (c == 0xf0 && (d & 0x30) == 0)
+      {
+      *errorcode = PCRE_UTF8_ERR17;  
+      return p - string - 3;
+      } 
+    if (c > 0xf4 || (c == 0xf4 && d > 0x8f))
+      {
+      *errorcode = PCRE_UTF8_ERR13;  
+      return p - string - 3;
+      }
     break;


-#if 0
-    /* These cases can no longer occur, as we restrict to a maximum of four
-    bytes nowadays. Leave the code here in case we ever want to add an option
-    for longer sequences. */
+    /* 5-byte and 6-byte characters are not allowed by RFC 3629, and will be
+    rejected by the length test below. However, we do the appropriate tests 
+    here so that overlong sequences get diagnosed, and also in case there is
+    ever an option for handling these larger code points. */  


-    /* Check for 1111 1000, xx00 0xxx */
-    case 4:
-    if (c == 0xf8 && (*p & 0x38) == 0) return p - string;
+    /* 5-byte character. Check 3rd, 4th, and 5th bytes for 0x80. Then check for
+    1111 1000, xx00 0xxx */ 
+    
+    case 4: 
+    if ((*(++p) & 0xc0) != 0x80)     /* Third byte */
+      {
+      *errorcode = PCRE_UTF8_ERR7;
+      return p - string - 2;   
+      } 
+    if ((*(++p) & 0xc0) != 0x80)     /* Fourth byte */
+      {
+      *errorcode = PCRE_UTF8_ERR8;
+      return p - string - 3;   
+      } 
+    if ((*(++p) & 0xc0) != 0x80)     /* Fifth byte */
+      {
+      *errorcode = PCRE_UTF8_ERR9;
+      return p - string - 4;   
+      } 
+    if (c == 0xf8 && (d & 0x38) == 0) 
+      {
+      *errorcode = PCRE_UTF8_ERR18; 
+      return p - string - 4;
+      } 
     break;


-    /* Check for leading 0xfe or 0xff, and then for 1111 1100, xx00 00xx */
+    /* 6-byte character. Check 3rd-6th bytes for 0x80. Then check for
+    1111 1100, xx00 00xx. */
+
     case 5:
-    if (c == 0xfe || c == 0xff ||
-       (c == 0xfc && (*p & 0x3c) == 0)) return p - string;
+    if ((*(++p) & 0xc0) != 0x80)     /* Third byte */
+      {
+      *errorcode = PCRE_UTF8_ERR7;
+      return p - string - 2;   
+      } 
+    if ((*(++p) & 0xc0) != 0x80)     /* Fourth byte */
+      {
+      *errorcode = PCRE_UTF8_ERR8;
+      return p - string - 3;   
+      } 
+    if ((*(++p) & 0xc0) != 0x80)     /* Fifth byte */
+      {
+      *errorcode = PCRE_UTF8_ERR9;
+      return p - string - 4;   
+      } 
+    if ((*(++p) & 0xc0) != 0x80)     /* Sixth byte */
+      {
+      *errorcode = PCRE_UTF8_ERR10;
+      return p - string - 5;   
+      } 
+    if (c == 0xfc && (d & 0x3c) == 0) 
+      {
+      *errorcode = PCRE_UTF8_ERR19; 
+      return p - string - 5;
+      } 
     break;
-#endif
-
     }
+    
+  /* Character is valid under RFC 2279, but 4-byte and 5-byte characters are
+  excluded by RFC 3629. The pointer p is currently at the last byte of the
+  character. */


-  /* Check for valid bytes after the 2nd, if any; all must start 10 */
-  while (--ab > 0)
+  if (ab > 3) 
     {
-    if ((*(++p) & 0xc0) != 0x80) return p - string;
-    }
+    *errorcode = (ab == 4)? PCRE_UTF8_ERR11 : PCRE_UTF8_ERR12; 
+    return p - string - ab;
+    } 
   }
-#else
+   
+#else  /* SUPPORT_UTF8 */
 (void)(string);  /* Keep picky compilers happy */
 (void)(length);
 #endif


-return -1;
+return -1; /* This indicates success */
}

/* End of pcre_valid_utf8.c */

Modified: code/trunk/pcretest.c
===================================================================
--- code/trunk/pcretest.c    2011-05-02 17:08:52 UTC (rev 597)
+++ code/trunk/pcretest.c    2011-05-07 15:37:31 UTC (rev 598)
@@ -197,7 +197,38 @@
 static uschar *dbuffer = NULL;
 static uschar *pbuffer = NULL;


+/* Textual explanations for runtime error codes */

+static const char *errtexts[] = {
+  NULL,  /* 0 is no error */
+  NULL,  /* NOMATCH is handled specially */
+  "NULL argument passed",
+  "bad option value",
+  "magic number missing",
+  "unknown opcode - pattern overwritten?",
+  "no more memory",
+  NULL,  /* never returned by pcre_exec() or pcre_dfa_exec() */       
+  "match limit exceeded",
+  "callout error code",
+  NULL,  /* BADUTF8 is handled specially */
+  "bad UTF-8 offset",
+  NULL,  /* PARTIAL is handled specially */
+  "not used - internal error",
+  "internal error - pattern overwritten?",
+  "bad count value",
+  "item unsupported for DFA matching",
+  "backreference condition or recursion test not supported for DFA matching",
+  "match limit not supported for DFA matching",
+  "workspace size exceeded in DFA matching",
+  "too much recursion for DFA matching",     
+  "recursion limit exceeded",
+  "not used - internal error",
+  "invalid combination of newline options",
+  "bad offset value",
+  NULL  /* SHORTUTF8 is handled specially */
+};
+          
+
 /*************************************************
 *         Alternate character tables             *
 *************************************************/
@@ -2858,15 +2889,31 @@
           }
         else
           {
-          if (count == PCRE_ERROR_NOMATCH)
-            {
+          switch(count)
+            {  
+            case PCRE_ERROR_NOMATCH:
             if (gmatched == 0)
               {
               if (markptr == NULL) fprintf(outfile, "No match\n");
                 else fprintf(outfile, "No match, mark = %s\n", markptr);
               }
+            break;
+            
+            case PCRE_ERROR_BADUTF8:
+            case PCRE_ERROR_SHORTUTF8:
+            fprintf(outfile, "Error %d (%s UTF-8 string)", count,
+              (count == PCRE_ERROR_BADUTF8)? "bad" : "short");
+            if (use_size_offsets >= 2)
+              fprintf(outfile, " offset=%d reason=%d", use_offsets[0], 
+                use_offsets[1]);
+            fprintf(outfile, "\n");         
+            break;     
+              
+            default:
+            fprintf(outfile, "Error %d (%s)\n", count, errtexts[-count]);
+            break;
             }
-          else fprintf(outfile, "Error %d\n", count);
+               
           break;  /* Out of the /g loop */
           }
         }


Modified: code/trunk/testdata/testinput5
===================================================================
--- code/trunk/testdata/testinput5    2011-05-02 17:08:52 UTC (rev 597)
+++ code/trunk/testdata/testinput5    2011-05-07 15:37:31 UTC (rev 598)
@@ -208,6 +208,19 @@
     \xe1\x88 
     \P\xe1\x88 
     \P\P\xe1\x88 
+    XX\xea
+    \O0XX\xea
+    \O1XX\xea
+    \O2XX\xea
+    XX\xf1
+    XX\xf8  
+    XX\xfc
+    ZZ\xea\xaf\x20YY
+    ZZ\xfd\xbf\xbf\x2f\xbf\xbfYY  
+    ZZ\xfd\xbf\xbf\xbf\x2f\xbfYY  
+    ZZ\xfd\xbf\xbf\xbf\xbf\x2fYY  
+    ZZ\xffYY
+    ZZ\xfeYY  


 /anything/8
     \xc0\x80


Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2011-05-02 17:08:52 UTC (rev 597)
+++ code/trunk/testdata/testoutput2    2011-05-07 15:37:31 UTC (rev 598)
@@ -11251,9 +11251,9 @@
     abc\>3
 No match
     abc\>4
-Error -24
+Error -24 (bad offset value)
     abc\>-4 
-Error -24
+Error -24 (bad offset value)


/^\cģ/
Failed: \c must be followed by an ASCII character at offset 3

Modified: code/trunk/testdata/testoutput5
===================================================================
--- code/trunk/testdata/testoutput5    2011-05-02 17:08:52 UTC (rev 597)
+++ code/trunk/testdata/testoutput5    2011-05-07 15:37:31 UTC (rev 598)
@@ -794,13 +794,13 @@
  0: \x{d6}


/[\xC3]/8
-Failed: invalid UTF-8 string at offset 2
+Failed: invalid UTF-8 string at offset 1

/\xC3/8
-Failed: invalid UTF-8 string at offset 1
+Failed: invalid UTF-8 string at offset 0

/\xC3\xC3\xC3xxx/8
-Failed: invalid UTF-8 string at offset 1
+Failed: invalid UTF-8 string at offset 0

/\xC3\xC3\xC3xxx/8?DZ
------------------------------------------------------------------
@@ -816,37 +816,63 @@

 /abc/8
     \xC3]
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=6
     \xC3
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=1
     \xC3\xC3\xC3
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=6
     \xC3\xC3\xC3\?
 No match
     \xe1\x88 
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=1
     \P\xe1\x88 
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=1
     \P\P\xe1\x88 
-Error -25
+Error -25 (short UTF-8 string) offset=0 reason=1
+    XX\xea
+Error -10 (bad UTF-8 string) offset=2 reason=2
+    \O0XX\xea
+Error -10 (bad UTF-8 string)
+    \O1XX\xea
+Error -10 (bad UTF-8 string)
+    \O2XX\xea
+Error -10 (bad UTF-8 string) offset=2 reason=2
+    XX\xf1
+Error -10 (bad UTF-8 string) offset=2 reason=3
+    XX\xf8  
+Error -10 (bad UTF-8 string) offset=2 reason=4
+    XX\xfc
+Error -10 (bad UTF-8 string) offset=2 reason=5
+    ZZ\xea\xaf\x20YY
+Error -10 (bad UTF-8 string) offset=2 reason=7
+    ZZ\xfd\xbf\xbf\x2f\xbf\xbfYY  
+Error -10 (bad UTF-8 string) offset=2 reason=8
+    ZZ\xfd\xbf\xbf\xbf\x2f\xbfYY  
+Error -10 (bad UTF-8 string) offset=2 reason=9
+    ZZ\xfd\xbf\xbf\xbf\xbf\x2fYY  
+Error -10 (bad UTF-8 string) offset=2 reason=10
+    ZZ\xffYY
+Error -10 (bad UTF-8 string) offset=2 reason=21
+    ZZ\xfeYY  
+Error -10 (bad UTF-8 string) offset=2 reason=21


 /anything/8
     \xc0\x80
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=15
     \xc1\x8f 
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=15
     \xe0\x9f\x80
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=16
     \xf0\x8f\x80\x80 
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=17
     \xf8\x87\x80\x80\x80  
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=18
     \xfc\x83\x80\x80\x80\x80
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=19
     \xfe\x80\x80\x80\x80\x80  
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=21
     \xff\x80\x80\x80\x80\x80  
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=21
     \xc3\x8f
 No match
     \xe0\xaf\x80
@@ -858,13 +884,13 @@
     \xf1\x8f\x80\x80 
 No match
     \xf8\x88\x80\x80\x80  
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=11
     \xf9\x87\x80\x80\x80  
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=11
     \xfc\x84\x80\x80\x80\x80
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=12
     \xfd\x83\x80\x80\x80\x80
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=12
     \?\xf8\x88\x80\x80\x80  
 No match
     \?\xf9\x87\x80\x80\x80  
@@ -1453,27 +1479,27 @@
     \x{0}\x{d7ff}\x{e000}\x{10ffff}
 No match
     \x{d800}
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=14
     \x{d800}\?
 No match
     \x{da00}
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=14
     \x{da00}\?
 No match
     \x{dfff}
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=14
     \x{dfff}\?
 No match
     \x{110000}    
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=13
     \x{110000}\?    
 No match
     \x{2000000} 
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=11
     \x{2000000}\? 
 No match
     \x{7fffffff} 
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=12
     \x{7fffffff}\? 
 No match


@@ -2304,7 +2330,7 @@
     a\x{123}aa\>1
  0: aa
     a\x{123}aa\>2
-Error -11
+Error -11 (bad UTF-8 offset)
     a\x{123}aa\>3
  0: aa
     a\x{123}aa\>4
@@ -2312,7 +2338,7 @@
     a\x{123}aa\>5
 No match
     a\x{123}aa\>6
-Error -24
+Error -24 (bad offset value)


/^\cģ/8
Failed: \c must be followed by an ASCII character at offset 3

Modified: code/trunk/testdata/testoutput7
===================================================================
--- code/trunk/testdata/testoutput7    2011-05-02 17:08:52 UTC (rev 597)
+++ code/trunk/testdata/testoutput7    2011-05-07 15:37:31 UTC (rev 598)
@@ -6431,7 +6431,7 @@


 /^(?(2)a|(1)(2))+$/
     123a
-Error -17
+Error -17 (backreference condition or recursion test not supported for DFA matching)


 /(?<=a|bbbb)c/
     ac
@@ -7459,7 +7459,7 @@


 /abc\K123/
     xyzabc123pqr
-Error -16
+Error -16 (item unsupported for DFA matching)


 /(?<=abc)123/
     xyzabc123pqr 
@@ -7598,29 +7598,29 @@


 /^(?!a(*SKIP)b)/
     ac
-Error -16
+Error -16 (item unsupported for DFA matching)


 /^(?=a(*SKIP)b|ac)/
     ** Failers
 No match
     ac
-Error -16
+Error -16 (item unsupported for DFA matching)


 /^(?=a(*THEN)b|ac)/
     ac
-Error -16
+Error -16 (item unsupported for DFA matching)


 /^(?=a(*PRUNE)b)/
     ab  
-Error -16
+Error -16 (item unsupported for DFA matching)
     ** Failers 
 No match
     ac
-Error -16
+Error -16 (item unsupported for DFA matching)


 /^(?(?!a(*SKIP)b))/
     ac
-Error -16
+Error -16 (item unsupported for DFA matching)


 /(?<=abc)def/
     abc\P\P
@@ -7693,8 +7693,8 @@
     abc\>3
 No match
     abc\>4
-Error -24
+Error -24 (bad offset value)
     abc\>-4 
-Error -24
+Error -24 (bad offset value)


/-- End of testinput7 --/

Modified: code/trunk/testdata/testoutput8
===================================================================
--- code/trunk/testdata/testoutput8    2011-05-02 17:08:52 UTC (rev 597)
+++ code/trunk/testdata/testoutput8    2011-05-07 15:37:31 UTC (rev 598)
@@ -98,19 +98,19 @@


 /abc/8
     \xC3]
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=6
     \xC3
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=1
     \xC3\xC3\xC3
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=6
     \xC3\xC3\xC3\?
 No match
     \xe1\x88 
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=1
     \P\xe1\x88 
-Error -10
+Error -10 (bad UTF-8 string) offset=0 reason=1
     \P\P\xe1\x88 
-Error -25
+Error -25 (short UTF-8 string) offset=0 reason=1


 /a.b/8
     acb
@@ -1337,7 +1337,7 @@
  0: aa
  1: a
     a\x{123}aa\>2
-Error -11
+Error -11 (bad UTF-8 offset)
     a\x{123}aa\>3
  0: aa
  1: a
@@ -1346,6 +1346,6 @@
     a\x{123}aa\>5
 No match
     a\x{123}aa\>6
-Error -24
+Error -24 (bad offset value)


/-- End of testinput8 --/