[Pcre-svn] [1261] code/trunk: Correct Unicode string checkin…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [1261] code/trunk: Correct Unicode string checking in the light of corrigendum #9.
Revision: 1261
          http://vcs.pcre.org/viewvc?view=rev&revision=1261
Author:   ph10
Date:     2013-02-27 16:27:01 +0000 (Wed, 27 Feb 2013)


Log Message:
-----------
Correct Unicode string checking in the light of corrigendum #9.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcreapi.3
    code/trunk/doc/pcreunicode.3
    code/trunk/pcre.h.in
    code/trunk/pcre16_valid_utf16.c
    code/trunk/pcre32_valid_utf32.c
    code/trunk/pcre_valid_utf8.c
    code/trunk/pcretest.c
    code/trunk/testdata/testinput15
    code/trunk/testdata/testinput18
    code/trunk/testdata/testinput24
    code/trunk/testdata/testinput26
    code/trunk/testdata/testoutput15
    code/trunk/testdata/testoutput18-16
    code/trunk/testdata/testoutput18-32
    code/trunk/testdata/testoutput24
    code/trunk/testdata/testoutput26


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/ChangeLog    2013-02-27 16:27:01 UTC (rev 1261)
@@ -79,6 +79,11 @@
 20. Added the PCRE-specific property \p{Xuc} for matching characters that can
     be expressed in certain programming languages using Universal Character 
     Names. 
+    
+21. Unicode validation has been updated in the light of Unicode Corrigendum #9,
+    which points out that "non characters" are not "characters that may not 
+    appear in Unicode strings" but rather "characters that are reserved for 
+    internal use and have only local meaning".



Version 8.32 30-November-2012

Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/doc/pcreapi.3    2013-02-27 16:27:01 UTC (rev 1261)
@@ -1,4 +1,4 @@
-.TH PCREAPI 3 "22 February 2013" "PCRE 8.33"
+.TH PCREAPI 3 "27 February 2013" "PCRE 8.33"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .sp
@@ -2353,8 +2353,10 @@
 .sp
   PCRE_UTF8_ERR2
 .sp
-Non-character. These are the last two characters in each plane (0xfffe, 0xffff,
-0x1fffe, 0x1ffff .. 0x10fffe, 0x10ffff), and the characters 0xfdd0..0xfdef.
+This error code was formerly used when the presence of a so-called
+"non-character" caused an error. Unicode corrigendum #9 makes it clear that 
+such characters should not cause a string to be rejected, and so this code is 
+no longer in use and is never returned.
 .
 .
 .SH "EXTRACTING CAPTURED SUBSTRINGS BY NUMBER"
@@ -2823,6 +2825,6 @@
 .rs
 .sp
 .nf
-Last updated: 22 February 2013
+Last updated: 27 February 2013
 Copyright (c) 1997-2013 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcreunicode.3
===================================================================
--- code/trunk/doc/pcreunicode.3    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/doc/pcreunicode.3    2013-02-27 16:27:01 UTC (rev 1261)
@@ -1,4 +1,4 @@
-.TH PCREUNICODE 3 "11 November 2012" "PCRE 8.32"
+.TH PCREUNICODE 3 "27 February 2013" "PCRE 8.33"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .SH "UTF-8, UTF-16, UTF-32, AND UNICODE PROPERTY SUPPORT"
@@ -84,7 +84,9 @@
 which are themselves derived from the Unicode specification. Earlier releases
 of PCRE followed the rules of RFC 2279, which allows the full range of 31-bit
 values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0
-to U+10FFFF, excluding the surrogate area and the non-characters.
+to U+10FFFF, excluding the surrogate area. (From release 8.33 the so-called 
+"non-character" code points are no longer excluded because Unicode corrigendum 
+#9 makes it clear that they should not be.)
 .P
 Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16,
 where they are used in pairs to encode codepoints with values greater than
@@ -93,9 +95,6 @@
 surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and
 UTF-32.)
 .P
-Also excluded are the "Non-Character" code points, which are U+FDD0 to U+FDEF
-and the last two code points in each plane, U+??FFFE and U+??FFFF.
-.P
 If an invalid UTF-8 string is passed to PCRE, an error return is given. At
 compile time, the only additional information is the offset to the first byte
 of the failing character. The run-time functions \fBpcre_exec()\fP and
@@ -128,9 +127,6 @@
 U+D800 to U+DFFF are independent code points. Values in the surrogate range
 must be used in pairs in the correct manner.
 .P
-Excluded are the "Non-Character" code points, which are U+FDD0 to U+FDEF
-and the last two code points in each plane, U+??FFFE and U+??FFFF.
-.P
 If an invalid UTF-16 string is passed to PCRE, an error return is given. At
 compile time, the only additional information is the offset to the first data
 unit of the failing character. The run-time functions \fBpcre16_exec()\fP and
@@ -152,9 +148,7 @@
 When you set the PCRE_UTF32 flag, the strings of 32-bit data units that are
 passed as patterns and subjects are (by default) checked for validity on entry
 to the relevant functions.  This check allows only values in the range U+0
-to U+10FFFF, excluding the surrogate area U+D800 to U+DFFF, and the
-"Non-Character" code points, which are U+FDD0 to U+FDEF and the last two
-characters in each plane, U+??FFFE and U+??FFFF.
+to U+10FFFF, excluding the surrogate area U+D800 to U+DFFF.
 .P
 If an invalid UTF-32 string is passed to PCRE, an error return is given. At
 compile time, the only additional information is the offset to the first data
@@ -250,6 +244,6 @@
 .rs
 .sp
 .nf
-Last updated: 11 November 2012
-Copyright (c) 1997-2012 University of Cambridge.
+Last updated: 27 February 2013
+Copyright (c) 1997-2013 University of Cambridge.
 .fi


Modified: code/trunk/pcre.h.in
===================================================================
--- code/trunk/pcre.h.in    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/pcre.h.in    2013-02-27 16:27:01 UTC (rev 1261)
@@ -224,7 +224,7 @@
 #define PCRE_UTF8_ERR19             19
 #define PCRE_UTF8_ERR20             20
 #define PCRE_UTF8_ERR21             21
-#define PCRE_UTF8_ERR22             22
+#define PCRE_UTF8_ERR22             22  /* Unused (was non-character) */


/* Specific error codes for UTF-16 validity checks */

@@ -232,13 +232,13 @@
 #define PCRE_UTF16_ERR1              1
 #define PCRE_UTF16_ERR2              2
 #define PCRE_UTF16_ERR3              3
-#define PCRE_UTF16_ERR4              4
+#define PCRE_UTF16_ERR4              4  /* Unused (was non-character) */


/* Specific error codes for UTF-32 validity checks */

 #define PCRE_UTF32_ERR0              0
 #define PCRE_UTF32_ERR1              1
-#define PCRE_UTF32_ERR2              2
+#define PCRE_UTF32_ERR2              2  /* Unused (was non-character) */
 #define PCRE_UTF32_ERR3              3


/* Request types for pcre_fullinfo() */

Modified: code/trunk/pcre16_valid_utf16.c
===================================================================
--- code/trunk/pcre16_valid_utf16.c    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/pcre16_valid_utf16.c    2013-02-27 16:27:01 UTC (rev 1261)
@@ -6,7 +6,7 @@
 and semantics are as close as possible to those of the Perl 5 language.


                        Written by Philip Hazel
-           Copyright (c) 1997-2012 University of Cambridge
+           Copyright (c) 1997-2013 University of Cambridge


-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -69,7 +69,7 @@
PCRE_UTF16_ERR1 Missing low surrogate at the end of the string
PCRE_UTF16_ERR2 Invalid low surrogate
PCRE_UTF16_ERR3 Isolated low surrogate
-PCRE_UTF16_ERR4 Non-character
+PCRE_UTF16_ERR4 Unused (was non-character)

 Arguments:
   string       points to the string
@@ -100,19 +100,10 @@
   if ((c & 0xf800) != 0xd800)
     {
     /* Normal UTF-16 code point. Neither high nor low surrogate. */
-
-    /* Check for non-characters */
-    if ((c & 0xfffeu) == 0xfffeu || (c >= 0xfdd0u && c <= 0xfdefu))
-      {
-      *erroroffset = p - string;
-      return PCRE_UTF16_ERR4;
-      }
     }
   else if ((c & 0x0400) == 0)
     {
-    /* High surrogate. */
-
-    /* Must be a followed by a low surrogate. */
+    /* High surrogate. Must be a followed by a low surrogate. */
     if (length == 0)
       {
       *erroroffset = p - string;
@@ -125,16 +116,6 @@
       *erroroffset = p - string;
       return PCRE_UTF16_ERR2;
       }
-    else
-      {
-      /* Valid surrogate, but check for non-characters */
-      c = (((c & 0x3ffu) << 10) | (*p & 0x3ffu)) + 0x10000u;
-      if ((c & 0xfffeu) == 0xfffeu)
-        {
-        *erroroffset = p - string;
-        return PCRE_UTF16_ERR4;
-        }
-      }
     }
   else
     {


Modified: code/trunk/pcre32_valid_utf32.c
===================================================================
--- code/trunk/pcre32_valid_utf32.c    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/pcre32_valid_utf32.c    2013-02-27 16:27:01 UTC (rev 1261)
@@ -6,7 +6,7 @@
 and semantics are as close as possible to those of the Perl 5 language.


                        Written by Philip Hazel
-           Copyright (c) 1997-2012 University of Cambridge
+           Copyright (c) 1997-2013 University of Cambridge


-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -66,7 +66,7 @@

PCRE_UTF32_ERR0 No error
PCRE_UTF32_ERR1 Surrogate character
-PCRE_UTF32_ERR2 Non-character
+PCRE_UTF32_ERR2 Unused (was non-character)
PCRE_UTF32_ERR3 Character > 0x10ffff

 Arguments:
@@ -98,16 +98,9 @@
   if ((c & 0xfffff800u) != 0xd800u)
     {
     /* Normal UTF-32 code point. Neither high nor low surrogate. */
-
-    /* Check for non-characters */
-    if ((c & 0xfffeu) == 0xfffeu || (c >= 0xfdd0u && c <= 0xfdefu))
+    if (c > 0x10ffffu)
       {
       *erroroffset = p - string;
-      return PCRE_UTF32_ERR2;
-      }
-    else if (c > 0x10ffffu)
-      {
-      *erroroffset = p - string;
       return PCRE_UTF32_ERR3;
       }
     }


Modified: code/trunk/pcre_valid_utf8.c
===================================================================
--- code/trunk/pcre_valid_utf8.c    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/pcre_valid_utf8.c    2013-02-27 16:27:01 UTC (rev 1261)
@@ -6,7 +6,7 @@
 and semantics are as close as possible to those of the Perl 5 language.


                        Written by Philip Hazel
-           Copyright (c) 1997-2012 University of Cambridge
+           Copyright (c) 1997-2013 University of Cambridge


-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -92,7 +92,7 @@
PCRE_UTF8_ERR19 Overlong 6-byte sequence (won't ever occur)
PCRE_UTF8_ERR20 Isolated 0x80 byte (not within UTF-8 character)
PCRE_UTF8_ERR21 Byte with the illegal value 0xfe or 0xff
-PCRE_UTF8_ERR22 Non-character
+PCRE_UTF8_ERR22 Unused (was non-character)

 Arguments:
   string       points to the string
@@ -118,7 +118,6 @@
 for (p = string; length-- > 0; p++)
   {
   register pcre_uchar ab, c, d;
-  pcre_uint32 v = 0;


   c = *p;
   if (c < 128) continue;                /* ASCII character */
@@ -187,7 +186,6 @@
       *erroroffset = (int)(p - string) - 2;
       return PCRE_UTF8_ERR14;
       }
-    v = ((c & 0x0f) << 12) | ((d & 0x3f) << 6) | (*p & 0x3f);
     break;


     /* 4-byte character. Check 3rd and 4th bytes for 0x80. Then check first 2
@@ -215,7 +213,6 @@
       *erroroffset = (int)(p - string) - 3;
       return PCRE_UTF8_ERR13;
       }
-    v = ((c & 0x07) << 18) | ((d & 0x3f) << 12) | ((p[-1] & 0x3f) << 6) | (*p & 0x3f);
     break;


     /* 5-byte and 6-byte characters are not allowed by RFC 3629, and will be
@@ -290,14 +287,6 @@
     *erroroffset = (int)(p - string) - ab;
     return (ab == 4)? PCRE_UTF8_ERR11 : PCRE_UTF8_ERR12;
     }
-
-  /* Reject non-characters. The pointer p is currently at the last byte of the
-  character. */
-  if ((v & 0xfffeu) == 0xfffeu || (v >= 0xfdd0 && v <= 0xfdef))
-    {
-    *erroroffset = (int)(p - string) - ab;
-    return PCRE_UTF8_ERR22;
-    }
   }


#else /* Not SUPPORT_UTF */

Modified: code/trunk/pcretest.c
===================================================================
--- code/trunk/pcretest.c    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/pcretest.c    2013-02-27 16:27:01 UTC (rev 1261)
@@ -1796,8 +1796,7 @@
                FALSE otherwise
 */


-#ifdef NEVER
-
+#ifdef NEVER   /* Not used */
 #ifdef SUPPORT_UTF
 static BOOL
 valid_utf32(pcre_uint32 *string, int length)
@@ -1808,28 +1807,17 @@
 for (p = string; length-- > 0; p++)
   {
   c = *p;
-
-  if (c > 0x10ffffu)
-    return FALSE;
-
-  /* A surrogate */
-  if ((c & 0xfffff800u) == 0xd800u)
-    return FALSE;
-
-  /* Non-character */
-  if ((c & 0xfffeu) == 0xfffeu || (c >= 0xfdd0u && c <= 0xfdefu))
-    return FALSE;
+  if (c > 0x10ffffu) return FALSE;                 /* Too big */
+  if ((c & 0xfffff800u) == 0xd800u) return FALSE;  /* Surrogate */
   }


return TRUE;
}
#endif /* SUPPORT_UTF */
-
#endif /* NEVER */
+#endif /* SUPPORT_PCRE32 */


-#endif
-
 /*************************************************
 *        Read or extend an input line            *
 *************************************************/


Modified: code/trunk/testdata/testinput15
===================================================================
--- code/trunk/testdata/testinput15    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/testdata/testinput15    2013-02-27 16:27:01 UTC (rev 1261)
@@ -89,7 +89,6 @@
     \x80
     \xfe
     \xff
-    \xef\xb7\x90


 /badutf/8
     \xfb\x80\x80\x80\x80
@@ -136,7 +135,7 @@
     \?\xfc\x84\x80\x80\x80\x80
     \?\xfd\x83\x80\x80\x80\x80


-/noncharacter/8
+/./8
     \x{fffe}
     \x{ffff}
     \x{1fffe}
@@ -310,7 +309,6 @@
 /-- This tests the stricter UTF-8 check according to RFC 3629. --/ 


 /X/8
-    \x{0}\x{d7ff}\x{e000}\x{10ffff}
     \x{d800}
     \x{d800}\?
     \x{da00}


Modified: code/trunk/testdata/testinput18
===================================================================
--- code/trunk/testdata/testinput18    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/testdata/testinput18    2013-02-27 16:27:01 UTC (rev 1261)
@@ -156,7 +156,6 @@
 /^[\QĀ\E-\QŐ\E/BZ8


 /X/8
-    \x{0}\x{d7ff}\x{e000}\x{10ffff}
     \x{d800}
     \x{d800}\?
     \x{da00}
@@ -169,7 +168,6 @@
     \x{dfff}\?
     \x{110000}
     \x{d800}\x{1234}
-    \x{fffe}


/(*UTF16)\x{11234}/
abcd\x{11234}pqr

Modified: code/trunk/testdata/testinput24
===================================================================
--- code/trunk/testdata/testinput24    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/testdata/testinput24    2013-02-27 16:27:01 UTC (rev 1261)
@@ -1,6 +1,6 @@
 /-- Tests for the 16-bit library with UTF-16 support only */


-/noncharacter/8
+/./8
     \x{fffe}
     \x{ffff}
     \x{1fffe}


Modified: code/trunk/testdata/testinput26
===================================================================
--- code/trunk/testdata/testinput26    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/testdata/testinput26    2013-02-27 16:27:01 UTC (rev 1261)
@@ -7,9 +7,9 @@
 /\C/8
     \x{110000}


-/-- Invalid UTF-32 --/
+/-- Noncharacters --/

-/noncharacter/8
+/./8
     \x{fffe}
     \x{ffff}
     \x{1fffe}


Modified: code/trunk/testdata/testoutput15
===================================================================
--- code/trunk/testdata/testoutput15    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/testdata/testoutput15    2013-02-27 16:27:01 UTC (rev 1261)
@@ -163,8 +163,6 @@
 Error -10 (bad UTF-8 string) offset=0 reason=21
     \xff
 Error -10 (bad UTF-8 string) offset=0 reason=21
-    \xef\xb7\x90
-Error -10 (bad UTF-8 string) offset=0 reason=22


 /badutf/8
     \xfb\x80\x80\x80\x80
@@ -250,139 +248,139 @@
     \?\xfd\x83\x80\x80\x80\x80
 No match


-/noncharacter/8
+/./8
     \x{fffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fffe}
     \x{ffff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{ffff}
     \x{1fffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{1fffe}
     \x{1ffff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{1ffff}
     \x{2fffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{2fffe}
     \x{2ffff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{2ffff}
     \x{3fffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{3fffe}
     \x{3ffff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{3ffff}
     \x{4fffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{4fffe}
     \x{4ffff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{4ffff}
     \x{5fffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{5fffe}
     \x{5ffff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{5ffff}
     \x{6fffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{6fffe}
     \x{6ffff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{6ffff}
     \x{7fffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{7fffe}
     \x{7ffff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{7ffff}
     \x{8fffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{8fffe}
     \x{8ffff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{8ffff}
     \x{9fffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{9fffe}
     \x{9ffff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{9ffff}
     \x{afffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{afffe}
     \x{affff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{affff}
     \x{bfffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{bfffe}
     \x{bffff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{bffff}
     \x{cfffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{cfffe}
     \x{cffff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{cffff}
     \x{dfffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{dfffe}
     \x{dffff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{dffff}
     \x{efffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{efffe}
     \x{effff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{effff}
     \x{ffffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{ffffe}
     \x{fffff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fffff}
     \x{10fffe}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{10fffe}
     \x{10ffff}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{10ffff}
     \x{fdd0}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdd0}
     \x{fdd1}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdd1}
     \x{fdd2}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdd2}
     \x{fdd3}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdd3}
     \x{fdd4}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdd4}
     \x{fdd5}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdd5}
     \x{fdd6}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdd6}
     \x{fdd7}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdd7}
     \x{fdd8}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdd8}
     \x{fdd9}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdd9}
     \x{fdda}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdda}
     \x{fddb}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fddb}
     \x{fddc}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fddc}
     \x{fddd}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fddd}
     \x{fdde}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdde}
     \x{fddf}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fddf}
     \x{fde0}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fde0}
     \x{fde1}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fde1}
     \x{fde2}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fde2}
     \x{fde3}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fde3}
     \x{fde4}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fde4}
     \x{fde5}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fde5}
     \x{fde6}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fde6}
     \x{fde7}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fde7}
     \x{fde8}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fde8}
     \x{fde9}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fde9}
     \x{fdea}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdea}
     \x{fdeb}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdeb}
     \x{fdec}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdec}
     \x{fded}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fded}
     \x{fdee}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdee}
     \x{fdef}
-Error -10 (bad UTF-8 string) offset=0 reason=22
+ 0: \x{fdef}


/\x{100}/8DZ
------------------------------------------------------------------
@@ -891,8 +889,6 @@
/-- This tests the stricter UTF-8 check according to RFC 3629. --/

 /X/8
-    \x{0}\x{d7ff}\x{e000}\x{10ffff}
-Error -10 (bad UTF-8 string) offset=7 reason=22
     \x{d800}
 Error -10 (bad UTF-8 string) offset=0 reason=14
     \x{d800}\?


Modified: code/trunk/testdata/testoutput18-16
===================================================================
--- code/trunk/testdata/testoutput18-16    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/testdata/testoutput18-16    2013-02-27 16:27:01 UTC (rev 1261)
@@ -608,8 +608,6 @@
 Failed: missing terminating ] for character class at offset 13


 /X/8
-    \x{0}\x{d7ff}\x{e000}\x{10ffff}
-Error -10 (bad UTF-16 string) offset=4 reason=4
     \x{d800}
 Error -10 (bad UTF-16 string) offset=0 reason=1
     \x{d800}\?
@@ -634,8 +632,6 @@
 ** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
     \x{d800}\x{1234}
 Error -10 (bad UTF-16 string) offset=1 reason=2
-    \x{fffe}
-Error -10 (bad UTF-16 string) offset=0 reason=4


/(*UTF16)\x{11234}/
abcd\x{11234}pqr

Modified: code/trunk/testdata/testoutput18-32
===================================================================
--- code/trunk/testdata/testoutput18-32    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/testdata/testoutput18-32    2013-02-27 16:27:01 UTC (rev 1261)
@@ -606,8 +606,6 @@
 Failed: missing terminating ] for character class at offset 13


 /X/8
-    \x{0}\x{d7ff}\x{e000}\x{10ffff}
-Error -10 (bad UTF-32 string) offset=3 reason=2
     \x{d800}
 Error -10 (bad UTF-32 string) offset=0 reason=1
     \x{d800}\?
@@ -632,8 +630,6 @@
 Error -10 (bad UTF-32 string) offset=0 reason=3
     \x{d800}\x{1234}
 Error -10 (bad UTF-32 string) offset=0 reason=1
-    \x{fffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2


/(*UTF16)\x{11234}/
Failed: (*VERB) not recognized at offset 5

Modified: code/trunk/testdata/testoutput24
===================================================================
--- code/trunk/testdata/testoutput24    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/testdata/testoutput24    2013-02-27 16:27:01 UTC (rev 1261)
@@ -1,138 +1,138 @@
 /-- Tests for the 16-bit library with UTF-16 support only */


-/noncharacter/8
+/./8
     \x{fffe}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fffe}
     \x{ffff}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{ffff}
     \x{1fffe}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{1fffe}
     \x{1ffff}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{d83f}\x{dfff}
     \x{2fffe}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{2fffe}
     \x{2ffff}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{d87f}\x{dfff}
     \x{3fffe}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{3fffe}
     \x{3ffff}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{d8bf}\x{dfff}
     \x{4fffe}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{4fffe}
     \x{4ffff}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{d8ff}\x{dfff}
     \x{5fffe}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{5fffe}
     \x{5ffff}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{d93f}\x{dfff}
     \x{6fffe}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{6fffe}
     \x{6ffff}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{d97f}\x{dfff}
     \x{7fffe}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{7fffe}
     \x{7ffff}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{d9bf}\x{dfff}
     \x{8fffe}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{8fffe}
     \x{8ffff}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{d9ff}\x{dfff}
     \x{9fffe}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{9fffe}
     \x{9ffff}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{da3f}\x{dfff}
     \x{afffe}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{afffe}
     \x{affff}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{da7f}\x{dfff}
     \x{bfffe}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{bfffe}
     \x{bffff}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{dabf}\x{dfff}
     \x{cfffe}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{cfffe}
     \x{cffff}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{daff}\x{dfff}
     \x{dfffe}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{dfffe}
     \x{dffff}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{db3f}\x{dfff}
     \x{efffe}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{efffe}
     \x{effff}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{db7f}\x{dfff}
     \x{ffffe}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{ffffe}
     \x{fffff}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{dbbf}\x{dfff}
     \x{10fffe}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{10fffe}
     \x{10ffff}
-Error -10 (bad UTF-16 string) offset=1 reason=4
+ 0: \x{dbff}\x{dfff}
     \x{fdd0}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdd0}
     \x{fdd1}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdd1}
     \x{fdd2}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdd2}
     \x{fdd3}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdd3}
     \x{fdd4}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdd4}
     \x{fdd5}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdd5}
     \x{fdd6}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdd6}
     \x{fdd7}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdd7}
     \x{fdd8}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdd8}
     \x{fdd9}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdd9}
     \x{fdda}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdda}
     \x{fddb}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fddb}
     \x{fddc}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fddc}
     \x{fddd}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fddd}
     \x{fdde}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdde}
     \x{fddf}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fddf}
     \x{fde0}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fde0}
     \x{fde1}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fde1}
     \x{fde2}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fde2}
     \x{fde3}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fde3}
     \x{fde4}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fde4}
     \x{fde5}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fde5}
     \x{fde6}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fde6}
     \x{fde7}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fde7}
     \x{fde8}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fde8}
     \x{fde9}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fde9}
     \x{fdea}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdea}
     \x{fdeb}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdeb}
     \x{fdec}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdec}
     \x{fded}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fded}
     \x{fdee}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdee}
     \x{fdef}
-Error -10 (bad UTF-16 string) offset=0 reason=4
+ 0: \x{fdef}


 /bad/8
     \x{d800}


Modified: code/trunk/testdata/testoutput26
===================================================================
--- code/trunk/testdata/testoutput26    2013-02-27 15:41:22 UTC (rev 1260)
+++ code/trunk/testdata/testoutput26    2013-02-27 16:27:01 UTC (rev 1261)
@@ -9,140 +9,140 @@
     \x{110000}
 Error -10 (bad UTF-32 string) offset=0 reason=3


-/-- Invalid UTF-32 --/
+/-- Noncharacters --/

-/noncharacter/8
+/./8
     \x{fffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fffe}
     \x{ffff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{ffff}
     \x{1fffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{1fffe}
     \x{1ffff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{1ffff}
     \x{2fffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{2fffe}
     \x{2ffff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{2ffff}
     \x{3fffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{3fffe}
     \x{3ffff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{3ffff}
     \x{4fffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{4fffe}
     \x{4ffff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{4ffff}
     \x{5fffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{5fffe}
     \x{5ffff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{5ffff}
     \x{6fffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{6fffe}
     \x{6ffff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{6ffff}
     \x{7fffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{7fffe}
     \x{7ffff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{7ffff}
     \x{8fffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{8fffe}
     \x{8ffff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{8ffff}
     \x{9fffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{9fffe}
     \x{9ffff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{9ffff}
     \x{afffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{afffe}
     \x{affff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{affff}
     \x{bfffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{bfffe}
     \x{bffff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{bffff}
     \x{cfffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{cfffe}
     \x{cffff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{cffff}
     \x{dfffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{dfffe}
     \x{dffff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{dffff}
     \x{efffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{efffe}
     \x{effff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{effff}
     \x{ffffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{ffffe}
     \x{fffff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fffff}
     \x{10fffe}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{10fffe}
     \x{10ffff}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{10ffff}
     \x{fdd0}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdd0}
     \x{fdd1}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdd1}
     \x{fdd2}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdd2}
     \x{fdd3}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdd3}
     \x{fdd4}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdd4}
     \x{fdd5}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdd5}
     \x{fdd6}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdd6}
     \x{fdd7}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdd7}
     \x{fdd8}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdd8}
     \x{fdd9}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdd9}
     \x{fdda}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdda}
     \x{fddb}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fddb}
     \x{fddc}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fddc}
     \x{fddd}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fddd}
     \x{fdde}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdde}
     \x{fddf}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fddf}
     \x{fde0}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fde0}
     \x{fde1}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fde1}
     \x{fde2}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fde2}
     \x{fde3}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fde3}
     \x{fde4}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fde4}
     \x{fde5}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fde5}
     \x{fde6}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fde6}
     \x{fde7}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fde7}
     \x{fde8}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fde8}
     \x{fde9}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fde9}
     \x{fdea}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdea}
     \x{fdeb}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdeb}
     \x{fdec}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdec}
     \x{fded}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fded}
     \x{fdee}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdee}
     \x{fdef}
-Error -10 (bad UTF-32 string) offset=0 reason=2
+ 0: \x{fdef}


/-- End of testinput26 --/