[Pcre-svn] [1107] code/trunk/doc/pcreunicode.3: pcre32: docs…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [1107] code/trunk/doc/pcreunicode.3: pcre32: docs: Document that non-character codepoints are excluded
Revision: 1107
          http://vcs.pcre.org/viewvc?view=rev&revision=1107
Author:   chpe
Date:     2012-10-16 16:56:51 +0100 (Tue, 16 Oct 2012)


Log Message:
-----------
pcre32: docs: Document that non-character codepoints are excluded

Document that UTF-8/16/32 validity checks now exclude the non-character
code points, which are U+FDD0..U+FDEF and the last two code points in each
plane, U+??FFFE and U+??FFFF.

Modified Paths:
--------------
    code/trunk/doc/pcreunicode.3


Modified: code/trunk/doc/pcreunicode.3
===================================================================
--- code/trunk/doc/pcreunicode.3    2012-10-16 15:56:48 UTC (rev 1106)
+++ code/trunk/doc/pcreunicode.3    2012-10-16 15:56:51 UTC (rev 1107)
@@ -93,14 +93,17 @@
 which are themselves derived from the Unicode specification. Earlier releases
 of PCRE followed the rules of RFC 2279, which allows the full range of 31-bit
 values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0
-to U+10FFFF, excluding U+D800 to U+DFFF.
+to U+10FFFF, excluding the surrogate area, and the non-characters.
 .P
-The excluded code points are the "Surrogate Area" of Unicode. They are reserved
+Excluded code points are the "Surrogate Area" of Unicode. They are reserved
 for use by UTF-16, where they are used in pairs to encode codepoints with
 values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
 are available independently in the UTF-8 encoding. (In other words, the whole
 surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)
 .P
+Also excluded are the "Non-Characters" code points, which are U+FDD0 to U+FDEF
+and the last two code points in each plane, U+??FFFE and U+??FFFF.
+.P
 If an invalid UTF-8 string is passed to PCRE, an error return is given. At
 compile time, the only additional information is the offset to the first byte
 of the failing character. The run-time functions \fBpcre_exec()\fP and
@@ -143,6 +146,9 @@
 U+D800 to U+DFFF are independent code points. Values in the surrogate range
 must be used in pairs in the correct manner.
 .P
+Excluded are the "Non-Characters" code points, which are U+FDD0 to U+FDEF
+and the last two code points in each plane, U+??FFFE and U+??FFFF.
+.P
 If an invalid UTF-16 string is passed to PCRE, an error return is given. At
 compile time, the only additional information is the offset to the first data
 unit of the failing character. The run-time functions \fBpcre16_exec()\fP and
@@ -163,7 +169,9 @@
 When you set the PCRE_UTF32 flag, the strings of 32-bit data units that are
 passed as patterns and subjects are (by default) checked for validity on entry
 to the relevant functions.  This check allows only values in the range U+0
-to U+10FFFF, excluding the surrogate are U+D800 to U+DFFF, and U+FFEF.
+to U+10FFFF, excluding the surrogate area U+D800 to U+DFFF, and the
+"Non-Characters" code points, which are U+FDD0 to U+FDEF and the last two
+characters in each plane, U+??FFFE and U+??FFFF.
 .P
 If an invalid UTF-32 string is passed to PCRE, an error return is given. At
 compile time, the only additional information is the offset to the first data