Revision: 1107
http://vcs.pcre.org/viewvc?view=rev&revision=1107
Author: chpe
Date: 2012-10-16 16:56:51 +0100 (Tue, 16 Oct 2012)
Log Message:
-----------
pcre32: docs: Document that non-character codepoints are excluded
Document that UTF-8/16/32 validity checks now exclude the non-character
code points, which are U+FDD0..U+FDEF and the last two code points in each
plane, U+??FFFE and U+??FFFF.
Modified Paths:
--------------
code/trunk/doc/pcreunicode.3
Modified: code/trunk/doc/pcreunicode.3
===================================================================
--- code/trunk/doc/pcreunicode.3 2012-10-16 15:56:48 UTC (rev 1106)
+++ code/trunk/doc/pcreunicode.3 2012-10-16 15:56:51 UTC (rev 1107)
@@ -93,14 +93,17 @@
which are themselves derived from the Unicode specification. Earlier releases
of PCRE followed the rules of RFC 2279, which allows the full range of 31-bit
values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0
-to U+10FFFF, excluding U+D800 to U+DFFF.
+to U+10FFFF, excluding the surrogate area, and the non-characters.
.P
-The excluded code points are the "Surrogate Area" of Unicode. They are reserved
+Excluded code points are the "Surrogate Area" of Unicode. They are reserved
for use by UTF-16, where they are used in pairs to encode codepoints with
values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
are available independently in the UTF-8 encoding. (In other words, the whole
surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)
.P
+Also excluded are the "Non-Characters" code points, which are U+FDD0 to U+FDEF
+and the last two code points in each plane, U+??FFFE and U+??FFFF.
+.P
If an invalid UTF-8 string is passed to PCRE, an error return is given. At
compile time, the only additional information is the offset to the first byte
of the failing character. The run-time functions \fBpcre_exec()\fP and
@@ -143,6 +146,9 @@
U+D800 to U+DFFF are independent code points. Values in the surrogate range
must be used in pairs in the correct manner.
.P
+Excluded are the "Non-Characters" code points, which are U+FDD0 to U+FDEF
+and the last two code points in each plane, U+??FFFE and U+??FFFF.
+.P
If an invalid UTF-16 string is passed to PCRE, an error return is given. At
compile time, the only additional information is the offset to the first data
unit of the failing character. The run-time functions \fBpcre16_exec()\fP and
@@ -163,7 +169,9 @@
When you set the PCRE_UTF32 flag, the strings of 32-bit data units that are
passed as patterns and subjects are (by default) checked for validity on entry
to the relevant functions. This check allows only values in the range U+0
-to U+10FFFF, excluding the surrogate are U+D800 to U+DFFF, and U+FFEF.
+to U+10FFFF, excluding the surrogate area U+D800 to U+DFFF, and the
+"Non-Characters" code points, which are U+FDD0 to U+FDEF and the last two
+characters in each plane, U+??FFFE and U+??FFFF.
.P
If an invalid UTF-32 string is passed to PCRE, an error return is given. At
compile time, the only additional information is the offset to the first data