[Pcre-svn] [737] code/trunk/doc: Add more about \C to docume…

Startseite
Nachricht löschen
Autor: Subversion repository
Datum:  
To: pcre-svn
Betreff: [Pcre-svn] [737] code/trunk/doc: Add more about \C to documentation.
Revision: 737
          http://vcs.pcre.org/viewvc?view=rev&revision=737
Author:   ph10
Date:     2011-10-19 18:37:29 +0100 (Wed, 19 Oct 2011)


Log Message:
-----------
Add more about \C to documentation.

Modified Paths:
--------------
    code/trunk/doc/pcrejit.3
    code/trunk/doc/pcrepattern.3
    code/trunk/doc/pcreunicode.3


Modified: code/trunk/doc/pcrejit.3
===================================================================
--- code/trunk/doc/pcrejit.3    2011-10-16 15:48:03 UTC (rev 736)
+++ code/trunk/doc/pcrejit.3    2011-10-19 17:37:29 UTC (rev 737)
@@ -95,7 +95,7 @@
 .P
 The unsupported pattern items are:
 .sp
-  \eC            match a single byte, even in UTF-8 mode
+  \eC            match a single byte; not supported in UTF-8 mode
   (?Cn)          callouts
   (?(<name>)...  conditional test on setting of a named subpattern
   (?(R)...       conditional test on whole pattern recursion
@@ -267,6 +267,6 @@
 .rs
 .sp
 .nf
-Last updated: 05 October 2011
+Last updated: 19 October 2011
 Copyright (c) 1997-2011 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3    2011-10-16 15:48:03 UTC (rev 736)
+++ code/trunk/doc/pcrepattern.3    2011-10-19 17:37:29 UTC (rev 737)
@@ -971,11 +971,14 @@
 .rs
 .sp
 Outside a character class, the escape sequence \eC matches any one byte, both
-in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending
+in and out of UTF-8 mode. Unlike a dot, it always matches line-ending
 characters. The feature is provided in Perl in order to match individual bytes
-in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes, the
-rest of the string may start with a malformed UTF-8 character. For this reason,
-the \eC escape sequence is best avoided.
+in UTF-8 mode, but it is unclear how it can usefully be used. Because \eC
+breaks up characters into individual bytes, matching one byte with \eC in UTF-8
+mode means that the rest of the string may start with a malformed UTF-8
+character. This has undefined results, because PCRE assumes that it is dealing
+with valid UTF-8 strings (and by default it checks this at the start of
+processing unless the PCRE_NO_UTF8_CHECK option is used).
 .P
 PCRE does not allow \eC to appear in lookbehind assertions
 .\" HTML <a href="#lookbehind">
@@ -984,6 +987,27 @@
 .\"
 because in UTF-8 mode this would make it impossible to calculate the length of
 the lookbehind.
+.P
+In general, the \eC escape sequence is best avoided in UTF-8 mode. However, one
+way of using it that avoids the problem of malformed UTF-8 characters is to 
+use a lookahead to check the length of the next character, as in this pattern 
+(ignore white space and line breaks):
+.sp
+  (?| (?=[\ex00-\ex7f])(\eC) |
+      (?=[\ex80-\ex{7ff}])(\eC)(\eC) |  
+      (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |  
+      (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
+.sp
+A group that starts with (?| resets the capturing parentheses numbers in each 
+alternative (see 
+.\" HTML <a href="#dupsubpatternnumber">
+.\" </a>
+"Duplicate Subpattern Numbers"
+.\"
+below). The assertions at the start of each branch check the next UTF-8 
+character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The 
+character's individual bytes are then captured by the appropriate number of
+groups.
 .
 .
 .\" HTML <a name="characterclass"></a>
@@ -2830,6 +2854,6 @@
 .rs
 .sp
 .nf
-Last updated: 09 October 2011
+Last updated: 19 October 2011
 Copyright (c) 1997-2011 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcreunicode.3
===================================================================
--- code/trunk/doc/pcreunicode.3    2011-10-16 15:48:03 UTC (rev 736)
+++ code/trunk/doc/pcreunicode.3    2011-10-19 17:37:29 UTC (rev 737)
@@ -100,11 +100,16 @@
 4. The dot metacharacter matches one UTF-8 character instead of a single byte.
 .P
 5. The escape sequence \eC can be used to match a single byte in UTF-8 mode,
-but its use can lead to some strange effects. This facility is not available in
-the alternative matching function, \fBpcre_dfa_exec()\fP, nor is it supported
-by the JIT optimization of \fBpcre_exec()\fP. If JIT optimization is requested
-for a pattern that contains \eC, it will not succeed, and so the matching will
-be carried out by the normal interpretive function.
+but its use can lead to some strange effects because it breaks up multibyte
+characters (see the description of \eC in the
+.\" HREF
+\fBpcrepattern\fP
+.\"
+documentation). The use of \eC is not supported in the alternative matching
+function \fBpcre_dfa_exec()\fP, nor is it supported in UTF-8 mode by the JIT
+optimization of \fBpcre_exec()\fP. If JIT optimization is requested for a UTF-8
+pattern that contains \eC, it will not succeed, and so the matching will be
+carried out by the normal interpretive function.
 .P
 6. The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly
 test characters of any code value, but, by default, the characters that PCRE
@@ -158,6 +163,6 @@
 .rs
 .sp
 .nf
-Last updated: 06 September 2011
+Last updated: 19 October 2011
 Copyright (c) 1997-2011 University of Cambridge.
 .fi