Revision: 1401
http://vcs.pcre.org/viewvc?view=rev&revision=1401
Author: ph10
Date: 2013-11-12 17:44:07 +0000 (Tue, 12 Nov 2013)
Log Message:
-----------
Clarify documentation for \s and \w when locales are in use - they may match
characters whose code points are greater than 127.
Modified Paths:
--------------
code/trunk/doc/pcrepattern.3
code/trunk/doc/pcresyntax.3
Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3 2013-11-12 17:05:55 UTC (rev 1400)
+++ code/trunk/doc/pcrepattern.3 2013-11-12 17:44:07 UTC (rev 1401)
@@ -533,8 +533,11 @@
.P
For compatibility with Perl, \es did not used to match the VT character (code
11), which made it different from the the POSIX "space" class. However, Perl
-added VT at release 5.18, and PCRE followed suit at release 8.34. The \es
-characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space (32).
+added VT at release 5.18, and PCRE followed suit at release 8.34. The default
+\es characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space
+(32), which are defined as white space in the "C" locale. This list may vary if
+locale-specific matching is taking place; in particular, in some locales the
+"non-breaking space" character (\exA0) is recognized as white space.
.P
A "word" character is an underscore or any character that is a letter or digit.
By default, the definition of letters and digits is controlled by PCRE's
@@ -549,16 +552,18 @@
\fBpcreapi\fP
.\"
page). For example, in a French locale such as "fr_FR" in Unix-like systems,
-or "french" in Windows, some character codes greater than 128 are used for
+or "french" in Windows, some character codes greater than 127 are used for
accented letters, and these are then matched by \ew. The use of locales with
Unicode is discouraged.
.P
-By default, in a UTF mode, characters with values greater than 128 never match
-\ed, \es, or \ew, and always match \eD, \eS, and \eW. These sequences retain
-their original meanings from before UTF support was available, mainly for
-efficiency reasons. However, if PCRE is compiled with Unicode property support,
-and the PCRE_UCP option is set, the behaviour is changed so that Unicode
-properties are used to determine character types, as follows:
+By default, characters whose code points are greater than 127 never match \ed,
+\es, or \ew, and always match \eD, \eS, and \eW, although this may vary for
+characters in the range 128-255 when locale-specific matching is happening.
+These escape sequences retain their original meanings from before Unicode
+support was available, mainly for efficiency reasons. If PCRE is compiled with
+Unicode property support, and the PCRE_UCP option is set, the behaviour is
+changed so that Unicode properties are used to determine character types, as
+follows:
.sp
\ed any character that matches \ep{Nd} (decimal digit)
\es any character that matches \ep{Z} or \eh or \ev
@@ -572,7 +577,7 @@
.P
The sequences \eh, \eH, \ev, and \eV are features that were added to Perl at
release 5.10. In contrast to the other sequences, which match only ASCII
-characters by default, these always match certain high-valued codepoints,
+characters by default, these always match certain high-valued code points,
whether or not PCRE_UCP is set. The horizontal space characters are:
.sp
U+0009 Horizontal tab (HT)
@@ -1339,10 +1344,12 @@
word "word" characters (same as \ew)
xdigit hexadecimal digits
.sp
-The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
-space (32). "Space" used to be different to \es, which did not include VT, for
-Perl compatibility. However, Perl changed at release 5.18, and PCRE followed at
-release 8.34. "Space" and \es now match the same set of characters.
+The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
+and space (32). If locale-specific matching is taking place, there may be
+additional space characters. "Space" used to be different to \es, which did not
+include VT, for Perl compatibility. However, Perl changed at release 5.18, and
+PCRE followed at release 8.34. "Space" and \es now match the same set of
+characters.
.P
The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
5.8. Another Perl extension is negation, which is indicated by a ^ character
@@ -1354,11 +1361,11 @@
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
supported, and an error is given if they are encountered.
.P
-By default, in UTF modes, characters with values greater than 128 do not match
-any of the POSIX character classes. However, if the PCRE_UCP option is passed
-to \fBpcre_compile()\fP, some of the classes are changed so that Unicode
-character properties are used. This is achieved by replacing certain POSIX
-classes by other sequences, as follows:
+By default, characters with values greater than 128 do not match any of the
+POSIX character classes. However, if the PCRE_UCP option is passed to
+\fBpcre_compile()\fP, some of the classes are changed so that Unicode character
+properties are used. This is achieved by replacing certain POSIX classes by
+other sequences, as follows:
.sp
[:alnum:] becomes \ep{Xan}
[:alpha:] becomes \ep{L}
Modified: code/trunk/doc/pcresyntax.3
===================================================================
--- code/trunk/doc/pcresyntax.3 2013-11-12 17:05:55 UTC (rev 1400)
+++ code/trunk/doc/pcresyntax.3 2013-11-12 17:44:07 UTC (rev 1401)
@@ -1,4 +1,4 @@
-.TH PCRESYNTAX 3 "05 November 2013" "PCRE 8.34"
+.TH PCRESYNTAX 3 "12 November 2013" "PCRE 8.34"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -61,9 +61,11 @@
\eW a "non-word" character
\eX a Unicode extended grapheme cluster
.sp
-In PCRE, by default, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII
-characters, even in a UTF mode. However, this can be changed by setting the
-PCRE_UCP option.
+By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
+or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
+happening, \es and \ew may also match characters with code points in the range
+128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
+is changed to use Unicode properties and they match many more characters.
.
.
.SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
@@ -506,6 +508,6 @@
.rs
.sp
.nf
-Last updated: 05 November 2013
+Last updated: 12 November 2013
Copyright (c) 1997-2013 University of Cambridge.
.fi