Revision: 186
http://www.exim.org/viewvc/pcre2?view=rev&revision=186
Author: ph10
Date: 2015-01-26 14:21:45 +0000 (Mon, 26 Jan 2015)
Log Message:
-----------
Documentation clarifications.
Modified Paths:
--------------
code/trunk/README
code/trunk/doc/html/README.txt
code/trunk/doc/html/pcre2build.html
code/trunk/doc/html/pcre2pattern.html
code/trunk/doc/pcre2.txt
code/trunk/doc/pcre2build.3
code/trunk/doc/pcre2pattern.3
Modified: code/trunk/README
===================================================================
--- code/trunk/README 2015-01-23 16:51:47 UTC (rev 185)
+++ code/trunk/README 2015-01-26 14:21:45 UTC (rev 186)
@@ -179,20 +179,24 @@
. If you do not want to make use of the support for UTF-8 Unicode character
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
- library, and UTF-32 Unicode character strings in the 32-bit library, you can
+ library, or UTF-32 Unicode character strings in the 32-bit library, you can
add --disable-unicode to the "configure" command. This reduces the size of
the libraries. It is not possible to configure one library with Unicode
support, and another without, in the same configuration.
When Unicode support is available, the use of a UTF encoding still has to be
- enabled by an option at run time. When PCRE2 is compiled with Unicode
- support, its input can only either be ASCII or UTF-8/16/32, even when running
- on EBCDIC platforms. It is not possible to use both --enable-unicode and
- --enable-ebcdic at the same time.
+ enabled by setting the PCRE2_UTF option at run time or starting a pattern
+ with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
+ either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
+ not possible to use both --enable-unicode and --enable-ebcdic at the same
+ time.
As well as supporting UTF strings, Unicode support includes support for the
\P, \p, and \X sequences that recognize Unicode character properties.
However, only the basic two-letter properties such as Lu are supported.
+ Escape sequences such as \d and \w in patterns do not by default make use of
+ Unicode properties, but can be made to do so by setting the PCRE2_UCP option
+ or starting a pattern with (*UCP).
. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
of the preceding, or any of the Unicode newline sequences, as indicating the
@@ -825,4 +829,4 @@
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 20 January 2015
+Last updated: 26 January 2015
Modified: code/trunk/doc/html/README.txt
===================================================================
--- code/trunk/doc/html/README.txt 2015-01-23 16:51:47 UTC (rev 185)
+++ code/trunk/doc/html/README.txt 2015-01-26 14:21:45 UTC (rev 186)
@@ -179,20 +179,24 @@
. If you do not want to make use of the support for UTF-8 Unicode character
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
- library, and UTF-32 Unicode character strings in the 32-bit library, you can
+ library, or UTF-32 Unicode character strings in the 32-bit library, you can
add --disable-unicode to the "configure" command. This reduces the size of
the libraries. It is not possible to configure one library with Unicode
support, and another without, in the same configuration.
When Unicode support is available, the use of a UTF encoding still has to be
- enabled by an option at run time. When PCRE2 is compiled with Unicode
- support, its input can only either be ASCII or UTF-8/16/32, even when running
- on EBCDIC platforms. It is not possible to use both --enable-unicode and
- --enable-ebcdic at the same time.
+ enabled by setting the PCRE2_UTF option at run time or starting a pattern
+ with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
+ either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
+ not possible to use both --enable-unicode and --enable-ebcdic at the same
+ time.
As well as supporting UTF strings, Unicode support includes support for the
\P, \p, and \X sequences that recognize Unicode character properties.
However, only the basic two-letter properties such as Lu are supported.
+ Escape sequences such as \d and \w in patterns do not by default make use of
+ Unicode properties, but can be made to do so by setting the PCRE2_UCP option
+ or starting a pattern with (*UCP).
. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
of the preceding, or any of the Unicode newline sequences, as indicating the
@@ -825,4 +829,4 @@
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 20 January 2015
+Last updated: 26 January 2015
Modified: code/trunk/doc/html/pcre2build.html
===================================================================
--- code/trunk/doc/html/pcre2build.html 2015-01-23 16:51:47 UTC (rev 185)
+++ code/trunk/doc/html/pcre2build.html 2015-01-26 14:21:45 UTC (rev 186)
@@ -127,8 +127,10 @@
</P>
<P>
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
-or UTF-32. To do that, applications that use the library have to set the
-PCRE2_UTF option when they call <b>pcre2_compile()</b> to compile a pattern.
+or UTF-32. To do that, applications that use the library can set the PCRE2_UTF
+option when they call <b>pcre2_compile()</b> to compile a pattern.
+Alternatively, patterns may be started with (*UTF) unless the application has
+locked this out by setting PCRE2_NEVER_UTF.
</P>
<P>
UTF support allows the libraries to process character code points up to
@@ -139,6 +141,12 @@
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation.
</P>
+<P>
+Pattern escapes such as \d and \w do not by default make use of Unicode
+properties. The application can request that they do by setting the PCRE2_UCP
+option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
+request this by starting with (*UCP).
+</P>
<br><a name="SEC6" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
<P>
Just-in-time compiler support is included in the build by specifying
@@ -471,9 +479,9 @@
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 23 November 2014
+Last updated: 26 January 2015
<br>
-Copyright © 1997-2014 University of Cambridge.
+Copyright © 1997-2015 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
Modified: code/trunk/doc/html/pcre2pattern.html
===================================================================
--- code/trunk/doc/html/pcre2pattern.html 2015-01-23 16:51:47 UTC (rev 185)
+++ code/trunk/doc/html/pcre2pattern.html 2015-01-26 14:21:45 UTC (rev 186)
@@ -110,7 +110,7 @@
Another special sequence that may appear at the start of a pattern is (*UCP).
This has the same effect as setting the PCRE2_UCP option: it causes sequences
such as \d and \w to use Unicode properties to determine character types,
-instead of recognizing only characters with codes less than 128 via a lookup
+instead of recognizing only characters with codes less than 256 via a lookup
table.
</P>
<P>
@@ -572,8 +572,8 @@
</P>
<P>
By default, characters whose code points are greater than 127 never match \d,
-\s, or \w, and always match \D, \S, and \W, although this may vary for
-characters in the range 128-255 when locale-specific matching is happening.
+\s, or \w, and always match \D, \S, and \W, although this may be different
+for characters in the range 128-255 when locale-specific matching is happening.
These escape sequences retain their original meanings from before Unicode
support was available, mainly for efficiency reasons. If the PCRE2_UCP option
is set, the behaviour is changed so that Unicode properties are used to
@@ -1369,11 +1369,12 @@
supported, and an error is given if they are encountered.
</P>
<P>
-By default, characters with values greater than 128 do not match any of the
-POSIX character classes. However, if the PCRE2_UCP option is passed to
-<b>pcre2_compile()</b>, some of the classes are changed so that Unicode
-character properties are used. This is achieved by replacing certain POSIX
-classes by other sequences, as follows:
+By default, characters with values greater than 127 do not match any of the
+POSIX character classes, although this may be different for characters in the
+range 128-255 when locale-specific matching is happening. However, if the
+PCRE2_UCP option is passed to <b>pcre2_compile()</b>, some of the classes are
+changed so that Unicode character properties are used. This is achieved by
+replacing certain POSIX classes with other sequences, as follows:
<pre>
[:alnum:] becomes \p{Xan}
[:alpha:] becomes \p{L}
@@ -1408,12 +1409,12 @@
<P>
[:punct:]
This matches all characters that have the Unicode P (punctuation) property,
-plus those characters with code points less than 128 that have the S (Symbol)
+plus those characters with code points less than 256 that have the S (Symbol)
property.
</P>
<P>
The other POSIX classes are unchanged, and match only characters with code
-points less than 128.
+points less than 256.
</P>
<br><a name="SEC11" href="#TOC1">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a><br>
<P>
@@ -3248,7 +3249,7 @@
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 02 January 2015
+Last updated: 26 January 2015
<br>
Copyright © 1997-2015 University of Cambridge.
<br>
Modified: code/trunk/doc/pcre2.txt
===================================================================
--- code/trunk/doc/pcre2.txt 2015-01-23 16:51:47 UTC (rev 185)
+++ code/trunk/doc/pcre2.txt 2015-01-26 14:21:45 UTC (rev 186)
@@ -2874,18 +2874,24 @@
another without, in the same configuration.
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
- UTF-16 or UTF-32. To do that, applications that use the library have to
- set the PCRE2_UTF option when they call pcre2_compile() to compile a
- pattern.
+ UTF-16 or UTF-32. To do that, applications that use the library can set
+ the PCRE2_UTF option when they call pcre2_compile() to compile a pat-
+ tern. Alternatively, patterns may be started with (*UTF) unless the
+ application has locked this out by setting PCRE2_NEVER_UTF.
UTF support allows the libraries to process character code points up to
- 0x10ffff in the strings that they handle. It also provides support for
- accessing the Unicode properties of such characters, using pattern
- escapes such as \P, \p, and \X. Only the general category properties
- such as Lu and Nd are supported. Details are given in the pcre2pattern
+ 0x10ffff in the strings that they handle. It also provides support for
+ accessing the Unicode properties of such characters, using pattern
+ escapes such as \P, \p, and \X. Only the general category properties
+ such as Lu and Nd are supported. Details are given in the pcre2pattern
documentation.
+ Pattern escapes such as \d and \w do not by default make use of Unicode
+ properties. The application can request that they do by setting the
+ PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a
+ pattern may also request this by starting with (*UCP).
+
JUST-IN-TIME COMPILER SUPPORT
Just-in-time compiler support is included in the build by specifying
@@ -3226,8 +3232,8 @@
REVISION
- Last updated: 23 November 2014
- Copyright (c) 1997-2014 University of Cambridge.
+ Last updated: 26 January 2015
+ Copyright (c) 1997-2015 University of Cambridge.
------------------------------------------------------------------------------
Modified: code/trunk/doc/pcre2build.3
===================================================================
--- code/trunk/doc/pcre2build.3 2015-01-23 16:51:47 UTC (rev 185)
+++ code/trunk/doc/pcre2build.3 2015-01-26 14:21:45 UTC (rev 186)
@@ -1,4 +1,4 @@
-.TH PCRE2BUILD 3 "23 November 2014" "PCRE2 10.00"
+.TH PCRE2BUILD 3 "26 January 2015" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.
@@ -113,8 +113,10 @@
in the same configuration.
.P
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
-or UTF-32. To do that, applications that use the library have to set the
-PCRE2_UTF option when they call \fBpcre2_compile()\fP to compile a pattern.
+or UTF-32. To do that, applications that use the library can set the PCRE2_UTF
+option when they call \fBpcre2_compile()\fP to compile a pattern.
+Alternatively, patterns may be started with (*UTF) unless the application has
+locked this out by setting PCRE2_NEVER_UTF.
.P
UTF support allows the libraries to process character code points up to
0x10ffff in the strings that they handle. It also provides support for
@@ -125,6 +127,11 @@
\fBpcre2pattern\fP
.\"
documentation.
+.P
+Pattern escapes such as \ed and \ew do not by default make use of Unicode
+properties. The application can request that they do by setting the PCRE2_UCP
+option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
+request this by starting with (*UCP).
.
.
.SH "JUST-IN-TIME COMPILER SUPPORT"
@@ -487,6 +494,6 @@
.rs
.sp
.nf
-Last updated: 23 November 2014
-Copyright (c) 1997-2014 University of Cambridge.
+Last updated: 26 January 2015
+Copyright (c) 1997-2015 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre2pattern.3
===================================================================
--- code/trunk/doc/pcre2pattern.3 2015-01-23 16:51:47 UTC (rev 185)
+++ code/trunk/doc/pcre2pattern.3 2015-01-26 14:21:45 UTC (rev 186)
@@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "02 January 2015" "PCRE2 10.00"
+.TH PCRE2PATTERN 3 "26 January 2015" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@@ -73,7 +73,7 @@
Another special sequence that may appear at the start of a pattern is (*UCP).
This has the same effect as setting the PCRE2_UCP option: it causes sequences
such as \ed and \ew to use Unicode properties to determine character types,
-instead of recognizing only characters with codes less than 128 via a lookup
+instead of recognizing only characters with codes less than 256 via a lookup
table.
.P
Some applications that allow their users to supply patterns may wish to
@@ -575,8 +575,8 @@
Unicode is discouraged.
.P
By default, characters whose code points are greater than 127 never match \ed,
-\es, or \ew, and always match \eD, \eS, and \eW, although this may vary for
-characters in the range 128-255 when locale-specific matching is happening.
+\es, or \ew, and always match \eD, \eS, and \eW, although this may be different
+for characters in the range 128-255 when locale-specific matching is happening.
These escape sequences retain their original meanings from before Unicode
support was available, mainly for efficiency reasons. If the PCRE2_UCP option
is set, the behaviour is changed so that Unicode properties are used to
@@ -1369,11 +1369,12 @@
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
supported, and an error is given if they are encountered.
.P
-By default, characters with values greater than 128 do not match any of the
-POSIX character classes. However, if the PCRE2_UCP option is passed to
-\fBpcre2_compile()\fP, some of the classes are changed so that Unicode
-character properties are used. This is achieved by replacing certain POSIX
-classes by other sequences, as follows:
+By default, characters with values greater than 127 do not match any of the
+POSIX character classes, although this may be different for characters in the
+range 128-255 when locale-specific matching is happening. However, if the
+PCRE2_UCP option is passed to \fBpcre2_compile()\fP, some of the classes are
+changed so that Unicode character properties are used. This is achieved by
+replacing certain POSIX classes with other sequences, as follows:
.sp
[:alnum:] becomes \ep{Xan}
[:alpha:] becomes \ep{L}
@@ -1404,11 +1405,11 @@
.TP 10
[:punct:]
This matches all characters that have the Unicode P (punctuation) property,
-plus those characters with code points less than 128 that have the S (Symbol)
+plus those characters with code points less than 256 that have the S (Symbol)
property.
.P
The other POSIX classes are unchanged, and match only characters with code
-points less than 128.
+points less than 256.
.
.
.SH "COMPATIBILITY FEATURE FOR WORD BOUNDARIES"
@@ -3292,6 +3293,6 @@
.rs
.sp
.nf
-Last updated: 02 January 2015
+Last updated: 26 January 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi