[Pcre-svn] [535] code/trunk: Prepare for release candidate.

Página Inicial
Delete this message
Autor: Subversion repository
Data:  
Para: pcre-svn
Assunto: [Pcre-svn] [535] code/trunk: Prepare for release candidate.
Revision: 535
          http://vcs.pcre.org/viewvc?view=rev&revision=535
Author:   ph10
Date:     2010-06-03 20:18:24 +0100 (Thu, 03 Jun 2010)


Log Message:
-----------
Prepare for release candidate.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/NEWS
    code/trunk/NON-UNIX-USE
    code/trunk/configure.ac
    code/trunk/doc/html/index.html
    code/trunk/doc/html/pcre.html
    code/trunk/doc/html/pcre_compile.html
    code/trunk/doc/html/pcre_compile2.html
    code/trunk/doc/html/pcreapi.html
    code/trunk/doc/html/pcrecompat.html
    code/trunk/doc/html/pcregrep.html
    code/trunk/doc/html/pcrepattern.html
    code/trunk/doc/html/pcreperform.html
    code/trunk/doc/html/pcreposix.html
    code/trunk/doc/html/pcresample.html
    code/trunk/doc/html/pcresyntax.html
    code/trunk/doc/html/pcretest.html
    code/trunk/doc/pcre.3
    code/trunk/doc/pcre.txt
    code/trunk/doc/pcreapi.3
    code/trunk/doc/pcregrep.1
    code/trunk/doc/pcregrep.txt
    code/trunk/doc/pcrepattern.3
    code/trunk/doc/pcreperform.3
    code/trunk/doc/pcresyntax.3
    code/trunk/doc/pcretest.1
    code/trunk/doc/pcretest.txt
    code/trunk/maint/README
    code/trunk/pcre_compile.c
    code/trunk/pcre_dfa_exec.c
    code/trunk/pcre_internal.h
    code/trunk/pcre_study.c
    code/trunk/pcre_xclass.c
    code/trunk/pcregrep.c
    code/trunk/pcretest.c


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/ChangeLog    2010-06-03 19:18:24 UTC (rev 535)
@@ -1,7 +1,7 @@
 ChangeLog for PCRE
 ------------------


-Version 8.10 03 May-2010
+Version 8.10 03-Jun-2010
------------------------

1. Added support for (*MARK:ARG) and for ARG additions to PRUNE, SKIP, and
@@ -9,70 +9,69 @@

2. (*ACCEPT) was not working when inside an atomic group.

-3.  Inside a character class, \B is treated as a literal by default, but 
-    faulted if PCRE_EXTRA is set. This mimics Perl's behaviour (the -w option 
+3.  Inside a character class, \B is treated as a literal by default, but
+    faulted if PCRE_EXTRA is set. This mimics Perl's behaviour (the -w option
     causes the error). The code is unchanged, but I tidied the documentation.
-    
-4.  Inside a character class, PCRE always treated \R and \X as literals, 
+
+4.  Inside a character class, PCRE always treated \R and \X as literals,
     whereas Perl faults them if its -w option is set. I have changed PCRE so
     that it faults them when PCRE_EXTRA is set.
-    
+
 5.  Added support for \N, which always matches any character other than
     newline. (It is the same as "." when PCRE_DOTALL is not set.)
-    
+
 6.  When compiling pcregrep with newer versions of gcc which may have
-    FORTIFY_SOURCE set, several warnings "ignoring return value of 'fwrite', 
+    FORTIFY_SOURCE set, several warnings "ignoring return value of 'fwrite',
     declared with attribute warn_unused_result" were given. Just casting the
-    result to (void) does not stop the warnings; a more elaborate fudge is 
-    needed. I've used a macro to implement this.  
-    
-7.  Minor change to pcretest.c to avoid a compiler warning. 
+    result to (void) does not stop the warnings; a more elaborate fudge is
+    needed. I've used a macro to implement this.


+7.  Minor change to pcretest.c to avoid a compiler warning.
+
 8.  Added four artifical Unicode properties to help with an option to make
     \s etc use properties (see next item). The new properties are: Xan
     (alphanumeric), Xsp (Perl space), Xps (POSIX space), and Xwd (word).
-    
+
 9.  Added PCRE_UCP to make \b, \d, \s, \w, and certain POSIX character classes
-    use Unicode properties. (*UCP) at the start of a pattern can be used to set 
+    use Unicode properties. (*UCP) at the start of a pattern can be used to set
     this option. Modified pcretest to add /W to test this facility. Added
     REG_UCP to make it available via the POSIX interface.
-    
-10. Added --line-buffered to pcregrep. 


-11. In UTF-8 mode, if a pattern that was compiled with PCRE_CASELESS was 
-    studied, and the match started with a letter with a code point greater than 
+10. Added --line-buffered to pcregrep.
+
+11. In UTF-8 mode, if a pattern that was compiled with PCRE_CASELESS was
+    studied, and the match started with a letter with a code point greater than
     127 whose first byte was different to the first byte of the other case of
     the letter, the other case of this starting letter was not recognized.
-    
+
 12. pcreposix.c included pcre.h before including pcre_internal.h. This caused a
-    conflict in the definition of PCRE_EXP_DECL. I have removed the include of 
+    conflict in the definition of PCRE_EXP_DECL. I have removed the include of
     pcre.h as pcre_internal.h includes pcre.h itself. (This may be a bit of
     historical tidying that never got done.)
-    
+
 13. If a pattern that was studied started with a repeated Unicode property
     test, for example, \p{Nd}+, there was the theoretical possibility of
-    setting up an incorrect bitmap of starting bytes, but fortunately it could 
-    not have actually happened in practice until change 8 above was made (it 
+    setting up an incorrect bitmap of starting bytes, but fortunately it could
+    not have actually happened in practice until change 8 above was made (it
     added property types that matched character-matching opcodes).
-    
-14. pcre_study() now recognizes \h, \v, and \R when constructing a bit map of 
-    possible starting bytes for non-anchored patterns. 
-    
+
+14. pcre_study() now recognizes \h, \v, and \R when constructing a bit map of
+    possible starting bytes for non-anchored patterns.
+
 15. Extended the "auto-possessify" feature of pcre_compile(). It now recognizes
-    \R, and also a number of cases that involve Unicode properties, both 
+    \R, and also a number of cases that involve Unicode properties, both
     explicit and implicit when PCRE_UCP is set.


 16. If a repeated Unicode property match (e.g. \p{Lu}*) was used with non-UTF-8
-    input, it could crash or give wrong results if characters with values 
-    greater than 0xc0 were present in the subject string. (Detail: it assumed 
+    input, it could crash or give wrong results if characters with values
+    greater than 0xc0 were present in the subject string. (Detail: it assumed
     UTF-8 input when processing these items.)
-    
+
 17. Added a lot of (int) casts to avoid compiler warnings in systems where
     size_t is 64-bit.
-    
+
 18. Added a check for running out of memory when PCRE is compiled with
-    --disable-stack-for-recursion. 
-    
+    --disable-stack-for-recursion.



Version 8.02 19-Mar-2010

Modified: code/trunk/NEWS
===================================================================
--- code/trunk/NEWS    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/NEWS    2010-06-03 19:18:24 UTC (rev 535)
@@ -1,6 +1,17 @@
 News about PCRE releases
 ------------------------


+Release 8.10 03-Jun-2010
+------------------------
+
+There are two major additions: support for (*MAKR) and friends, and the option
+PCRE_UCP, which changes the behaviour of \b, \d, \s, and \w (and their
+opposites) so that they make use of Unicode properties. There are also a number
+of lesser new features, and several bugs have been fixed. A new option,
+--line-buffered, has been added to pcregrep, for use when it is connected to
+pipes.
+
+
Release 8.02 19-Mar-2010
------------------------


Modified: code/trunk/NON-UNIX-USE
===================================================================
--- code/trunk/NON-UNIX-USE    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/NON-UNIX-USE    2010-06-03 19:18:24 UTC (rev 535)
@@ -188,7 +188,7 @@
 LINKING PROGRAMS IN WINDOWS ENVIRONMENTS


If you want to statically link a program against a PCRE library in the form of
-a non-dll .a file, you must define PCRE_STATIC before including pcre.h or
+a non-dll .a file, you must define PCRE_STATIC before including pcre.h or
pcrecpp.h, otherwise the pcre_malloc() and pcre_free() exported functions will
be declared __declspec(dllimport), with unwanted results.


Modified: code/trunk/configure.ac
===================================================================
--- code/trunk/configure.ac    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/configure.ac    2010-06-03 19:18:24 UTC (rev 535)
@@ -11,7 +11,7 @@
 m4_define(pcre_major, [8])
 m4_define(pcre_minor, [10])
 m4_define(pcre_prerelease, [-RC1])
-m4_define(pcre_date, [2010-05-03])
+m4_define(pcre_date, [2010-06-03])


# Libtool shared library interface versions (current:revision:age)
m4_define(libpcre_version, [0:1:0])

Modified: code/trunk/doc/html/index.html
===================================================================
--- code/trunk/doc/html/index.html    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/html/index.html    2010-06-03 19:18:24 UTC (rev 535)
@@ -1,10 +1,10 @@
 <html>
-<!-- This is a manually maintained file that is the root of the HTML version of 
-     the PCRE documentation. When the HTML documents are built from the man 
-     page versions, the entire doc/html directory is emptied, this file is then 
-     copied into doc/html/index.html, and the remaining files therein are 
+<!-- This is a manually maintained file that is the root of the HTML version of
+     the PCRE documentation. When the HTML documents are built from the man
+     page versions, the entire doc/html directory is emptied, this file is then
+     copied into doc/html/index.html, and the remaining files therein are
      created by the 132html script.
--->      
+-->
 <head>
 <title>PCRE specification</title>
 </head>
@@ -74,11 +74,11 @@
 </table>


<p>
-There are also individual pages that summarize the interface for each function
+There are also individual pages that summarize the interface for each function
in the library:
</p>

-<table>    
+<table>


 <tr><td><a href="pcre_compile.html">pcre_compile</a></td>
     <td>&nbsp;&nbsp;Compile a regular expression</td></tr>
@@ -129,7 +129,7 @@


 <tr><td><a href="pcre_maketables.html">pcre_maketables</a></td>
     <td>&nbsp;&nbsp;Build character tables in current locale</td></tr>
-    
+
 <tr><td><a href="pcre_refcount.html">pcre_refcount</a></td>
     <td>&nbsp;&nbsp;Maintain reference count in compiled pattern</td></tr>



Modified: code/trunk/doc/html/pcre.html
===================================================================
--- code/trunk/doc/html/pcre.html    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/html/pcre.html    2010-06-03 19:18:24 UTC (rev 535)
@@ -269,7 +269,7 @@
 <a href="pcrepattern.html#genericchartypes">generic character types</a>
 in the
 <a href="pcrepattern.html"><b>pcrepattern</b></a>
-documentation. 
+documentation.
 </P>
 <P>
 7. Similarly, characters that match the POSIX named character classes are all
@@ -277,7 +277,7 @@
 </P>
 <P>
 8. However, the Perl 5.10 horizontal and vertical whitespace matching escapes
-(\h, \H, \v, and \V) do match all the appropriate Unicode characters, 
+(\h, \H, \v, and \V) do match all the appropriate Unicode characters,
 whether or not PCRE_UCP is set.
 </P>
 <P>


Modified: code/trunk/doc/html/pcre_compile.html
===================================================================
--- code/trunk/doc/html/pcre_compile.html    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/html/pcre_compile.html    2010-06-03 19:18:24 UTC (rev 535)
@@ -66,11 +66,12 @@
   PCRE_NO_UTF8_CHECK      Do not check the pattern for UTF-8
                             validity (only relevant if
                             PCRE_UTF8 is set)
+  PCRE_UCP                Use Unicode properties for \d, \w, etc.
   PCRE_UNGREEDY           Invert greediness of quantifiers
   PCRE_UTF8               Run in UTF-8 mode
 </pre>
 PCRE must be built with UTF-8 support in order to use PCRE_UTF8 and
-PCRE_NO_UTF8_CHECK.
+PCRE_NO_UTF8_CHECK, and with UCP support if PCRE_UCP is used.
 </P>
 <P>
 The yield of the function is a pointer to a private data structure that


Modified: code/trunk/doc/html/pcre_compile2.html
===================================================================
--- code/trunk/doc/html/pcre_compile2.html    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/html/pcre_compile2.html    2010-06-03 19:18:24 UTC (rev 535)
@@ -70,11 +70,12 @@
   PCRE_NO_UTF8_CHECK      Do not check the pattern for UTF-8
                             validity (only relevant if
                             PCRE_UTF8 is set)
+  PCRE_UCP                Use Unicode properties for \d, \w, etc.
   PCRE_UNGREEDY           Invert greediness of quantifiers
   PCRE_UTF8               Run in UTF-8 mode
 </pre>
 PCRE must be built with UTF-8 support in order to use PCRE_UTF8 and
-PCRE_NO_UTF8_CHECK.
+PCRE_NO_UTF8_CHECK, and with UCP support if PCRE_UCP is used.
 </P>
 <P>
 The yield of the function is a pointer to a private data structure that


Modified: code/trunk/doc/html/pcreapi.html
===================================================================
--- code/trunk/doc/html/pcreapi.html    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/html/pcreapi.html    2010-06-03 19:18:24 UTC (rev 535)
@@ -161,6 +161,13 @@
 Applications can use these to include support for different releases of PCRE.
 </P>
 <P>
+In a Windows environment, if you want to statically link an application program
+against a non-dll <b>pcre.a</b> file, you must define PCRE_STATIC before
+including <b>pcre.h</b> or <b>pcrecpp.h</b>, because otherwise the
+<b>pcre_malloc()</b> and <b>pcre_free()</b> exported functions will be declared
+<b>__declspec(dllimport)</b>, with unwanted results.
+</P>
+<P>
 The functions <b>pcre_compile()</b>, <b>pcre_compile2()</b>, <b>pcre_study()</b>,
 and <b>pcre_exec()</b> are used for compiling and matching regular expressions
 in a Perl-compatible manner. A sample program that demonstrates the simplest
@@ -649,6 +656,19 @@
 they acquire numbers in the usual way). There is no equivalent of this option
 in Perl.
 <pre>
+  PCRE_UCP
+</pre>
+This option changes the way PCRE processes \b, \d, \s, \w, and some of the
+POSIX character classes. By default, only ASCII characters are recognized, but
+if PCRE_UCP is set, Unicode properties are used instead to classify characters.
+More details are given in the section on
+<a href="pcre.html#genericchartypes">generic character types</a>
+in the
+<a href="pcrepattern.html"><b>pcrepattern</b></a>
+page. If you set PCRE_UCP, matching one of the items it affects takes much
+longer. The option is available only if PCRE has been compiled with Unicode
+property support.
+<pre>
   PCRE_UNGREEDY
 </pre>
 This option inverts the "greediness" of the quantifiers so that they are not
@@ -757,6 +777,7 @@
   64  ] is an invalid data character in JavaScript compatibility mode
   65  different names for subpatterns of the same number are not allowed
   66  (*MARK) must have an argument
+  67  this version of PCRE is not compiled with PCRE_UCP support
 </pre>
 The numbers 32 and 10000 in errors 48 and 49 are defaults; different values may
 be used if the limits were changed when PCRE was built.
@@ -829,11 +850,13 @@
 PCRE handles caseless matching, and determines whether characters are letters,
 digits, or whatever, by reference to a set of tables, indexed by character
 value. When running in UTF-8 mode, this applies only to characters with codes
-less than 128. Higher-valued codes never match escapes such as \w or \d, but
-can be tested with \p if PCRE is built with Unicode character property
-support. The use of locales with Unicode is discouraged. If you are handling
-characters with codes greater than 128, you should either use UTF-8 and
-Unicode, or use locales, but not try to mix the two.
+less than 128. By default, higher-valued codes never match escapes such as \w
+or \d, but they can be tested with \p if PCRE is built with Unicode character
+property support. Alternatively, the PCRE_UCP option can be set at compile
+time; this causes \w and friends to use Unicode property support instead of
+built-in tables. The use of locales with Unicode is discouraged. If you are
+handling characters with codes greater than 128, you should either use UTF-8
+and Unicode, or use locales, but not try to mix the two.
 </P>
 <P>
 PCRE contains an internal set of tables that are used when the final argument
@@ -1623,6 +1646,11 @@
 gets a block of memory at the start of matching to use for this purpose. If the
 call via <b>pcre_malloc()</b> fails, this error is given. The memory is
 automatically freed at the end of matching.
+</P>
+<P>
+This error is also given if <b>pcre_stack_malloc()</b> fails in
+<b>pcre_exec()</b>. This can happen only when PCRE has been compiled with
+<b>--disable-stack-for-recursion</b>.
 <pre>
   PCRE_ERROR_NOSUBSTRING    (-7)
 </pre>
@@ -2087,7 +2115,7 @@
 </P>
 <br><a name="SEC22" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 03 May 2010
+Last updated: 01 June 2010
 <br>
 Copyright &copy; 1997-2010 University of Cambridge.
 <br>


Modified: code/trunk/doc/html/pcrecompat.html
===================================================================
--- code/trunk/doc/html/pcrecompat.html    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/html/pcrecompat.html    2010-06-03 19:18:24 UTC (rev 535)
@@ -18,7 +18,7 @@
 <P>
 This document describes the differences in the ways that PCRE and Perl handle
 regular expressions. The differences described here are with respect to Perl
-5.10.
+5.10/5.11.
 </P>
 <P>
 1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details of what
@@ -102,12 +102,7 @@
 the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set to "b".
 </P>
 <P>
-11. PCRE does support Perl 5.10's backtracking verbs (*ACCEPT), (*FAIL), (*F),
-(*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in the forms without an
-argument. PCRE does not support (*MARK).
-</P>
-<P>
-12. PCRE's handling of duplicate subpattern numbers and duplicate subpattern
+11. PCRE's handling of duplicate subpattern numbers and duplicate subpattern
 names is not as general as Perl's. This is a consequence of the fact the PCRE
 works internally just with numbers, using an external table to translate
 between numbers and names. In particular, a pattern such as (?|(?&#60;a&#62;A)|(?&#60;b)B),
@@ -118,7 +113,7 @@
 an error is given at compile time.
 </P>
 <P>
-13. PCRE provides some extensions to the Perl regular expression facilities.
+12. PCRE provides some extensions to the Perl regular expression facilities.
 Perl 5.10 includes new features that are not in earlier versions of Perl, some
 of which (such as named parentheses) have been in PCRE for some time. This list
 is with respect to Perl 5.10:
@@ -187,9 +182,9 @@
 REVISION
 </b><br>
 <P>
-Last updated: 04 October 2009
+Last updated: 12 May 2010
 <br>
-Copyright &copy; 1997-2009 University of Cambridge.
+Copyright &copy; 1997-2010 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE index page</a>.


Modified: code/trunk/doc/html/pcregrep.html
===================================================================
--- code/trunk/doc/html/pcregrep.html    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/html/pcregrep.html    2010-06-03 19:18:24 UTC (rev 535)
@@ -178,8 +178,8 @@
 connected to a terminal. More resources are used when colouring is enabled,
 because <b>pcregrep</b> has to search for all possible matches in a line, not
 just one, in order to colour them all.
-</P>
-<P>
+<br>
+<br>
 The colour that is used can be specified by setting the environment variable
 PCREGREP_COLOUR or PCREGREP_COLOR. The value of this variable should be a
 string of two numbers, separated by a semicolon. They are copied directly into
@@ -338,6 +338,17 @@
 short form for this option.
 </P>
 <P>
+<b>--line-buffered</b>
+When this option is given, input is read and processed line by line, and the
+output is flushed after each write. By default, input is read in large chunks,
+unless <b>pcregrep</b> can determine that it is reading from a terminal (which
+is currently possible only in Unix environments). Output to terminal is
+normally automatically flushed by the operating system. This option can be
+useful when the input or output is attached to a pipe and you do not want
+<b>pcregrep</b> to buffer up large amounts of data. However, its use will affect
+performance, and the <b>-M</b> (multiline) option ceases to work.
+</P>
+<P>
 <b>--line-offsets</b>
 Instead of showing lines or parts of lines that match, show each match as a
 line number, the offset from the start of the line, and a length. The line
@@ -365,7 +376,8 @@
 <b>pcregrep</b> ensures that at least 8K characters or the rest of the document
 (whichever is the shorter) are available for forward matching, and similarly
 the previous 8K characters (or all the previous characters, if fewer than 8K)
-are guaranteed to be available for lookbehind assertions.
+are guaranteed to be available for lookbehind assertions. This option does not
+work when input is read line by line (see \fP--line-buffered\fP.)
 </P>
 <P>
 <b>-N</b> <i>newline-type</i>, <b>--newline=</b><i>newline-type</i>
@@ -538,9 +550,9 @@
 </P>
 <br><a name="SEC13" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 13 September 2009
+Last updated: 21 May 2010
 <br>
-Copyright &copy; 1997-2009 University of Cambridge.
+Copyright &copy; 1997-2010 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE index page</a>.


Modified: code/trunk/doc/html/pcrepattern.html
===================================================================
--- code/trunk/doc/html/pcrepattern.html    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/html/pcrepattern.html    2010-06-03 19:18:24 UTC (rev 535)
@@ -78,6 +78,17 @@
 page.
 </P>
 <P>
+Another special sequence that may appear at the start of a pattern or in
+combination with (*UTF8) is:
+<pre>
+  (*UCP)
+</pre>
+This has the same effect as setting the PCRE_UCP option: it causes sequences
+such as \d and \w to use Unicode properties to determine character types,
+instead of recognizing only characters with codes less than 128 via a lookup
+table.
+</P>
+<P>
 The remainder of this document discusses the patterns that are supported by
 PCRE when its main matching function, <b>pcre_exec()</b>, is used.
 From release 6.0, PCRE offers a second matching function,
@@ -357,20 +368,17 @@
   \w     any "word" character
   \W     any "non-word" character
 </pre>
-There is also the single sequence \N, which matches a non-newline character. 
-This is the same as 
+There is also the single sequence \N, which matches a non-newline character.
+This is the same as
 <a href="#fullstopdot">the "." metacharacter</a>
 when PCRE_DOTALL is not set.
 </P>
 <P>
 Each pair of lower and upper case escape sequences partitions the complete set
 of characters into two disjoint sets. Any given character matches one, and only
-one, of each pair.
-</P>
-<P>
-These character type sequences can appear both inside and outside character
+one, of each pair. The sequences can appear both inside and outside character
 classes. They each match one character of the appropriate type. If the current
-matching point is at the end of the subject string, all of them fail, since
+matching point is at the end of the subject string, all of them fail, because
 there is no character to match.
 </P>
 <P>
@@ -381,17 +389,41 @@
 does.
 </P>
 <P>
-In UTF-8 mode, characters with values greater than 128 never match \d, \s, or
-\w, and always match \D, \S, and \W. This is true even when Unicode
-character property support is available. These sequences retain their original
-meanings from before UTF-8 support was available, mainly for efficiency
-reasons. Note that this also affects \b, because it is defined in terms of \w
-and \W.
+A "word" character is an underscore or any character that is a letter or digit.
+By default, the definition of letters and digits is controlled by PCRE's
+low-valued character tables, and may vary if locale-specific matching is taking
+place (see
+<a href="pcreapi.html#localesupport">"Locale support"</a>
+in the
+<a href="pcreapi.html"><b>pcreapi</b></a>
+page). For example, in a French locale such as "fr_FR" in Unix-like systems,
+or "french" in Windows, some character codes greater than 128 are used for
+accented letters, and these are then matched by \w. The use of locales with
+Unicode is discouraged.
 </P>
 <P>
+By default, in UTF-8 mode, characters with values greater than 128 never match
+\d, \s, or \w, and always match \D, \S, and \W. These sequences retain
+their original meanings from before UTF-8 support was available, mainly for
+efficiency reasons. However, if PCRE is compiled with Unicode property support,
+and the PCRE_UCP option is set, the behaviour is changed so that Unicode
+properties are used to determine character types, as follows:
+<pre>
+  \d  any character that \p{Nd} matches (decimal digit)
+  \s  any character that \p{Z} matches, plus HT, LF, FF, CR
+  \w  any character that \p{L} or \p{N} matches, plus underscore
+</pre>
+The upper case escapes match the inverse sets of characters. Note that \d
+matches only decimal digits, whereas \w matches any Unicode digit, as well as
+any Unicode letter, and underscore. Note also that PCRE_UCP affects \b, and
+\B because they are defined in terms of \w and \W. Matching these sequences
+is noticeably slower when PCRE_UCP is set.
+</P>
+<P>
 The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the
-other sequences, these do match certain high-valued codepoints in UTF-8 mode.
-The horizontal space characters are:
+other sequences, which match only ASCII characters by default, these always
+match certain high-valued codepoints in UTF-8 mode, whether or not PCRE_UCP is
+set. The horizontal space characters are:
 <pre>
   U+0009     Horizontal tab
   U+0020     Space
@@ -422,21 +454,8 @@
   U+0085     Next line
   U+2028     Line separator
   U+2029     Paragraph separator
-</PRE>
+<a name="newlineseq"></a></PRE>
 </P>
-<P>
-A "word" character is an underscore or any character less than 256 that is a
-letter or digit. The definition of letters and digits is controlled by PCRE's
-low-valued character tables, and may vary if locale-specific matching is taking
-place (see
-<a href="pcreapi.html#localesupport">"Locale support"</a>
-in the
-<a href="pcreapi.html"><b>pcreapi</b></a>
-page). For example, in a French locale such as "fr_FR" in Unix-like systems,
-or "french" in Windows, some character codes greater than 128 are used for
-accented letters, and these are matched by \w. The use of locales with Unicode
-is discouraged.
-<a name="newlineseq"></a></P>
 <br><b>
 Newline sequences
 </b><br>
@@ -479,13 +498,13 @@
 which are not Perl-compatible, are recognized only at the very start of a
 pattern, and that they must be in upper case. If more than one of them is
 present, the last one is used. They can be combined with a change of newline
-convention, for example, a pattern can start with:
+convention; for example, a pattern can start with:
 <pre>
   (*ANY)(*BSR_ANYCRLF)
 </pre>
-Inside a character class, \R is treated as an unrecognized escape sequence, 
-and so matches the letter "R" by default, but causes an error if PCRE_EXTRA is
-set.
+They can also be combined with the (*UTF8) or (*UCP) special sequences. Inside
+a character class, \R is treated as an unrecognized escape sequence, and so
+matches the letter "R" by default, but causes an error if PCRE_EXTRA is set.
 <a name="uniextseq"></a></P>
 <br><b>
 Unicode character properties
@@ -504,7 +523,7 @@
 The property names represented by <i>xx</i> above are limited to the Unicode
 script names, the general category properties, "Any", which matches any
 character (including newline), and some special PCRE properties (described
-in the 
+in the
 <a href="#extraprops">next section).</a>
 Other Perl properties such as "InMusicalSymbols" are not currently supported by
 PCRE. Note that \P{Any} does not match any characters, so always causes a
@@ -720,26 +739,29 @@
 Matching characters by Unicode property is not fast, because PCRE has to search
 a structure that contains data for over fifteen thousand characters. That is
 why the traditional escape sequences such as \d and \w do not use Unicode
-properties in PCRE.
+properties in PCRE by default, though you can make them do so by setting the
+PCRE_UCP option for <b>pcre_compile()</b> or by starting the pattern with
+(*UCP).
 <a name="extraprops"></a></P>
 <br><b>
 PCRE's additional properties
 </b><br>
 <P>
-As well as the standard Unicode properties described in the previous 
-section, PCRE supports four more that make it possible to convert traditional 
+As well as the standard Unicode properties described in the previous
+section, PCRE supports four more that make it possible to convert traditional
 escape sequences such as \w and \s and POSIX character classes to use Unicode
-properties. These are:
+properties. PCRE uses these non-standard, non-Perl properties internally when
+PCRE_UCP is set. They are:
 <pre>
   Xan   Any alphanumeric character
   Xps   Any POSIX space character
   Xsp   Any Perl space character
   Xwd   Any Perl "word" character
 </pre>
-Xan matches characters that have either the L (letter) or the N (number) 
-property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or 
+Xan matches characters that have either the L (letter) or the N (number)
+property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or
 carriage return, and any other character that has the Z (separator) property.
-Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the 
+Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the
 same characters as Xan, plus underscore.
 <a name="resetmatchstart"></a></P>
 <br><b>
@@ -790,7 +812,7 @@
   \G     matches at the first matching position in the subject
 </pre>
 Inside a character class, \b has a different meaning; it matches the backspace
-character. If any other of these assertions appears in a character class, by 
+character. If any other of these assertions appears in a character class, by
 default it matches the corresponding literal character (for example, \B
 matches the letter B). However, if the PCRE_EXTRA option is set, an "invalid
 escape sequence" error is generated instead.
@@ -799,10 +821,12 @@
 A word boundary is a position in the subject string where the current character
 and the previous character do not both match \w or \W (i.e. one matches
 \w and the other matches \W), or the start or end of the string if the
-first or last character matches \w, respectively. Neither PCRE nor Perl has a
-separte "start of word" or "end of word" metasequence. However, whatever
-follows \b normally determines which it is. For example, the fragment
-\ba matches "a" at the start of a word.
+first or last character matches \w, respectively. In UTF-8 mode, the meanings
+of \w and \W can be changed by setting the PCRE_UCP option. When this is
+done, it also affects \b and \B. Neither PCRE nor Perl has a separate "start
+of word" or "end of word" metasequence. However, whatever follows \b normally
+determines which it is. For example, the fragment \ba matches "a" at the start
+of a word.
 </P>
 <P>
 The \A, \Z, and \z assertions differ from the traditional circumflex and
@@ -916,8 +940,8 @@
 special meaning in a character class.
 </P>
 <P>
-The escape sequence \N always behaves as a dot does when PCRE_DOTALL is not 
-set. In other words, it matches any one character except one that signifies the 
+The escape sequence \N always behaves as a dot does when PCRE_DOTALL is not
+set. In other words, it matches any one character except one that signifies the
 end of a line.
 </P>
 <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
@@ -1016,12 +1040,12 @@
 property support.
 </P>
 <P>
-The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
-in a character class, and add the characters that they match to the class. For
-example, [\dABCDEF] matches any hexadecimal digit. A circumflex can
-conveniently be used with the upper case character types to specify a more
-restricted set of characters than the matching lower case type. For example,
-the class [^\W_] matches any letter or digit, but not underscore.
+The character types \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, \w, and
+\W may also appear in a character class, and add the characters that they
+match to the class. For example, [\dABCDEF] matches any hexadecimal digit. A
+circumflex can conveniently be used with the upper case character types to
+specify a more restricted set of characters than the matching lower case type.
+For example, the class [^\W_] matches any letter or digit, but not underscore.
 </P>
 <P>
 The only metacharacters that are recognized in character classes are backslash,
@@ -1040,7 +1064,7 @@
   [01[:alpha:]%]
 </pre>
 matches "0", "1", any alphabetic character, or "%". The supported class names
-are
+are:
 <pre>
   alnum    letters and digits
   alpha    letters
@@ -1051,7 +1075,7 @@
   graph    printing characters, excluding space
   lower    lower case letters
   print    printing characters, including space
-  punct    printing characters, excluding letters and digits
+  punct    printing characters, excluding letters and digits and space
   space    white space (not quite the same as \s)
   upper    upper case letters
   word     "word" characters (same as \w)
@@ -1074,8 +1098,24 @@
 supported, and an error is given if they are encountered.
 </P>
 <P>
-In UTF-8 mode, characters with values greater than 128 do not match any of
-the POSIX character classes.
+By default, in UTF-8 mode, characters with values greater than 128 do not match
+any of the POSIX character classes. However, if the PCRE_UCP option is passed
+to <b>pcre_compile()</b>, some of the classes are changed so that Unicode
+character properties are used. This is achieved by replacing the POSIX classes
+by other sequences, as follows:
+<pre>
+  [:alnum:]  becomes  \p{Xan}
+  [:alpha:]  becomes  \p{L}
+  [:blank:]  becomes  \h
+  [:digit:]  becomes  \p{Nd}
+  [:lower:]  becomes  \p{Ll}
+  [:space:]  becomes  \p{Xps}
+  [:upper:]  becomes  \p{Lu}
+  [:word:]   becomes  \p{Xwd}
+</pre>
+Negated versions, such as [:^alpha:] use \P instead of \p. The other POSIX
+classes are unchanged, and match only characters with code points less than
+128.
 </P>
 <br><a name="SEC10" href="#TOC1">VERTICAL BAR</a><br>
 <P>
@@ -1148,8 +1188,9 @@
 the application has set or what has been defaulted. Details are given in the
 section entitled
 <a href="#newlineseq">"Newline sequences"</a>
-above. There is also the (*UTF8) leading sequence that can be used to set UTF-8
-mode; this is equivalent to setting the PCRE_UTF8 option.
+above. There are also the (*UTF8) and (*UCP) leading sequences that can be used
+to set UTF-8 and Unicode property modes; they are equivalent to setting the
+PCRE_UTF8 and the PCRE_UCP options, respectively.
 <a name="subpattern"></a></P>
 <br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br>
 <P>
@@ -2589,7 +2630,7 @@
 </P>
 <br><a name="SEC28" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 05 May 2010
+Last updated: 18 May 2010
 <br>
 Copyright &copy; 1997-2010 University of Cambridge.
 <br>


Modified: code/trunk/doc/html/pcreperform.html
===================================================================
--- code/trunk/doc/html/pcreperform.html    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/html/pcreperform.html    2010-06-03 19:18:24 UTC (rev 535)
@@ -105,6 +105,15 @@
 be faster.
 </P>
 <P>
+By default, the escape sequences \b, \d, \s, and \w, and the POSIX
+character classes such as [:alpha:] do not use Unicode properties, partly for
+backwards compatibility, and partly for performance reasons. However, you can
+set PCRE_UCP if you want Unicode character properties to be used. This can
+double the matching time for items such as \d, when matched with
+<b>pcre_exec()</b>; the performance loss is less with <b>pcre_dfa_exec()</b>, and
+in both cases there is not much difference for \b.
+</P>
+<P>
 When a pattern begins with .* not in parentheses, or in parentheses that are
 not the subject of a backreference, and the PCRE_DOTALL option is set, the
 pattern is implicitly anchored by PCRE, since it can match only at the start of
@@ -177,7 +186,7 @@
 REVISION
 </b><br>
 <P>
-Last updated: 07 March 2010
+Last updated: 16 May 2010
 <br>
 Copyright &copy; 1997-2010 University of Cambridge.
 <br>


Modified: code/trunk/doc/html/pcreposix.html
===================================================================
--- code/trunk/doc/html/pcreposix.html    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/html/pcreposix.html    2010-06-03 19:18:24 UTC (rev 535)
@@ -87,8 +87,6 @@
 constants whose names start with "REG_"; these are used for setting options and
 identifying error codes.
 </P>
-<P>
-</P>
 <br><a name="SEC3" href="#TOC1">COMPILING A PATTERN</a><br>
 <P>
 The function <b>regcomp()</b> is called to compile a pattern into an
@@ -126,6 +124,13 @@
 <i>nmatch</i> and <i>pmatch</i> arguments are ignored, and no captured strings
 are returned.
 <pre>
+  REG_UCP
+</pre>
+The PCRE_UCP option is set when the regular expression is passed for
+compilation to the native function. This causes PCRE to use Unicode properties
+when matchine \d, \w, etc., instead of just recognizing ASCII values. Note
+that REG_UTF8 is not part of the POSIX standard.
+<pre>
   REG_UNGREEDY
 </pre>
 The PCRE_UNGREEDY option is set when the regular expression is passed for
@@ -277,9 +282,9 @@
 </P>
 <br><a name="SEC9" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 02 September 2009
+Last updated: 16 May 2010
 <br>
-Copyright &copy; 1997-2009 University of Cambridge.
+Copyright &copy; 1997-2010 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE index page</a>.


Modified: code/trunk/doc/html/pcresample.html
===================================================================
--- code/trunk/doc/html/pcresample.html    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/html/pcresample.html    2010-06-03 19:18:24 UTC (rev 535)
@@ -50,8 +50,15 @@
 <pre>
   gcc -o pcredemo -I/usr/local/include pcredemo.c -L/usr/local/lib -lpcre
 </pre>
-Once you have compiled the demonstration program, you can run simple tests like
-this:
+In a Windows environment, if you want to statically link the program against a
+non-dll <b>pcre.a</b> file, you must uncomment the line that defines PCRE_STATIC
+before including <b>pcre.h</b>, because otherwise the <b>pcre_malloc()</b> and
+<b>pcre_free()</b> exported functions will be declared
+<b>__declspec(dllimport)</b>, with unwanted results.
+</P>
+<P>
+Once you have compiled and linked the demonstration program, you can run simple
+tests like this:
 <pre>
   ./pcredemo 'cat|dog' 'the cat sat on the mat'
   ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
@@ -93,9 +100,9 @@
 REVISION
 </b><br>
 <P>
-Last updated: 30 September 2009
+Last updated: 26 May 2010
 <br>
-Copyright &copy; 1997-2009 University of Cambridge.
+Copyright &copy; 1997-2010 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE index page</a>.


Modified: code/trunk/doc/html/pcresyntax.html
===================================================================
--- code/trunk/doc/html/pcresyntax.html    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/html/pcresyntax.html    2010-06-03 19:18:24 UTC (rev 535)
@@ -81,7 +81,7 @@
   \D         a character that is not a decimal digit
   \h         a horizontal whitespace character
   \H         a character that is not a horizontal whitespace character
-  \N         a character that is not a newline 
+  \N         a character that is not a newline
   \p{<i>xx</i>}     a character with the <i>xx</i> property
   \P{<i>xx</i>}     a character without the <i>xx</i> property
   \R         a newline sequence
@@ -93,7 +93,9 @@
   \W         a "non-word" character
   \X         an extended Unicode sequence
 </pre>
-In PCRE, \d, \D, \s, \S, \w, and \W recognize only ASCII characters.
+In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII
+characters, even in UTF-8 mode. However, this can be changed by setting the
+PCRE_UCP option.
 </P>
 <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
 <P>
@@ -150,7 +152,7 @@
   Xan        Alphanumeric: union of properties L and N
   Xps        POSIX space: property Z or tab, NL, VT, FF, CR
   Xsp        Perl space: property Z or tab, NL, FF, CR
-  Xwd        Perl word: property Xan or underscore 
+  Xwd        Perl word: property Xan or underscore
 </PRE>
 </P>
 <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
@@ -272,7 +274,8 @@
   word        same as \w
   xdigit      hexadecimal digit
 </pre>
-In PCRE, POSIX character set names recognize only ASCII characters. You can use
+In PCRE, POSIX character set names recognize only ASCII characters by default,
+but some of them use Unicode properties if PCRE_UCP is set. You can use
 \Q...\E inside a character class.
 </P>
 <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
@@ -299,7 +302,7 @@
 <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
 <P>
 <pre>
-  \b          word boundary (only ASCII letters recognized)
+  \b          word boundary
   \B          not a word boundary
   ^           start of subject
                also after internal newline in multiline mode
@@ -360,10 +363,11 @@
   (?x)            extended (ignore white space)
   (?-...)         unset option(s)
 </pre>
-The following is recognized only at the start of a pattern or after one of the
+The following are recognized only at the start of a pattern or after one of the
 newline-setting options with similar syntax:
 <pre>
-  (*UTF8)         set UTF-8 mode
+  (*UTF8)         set UTF-8 mode (PCRE_UTF8)
+  (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
 </PRE>
 </P>
 <br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
@@ -449,7 +453,7 @@
 <br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
 <P>
 These are recognized only at the very start of the pattern or after a
-(*BSR_...) or (*UTF8) option.
+(*BSR_...) or (*UTF8) or (*UCP) option.
 <pre>
   (*CR)           carriage return only
   (*LF)           linefeed only
@@ -461,7 +465,7 @@
 <br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
 <P>
 These are recognized only at the very start of the pattern or after a
-(*...) option that sets the newline convention or UTF-8 mode.
+(*...) option that sets the newline convention or UTF-8 or UCP mode.
 <pre>
   (*BSR_ANYCRLF)  CR, LF, or CRLF
   (*BSR_UNICODE)  any Unicode newline sequence
@@ -490,7 +494,7 @@
 </P>
 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 05 May 2010
+Last updated: 12 May 2010
 <br>
 Copyright &copy; 1997-2010 University of Cambridge.
 <br>


Modified: code/trunk/doc/html/pcretest.html
===================================================================
--- code/trunk/doc/html/pcretest.html    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/html/pcretest.html    2010-06-03 19:18:24 UTC (rev 535)
@@ -201,11 +201,11 @@
 <pre>
   /caseless/i
 </pre>
-The following table shows additional modifiers for setting PCRE options that do
-not correspond to anything in Perl:
+The following table shows additional modifiers for setting PCRE compile-time
+options that do not correspond to anything in Perl:
 <pre>
   <b>/8</b>              PCRE_UTF8
-  <b>/?</b>              PCRE_NO_UTF8_CHECK 
+  <b>/?</b>              PCRE_NO_UTF8_CHECK
   <b>/A</b>              PCRE_ANCHORED
   <b>/C</b>              PCRE_AUTO_CALLOUT
   <b>/E</b>              PCRE_DOLLAR_ENDONLY
@@ -213,7 +213,7 @@
   <b>/J</b>              PCRE_DUPNAMES
   <b>/N</b>              PCRE_NO_AUTO_CAPTURE
   <b>/U</b>              PCRE_UNGREEDY
-  <b>/W</b>              PCRE_UCP 
+  <b>/W</b>              PCRE_UCP
   <b>/X</b>              PCRE_EXTRA
   <b>/&#60;JS&#62;</b>           PCRE_JAVASCRIPT_COMPAT
   <b>/&#60;cr&#62;</b>           PCRE_NEWLINE_CR
@@ -235,7 +235,7 @@
 \x{hh...} notation if they are valid UTF-8 sequences. Full details of the PCRE
 options are given in the
 <a href="pcreapi.html"><b>pcreapi</b></a>
-documentation. 
+documentation.
 </P>
 <br><b>
 Finding all matches in a string
@@ -326,17 +326,29 @@
 pattern to be output.
 </P>
 <P>
-The <b>/P</b> modifier causes <b>pcretest</b> to call PCRE via the POSIX wrapper
-API rather than its native API. When this is done, all other modifiers except
-<b>/i</b>, <b>/m</b>, and <b>/+</b> are ignored. REG_ICASE is set if <b>/i</b> is
-present, and REG_NEWLINE is set if <b>/m</b> is present. The wrapper functions
-force PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is set.
-</P>
-<P>
 The <b>/S</b> modifier causes <b>pcre_study()</b> to be called after the
 expression has been compiled, and the results used when the expression is
 matched.
 </P>
+<br><b>
+Using the POSIX wrapper API
+</b><br>
+<P>
+The <b>/P</b> modifier causes <b>pcretest</b> to call PCRE via the POSIX wrapper
+API rather than its native API. When <b>/P</b> is set, the following modifiers
+set options for the <b>regcomp()</b> function:
+<pre>
+  /i    REG_ICASE
+  /m    REG_NEWLINE
+  /N    REG_NOSUB
+  /s    REG_DOTALL     )
+  /U    REG_UNGREEDY   ) These options are not part of
+  /W    REG_UCP        )   the POSIX standard
+  /8    REG_UTF8       )
+</pre>
+The <b>/+</b> modifier works as described above. All other modifiers are
+ignored.
+</P>
 <br><a name="SEC5" href="#TOC1">DATA LINES</a><br>
 <P>
 Before each data line is passed to <b>pcre_exec()</b>, leading and trailing
@@ -423,9 +435,9 @@
 </P>
 <P>
 If the <b>/P</b> modifier was present on the pattern, causing the POSIX wrapper
-API to be used, the only option-setting sequences that have any effect are \B
-and \Z, causing REG_NOTBOL and REG_NOTEOL, respectively, to be passed to
-<b>regexec()</b>.
+API to be used, the only option-setting sequences that have any effect are \B,
+\N, and \Z, causing REG_NOTBOL, REG_NOTEMPTY, and REG_NOTEOL, respectively,
+to be passed to <b>regexec()</b>.
 </P>
 <P>
 The use of \x{hh...} to represent UTF-8 characters is not dependent on the use
@@ -714,7 +726,7 @@
 </P>
 <br><a name="SEC15" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 12 May 2010
+Last updated: 16 May 2010
 <br>
 Copyright &copy; 1997-2010 University of Cambridge.
 <br>


Modified: code/trunk/doc/pcre.3
===================================================================
--- code/trunk/doc/pcre.3    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/pcre.3    2010-06-03 19:18:24 UTC (rev 535)
@@ -268,13 +268,13 @@
 .\" HREF
 \fBpcrepattern\fP
 .\"
-documentation. 
+documentation.
 .P
 7. Similarly, characters that match the POSIX named character classes are all
 low-valued characters, unless the PCRE_UCP option is set.
 .P
 8. However, the Perl 5.10 horizontal and vertical whitespace matching escapes
-(\eh, \eH, \ev, and \eV) do match all the appropriate Unicode characters, 
+(\eh, \eH, \ev, and \eV) do match all the appropriate Unicode characters,
 whether or not PCRE_UCP is set.
 .P
 9. Case-insensitive matching applies only to characters whose values are less


Modified: code/trunk/doc/pcre.txt
===================================================================
--- code/trunk/doc/pcre.txt    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/pcre.txt    2010-06-03 19:18:24 UTC (rev 535)
@@ -2,7 +2,7 @@
 This file contains a concatenation of the PCRE man pages, converted to plain
 text format for ease of searching with a text editor, or for use on systems
 that do not have a man page processor. The small individual files that give
-synopses of each function in the library have not been included. Neither has 
+synopses of each function in the library have not been included. Neither has
 the pcredemo program. There are separate text files for the pcregrep and
 pcretest commands.
 -----------------------------------------------------------------------------
@@ -270,8 +270,8 @@
        Last updated: 12 May 2010
        Copyright (c) 1997-2010 University of Cambridge.
 ------------------------------------------------------------------------------
- 
- 
+
+
 PCREBUILD(3)                                                      PCREBUILD(3)



@@ -601,8 +601,8 @@
        Last updated: 29 September 2009
        Copyright (c) 1997-2009 University of Cambridge.
 ------------------------------------------------------------------------------
- 
- 
+
+
 PCREMATCHING(3)                                                PCREMATCHING(3)



@@ -801,8 +801,8 @@
        Last updated: 29 September 2009
        Copyright (c) 1997-2009 University of Cambridge.
 ------------------------------------------------------------------------------
- 
- 
+
+
 PCREAPI(3)                                                          PCREAPI(3)



@@ -906,6 +906,12 @@
        bers for the library.  Applications can use these  to  include  support
        for different releases of PCRE.


+       In a Windows environment, if you want to statically link an application
+       program against a non-dll pcre.a  file,  you  must  define  PCRE_STATIC
+       before  including  pcre.h or pcrecpp.h, because otherwise the pcre_mal-
+       loc()   and   pcre_free()   exported   functions   will   be   declared
+       __declspec(dllimport), with unwanted results.
+
        The   functions   pcre_compile(),  pcre_compile2(),  pcre_study(),  and
        pcre_exec() are used for compiling and matching regular expressions  in
        a  Perl-compatible  manner. A sample program that demonstrates the sim-
@@ -1376,6 +1382,16 @@
        be  used  for  capturing  (and  they acquire numbers in the usual way).
        There is no equivalent of this option in Perl.


+         PCRE_UCP
+
+       This option changes the way PCRE processes \b, \d, \s, \w, and some  of
+       the POSIX character classes. By default, only ASCII characters are rec-
+       ognized, but if PCRE_UCP is set, Unicode properties are used instead to
+       classify  characters.  More details are given in the section on generic
+       character types in the pcrepattern page. If you set PCRE_UCP,  matching
+       one  of the items it affects takes much longer. The option is available
+       only if PCRE has been compiled with Unicode property support.
+
          PCRE_UNGREEDY


        This option inverts the "greediness" of the quantifiers  so  that  they
@@ -1483,6 +1499,7 @@
          65   different  names  for  subpatterns  of  the  same number are not
        allowed
          66  (*MARK) must have an argument
+         67  this version of PCRE is not compiled with PCRE_UCP support


        The numbers 32 and 10000 in errors 48 and 49  are  defaults;  different
        values may be used if the limits were changed when PCRE was built.
@@ -1548,12 +1565,14 @@
        PCRE handles caseless matching, and determines whether  characters  are
        letters,  digits, or whatever, by reference to a set of tables, indexed
        by character value. When running in UTF-8 mode, this  applies  only  to
-       characters  with  codes  less than 128. Higher-valued codes never match
-       escapes such as \w or \d, but can be tested with \p if  PCRE  is  built
-       with  Unicode  character property support. The use of locales with Uni-
-       code is discouraged. If you are handling characters with codes  greater
-       than  128, you should either use UTF-8 and Unicode, or use locales, but
-       not try to mix the two.
+       characters  with  codes  less than 128. By default, higher-valued codes
+       never match escapes such as \w or \d, but they can be tested with \p if
+       PCRE  is  built with Unicode character property support. Alternatively,
+       the PCRE_UCP option can be set at compile  time;  this  causes  \w  and
+       friends to use Unicode property support instead of built-in tables. The
+       use of locales with Unicode is discouraged. If you are handling charac-
+       ters  with codes greater than 128, you should either use UTF-8 and Uni-
+       code, or use locales, but not try to mix the two.


        PCRE contains an internal set of tables that are used  when  the  final
        argument  of  pcre_compile()  is  NULL.  These  are sufficient for many
@@ -2295,6 +2314,10 @@
        purpose.  If the call via pcre_malloc() fails, this error is given. The
        memory is automatically freed at the end of matching.


+       This error is also given if pcre_stack_malloc() fails  in  pcre_exec().
+       This  can happen only when PCRE has been compiled with --disable-stack-
+       for-recursion.
+
          PCRE_ERROR_NOSUBSTRING    (-7)


        This error is used by the pcre_copy_substring(),  pcre_get_substring(),
@@ -2730,11 +2753,11 @@


REVISION

-       Last updated: 03 May 2010
+       Last updated: 01 June 2010
        Copyright (c) 1997-2010 University of Cambridge.
 ------------------------------------------------------------------------------
- 
- 
+
+
 PCRECALLOUT(3)                                                  PCRECALLOUT(3)



@@ -2914,8 +2937,8 @@
        Last updated: 29 September 2009
        Copyright (c) 1997-2009 University of Cambridge.
 ------------------------------------------------------------------------------
- 
- 
+
+
 PCRECOMPAT(3)                                                    PCRECOMPAT(3)



@@ -2927,7 +2950,7 @@

        This  document describes the differences in the ways that PCRE and Perl
        handle regular expressions. The differences  described  here  are  with
-       respect to Perl 5.10.
+       respect to Perl 5.10/5.11.


        1.  PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
        of what it does have are given in the section on UTF-8 support  in  the
@@ -2998,11 +3021,7 @@
        matching "aba" against the  pattern  /^(a(b)?)+$/  in  Perl  leaves  $2
        unset, but in PCRE it is set to "b".


-       11.  PCRE  does  support  Perl  5.10's  backtracking  verbs  (*ACCEPT),
-       (*FAIL), (*F), (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but  only  in
-       the forms without an argument. PCRE does not support (*MARK).
-
-       12.  PCRE's handling of duplicate subpattern numbers and duplicate sub-
+       11.  PCRE's handling of duplicate subpattern numbers and duplicate sub-
        pattern names is not as general as Perl's. This is a consequence of the
        fact the PCRE works internally just with numbers, using an external ta-
        ble to translate between numbers and names. In  particular,  a  pattern
@@ -3013,7 +3032,7 @@
        turing subpattern number 1. To avoid this confusing situation, an error
        is given at compile time.


-       13. PCRE provides some extensions to the Perl regular expression facil-
+       12. PCRE provides some extensions to the Perl regular expression facil-
        ities.   Perl  5.10  includes new features that are not in earlier ver-
        sions of Perl, some of which (such as named parentheses) have  been  in
        PCRE for some time. This list is with respect to Perl 5.10:
@@ -3068,11 +3087,11 @@


REVISION

-       Last updated: 04 October 2009
-       Copyright (c) 1997-2009 University of Cambridge.
+       Last updated: 12 May 2010
+       Copyright (c) 1997-2010 University of Cambridge.
 ------------------------------------------------------------------------------
- 
- 
+
+
 PCREPATTERN(3)                                                  PCREPATTERN(3)



@@ -3111,6 +3130,16 @@
        below.  There  is  also  a  summary of UTF-8 features in the section on
        UTF-8 support in the main pcre page.


+       Another special sequence that may appear at the start of a  pattern  or
+       in combination with (*UTF8) is:
+
+         (*UCP)
+
+       This  has  the  same  effect  as setting the PCRE_UCP option: it causes
+       sequences such as \d and \w to  use  Unicode  properties  to  determine
+       character types, instead of recognizing only characters with codes less
+       than 128 via a lookup table.
+
        The remainder of this document discusses the  patterns  that  are  sup-
        ported  by  PCRE when its main matching function, pcre_exec(), is used.
        From  release  6.0,   PCRE   offers   a   second   matching   function,
@@ -3382,29 +3411,49 @@


        Each  pair of lower and upper case escape sequences partitions the com-
        plete set of characters into two disjoint  sets.  Any  given  character
-       matches one, and only one, of each pair.
+       matches  one, and only one, of each pair. The sequences can appear both
+       inside and outside character classes. They each match one character  of
+       the  appropriate  type.  If the current matching point is at the end of
+       the subject string, all of them fail, because there is no character  to
+       match.


-       These character type sequences can appear both inside and outside char-
-       acter classes. They each match one character of the  appropriate  type.
-       If  the current matching point is at the end of the subject string, all
-       of them fail, since there is no character to match.
-
-       For compatibility with Perl, \s does not match the VT  character  (code
-       11).   This makes it different from the the POSIX "space" class. The \s
-       characters are HT (9), LF (10), FF (12), CR (13), and  space  (32).  If
+       For  compatibility  with Perl, \s does not match the VT character (code
+       11).  This makes it different from the the POSIX "space" class. The  \s
+       characters  are  HT  (9), LF (10), FF (12), CR (13), and space (32). If
        "use locale;" is included in a Perl script, \s may match the VT charac-
        ter. In PCRE, it never does.


-       In UTF-8 mode, characters with values greater than 128 never match  \d,
-       \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
-       code character property support is available.  These  sequences  retain
-       their original meanings from before UTF-8 support was available, mainly
-       for efficiency reasons. Note that this also affects \b, because  it  is
-       defined in terms of \w and \W.
+       A  "word"  character is an underscore or any character that is a letter
+       or digit.  By default, the definition of letters  and  digits  is  con-
+       trolled  by PCRE's low-valued character tables, and may vary if locale-
+       specific matching is taking place (see "Locale support" in the  pcreapi
+       page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
+       systems, or "french" in Windows, some character codes greater than  128
+       are  used  for  accented letters, and these are then matched by \w. The
+       use of locales with Unicode is discouraged.


+       By default, in UTF-8 mode, characters  with  values  greater  than  128
+       never  match  \d,  \s,  or  \w,  and always match \D, \S, and \W. These
+       sequences retain their original meanings from before UTF-8 support  was
+       available,  mainly for efficiency reasons. However, if PCRE is compiled
+       with Unicode property support, and the PCRE_UCP option is set, the  be-
+       haviour  is  changed  so  that Unicode properties are used to determine
+       character types, as follows:
+
+         \d  any character that \p{Nd} matches (decimal digit)
+         \s  any character that \p{Z} matches, plus HT, LF, FF, CR
+         \w  any character that \p{L} or \p{N} matches, plus underscore
+
+       The upper case escapes match the inverse sets of characters. Note  that
+       \d  matches  only decimal digits, whereas \w matches any Unicode digit,
+       as well as any Unicode letter, and underscore. Note also that  PCRE_UCP
+       affects  \b,  and  \B  because  they are defined in terms of \w and \W.
+       Matching these sequences is noticeably slower when PCRE_UCP is set.
+
        The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
-       the other sequences, these do match certain high-valued  codepoints  in
-       UTF-8 mode.  The horizontal space characters are:
+       the  other  sequences,  which  match  only ASCII characters by default,
+       these always  match  certain  high-valued  codepoints  in  UTF-8  mode,
+       whether or not PCRE_UCP is set. The horizontal space characters are:


          U+0009     Horizontal tab
          U+0020     Space
@@ -3436,57 +3485,49 @@
          U+2028     Line separator
          U+2029     Paragraph separator


-       A "word" character is an underscore or any character less than 256 that
-       is a letter or digit. The definition of  letters  and  digits  is  con-
-       trolled  by PCRE's low-valued character tables, and may vary if locale-
-       specific matching is taking place (see "Locale support" in the  pcreapi
-       page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
-       systems, or "french" in Windows, some character codes greater than  128
-       are  used for accented letters, and these are matched by \w. The use of
-       locales with Unicode is discouraged.
-
    Newline sequences


-       Outside a character class, by default, the escape sequence  \R  matches
+       Outside  a  character class, by default, the escape sequence \R matches
        any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8
        mode \R is equivalent to the following:


          (?>\r\n|\n|\x0b|\f|\r|\x85)


-       This is an example of an "atomic group", details  of  which  are  given
+       This  is  an  example  of an "atomic group", details of which are given
        below.  This particular group matches either the two-character sequence
-       CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,
+       CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
        U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
        return, U+000D), or NEL (next line, U+0085). The two-character sequence
        is treated as a single unit that cannot be split.


-       In  UTF-8  mode, two additional characters whose codepoints are greater
+       In UTF-8 mode, two additional characters whose codepoints  are  greater
        than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
-       rator,  U+2029).   Unicode character property support is not needed for
+       rator, U+2029).  Unicode character property support is not  needed  for
        these characters to be recognized.


        It is possible to restrict \R to match only CR, LF, or CRLF (instead of
-       the  complete  set  of  Unicode  line  endings)  by  setting the option
+       the complete set  of  Unicode  line  endings)  by  setting  the  option
        PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
        (BSR is an abbrevation for "backslash R".) This can be made the default
-       when PCRE is built; if this is the case, the  other  behaviour  can  be
-       requested  via  the  PCRE_BSR_UNICODE  option.   It is also possible to
-       specify these settings by starting a pattern string  with  one  of  the
+       when  PCRE  is  built;  if this is the case, the other behaviour can be
+       requested via the PCRE_BSR_UNICODE option.   It  is  also  possible  to
+       specify  these  settings  by  starting a pattern string with one of the
        following sequences:


          (*BSR_ANYCRLF)   CR, LF, or CRLF only
          (*BSR_UNICODE)   any Unicode newline sequence


-       These  override  the default and the options given to pcre_compile() or
-       pcre_compile2(), but  they  can  be  overridden  by  options  given  to
+       These override the default and the options given to  pcre_compile()  or
+       pcre_compile2(),  but  they  can  be  overridden  by  options  given to
        pcre_exec() or pcre_dfa_exec(). Note that these special settings, which
-       are not Perl-compatible, are recognized only at the  very  start  of  a
-       pattern,  and that they must be in upper case. If more than one of them
+       are  not  Perl-compatible,  are  recognized only at the very start of a
+       pattern, and that they must be in upper case. If more than one of  them
        is present, the last one is used. They can be combined with a change of
-       newline convention, for example, a pattern can start with:
+       newline convention; for example, a pattern can start with:


          (*ANY)(*BSR_ANYCRLF)


+       They can also be combined with the (*UTF8) or (*UCP) special sequences.
        Inside  a  character  class,  \R  is  treated as an unrecognized escape
        sequence, and so matches the letter "R" by default, but causes an error
        if PCRE_EXTRA is set.
@@ -3631,55 +3672,58 @@
        Matching characters by Unicode property is not fast, because  PCRE  has
        to  search  a  structure  that  contains data for over fifteen thousand
        characters. That is why the traditional escape sequences such as \d and
-       \w do not use Unicode properties in PCRE.
+       \w  do  not  use  Unicode properties in PCRE by default, though you can
+       make them do so by setting the PCRE_UCP option for pcre_compile() or by
+       starting the pattern with (*UCP).


    PCRE's additional properties


        As  well  as  the standard Unicode properties described in the previous
        section, PCRE supports four more that make it possible to convert  tra-
        ditional escape sequences such as \w and \s and POSIX character classes
-       to use Unicode properties. These are:
+       to use Unicode properties. PCRE uses these non-standard, non-Perl prop-
+       erties internally when PCRE_UCP is set. They are:


          Xan   Any alphanumeric character
          Xps   Any POSIX space character
          Xsp   Any Perl space character
          Xwd   Any Perl "word" character


-       Xan matches characters that have either the L (letter) or the  N  (num-
-       ber)  property. Xps matches the characters tab, linefeed, vertical tab,
-       formfeed, or carriage return, and any other character that  has  the  Z
+       Xan  matches  characters that have either the L (letter) or the N (num-
+       ber) property. Xps matches the characters tab, linefeed, vertical  tab,
+       formfeed,  or  carriage  return, and any other character that has the Z
        (separator) property.  Xsp is the same as Xps, except that vertical tab
        is excluded. Xwd matches the same characters as Xan, plus underscore.


    Resetting the match start


        The escape sequence \K, which is a Perl 5.10 feature, causes any previ-
-       ously  matched  characters  not  to  be  included  in the final matched
+       ously matched characters not  to  be  included  in  the  final  matched
        sequence. For example, the pattern:


          foo\Kbar


-       matches "foobar", but reports that it has matched "bar".  This  feature
-       is  similar  to  a lookbehind assertion (described below).  However, in
-       this case, the part of the subject before the real match does not  have
-       to  be of fixed length, as lookbehind assertions do. The use of \K does
-       not interfere with the setting of captured  substrings.   For  example,
+       matches  "foobar",  but reports that it has matched "bar". This feature
+       is similar to a lookbehind assertion (described  below).   However,  in
+       this  case, the part of the subject before the real match does not have
+       to be of fixed length, as lookbehind assertions do. The use of \K  does
+       not  interfere  with  the setting of captured substrings.  For example,
        when the pattern


          (foo)\Kbar


        matches "foobar", the first substring is still set to "foo".


-       Perl  documents  that  the  use  of  \K  within assertions is "not well
-       defined". In PCRE, \K is acted upon  when  it  occurs  inside  positive
+       Perl documents that the use  of  \K  within  assertions  is  "not  well
+       defined".  In  PCRE,  \K  is  acted upon when it occurs inside positive
        assertions, but is ignored in negative assertions.


    Simple assertions


-       The  final use of backslash is for certain simple assertions. An asser-
-       tion specifies a condition that has to be met at a particular point  in
-       a  match, without consuming any characters from the subject string. The
-       use of subpatterns for more complicated assertions is described  below.
+       The final use of backslash is for certain simple assertions. An  asser-
+       tion  specifies a condition that has to be met at a particular point in
+       a match, without consuming any characters from the subject string.  The
+       use  of subpatterns for more complicated assertions is described below.
        The backslashed assertions are:


          \b     matches at a word boundary
@@ -3690,47 +3734,49 @@
          \z     matches only at the end of the subject
          \G     matches at the first matching position in the subject


-       Inside  a  character  class, \b has a different meaning; it matches the
-       backspace character. If any other of  these  assertions  appears  in  a
-       character  class, by default it matches the corresponding literal char-
+       Inside a character class, \b has a different meaning;  it  matches  the
+       backspace  character.  If  any  other  of these assertions appears in a
+       character class, by default it matches the corresponding literal  char-
        acter  (for  example,  \B  matches  the  letter  B).  However,  if  the
-       PCRE_EXTRA  option is set, an "invalid escape sequence" error is gener-
+       PCRE_EXTRA option is set, an "invalid escape sequence" error is  gener-
        ated instead.


-       A word boundary is a position in the subject string where  the  current
-       character  and  the previous character do not both match \w or \W (i.e.
-       one matches \w and the other matches \W), or the start or  end  of  the
-       string if the first or last character matches \w, respectively. Neither
-       PCRE nor Perl has a separte "start of word" or "end  of  word"  metase-
-       quence.  However,  whatever follows \b normally determines which it is.
+       A  word  boundary is a position in the subject string where the current
+       character and the previous character do not both match \w or  \W  (i.e.
+       one  matches  \w  and the other matches \W), or the start or end of the
+       string if the first or last  character  matches  \w,  respectively.  In
+       UTF-8  mode,  the  meanings  of \w and \W can be changed by setting the
+       PCRE_UCP option. When this is done, it also affects \b and \B.  Neither
+       PCRE  nor  Perl has a separate "start of word" or "end of word" metase-
+       quence. However, whatever follows \b normally determines which  it  is.
        For example, the fragment \ba matches "a" at the start of a word.


-       The \A, \Z, and \z assertions differ from  the  traditional  circumflex
+       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
        and dollar (described in the next section) in that they only ever match
-       at the very start and end of the subject string, whatever  options  are
-       set.  Thus,  they are independent of multiline mode. These three asser-
+       at  the  very start and end of the subject string, whatever options are
+       set. Thus, they are independent of multiline mode. These  three  asser-
        tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
-       affect  only the behaviour of the circumflex and dollar metacharacters.
-       However, if the startoffset argument of pcre_exec() is non-zero,  indi-
+       affect only the behaviour of the circumflex and dollar  metacharacters.
+       However,  if the startoffset argument of pcre_exec() is non-zero, indi-
        cating that matching is to start at a point other than the beginning of
-       the subject, \A can never match. The difference between \Z  and  \z  is
+       the  subject,  \A  can never match. The difference between \Z and \z is
        that \Z matches before a newline at the end of the string as well as at
        the very end, whereas \z matches only at the end.


-       The \G assertion is true only when the current matching position is  at
-       the  start point of the match, as specified by the startoffset argument
-       of pcre_exec(). It differs from \A when the  value  of  startoffset  is
-       non-zero.  By calling pcre_exec() multiple times with appropriate argu-
+       The  \G assertion is true only when the current matching position is at
+       the start point of the match, as specified by the startoffset  argument
+       of  pcre_exec().  It  differs  from \A when the value of startoffset is
+       non-zero. By calling pcre_exec() multiple times with appropriate  argu-
        ments, you can mimic Perl's /g option, and it is in this kind of imple-
        mentation where \G can be useful.


-       Note,  however,  that  PCRE's interpretation of \G, as the start of the
+       Note, however, that PCRE's interpretation of \G, as the  start  of  the
        current match, is subtly different from Perl's, which defines it as the
-       end  of  the  previous  match. In Perl, these can be different when the
-       previously matched string was empty. Because PCRE does just  one  match
+       end of the previous match. In Perl, these can  be  different  when  the
+       previously  matched  string was empty. Because PCRE does just one match
        at a time, it cannot reproduce this behaviour.


-       If  all  the alternatives of a pattern begin with \G, the expression is
+       If all the alternatives of a pattern begin with \G, the  expression  is
        anchored to the starting match position, and the "anchored" flag is set
        in the compiled regular expression.


@@ -3738,94 +3784,94 @@
CIRCUMFLEX AND DOLLAR

        Outside a character class, in the default matching mode, the circumflex
-       character is an assertion that is true only  if  the  current  matching
-       point  is  at the start of the subject string. If the startoffset argu-
-       ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the
-       PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex
+       character  is  an  assertion  that is true only if the current matching
+       point is at the start of the subject string. If the  startoffset  argu-
+       ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
+       PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
        has an entirely different meaning (see below).


-       Circumflex need not be the first character of the pattern if  a  number
-       of  alternatives are involved, but it should be the first thing in each
-       alternative in which it appears if the pattern is ever  to  match  that
-       branch.  If all possible alternatives start with a circumflex, that is,
-       if the pattern is constrained to match only at the start  of  the  sub-
-       ject,  it  is  said  to be an "anchored" pattern. (There are also other
+       Circumflex  need  not be the first character of the pattern if a number
+       of alternatives are involved, but it should be the first thing in  each
+       alternative  in  which  it appears if the pattern is ever to match that
+       branch. If all possible alternatives start with a circumflex, that  is,
+       if  the  pattern  is constrained to match only at the start of the sub-
+       ject, it is said to be an "anchored" pattern.  (There  are  also  other
        constructs that can cause a pattern to be anchored.)


-       A dollar character is an assertion that is true  only  if  the  current
-       matching  point  is  at  the  end of the subject string, or immediately
+       A  dollar  character  is  an assertion that is true only if the current
+       matching point is at the end of  the  subject  string,  or  immediately
        before a newline at the end of the string (by default). Dollar need not
-       be  the  last  character of the pattern if a number of alternatives are
-       involved, but it should be the last item in  any  branch  in  which  it
+       be the last character of the pattern if a number  of  alternatives  are
+       involved,  but  it  should  be  the last item in any branch in which it
        appears. Dollar has no special meaning in a character class.


-       The  meaning  of  dollar  can be changed so that it matches only at the
-       very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
+       The meaning of dollar can be changed so that it  matches  only  at  the
+       very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at
        compile time. This does not affect the \Z assertion.


        The meanings of the circumflex and dollar characters are changed if the
-       PCRE_MULTILINE option is set. When  this  is  the  case,  a  circumflex
-       matches  immediately after internal newlines as well as at the start of
-       the subject string. It does not match after a  newline  that  ends  the
-       string.  A dollar matches before any newlines in the string, as well as
-       at the very end, when PCRE_MULTILINE is set. When newline is  specified
-       as  the  two-character  sequence CRLF, isolated CR and LF characters do
+       PCRE_MULTILINE  option  is  set.  When  this  is the case, a circumflex
+       matches immediately after internal newlines as well as at the start  of
+       the  subject  string.  It  does not match after a newline that ends the
+       string. A dollar matches before any newlines in the string, as well  as
+       at  the very end, when PCRE_MULTILINE is set. When newline is specified
+       as the two-character sequence CRLF, isolated CR and  LF  characters  do
        not indicate newlines.


-       For example, the pattern /^abc$/ matches the subject string  "def\nabc"
-       (where  \n  represents a newline) in multiline mode, but not otherwise.
-       Consequently, patterns that are anchored in single  line  mode  because
-       all  branches  start  with  ^ are not anchored in multiline mode, and a
-       match for circumflex is  possible  when  the  startoffset  argument  of
-       pcre_exec()  is  non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
+       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
+       (where \n represents a newline) in multiline mode, but  not  otherwise.
+       Consequently,  patterns  that  are anchored in single line mode because
+       all branches start with ^ are not anchored in  multiline  mode,  and  a
+       match  for  circumflex  is  possible  when  the startoffset argument of
+       pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is  ignored  if
        PCRE_MULTILINE is set.


-       Note that the sequences \A, \Z, and \z can be used to match  the  start
-       and  end of the subject in both modes, and if all branches of a pattern
-       start with \A it is always anchored, whether or not  PCRE_MULTILINE  is
+       Note  that  the sequences \A, \Z, and \z can be used to match the start
+       and end of the subject in both modes, and if all branches of a  pattern
+       start  with  \A it is always anchored, whether or not PCRE_MULTILINE is
        set.



FULL STOP (PERIOD, DOT) AND \N

        Outside a character class, a dot in the pattern matches any one charac-
-       ter in the subject string except (by default) a character  that  signi-
-       fies  the  end  of  a line. In UTF-8 mode, the matched character may be
+       ter  in  the subject string except (by default) a character that signi-
+       fies the end of a line. In UTF-8 mode, the  matched  character  may  be
        more than one byte long.


-       When a line ending is defined as a single character, dot never  matches
-       that  character; when the two-character sequence CRLF is used, dot does
-       not match CR if it is immediately followed  by  LF,  but  otherwise  it
-       matches  all characters (including isolated CRs and LFs). When any Uni-
-       code line endings are being recognized, dot does not match CR or LF  or
+       When  a line ending is defined as a single character, dot never matches
+       that character; when the two-character sequence CRLF is used, dot  does
+       not  match  CR  if  it  is immediately followed by LF, but otherwise it
+       matches all characters (including isolated CRs and LFs). When any  Uni-
+       code  line endings are being recognized, dot does not match CR or LF or
        any of the other line ending characters.


-       The  behaviour  of  dot  with regard to newlines can be changed. If the
-       PCRE_DOTALL option is set, a dot matches  any  one  character,  without
+       The behaviour of dot with regard to newlines can  be  changed.  If  the
+       PCRE_DOTALL  option  is  set,  a dot matches any one character, without
        exception. If the two-character sequence CRLF is present in the subject
        string, it takes two dots to match it.


-       The handling of dot is entirely independent of the handling of  circum-
-       flex  and  dollar,  the  only relationship being that they both involve
+       The  handling of dot is entirely independent of the handling of circum-
+       flex and dollar, the only relationship being  that  they  both  involve
        newlines. Dot has no special meaning in a character class.


        The escape sequence \N always behaves as a dot does when PCRE_DOTALL is
-       not  set.  In other words, it matches any one character except one that
+       not set. In other words, it matches any one character except  one  that
        signifies the end of a line.



MATCHING A SINGLE BYTE

        Outside a character class, the escape sequence \C matches any one byte,
-       both  in  and  out  of  UTF-8 mode. Unlike a dot, it always matches any
-       line-ending characters. The feature is provided in  Perl  in  order  to
-       match  individual bytes in UTF-8 mode. Because it breaks up UTF-8 char-
-       acters into individual bytes, what remains in the string may be a  mal-
-       formed  UTF-8  string.  For this reason, the \C escape sequence is best
+       both in and out of UTF-8 mode. Unlike a  dot,  it  always  matches  any
+       line-ending  characters.  The  feature  is provided in Perl in order to
+       match individual bytes in UTF-8 mode. Because it breaks up UTF-8  char-
+       acters  into individual bytes, what remains in the string may be a mal-
+       formed UTF-8 string. For this reason, the \C escape  sequence  is  best
        avoided.


-       PCRE does not allow \C to appear in  lookbehind  assertions  (described
-       below),  because  in UTF-8 mode this would make it impossible to calcu-
+       PCRE  does  not  allow \C to appear in lookbehind assertions (described
+       below), because in UTF-8 mode this would make it impossible  to  calcu-
        late the length of the lookbehind.



@@ -3835,103 +3881,103 @@
        closing square bracket. A closing square bracket on its own is not spe-
        cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,
        a lone closing square bracket causes a compile-time error. If a closing
-       square bracket is required as a member of the class, it should  be  the
-       first  data  character  in  the  class (after an initial circumflex, if
+       square  bracket  is required as a member of the class, it should be the
+       first data character in the class  (after  an  initial  circumflex,  if
        present) or escaped with a backslash.


-       A character class matches a single character in the subject.  In  UTF-8
+       A  character  class matches a single character in the subject. In UTF-8
        mode, the character may be more than one byte long. A matched character
        must be in the set of characters defined by the class, unless the first
-       character  in  the  class definition is a circumflex, in which case the
-       subject character must not be in the set defined by  the  class.  If  a
-       circumflex  is actually required as a member of the class, ensure it is
+       character in the class definition is a circumflex, in  which  case  the
+       subject  character  must  not  be in the set defined by the class. If a
+       circumflex is actually required as a member of the class, ensure it  is
        not the first character, or escape it with a backslash.


-       For example, the character class [aeiou] matches any lower case  vowel,
-       while  [^aeiou]  matches  any character that is not a lower case vowel.
+       For  example, the character class [aeiou] matches any lower case vowel,
+       while [^aeiou] matches any character that is not a  lower  case  vowel.
        Note that a circumflex is just a convenient notation for specifying the
-       characters  that  are in the class by enumerating those that are not. A
-       class that starts with a circumflex is not an assertion; it still  con-
-       sumes  a  character  from the subject string, and therefore it fails if
+       characters that are in the class by enumerating those that are  not.  A
+       class  that starts with a circumflex is not an assertion; it still con-
+       sumes a character from the subject string, and therefore  it  fails  if
        the current pointer is at the end of the string.


-       In UTF-8 mode, characters with values greater than 255 can be  included
-       in  a  class as a literal string of bytes, or by using the \x{ escaping
+       In  UTF-8 mode, characters with values greater than 255 can be included
+       in a class as a literal string of bytes, or by using the  \x{  escaping
        mechanism.


-       When caseless matching is set, any letters in a  class  represent  both
-       their  upper  case  and lower case versions, so for example, a caseless
-       [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
-       match  "A", whereas a caseful version would. In UTF-8 mode, PCRE always
-       understands the concept of case for characters whose  values  are  less
-       than  128, so caseless matching is always possible. For characters with
-       higher values, the concept of case is supported  if  PCRE  is  compiled
-       with  Unicode  property support, but not otherwise.  If you want to use
-       caseless matching in UTF8-mode for characters 128 and above,  you  must
-       ensure  that  PCRE is compiled with Unicode property support as well as
+       When  caseless  matching  is set, any letters in a class represent both
+       their upper case and lower case versions, so for  example,  a  caseless
+       [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
+       match "A", whereas a caseful version would. In UTF-8 mode, PCRE  always
+       understands  the  concept  of case for characters whose values are less
+       than 128, so caseless matching is always possible. For characters  with
+       higher  values,  the  concept  of case is supported if PCRE is compiled
+       with Unicode property support, but not otherwise.  If you want  to  use
+       caseless  matching  in UTF8-mode for characters 128 and above, you must
+       ensure that PCRE is compiled with Unicode property support as  well  as
        with UTF-8 support.


-       Characters that might indicate line breaks are  never  treated  in  any
-       special  way  when  matching  character  classes,  whatever line-ending
-       sequence is in  use,  and  whatever  setting  of  the  PCRE_DOTALL  and
+       Characters  that  might  indicate  line breaks are never treated in any
+       special way  when  matching  character  classes,  whatever  line-ending
+       sequence  is  in  use,  and  whatever  setting  of  the PCRE_DOTALL and
        PCRE_MULTILINE options is used. A class such as [^a] always matches one
        of these characters.


-       The minus (hyphen) character can be used to specify a range of  charac-
-       ters  in  a  character  class.  For  example,  [d-m] matches any letter
-       between d and m, inclusive. If a  minus  character  is  required  in  a
-       class,  it  must  be  escaped  with a backslash or appear in a position
-       where it cannot be interpreted as indicating a range, typically as  the
+       The  minus (hyphen) character can be used to specify a range of charac-
+       ters in a character  class.  For  example,  [d-m]  matches  any  letter
+       between  d  and  m,  inclusive.  If  a minus character is required in a
+       class, it must be escaped with a backslash  or  appear  in  a  position
+       where  it cannot be interpreted as indicating a range, typically as the
        first or last character in the class.


        It is not possible to have the literal character "]" as the end charac-
-       ter of a range. A pattern such as [W-]46] is interpreted as a class  of
-       two  characters ("W" and "-") followed by a literal string "46]", so it
-       would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
-       backslash  it is interpreted as the end of range, so [W-\]46] is inter-
-       preted as a class containing a range followed by two other  characters.
-       The  octal or hexadecimal representation of "]" can also be used to end
+       ter  of a range. A pattern such as [W-]46] is interpreted as a class of
+       two characters ("W" and "-") followed by a literal string "46]", so  it
+       would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
+       backslash it is interpreted as the end of range, so [W-\]46] is  inter-
+       preted  as a class containing a range followed by two other characters.
+       The octal or hexadecimal representation of "]" can also be used to  end
        a range.


-       Ranges operate in the collating sequence of character values. They  can
-       also   be  used  for  characters  specified  numerically,  for  example
-       [\000-\037]. In UTF-8 mode, ranges can include characters whose  values
+       Ranges  operate in the collating sequence of character values. They can
+       also  be  used  for  characters  specified  numerically,  for   example
+       [\000-\037].  In UTF-8 mode, ranges can include characters whose values
        are greater than 255, for example [\x{100}-\x{2ff}].


        If a range that includes letters is used when caseless matching is set,
        it matches the letters in either case. For example, [W-c] is equivalent
-       to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if
-       character tables for a French locale are in  use,  [\xc8-\xcb]  matches
-       accented  E  characters in both cases. In UTF-8 mode, PCRE supports the
-       concept of case for characters with values greater than 128  only  when
+       to [][\\^_`wxyzabc], matched caselessly,  and  in  non-UTF-8  mode,  if
+       character  tables  for  a French locale are in use, [\xc8-\xcb] matches
+       accented E characters in both cases. In UTF-8 mode, PCRE  supports  the
+       concept  of  case for characters with values greater than 128 only when
        it is compiled with Unicode property support.


-       The  character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
-       in a character class, and add the characters that  they  match  to  the
-       class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
-       flex can conveniently be used with the upper case  character  types  to
-       specify  a  more  restricted  set of characters than the matching lower
-       case type. For example, the class [^\W_] matches any letter  or  digit,
-       but not underscore.
+       The character types \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, \w, and  \W
+       may  also appear in a character class, and add the characters that they
+       match to the class. For example,  [\dABCDEF]  matches  any  hexadecimal
+       digit.  A circumflex can conveniently be used with the upper case char-
+       acter types to specify a more restricted set  of  characters  than  the
+       matching  lower  case  type.  For example, the class [^\W_] matches any
+       letter or digit, but not underscore.


-       The  only  metacharacters  that are recognized in character classes are
-       backslash, hyphen (only where it can be  interpreted  as  specifying  a
-       range),  circumflex  (only  at the start), opening square bracket (only
-       when it can be interpreted as introducing a POSIX class name - see  the
-       next  section),  and  the  terminating closing square bracket. However,
+       The only metacharacters that are recognized in  character  classes  are
+       backslash,  hyphen  (only  where  it can be interpreted as specifying a
+       range), circumflex (only at the start), opening  square  bracket  (only
+       when  it can be interpreted as introducing a POSIX class name - see the
+       next section), and the terminating  closing  square  bracket.  However,
        escaping other non-alphanumeric characters does no harm.



POSIX CHARACTER CLASSES

        Perl supports the POSIX notation for character classes. This uses names
-       enclosed  by  [: and :] within the enclosing square brackets. PCRE also
+       enclosed by [: and :] within the enclosing square brackets.  PCRE  also
        supports this notation. For example,


          [01[:alpha:]%]


        matches "0", "1", any alphabetic character, or "%". The supported class
-       names are
+       names are:


          alnum    letters and digits
          alpha    letters
@@ -3942,31 +3988,47 @@
          graph    printing characters, excluding space
          lower    lower case letters
          print    printing characters, including space
-         punct    printing characters, excluding letters and digits
+         punct    printing characters, excluding letters and digits and space
          space    white space (not quite the same as \s)
          upper    upper case letters
          word     "word" characters (same as \w)
          xdigit   hexadecimal digits


-       The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
-       and space (32). Notice that this list includes the VT  character  (code
+       The "space" characters are HT (9), LF (10), VT (11), FF (12), CR  (13),
+       and  space  (32). Notice that this list includes the VT character (code
        11). This makes "space" different to \s, which does not include VT (for
        Perl compatibility).


-       The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
-       from  Perl  5.8. Another Perl extension is negation, which is indicated
+       The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
+       from Perl 5.8. Another Perl extension is negation, which  is  indicated
        by a ^ character after the colon. For example,


          [12[:^digit:]]


-       matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the
+       matches  "1", "2", or any non-digit. PCRE (and Perl) also recognize the
        POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
        these are not supported, and an error is given if they are encountered.


-       In UTF-8 mode, characters with values greater than 128 do not match any
-       of the POSIX character classes.
+       By  default,  in UTF-8 mode, characters with values greater than 128 do
+       not match any of the POSIX character classes. However, if the  PCRE_UCP
+       option  is passed to pcre_compile(), some of the classes are changed so
+       that Unicode character properties are used. This is achieved by replac-
+       ing the POSIX classes by other sequences, as follows:


+         [:alnum:]  becomes  \p{Xan}
+         [:alpha:]  becomes  \p{L}
+         [:blank:]  becomes  \h
+         [:digit:]  becomes  \p{Nd}
+         [:lower:]  becomes  \p{Ll}
+         [:space:]  becomes  \p{Xps}
+         [:upper:]  becomes  \p{Lu}
+         [:word:]   becomes  \p{Xwd}


+       Negated  versions,  such  as [:^alpha:] use \P instead of \p. The other
+       POSIX classes are unchanged, and match only characters with code points
+       less than 128.
+
+
 VERTICAL BAR


        Vertical  bar characters are used to separate alternative patterns. For
@@ -4035,8 +4097,9 @@
        cases the pattern can contain special leading sequences such as (*CRLF)
        to  override  what  the application has set or what has been defaulted.
        Details are given in the section entitled  "Newline  sequences"  above.
-       There  is  also  the  (*UTF8)  leading sequence that can be used to set
-       UTF-8 mode; this is equivalent to setting the PCRE_UTF8 option.
+       There  are  also  the  (*UTF8) and (*UCP) leading sequences that can be
+       used to set UTF-8 and Unicode property modes; they  are  equivalent  to
+       setting the PCRE_UTF8 and the PCRE_UCP options, respectively.



SUBPATTERNS
@@ -4048,18 +4111,18 @@

          cat(aract|erpillar|)


-       matches  one  of the words "cat", "cataract", or "caterpillar". Without
-       the parentheses, it would match  "cataract",  "erpillar"  or  an  empty
+       matches one of the words "cat", "cataract", or  "caterpillar".  Without
+       the  parentheses,  it  would  match  "cataract", "erpillar" or an empty
        string.


-       2.  It  sets  up  the  subpattern as a capturing subpattern. This means
-       that, when the whole pattern  matches,  that  portion  of  the  subject
+       2. It sets up the subpattern as  a  capturing  subpattern.  This  means
+       that,  when  the  whole  pattern  matches,  that portion of the subject
        string that matched the subpattern is passed back to the caller via the
-       ovector argument of pcre_exec(). Opening parentheses are  counted  from
-       left  to  right  (starting  from 1) to obtain numbers for the capturing
+       ovector  argument  of pcre_exec(). Opening parentheses are counted from
+       left to right (starting from 1) to obtain  numbers  for  the  capturing
        subpatterns.


-       For example, if the string "the red king" is matched against  the  pat-
+       For  example,  if the string "the red king" is matched against the pat-
        tern


          the ((red|white) (king|queen))
@@ -4067,12 +4130,12 @@
        the captured substrings are "red king", "red", and "king", and are num-
        bered 1, 2, and 3, respectively.


-       The fact that plain parentheses fulfil  two  functions  is  not  always
-       helpful.   There are often times when a grouping subpattern is required
-       without a capturing requirement. If an opening parenthesis is  followed
-       by  a question mark and a colon, the subpattern does not do any captur-
-       ing, and is not counted when computing the  number  of  any  subsequent
-       capturing  subpatterns. For example, if the string "the white queen" is
+       The  fact  that  plain  parentheses  fulfil two functions is not always
+       helpful.  There are often times when a grouping subpattern is  required
+       without  a capturing requirement. If an opening parenthesis is followed
+       by a question mark and a colon, the subpattern does not do any  captur-
+       ing,  and  is  not  counted when computing the number of any subsequent
+       capturing subpatterns. For example, if the string "the white queen"  is
        matched against the pattern


          the ((?:red|white) (king|queen))
@@ -4080,96 +4143,96 @@
        the captured substrings are "white queen" and "queen", and are numbered
        1 and 2. The maximum number of capturing subpatterns is 65535.


-       As  a  convenient shorthand, if any option settings are required at the
-       start of a non-capturing subpattern,  the  option  letters  may  appear
+       As a convenient shorthand, if any option settings are required  at  the
+       start  of  a  non-capturing  subpattern,  the option letters may appear
        between the "?" and the ":". Thus the two patterns


          (?i:saturday|sunday)
          (?:(?i)saturday|sunday)


        match exactly the same set of strings. Because alternative branches are
-       tried from left to right, and options are not reset until  the  end  of
-       the  subpattern is reached, an option setting in one branch does affect
-       subsequent branches, so the above patterns match "SUNDAY"  as  well  as
+       tried  from  left  to right, and options are not reset until the end of
+       the subpattern is reached, an option setting in one branch does  affect
+       subsequent  branches,  so  the above patterns match "SUNDAY" as well as
        "Saturday".



DUPLICATE SUBPATTERN NUMBERS

        Perl 5.10 introduced a feature whereby each alternative in a subpattern
-       uses the same numbers for its capturing parentheses. Such a  subpattern
-       starts  with (?| and is itself a non-capturing subpattern. For example,
+       uses  the same numbers for its capturing parentheses. Such a subpattern
+       starts with (?| and is itself a non-capturing subpattern. For  example,
        consider this pattern:


          (?|(Sat)ur|(Sun))day


-       Because the two alternatives are inside a (?| group, both sets of  cap-
-       turing  parentheses  are  numbered one. Thus, when the pattern matches,
-       you can look at captured substring number  one,  whichever  alternative
-       matched.  This  construct  is useful when you want to capture part, but
+       Because  the two alternatives are inside a (?| group, both sets of cap-
+       turing parentheses are numbered one. Thus, when  the  pattern  matches,
+       you  can  look  at captured substring number one, whichever alternative
+       matched. This construct is useful when you want to  capture  part,  but
        not all, of one of a number of alternatives. Inside a (?| group, paren-
-       theses  are  numbered as usual, but the number is reset at the start of
-       each branch. The numbers of any capturing buffers that follow the  sub-
-       pattern  start after the highest number used in any branch. The follow-
-       ing example is taken from the Perl documentation.  The  numbers  under-
+       theses are numbered as usual, but the number is reset at the  start  of
+       each  branch. The numbers of any capturing buffers that follow the sub-
+       pattern start after the highest number used in any branch. The  follow-
+       ing  example  is taken from the Perl documentation.  The numbers under-
        neath show in which buffer the captured content will be stored.


          # before  ---------------branch-reset----------- after
          / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
          # 1            2         2  3        2     3     4


-       A  back  reference  to a numbered subpattern uses the most recent value
-       that is set for that number by any subpattern.  The  following  pattern
+       A back reference to a numbered subpattern uses the  most  recent  value
+       that  is  set  for that number by any subpattern. The following pattern
        matches "abcabc" or "defdef":


          /(?|(abc)|(def))\1/


-       In  contrast, a recursive or "subroutine" call to a numbered subpattern
-       always refers to the first one in the pattern with  the  given  number.
+       In contrast, a recursive or "subroutine" call to a numbered  subpattern
+       always  refers  to  the first one in the pattern with the given number.
        The following pattern matches "abcabc" or "defabc":


          /(?|(abc)|(def))(?1)/


-       If  a condition test for a subpattern's having matched refers to a non-
-       unique number, the test is true if any of the subpatterns of that  num-
+       If a condition test for a subpattern's having matched refers to a  non-
+       unique  number, the test is true if any of the subpatterns of that num-
        ber have matched.


-       An  alternative approach to using this "branch reset" feature is to use
+       An alternative approach to using this "branch reset" feature is to  use
        duplicate named subpatterns, as described in the next section.



NAMED SUBPATTERNS

-       Identifying capturing parentheses by number is simple, but  it  can  be
-       very  hard  to keep track of the numbers in complicated regular expres-
-       sions. Furthermore, if an  expression  is  modified,  the  numbers  may
-       change.  To help with this difficulty, PCRE supports the naming of sub-
+       Identifying  capturing  parentheses  by number is simple, but it can be
+       very hard to keep track of the numbers in complicated  regular  expres-
+       sions.  Furthermore,  if  an  expression  is  modified, the numbers may
+       change. To help with this difficulty, PCRE supports the naming of  sub-
        patterns. This feature was not added to Perl until release 5.10. Python
-       had  the  feature earlier, and PCRE introduced it at release 4.0, using
-       the Python syntax. PCRE now supports both the Perl and the Python  syn-
-       tax.  Perl  allows  identically  numbered subpatterns to have different
+       had the feature earlier, and PCRE introduced it at release  4.0,  using
+       the  Python syntax. PCRE now supports both the Perl and the Python syn-
+       tax. Perl allows identically numbered  subpatterns  to  have  different
        names, but PCRE does not.


-       In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
-       or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
-       to capturing parentheses from other parts of the pattern, such as  back
-       references,  recursion,  and conditions, can be made by name as well as
+       In  PCRE,  a subpattern can be named in one of three ways: (?<name>...)
+       or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References
+       to  capturing parentheses from other parts of the pattern, such as back
+       references, recursion, and conditions, can be made by name as  well  as
        by number.


-       Names consist of up to  32  alphanumeric  characters  and  underscores.
-       Named  capturing  parentheses  are  still  allocated numbers as well as
-       names, exactly as if the names were not present. The PCRE API  provides
+       Names  consist  of  up  to  32 alphanumeric characters and underscores.
+       Named capturing parentheses are still  allocated  numbers  as  well  as
+       names,  exactly as if the names were not present. The PCRE API provides
        function calls for extracting the name-to-number translation table from
        a compiled pattern. There is also a convenience function for extracting
        a captured substring by name.


-       By  default, a name must be unique within a pattern, but it is possible
+       By default, a name must be unique within a pattern, but it is  possible
        to relax this constraint by setting the PCRE_DUPNAMES option at compile
-       time.  (Duplicate  names are also always permitted for subpatterns with
-       the same number, set up as described in the previous  section.)  Dupli-
-       cate  names  can  be useful for patterns where only one instance of the
-       named parentheses can match. Suppose you want to match the  name  of  a
-       weekday,  either as a 3-letter abbreviation or as the full name, and in
+       time. (Duplicate names are also always permitted for  subpatterns  with
+       the  same  number, set up as described in the previous section.) Dupli-
+       cate names can be useful for patterns where only one  instance  of  the
+       named  parentheses  can  match. Suppose you want to match the name of a
+       weekday, either as a 3-letter abbreviation or as the full name, and  in
        both cases you want to extract the abbreviation. This pattern (ignoring
        the line breaks) does the job:


@@ -4179,38 +4242,38 @@
          (?<DN>Thu)(?:rsday)?|
          (?<DN>Sat)(?:urday)?


-       There  are  five capturing substrings, but only one is ever set after a
+       There are five capturing substrings, but only one is ever set  after  a
        match.  (An alternative way of solving this problem is to use a "branch
        reset" subpattern, as described in the previous section.)


-       The  convenience  function  for extracting the data by name returns the
-       substring for the first (and in this example, the only)  subpattern  of
-       that  name  that  matched.  This saves searching to find which numbered
+       The convenience function for extracting the data by  name  returns  the
+       substring  for  the first (and in this example, the only) subpattern of
+       that name that matched. This saves searching  to  find  which  numbered
        subpattern it was.


-       If you make a back reference to  a  non-unique  named  subpattern  from
-       elsewhere  in the pattern, the one that corresponds to the first occur-
+       If  you  make  a  back  reference to a non-unique named subpattern from
+       elsewhere in the pattern, the one that corresponds to the first  occur-
        rence of the name is used. In the absence of duplicate numbers (see the
-       previous  section) this is the one with the lowest number. If you use a
-       named reference in a condition test (see the section  about  conditions
-       below),  either  to check whether a subpattern has matched, or to check
-       for recursion, all subpatterns with the same name are  tested.  If  the
-       condition  is  true for any one of them, the overall condition is true.
+       previous section) this is the one with the lowest number. If you use  a
+       named  reference  in a condition test (see the section about conditions
+       below), either to check whether a subpattern has matched, or  to  check
+       for  recursion,  all  subpatterns with the same name are tested. If the
+       condition is true for any one of them, the overall condition  is  true.
        This is the same behaviour as testing by number. For further details of
        the interfaces for handling named subpatterns, see the pcreapi documen-
        tation.


        Warning: You cannot use different names to distinguish between two sub-
-       patterns  with  the same number because PCRE uses only the numbers when
+       patterns with the same number because PCRE uses only the  numbers  when
        matching. For this reason, an error is given at compile time if differ-
-       ent  names  are given to subpatterns with the same number. However, you
-       can give the same name to subpatterns with the same number,  even  when
+       ent names are given to subpatterns with the same number.  However,  you
+       can  give  the same name to subpatterns with the same number, even when
        PCRE_DUPNAMES is not set.



REPETITION

-       Repetition  is  specified  by  quantifiers, which can follow any of the
+       Repetition is specified by quantifiers, which can  follow  any  of  the
        following items:


          a literal data character
@@ -4224,17 +4287,17 @@
          a parenthesized subpattern (unless it is an assertion)
          a recursive or "subroutine" call to a subpattern


-       The general repetition quantifier specifies a minimum and maximum  num-
-       ber  of  permitted matches, by giving the two numbers in curly brackets
-       (braces), separated by a comma. The numbers must be  less  than  65536,
+       The  general repetition quantifier specifies a minimum and maximum num-
+       ber of permitted matches, by giving the two numbers in  curly  brackets
+       (braces),  separated  by  a comma. The numbers must be less than 65536,
        and the first must be less than or equal to the second. For example:


          z{2,4}


-       matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
-       special character. If the second number is omitted, but  the  comma  is
-       present,  there  is  no upper limit; if the second number and the comma
-       are both omitted, the quantifier specifies an exact number of  required
+       matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a
+       special  character.  If  the second number is omitted, but the comma is
+       present, there is no upper limit; if the second number  and  the  comma
+       are  both omitted, the quantifier specifies an exact number of required
        matches. Thus


          [aeiou]{3,}
@@ -4243,49 +4306,49 @@


          \d{8}


-       matches  exactly  8  digits. An opening curly bracket that appears in a
-       position where a quantifier is not allowed, or one that does not  match
-       the  syntax of a quantifier, is taken as a literal character. For exam-
+       matches exactly 8 digits. An opening curly bracket that  appears  in  a
+       position  where a quantifier is not allowed, or one that does not match
+       the syntax of a quantifier, is taken as a literal character. For  exam-
        ple, {,6} is not a quantifier, but a literal string of four characters.


-       In UTF-8 mode, quantifiers apply to UTF-8  characters  rather  than  to
+       In  UTF-8  mode,  quantifiers  apply to UTF-8 characters rather than to
        individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
        acters, each of which is represented by a two-byte sequence. Similarly,
        when Unicode property support is available, \X{3} matches three Unicode
-       extended sequences, each of which may be several bytes long  (and  they
+       extended  sequences,  each of which may be several bytes long (and they
        may be of different lengths).


        The quantifier {0} is permitted, causing the expression to behave as if
        the previous item and the quantifier were not present. This may be use-
-       ful  for  subpatterns that are referenced as subroutines from elsewhere
+       ful for subpatterns that are referenced as subroutines  from  elsewhere
        in the pattern. Items other than subpatterns that have a {0} quantifier
        are omitted from the compiled pattern.


-       For  convenience, the three most common quantifiers have single-charac-
+       For convenience, the three most common quantifiers have  single-charac-
        ter abbreviations:


          *    is equivalent to {0,}
          +    is equivalent to {1,}
          ?    is equivalent to {0,1}


-       It is possible to construct infinite loops by  following  a  subpattern
+       It  is  possible  to construct infinite loops by following a subpattern
        that can match no characters with a quantifier that has no upper limit,
        for example:


          (a?)*


        Earlier versions of Perl and PCRE used to give an error at compile time
-       for  such  patterns. However, because there are cases where this can be
-       useful, such patterns are now accepted, but if any  repetition  of  the
-       subpattern  does in fact match no characters, the loop is forcibly bro-
+       for such patterns. However, because there are cases where this  can  be
+       useful,  such  patterns  are now accepted, but if any repetition of the
+       subpattern does in fact match no characters, the loop is forcibly  bro-
        ken.


-       By default, the quantifiers are "greedy", that is, they match  as  much
-       as  possible  (up  to  the  maximum number of permitted times), without
-       causing the rest of the pattern to fail. The classic example  of  where
+       By  default,  the quantifiers are "greedy", that is, they match as much
+       as possible (up to the maximum  number  of  permitted  times),  without
+       causing  the  rest of the pattern to fail. The classic example of where
        this gives problems is in trying to match comments in C programs. These
-       appear between /* and */ and within the comment,  individual  *  and  /
-       characters  may  appear. An attempt to match C comments by applying the
+       appear  between  /*  and  */ and within the comment, individual * and /
+       characters may appear. An attempt to match C comments by  applying  the
        pattern


          /\*.*\*/
@@ -4294,19 +4357,19 @@


          /* first comment */  not comment  /* second comment */


-       fails, because it matches the entire string owing to the greediness  of
+       fails,  because it matches the entire string owing to the greediness of
        the .*  item.


-       However,  if  a quantifier is followed by a question mark, it ceases to
+       However, if a quantifier is followed by a question mark, it  ceases  to
        be greedy, and instead matches the minimum number of times possible, so
        the pattern


          /\*.*?\*/


-       does  the  right  thing with the C comments. The meaning of the various
-       quantifiers is not otherwise changed,  just  the  preferred  number  of
-       matches.   Do  not  confuse this use of question mark with its use as a
-       quantifier in its own right. Because it has two uses, it can  sometimes
+       does the right thing with the C comments. The meaning  of  the  various
+       quantifiers  is  not  otherwise  changed,  just the preferred number of
+       matches.  Do not confuse this use of question mark with its  use  as  a
+       quantifier  in its own right. Because it has two uses, it can sometimes
        appear doubled, as in


          \d??\d
@@ -4314,36 +4377,36 @@
        which matches one digit by preference, but can match two if that is the
        only way the rest of the pattern matches.


-       If the PCRE_UNGREEDY option is set (an option that is not available  in
-       Perl),  the  quantifiers are not greedy by default, but individual ones
-       can be made greedy by following them with a  question  mark.  In  other
+       If  the PCRE_UNGREEDY option is set (an option that is not available in
+       Perl), the quantifiers are not greedy by default, but  individual  ones
+       can  be  made  greedy  by following them with a question mark. In other
        words, it inverts the default behaviour.


-       When  a  parenthesized  subpattern  is quantified with a minimum repeat
-       count that is greater than 1 or with a limited maximum, more memory  is
-       required  for  the  compiled  pattern, in proportion to the size of the
+       When a parenthesized subpattern is quantified  with  a  minimum  repeat
+       count  that is greater than 1 or with a limited maximum, more memory is
+       required for the compiled pattern, in proportion to  the  size  of  the
        minimum or maximum.


        If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
-       alent  to  Perl's  /s) is set, thus allowing the dot to match newlines,
-       the pattern is implicitly anchored, because whatever  follows  will  be
-       tried  against every character position in the subject string, so there
-       is no point in retrying the overall match at  any  position  after  the
-       first.  PCRE  normally treats such a pattern as though it were preceded
+       alent to Perl's /s) is set, thus allowing the dot  to  match  newlines,
+       the  pattern  is  implicitly anchored, because whatever follows will be
+       tried against every character position in the subject string, so  there
+       is  no  point  in  retrying the overall match at any position after the
+       first. PCRE normally treats such a pattern as though it  were  preceded
        by \A.


-       In cases where it is known that the subject  string  contains  no  new-
-       lines,  it  is  worth setting PCRE_DOTALL in order to obtain this opti-
+       In  cases  where  it  is known that the subject string contains no new-
+       lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-
        mization, or alternatively using ^ to indicate anchoring explicitly.


-       However, there is one situation where the optimization cannot be  used.
+       However,  there is one situation where the optimization cannot be used.
        When .*  is inside capturing parentheses that are the subject of a back
        reference elsewhere in the pattern, a match at the start may fail where
        a later one succeeds. Consider, for example:


          (.*)abc\1


-       If  the subject is "xyz123abc123" the match point is the fourth charac-
+       If the subject is "xyz123abc123" the match point is the fourth  charac-
        ter. For this reason, such a pattern is not implicitly anchored.


        When a capturing subpattern is repeated, the value captured is the sub-
@@ -4352,8 +4415,8 @@
          (tweedle[dume]{3}\s*)+


        has matched "tweedledum tweedledee" the value of the captured substring
-       is "tweedledee". However, if there are  nested  capturing  subpatterns,
-       the  corresponding captured values may have been set in previous itera-
+       is  "tweedledee".  However,  if there are nested capturing subpatterns,
+       the corresponding captured values may have been set in previous  itera-
        tions. For example, after


          /(a|(b))+/
@@ -4363,53 +4426,53 @@


ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

-       With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
-       repetition,  failure  of what follows normally causes the repeated item
-       to be re-evaluated to see if a different number of repeats  allows  the
-       rest  of  the pattern to match. Sometimes it is useful to prevent this,
-       either to change the nature of the match, or to cause it  fail  earlier
-       than  it otherwise might, when the author of the pattern knows there is
+       With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
+       repetition, failure of what follows normally causes the  repeated  item
+       to  be  re-evaluated to see if a different number of repeats allows the
+       rest of the pattern to match. Sometimes it is useful to  prevent  this,
+       either  to  change the nature of the match, or to cause it fail earlier
+       than it otherwise might, when the author of the pattern knows there  is
        no point in carrying on.


-       Consider, for example, the pattern \d+foo when applied to  the  subject
+       Consider,  for  example, the pattern \d+foo when applied to the subject
        line


          123456bar


        After matching all 6 digits and then failing to match "foo", the normal
-       action of the matcher is to try again with only 5 digits  matching  the
-       \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
-       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
-       the  means for specifying that once a subpattern has matched, it is not
+       action  of  the matcher is to try again with only 5 digits matching the
+       \d+ item, and then with  4,  and  so  on,  before  ultimately  failing.
+       "Atomic  grouping"  (a  term taken from Jeffrey Friedl's book) provides
+       the means for specifying that once a subpattern has matched, it is  not
        to be re-evaluated in this way.


-       If we use atomic grouping for the previous example, the  matcher  gives
-       up  immediately  on failing to match "foo" the first time. The notation
+       If  we  use atomic grouping for the previous example, the matcher gives
+       up immediately on failing to match "foo" the first time.  The  notation
        is a kind of special parenthesis, starting with (?> as in this example:


          (?>\d+)foo


-       This kind of parenthesis "locks up" the  part of the  pattern  it  con-
-       tains  once  it  has matched, and a failure further into the pattern is
-       prevented from backtracking into it. Backtracking past it  to  previous
+       This  kind  of  parenthesis "locks up" the  part of the pattern it con-
+       tains once it has matched, and a failure further into  the  pattern  is
+       prevented  from  backtracking into it. Backtracking past it to previous
        items, however, works as normal.


-       An  alternative  description  is that a subpattern of this type matches
-       the string of characters that an  identical  standalone  pattern  would
+       An alternative description is that a subpattern of  this  type  matches
+       the  string  of  characters  that an identical standalone pattern would
        match, if anchored at the current point in the subject string.


        Atomic grouping subpatterns are not capturing subpatterns. Simple cases
        such as the above example can be thought of as a maximizing repeat that
-       must  swallow  everything  it can. So, while both \d+ and \d+? are pre-
-       pared to adjust the number of digits they match in order  to  make  the
+       must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
+       pared  to  adjust  the number of digits they match in order to make the
        rest of the pattern match, (?>\d+) can only match an entire sequence of
        digits.


-       Atomic groups in general can of course contain arbitrarily  complicated
-       subpatterns,  and  can  be  nested. However, when the subpattern for an
+       Atomic  groups in general can of course contain arbitrarily complicated
+       subpatterns, and can be nested. However, when  the  subpattern  for  an
        atomic group is just a single repeated item, as in the example above, a
-       simpler  notation,  called  a "possessive quantifier" can be used. This
-       consists of an additional + character  following  a  quantifier.  Using
+       simpler notation, called a "possessive quantifier" can  be  used.  This
+       consists  of  an  additional  + character following a quantifier. Using
        this notation, the previous example can be rewritten as


          \d++foo
@@ -4419,45 +4482,45 @@


          (abc|xyz){2,3}+


-       Possessive  quantifiers  are  always  greedy;  the   setting   of   the
+       Possessive   quantifiers   are   always  greedy;  the  setting  of  the
        PCRE_UNGREEDY option is ignored. They are a convenient notation for the
-       simpler forms of atomic group. However, there is no difference  in  the
-       meaning  of  a  possessive  quantifier and the equivalent atomic group,
-       though there may be a performance  difference;  possessive  quantifiers
+       simpler  forms  of atomic group. However, there is no difference in the
+       meaning of a possessive quantifier and  the  equivalent  atomic  group,
+       though  there  may  be a performance difference; possessive quantifiers
        should be slightly faster.


-       The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-
-       tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first
+       The possessive quantifier syntax is an extension to the Perl  5.8  syn-
+       tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
        edition of his book. Mike McCloskey liked it, so implemented it when he
-       built Sun's Java package, and PCRE copied it from there. It  ultimately
+       built  Sun's Java package, and PCRE copied it from there. It ultimately
        found its way into Perl at release 5.10.


        PCRE has an optimization that automatically "possessifies" certain sim-
-       ple pattern constructs. For example, the sequence  A+B  is  treated  as
-       A++B  because  there is no point in backtracking into a sequence of A's
+       ple  pattern  constructs.  For  example, the sequence A+B is treated as
+       A++B because there is no point in backtracking into a sequence  of  A's
        when B must follow.


-       When a pattern contains an unlimited repeat inside  a  subpattern  that
-       can  itself  be  repeated  an  unlimited number of times, the use of an
-       atomic group is the only way to avoid some  failing  matches  taking  a
+       When  a  pattern  contains an unlimited repeat inside a subpattern that
+       can itself be repeated an unlimited number of  times,  the  use  of  an
+       atomic  group  is  the  only way to avoid some failing matches taking a
        very long time indeed. The pattern


          (\D+|<\d+>)*[!?]


-       matches  an  unlimited number of substrings that either consist of non-
-       digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
+       matches an unlimited number of substrings that either consist  of  non-
+       digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
        matches, it runs quickly. However, if it is applied to


          aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa


-       it  takes  a  long  time  before reporting failure. This is because the
-       string can be divided between the internal \D+ repeat and the  external
-       *  repeat  in  a  large  number of ways, and all have to be tried. (The
-       example uses [!?] rather than a single character at  the  end,  because
-       both  PCRE  and  Perl have an optimization that allows for fast failure
-       when a single character is used. They remember the last single  charac-
-       ter  that  is required for a match, and fail early if it is not present
-       in the string.) If the pattern is changed so that  it  uses  an  atomic
+       it takes a long time before reporting  failure.  This  is  because  the
+       string  can be divided between the internal \D+ repeat and the external
+       * repeat in a large number of ways, and all  have  to  be  tried.  (The
+       example  uses  [!?]  rather than a single character at the end, because
+       both PCRE and Perl have an optimization that allows  for  fast  failure
+       when  a single character is used. They remember the last single charac-
+       ter that is required for a match, and fail early if it is  not  present
+       in  the  string.)  If  the pattern is changed so that it uses an atomic
        group, like this:


          ((?>\D+)|<\d+>)*[!?]
@@ -4469,37 +4532,37 @@


        Outside a character class, a backslash followed by a digit greater than
        0 (and possibly further digits) is a back reference to a capturing sub-
-       pattern  earlier  (that is, to its left) in the pattern, provided there
+       pattern earlier (that is, to its left) in the pattern,  provided  there
        have been that many previous capturing left parentheses.


        However, if the decimal number following the backslash is less than 10,
-       it  is  always  taken  as a back reference, and causes an error only if
-       there are not that many capturing left parentheses in the  entire  pat-
-       tern.  In  other words, the parentheses that are referenced need not be
-       to the left of the reference for numbers less than 10. A "forward  back
-       reference"  of  this  type can make sense when a repetition is involved
-       and the subpattern to the right has participated in an  earlier  itera-
+       it is always taken as a back reference, and causes  an  error  only  if
+       there  are  not that many capturing left parentheses in the entire pat-
+       tern. In other words, the parentheses that are referenced need  not  be
+       to  the left of the reference for numbers less than 10. A "forward back
+       reference" of this type can make sense when a  repetition  is  involved
+       and  the  subpattern to the right has participated in an earlier itera-
        tion.


-       It  is  not  possible to have a numerical "forward back reference" to a
-       subpattern whose number is 10 or  more  using  this  syntax  because  a
-       sequence  such  as  \50 is interpreted as a character defined in octal.
+       It is not possible to have a numerical "forward back  reference"  to  a
+       subpattern  whose  number  is  10  or  more using this syntax because a
+       sequence such as \50 is interpreted as a character  defined  in  octal.
        See the subsection entitled "Non-printing characters" above for further
-       details  of  the  handling of digits following a backslash. There is no
-       such problem when named parentheses are used. A back reference  to  any
+       details of the handling of digits following a backslash.  There  is  no
+       such  problem  when named parentheses are used. A back reference to any
        subpattern is possible using named parentheses (see below).


-       Another  way  of  avoiding  the ambiguity inherent in the use of digits
+       Another way of avoiding the ambiguity inherent in  the  use  of  digits
        following a backslash is to use the \g escape sequence, which is a fea-
-       ture  introduced  in  Perl  5.10.  This  escape  must be followed by an
-       unsigned number or a negative number, optionally  enclosed  in  braces.
+       ture introduced in Perl 5.10.  This  escape  must  be  followed  by  an
+       unsigned  number  or  a negative number, optionally enclosed in braces.
        These examples are all identical:


          (ring), \1
          (ring), \g1
          (ring), \g{1}


-       An  unsigned number specifies an absolute reference without the ambigu-
+       An unsigned number specifies an absolute reference without the  ambigu-
        ity that is present in the older syntax. It is also useful when literal
        digits follow the reference. A negative number is a relative reference.
        Consider this example:
@@ -4507,33 +4570,33 @@
          (abc(def)ghi)\g{-1}


        The sequence \g{-1} is a reference to the most recently started captur-
-       ing  subpattern  before \g, that is, is it equivalent to \2. Similarly,
+       ing subpattern before \g, that is, is it equivalent to  \2.  Similarly,
        \g{-2} would be equivalent to \1. The use of relative references can be
-       helpful  in  long  patterns,  and  also in patterns that are created by
+       helpful in long patterns, and also in  patterns  that  are  created  by
        joining together fragments that contain references within themselves.


-       A back reference matches whatever actually matched the  capturing  sub-
-       pattern  in  the  current subject string, rather than anything matching
+       A  back  reference matches whatever actually matched the capturing sub-
+       pattern in the current subject string, rather  than  anything  matching
        the subpattern itself (see "Subpatterns as subroutines" below for a way
        of doing that). So the pattern


          (sens|respons)e and \1ibility


-       matches  "sense and sensibility" and "response and responsibility", but
-       not "sense and responsibility". If caseful matching is in force at  the
-       time  of the back reference, the case of letters is relevant. For exam-
+       matches "sense and sensibility" and "response and responsibility",  but
+       not  "sense and responsibility". If caseful matching is in force at the
+       time of the back reference, the case of letters is relevant. For  exam-
        ple,


          ((?i)rah)\s+\1


-       matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
+       matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
        original capturing subpattern is matched caselessly.


-       There  are  several  different ways of writing back references to named
-       subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or
-       \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's
+       There are several different ways of writing back  references  to  named
+       subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
+       \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's
        unified back reference syntax, in which \g can be used for both numeric
-       and  named  references,  is  also supported. We could rewrite the above
+       and named references, is also supported. We  could  rewrite  the  above
        example in any of the following ways:


          (?<p1>(?i)rah)\s+\k<p1>
@@ -4541,67 +4604,67 @@
          (?P<p1>(?i)rah)\s+(?P=p1)
          (?<p1>(?i)rah)\s+\g{p1}


-       A subpattern that is referenced by  name  may  appear  in  the  pattern
+       A  subpattern  that  is  referenced  by  name may appear in the pattern
        before or after the reference.


-       There  may be more than one back reference to the same subpattern. If a
-       subpattern has not actually been used in a particular match,  any  back
+       There may be more than one back reference to the same subpattern. If  a
+       subpattern  has  not actually been used in a particular match, any back
        references to it always fail by default. For example, the pattern


          (a|(bc))\2


-       always  fails  if  it starts to match "a" rather than "bc". However, if
+       always fails if it starts to match "a" rather than  "bc".  However,  if
        the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
        ence to an unset value matches an empty string.


-       Because  there may be many capturing parentheses in a pattern, all dig-
-       its following a backslash are taken as part of a potential back  refer-
-       ence  number.   If  the  pattern continues with a digit character, some
-       delimiter must  be  used  to  terminate  the  back  reference.  If  the
+       Because there may be many capturing parentheses in a pattern, all  dig-
+       its  following a backslash are taken as part of a potential back refer-
+       ence number.  If the pattern continues with  a  digit  character,  some
+       delimiter  must  be  used  to  terminate  the  back  reference.  If the
        PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{
        syntax or an empty comment (see "Comments" below) can be used.


    Recursive back references


-       A back reference that occurs inside the parentheses to which it  refers
-       fails  when  the subpattern is first used, so, for example, (a\1) never
-       matches.  However, such references can be useful inside  repeated  sub-
+       A  back reference that occurs inside the parentheses to which it refers
+       fails when the subpattern is first used, so, for example,  (a\1)  never
+       matches.   However,  such references can be useful inside repeated sub-
        patterns. For example, the pattern


          (a|b\1)+


        matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
-       ation of the subpattern,  the  back  reference  matches  the  character
-       string  corresponding  to  the previous iteration. In order for this to
-       work, the pattern must be such that the first iteration does  not  need
-       to  match the back reference. This can be done using alternation, as in
+       ation  of  the  subpattern,  the  back  reference matches the character
+       string corresponding to the previous iteration. In order  for  this  to
+       work,  the  pattern must be such that the first iteration does not need
+       to match the back reference. This can be done using alternation, as  in
        the example above, or by a quantifier with a minimum of zero.


-       Back references of this type cause the group that they reference to  be
-       treated  as  an atomic group.  Once the whole group has been matched, a
-       subsequent matching failure cannot cause backtracking into  the  middle
+       Back  references of this type cause the group that they reference to be
+       treated as an atomic group.  Once the whole group has been  matched,  a
+       subsequent  matching  failure cannot cause backtracking into the middle
        of the group.



ASSERTIONS

-       An  assertion  is  a  test on the characters following or preceding the
-       current matching point that does not actually consume  any  characters.
-       The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
+       An assertion is a test on the characters  following  or  preceding  the
+       current  matching  point that does not actually consume any characters.
+       The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
        described above.


-       More complicated assertions are coded as  subpatterns.  There  are  two
-       kinds:  those  that  look  ahead of the current position in the subject
-       string, and those that look  behind  it.  An  assertion  subpattern  is
-       matched  in  the  normal way, except that it does not cause the current
+       More  complicated  assertions  are  coded as subpatterns. There are two
+       kinds: those that look ahead of the current  position  in  the  subject
+       string,  and  those  that  look  behind  it. An assertion subpattern is
+       matched in the normal way, except that it does not  cause  the  current
        matching position to be changed.


-       Assertion subpatterns are not capturing subpatterns,  and  may  not  be
-       repeated,  because  it  makes no sense to assert the same thing several
-       times. If any kind of assertion contains capturing  subpatterns  within
-       it,  these are counted for the purposes of numbering the capturing sub-
+       Assertion  subpatterns  are  not  capturing subpatterns, and may not be
+       repeated, because it makes no sense to assert the  same  thing  several
+       times.  If  any kind of assertion contains capturing subpatterns within
+       it, these are counted for the purposes of numbering the capturing  sub-
        patterns in the whole pattern.  However, substring capturing is carried
-       out  only  for  positive assertions, because it does not make sense for
+       out only for positive assertions, because it does not  make  sense  for
        negative assertions.


    Lookahead assertions
@@ -4611,38 +4674,38 @@


          \w+(?=;)


-       matches  a word followed by a semicolon, but does not include the semi-
+       matches a word followed by a semicolon, but does not include the  semi-
        colon in the match, and


          foo(?!bar)


-       matches any occurrence of "foo" that is not  followed  by  "bar".  Note
+       matches  any  occurrence  of  "foo" that is not followed by "bar". Note
        that the apparently similar pattern


          (?!foo)bar


-       does  not  find  an  occurrence  of "bar" that is preceded by something
-       other than "foo"; it finds any occurrence of "bar" whatsoever,  because
+       does not find an occurrence of "bar"  that  is  preceded  by  something
+       other  than "foo"; it finds any occurrence of "bar" whatsoever, because
        the assertion (?!foo) is always true when the next three characters are
        "bar". A lookbehind assertion is needed to achieve the other effect.


        If you want to force a matching failure at some point in a pattern, the
-       most  convenient  way  to  do  it  is with (?!) because an empty string
-       always matches, so an assertion that requires there not to be an  empty
-       string  must  always  fail.   The  Perl  5.10 backtracking control verb
+       most convenient way to do it is  with  (?!)  because  an  empty  string
+       always  matches, so an assertion that requires there not to be an empty
+       string must always fail.   The  Perl  5.10  backtracking  control  verb
        (*FAIL) or (*F) is essentially a synonym for (?!).


    Lookbehind assertions


-       Lookbehind assertions start with (?<= for positive assertions and  (?<!
+       Lookbehind  assertions start with (?<= for positive assertions and (?<!
        for negative assertions. For example,


          (?<!foo)bar


-       does  find  an  occurrence  of "bar" that is not preceded by "foo". The
-       contents of a lookbehind assertion are restricted  such  that  all  the
+       does find an occurrence of "bar" that is not  preceded  by  "foo".  The
+       contents  of  a  lookbehind  assertion are restricted such that all the
        strings it matches must have a fixed length. However, if there are sev-
-       eral top-level alternatives, they do not all  have  to  have  the  same
+       eral  top-level  alternatives,  they  do  not all have to have the same
        fixed length. Thus


          (?<=bullock|donkey)
@@ -4651,62 +4714,62 @@


          (?<!dogs?|cats?)


-       causes  an  error at compile time. Branches that match different length
-       strings are permitted only at the top level of a lookbehind  assertion.
-       This  is an extension compared with Perl (5.8 and 5.10), which requires
+       causes an error at compile time. Branches that match  different  length
+       strings  are permitted only at the top level of a lookbehind assertion.
+       This is an extension compared with Perl (5.8 and 5.10), which  requires
        all branches to match the same length of string. An assertion such as


          (?<=ab(c|de))


-       is not permitted, because its single top-level  branch  can  match  two
+       is  not  permitted,  because  its single top-level branch can match two
        different lengths, but it is acceptable to PCRE if rewritten to use two
        top-level branches:


          (?<=abc|abde)


        In some cases, the Perl 5.10 escape sequence \K (see above) can be used
-       instead  of  a  lookbehind  assertion  to  get  round  the fixed-length
+       instead of  a  lookbehind  assertion  to  get  round  the  fixed-length
        restriction.


-       The implementation of lookbehind assertions is, for  each  alternative,
-       to  temporarily  move the current position back by the fixed length and
+       The  implementation  of lookbehind assertions is, for each alternative,
+       to temporarily move the current position back by the fixed  length  and
        then try to match. If there are insufficient characters before the cur-
        rent position, the assertion fails.


        PCRE does not allow the \C escape (which matches a single byte in UTF-8
-       mode) to appear in lookbehind assertions, because it makes it  impossi-
-       ble  to  calculate the length of the lookbehind. The \X and \R escapes,
+       mode)  to appear in lookbehind assertions, because it makes it impossi-
+       ble to calculate the length of the lookbehind. The \X and  \R  escapes,
        which can match different numbers of bytes, are also not permitted.


-       "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
-       lookbehinds,  as  long as the subpattern matches a fixed-length string.
+       "Subroutine"  calls  (see below) such as (?2) or (?&X) are permitted in
+       lookbehinds, as long as the subpattern matches a  fixed-length  string.
        Recursion, however, is not supported.


-       Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
+       Possessive  quantifiers  can  be  used  in  conjunction with lookbehind
        assertions to specify efficient matching of fixed-length strings at the
        end of subject strings. Consider a simple pattern such as


          abcd$


-       when applied to a long string that does  not  match.  Because  matching
+       when  applied  to  a  long string that does not match. Because matching
        proceeds from left to right, PCRE will look for each "a" in the subject
-       and then see if what follows matches the rest of the  pattern.  If  the
+       and  then  see  if what follows matches the rest of the pattern. If the
        pattern is specified as


          ^.*abcd$


-       the  initial .* matches the entire string at first, but when this fails
+       the initial .* matches the entire string at first, but when this  fails
        (because there is no following "a"), it backtracks to match all but the
-       last  character,  then all but the last two characters, and so on. Once
-       again the search for "a" covers the entire string, from right to  left,
+       last character, then all but the last two characters, and so  on.  Once
+       again  the search for "a" covers the entire string, from right to left,
        so we are no better off. However, if the pattern is written as


          ^.*+(?<=abcd)


-       there  can  be  no backtracking for the .*+ item; it can match only the
-       entire string. The subsequent lookbehind assertion does a  single  test
-       on  the last four characters. If it fails, the match fails immediately.
-       For long strings, this approach makes a significant difference  to  the
+       there can be no backtracking for the .*+ item; it can  match  only  the
+       entire  string.  The subsequent lookbehind assertion does a single test
+       on the last four characters. If it fails, the match fails  immediately.
+       For  long  strings, this approach makes a significant difference to the
        processing time.


    Using multiple assertions
@@ -4715,18 +4778,18 @@


          (?<=\d{3})(?<!999)foo


-       matches  "foo" preceded by three digits that are not "999". Notice that
-       each of the assertions is applied independently at the  same  point  in
-       the  subject  string.  First  there  is a check that the previous three
-       characters are all digits, and then there is  a  check  that  the  same
+       matches "foo" preceded by three digits that are not "999". Notice  that
+       each  of  the  assertions is applied independently at the same point in
+       the subject string. First there is a  check  that  the  previous  three
+       characters  are  all  digits,  and  then there is a check that the same
        three characters are not "999".  This pattern does not match "foo" pre-
-       ceded by six characters, the first of which are  digits  and  the  last
-       three  of  which  are not "999". For example, it doesn't match "123abc-
+       ceded  by  six  characters,  the first of which are digits and the last
+       three of which are not "999". For example, it  doesn't  match  "123abc-
        foo". A pattern to do that is


          (?<=\d{3}...)(?<!999)foo


-       This time the first assertion looks at the  preceding  six  characters,
+       This  time  the  first assertion looks at the preceding six characters,
        checking that the first three are digits, and then the second assertion
        checks that the preceding three characters are not "999".


@@ -4734,96 +4797,96 @@

          (?<=(?<!foo)bar)baz


-       matches an occurrence of "baz" that is preceded by "bar" which in  turn
+       matches  an occurrence of "baz" that is preceded by "bar" which in turn
        is not preceded by "foo", while


          (?<=\d{3}(?!999)...)foo


-       is  another pattern that matches "foo" preceded by three digits and any
+       is another pattern that matches "foo" preceded by three digits and  any
        three characters that are not "999".



CONDITIONAL SUBPATTERNS

-       It is possible to cause the matching process to obey a subpattern  con-
-       ditionally  or to choose between two alternative subpatterns, depending
-       on the result of an assertion, or whether a specific capturing  subpat-
-       tern  has  already  been matched. The two possible forms of conditional
+       It  is possible to cause the matching process to obey a subpattern con-
+       ditionally or to choose between two alternative subpatterns,  depending
+       on  the result of an assertion, or whether a specific capturing subpat-
+       tern has already been matched. The two possible  forms  of  conditional
        subpattern are:


          (?(condition)yes-pattern)
          (?(condition)yes-pattern|no-pattern)


-       If the condition is satisfied, the yes-pattern is used;  otherwise  the
-       no-pattern  (if  present)  is used. If there are more than two alterna-
+       If  the  condition is satisfied, the yes-pattern is used; otherwise the
+       no-pattern (if present) is used. If there are more  than  two  alterna-
        tives in the subpattern, a compile-time error occurs.


-       There are four kinds of condition: references  to  subpatterns,  refer-
+       There  are  four  kinds of condition: references to subpatterns, refer-
        ences to recursion, a pseudo-condition called DEFINE, and assertions.


    Checking for a used subpattern by number


-       If  the  text between the parentheses consists of a sequence of digits,
+       If the text between the parentheses consists of a sequence  of  digits,
        the condition is true if a capturing subpattern of that number has pre-
-       viously  matched.  If  there is more than one capturing subpattern with
-       the same number (see the earlier  section  about  duplicate  subpattern
+       viously matched. If there is more than one  capturing  subpattern  with
+       the  same  number  (see  the earlier section about duplicate subpattern
        numbers), the condition is true if any of them have been set. An alter-
-       native notation is to precede the digits with a plus or minus sign.  In
-       this  case, the subpattern number is relative rather than absolute. The
-       most recently opened parentheses can be referenced by (?(-1), the  next
-       most  recent  by  (?(-2),  and so on. In looping constructs it can also
-       make sense to refer  to  subsequent  groups  with  constructs  such  as
+       native  notation is to precede the digits with a plus or minus sign. In
+       this case, the subpattern number is relative rather than absolute.  The
+       most  recently opened parentheses can be referenced by (?(-1), the next
+       most recent by (?(-2), and so on. In looping  constructs  it  can  also
+       make  sense  to  refer  to  subsequent  groups  with constructs such as
        (?(+2).


-       Consider  the  following  pattern, which contains non-significant white
+       Consider the following pattern, which  contains  non-significant  white
        space to make it more readable (assume the PCRE_EXTENDED option) and to
        divide it into three parts for ease of discussion:


          ( \( )?    [^()]+    (?(1) \) )


-       The  first  part  matches  an optional opening parenthesis, and if that
+       The first part matches an optional opening  parenthesis,  and  if  that
        character is present, sets it as the first captured substring. The sec-
-       ond  part  matches one or more characters that are not parentheses. The
+       ond part matches one or more characters that are not  parentheses.  The
        third part is a conditional subpattern that tests whether the first set
        of parentheses matched or not. If they did, that is, if subject started
        with an opening parenthesis, the condition is true, and so the yes-pat-
-       tern  is  executed  and  a  closing parenthesis is required. Otherwise,
-       since no-pattern is not present, the  subpattern  matches  nothing.  In
-       other  words,  this  pattern  matches  a  sequence  of non-parentheses,
+       tern is executed and a  closing  parenthesis  is  required.  Otherwise,
+       since  no-pattern  is  not  present, the subpattern matches nothing. In
+       other words,  this  pattern  matches  a  sequence  of  non-parentheses,
        optionally enclosed in parentheses.


-       If you were embedding this pattern in a larger one,  you  could  use  a
+       If  you  were  embedding  this pattern in a larger one, you could use a
        relative reference:


          ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...


-       This  makes  the  fragment independent of the parentheses in the larger
+       This makes the fragment independent of the parentheses  in  the  larger
        pattern.


    Checking for a used subpattern by name


-       Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
-       used  subpattern  by  name.  For compatibility with earlier versions of
-       PCRE, which had this facility before Perl, the syntax  (?(name)...)  is
-       also  recognized. However, there is a possible ambiguity with this syn-
-       tax, because subpattern names may  consist  entirely  of  digits.  PCRE
-       looks  first for a named subpattern; if it cannot find one and the name
-       consists entirely of digits, PCRE looks for a subpattern of  that  num-
-       ber,  which must be greater than zero. Using subpattern names that con-
+       Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
+       used subpattern by name. For compatibility  with  earlier  versions  of
+       PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
+       also recognized. However, there is a possible ambiguity with this  syn-
+       tax,  because  subpattern  names  may  consist entirely of digits. PCRE
+       looks first for a named subpattern; if it cannot find one and the  name
+       consists  entirely  of digits, PCRE looks for a subpattern of that num-
+       ber, which must be greater than zero. Using subpattern names that  con-
        sist entirely of digits is not recommended.


        Rewriting the above example to use a named subpattern gives this:


          (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )


-       If the name used in a condition of this kind is a duplicate,  the  test
-       is  applied to all subpatterns of the same name, and is true if any one
+       If  the  name used in a condition of this kind is a duplicate, the test
+       is applied to all subpatterns of the same name, and is true if any  one
        of them has matched.


    Checking for pattern recursion


        If the condition is the string (R), and there is no subpattern with the
-       name  R, the condition is true if a recursive call to the whole pattern
+       name R, the condition is true if a recursive call to the whole  pattern
        or any subpattern has been made. If digits or a name preceded by amper-
        sand follow the letter R, for example:


@@ -4831,77 +4894,77 @@

        the condition is true if the most recent recursion is into a subpattern
        whose number or name is given. This condition does not check the entire
-       recursion  stack.  If  the  name  used in a condition of this kind is a
+       recursion stack. If the name used in a condition  of  this  kind  is  a
        duplicate, the test is applied to all subpatterns of the same name, and
        is true if any one of them is the most recent recursion.


-       At  "top  level",  all  these recursion test conditions are false.  The
+       At "top level", all these recursion test  conditions  are  false.   The
        syntax for recursive patterns is described below.


    Defining subpatterns for use by reference only


-       If the condition is the string (DEFINE), and  there  is  no  subpattern
-       with  the  name  DEFINE,  the  condition is always false. In this case,
-       there may be only one alternative  in  the  subpattern.  It  is  always
-       skipped  if  control  reaches  this  point  in the pattern; the idea of
-       DEFINE is that it can be used to define "subroutines" that can be  ref-
-       erenced  from elsewhere. (The use of "subroutines" is described below.)
-       For example, a pattern to match an IPv4 address could be  written  like
+       If  the  condition  is  the string (DEFINE), and there is no subpattern
+       with the name DEFINE, the condition is  always  false.  In  this  case,
+       there  may  be  only  one  alternative  in the subpattern. It is always
+       skipped if control reaches this point  in  the  pattern;  the  idea  of
+       DEFINE  is that it can be used to define "subroutines" that can be ref-
+       erenced from elsewhere. (The use of "subroutines" is described  below.)
+       For  example,  a pattern to match an IPv4 address could be written like
        this (ignore whitespace and line breaks):


          (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
          \b (?&byte) (\.(?&byte)){3} \b


-       The  first part of the pattern is a DEFINE group inside which a another
-       group named "byte" is defined. This matches an individual component  of
-       an  IPv4  address  (a number less than 256). When matching takes place,
-       this part of the pattern is skipped because DEFINE acts  like  a  false
-       condition.  The  rest of the pattern uses references to the named group
-       to match the four dot-separated components of an IPv4 address,  insist-
+       The first part of the pattern is a DEFINE group inside which a  another
+       group  named "byte" is defined. This matches an individual component of
+       an IPv4 address (a number less than 256). When  matching  takes  place,
+       this  part  of  the pattern is skipped because DEFINE acts like a false
+       condition. The rest of the pattern uses references to the  named  group
+       to  match the four dot-separated components of an IPv4 address, insist-
        ing on a word boundary at each end.


    Assertion conditions


-       If  the  condition  is  not  in any of the above formats, it must be an
-       assertion.  This may be a positive or negative lookahead or  lookbehind
-       assertion.  Consider  this  pattern,  again  containing non-significant
+       If the condition is not in any of the above  formats,  it  must  be  an
+       assertion.   This may be a positive or negative lookahead or lookbehind
+       assertion. Consider  this  pattern,  again  containing  non-significant
        white space, and with the two alternatives on the second line:


          (?(?=[^a-z]*[a-z])
          \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )


-       The condition  is  a  positive  lookahead  assertion  that  matches  an
-       optional  sequence of non-letters followed by a letter. In other words,
-       it tests for the presence of at least one letter in the subject.  If  a
-       letter  is found, the subject is matched against the first alternative;
-       otherwise it is  matched  against  the  second.  This  pattern  matches
-       strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
+       The  condition  is  a  positive  lookahead  assertion  that  matches an
+       optional sequence of non-letters followed by a letter. In other  words,
+       it  tests  for the presence of at least one letter in the subject. If a
+       letter is found, the subject is matched against the first  alternative;
+       otherwise  it  is  matched  against  the  second.  This pattern matches
+       strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
        letters and dd are digits.



COMMENTS

-       The sequence (?# marks the start of a comment that continues up to  the
-       next  closing  parenthesis.  Nested  parentheses are not permitted. The
-       characters that make up a comment play no part in the pattern  matching
+       The  sequence (?# marks the start of a comment that continues up to the
+       next closing parenthesis. Nested parentheses  are  not  permitted.  The
+       characters  that make up a comment play no part in the pattern matching
        at all.


-       If  the PCRE_EXTENDED option is set, an unescaped # character outside a
-       character class introduces a  comment  that  continues  to  immediately
+       If the PCRE_EXTENDED option is set, an unescaped # character outside  a
+       character  class  introduces  a  comment  that continues to immediately
        after the next newline in the pattern.



RECURSIVE PATTERNS

-       Consider  the problem of matching a string in parentheses, allowing for
-       unlimited nested parentheses. Without the use of  recursion,  the  best
-       that  can  be  done  is  to use a pattern that matches up to some fixed
-       depth of nesting. It is not possible to  handle  an  arbitrary  nesting
+       Consider the problem of matching a string in parentheses, allowing  for
+       unlimited  nested  parentheses.  Without the use of recursion, the best
+       that can be done is to use a pattern that  matches  up  to  some  fixed
+       depth  of  nesting.  It  is not possible to handle an arbitrary nesting
        depth.


        For some time, Perl has provided a facility that allows regular expres-
-       sions to recurse (amongst other things). It does this by  interpolating
-       Perl  code in the expression at run time, and the code can refer to the
+       sions  to recurse (amongst other things). It does this by interpolating
+       Perl code in the expression at run time, and the code can refer to  the
        expression itself. A Perl pattern using code interpolation to solve the
        parentheses problem can be created like this:


@@ -4911,182 +4974,182 @@
        refers recursively to the pattern in which it appears.


        Obviously, PCRE cannot support the interpolation of Perl code. Instead,
-       it  supports  special  syntax  for recursion of the entire pattern, and
-       also for individual subpattern recursion.  After  its  introduction  in
-       PCRE  and  Python,  this  kind of recursion was subsequently introduced
+       it supports special syntax for recursion of  the  entire  pattern,  and
+       also  for  individual  subpattern  recursion. After its introduction in
+       PCRE and Python, this kind of  recursion  was  subsequently  introduced
        into Perl at release 5.10.


-       A special item that consists of (? followed by a  number  greater  than
+       A  special  item  that consists of (? followed by a number greater than
        zero and a closing parenthesis is a recursive call of the subpattern of
-       the given number, provided that it occurs inside that  subpattern.  (If
-       not,  it  is  a  "subroutine" call, which is described in the next sec-
-       tion.) The special item (?R) or (?0) is a recursive call of the  entire
+       the  given  number, provided that it occurs inside that subpattern. (If
+       not, it is a "subroutine" call, which is described  in  the  next  sec-
+       tion.)  The special item (?R) or (?0) is a recursive call of the entire
        regular expression.


-       This  PCRE  pattern  solves  the nested parentheses problem (assume the
+       This PCRE pattern solves the nested  parentheses  problem  (assume  the
        PCRE_EXTENDED option is set so that white space is ignored):


          \( ( [^()]++ | (?R) )* \)


-       First it matches an opening parenthesis. Then it matches any number  of
-       substrings  which  can  either  be  a sequence of non-parentheses, or a
-       recursive match of the pattern itself (that is, a  correctly  parenthe-
+       First  it matches an opening parenthesis. Then it matches any number of
+       substrings which can either be a  sequence  of  non-parentheses,  or  a
+       recursive  match  of the pattern itself (that is, a correctly parenthe-
        sized substring).  Finally there is a closing parenthesis. Note the use
        of a possessive quantifier to avoid backtracking into sequences of non-
        parentheses.


-       If  this  were  part of a larger pattern, you would not want to recurse
+       If this were part of a larger pattern, you would not  want  to  recurse
        the entire pattern, so instead you could use this:


          ( \( ( [^()]++ | (?1) )* \) )


-       We have put the pattern into parentheses, and caused the  recursion  to
+       We  have  put the pattern into parentheses, and caused the recursion to
        refer to them instead of the whole pattern.


-       In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
-       tricky. This is made easier by the use of relative references  (a  Perl
-       5.10  feature).   Instead  of  (?1)  in the pattern above you can write
+       In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
+       tricky.  This  is made easier by the use of relative references (a Perl
+       5.10 feature).  Instead of (?1) in the  pattern  above  you  can  write
        (?-2) to refer to the second most recently opened parentheses preceding
-       the  recursion.  In  other  words,  a  negative number counts capturing
+       the recursion. In other  words,  a  negative  number  counts  capturing
        parentheses leftwards from the point at which it is encountered.


-       It is also possible to refer to  subsequently  opened  parentheses,  by
-       writing  references  such  as (?+2). However, these cannot be recursive
-       because the reference is not inside the  parentheses  that  are  refer-
-       enced.  They  are  always  "subroutine" calls, as described in the next
+       It  is  also  possible  to refer to subsequently opened parentheses, by
+       writing references such as (?+2). However, these  cannot  be  recursive
+       because  the  reference  is  not inside the parentheses that are refer-
+       enced. They are always "subroutine" calls, as  described  in  the  next
        section.


-       An alternative approach is to use named parentheses instead.  The  Perl
-       syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
+       An  alternative  approach is to use named parentheses instead. The Perl
+       syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also
        supported. We could rewrite the above example as follows:


          (?<pn> \( ( [^()]++ | (?&pn) )* \) )


-       If there is more than one subpattern with the same name,  the  earliest
+       If  there  is more than one subpattern with the same name, the earliest
        one is used.


-       This  particular  example pattern that we have been looking at contains
+       This particular example pattern that we have been looking  at  contains
        nested unlimited repeats, and so the use of a possessive quantifier for
        matching strings of non-parentheses is important when applying the pat-
-       tern to strings that do not match. For example, when  this  pattern  is
+       tern  to  strings  that do not match. For example, when this pattern is
        applied to


          (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()


-       it  yields  "no  match" quickly. However, if a possessive quantifier is
-       not used, the match runs for a very long time indeed because there  are
-       so  many  different  ways the + and * repeats can carve up the subject,
+       it yields "no match" quickly. However, if a  possessive  quantifier  is
+       not  used, the match runs for a very long time indeed because there are
+       so many different ways the + and * repeats can carve  up  the  subject,
        and all have to be tested before failure can be reported.


-       At the end of a match, the values of capturing  parentheses  are  those
-       from  the outermost level. If you want to obtain intermediate values, a
-       callout function can be used (see below and the pcrecallout  documenta-
+       At  the  end  of a match, the values of capturing parentheses are those
+       from the outermost level. If you want to obtain intermediate values,  a
+       callout  function can be used (see below and the pcrecallout documenta-
        tion). If the pattern above is matched against


          (ab(cd)ef)


-       the  value  for  the  inner capturing parentheses (numbered 2) is "ef",
-       which is the last value taken on at the top level. If a capturing  sub-
+       the value for the inner capturing parentheses  (numbered  2)  is  "ef",
+       which  is the last value taken on at the top level. If a capturing sub-
        pattern is not matched at the top level, its final value is unset, even
        if it is (temporarily) set at a deeper level.


-       If there are more than 15 capturing parentheses in a pattern, PCRE  has
-       to  obtain extra memory to store data during a recursion, which it does
+       If  there are more than 15 capturing parentheses in a pattern, PCRE has
+       to obtain extra memory to store data during a recursion, which it  does
        by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
        can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.


-       Do  not  confuse  the (?R) item with the condition (R), which tests for
-       recursion.  Consider this pattern, which matches text in  angle  brack-
-       ets,  allowing for arbitrary nesting. Only digits are allowed in nested
-       brackets (that is, when recursing), whereas any characters are  permit-
+       Do not confuse the (?R) item with the condition (R),  which  tests  for
+       recursion.   Consider  this pattern, which matches text in angle brack-
+       ets, allowing for arbitrary nesting. Only digits are allowed in  nested
+       brackets  (that is, when recursing), whereas any characters are permit-
        ted at the outer level.


          < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >


-       In  this  pattern, (?(R) is the start of a conditional subpattern, with
-       two different alternatives for the recursive and  non-recursive  cases.
+       In this pattern, (?(R) is the start of a conditional  subpattern,  with
+       two  different  alternatives for the recursive and non-recursive cases.
        The (?R) item is the actual recursive call.


    Recursion difference from Perl


-       In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
+       In PCRE (like Python, but unlike Perl), a recursive subpattern call  is
        always treated as an atomic group. That is, once it has matched some of
        the subject string, it is never re-entered, even if it contains untried
-       alternatives and there is a subsequent matching failure.  This  can  be
-       illustrated  by the following pattern, which purports to match a palin-
-       dromic string that contains an odd number of characters  (for  example,
+       alternatives  and  there  is a subsequent matching failure. This can be
+       illustrated by the following pattern, which purports to match a  palin-
+       dromic  string  that contains an odd number of characters (for example,
        "a", "aba", "abcba", "abcdcba"):


          ^(.|(.)(?1)\2)$


        The idea is that it either matches a single character, or two identical
-       characters surrounding a sub-palindrome. In Perl, this  pattern  works;
-       in  PCRE  it  does  not if the pattern is longer than three characters.
+       characters  surrounding  a sub-palindrome. In Perl, this pattern works;
+       in PCRE it does not if the pattern is  longer  than  three  characters.
        Consider the subject string "abcba":


-       At the top level, the first character is matched, but as it is  not  at
+       At  the  top level, the first character is matched, but as it is not at
        the end of the string, the first alternative fails; the second alterna-
        tive is taken and the recursion kicks in. The recursive call to subpat-
-       tern  1  successfully  matches the next character ("b"). (Note that the
+       tern 1 successfully matches the next character ("b").  (Note  that  the
        beginning and end of line tests are not part of the recursion).


-       Back at the top level, the next character ("c") is compared  with  what
-       subpattern  2 matched, which was "a". This fails. Because the recursion
-       is treated as an atomic group, there are now  no  backtracking  points,
-       and  so  the  entire  match fails. (Perl is able, at this point, to re-
-       enter the recursion and try the second alternative.)  However,  if  the
+       Back  at  the top level, the next character ("c") is compared with what
+       subpattern 2 matched, which was "a". This fails. Because the  recursion
+       is  treated  as  an atomic group, there are now no backtracking points,
+       and so the entire match fails. (Perl is able, at  this  point,  to  re-
+       enter  the  recursion  and try the second alternative.) However, if the
        pattern is written with the alternatives in the other order, things are
        different:


          ^((.)(?1)\2|.)$


-       This time, the recursing alternative is tried first, and  continues  to
-       recurse  until  it runs out of characters, at which point the recursion
-       fails. But this time we do have  another  alternative  to  try  at  the
-       higher  level.  That  is  the  big difference: in the previous case the
+       This  time,  the recursing alternative is tried first, and continues to
+       recurse until it runs out of characters, at which point  the  recursion
+       fails.  But  this  time  we  do  have another alternative to try at the
+       higher level. That is the big difference:  in  the  previous  case  the
        remaining alternative is at a deeper recursion level, which PCRE cannot
        use.


        To change the pattern so that matches all palindromic strings, not just
-       those with an odd number of characters, it is tempting  to  change  the
+       those  with  an  odd number of characters, it is tempting to change the
        pattern to this:


          ^((.)(?1)\2|.?)$


-       Again,  this  works  in Perl, but not in PCRE, and for the same reason.
-       When a deeper recursion has matched a single character,  it  cannot  be
-       entered  again  in  order  to match an empty string. The solution is to
-       separate the two cases, and write out the odd and even cases as  alter-
+       Again, this works in Perl, but not in PCRE, and for  the  same  reason.
+       When  a  deeper  recursion has matched a single character, it cannot be
+       entered again in order to match an empty string.  The  solution  is  to
+       separate  the two cases, and write out the odd and even cases as alter-
        natives at the higher level:


          ^(?:((.)(?1)\2|)|((.)(?3)\4|.))


-       If  you  want  to match typical palindromic phrases, the pattern has to
+       If you want to match typical palindromic phrases, the  pattern  has  to
        ignore all non-word characters, which can be done like this:


          ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$


        If run with the PCRE_CASELESS option, this pattern matches phrases such
        as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
-       Perl. Note the use of the possessive quantifier *+ to avoid  backtrack-
-       ing  into  sequences of non-word characters. Without this, PCRE takes a
-       great deal longer (ten times or more) to  match  typical  phrases,  and
+       Perl.  Note the use of the possessive quantifier *+ to avoid backtrack-
+       ing into sequences of non-word characters. Without this, PCRE  takes  a
+       great  deal  longer  (ten  times or more) to match typical phrases, and
        Perl takes so long that you think it has gone into a loop.


-       WARNING:  The  palindrome-matching patterns above work only if the sub-
-       ject string does not start with a palindrome that is shorter  than  the
-       entire  string.  For example, although "abcba" is correctly matched, if
-       the subject is "ababa", PCRE finds the palindrome "aba" at  the  start,
-       then  fails at top level because the end of the string does not follow.
-       Once again, it cannot jump back into the recursion to try other  alter-
+       WARNING: The palindrome-matching patterns above work only if  the  sub-
+       ject  string  does not start with a palindrome that is shorter than the
+       entire string.  For example, although "abcba" is correctly matched,  if
+       the  subject  is "ababa", PCRE finds the palindrome "aba" at the start,
+       then fails at top level because the end of the string does not  follow.
+       Once  again, it cannot jump back into the recursion to try other alter-
        natives, so the entire match fails.



SUBPATTERNS AS SUBROUTINES

        If the syntax for a recursive subpattern reference (either by number or
-       by name) is used outside the parentheses to which it refers,  it  oper-
-       ates  like a subroutine in a programming language. The "called" subpat-
+       by  name)  is used outside the parentheses to which it refers, it oper-
+       ates like a subroutine in a programming language. The "called"  subpat-
        tern may be defined before or after the reference. A numbered reference
        can be absolute or relative, as in these examples:


@@ -5098,175 +5161,175 @@

          (sens|respons)e and \1ibility


-       matches  "sense and sensibility" and "response and responsibility", but
+       matches "sense and sensibility" and "response and responsibility",  but
        not "sense and responsibility". If instead the pattern


          (sens|respons)e and (?1)ibility


-       is used, it does match "sense and responsibility" as well as the  other
-       two  strings.  Another  example  is  given  in the discussion of DEFINE
+       is  used, it does match "sense and responsibility" as well as the other
+       two strings. Another example is  given  in  the  discussion  of  DEFINE
        above.


-       Like recursive subpatterns, a subroutine call is always treated  as  an
-       atomic  group. That is, once it has matched some of the subject string,
-       it is never re-entered, even if it contains  untried  alternatives  and
-       there  is a subsequent matching failure. Any capturing parentheses that
-       are set during the subroutine call  revert  to  their  previous  values
+       Like  recursive  subpatterns, a subroutine call is always treated as an
+       atomic group. That is, once it has matched some of the subject  string,
+       it  is  never  re-entered, even if it contains untried alternatives and
+       there is a subsequent matching failure. Any capturing parentheses  that
+       are  set  during  the  subroutine  call revert to their previous values
        afterwards.


-       When  a  subpattern is used as a subroutine, processing options such as
+       When a subpattern is used as a subroutine, processing options  such  as
        case-independence are fixed when the subpattern is defined. They cannot
        be changed for different calls. For example, consider this pattern:


          (abc)(?i:(?-1))


-       It  matches  "abcabc". It does not match "abcABC" because the change of
+       It matches "abcabc". It does not match "abcABC" because the  change  of
        processing option does not affect the called subpattern.



ONIGURUMA SUBROUTINE SYNTAX

-       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
+       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
        name or a number enclosed either in angle brackets or single quotes, is
-       an alternative syntax for referencing a  subpattern  as  a  subroutine,
-       possibly  recursively. Here are two of the examples used above, rewrit-
+       an  alternative  syntax  for  referencing a subpattern as a subroutine,
+       possibly recursively. Here are two of the examples used above,  rewrit-
        ten using this syntax:


          (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
          (sens|respons)e and \g'1'ibility


-       PCRE supports an extension to Oniguruma: if a number is preceded  by  a
+       PCRE  supports  an extension to Oniguruma: if a number is preceded by a
        plus or a minus sign it is taken as a relative reference. For example:


          (abc)(?i:\g<-1>)


-       Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
-       synonymous. The former is a back reference; the latter is a  subroutine
+       Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
+       synonymous.  The former is a back reference; the latter is a subroutine
        call.



CALLOUTS

        Perl has a feature whereby using the sequence (?{...}) causes arbitrary
-       Perl code to be obeyed in the middle of matching a regular  expression.
+       Perl  code to be obeyed in the middle of matching a regular expression.
        This makes it possible, amongst other things, to extract different sub-
        strings that match the same pair of parentheses when there is a repeti-
        tion.


        PCRE provides a similar feature, but of course it cannot obey arbitrary
        Perl code. The feature is called "callout". The caller of PCRE provides
-       an  external function by putting its entry point in the global variable
-       pcre_callout.  By default, this variable contains NULL, which  disables
+       an external function by putting its entry point in the global  variable
+       pcre_callout.   By default, this variable contains NULL, which disables
        all calling out.


-       Within  a  regular  expression,  (?C) indicates the points at which the
-       external function is to be called. If you want  to  identify  different
-       callout  points, you can put a number less than 256 after the letter C.
-       The default value is zero.  For example, this pattern has  two  callout
+       Within a regular expression, (?C) indicates the  points  at  which  the
+       external  function  is  to be called. If you want to identify different
+       callout points, you can put a number less than 256 after the letter  C.
+       The  default  value is zero.  For example, this pattern has two callout
        points:


          (?C1)abc(?C2)def


        If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
-       automatically installed before each item in the pattern. They  are  all
+       automatically  installed  before each item in the pattern. They are all
        numbered 255.


        During matching, when PCRE reaches a callout point (and pcre_callout is
-       set), the external function is called. It is provided with  the  number
-       of  the callout, the position in the pattern, and, optionally, one item
-       of data originally supplied by the caller of pcre_exec().  The  callout
-       function  may cause matching to proceed, to backtrack, or to fail alto-
+       set),  the  external function is called. It is provided with the number
+       of the callout, the position in the pattern, and, optionally, one  item
+       of  data  originally supplied by the caller of pcre_exec(). The callout
+       function may cause matching to proceed, to backtrack, or to fail  alto-
        gether. A complete description of the interface to the callout function
        is given in the pcrecallout documentation.



BACKTRACKING CONTROL

-       Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
+       Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",
        which are described in the Perl documentation as "experimental and sub-
-       ject  to  change or removal in a future version of Perl". It goes on to
-       say: "Their usage in production code should be noted to avoid  problems
+       ject to change or removal in a future version of Perl". It goes  on  to
+       say:  "Their usage in production code should be noted to avoid problems
        during upgrades." The same remarks apply to the PCRE features described
        in this section.


-       Since these verbs are specifically related  to  backtracking,  most  of
-       them  can  be  used  only  when  the  pattern  is  to  be matched using
+       Since  these  verbs  are  specifically related to backtracking, most of
+       them can be  used  only  when  the  pattern  is  to  be  matched  using
        pcre_exec(), which uses a backtracking algorithm. With the exception of
        (*FAIL), which behaves like a failing negative assertion, they cause an
        error if encountered by pcre_dfa_exec().


        If any of these verbs are used in an assertion or subroutine subpattern
-       (including  recursive  subpatterns),  their  effect is confined to that
-       subpattern; it does not extend to the surrounding  pattern.  Note  that
-       such  subpatterns are processed as anchored at the point where they are
+       (including recursive subpatterns), their effect  is  confined  to  that
+       subpattern;  it  does  not extend to the surrounding pattern. Note that
+       such subpatterns are processed as anchored at the point where they  are
        tested.


-       The new verbs make use of what was previously invalid syntax: an  open-
+       The  new verbs make use of what was previously invalid syntax: an open-
        ing parenthesis followed by an asterisk. They are generally of the form
-       (*VERB) or (*VERB:NAME). Some may take either form, with differing  be-
+       (*VERB)  or (*VERB:NAME). Some may take either form, with differing be-
        haviour, depending on whether or not an argument is present. An name is
-       a sequence of letters, digits, and underscores. If the name  is  empty,
-       that  is, if the closing parenthesis immediately follows the colon, the
+       a  sequence  of letters, digits, and underscores. If the name is empty,
+       that is, if the closing parenthesis immediately follows the colon,  the
        effect is as if the colon were not there. Any number of these verbs may
        occur in a pattern.


-       PCRE  contains some optimizations that are used to speed up matching by
+       PCRE contains some optimizations that are used to speed up matching  by
        running some checks at the start of each match attempt. For example, it
-       may  know  the minimum length of matching subject, or that a particular
-       character must be present. When one of these  optimizations  suppresses
-       the  running  of  a match, any included backtracking verbs will not, of
+       may know the minimum length of matching subject, or that  a  particular
+       character  must  be present. When one of these optimizations suppresses
+       the running of a match, any included backtracking verbs  will  not,  of
        course, be processed. You can suppress the start-of-match optimizations
        by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_exec().


    Verbs that act immediately


-       The  following  verbs act as soon as they are encountered. They may not
+       The following verbs act as soon as they are encountered. They  may  not
        be followed by a name.


           (*ACCEPT)


-       This verb causes the match to end successfully, skipping the  remainder
-       of  the pattern. When inside a recursion, only the innermost pattern is
-       ended immediately. If (*ACCEPT) is inside  capturing  parentheses,  the
-       data  so  far  is  captured. (This feature was added to PCRE at release
+       This  verb causes the match to end successfully, skipping the remainder
+       of the pattern. When inside a recursion, only the innermost pattern  is
+       ended  immediately.  If  (*ACCEPT) is inside capturing parentheses, the
+       data so far is captured. (This feature was added  to  PCRE  at  release
        8.00.) For example:


          A((?:A|B(*ACCEPT)|C)D)


-       This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-
+       This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
        tured by the outer parentheses.


          (*FAIL) or (*F)


-       This  verb  causes the match to fail, forcing backtracking to occur. It
-       is equivalent to (?!) but easier to read. The Perl documentation  notes
-       that  it  is  probably  useful only when combined with (?{}) or (??{}).
-       Those are, of course, Perl features that are not present in  PCRE.  The
-       nearest  equivalent is the callout feature, as for example in this pat-
+       This verb causes the match to fail, forcing backtracking to  occur.  It
+       is  equivalent to (?!) but easier to read. The Perl documentation notes
+       that it is probably useful only when combined  with  (?{})  or  (??{}).
+       Those  are,  of course, Perl features that are not present in PCRE. The
+       nearest equivalent is the callout feature, as for example in this  pat-
        tern:


          a+(?C)(*FAIL)


-       A match with the string "aaaa" always fails, but the callout  is  taken
+       A  match  with the string "aaaa" always fails, but the callout is taken
        before each backtrack happens (in this example, 10 times).


    Recording which path was taken


-       There  is  one  verb  whose  main  purpose  is to track how a match was
-       arrived at, though it also has a  secondary  use  in  conjunction  with
+       There is one verb whose main purpose  is  to  track  how  a  match  was
+       arrived  at,  though  it  also  has a secondary use in conjunction with
        advancing the match starting point (see (*SKIP) below).


          (*MARK:NAME) or (*:NAME)


-       A  name  is  always  required  with  this  verb.  There  may be as many
-       instances of (*MARK) as you like in a pattern, and their names  do  not
+       A name is always  required  with  this  verb.  There  may  be  as  many
+       instances  of  (*MARK) as you like in a pattern, and their names do not
        have to be unique.


-       When  a  match  succeeds,  the  name of the last-encountered (*MARK) is
-       passed back to  the  caller  via  the  pcre_extra  data  structure,  as
+       When a match succeeds, the name  of  the  last-encountered  (*MARK)  is
+       passed  back  to  the  caller  via  the  pcre_extra  data structure, as
        described in the section on pcre_extra in the pcreapi documentation. No
-       data is returned for a partial match. Here is an  example  of  pcretest
-       output,  where the /K modifier requests the retrieval and outputting of
+       data  is  returned  for a partial match. Here is an example of pcretest
+       output, where the /K modifier requests the retrieval and outputting  of
        (*MARK) data:


          /X(*MARK:A)Y|X(*MARK:B)Z/K
@@ -5278,13 +5341,13 @@
          MK: B


        The (*MARK) name is tagged with "MK:" in this output, and in this exam-
-       ple  it indicates which of the two alternatives matched. This is a more
-       efficient way of obtaining this information than putting each  alterna-
+       ple it indicates which of the two alternatives matched. This is a  more
+       efficient  way of obtaining this information than putting each alterna-
        tive in its own capturing parentheses.


-       A  name  may  also  be  returned after a failed match if the final path
-       through the pattern involves (*MARK). However, unless (*MARK)  used  in
-       conjunction  with  (*COMMIT),  this  is unlikely to happen for an unan-
+       A name may also be returned after a failed  match  if  the  final  path
+       through  the  pattern involves (*MARK). However, unless (*MARK) used in
+       conjunction with (*COMMIT), this is unlikely to  happen  for  an  unan-
        chored pattern because, as the starting point for matching is advanced,
        the final check is often with an empty string, causing a failure before
        (*MARK) is reached. For example:
@@ -5294,56 +5357,56 @@
          No match


        There are three potential starting points for this match (starting with
-       X,  starting  with  P,  and  with  an  empty string). If the pattern is
+       X, starting with P, and with  an  empty  string).  If  the  pattern  is
        anchored, the result is different:


          /^X(*MARK:A)Y|^X(*MARK:B)Z/K
          XP
          No match, mark = B


-       PCRE's start-of-match optimizations can also interfere with  this.  For
-       example,  if, as a result of a call to pcre_study(), it knows the mini-
-       mum subject length for a match, a shorter subject will not  be  scanned
+       PCRE's  start-of-match  optimizations can also interfere with this. For
+       example, if, as a result of a call to pcre_study(), it knows the  mini-
+       mum  subject  length for a match, a shorter subject will not be scanned
        at all.


        Note that similar anomalies (though different in detail) exist in Perl,
-       no doubt for the same reasons. The use of (*MARK) data after  a  failed
-       match  of an unanchored pattern is not recommended, unless (*COMMIT) is
+       no  doubt  for the same reasons. The use of (*MARK) data after a failed
+       match of an unanchored pattern is not recommended, unless (*COMMIT)  is
        involved.


    Verbs that act after backtracking


        The following verbs do nothing when they are encountered. Matching con-
-       tinues  with what follows, but if there is no subsequent match, causing
-       a backtrack to the verb, a failure is  forced.  That  is,  backtracking
-       cannot  pass  to the left of the verb. However, when one of these verbs
-       appears inside an atomic group, its effect is confined to  that  group,
-       because  once the group has been matched, there is never any backtrack-
-       ing into it. In this situation, backtracking can  "jump  back"  to  the
-       left  of the entire atomic group. (Remember also, as stated above, that
+       tinues with what follows, but if there is no subsequent match,  causing
+       a  backtrack  to  the  verb, a failure is forced. That is, backtracking
+       cannot pass to the left of the verb. However, when one of  these  verbs
+       appears  inside  an atomic group, its effect is confined to that group,
+       because once the group has been matched, there is never any  backtrack-
+       ing  into  it.  In  this situation, backtracking can "jump back" to the
+       left of the entire atomic group. (Remember also, as stated above,  that
        this localization also applies in subroutine calls and assertions.)


-       These verbs differ in exactly what kind of failure  occurs  when  back-
+       These  verbs  differ  in exactly what kind of failure occurs when back-
        tracking reaches them.


          (*COMMIT)


-       This  verb, which may not be followed by a name, causes the whole match
+       This verb, which may not be followed by a name, causes the whole  match
        to fail outright if the rest of the pattern does not match. Even if the
        pattern is unanchored, no further attempts to find a match by advancing
        the  starting  point  take  place.  Once  (*COMMIT)  has  been  passed,
-       pcre_exec()  is  committed  to  finding a match at the current starting
+       pcre_exec() is committed to finding a match  at  the  current  starting
        point, or not at all. For example:


          a+(*COMMIT)b


-       This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
+       This  matches  "xxaab" but not "aacaab". It can be thought of as a kind
        of dynamic anchor, or "I've started, so I must finish." The name of the
-       most recently passed (*MARK) in the path is passed back when  (*COMMIT)
+       most  recently passed (*MARK) in the path is passed back when (*COMMIT)
        forces a match failure.


-       Note  that  (*COMMIT)  at  the start of a pattern is not the same as an
-       anchor, unless PCRE's start-of-match optimizations are turned  off,  as
+       Note that (*COMMIT) at the start of a pattern is not  the  same  as  an
+       anchor,  unless  PCRE's start-of-match optimizations are turned off, as
        shown in this pcretest example:


          /(*COMMIT)abc/
@@ -5352,71 +5415,71 @@
          xyzabc\Y
          No match


-       PCRE  knows  that  any  match  must start with "a", so the optimization
-       skips along the subject to "a" before running the first match  attempt,
-       which  succeeds.  When the optimization is disabled by the \Y escape in
+       PCRE knows that any match must start  with  "a",  so  the  optimization
+       skips  along the subject to "a" before running the first match attempt,
+       which succeeds. When the optimization is disabled by the \Y  escape  in
        the second subject, the match starts at "x" and so the (*COMMIT) causes
        it to fail without trying any other starting points.


          (*PRUNE) or (*PRUNE:NAME)


-       This  verb causes the match to fail at the current starting position in
-       the subject if the rest of the pattern does not match. If  the  pattern
-       is  unanchored,  the  normal  "bumpalong"  advance to the next starting
-       character then happens. Backtracking can occur as usual to the left  of
-       (*PRUNE),  before  it  is  reached,  or  when  matching to the right of
-       (*PRUNE), but if there is no match to the  right,  backtracking  cannot
-       cross  (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter-
-       native to an atomic group or possessive quantifier, but there are  some
+       This verb causes the match to fail at the current starting position  in
+       the  subject  if the rest of the pattern does not match. If the pattern
+       is unanchored, the normal "bumpalong"  advance  to  the  next  starting
+       character  then happens. Backtracking can occur as usual to the left of
+       (*PRUNE), before it is reached,  or  when  matching  to  the  right  of
+       (*PRUNE),  but  if  there is no match to the right, backtracking cannot
+       cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an  alter-
+       native  to an atomic group or possessive quantifier, but there are some
        uses of (*PRUNE) that cannot be expressed in any other way.  The behav-
-       iour of (*PRUNE:NAME) is the  same  as  (*MARK:NAME)(*PRUNE)  when  the
-       match  fails  completely;  the name is passed back if this is the final
-       attempt.  (*PRUNE:NAME) does not pass back a name  if  the  match  suc-
-       ceeds.  In  an  anchored pattern (*PRUNE) has the same effect as (*COM-
+       iour  of  (*PRUNE:NAME)  is  the  same as (*MARK:NAME)(*PRUNE) when the
+       match fails completely; the name is passed back if this  is  the  final
+       attempt.   (*PRUNE:NAME)  does  not  pass back a name if the match suc-
+       ceeds. In an anchored pattern (*PRUNE) has the same  effect  as  (*COM-
        MIT).


          (*SKIP)


-       This verb, when given without a name, is like (*PRUNE), except that  if
-       the  pattern  is unanchored, the "bumpalong" advance is not to the next
+       This  verb, when given without a name, is like (*PRUNE), except that if
+       the pattern is unanchored, the "bumpalong" advance is not to  the  next
        character, but to the position in the subject where (*SKIP) was encoun-
-       tered.  (*SKIP)  signifies that whatever text was matched leading up to
+       tered. (*SKIP) signifies that whatever text was matched leading  up  to
        it cannot be part of a successful match. Consider:


          a+(*SKIP)b


-       If the subject is "aaaac...",  after  the  first  match  attempt  fails
-       (starting  at  the  first  character in the string), the starting point
+       If  the  subject  is  "aaaac...",  after  the first match attempt fails
+       (starting at the first character in the  string),  the  starting  point
        skips on to start the next attempt at "c". Note that a possessive quan-
-       tifer  does not have the same effect as this example; although it would
-       suppress backtracking  during  the  first  match  attempt,  the  second
-       attempt  would  start at the second character instead of skipping on to
+       tifer does not have the same effect as this example; although it  would
+       suppress  backtracking  during  the  first  match  attempt,  the second
+       attempt would start at the second character instead of skipping  on  to
        "c".


          (*SKIP:NAME)


-       When (*SKIP) has an associated name, its behaviour is modified. If  the
+       When  (*SKIP) has an associated name, its behaviour is modified. If the
        following pattern fails to match, the previous path through the pattern
-       is searched for the most recent (*MARK) that has the same name. If  one
-       is  found, the "bumpalong" advance is to the subject position that cor-
-       responds to that (*MARK) instead of to where (*SKIP)  was  encountered.
-       If  no (*MARK) with a matching name is found, normal "bumpalong" of one
+       is  searched for the most recent (*MARK) that has the same name. If one
+       is found, the "bumpalong" advance is to the subject position that  cor-
+       responds  to  that (*MARK) instead of to where (*SKIP) was encountered.
+       If no (*MARK) with a matching name is found, normal "bumpalong" of  one
        character happens (the (*SKIP) is ignored).


          (*THEN) or (*THEN:NAME)


        This verb causes a skip to the next alternation if the rest of the pat-
        tern does not match. That is, it cancels pending backtracking, but only
-       within the current alternation. Its name  comes  from  the  observation
+       within  the  current  alternation.  Its name comes from the observation
        that it can be used for a pattern-based if-then-else block:


          ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...


-       If  the COND1 pattern matches, FOO is tried (and possibly further items
-       after the end of the group if FOO succeeds);  on  failure  the  matcher
-       skips  to  the second alternative and tries COND2, without backtracking
-       into COND1. The behaviour  of  (*THEN:NAME)  is  exactly  the  same  as
-       (*MARK:NAME)(*THEN)  if  the  overall  match  fails.  If (*THEN) is not
+       If the COND1 pattern matches, FOO is tried (and possibly further  items
+       after  the  end  of  the group if FOO succeeds); on failure the matcher
+       skips to the second alternative and tries COND2,  without  backtracking
+       into  COND1.  The  behaviour  of  (*THEN:NAME)  is  exactly the same as
+       (*MARK:NAME)(*THEN) if the overall  match  fails.  If  (*THEN)  is  not
        directly inside an alternation, it acts like (*PRUNE).



@@ -5434,11 +5497,11 @@

REVISION

-       Last updated: 05 May 2010
+       Last updated: 18 May 2010
        Copyright (c) 1997-2010 University of Cambridge.
 ------------------------------------------------------------------------------
- 
- 
+
+
 PCRESYNTAX(3)                                                    PCRESYNTAX(3)



@@ -5494,7 +5557,9 @@
          \W         a "non-word" character
          \X         an extended Unicode sequence


-       In PCRE, \d, \D, \s, \S, \w, and \W recognize only ASCII characters.
+       In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII
+       characters, even in UTF-8 mode. However, this can be changed by setting
+       the PCRE_UCP option.



 GENERAL CATEGORY PROPERTIES FOR \p and \P
@@ -5594,8 +5659,9 @@
          word        same as \w
          xdigit      hexadecimal digit


-       In PCRE, POSIX character set names recognize only ASCII characters. You
-       can use \Q...\E inside a character class.
+       In PCRE, POSIX character set names recognize only ASCII  characters  by
+       default,  but  some  of them use Unicode properties if PCRE_UCP is set.
+       You can use \Q...\E inside a character class.



QUANTIFIERS
@@ -5620,7 +5686,7 @@

ANCHORS AND SIMPLE ASSERTIONS

-         \b          word boundary (only ASCII letters recognized)
+         \b          word boundary
          \B          not a word boundary
          ^           start of subject
                       also after internal newline in multiline mode
@@ -5675,10 +5741,11 @@
          (?x)            extended (ignore white space)
          (?-...)         unset option(s)


-       The following is recognized only at the start of a pattern or after one
-       of the newline-setting options with similar syntax:
+       The following are recognized only at the start of a  pattern  or  after
+       one of the newline-setting options with similar syntax:


-         (*UTF8)         set UTF-8 mode
+         (*UTF8)         set UTF-8 mode (PCRE_UTF8)
+         (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)



 LOOKAHEAD AND LOOKBEHIND ASSERTIONS
@@ -5747,7 +5814,7 @@
          (*ACCEPT)       force successful match
          (*FAIL)         force backtrack; synonym (*F)


-       The following act only when a subsequent match failure causes  a  back-
+       The  following  act only when a subsequent match failure causes a back-
        track to reach them. They all force a match failure, but they differ in
        what happens afterwards. Those that advance the start-of-match point do
        so only if the pattern is not anchored.
@@ -5760,8 +5827,8 @@


NEWLINE CONVENTIONS

-       These  are  recognized only at the very start of the pattern or after a
-       (*BSR_...) or (*UTF8) option.
+       These are recognized only at the very start of the pattern or  after  a
+       (*BSR_...) or (*UTF8) or (*UCP) option.


          (*CR)           carriage return only
          (*LF)           linefeed only
@@ -5772,8 +5839,8 @@


WHAT \R MATCHES

-       These are recognized only at the very start of the pattern or  after  a
-       (*...) option that sets the newline convention or UTF-8 mode.
+       These  are  recognized only at the very start of the pattern or after a
+       (*...) option that sets the newline convention or UTF-8 or UCP mode.


          (*BSR_ANYCRLF)  CR, LF, or CRLF
          (*BSR_UNICODE)  any Unicode newline sequence
@@ -5799,11 +5866,11 @@


REVISION

-       Last updated: 05 May 2010
+       Last updated: 12 May 2010
        Copyright (c) 1997-2010 University of Cambridge.
 ------------------------------------------------------------------------------
- 
- 
+
+
 PCREPARTIAL(3)                                                  PCREPARTIAL(3)



@@ -6184,8 +6251,8 @@
        Last updated: 19 October 2009
        Copyright (c) 1997-2009 University of Cambridge.
 ------------------------------------------------------------------------------
- 
- 
+
+
 PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)



@@ -6308,8 +6375,8 @@
        Last updated: 13 June 2007
        Copyright (c) 1997-2007 University of Cambridge.
 ------------------------------------------------------------------------------
- 
- 
+
+
 PCREPERFORM(3)                                                  PCREPERFORM(3)



@@ -6400,6 +6467,14 @@
        If you can find an alternative pattern  that  does  not  use  character
        properties, it will probably be faster.


+       By  default,  the  escape  sequences  \b, \d, \s, and \w, and the POSIX
+       character classes such as [:alpha:]  do  not  use  Unicode  properties,
+       partly for backwards compatibility, and partly for performance reasons.
+       However, you can set PCRE_UCP if you want Unicode character  properties
+       to  be  used.  This  can double the matching time for items such as \d,
+       when matched with  pcre_exec();  the  performance  loss  is  less  with
+       pcre_dfa_exec(), and in both cases there is not much difference for \b.
+
        When  a  pattern  begins  with .* not in parentheses, or in parentheses
        that are not the subject of a backreference, and the PCRE_DOTALL option
        is  set, the pattern is implicitly anchored by PCRE, since it can match
@@ -6465,11 +6540,11 @@


REVISION

-       Last updated: 07 March 2010
+       Last updated: 16 May 2010
        Copyright (c) 1997-2010 University of Cambridge.
 ------------------------------------------------------------------------------
- 
- 
+
+
 PCREPOSIX(3)                                                      PCREPOSIX(3)



@@ -6571,33 +6646,40 @@
        ing,  the  nmatch  and  pmatch  arguments  are ignored, and no captured
        strings are returned.


+         REG_UCP
+
+       The PCRE_UCP option is set when the regular expression  is  passed  for
+       compilation  to  the  native  function. This causes PCRE to use Unicode
+       properties when matchine \d, \w,  etc.,  instead  of  just  recognizing
+       ASCII values. Note that REG_UTF8 is not part of the POSIX standard.
+
          REG_UNGREEDY


-       The PCRE_UNGREEDY option is set when the regular expression  is  passed
-       for  compilation  to the native function. Note that REG_UNGREEDY is not
+       The  PCRE_UNGREEDY  option is set when the regular expression is passed
+       for compilation to the native function. Note that REG_UNGREEDY  is  not
        part of the POSIX standard.


          REG_UTF8


-       The PCRE_UTF8 option is set when the regular expression is  passed  for
-       compilation  to the native function. This causes the pattern itself and
-       all data strings used for matching it to be treated as  UTF-8  strings.
+       The  PCRE_UTF8  option is set when the regular expression is passed for
+       compilation to the native function. This causes the pattern itself  and
+       all  data  strings used for matching it to be treated as UTF-8 strings.
        Note that REG_UTF8 is not part of the POSIX standard.


-       In  the  absence  of  these  flags, no options are passed to the native
-       function.  This means the the  regex  is  compiled  with  PCRE  default
-       semantics.  In particular, the way it handles newline characters in the
-       subject string is the Perl way, not the POSIX way.  Note  that  setting
-       PCRE_MULTILINE  has only some of the effects specified for REG_NEWLINE.
-       It does not affect the way newlines are matched by . (they are not)  or
+       In the absence of these flags, no options  are  passed  to  the  native
+       function.   This  means  the  the  regex  is compiled with PCRE default
+       semantics. In particular, the way it handles newline characters in  the
+       subject  string  is  the Perl way, not the POSIX way. Note that setting
+       PCRE_MULTILINE has only some of the effects specified for  REG_NEWLINE.
+       It  does not affect the way newlines are matched by . (they are not) or
        by a negative class such as [^a] (they are).


-       The  yield of regcomp() is zero on success, and non-zero otherwise. The
+       The yield of regcomp() is zero on success, and non-zero otherwise.  The
        preg structure is filled in on success, and one member of the structure
-       is  public: re_nsub contains the number of capturing subpatterns in the
+       is public: re_nsub contains the number of capturing subpatterns in  the
        regular expression. Various error codes are defined in the header file.


-       NOTE: If the yield of regcomp() is non-zero, you must  not  attempt  to
+       NOTE:  If  the  yield of regcomp() is non-zero, you must not attempt to
        use the contents of the preg structure. If, for example, you pass it to
        regexec(), the result is undefined and your program is likely to crash.


@@ -6605,9 +6687,9 @@
MATCHING NEWLINE CHARACTERS

        This area is not simple, because POSIX and Perl take different views of
-       things.   It  is  not possible to get PCRE to obey POSIX semantics, but
-       then PCRE was never intended to be a POSIX engine. The following  table
-       lists  the  different  possibilities for matching newline characters in
+       things.  It is not possible to get PCRE to obey  POSIX  semantics,  but
+       then  PCRE was never intended to be a POSIX engine. The following table
+       lists the different possibilities for matching  newline  characters  in
        PCRE:


                                  Default   Change with
@@ -6629,19 +6711,19 @@
          ^ matches \n in middle     no     REG_NEWLINE


        PCRE's behaviour is the same as Perl's, except that there is no equiva-
-       lent  for  PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is
+       lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl,  there  is
        no way to stop newline from matching [^a].


-       The  default  POSIX  newline  handling  can  be  obtained  by   setting
-       PCRE_DOTALL  and  PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE
+       The   default  POSIX  newline  handling  can  be  obtained  by  setting
+       PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to  make  PCRE
        behave exactly as for the REG_NEWLINE action.



MATCHING A PATTERN

-       The function regexec() is called  to  match  a  compiled  pattern  preg
-       against  a  given string, which is by default terminated by a zero byte
-       (but see REG_STARTEND below), subject to the options in  eflags.  These
+       The  function  regexec()  is  called  to  match a compiled pattern preg
+       against a given string, which is by default terminated by a  zero  byte
+       (but  see  REG_STARTEND below), subject to the options in eflags. These
        can be:


          REG_NOTBOL
@@ -6663,17 +6745,17 @@


          REG_STARTEND


-       The string is considered to start at string +  pmatch[0].rm_so  and  to
-       have  a terminating NUL located at string + pmatch[0].rm_eo (there need
-       not actually be a NUL at that location), regardless  of  the  value  of
-       nmatch.  This  is a BSD extension, compatible with but not specified by
-       IEEE Standard 1003.2 (POSIX.2), and should  be  used  with  caution  in
+       The  string  is  considered to start at string + pmatch[0].rm_so and to
+       have a terminating NUL located at string + pmatch[0].rm_eo (there  need
+       not  actually  be  a  NUL at that location), regardless of the value of
+       nmatch. This is a BSD extension, compatible with but not  specified  by
+       IEEE  Standard  1003.2  (POSIX.2),  and  should be used with caution in
        software intended to be portable to other systems. Note that a non-zero
        rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
        of the string, not how it is matched.


-       If  the pattern was compiled with the REG_NOSUB flag, no data about any
-       matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of
+       If the pattern was compiled with the REG_NOSUB flag, no data about  any
+       matched  strings  is  returned.  The  nmatch  and  pmatch  arguments of
        regexec() are ignored.


        If the value of nmatch is zero, or if the value pmatch is NULL, no data
@@ -6681,34 +6763,34 @@


        Otherwise,the portion of the string that was matched, and also any cap-
        tured substrings, are returned via the pmatch argument, which points to
-       an array of nmatch structures of type regmatch_t, containing  the  mem-
-       bers  rm_so  and rm_eo. These contain the offset to the first character
-       of each substring and the offset to the first character after  the  end
-       of  each substring, respectively. The 0th element of the vector relates
-       to the entire portion of string that was matched;  subsequent  elements
-       relate  to  the capturing subpatterns of the regular expression. Unused
+       an  array  of nmatch structures of type regmatch_t, containing the mem-
+       bers rm_so and rm_eo. These contain the offset to the  first  character
+       of  each  substring and the offset to the first character after the end
+       of each substring, respectively. The 0th element of the vector  relates
+       to  the  entire portion of string that was matched; subsequent elements
+       relate to the capturing subpatterns of the regular  expression.  Unused
        entries in the array have both structure members set to -1.


-       A successful match yields  a  zero  return;  various  error  codes  are
-       defined  in  the  header  file,  of which REG_NOMATCH is the "expected"
+       A  successful  match  yields  a  zero  return;  various error codes are
+       defined in the header file, of  which  REG_NOMATCH  is  the  "expected"
        failure code.



ERROR MESSAGES

        The regerror() function maps a non-zero errorcode from either regcomp()
-       or  regexec()  to  a  printable message. If preg is not NULL, the error
+       or regexec() to a printable message. If preg is  not  NULL,  the  error
        should have arisen from the use of that structure. A message terminated
-       by  a  binary  zero  is  placed  in  errbuf. The length of the message,
-       including the zero, is limited to errbuf_size. The yield of  the  func-
+       by a binary zero is placed  in  errbuf.  The  length  of  the  message,
+       including  the  zero, is limited to errbuf_size. The yield of the func-
        tion is the size of buffer needed to hold the whole message.



MEMORY USAGE

-       Compiling  a regular expression causes memory to be allocated and asso-
-       ciated with the preg structure. The function regfree() frees  all  such
-       memory,  after  which  preg may no longer be used as a compiled expres-
+       Compiling a regular expression causes memory to be allocated and  asso-
+       ciated  with  the preg structure. The function regfree() frees all such
+       memory, after which preg may no longer be used as  a  compiled  expres-
        sion.



@@ -6721,11 +6803,11 @@

REVISION

-       Last updated: 02 September 2009
-       Copyright (c) 1997-2009 University of Cambridge.
+       Last updated: 16 May 2010
+       Copyright (c) 1997-2010 University of Cambridge.
 ------------------------------------------------------------------------------
- 
- 
+
+
 PCRECPP(3)                                                          PCRECPP(3)



@@ -7065,8 +7147,8 @@

        Last updated: 17 March 2009
 ------------------------------------------------------------------------------
- 
- 
+
+
 PCRESAMPLE(3)                                                    PCRESAMPLE(3)



@@ -7108,9 +7190,15 @@
          gcc -o pcredemo -I/usr/local/include pcredemo.c \
              -L/usr/local/lib -lpcre


-       Once you have compiled the demonstration program, you  can  run  simple
-       tests like this:
+       In a Windows environment, if you want to statically  link  the  program
+       against a non-dll pcre.a file, you must uncomment the line that defines
+       PCRE_STATIC before including pcre.h, because  otherwise  the  pcre_mal-
+       loc()   and   pcre_free()   exported   functions   will   be   declared
+       __declspec(dllimport), with unwanted results.


+       Once you have compiled and linked the demonstration  program,  you  can
+       run simple tests like this:
+
          ./pcredemo 'cat|dog' 'the cat sat on the mat'
          ./pcredemo -g 'cat|dog' 'the dog sat on the cat'


@@ -7143,8 +7231,8 @@

REVISION

-       Last updated: 30 September 2009
-       Copyright (c) 1997-2009 University of Cambridge.
+       Last updated: 26 May 2010
+       Copyright (c) 1997-2010 University of Cambridge.
 ------------------------------------------------------------------------------
 PCRESTACK(3)                                                      PCRESTACK(3)


@@ -7295,5 +7383,5 @@
        Last updated: 03 January 2010
        Copyright (c) 1997-2010 University of Cambridge.
 ------------------------------------------------------------------------------
- 
- 
+
+


Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/pcreapi.3    2010-06-03 19:18:24 UTC (rev 535)
@@ -1644,7 +1644,7 @@
 call via \fBpcre_malloc()\fP fails, this error is given. The memory is
 automatically freed at the end of matching.
 .P
-This error is also given if \fBpcre_stack_malloc()\fP fails in 
+This error is also given if \fBpcre_stack_malloc()\fP fails in
 \fBpcre_exec()\fP. This can happen only when PCRE has been compiled with
 \fB--disable-stack-for-recursion\fP.
 .sp


Modified: code/trunk/doc/pcregrep.1
===================================================================
--- code/trunk/doc/pcregrep.1    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/pcregrep.1    2010-06-03 19:18:24 UTC (rev 535)
@@ -283,13 +283,13 @@
 short form for this option.
 .TP
 \fB--line-buffered\fP
-When this option is given, input is read and processed line by line, and the 
-output is flushed after each write. By default, input is read in large chunks, 
-unless \fBpcregrep\fP can determine that it is reading from a terminal (which 
-is currently possible only in Unix environments). Output to terminal is 
-normally automatically flushed by the operating system. This option can be 
-useful when the input or output is attached to a pipe and you do not want 
-\fBpcregrep\fP to buffer up large amounts of data. However, its use will affect 
+When this option is given, input is read and processed line by line, and the
+output is flushed after each write. By default, input is read in large chunks,
+unless \fBpcregrep\fP can determine that it is reading from a terminal (which
+is currently possible only in Unix environments). Output to terminal is
+normally automatically flushed by the operating system. This option can be
+useful when the input or output is attached to a pipe and you do not want
+\fBpcregrep\fP to buffer up large amounts of data. However, its use will affect
 performance, and the \fB-M\fP (multiline) option ceases to work.
 .TP
 \fB--line-offsets\fP
@@ -317,7 +317,7 @@
 \fBpcregrep\fP ensures that at least 8K characters or the rest of the document
 (whichever is the shorter) are available for forward matching, and similarly
 the previous 8K characters (or all the previous characters, if fewer than 8K)
-are guaranteed to be available for lookbehind assertions. This option does not 
+are guaranteed to be available for lookbehind assertions. This option does not
 work when input is read line by line (see \fP--line-buffered\fP.)
 .TP
 \fB-N\fP \fInewline-type\fP, \fB--newline=\fP\fInewline-type\fP


Modified: code/trunk/doc/pcregrep.txt
===================================================================
--- code/trunk/doc/pcregrep.txt    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/pcregrep.txt    2010-06-03 19:18:24 UTC (rev 535)
@@ -317,38 +317,51 @@
                  when file names are being output. If not supplied, "(standard
                  input)" is used. There is no short form for this option.


+       --line-buffered
+                 When  this  option is given, input is read and processed line
+                 by line, and the output  is  flushed  after  each  write.  By
+                 default,  input  is read in large chunks, unless pcregrep can
+                 determine that it is reading from a terminal (which  is  cur-
+                 rently  possible only in Unix environments). Output to termi-
+                 nal is normally automatically flushed by the  operating  sys-
+                 tem.  This  option  can be useful when the input or output is
+                 attached to a pipe and you do not want pcregrep to buffer  up
+                 large  amounts  of data. However, its use will affect perfor-
+                 mance, and the -M (multiline) option ceases to work.
+
        --line-offsets
-                 Instead  of  showing lines or parts of lines that match, show
+                 Instead of showing lines or parts of lines that  match,  show
                  each match as a line number, the offset from the start of the
-                 line,  and a length. The line number is terminated by a colon
-                 (as usual; see the -n option), and the offset and length  are
-                 separated  by  a  comma.  In  this mode, no context is shown.
-                 That is, the -A, -B, and -C options are ignored. If there  is
-                 more  than  one  match in a line, each of them is shown sepa-
+                 line, and a length. The line number is terminated by a  colon
+                 (as  usual; see the -n option), and the offset and length are
+                 separated by a comma. In this  mode,  no  context  is  shown.
+                 That  is, the -A, -B, and -C options are ignored. If there is
+                 more than one match in a line, each of them  is  shown  sepa-
                  rately. This option is mutually exclusive with --file-offsets
                  and --only-matching.


        --locale=locale-name
-                 This  option specifies a locale to be used for pattern match-
-                 ing. It overrides the value in the LC_ALL or  LC_CTYPE  envi-
-                 ronment  variables.  If  no  locale  is  specified,  the PCRE
-                 library's default (usually the "C" locale) is used. There  is
+                 This option specifies a locale to be used for pattern  match-
+                 ing.  It  overrides the value in the LC_ALL or LC_CTYPE envi-
+                 ronment variables.  If  no  locale  is  specified,  the  PCRE
+                 library's  default (usually the "C" locale) is used. There is
                  no short form for this option.


        -M, --multiline
-                 Allow  patterns to match more than one line. When this option
+                 Allow patterns to match more than one line. When this  option
                  is given, patterns may usefully contain literal newline char-
-                 acters  and  internal  occurrences of ^ and $ characters. The
-                 output for any one match may consist of more than  one  line.
-                 When  this option is set, the PCRE library is called in "mul-
-                 tiline" mode.  There is a limit to the number of  lines  that
-                 can  be matched, imposed by the way that pcregrep buffers the
-                 input file as it scans it. However, pcregrep ensures that  at
+                 acters and internal occurrences of ^ and  $  characters.  The
+                 output  for  any one match may consist of more than one line.
+                 When this option is set, the PCRE library is called in  "mul-
+                 tiline"  mode.   There is a limit to the number of lines that
+                 can be matched, imposed by the way that pcregrep buffers  the
+                 input  file as it scans it. However, pcregrep ensures that at
                  least 8K characters or the rest of the document (whichever is
-                 the shorter) are available for forward  matching,  and  simi-
+                 the  shorter)  are  available for forward matching, and simi-
                  larly the previous 8K characters (or all the previous charac-
-                 ters, if fewer than 8K) are guaranteed to  be  available  for
-                 lookbehind assertions.
+                 ters,  if  fewer  than 8K) are guaranteed to be available for
+                 lookbehind assertions. This option does not work  when  input
+                 is read line by line (see --line-buffered.)


        -N newline-type, --newline=newline-type
                  The  PCRE  library  supports  five  different conventions for
@@ -523,5 +536,5 @@


REVISION

-       Last updated: 13 September 2009
-       Copyright (c) 1997-2009 University of Cambridge.
+       Last updated: 21 May 2010
+       Copyright (c) 1997-2010 University of Cambridge.


Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/pcrepattern.3    2010-06-03 19:18:24 UTC (rev 535)
@@ -42,14 +42,14 @@
 .\"
 page.
 .P
-Another special sequence that may appear at the start of a pattern or in 
+Another special sequence that may appear at the start of a pattern or in
 combination with (*UTF8) is:
 .sp
   (*UCP)
 .sp
-This has the same effect as setting the PCRE_UCP option: it causes sequences 
-such as \ed and \ew to use Unicode properties to determine character types, 
-instead of recognizing only characters with codes less than 128 via a lookup 
+This has the same effect as setting the PCRE_UCP option: it causes sequences
+such as \ed and \ew to use Unicode properties to determine character types,
+instead of recognizing only characters with codes less than 128 via a lookup
 table.
 .P
 The remainder of this document discusses the patterns that are supported by
@@ -367,11 +367,11 @@
   \ew     any "word" character
   \eW     any "non-word" character
 .sp
-There is also the single sequence \eN, which matches a non-newline character. 
-This is the same as 
+There is also the single sequence \eN, which matches a non-newline character.
+This is the same as
 .\" HTML <a href="#fullstopdot">
 .\" </a>
-the "." metacharacter 
+the "." metacharacter
 .\"
 when PCRE_DOTALL is not set.
 .P
@@ -408,7 +408,7 @@
 By default, in UTF-8 mode, characters with values greater than 128 never match
 \ed, \es, or \ew, and always match \eD, \eS, and \eW. These sequences retain
 their original meanings from before UTF-8 support was available, mainly for
-efficiency reasons. However, if PCRE is compiled with Unicode property support, 
+efficiency reasons. However, if PCRE is compiled with Unicode property support,
 and the PCRE_UCP option is set, the behaviour is changed so that Unicode
 properties are used to determine character types, as follows:
 .sp
@@ -417,14 +417,14 @@
   \ew  any character that \ep{L} or \ep{N} matches, plus underscore
 .sp
 The upper case escapes match the inverse sets of characters. Note that \ed
-matches only decimal digits, whereas \ew matches any Unicode digit, as well as 
+matches only decimal digits, whereas \ew matches any Unicode digit, as well as
 any Unicode letter, and underscore. Note also that PCRE_UCP affects \eb, and
 \eB because they are defined in terms of \ew and \eW. Matching these sequences
 is noticeably slower when PCRE_UCP is set.
 .P
 The sequences \eh, \eH, \ev, and \eV are Perl 5.10 features. In contrast to the
 other sequences, which match only ASCII characters by default, these always
-match certain high-valued codepoints in UTF-8 mode, whether or not PCRE_UCP is 
+match certain high-valued codepoints in UTF-8 mode, whether or not PCRE_UCP is
 set. The horizontal space characters are:
 .sp
   U+0009     Horizontal tab
@@ -527,10 +527,10 @@
 The property names represented by \fIxx\fP above are limited to the Unicode
 script names, the general category properties, "Any", which matches any
 character (including newline), and some special PCRE properties (described
-in the 
+in the
 .\" HTML <a href="#extraprops">
 .\" </a>
-next section). 
+next section).
 .\"
 Other Perl properties such as "InMusicalSymbols" are not currently supported by
 PCRE. Note that \eP{Any} does not match any characters, so always causes a
@@ -741,7 +741,7 @@
 Matching characters by Unicode property is not fast, because PCRE has to search
 a structure that contains data for over fifteen thousand characters. That is
 why the traditional escape sequences such as \ed and \ew do not use Unicode
-properties in PCRE by default, though you can make them do so by setting the 
+properties in PCRE by default, though you can make them do so by setting the
 PCRE_UCP option for \fBpcre_compile()\fP or by starting the pattern with
 (*UCP).
 .
@@ -750,8 +750,8 @@
 .SS PCRE's additional properties
 .rs
 .sp
-As well as the standard Unicode properties described in the previous 
-section, PCRE supports four more that make it possible to convert traditional 
+As well as the standard Unicode properties described in the previous
+section, PCRE supports four more that make it possible to convert traditional
 escape sequences such as \ew and \es and POSIX character classes to use Unicode
 properties. PCRE uses these non-standard, non-Perl properties internally when
 PCRE_UCP is set. They are:
@@ -761,10 +761,10 @@
   Xsp   Any Perl space character
   Xwd   Any Perl "word" character
 .sp
-Xan matches characters that have either the L (letter) or the N (number) 
-property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or 
+Xan matches characters that have either the L (letter) or the N (number)
+property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or
 carriage return, and any other character that has the Z (separator) property.
-Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the 
+Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the
 same characters as Xan, plus underscore.
 .
 .
@@ -825,7 +825,7 @@
   \eG     matches at the first matching position in the subject
 .sp
 Inside a character class, \eb has a different meaning; it matches the backspace
-character. If any other of these assertions appears in a character class, by 
+character. If any other of these assertions appears in a character class, by
 default it matches the corresponding literal character (for example, \eB
 matches the letter B). However, if the PCRE_EXTRA option is set, an "invalid
 escape sequence" error is generated instead.
@@ -946,8 +946,8 @@
 dollar, the only relationship being that they both involve newlines. Dot has no
 special meaning in a character class.
 .P
-The escape sequence \eN always behaves as a dot does when PCRE_DOTALL is not 
-set. In other words, it matches any one character except one that signifies the 
+The escape sequence \eN always behaves as a dot does when PCRE_DOTALL is not
+set. In other words, it matches any one character except one that signifies the
 end of a line.
 .
 .
@@ -1102,16 +1102,16 @@
 .P
 By default, in UTF-8 mode, characters with values greater than 128 do not match
 any of the POSIX character classes. However, if the PCRE_UCP option is passed
-to \fBpcre_compile()\fP, some of the classes are changed so that Unicode 
-character properties are used. This is achieved by replacing the POSIX classes 
+to \fBpcre_compile()\fP, some of the classes are changed so that Unicode
+character properties are used. This is achieved by replacing the POSIX classes
 by other sequences, as follows:
 .sp
   [:alnum:]  becomes  \ep{Xan}
   [:alpha:]  becomes  \ep{L}
-  [:blank:]  becomes  \eh 
+  [:blank:]  becomes  \eh
   [:digit:]  becomes  \ep{Nd}
   [:lower:]  becomes  \ep{Ll}
-  [:space:]  becomes  \ep{Xps} 
+  [:space:]  becomes  \ep{Xps}
   [:upper:]  becomes  \ep{Lu}
   [:word:]   becomes  \ep{Xwd}
 .sp


Modified: code/trunk/doc/pcreperform.3
===================================================================
--- code/trunk/doc/pcreperform.3    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/pcreperform.3    2010-06-03 19:18:24 UTC (rev 535)
@@ -96,12 +96,12 @@
 an alternative pattern that does not use character properties, it will probably
 be faster.
 .P
-By default, the escape sequences \eb, \ed, \es, and \ew, and the POSIX 
-character classes such as [:alpha:] do not use Unicode properties, partly for 
-backwards compatibility, and partly for performance reasons. However, you can 
-set PCRE_UCP if you want Unicode character properties to be used. This can 
-double the matching time for items such as \ed, when matched with 
-\fBpcre_exec()\fP; the performance loss is less with \fBpcre_dfa_exec()\fP, and 
+By default, the escape sequences \eb, \ed, \es, and \ew, and the POSIX
+character classes such as [:alpha:] do not use Unicode properties, partly for
+backwards compatibility, and partly for performance reasons. However, you can
+set PCRE_UCP if you want Unicode character properties to be used. This can
+double the matching time for items such as \ed, when matched with
+\fBpcre_exec()\fP; the performance loss is less with \fBpcre_dfa_exec()\fP, and
 in both cases there is not much difference for \eb.
 .P
 When a pattern begins with .* not in parentheses, or in parentheses that are


Modified: code/trunk/doc/pcresyntax.3
===================================================================
--- code/trunk/doc/pcresyntax.3    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/pcresyntax.3    2010-06-03 19:18:24 UTC (rev 535)
@@ -45,7 +45,7 @@
   \eD         a character that is not a decimal digit
   \eh         a horizontal whitespace character
   \eH         a character that is not a horizontal whitespace character
-  \eN         a character that is not a newline 
+  \eN         a character that is not a newline
   \ep{\fIxx\fP}     a character with the \fIxx\fP property
   \eP{\fIxx\fP}     a character without the \fIxx\fP property
   \eR         a newline sequence
@@ -58,7 +58,7 @@
   \eX         an extended Unicode sequence
 .sp
 In PCRE, by default, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII
-characters, even in UTF-8 mode. However, this can be changed by setting the 
+characters, even in UTF-8 mode. However, this can be changed by setting the
 PCRE_UCP option.
 .
 .
@@ -117,7 +117,7 @@
   Xan        Alphanumeric: union of properties L and N
   Xps        POSIX space: property Z or tab, NL, VT, FF, CR
   Xsp        Perl space: property Z or tab, NL, FF, CR
-  Xwd        Perl word: property Xan or underscore 
+  Xwd        Perl word: property Xan or underscore
 .
 .
 .SH "SCRIPT NAMES FOR \ep AND \eP"
@@ -241,7 +241,7 @@
   word        same as \ew
   xdigit      hexadecimal digit
 .sp
-In PCRE, POSIX character set names recognize only ASCII characters by default, 
+In PCRE, POSIX character set names recognize only ASCII characters by default,
 but some of them use Unicode properties if PCRE_UCP is set. You can use
 \eQ...\eE inside a character class.
 .


Modified: code/trunk/doc/pcretest.1
===================================================================
--- code/trunk/doc/pcretest.1    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/pcretest.1    2010-06-03 19:18:24 UTC (rev 535)
@@ -169,7 +169,7 @@
 options that do not correspond to anything in Perl:
 .sp
   \fB/8\fP              PCRE_UTF8
-  \fB/?\fP              PCRE_NO_UTF8_CHECK 
+  \fB/?\fP              PCRE_NO_UTF8_CHECK
   \fB/A\fP              PCRE_ANCHORED
   \fB/C\fP              PCRE_AUTO_CALLOUT
   \fB/E\fP              PCRE_DOLLAR_ENDONLY
@@ -177,7 +177,7 @@
   \fB/J\fP              PCRE_DUPNAMES
   \fB/N\fP              PCRE_NO_AUTO_CAPTURE
   \fB/U\fP              PCRE_UNGREEDY
-  \fB/W\fP              PCRE_UCP 
+  \fB/W\fP              PCRE_UCP
   \fB/X\fP              PCRE_EXTRA
   \fB/<JS>\fP           PCRE_JAVASCRIPT_COMPAT
   \fB/<cr>\fP           PCRE_NEWLINE_CR
@@ -201,7 +201,7 @@
 .\" HREF
 \fBpcreapi\fP
 .\"
-documentation. 
+documentation.
 .
 .
 .SS "Finding all matches in a string"
@@ -291,14 +291,14 @@
 .rs
 .sp
 The \fB/P\fP modifier causes \fBpcretest\fP to call PCRE via the POSIX wrapper
-API rather than its native API. When \fB/P\fP is set, the following modifiers 
+API rather than its native API. When \fB/P\fP is set, the following modifiers
 set options for the \fBregcomp()\fP function:
 .sp
   /i    REG_ICASE
   /m    REG_NEWLINE
   /N    REG_NOSUB
   /s    REG_DOTALL     )
-  /U    REG_UNGREEDY   ) These options are not part of   
+  /U    REG_UNGREEDY   ) These options are not part of
   /W    REG_UCP        )   the POSIX standard
   /8    REG_UTF8       )
 .sp


Modified: code/trunk/doc/pcretest.txt
===================================================================
--- code/trunk/doc/pcretest.txt    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/doc/pcretest.txt    2010-06-03 19:18:24 UTC (rev 535)
@@ -154,8 +154,8 @@


          /caseless/i


-       The following table shows additional modifiers for setting PCRE options
-       that do not correspond to anything in Perl:
+       The  following  table  shows additional modifiers for setting PCRE com-
+       pile-time options that do not correspond to anything in Perl:


          /8              PCRE_UTF8
          /?              PCRE_NO_UTF8_CHECK
@@ -267,23 +267,34 @@
        The  /M  modifier causes the size of memory block used to hold the com-
        piled pattern to be output.


-       The /P modifier causes pcretest to call PCRE via the POSIX wrapper  API
-       rather  than  its  native  API.  When this is done, all other modifiers
-       except /i, /m, and /+ are ignored. REG_ICASE is set if /i  is  present,
-       and  REG_NEWLINE  is  set if /m is present. The wrapper functions force
-       PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is set.
-
        The /S modifier causes pcre_study() to be called after  the  expression
        has been compiled, and the results used when the expression is matched.


+ Using the POSIX wrapper API

+       The  /P modifier causes pcretest to call PCRE via the POSIX wrapper API
+       rather than its native API. When /P is set, the following modifiers set
+       options for the regcomp() function:
+
+         /i    REG_ICASE
+         /m    REG_NEWLINE
+         /N    REG_NOSUB
+         /s    REG_DOTALL     )
+         /U    REG_UNGREEDY   ) These options are not part of
+         /W    REG_UCP        )   the POSIX standard
+         /8    REG_UTF8       )
+
+       The  /+  modifier  works  as  described  above. All other modifiers are
+       ignored.
+
+
 DATA LINES


-       Before  each  data  line is passed to pcre_exec(), leading and trailing
-       whitespace is removed, and it is then scanned for \  escapes.  Some  of
-       these  are  pretty esoteric features, intended for checking out some of
-       the more complicated features of PCRE. If you are just  testing  "ordi-
-       nary"  regular  expressions,  you probably don't need any of these. The
+       Before each data line is passed to pcre_exec(),  leading  and  trailing
+       whitespace  is  removed,  and it is then scanned for \ escapes. Some of
+       these are pretty esoteric features, intended for checking out  some  of
+       the  more  complicated features of PCRE. If you are just testing "ordi-
+       nary" regular expressions, you probably don't need any  of  these.  The
        following escapes are recognized:


          \a         alarm (BEL, \x07)
@@ -361,72 +372,72 @@
          \<any>     pass the PCRE_NEWLINE_ANY option to pcre_exec()
                       or pcre_dfa_exec()


-       The escapes that specify line ending  sequences  are  literal  strings,
+       The  escapes  that  specify  line ending sequences are literal strings,
        exactly as shown. No more than one newline setting should be present in
        any data line.


-       A backslash followed by anything else just escapes the  anything  else.
-       If  the very last character is a backslash, it is ignored. This gives a
-       way of passing an empty line as data, since a real  empty  line  termi-
+       A  backslash  followed by anything else just escapes the anything else.
+       If the very last character is a backslash, it is ignored. This gives  a
+       way  of  passing  an empty line as data, since a real empty line termi-
        nates the data input.


-       If  \M  is present, pcretest calls pcre_exec() several times, with dif-
-       ferent values in the match_limit and  match_limit_recursion  fields  of
-       the  pcre_extra  data structure, until it finds the minimum numbers for
+       If \M is present, pcretest calls pcre_exec() several times,  with  dif-
+       ferent  values  in  the match_limit and match_limit_recursion fields of
+       the pcre_extra data structure, until it finds the minimum  numbers  for
        each parameter that allow pcre_exec() to complete. The match_limit num-
-       ber  is  a  measure of the amount of backtracking that takes place, and
+       ber is a measure of the amount of backtracking that  takes  place,  and
        checking it out can be instructive. For most simple matches, the number
-       is  quite  small,  but for patterns with very large numbers of matching
-       possibilities, it can become large very quickly with increasing  length
+       is quite small, but for patterns with very large  numbers  of  matching
+       possibilities,  it can become large very quickly with increasing length
        of subject string. The match_limit_recursion number is a measure of how
-       much stack (or, if PCRE is compiled with  NO_RECURSE,  how  much  heap)
+       much  stack  (or,  if  PCRE is compiled with NO_RECURSE, how much heap)
        memory is needed to complete the match attempt.


-       When  \O  is  used, the value specified may be higher or lower than the
+       When \O is used, the value specified may be higher or  lower  than  the
        size set by the -O command line option (or defaulted to 45); \O applies
        only to the call of pcre_exec() for the line in which it appears.


-       If  the /P modifier was present on the pattern, causing the POSIX wrap-
-       per API to be used, the only option-setting  sequences  that  have  any
-       effect  are \B and \Z, causing REG_NOTBOL and REG_NOTEOL, respectively,
-       to be passed to regexec().
+       If the /P modifier was present on the pattern, causing the POSIX  wrap-
+       per  API  to  be  used, the only option-setting sequences that have any
+       effect are \B,  \N,  and  \Z,  causing  REG_NOTBOL,  REG_NOTEMPTY,  and
+       REG_NOTEOL, respectively, to be passed to regexec().


-       The use of \x{hh...} to represent UTF-8 characters is not dependent  on
-       the  use  of  the  /8 modifier on the pattern. It is recognized always.
-       There may be any number of hexadecimal digits inside  the  braces.  The
-       result  is  from  one  to  six bytes, encoded according to the original
-       UTF-8 rules of RFC 2279. This allows for  values  in  the  range  0  to
-       0x7FFFFFFF.  Note  that not all of those are valid Unicode code points,
-       or indeed valid UTF-8 characters according to the later  rules  in  RFC
+       The  use of \x{hh...} to represent UTF-8 characters is not dependent on
+       the use of the /8 modifier on the pattern.  It  is  recognized  always.
+       There  may  be  any number of hexadecimal digits inside the braces. The
+       result is from one to six bytes,  encoded  according  to  the  original
+       UTF-8  rules  of  RFC  2279.  This  allows for values in the range 0 to
+       0x7FFFFFFF. Note that not all of those are valid Unicode  code  points,
+       or  indeed  valid  UTF-8 characters according to the later rules in RFC
        3629.



THE ALTERNATIVE MATCHING FUNCTION

-       By   default,  pcretest  uses  the  standard  PCRE  matching  function,
+       By  default,  pcretest  uses  the  standard  PCRE  matching   function,
        pcre_exec() to match each data line. From release 6.0, PCRE supports an
-       alternative  matching  function,  pcre_dfa_test(),  which operates in a
-       different way, and has some restrictions. The differences  between  the
+       alternative matching function, pcre_dfa_test(),  which  operates  in  a
+       different  way,  and has some restrictions. The differences between the
        two functions are described in the pcrematching documentation.


-       If  a data line contains the \D escape sequence, or if the command line
-       contains the -dfa option, the alternative matching function is  called.
+       If a data line contains the \D escape sequence, or if the command  line
+       contains  the -dfa option, the alternative matching function is called.
        This function finds all possible matches at a given point. If, however,
-       the \F escape sequence is present in the data line, it stops after  the
+       the  \F escape sequence is present in the data line, it stops after the
        first match is found. This is always the shortest possible match.



DEFAULT OUTPUT FROM PCRETEST

-       This  section  describes  the output when the normal matching function,
+       This section describes the output when the  normal  matching  function,
        pcre_exec(), is being used.


        When a match succeeds, pcretest outputs the list of captured substrings
-       that  pcre_exec()  returns,  starting with number 0 for the string that
-       matched the whole pattern. Otherwise, it outputs "No  match"  when  the
+       that pcre_exec() returns, starting with number 0 for  the  string  that
+       matched  the  whole  pattern. Otherwise, it outputs "No match" when the
        return is PCRE_ERROR_NOMATCH, and "Partial match:" followed by the par-
-       tially matching substring when pcre_exec() returns  PCRE_ERROR_PARTIAL.
-       For  any other returns, it outputs the PCRE negative error number. Here
+       tially  matching substring when pcre_exec() returns PCRE_ERROR_PARTIAL.
+       For any other returns, it outputs the PCRE negative error number.  Here
        is an example of an interactive pcretest run.


          $ pcretest
@@ -439,11 +450,11 @@
          data> xyz
          No match


-       Note that unset capturing substrings that are not followed by one  that
-       is  set are not returned by pcre_exec(), and are not shown by pcretest.
-       In the following example, there are two capturing substrings, but  when
-       the  first  data  line  is  matched, the second, unset substring is not
-       shown. An "internal" unset substring is shown as "<unset>", as for  the
+       Note  that unset capturing substrings that are not followed by one that
+       is set are not returned by pcre_exec(), and are not shown by  pcretest.
+       In  the following example, there are two capturing substrings, but when
+       the first data line is matched, the  second,  unset  substring  is  not
+       shown.  An "internal" unset substring is shown as "<unset>", as for the
        second data line.


            re> /(a)|(b)/
@@ -455,11 +466,11 @@
           1: <unset>
           2: b


-       If  the strings contain any non-printing characters, they are output as
-       \0x escapes, or as \x{...} escapes if the /8 modifier  was  present  on
-       the  pattern.  See below for the definition of non-printing characters.
-       If the pattern has the /+ modifier, the output for substring 0 is  fol-
-       lowed  by  the  the rest of the subject string, identified by "0+" like
+       If the strings contain any non-printing characters, they are output  as
+       \0x  escapes,  or  as \x{...} escapes if the /8 modifier was present on
+       the pattern. See below for the definition of  non-printing  characters.
+       If  the pattern has the /+ modifier, the output for substring 0 is fol-
+       lowed by the the rest of the subject string, identified  by  "0+"  like
        this:


            re> /cat/+
@@ -467,7 +478,7 @@
           0: cat
           0+ aract


-       If the pattern has the /g or /G modifier,  the  results  of  successive
+       If  the  pattern  has  the /g or /G modifier, the results of successive
        matching attempts are output in sequence, like this:


            re> /\Bi(\w\w)/g
@@ -481,24 +492,24 @@


        "No match" is output only if the first match attempt fails.


-       If  any  of the sequences \C, \G, or \L are present in a data line that
-       is successfully matched, the substrings extracted  by  the  convenience
+       If any of the sequences \C, \G, or \L are present in a data  line  that
+       is  successfully  matched,  the substrings extracted by the convenience
        functions are output with C, G, or L after the string number instead of
        a colon. This is in addition to the normal full list. The string length
-       (that  is,  the return from the extraction function) is given in paren-
+       (that is, the return from the extraction function) is given  in  paren-
        theses after each string for \C and \G.


        Note that whereas patterns can be continued over several lines (a plain
        ">" prompt is used for continuations), data lines may not. However new-
-       lines can be included in data by means of the \n escape (or  \r,  \r\n,
+       lines  can  be included in data by means of the \n escape (or \r, \r\n,
        etc., depending on the newline sequence setting).



OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION

-       When  the  alternative  matching function, pcre_dfa_exec(), is used (by
-       means of the \D escape sequence or the -dfa command line  option),  the
-       output  consists  of  a list of all the matches that start at the first
+       When the alternative matching function, pcre_dfa_exec(),  is  used  (by
+       means  of  the \D escape sequence or the -dfa command line option), the
+       output consists of a list of all the matches that start  at  the  first
        point in the subject where there is at least one match. For example:


            re> /(tang|tangerine|tan)/
@@ -507,8 +518,8 @@
           1: tang
           2: tan


-       (Using the normal matching function on this data  finds  only  "tang".)
-       The  longest matching string is always given first (and numbered zero).
+       (Using  the  normal  matching function on this data finds only "tang".)
+       The longest matching string is always given first (and numbered  zero).
        After a PCRE_ERROR_PARTIAL return, the output is "Partial match:", fol-
        lowed by the partially matching substring.


@@ -524,16 +535,16 @@
           1: tan
           0: tan


-       Since the matching function does not  support  substring  capture,  the
-       escape  sequences  that  are concerned with captured substrings are not
+       Since  the  matching  function  does not support substring capture, the
+       escape sequences that are concerned with captured  substrings  are  not
        relevant.



RESTARTING AFTER A PARTIAL MATCH

        When the alternative matching function has given the PCRE_ERROR_PARTIAL
-       return,  indicating that the subject partially matched the pattern, you
-       can restart the match with additional subject data by means of  the  \R
+       return, indicating that the subject partially matched the pattern,  you
+       can  restart  the match with additional subject data by means of the \R
        escape sequence. For example:


            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@@ -542,30 +553,30 @@
          data> n05\R\D
           0: n05


-       For  further  information  about  partial matching, see the pcrepartial
+       For further information about partial  matching,  see  the  pcrepartial
        documentation.



CALLOUTS

-       If the pattern contains any callout requests, pcretest's callout  func-
-       tion  is  called  during  matching. This works with both matching func-
+       If  the pattern contains any callout requests, pcretest's callout func-
+       tion is called during matching. This works  with  both  matching  func-
        tions. By default, the called function displays the callout number, the
-       start  and  current  positions in the text at the callout time, and the
+       start and current positions in the text at the callout  time,  and  the
        next pattern item to be tested. For example, the output


          --->pqrabcdef
            0    ^  ^     \d


-       indicates that callout number 0 occurred for a match  attempt  starting
-       at  the fourth character of the subject string, when the pointer was at
-       the seventh character of the data, and when the next pattern  item  was
-       \d.  Just  one  circumflex is output if the start and current positions
+       indicates  that  callout number 0 occurred for a match attempt starting
+       at the fourth character of the subject string, when the pointer was  at
+       the  seventh  character of the data, and when the next pattern item was
+       \d. Just one circumflex is output if the start  and  current  positions
        are the same.


        Callouts numbered 255 are assumed to be automatic callouts, inserted as
-       a  result  of the /C pattern modifier. In this case, instead of showing
-       the callout number, the offset in the pattern, preceded by a  plus,  is
+       a result of the /C pattern modifier. In this case, instead  of  showing
+       the  callout  number, the offset in the pattern, preceded by a plus, is
        output. For example:


            re> /\d?[A-E]\*/C
@@ -577,86 +588,86 @@
          +10 ^ ^
           0: E*


-       The  callout  function  in pcretest returns zero (carry on matching) by
-       default, but you can use a \C item in a data line (as described  above)
+       The callout function in pcretest returns zero (carry  on  matching)  by
+       default,  but you can use a \C item in a data line (as described above)
        to change this.


-       Inserting  callouts can be helpful when using pcretest to check compli-
-       cated regular expressions. For further information about callouts,  see
+       Inserting callouts can be helpful when using pcretest to check  compli-
+       cated  regular expressions. For further information about callouts, see
        the pcrecallout documentation.



NON-PRINTING CHARACTERS

-       When  pcretest is outputting text in the compiled version of a pattern,
-       bytes other than 32-126 are always treated as  non-printing  characters
+       When pcretest is outputting text in the compiled version of a  pattern,
+       bytes  other  than 32-126 are always treated as non-printing characters
        are are therefore shown as hex escapes.


-       When  pcretest  is  outputting text that is a matched part of a subject
-       string, it behaves in the same way, unless a different locale has  been
-       set  for  the  pattern  (using  the  /L  modifier).  In  this case, the
+       When pcretest is outputting text that is a matched part  of  a  subject
+       string,  it behaves in the same way, unless a different locale has been
+       set for the  pattern  (using  the  /L  modifier).  In  this  case,  the
        isprint() function to distinguish printing and non-printing characters.



SAVING AND RELOADING COMPILED PATTERNS

-       The facilities described in this section are  not  available  when  the
+       The  facilities  described  in  this section are not available when the
        POSIX inteface to PCRE is being used, that is, when the /P pattern mod-
        ifier is specified.


        When the POSIX interface is not in use, you can cause pcretest to write
-       a  compiled  pattern to a file, by following the modifiers with > and a
+       a compiled pattern to a file, by following the modifiers with >  and  a
        file name.  For example:


          /pattern/im >/some/file


-       See the pcreprecompile documentation for a discussion about saving  and
+       See  the pcreprecompile documentation for a discussion about saving and
        re-using compiled patterns.


-       The  data  that  is  written  is  binary. The first eight bytes are the
-       length of the compiled pattern data  followed  by  the  length  of  the
-       optional  study  data,  each  written as four bytes in big-endian order
-       (most significant byte first). If there is no study  data  (either  the
+       The data that is written is binary.  The  first  eight  bytes  are  the
+       length  of  the  compiled  pattern  data  followed by the length of the
+       optional study data, each written as four  bytes  in  big-endian  order
+       (most  significant  byte  first). If there is no study data (either the
        pattern was not studied, or studying did not return any data), the sec-
-       ond length is zero. The lengths are followed by an exact  copy  of  the
+       ond  length  is  zero. The lengths are followed by an exact copy of the
        compiled pattern. If there is additional study data, this follows imme-
-       diately after the compiled pattern. After writing  the  file,  pcretest
+       diately  after  the  compiled pattern. After writing the file, pcretest
        expects to read a new pattern.


        A saved pattern can be reloaded into pcretest by specifing < and a file
-       name instead of a pattern. The name of the file must not  contain  a  <
-       character,  as  otherwise pcretest will interpret the line as a pattern
+       name  instead  of  a pattern. The name of the file must not contain a <
+       character, as otherwise pcretest will interpret the line as  a  pattern
        delimited by < characters.  For example:


           re> </some/file
          Compiled regex loaded from /some/file
          No study data


-       When the pattern has been loaded, pcretest proceeds to read data  lines
+       When  the pattern has been loaded, pcretest proceeds to read data lines
        in the usual way.


-       You  can copy a file written by pcretest to a different host and reload
-       it there, even if the new host has opposite endianness to  the  one  on
-       which  the pattern was compiled. For example, you can compile on an i86
+       You can copy a file written by pcretest to a different host and  reload
+       it  there,  even  if the new host has opposite endianness to the one on
+       which the pattern was compiled. For example, you can compile on an  i86
        machine and run on a SPARC machine.


-       File names for saving and reloading can be absolute  or  relative,  but
-       note  that the shell facility of expanding a file name that starts with
+       File  names  for  saving and reloading can be absolute or relative, but
+       note that the shell facility of expanding a file name that starts  with
        a tilde (~) is not available.


-       The ability to save and reload files in pcretest is intended for  test-
-       ing  and experimentation. It is not intended for production use because
-       only a single pattern can be written to a file. Furthermore,  there  is
-       no  facility  for  supplying  custom  character  tables  for use with a
-       reloaded pattern. If the original  pattern  was  compiled  with  custom
-       tables,  an  attempt to match a subject string using a reloaded pattern
-       is likely to cause pcretest to crash.  Finally, if you attempt to  load
+       The  ability to save and reload files in pcretest is intended for test-
+       ing and experimentation. It is not intended for production use  because
+       only  a  single pattern can be written to a file. Furthermore, there is
+       no facility for supplying  custom  character  tables  for  use  with  a
+       reloaded  pattern.  If  the  original  pattern was compiled with custom
+       tables, an attempt to match a subject string using a  reloaded  pattern
+       is  likely to cause pcretest to crash.  Finally, if you attempt to load
        a file that is not in the correct format, the result is undefined.



SEE ALSO

-       pcre(3),  pcreapi(3),  pcrecallout(3), pcrematching(3), pcrepartial(d),
+       pcre(3), pcreapi(3), pcrecallout(3),  pcrematching(3),  pcrepartial(d),
        pcrepattern(3), pcreprecompile(3).



@@ -669,5 +680,5 @@

REVISION

-       Last updated: 12 May 2010
+       Last updated: 16 May 2010
        Copyright (c) 1997-2010 University of Cambridge.


Modified: code/trunk/maint/README
===================================================================
--- code/trunk/maint/README    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/maint/README    2010-06-03 19:18:24 UTC (rev 535)
@@ -25,20 +25,20 @@


 GenerateUtt.py   A Python script to generate part of the pcre_tables.c file
                  that contains Unicode script names in a long string with
-                 offsets, which is tedious to maintain by hand. 
+                 offsets, which is tedious to maintain by hand.


 ManyConfigTests  A shell script that runs "configure, make, test" a number of
                  times with different configuration settings.
-                 
+
 MultiStage2.py   A Python script that generates the file pcre_ucd.c from three
                  Unicode data tables, which are themselves downloaded from the
-                 Unicode web site. Run this script in the "maint" directory. 
+                 Unicode web site. Run this script in the "maint" directory.
                  The generated file contains the tables for a 2-stage lookup
-                 of Unicode properties.  
+                 of Unicode properties.


 README           This file.


-Unicode.tables   The files in this directory, DerivedGeneralCategory.txt, 
+Unicode.tables   The files in this directory, DerivedGeneralCategory.txt,
                  Scripts.txt and UnicodeData.txt, were downloaded from the
                  Unicode web site. They contain information about Unicode
                  characters and scripts.
@@ -71,9 +71,9 @@
 run to generate the tricky tables for inclusion in pcre_tables.c.


If MultiStage2.py gives the error "ValueError: list.index(x): x not in list",
-the cause is usually a missing (or misspelt) name in the list of scripts. I
-couldn't find a straightforward list of scripts on the Unicode site, but
-there's a useful Wikipedia page that list them, and notes the Unicode version
+the cause is usually a missing (or misspelt) name in the list of scripts. I
+couldn't find a straightforward list of scripts on the Unicode site, but
+there's a useful Wikipedia page that list them, and notes the Unicode version
in which they were introduced:

http://en.wikipedia.org/wiki/Unicode_scripts#Table_of_Unicode_scripts
@@ -83,7 +83,7 @@
of test characters. The source file ucptest.c must be updated whenever new
Unicode script names are added.

-Note also that both the pcresyntax.3 and pcrepattern.3 man pages contain lists
+Note also that both the pcresyntax.3 and pcrepattern.3 man pages contain lists
of Unicode script names.


@@ -94,20 +94,20 @@
distribution for a new release.

. Ensure that the version number and version date are correct in configure.ac.
-
+
. If new build options have been added, ensure that they are added to the CMake
- files as well as to the autoconf files.
+ files as well as to the autoconf files.

. Run ./autogen.sh to ensure everything is up-to-date.

. Compile and test with many different config options, and combinations of
options. The maint/ManyConfigTests script now encapsulates this testing.

-. Run perltest.pl on the test data for tests 1, 4, 6, and 11. The first two can
- be run with Perl 5.8 or 5.10; the last two require Perl 5.10. The output
- should match the PCRE test output, apart from the version identification at
- the start of each test. The other tests are not Perl-compatible (they use
- various PCRE-specific features or options).
+. Run perltest.pl on the test data for tests 1, 4, 6, and 11. The first two can
+ be run with Perl 5.8 or >= 5.10; the last two require Perl >= 5.10. The
+ output should match the PCRE test output, apart from the version
+ identification at the start of each test. The other tests are not
+ Perl-compatible (they use various PCRE-specific features or options).

. Test with valgrind by running "RunTest valgrind". There is also "RunGrepTest
valgrind", though that takes quite a long time.
@@ -130,14 +130,14 @@
used" warnings for the modules in which there is no call to memmove(). These
can be ignored.

-. Documentation: check AUTHORS, COPYING, ChangeLog (check version and date),
+. Documentation: check AUTHORS, COPYING, ChangeLog (check version and date),
INSTALL, LICENCE, NEWS (check version and date), NON-UNIX-USE, and README.
Many of these won't need changing, but over the long term things do change.

. Man pages: Check all man pages for \ not followed by e or f or " because
- that indicates a markup error. However, there is one exception: pcredemo.3,
+ that indicates a markup error. However, there is one exception: pcredemo.3,
which is created from the pcredemo.c program. It contains three instances
- of \\n.
+ of \\n.

. When the release is built, test it on a number of different operating
systems if possible, and using different compilers as well. For example,
@@ -154,10 +154,10 @@
Double-check with "svn status", then create an SVN tagged copy:

   svn copy svn://vcs.exim.org/pcre/code/trunk \
-           svn://vcs.exim.org/pcre/code/tags/pcre-8.xx 
+           svn://vcs.exim.org/pcre/code/tags/pcre-8.xx


Don't forget to update Freshmeat when the new release is out, and to tell
-webmaster@??? and the mailing list. Also, update the list of version
+webmaster@??? and the mailing list. Also, update the list of version
numbers in Bugzilla (edit products).


@@ -186,7 +186,7 @@
     over the existing "required byte" (reqbyte) feature that just remembers one
     byte.


- * These probably need to go in study():
+ * These probably need to go in pcre_study():

     o Remember an initial string rather than just 1 char?


@@ -194,7 +194,14 @@
       earlier one if common to all alternatives.


     o Friedl contains other ideas.
-    
+
+  * pcre_study() does not set initial byte flags for Unicode property types
+    such as \p; I don't know how much benefit there would be for, for example,
+    setting the bits for 0-9 and all bytes >= xC0 when a pattern starts with
+    \p{N}.
+
+  * There is scope for more "auto-possessifying" in connection with \p and \P.
+
 . If Perl gets to a consistent state over the settings of capturing sub-
   patterns inside repeats, see if we can match it. One example of the
   difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE
@@ -205,11 +212,6 @@


. Unicode

-  * Note that in Perl, \s matches \pZ and similarly for \d, \w and the POSIX
-    character classes. For the moment, I've chosen not to support this for
-    backward compatibility, for speed, and because it would be messy to
-    implement.
-
   * A different approach to Unicode might be to use a typedef to do everything
     in unsigned shorts instead of unsigned chars. Actually, we'd have to have a
     new typedef to distinguish data from bits of compiled pattern that are in
@@ -271,54 +273,54 @@


. Someone suggested --disable-callout to save code space when callouts are
never wanted. This seems rather marginal.
-
-. Check names that consist entirely of digits: PCRE allows, but do Perl and
- Python, etc?
-
-. A user suggested a parameter to limit the length of string matched, for
- example if the parameter is N, the current match should fail if the matched
- substring exceeds N. This could apply to both match functions. The value
+
+. Check names that consist entirely of digits: PCRE allows, but do Perl and
+ Python, etc?
+
+. A user suggested a parameter to limit the length of string matched, for
+ example if the parameter is N, the current match should fail if the matched
+ substring exceeds N. This could apply to both match functions. The value
could be a new field in the extra block.
-
+
. Callouts with arguments: (?Cn:ARG) for instance.

-. A user is going to supply a patch to generalize the API for user-specific
+. A user is going to supply a patch to generalize the API for user-specific
memory allocation so that it is more flexible in threaded environments. This
was promised a long time ago, and never appeared...
-
+
. Write a function that generates random matching strings for a compiled regex.

-. Write a wrapper to maintain a structure with specified runtime parameters,
- such as recurse limit, and pass these to PCRE each time it is called. Also
+. Write a wrapper to maintain a structure with specified runtime parameters,
+ such as recurse limit, and pass these to PCRE each time it is called. Also
maybe malloc and free. A user sent a prototype.
-
-. Pcregrep: an option to specify the output line separator, either as a string
- or select from a fixed list. This is not dead easy, because at the moment it
+
+. Pcregrep: an option to specify the output line separator, either as a string
+ or select from a fixed list. This is not dead easy, because at the moment it
outputs whatever is in the input file.
-
-. Improve the code for duplicate checking in pcre_dfa_exec(). An incomplete,
- non-thread-safe patch showed that this can help performance for patterns
- where there are many alternatives. However, a simple thread-safe
- implementation that I tried made things worse in many simple cases, so this
+
+. Improve the code for duplicate checking in pcre_dfa_exec(). An incomplete,
+ non-thread-safe patch showed that this can help performance for patterns
+ where there are many alternatives. However, a simple thread-safe
+ implementation that I tried made things worse in many simple cases, so this
is not an obviously good thing.
-
-. Make the longest lookbehind available via pcre_fullinfo(). This is not
- straightforward because lookbehinds can be nested inside lookbehinds. This
- case will have to be identified, and the amounts added. This should then give
- the maximum possible lookbehind length. The reason for wanting this is to
+
+. Make the longest lookbehind available via pcre_fullinfo(). This is not
+ straightforward because lookbehinds can be nested inside lookbehinds. This
+ case will have to be identified, and the amounts added. This should then give
+ the maximum possible lookbehind length. The reason for wanting this is to
help when implementing multi-segment matching using pcre_exec() with partial
matching and overlapping segments.
-
+
. PCRE cannot at present distinguish between subpatterns with different names,
- but the same number (created by the use of ?|). In order to do so, a way of
+ but the same number (created by the use of ?|). In order to do so, a way of
remembering *which* subpattern numbered n matched is needed. Bugzilla #760.
-
-. Instead of having #ifdef HAVE_CONFIG_H in each module, put #include
+ Now that (*MARK) has been implemented, it can perhaps be used as a way round
+ this problem.
+
+. Instead of having #ifdef HAVE_CONFIG_H in each module, put #include
"something" and the the #ifdef appears only in one place, in "something".
-
-. Support for (*MARK) and arguments for (*PRUNE) and friends.

Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 10 March 2010
+Last updated: 03 June 2010

Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/pcre_compile.c    2010-06-03 19:18:24 UTC (rev 535)
@@ -261,8 +261,8 @@
   cbit_xdigit,-1,          0              /* xdigit */
 };


-/* Table of substitutes for \d etc when PCRE_UCP is set. The POSIX class
-substitutes must be in the order of the names, defined above, and there are
+/* Table of substitutes for \d etc when PCRE_UCP is set. The POSIX class
+substitutes must be in the order of the names, defined above, and there are
both positive and negative cases. NULL means no substitute. */

 #ifdef SUPPORT_UCP
@@ -272,14 +272,14 @@
   (uschar *)"\\P{Xsp}",   /* \S */       /* NOTE: Xsp is Perl space */
   (uschar *)"\\p{Xsp}",   /* \s */
   (uschar *)"\\P{Xwd}",   /* \W */
-  (uschar *)"\\p{Xwd}"    /* \w */   
+  (uschar *)"\\p{Xwd}"    /* \w */
 };
-  
+
 static const uschar *posix_substitutes[] = {
   (uschar *)"\\p{L}",     /* alpha */
-  (uschar *)"\\p{Ll}",    /* lower */ 
-  (uschar *)"\\p{Lu}",    /* upper */ 
-  (uschar *)"\\p{Xan}",   /* alnum */ 
+  (uschar *)"\\p{Ll}",    /* lower */
+  (uschar *)"\\p{Lu}",    /* upper */
+  (uschar *)"\\p{Xan}",   /* alnum */
   NULL,                   /* ascii */
   (uschar *)"\\h",        /* blank */
   NULL,                   /* cntrl */
@@ -289,12 +289,12 @@
   NULL,                   /* punct */
   (uschar *)"\\p{Xps}",   /* space */    /* NOTE: Xps is POSIX space */
   (uschar *)"\\p{Xwd}",   /* word */
-  NULL,                   /* xdigit */         
+  NULL,                   /* xdigit */
   /* Negated cases */
   (uschar *)"\\P{L}",     /* ^alpha */
-  (uschar *)"\\P{Ll}",    /* ^lower */ 
-  (uschar *)"\\P{Lu}",    /* ^upper */ 
-  (uschar *)"\\P{Xan}",   /* ^alnum */ 
+  (uschar *)"\\P{Ll}",    /* ^lower */
+  (uschar *)"\\P{Lu}",    /* ^upper */
+  (uschar *)"\\P{Xan}",   /* ^alnum */
   NULL,                   /* ^ascii */
   (uschar *)"\\H",        /* ^blank */
   NULL,                   /* ^cntrl */
@@ -304,10 +304,10 @@
   NULL,                   /* ^punct */
   (uschar *)"\\P{Xps}",   /* ^space */   /* NOTE: Xps is POSIX space */
   (uschar *)"\\P{Xwd}",   /* ^word */
-  NULL                    /* ^xdigit */         
+  NULL                    /* ^xdigit */
 };
 #define POSIX_SUBSIZE (sizeof(posix_substitutes)/sizeof(uschar *))
-#endif   
+#endif


#define STRING(a) # a
#define XSTRING(s) STRING(s)
@@ -407,7 +407,7 @@
/* 65 */
"different names for subpatterns of the same number are not allowed\0"
"(*MARK) must have an argument\0"
- "this version of PCRE is not compiled with PCRE_UCP support\0"
+ "this version of PCRE is not compiled with PCRE_UCP support\0"
;

 /* Table to identify digits and hex digits. This is used when compiling
@@ -2407,9 +2407,9 @@
   ptype        the property type
   pdata        the data for the type
   negated      TRUE if it's a negated property (\P or \p{^)
-  
+
 Returns:       TRUE if auto-possessifying is OK
-*/    
+*/


 static BOOL
 check_char_prop(int c, int ptype, int pdata, BOOL negated)
@@ -2453,7 +2453,7 @@
           _pcre_ucp_gentype[prop->chartype] == ucp_N ||
           c == CHAR_UNDERSCORE) == negated;
   }
-return FALSE;  
+return FALSE;
 }
 #endif  /* SUPPORT_UCP */


@@ -2478,7 +2478,7 @@
*/

 static BOOL
-check_auto_possessive(const uschar *previous, BOOL utf8, const uschar *ptr, 
+check_auto_possessive(const uschar *previous, BOOL utf8, const uschar *ptr,
   int options, compile_data *cd)
 {
 int c, next;
@@ -2549,23 +2549,23 @@
 if (next >= 0) switch(op_code)
   {
   case OP_CHAR:
-#ifdef SUPPORT_UTF8  
+#ifdef SUPPORT_UTF8
   GETCHARTEST(c, previous);
 #else
   c = *previous;
-#endif      
-  return c != next; 
+#endif
+  return c != next;


/* For CHARNC (caseless character) we must check the other case. If we have
Unicode property support, we can use it to test the other case of
high-valued characters. */

   case OP_CHARNC:
-#ifdef SUPPORT_UTF8  
+#ifdef SUPPORT_UTF8
   GETCHARTEST(c, previous);
 #else
   c = *previous;
-#endif      
+#endif
   if (c == next) return FALSE;
 #ifdef SUPPORT_UTF8
   if (utf8)
@@ -2603,10 +2603,10 @@
   else
 #endif  /* SUPPORT_UTF8 */
   return (c == cd->fcc[next]);  /* Non-UTF-8 mode */
-  
-  /* Note that OP_DIGIT etc. are generated only when PCRE_UCP is *not* set. 
-  When it is set, \d etc. are converted into OP_(NOT_)PROP codes. */ 


+ /* Note that OP_DIGIT etc. are generated only when PCRE_UCP is *not* set.
+ When it is set, \d etc. are converted into OP_(NOT_)PROP codes. */
+
case OP_DIGIT:
return next > 127 || (cd->ctypes[next] & ctype_digit) == 0;

@@ -2673,7 +2673,7 @@
#ifdef SUPPORT_UCP
case OP_PROP:
return check_char_prop(next, previous[0], previous[1], FALSE);
-
+
case OP_NOTPROP:
return check_char_prop(next, previous[0], previous[1], TRUE);
#endif
@@ -2683,21 +2683,21 @@
}


-/* Handle the case when the next item is \d, \s, etc. Note that when PCRE_UCP
-is set, \d turns into ESC_du rather than ESC_d, etc., so ESC_d etc. are
-generated only when PCRE_UCP is *not* set, that is, when only ASCII
-characteristics are recognized. Similarly, the opcodes OP_DIGIT etc. are
+/* Handle the case when the next item is \d, \s, etc. Note that when PCRE_UCP
+is set, \d turns into ESC_du rather than ESC_d, etc., so ESC_d etc. are
+generated only when PCRE_UCP is *not* set, that is, when only ASCII
+characteristics are recognized. Similarly, the opcodes OP_DIGIT etc. are
replaced by OP_PROP codes when PCRE_UCP is set. */

 switch(op_code)
   {
   case OP_CHAR:
   case OP_CHARNC:
-#ifdef SUPPORT_UTF8  
+#ifdef SUPPORT_UTF8
   GETCHARTEST(c, previous);
 #else
   c = *previous;
-#endif      
+#endif
   switch(-next)
     {
     case ESC_d:
@@ -2761,11 +2761,11 @@
       default:
       return -next == ESC_v;
       }
-      
-    /* When PCRE_UCP is set, these values get generated for \d etc. Find 
-    their substitutions and process them. The result will always be either 
+
+    /* When PCRE_UCP is set, these values get generated for \d etc. Find
+    their substitutions and process them. The result will always be either
     -ESC_p or -ESC_P. Then fall through to process those values. */
-  
+
 #ifdef SUPPORT_UCP
     case ESC_du:
     case ESC_DU:
@@ -2780,43 +2780,43 @@
       if (temperrorcode != 0) return FALSE;
       ptr++;    /* For compatibility */
       }
-    /* Fall through */   
+    /* Fall through */


     case ESC_p:
     case ESC_P:
       {
       int ptype, pdata, errorcodeptr;
-      BOOL negated;  
-        
+      BOOL negated;
+
       ptr--;      /* Make ptr point at the p or P */
       ptype = get_ucp(&ptr, &negated, &pdata, &errorcodeptr);
       if (ptype < 0) return FALSE;
       ptr++;      /* Point past the final curly ket */
-      
+
       /* If the property item is optional, we have to give up. (When generated
       from \d etc by PCRE_UCP, this test will have been applied much earlier,
       to the original \d etc. At this point, ptr will point to a zero byte. */
-      
+
       if (*ptr == CHAR_ASTERISK || *ptr == CHAR_QUESTION_MARK ||
         strncmp((char *)ptr, STR_LEFT_CURLY_BRACKET STR_0 STR_COMMA, 3) == 0)
           return FALSE;
-      
+
       /* Do the property check. */
-      
+
       return check_char_prop(c, ptype, pdata, (next == -ESC_P) != negated);
-      } 
+      }
 #endif


     default:
     return FALSE;
     }
-    
-  /* In principle, support for Unicode properties should be integrated here as 
-  well. It means re-organizing the above code so as to get hold of the property 
-  values before switching on the op-code. However, I wonder how many patterns 
-  combine ASCII \d etc with Unicode properties? (Note that if PCRE_UCP is set, 
-  these op-codes are never generated.) */ 


+  /* In principle, support for Unicode properties should be integrated here as
+  well. It means re-organizing the above code so as to get hold of the property
+  values before switching on the op-code. However, I wonder how many patterns
+  combine ASCII \d etc with Unicode properties? (Note that if PCRE_UCP is set,
+  these op-codes are never generated.) */
+
   case OP_DIGIT:
   return next == -ESC_D || next == -ESC_s || next == -ESC_W ||
          next == -ESC_h || next == -ESC_v || next == -ESC_R;
@@ -2831,14 +2831,14 @@
   return next == -ESC_s || next == -ESC_h || next == -ESC_v;


   case OP_HSPACE:
-  return next == -ESC_S || next == -ESC_H || next == -ESC_d || 
+  return next == -ESC_S || next == -ESC_H || next == -ESC_d ||
          next == -ESC_w || next == -ESC_v || next == -ESC_R;


case OP_NOT_HSPACE:
return next == -ESC_h;

/* Can't have \S in here because VT matches \S (Perl anomaly) */
- case OP_ANYNL:
+ case OP_ANYNL:
case OP_VSPACE:
return next == -ESC_V || next == -ESC_d || next == -ESC_w;

@@ -2846,7 +2846,7 @@
return next == -ESC_v || next == -ESC_R;

   case OP_WORDCHAR:
-  return next == -ESC_W || next == -ESC_s || next == -ESC_h || 
+  return next == -ESC_W || next == -ESC_s || next == -ESC_h ||
          next == -ESC_v || next == -ESC_R;


case OP_NOT_WORDCHAR:
@@ -2982,7 +2982,7 @@

c = *ptr;

- /* If we are at the end of a nested substitution, revert to the outer level
+ /* If we are at the end of a nested substitution, revert to the outer level
string. Nesting only happens one level deep. */

   if (c == 0 && nestptr != NULL)
@@ -3289,7 +3289,7 @@
         {                           /* Braces are required because the */
         GETCHARLEN(c, ptr, ptr);    /* macro generates multiple statements */
         }
-        
+
       /* In the pre-compile phase, accumulate the length of any UTF-8 extra
       data and reset the pointer. This is so that very large classes that
       contain a zillion UTF-8 characters no longer overwrite the work space
@@ -3358,22 +3358,22 @@


         if ((options & PCRE_CASELESS) != 0 && posix_class <= 2)
           posix_class = 0;
-          
-        /* When PCRE_UCP is set, some of the POSIX classes are converted to 
+
+        /* When PCRE_UCP is set, some of the POSIX classes are converted to
         different escape sequences that use Unicode properties. */
-        
+
 #ifdef SUPPORT_UCP
         if ((options & PCRE_UCP) != 0)
           {
           int pc = posix_class + ((local_negate)? POSIX_SUBSIZE/2 : 0);
           if (posix_substitutes[pc] != NULL)
             {
-            nestptr = tempptr + 1; 
+            nestptr = tempptr + 1;
             ptr = posix_substitutes[pc] - 1;
-            continue; 
-            } 
-          }   
-#endif           
+            continue;
+            }
+          }
+#endif
         /* In the non-UCP case, we build the bit map for the POSIX class in a
         chunk of local store because we may be adding and subtracting from it,
         and we don't want to subtract bits that may be in the main map already.
@@ -3460,7 +3460,7 @@
             case ESC_SU:
             nestptr = ptr;
             ptr = substitutes[-c - ESC_DU] - 1;  /* Just before substitute */
-            class_charcount -= 2;                /* Undo! */ 
+            class_charcount -= 2;                /* Undo! */
             continue;
 #endif
             case ESC_d:
@@ -3911,7 +3911,7 @@
     can cause firstbyte to be set. Otherwise, there can be no first char if
     this item is first, whatever repeat count may follow. In the case of
     reqbyte, save the previous value for reinstating. */
-    
+
 #ifdef SUPPORT_UTF8
     if (class_charcount == 1 && !class_utf8 &&
       (!utf8 || !negate_class || class_lastchar < 128))
@@ -3991,7 +3991,7 @@
       }
 #endif


-    /* If there are no characters > 255, or they are all to be included or 
+    /* If there are no characters > 255, or they are all to be included or
     excluded, set the opcode to OP_CLASS or OP_NCLASS, depending on whether the
     whole class was negated and whether there were negative specials such as \S
     (non-UCP) in the class. Then copy the 32-byte map into the code vector,
@@ -5795,7 +5795,7 @@


     /* ===================================================================*/
     /* Handle metasequences introduced by \. For ones like \d, the ESC_ values
-    are arranged to be the negation of the corresponding OP_values in the 
+    are arranged to be the negation of the corresponding OP_values in the
     default case when PCRE_UCP is not set. For the back references, the values
     are ESC_REF plus the reference number. Only back references and those types
     that consume a character may be repeated. We can test for values between
@@ -5973,11 +5973,11 @@
           ptr = substitutes[-c - ESC_DU] - 1;  /* Just before substitute */
           }
         else
-#endif    
-          {  
+#endif
+          {
           previous = (-c > ESC_b && -c < ESC_Z)? code : NULL;
           *code++ = -c;
-          } 
+          }
         }
       continue;
       }
@@ -6809,7 +6809,7 @@
     options = (options & ~(PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) | newbsr;
   else break;
   }
-  
+
 utf8 = (options & PCRE_UTF8) != 0;


/* Can't support UTF8 unless PCRE has been compiled to include the code. */
@@ -6835,8 +6835,8 @@
if ((options & PCRE_UCP) != 0)
{
errorcode = ERR67;
- goto PCRE_EARLY_ERROR_RETURN;
- }
+ goto PCRE_EARLY_ERROR_RETURN;
+ }
#endif

/* Check validity of \R options. */

Modified: code/trunk/pcre_dfa_exec.c
===================================================================
--- code/trunk/pcre_dfa_exec.c    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/pcre_dfa_exec.c    2010-06-03 19:18:24 UTC (rev 535)
@@ -922,36 +922,36 @@
           if (utf8) BACKCHAR(temp);
 #endif
           GETCHARTEST(d, temp);
-#ifdef SUPPORT_UCP          
+#ifdef SUPPORT_UCP
           if ((md->poptions & PCRE_UCP) != 0)
             {
             if (d == '_') left_word = TRUE; else
-              { 
+              {
               int cat = UCD_CATEGORY(d);
               left_word = (cat == ucp_L || cat == ucp_N);
-              } 
-            }  
-          else 
-#endif          
+              }
+            }
+          else
+#endif
           left_word = d < 256 && (ctypes[d] & ctype_word) != 0;
           }
         else left_word = FALSE;


         if (clen > 0)
-          { 
-#ifdef SUPPORT_UCP          
+          {
+#ifdef SUPPORT_UCP
           if ((md->poptions & PCRE_UCP) != 0)
             {
             if (c == '_') right_word = TRUE; else
-              { 
+              {
               int cat = UCD_CATEGORY(c);
               right_word = (cat == ucp_L || cat == ucp_N);
-              } 
-            }  
-          else 
-#endif          
+              }
+            }
+          else
+#endif
           right_word = c < 256 && (ctypes[c] & ctype_word) != 0;
-          } 
+          }
         else right_word = FALSE;


         if ((left_word == right_word) == (codevalue == OP_NOT_WORD_BOUNDARY))
@@ -979,7 +979,7 @@
           break;


           case PT_LAMP:
-          OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll || 
+          OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
                prop->chartype == ucp_Lt;
           break;


@@ -994,30 +994,30 @@
           case PT_SC:
           OK = prop->script == code[2];
           break;
-          
+
           /* These are specials for combination cases. */
-          
+
           case PT_ALNUM:
           OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
                _pcre_ucp_gentype[prop->chartype] == ucp_N;
-          break;        
- 
+          break;
+
           case PT_SPACE:    /* Perl space */
           OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
                c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
-          break;  
- 
+          break;
+
           case PT_PXSPACE:  /* POSIX space */
           OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
                c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
                c == CHAR_FF || c == CHAR_CR;
-          break;      
- 
+          break;
+
           case PT_WORD:
           OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
                _pcre_ucp_gentype[prop->chartype] == ucp_N ||
                c == CHAR_UNDERSCORE;
-          break;            
+          break;


           /* Should never occur, but keep compilers from grumbling. */


@@ -1173,7 +1173,7 @@
           break;


           case PT_LAMP:
-          OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll || 
+          OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
             prop->chartype == ucp_Lt;
           break;


@@ -1190,28 +1190,28 @@
           break;


           /* These are specials for combination cases. */
-          
+
           case PT_ALNUM:
           OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
                _pcre_ucp_gentype[prop->chartype] == ucp_N;
-          break;        
- 
+          break;
+
           case PT_SPACE:    /* Perl space */
           OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
                c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
-          break;  
- 
+          break;
+
           case PT_PXSPACE:  /* POSIX space */
           OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
                c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
                c == CHAR_FF || c == CHAR_CR;
-          break;      
- 
+          break;
+
           case PT_WORD:
           OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
                _pcre_ucp_gentype[prop->chartype] == ucp_N ||
                c == CHAR_UNDERSCORE;
-          break;            
+          break;


           /* Should never occur, but keep compilers from grumbling. */


@@ -1420,7 +1420,7 @@
           break;


           case PT_LAMP:
-          OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll || 
+          OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
             prop->chartype == ucp_Lt;
           break;


@@ -1435,30 +1435,30 @@
           case PT_SC:
           OK = prop->script == code[3];
           break;
-          
+
           /* These are specials for combination cases. */
-          
+
           case PT_ALNUM:
           OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
                _pcre_ucp_gentype[prop->chartype] == ucp_N;
-          break;        
- 
+          break;
+
           case PT_SPACE:    /* Perl space */
           OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
                c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
-          break;  
- 
+          break;
+
           case PT_PXSPACE:  /* POSIX space */
           OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
                c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
                c == CHAR_FF || c == CHAR_CR;
-          break;      
- 
+          break;
+
           case PT_WORD:
           OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
                _pcre_ucp_gentype[prop->chartype] == ucp_N ||
                c == CHAR_UNDERSCORE;
-          break;            
+          break;


           /* Should never occur, but keep compilers from grumbling. */


@@ -1692,7 +1692,7 @@
           break;


           case PT_LAMP:
-          OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll || 
+          OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
             prop->chartype == ucp_Lt;
           break;


@@ -1707,30 +1707,30 @@
           case PT_SC:
           OK = prop->script == code[5];
           break;
-          
+
           /* These are specials for combination cases. */
-          
+
           case PT_ALNUM:
           OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
                _pcre_ucp_gentype[prop->chartype] == ucp_N;
-          break;        
- 
+          break;
+
           case PT_SPACE:    /* Perl space */
           OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
                c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
-          break;  
- 
+          break;
+
           case PT_PXSPACE:  /* POSIX space */
           OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
                c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
                c == CHAR_FF || c == CHAR_CR;
-          break;      
- 
+          break;
+
           case PT_WORD:
           OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
                _pcre_ucp_gentype[prop->chartype] == ucp_N ||
                c == CHAR_UNDERSCORE;
-          break;            
+          break;


           /* Should never occur, but keep compilers from grumbling. */


@@ -2623,7 +2623,7 @@

           do { end_subpattern += GET(end_subpattern, 1); }
             while (*end_subpattern == OP_ALT);
-          next_state_offset = 
+          next_state_offset =
             (int)(end_subpattern - start_code + LINK_SIZE + 1);


           /* If the end of this subpattern is KETRMAX or KETRMIN, we must


Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/pcre_internal.h    2010-06-03 19:18:24 UTC (rev 535)
@@ -1218,7 +1218,7 @@
 their negation. Also, they must appear in the same order as in the opcode
 definitions below, up to ESC_z. There's a dummy for OP_ALLANY because it
 corresponds to "." in DOTALL mode rather than an escape sequence. It is also
-used for [^] in JavaScript compatibility mode. In non-DOTALL mode, "." behaves 
+used for [^] in JavaScript compatibility mode. In non-DOTALL mode, "." behaves
 like \N.


The special values ESC_DU, ESC_du, etc. are used instead of ESC_D, ESC_d, etc.
@@ -1235,9 +1235,9 @@

 enum { ESC_A = 1, ESC_G, ESC_K, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s,
        ESC_W, ESC_w, ESC_N, ESC_dum, ESC_C, ESC_P, ESC_p, ESC_R, ESC_H,
-       ESC_h, ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z, 
+       ESC_h, ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z,
        ESC_E, ESC_Q, ESC_g, ESC_k,
-       ESC_DU, ESC_du, ESC_SU, ESC_su, ESC_WU, ESC_wu, 
+       ESC_DU, ESC_du, ESC_SU, ESC_su, ESC_WU, ESC_wu,
        ESC_REF };


 /* Opcode table: Starting from 1 (i.e. after OP_END), the values up to
@@ -1677,7 +1677,7 @@
   BOOL   noteol;                /* NOTEOL flag */
   BOOL   utf8;                  /* UTF8 flag */
   BOOL   jscript_compat;        /* JAVASCRIPT_COMPAT flag */
-  BOOL   use_ucp;               /* PCRE_UCP flag */ 
+  BOOL   use_ucp;               /* PCRE_UCP flag */
   BOOL   endonly;               /* Dollar not before final \n */
   BOOL   notempty;              /* Empty string match not wanted */
   BOOL   notempty_atstart;      /* Empty string match at start not wanted */


Modified: code/trunk/pcre_study.c
===================================================================
--- code/trunk/pcre_study.c    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/pcre_study.c    2010-06-03 19:18:24 UTC (rev 535)
@@ -441,7 +441,7 @@
 *      Set a bit and maybe its alternate case    *
 *************************************************/


-/* Given a character, set its first byte's bit in the table, and also the 
+/* Given a character, set its first byte's bit in the table, and also the
 corresponding bit for the other version of a letter if we are caseless. In
 UTF-8 mode, for characters greater than 127, we can only do the caseless thing
 when Unicode property support is available.
@@ -451,7 +451,7 @@
   p             points to the character
   caseless      the caseless flag
   cd            the block with char table pointers
-  utf8          TRUE for UTF-8 mode 
+  utf8          TRUE for UTF-8 mode


 Returns:        pointer after the character
 */
@@ -471,15 +471,15 @@
 #ifdef SUPPORT_UCP
   if (caseless)
     {
-    uschar buff[8]; 
+    uschar buff[8];
     c = UCD_OTHERCASE(c);
-    (void)_pcre_ord2utf8(c, buff); 
-    SET_BIT(buff[0]); 
-    }  
-#endif 
+    (void)_pcre_ord2utf8(c, buff);
+    SET_BIT(buff[0]);
+    }
+#endif
   return p;
   }
-#endif    
+#endif


/* Not UTF-8 mode, or character is less than 127. */

@@ -666,40 +666,40 @@
       (void)set_table_bit(start_bits, tcode + 1, caseless, cd, utf8);
       try_next = FALSE;
       break;
-      
-      /* Special spacing and line-terminating items. These recognize specific 
-      lists of characters. The difference between VSPACE and ANYNL is that the 
-      latter can match the two-character CRLF sequence, but that is not 
-      relevant for finding the first character, so their code here is 
+
+      /* Special spacing and line-terminating items. These recognize specific
+      lists of characters. The difference between VSPACE and ANYNL is that the
+      latter can match the two-character CRLF sequence, but that is not
+      relevant for finding the first character, so their code here is
       identical. */
-      
+
       case OP_HSPACE:
       SET_BIT(0x09);
       SET_BIT(0x20);
       SET_BIT(0xA0);
       if (utf8)
-        {  
+        {
         SET_BIT(0xE1);  /* For U+1680, U+180E */
         SET_BIT(0xE2);  /* For U+2000 - U+200A, U+202F, U+205F */
-        SET_BIT(0xE3);  /* For U+3000 */ 
+        SET_BIT(0xE3);  /* For U+3000 */
         }
       try_next = FALSE;
-      break;         
+      break;


-      case OP_ANYNL:  
+      case OP_ANYNL:
       case OP_VSPACE:
-      SET_BIT(0x0A); 
-      SET_BIT(0x0B); 
-      SET_BIT(0x0C); 
-      SET_BIT(0x0D); 
-      SET_BIT(0x85); 
-      if (utf8) SET_BIT(0xE2);    /* For U+2028, U+2029 */ 
+      SET_BIT(0x0A);
+      SET_BIT(0x0B);
+      SET_BIT(0x0C);
+      SET_BIT(0x0D);
+      SET_BIT(0x85);
+      if (utf8) SET_BIT(0xE2);    /* For U+2028, U+2029 */
       try_next = FALSE;
-      break;  
+      break;


-      /* Single character types set the bits and stop. Note that if PCRE_UCP 
-      is set, we do not see these op codes because \d etc are converted to 
-      properties. Therefore, these apply in the case when only ASCII characters 
+      /* Single character types set the bits and stop. Note that if PCRE_UCP
+      is set, we do not see these op codes because \d etc are converted to
+      properties. Therefore, these apply in the case when only ASCII characters
       are recognized to match the types. */


       case OP_NOT_DIGIT:
@@ -757,7 +757,7 @@


       case OP_TYPEPLUS:
       case OP_TYPEMINPLUS:
-      case OP_TYPEPOSPLUS: 
+      case OP_TYPEPOSPLUS:
       tcode++;
       break;


@@ -785,29 +785,29 @@
         case OP_ANY:
         case OP_ALLANY:
         return SSB_FAIL;
-        
+
         case OP_HSPACE:
         SET_BIT(0x09);
         SET_BIT(0x20);
         SET_BIT(0xA0);
         if (utf8)
-          {  
+          {
           SET_BIT(0xE1);  /* For U+1680, U+180E */
           SET_BIT(0xE2);  /* For U+2000 - U+200A, U+202F, U+205F */
-          SET_BIT(0xE3);  /* For U+3000 */ 
+          SET_BIT(0xE3);  /* For U+3000 */
           }
-        break;         
-  
-        case OP_ANYNL:  
+        break;
+
+        case OP_ANYNL:
         case OP_VSPACE:
-        SET_BIT(0x0A); 
-        SET_BIT(0x0B); 
-        SET_BIT(0x0C); 
-        SET_BIT(0x0D); 
-        SET_BIT(0x85); 
-        if (utf8) SET_BIT(0xE2);    /* For U+2028, U+2029 */ 
-        break;  
- 
+        SET_BIT(0x0A);
+        SET_BIT(0x0B);
+        SET_BIT(0x0C);
+        SET_BIT(0x0D);
+        SET_BIT(0x85);
+        if (utf8) SET_BIT(0xE2);    /* For U+2028, U+2029 */
+        break;
+
         case OP_NOT_DIGIT:
         for (c = 0; c < 32; c++)
           start_bits[c] |= ~cd->cbits[c+cbit_digit];


Modified: code/trunk/pcre_xclass.c
===================================================================
--- code/trunk/pcre_xclass.c    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/pcre_xclass.c    2010-06-03 19:18:24 UTC (rev 535)
@@ -112,12 +112,12 @@
       break;


       case PT_LAMP:
-      if ((prop->chartype == ucp_Lu || prop->chartype == ucp_Ll || 
+      if ((prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
            prop->chartype == ucp_Lt) == (t == XCL_PROP)) return !negated;
       break;


       case PT_GC:
-      if ((data[1] == _pcre_ucp_gentype[prop->chartype]) == (t == XCL_PROP)) 
+      if ((data[1] == _pcre_ucp_gentype[prop->chartype]) == (t == XCL_PROP))
         return !negated;
       break;


@@ -128,33 +128,33 @@
       case PT_SC:
       if ((data[1] == prop->script) == (t == XCL_PROP)) return !negated;
       break;
-      
+
       case PT_ALNUM:
       if ((_pcre_ucp_gentype[prop->chartype] == ucp_L ||
            _pcre_ucp_gentype[prop->chartype] == ucp_N) == (t == XCL_PROP))
         return !negated;
-      break;         
-      
+      break;
+
       case PT_SPACE:    /* Perl space */
       if ((_pcre_ucp_gentype[prop->chartype] == ucp_Z ||
-           c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR) 
+           c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR)
              == (t == XCL_PROP))
         return !negated;
-      break;         
+      break;


       case PT_PXSPACE:  /* POSIX space */
       if ((_pcre_ucp_gentype[prop->chartype] == ucp_Z ||
            c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
            c == CHAR_FF || c == CHAR_CR) == (t == XCL_PROP))
         return !negated;
-      break;         
+      break;


-      case PT_WORD:    
+      case PT_WORD:
       if ((_pcre_ucp_gentype[prop->chartype] == ucp_L ||
-           _pcre_ucp_gentype[prop->chartype] == ucp_N || c == CHAR_UNDERSCORE) 
+           _pcre_ucp_gentype[prop->chartype] == ucp_N || c == CHAR_UNDERSCORE)
              == (t == XCL_PROP))
         return !negated;
-      break;         
+      break;


       /* This should never occur, but compilers may mutter if there is no
       default. */


Modified: code/trunk/pcregrep.c
===================================================================
--- code/trunk/pcregrep.c    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/pcregrep.c    2010-06-03 19:18:24 UTC (rev 535)
@@ -104,10 +104,10 @@


enum { EL_LF, EL_CR, EL_CRLF, EL_ANY, EL_ANYCRLF };

-/* In newer versions of gcc, with FORTIFY_SOURCE set (the default in some
-environments), a warning is issued if the value of fwrite() is ignored.
-Unfortunately, casting to (void) does not suppress the warning. To get round
-this, we use a macro that compiles a fudge. Oddly, this does not also seem to
+/* In newer versions of gcc, with FORTIFY_SOURCE set (the default in some
+environments), a warning is issued if the value of fwrite() is ignored.
+Unfortunately, casting to (void) does not suppress the warning. To get round
+this, we use a macro that compiles a fudge. Oddly, this does not also seem to
apply to fprintf(). */

 #define FWRITE(a,b,c,d) if (fwrite(a,b,c,d)) {}
@@ -550,20 +550,20 @@
 *            Read one line of input              *
 *************************************************/


-/* Normally, input is read using fread() into a large buffer, so many lines may
-be read at once. However, doing this for tty input means that no output appears
+/* Normally, input is read using fread() into a large buffer, so many lines may
+be read at once. However, doing this for tty input means that no output appears
until a lot of input has been typed. Instead, tty input is handled line by
line. We cannot use fgets() for this, because it does not stop at a binary
-zero, and therefore there is no way of telling how many characters it has read,
+zero, and therefore there is no way of telling how many characters it has read,
because there may be binary zeros embedded in the data.

 Arguments:
   buffer     the buffer to read into
   length     the maximum number of characters to read
   f          the file
-  
+
 Returns:     the number of characters read, zero at end of file
-*/   
+*/


static int
read_one_line(char *buffer, int length, FILE *f)
@@ -573,9 +573,9 @@
while ((c = fgetc(f)) != EOF)
{
buffer[yield++] = c;
- if (c == '\n' || yield >= length) break;
- }
-return yield;
+ if (c == '\n' || yield >= length) break;
+ }
+return yield;
}


@@ -1017,11 +1017,11 @@
   {
   in = (FILE *)handle;
   if (is_file_tty(in)) input_line_buffered = TRUE;
-  bufflength = input_line_buffered? 
+  bufflength = input_line_buffered?
     read_one_line(buffer, 3*MBUFTHIRD, in) :
     fread(buffer, 1, 3*MBUFTHIRD, in);
   }
-  
+
 endptr = buffer + bufflength;


 /* Loop while the current pointer is not at the end of the file. For large
@@ -1321,7 +1321,7 @@
           FWRITE(matchptr + offsets[0], 1, offsets[1] - offsets[0], stdout);
           fprintf(stdout, "%c[00m", 0x1b);
           }
-        FWRITE(ptr + last_offset, 1, 
+        FWRITE(ptr + last_offset, 1,
           (linelength + endlinelength) - last_offset, stdout);
         }


@@ -1367,16 +1367,16 @@
   ptr += linelength + endlinelength;
   filepos += (int)(linelength + endlinelength);
   linenumber++;
-  
-  /* If input is line buffered, and the buffer is not yet full, read another 
+
+  /* If input is line buffered, and the buffer is not yet full, read another
   line and add it into the buffer. */
-  
+
   if (input_line_buffered && bufflength < sizeof(buffer))
     {
     int add = read_one_line(ptr, sizeof(buffer) - (ptr - buffer), in);
     bufflength += add;
-    endptr += add; 
-    }   
+    endptr += add;
+    }


   /* If we haven't yet reached the end of the file (the buffer is full), and
   the current point is in the top 1/3 of the buffer, slide the buffer down by
@@ -1412,9 +1412,9 @@
     else
 #endif


-    bufflength = 2*MBUFTHIRD + 
-      (input_line_buffered? 
-       read_one_line(buffer + 2*MBUFTHIRD, MBUFTHIRD, in) : 
+    bufflength = 2*MBUFTHIRD +
+      (input_line_buffered?
+       read_one_line(buffer + 2*MBUFTHIRD, MBUFTHIRD, in) :
        fread(buffer + 2*MBUFTHIRD, 1, MBUFTHIRD, in));
     endptr = buffer + bufflength;


@@ -1766,7 +1766,7 @@
   case N_FOFFSETS: file_offsets = TRUE; break;
   case N_HELP: help(); exit(0);
   case N_LOFFSETS: line_offsets = number = TRUE; break;
-  case N_LBUFFER: line_buffered = TRUE; break; 
+  case N_LBUFFER: line_buffered = TRUE; break;
   case 'c': count_only = TRUE; break;
   case 'F': process_options |= PO_FIXED_STRINGS; break;
   case 'H': filenames = FN_FORCE; break;
@@ -2029,7 +2029,7 @@
         else                 /* Special case xxx=data */
           {
           int oplen = (int)(equals - op->long_name);
-          int arglen = (argequals == NULL)? 
+          int arglen = (argequals == NULL)?
             (int)strlen(arg) : (int)(argequals - arg);
           if (oplen == arglen && strncmp(arg, op->long_name, oplen) == 0)
             {
@@ -2289,7 +2289,7 @@
     if (cs != NULL) colour_string = cs;
     }
   }
-  
+
 /* Interpret the newline type; the default settings are Unix-like. */


if (strcmp(newline, "cr") == 0 || strcmp(newline, "CR") == 0)

Modified: code/trunk/pcretest.c
===================================================================
--- code/trunk/pcretest.c    2010-06-03 18:26:05 UTC (rev 534)
+++ code/trunk/pcretest.c    2010-06-03 19:18:24 UTC (rev 535)
@@ -1237,7 +1237,7 @@


       case 'S': do_study = 1; break;
       case 'U': options |= PCRE_UNGREEDY; break;
-      case 'W': options |= PCRE_UCP; break; 
+      case 'W': options |= PCRE_UCP; break;
       case 'X': options |= PCRE_EXTRA; break;
       case 'Z': debug_lengths = 0; break;
       case '8': options |= PCRE_UTF8; use_utf8 = 1; break;