[Pcre-svn] [722] code/trunk: Documentation update

Author: Subversion repository
Date:
To: pcre-svn
Subject: [Pcre-svn] [722] code/trunk: Documentation update

Revision: 722

          http://www.exim.org/viewvc/pcre2?view=rev&revision=722
Author:   ph10
Date:     2017-03-31 17:49:33 +0100 (Fri, 31 Mar 2017)
Log Message:
-----------
Documentation update

Modified Paths:
--------------
    code/trunk/Makefile.am
    code/trunk/doc/html/index.html
    code/trunk/doc/html/pcre2compat.html
    code/trunk/doc/html/pcre2jit.html
    code/trunk/doc/html/pcre2limits.html
    code/trunk/doc/html/pcre2perform.html
    code/trunk/doc/index.html.src
    code/trunk/doc/pcre2.txt
    code/trunk/doc/pcre2perform.3

Removed Paths:
-------------
    code/trunk/doc/pcre2stack.3

Modified: code/trunk/Makefile.am
===================================================================
--- code/trunk/Makefile.am    2017-03-31 15:51:41 UTC (rev 721)
+++ code/trunk/Makefile.am    2017-03-31 16:49:33 UTC (rev 722)
@@ -103,7 +103,6 @@
   doc/html/pcre2posix.html \
   doc/html/pcre2sample.html \
   doc/html/pcre2serialize.html \
-  doc/html/pcre2stack.html \
   doc/html/pcre2syntax.html \
   doc/html/pcre2test.html \
   doc/html/pcre2unicode.html
@@ -187,7 +186,6 @@
   doc/pcre2posix.3 \
   doc/pcre2sample.3 \
   doc/pcre2serialize.3 \
-  doc/pcre2stack.3 \
   doc/pcre2syntax.3 \
   doc/pcre2test.1 \
   doc/pcre2unicode.3

Modified: code/trunk/doc/html/index.html
===================================================================
--- code/trunk/doc/html/index.html    2017-03-31 15:51:41 UTC (rev 721)
+++ code/trunk/doc/html/index.html    2017-03-31 16:49:33 UTC (rev 722)
@@ -68,9 +68,6 @@
 <tr><td><a href="pcre2serialize.html">pcre2serialize</a></td>
     <td>&nbsp;&nbsp;Serializing functions for saving precompiled patterns</td></tr>

-<tr><td><a href="pcre2stack.html">pcre2stack</a></td>
-    <td>&nbsp;&nbsp;Discussion of PCRE2's stack usage</td></tr>
-
 <tr><td><a href="pcre2syntax.html">pcre2syntax</a></td>
     <td>&nbsp;&nbsp;Syntax quick-reference summary</td></tr>

Modified: code/trunk/doc/html/pcre2compat.html
===================================================================
--- code/trunk/doc/html/pcre2compat.html    2017-03-31 15:51:41 UTC (rev 721)
+++ code/trunk/doc/html/pcre2compat.html    2017-03-31 16:49:33 UTC (rev 722)
@@ -18,7 +18,8 @@
 <P>
 This document describes the differences in the ways that PCRE2 and Perl handle
 regular expressions. The differences described here are with respect to Perl
-versions 5.10 and above.
+versions 5.24, but as both Perl and PCRE2 are continually changing, the
+information may sometimes be out of date.
 </P>
 <P>
 1. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
@@ -27,17 +28,18 @@
 page.
 </P>
 <P>
-2. PCRE2 allows repeat quantifiers only on parenthesized assertions, but they
-do not mean what you might think. For example, (?!a){3} does not assert that
-the next three characters are not "a". It just asserts that the next character
-is not "a" three times (in principle: PCRE2 optimizes this to run the assertion
-just once). Perl allows repeat quantifiers on other assertions such as \b, but
-these do not seem to have any use.
+2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but
+they do not mean what you might think. For example, (?!a){3} does not assert
+that the next three characters are not "a". It just asserts that the next
+character is not "a" three times (in principle: PCRE2 optimizes this to run the
+assertion just once). Perl allows some repeat quantifiers on other assertions,
+for example, \b* (but not \b{3}), but these do not seem to have any use.
 </P>
 <P>
-3. Capturing subpatterns that occur inside negative lookahead assertions are
-counted, but their entries in the offsets vector are never set. Perl sometimes
-(but not always) sets its numerical variables from inside negative assertions.
+3. Capturing subpatterns that occur inside negative lookaround assertions are
+counted, but their entries in the offsets vector are set only if the assertion 
+is a condition. Perl has changed its behaviour in this regard from time to 
+time.
 </P>
 <P>
 4. The following Perl escape sequences are not supported: \l, \u, \L,
@@ -50,13 +52,13 @@
 </P>
 <P>
 5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
-built with Unicode support. The properties that can be tested with \p and \P
-are limited to the general category properties such as Lu and Nd, script names
-such as Greek or Han, and the derived properties Any and L&. PCRE2 does support
-the Cs (surrogate) property, which Perl does not; the Perl documentation says
-"Because Perl hides the need for the user to understand the internal
-representation of Unicode characters, there is no need to implement the
-somewhat messy concept of surrogates."
+built with Unicode support (the default). The properties that can be tested
+with \p and \P are limited to the general category properties such as Lu and
+Nd, script names such as Greek or Han, and the derived properties Any and L&.
+PCRE2 does support the Cs (surrogate) property, which Perl does not; the Perl
+documentation says "Because Perl hides the need for the user to understand the
+internal representation of Unicode characters, there is no need to implement
+the somewhat messy concept of surrogates."
 </P>
 <P>
 6. PCRE2 does support the \Q...\E escape for quoting substrings. Characters
@@ -75,23 +77,15 @@
 </P>
 <P>
 7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
-constructions. However, there is support for recursive patterns. This is not
-available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE2 "callout"
-feature allows an external function to be called during pattern matching. See
-the
+constructions. However, there is support PCRE2's "callout" feature, which
+allows an external function to be called during pattern matching. See the
 <a href="pcre2callout.html"><b>pcre2callout</b></a>
 documentation for details.
 </P>
 <P>
-8. Subroutine calls (whether recursive or not) are treated as atomic groups.
-Atomic recursion is like Python, but unlike Perl. Captured values that are set
-outside a subroutine call can be referenced from inside in PCRE2, but not in
-Perl. There is a discussion that explains these differences in more detail in
-the
-<a href="pcre2pattern.html#recursiondifference">section on recursion differences from Perl</a>
-in the
-<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
-page.
+8. Subroutine calls (whether recursive or not) were treated as atomic groups up
+to PCRE2 release 10.23, but from release 10.30 this changed, and backtracking
+into subroutine calls is now supported, as in Perl. 
 </P>
 <P>
 9. If any of the backtracking control verbs are used in a subpattern that is
@@ -147,14 +141,14 @@
 16. In PCRE2, the upper/lower case character properties Lu and Ll are not
 affected when case-independent matching is specified. For example, \p{Lu}
 always matches an upper case letter. I think Perl has changed in this respect;
-in the release at the time of writing (5.16), \p{Lu} and \p{Ll} match all
+in the release at the time of writing (5.24), \p{Lu} and \p{Ll} match all
 letters, regardless of case, when case independence is specified.
 </P>
 <P>
 17. PCRE2 provides some extensions to the Perl regular expression facilities.
 Perl 5.10 includes new features that are not in earlier versions of Perl, some
-of which (such as named parentheses) have been in PCRE2 for some time. This
-list is with respect to Perl 5.10:
+of which (such as named parentheses) were in PCRE2 for some time before. This
+list is with respect to Perl 5.24:
 <br>
 <br>
 (a) Although lookbehind assertions in PCRE2 must match fixed length strings,
@@ -220,9 +214,9 @@
 REVISION
 </b><br>
 <P>
-Last updated: 18 October 2016
+Last updated: 29 March 2017
 <br>
-Copyright &copy; 1997-2016 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE2 index page</a>.

Modified: code/trunk/doc/html/pcre2jit.html
===================================================================
--- code/trunk/doc/html/pcre2jit.html    2017-03-31 15:51:41 UTC (rev 721)
+++ code/trunk/doc/html/pcre2jit.html    2017-03-31 16:49:33 UTC (rev 722)
@@ -173,7 +173,7 @@
 The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if searching
 a very large pattern tree goes on for too long, as it is in the same
 circumstance when JIT is not used, but the details of exactly what is counted
-are not the same. The PCRE2_ERROR_RECURSIONLIMIT error code is never returned
+are not the same. The PCRE2_ERROR_DEPTHLIMIT error code is never returned
 when JIT matching is used.
 <a name="stackcontrol"></a></P>
 <br><a name="SEC6" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
@@ -436,9 +436,9 @@
 </P>
 <br><a name="SEC13" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 05 June 2016
+Last updated: 30 March 2017
 <br>
-Copyright &copy; 1997-2016 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE2 index page</a>.

Modified: code/trunk/doc/html/pcre2limits.html
===================================================================
--- code/trunk/doc/html/pcre2limits.html    2017-03-31 15:51:41 UTC (rev 721)
+++ code/trunk/doc/html/pcre2limits.html    2017-03-31 16:49:33 UTC (rev 722)
@@ -44,14 +44,6 @@
 and unset offsets.
 </P>
 <P>
-Note that when using the traditional matching function, PCRE2 uses recursion to
-handle subpatterns and indefinite repetition. This means that the available
-stack space may limit the size of a subject string that can be processed by
-certain patterns. For a discussion of stack issues, see the
-<a href="pcre2stack.html"><b>pcre2stack</b></a>
-documentation.
-</P>
-<P>
 All values in repeating quantifiers must be less than 65536.
 </P>
 <P>
@@ -94,9 +86,9 @@
 REVISION
 </b><br>
 <P>
-Last updated: 26 October 2016
+Last updated: 30 March 2017
 <br>
-Copyright &copy; 1997-2016 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE2 index page</a>.

Modified: code/trunk/doc/html/pcre2perform.html
===================================================================
--- code/trunk/doc/html/pcre2perform.html    2017-03-31 15:51:41 UTC (rev 721)
+++ code/trunk/doc/html/pcre2perform.html    2017-03-31 16:49:33 UTC (rev 722)
@@ -15,7 +15,7 @@
 <ul>
 <li><a name="TOC1" href="#SEC1">PCRE2 PERFORMANCE</a>
 <li><a name="TOC2" href="#SEC2">COMPILED PATTERN MEMORY USAGE</a>
-<li><a name="TOC3" href="#SEC3">STACK USAGE AT RUN TIME</a>
+<li><a name="TOC3" href="#SEC3">STACK AND HEAP USAGE AT RUN TIME</a>
 <li><a name="TOC4" href="#SEC4">PROCESSING TIME</a>
 <li><a name="TOC5" href="#SEC5">AUTHOR</a>
 <li><a name="TOC6" href="#SEC6">REVISION</a>
@@ -29,11 +29,11 @@
 <br><a name="SEC2" href="#TOC1">COMPILED PATTERN MEMORY USAGE</a><br>
 <P>
 Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
-so that most simple patterns do not use much memory. However, there is one case
-where the memory usage of a compiled pattern can be unexpectedly large. If a
-parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
-a limited maximum, the whole subpattern is repeated in the compiled code. For
-example, the pattern
+so that most simple patterns do not use much memory for storing the compiled 
+version. However, there is one case where the memory usage of a compiled
+pattern can be unexpectedly large. If a parenthesized subpattern has a
+quantifier with a minimum greater than 1 and/or a limited maximum, the whole
+subpattern is repeated in the compiled code. For example, the pattern
 <pre>
   (abc|def){2,4}
 </pre>
@@ -52,13 +52,13 @@
 <pre>
   ((ab){1,1000}c){1,3}
 </pre>
-uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
-with its default internal pointer size of two bytes, the size limit on a
-compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
-is reached with the above pattern if the outer repetition is increased from 3
-to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
-larger compiled patterns, but it is better to try to rewrite your pattern to
-use less memory if you can.
+uses over 50K bytes when compiled using the 8-bit library. When PCRE2 is
+compiled with its default internal pointer size of two bytes, the size limit on
+a compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and
+this is reached with the above pattern if the outer repetition is increased
+from 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus
+handle larger compiled patterns, but it is better to try to rewrite your
+pattern to use less memory if you can.
 </P>
 <P>
 One way of reducing the memory usage for such patterns is to make use of
@@ -68,26 +68,34 @@
 <pre>
   ((ab)(?2){0,999}c)(?1){0,2}
 </pre>
-reduces the memory requirements to 18K, and indeed it remains under 20K even
-with the outer repetition increased to 100. However, this pattern is not
-exactly equivalent, because the "subroutine" calls are treated as
-<a href="pcre2pattern.html#atomicgroup">atomic groups</a>
-into which there can be no backtracking if there is a subsequent matching
-failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
-Furthermore, there is a noticeable loss of speed when executing the modified
-pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
-speed is acceptable, this kind of rewriting will allow you to process patterns
-that PCRE2 cannot otherwise handle.
+reduces the memory requirements to around 16K, and indeed it remains under 20K
+even with the outer repetition increased to 100. However, this kind of pattern
+is not always exactly equivalent, because any captures within subroutine calls
+are lost when the subroutine completes. If this is not a problem, this kind of
+rewriting will allow you to process patterns that PCRE2 cannot otherwise
+handle. The matching performance of the two different versions of the pattern
+are roughly the same. (This applies from release 10.30 - things were different
+in earlier releases.)
 </P>
-<br><a name="SEC3" href="#TOC1">STACK USAGE AT RUN TIME</a><br>
+<br><a name="SEC3" href="#TOC1">STACK AND HEAP USAGE AT RUN TIME</a><br>
 <P>
-When <b>pcre2_match()</b> is used for matching, certain kinds of pattern can
-cause it to use large amounts of the process stack. In some environments the
-default process stack is quite small, and if it runs out the result is often
-SIGSEGV. Rewriting your pattern can often help. The
-<a href="pcre2stack.html"><b>pcre2stack</b></a>
-documentation discusses this issue in detail.
+From release 10.30, the interpretive (non-JIT) version of <b>pcre2_match()</b>
+uses very little system stack at run time. In earlier releases recursive
+function calls could use a great deal of stack, and this could cause problems,
+but this usage has been eliminated. Backtracking positions are now explicitly
+remembered in memory frames controlled by the code. An initial 10K vector of
+frames is allocated on the system stack (enough for about 50 frames for small
+patterns), but if this is insufficient, heap memory is used. Rewriting patterns 
+to be time-efficient, as described below, may also reduce the memory 
+requirements.
 </P>
+<P>
+In contrast to <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b> does use recursive 
+function calls, but only for processing atomic groups, lookaround assertions,
+and recursion within the pattern. Too much nested recursion may cause stack
+issues. The "match depth" parameter can be used to limit the depth of function
+recursion in <b>pcre2_dfa_match()</b>.
+</P>
 <br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br>
 <P>
 Certain items in regular expression patterns are processed more efficiently
@@ -175,8 +183,55 @@
 </P>
 <P>
 In many cases, the solution to this kind of performance issue is to use an
-atomic group or a possessive quantifier.
+atomic group or a possessive quantifier. This can often reduce memory 
+requirements as well. As another example, consider this pattern:
+<pre>
+  ([^&#60;]|&#60;(?!inet))+
+</pre>
+It matches from wherever it starts until it encounters "&#60;inet" or the end of
+the data, and is the kind of pattern that might be used when processing an XML
+file. Each iteration of the outer parentheses matches either one character that
+is not "&#60;" or a "&#60;" that is not followed by "inet". However, each time a
+parenthesis is processed, a backtracking position is passed, so this
+formulation uses a memory frame for each matched character. For a long string,
+a lot of memory is required. Consider now this rewritten pattern, which matches
+exactly the same strings:
+<pre>
+  ([^&#60;]++|&#60;(?!inet))+
+</pre>
+This runs much faster, because sequences of characters that do not contain "&#60;"
+are "swallowed" in one item inside the parentheses, and a possessive quantifier
+is used to stop any backtracking into the runs of non-"&#60;" characters. This
+version also uses a lot less memory because entry to a new set of parentheses
+happens only when a "&#60;" character that is not followed by "inet" is encountered
+(and we assume this is relatively rare). 
 </P>
+<P>
+This example shows that one way of optimizing performance when matching long
+subject strings is to write repeated parenthesized subpatterns to match more
+than one character whenever possible.
+</P>
+<br><b>
+SETTING RESOURCE LIMITS
+</b><br>
+<P>
+You can set limits on the amount of processing that takes place when matching, 
+and on the amount of heap memory that is used. The default values of the limits
+are very large, and unlikely ever to operate. They can be changed when PCRE2 is
+built, and they can also be set when <b>pcre2_match()</b> or 
+<b>pcre2_dfa_match()</b> is called. For details of these interfaces, see the
+<a href="pcre2build.html"><b>pcre2build</b></a>
+documentation and the section entitled
+<a href="pcre2api.html#matchcontext">"The match context"</a>
+in the
+<a href="pcre2api.html"><b>pcre2api</b></a>
+documentation.
+</P>
+<P>
+The <b>pcre2test</b> test program has a modifier called "find_limits" which, if
+applied to a subject line, causes it to find the smallest limits that allow a
+pattern to match. This is done by repeatedly matching with different limits.
+</P>
 <br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
 <P>
 Philip Hazel
@@ -188,9 +243,9 @@
 </P>
 <br><a name="SEC6" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 02 January 2015
+Last updated: 31 March 2017
 <br>
-Copyright &copy; 1997-2015 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE2 index page</a>.

Modified: code/trunk/doc/index.html.src
===================================================================
--- code/trunk/doc/index.html.src    2017-03-31 15:51:41 UTC (rev 721)
+++ code/trunk/doc/index.html.src    2017-03-31 16:49:33 UTC (rev 722)
@@ -68,9 +68,6 @@
 <tr><td><a href="pcre2serialize.html">pcre2serialize</a></td>
     <td>&nbsp;&nbsp;Serializing functions for saving precompiled patterns</td></tr>

-<tr><td><a href="pcre2stack.html">pcre2stack</a></td>
-    <td>&nbsp;&nbsp;Discussion of PCRE2's stack usage</td></tr>
-
 <tr><td><a href="pcre2syntax.html">pcre2syntax</a></td>
     <td>&nbsp;&nbsp;Syntax quick-reference summary</td></tr>

Modified: code/trunk/doc/pcre2.txt
===================================================================
--- code/trunk/doc/pcre2.txt    2017-03-31 15:51:41 UTC (rev 721)
+++ code/trunk/doc/pcre2.txt    2017-03-31 16:49:33 UTC (rev 722)
@@ -4097,45 +4097,46 @@

        This document describes the differences in the ways that PCRE2 and Perl
        handle regular expressions. The differences  described  here  are  with
-       respect to Perl versions 5.10 and above.
+       respect  to Perl versions 5.24, but as both Perl and PCRE2 are continu-
+       ally changing, the information may sometimes be out of date.

-       1.  PCRE2  has only a subset of Perl's Unicode support. Details of what
+       1. PCRE2 has only a subset of Perl's Unicode support. Details  of  what
        it does have are given in the pcre2unicode page.

-       2. PCRE2 allows repeat quantifiers only  on  parenthesized  assertions,
-       but  they  do not mean what you might think. For example, (?!a){3} does
-       not assert that the next three characters are not "a". It just  asserts
-       that  the  next  character  is not "a" three times (in principle: PCRE2
-       optimizes this to run the assertion  just  once).  Perl  allows  repeat
-       quantifiers  on  other  assertions such as \b, but these do not seem to
-       have any use.
+       2.  Like  Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
+       tions, but they do not mean what you might think. For example, (?!a){3}
+       does  not  assert  that  the next three characters are not "a". It just
+       asserts that the next character is not "a" three times  (in  principle:
+       PCRE2  optimizes this to run the assertion just once). Perl allows some
+       repeat quantifiers on other  assertions,  for  example,  \b*  (but  not
+       \b{3}), but these do not seem to have any use.

-       3. Capturing subpatterns that occur inside  negative  lookahead  asser-
-       tions  are  counted,  but their entries in the offsets vector are never
-       set. Perl sometimes (but not always) sets its numerical variables  from
-       inside negative assertions.
+       3.  Capturing  subpatterns that occur inside negative lookaround asser-
+       tions are counted, but their entries in the offsets vector are set only
+       if the assertion is a condition. Perl has changed its behaviour in this
+       regard from time to time.

-       4.  The  following Perl escape sequences are not supported: \l, \u, \L,
-       \U, and \N when followed by a character name or Unicode value.  (\N  on
+       4. The following Perl escape sequences are not supported: \l,  \u,  \L,
+       \U,  and  \N when followed by a character name or Unicode value. (\N on
        its own, matching a non-newline character, is supported.) In fact these
-       are implemented by Perl's general string-handling and are not  part  of
-       its  pattern matching engine. If any of these are encountered by PCRE2,
+       are  implemented  by Perl's general string-handling and are not part of
+       its pattern matching engine. If any of these are encountered by  PCRE2,
        an error is generated by default. However, if the PCRE2_ALT_BSUX option
        is set, \U and \u are interpreted as ECMAScript interprets them.

        5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
-       is built with Unicode support. The properties that can be  tested  with
-       \p and \P are limited to the general category properties such as Lu and
-       Nd, script names such as Greek or Han, and the derived  properties  Any
-       and L&. PCRE2 does support the Cs (surrogate) property, which Perl does
-       not; the Perl documentation says "Because Perl hides the need  for  the
-       user  to  understand the internal representation of Unicode characters,
-       there is no need to implement the  somewhat  messy  concept  of  surro-
-       gates."
+       is built with Unicode support (the default). The properties that can be
+       tested with \p and \P are limited to the  general  category  properties
+       such  as  Lu and Nd, script names such as Greek or Han, and the derived
+       properties Any and L&.  PCRE2 does support the Cs (surrogate) property,
+       which  Perl  does  not; the Perl documentation says "Because Perl hides
+       the need for the user to understand the internal representation of Uni-
+       code  characters, there is no need to implement the somewhat messy con-
+       cept of surrogates."

-       6.  PCRE2 does support the \Q...\E escape for quoting substrings. Char-
-       acters in between are treated as literals. This is  slightly  different
-       from  Perl  in  that  $  and  @ are also handled as literals inside the
+       6. PCRE2 does support the \Q...\E escape for quoting substrings.  Char-
+       acters  in  between are treated as literals. This is slightly different
+       from Perl in that $ and @ are  also  handled  as  literals  inside  the
        quotes. In Perl, they cause variable interpolation (but of course PCRE2
        does not have variables).  Note the following examples:

@@ -4146,22 +4147,17 @@
            \Qabc\$xyz\E       abc\$xyz          abc\$xyz
            \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz

-       The  \Q...\E  sequence  is recognized both inside and outside character
+       The \Q...\E sequence is recognized both inside  and  outside  character
        classes.

-       7.  Fairly  obviously,  PCRE2  does  not  support  the  (?{code})   and
-       (??{code})  constructions. However, there is support for recursive pat-
-       terns. This is not available in Perl 5.8, but it is in Perl 5.10. Also,
-       the  PCRE2  "callout"  feature allows an external function to be called
-       during  pattern  matching.  See  the  pcre2callout  documentation   for
-       details.
+       7.   Fairly  obviously,  PCRE2  does  not  support  the  (?{code})  and
+       (??{code}) constructions. However, there is support  PCRE2's  "callout"
+       feature,  which allows an external function to be called during pattern
+       matching. See the pcre2callout documentation for details.

-       8.  Subroutine  calls  (whether recursive or not) are treated as atomic
-       groups.  Atomic recursion is like Python,  but  unlike  Perl.  Captured
-       values  that  are  set outside a subroutine call can be referenced from
-       inside in PCRE2, but not in Perl. There is a discussion  that  explains
-       these  differences  in  more detail in the section on recursion differ-
-       ences from Perl in the pcre2pattern page.
+       8. Subroutine calls (whether recursive or not) were treated  as  atomic
+       groups  up to PCRE2 release 10.23, but from release 10.30 this changed,
+       and backtracking into subroutine calls is now supported, as in Perl.

        9. If any of the backtracking control verbs are used  in  a  subpattern
        that  is  called  as  a  subroutine (whether or not recursively), their
@@ -4211,14 +4207,14 @@
        16.  In  PCRE2, the upper/lower case character properties Lu and Ll are
        not affected when case-independent matching is specified. For  example,
        \p{Lu} always matches an upper case letter. I think Perl has changed in
-       this respect; in the release at the time of writing (5.16), \p{Lu}  and
+       this respect; in the release at the time of writing (5.24), \p{Lu}  and
        \p{Ll} match all letters, regardless of case, when case independence is
        specified.

        17. PCRE2 provides some  extensions  to  the  Perl  regular  expression
        facilities.   Perl  5.10  includes new features that are not in earlier
-       versions of Perl, some of which (such as named parentheses)  have  been
-       in PCRE2 for some time. This list is with respect to Perl 5.10:
+       versions of Perl, some of which (such as  named  parentheses)  were  in
+       PCRE2 for some time before. This list is with respect to Perl 5.24:

        (a)  Although  lookbehind  assertions  in PCRE2 must match fixed length
        strings, each alternative branch of a lookbehind assertion can match  a
@@ -4271,8 +4267,8 @@

REVISION

-       Last updated: 18 October 2016
-       Copyright (c) 1997-2016 University of Cambridge.
+       Last updated: 29 March 2017
+       Copyright (c) 1997-2017 University of Cambridge.
 ------------------------------------------------------------------------------

@@ -4420,8 +4416,8 @@
        The error code PCRE2_ERROR_MATCHLIMIT is returned by the  JIT  code  if
        searching  a  very large pattern tree goes on for too long, as it is in
        the same circumstance when JIT is not used, but the details of  exactly
-       what  is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT error
-       code is never returned when JIT matching is used.
+       what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
+       is never returned when JIT matching is used.

CONTROLLING THE JIT STACK
@@ -4668,8 +4664,8 @@

REVISION

-       Last updated: 05 June 2016
-       Copyright (c) 1997-2016 University of Cambridge.
+       Last updated: 30 March 2017
+       Copyright (c) 1997-2017 University of Cambridge.
 ------------------------------------------------------------------------------

@@ -4706,12 +4702,6 @@
        (that is ~(PCRE2_SIZE)0) is reserved as a special indicator  for  zero-
        terminated strings and unset offsets.

-       Note  that  when  using  the  traditional matching function, PCRE2 uses
-       recursion to handle subpatterns and indefinite repetition.  This  means
-       that  the  available stack space may limit the size of a subject string
-       that can be processed by certain patterns. For a  discussion  of  stack
-       issues, see the pcre2stack documentation.
-
        All values in repeating quantifiers must be less than 65536.

        The maximum length of a lookbehind assertion is 65535 characters.
@@ -4745,8 +4735,8 @@

REVISION

-       Last updated: 26 October 2016
-       Copyright (c) 1997-2016 University of Cambridge.
+       Last updated: 30 March 2017
+       Copyright (c) 1997-2017 University of Cambridge.
 ------------------------------------------------------------------------------

@@ -8485,11 +8475,12 @@
COMPILED PATTERN MEMORY USAGE

        Patterns are compiled by PCRE2 into a reasonably efficient interpretive
-       code, so that most simple patterns do not  use  much  memory.  However,
-       there  is  one case where the memory usage of a compiled pattern can be
-       unexpectedly large. If a parenthesized subpattern has a quantifier with
-       a minimum greater than 1 and/or a limited maximum, the whole subpattern
-       is repeated in the compiled code. For example, the pattern
+       code, so that most simple patterns do not use much memory  for  storing
+       the compiled version. However, there is one case where the memory usage
+       of a compiled pattern can be unexpectedly  large.  If  a  parenthesized
+       subpattern has a quantifier with a minimum greater than 1 and/or a lim-
+       ited maximum, the whole subpattern is repeated in  the  compiled  code.
+       For example, the pattern

          (abc|def){2,4}

@@ -8497,23 +8488,24 @@

          (abc|def)(abc|def)((abc|def)(abc|def)?)?

-       (Technical aside: It is done this way so that backtrack  points  within
+       (Technical  aside:  It is done this way so that backtrack points within
        each of the repetitions can be independently maintained.)

-       For  regular expressions whose quantifiers use only small numbers, this
-       is not usually a problem. However, if the numbers are large,  and  par-
-       ticularly  if  such repetitions are nested, the memory usage can become
+       For regular expressions whose quantifiers use only small numbers,  this
+       is  not  usually a problem. However, if the numbers are large, and par-
+       ticularly if such repetitions are nested, the memory usage  can  become
        an embarrassment. For example, the very simple pattern

          ((ab){1,1000}c){1,3}

-       uses 51K bytes when compiled using the 8-bit  library.  When  PCRE2  is
-       compiled  with its default internal pointer size of two bytes, the size
-       limit on a compiled pattern is 64K code units in the 8-bit  and  16-bit
-       libraries, and this is reached with the above pattern if the outer rep-
-       etition is increased from 3 to 4. PCRE2 can be compiled to  use  larger
-       internal  pointers  and thus handle larger compiled patterns, but it is
-       better to try to rewrite your pattern to use less memory if you can.
+       uses  over  50K bytes when compiled using the 8-bit library. When PCRE2
+       is compiled with its default internal pointer size of  two  bytes,  the
+       size  limit  on  a  compiled pattern is 64K code units in the 8-bit and
+       16-bit libraries, and this is reached with the  above  pattern  if  the
+       outer repetition is increased from 3 to 4. PCRE2 can be compiled to use
+       larger internal pointers and thus handle larger compiled patterns,  but
+       it  is  better to try to rewrite your pattern to use less memory if you
+       can.

        One way of reducing the memory usage for such patterns is to  make  use
        of PCRE2's "subroutine" facility. Re-writing the above pattern as
@@ -8520,91 +8512,100 @@

          ((ab)(?2){0,999}c)(?1){0,2}

-       reduces the memory requirements to 18K, and indeed it remains under 20K
-       even with the outer repetition increased to 100. However, this  pattern
-       is  not  exactly equivalent, because the "subroutine" calls are treated
-       as atomic groups into which there can be no backtracking if there is  a
-       subsequent  matching  failure.  Therefore, PCRE2 cannot do this kind of
-       rewriting automatically.  Furthermore, there is a  noticeable  loss  of
-       speed  when executing the modified pattern. Nevertheless, if the atomic
-       grouping is not a problem and the loss of  speed  is  acceptable,  this
-       kind  of rewriting will allow you to process patterns that PCRE2 cannot
-       otherwise handle.
+       reduces  the  memory  requirements to around 16K, and indeed it remains
+       under 20K even with the outer repetition  increased  to  100.  However,
+       this kind of pattern is not always exactly equivalent, because any cap-
+       tures within subroutine calls are lost when the  subroutine  completes.
+       If  this  is  not  a  problem, this kind of rewriting will allow you to
+       process patterns that PCRE2 cannot otherwise handle. The matching  per-
+       formance  of  the two different versions of the pattern are roughly the
+       same. (This applies from release 10.30 - things were different in  ear-
+       lier releases.)

-STACK USAGE AT RUN TIME
+STACK AND HEAP USAGE AT RUN TIME

-       When pcre2_match() is used for matching, certain kinds of  pattern  can
-       cause  it  to  use large amounts of the process stack. In some environ-
-       ments the default process stack is quite small, and if it runs out  the
-       result  is  often  SIGSEGV.  Rewriting your pattern can often help. The
-       pcre2stack documentation discusses this issue in detail.
+       From release 10.30, the interpretive (non-JIT) version of pcre2_match()
+       uses very little system stack at run time. In earlier  releases  recur-
+       sive  function  calls  could  use a great deal of stack, and this could
+       cause problems, but this usage has been eliminated. Backtracking  posi-
+       tions  are now explicitly remembered in memory frames controlled by the
+       code. An initial 10K vector of frames is allocated on the system  stack
+       (enough  for about 50 frames for small patterns), but if this is insuf-
+       ficient, heap memory is used. Rewriting patterns to be  time-efficient,
+       as described below, may also reduce the memory requirements.

+       In  contrast  to  pcre2_match(),  pcre2_dfa_match()  does use recursive
+       function calls, but  only  for  processing  atomic  groups,  lookaround
+       assertions, and recursion within the pattern. Too much nested recursion
+       may cause stack issues. The "match depth"  parameter  can  be  used  to
+       limit the depth of function recursion in pcre2_dfa_match().

+
PROCESSING TIME

-       Certain items in regular expression patterns are processed  more  effi-
+       Certain  items  in regular expression patterns are processed more effi-
        ciently than others. It is more efficient to use a character class like
-       [aeiou]  than  a  set  of   single-character   alternatives   such   as
-       (a|e|i|o|u).  In  general,  the simplest construction that provides the
+       [aeiou]   than   a   set   of  single-character  alternatives  such  as
+       (a|e|i|o|u). In general, the simplest construction  that  provides  the
        required behaviour is usually the most efficient. Jeffrey Friedl's book
-       contains  a  lot  of useful general discussion about optimizing regular
-       expressions for efficient performance. This  document  contains  a  few
+       contains a lot of useful general discussion  about  optimizing  regular
+       expressions  for  efficient  performance.  This document contains a few
        observations about PCRE2.

-       Using  Unicode  character  properties  (the  \p, \P, and \X escapes) is
-       slow, because PCRE2 has to use a multi-stage table lookup  whenever  it
-       needs  a  character's  property. If you can find an alternative pattern
+       Using Unicode character properties (the \p,  \P,  and  \X  escapes)  is
+       slow,  because  PCRE2 has to use a multi-stage table lookup whenever it
+       needs a character's property. If you can find  an  alternative  pattern
        that does not use character properties, it will probably be faster.

-       By default, the escape sequences \b, \d, \s,  and  \w,  and  the  POSIX
-       character  classes  such  as  [:alpha:]  do not use Unicode properties,
+       By  default,  the  escape  sequences  \b, \d, \s, and \w, and the POSIX
+       character classes such as [:alpha:]  do  not  use  Unicode  properties,
        partly for backwards compatibility, and partly for performance reasons.
-       However,  you  can  set  the PCRE2_UCP option or start the pattern with
-       (*UCP) if you want Unicode character properties to be  used.  This  can
-       double  the  matching  time  for  items  such  as \d, when matched with
-       pcre2_match(); the performance loss is less with a DFA  matching  func-
+       However, you can set the PCRE2_UCP option or  start  the  pattern  with
+       (*UCP)  if  you  want Unicode character properties to be used. This can
+       double the matching time for  items  such  as  \d,  when  matched  with
+       pcre2_match();  the  performance loss is less with a DFA matching func-
        tion, and in both cases there is not much difference for \b.

-       When  a pattern begins with .* not in atomic parentheses, nor in paren-
-       theses that are the subject of a backreference,  and  the  PCRE2_DOTALL
-       option  is  set,  the pattern is implicitly anchored by PCRE2, since it
-       can match only at the start of a subject string.  If  the  pattern  has
+       When a pattern begins with .* not in atomic parentheses, nor in  paren-
+       theses  that  are  the subject of a backreference, and the PCRE2_DOTALL
+       option is set, the pattern is implicitly anchored by  PCRE2,  since  it
+       can  match  only  at  the start of a subject string. If the pattern has
        multiple top-level branches, they must all be anchorable. The optimiza-
-       tion can be disabled by  the  PCRE2_NO_DOTSTAR_ANCHOR  option,  and  is
+       tion  can  be  disabled  by  the PCRE2_NO_DOTSTAR_ANCHOR option, and is
        automatically disabled if the pattern contains (*PRUNE) or (*SKIP).

-       If  PCRE2_DOTALL  is  not  set,  PCRE2  cannot  make this optimization,
+       If PCRE2_DOTALL is  not  set,  PCRE2  cannot  make  this  optimization,
        because the dot metacharacter does not then match a newline, and if the
-       subject  string contains newlines, the pattern may match from the char-
+       subject string contains newlines, the pattern may match from the  char-
        acter immediately following one of them instead of from the very start.
        For example, the pattern

          .*second

-       matches  the subject "first\nand second" (where \n stands for a newline
-       character), with the match starting at the seventh character. In  order
-       to  do  this, PCRE2 has to retry the match starting after every newline
+       matches the subject "first\nand second" (where \n stands for a  newline
+       character),  with the match starting at the seventh character. In order
+       to do this, PCRE2 has to retry the match starting after  every  newline
        in the subject.

-       If you are using such a pattern with subject strings that do  not  con-
-       tain   newlines,   the   best   performance   is  obtained  by  setting
-       PCRE2_DOTALL, or starting the pattern with  ^.*  or  ^.*?  to  indicate
+       If  you  are using such a pattern with subject strings that do not con-
+       tain  newlines,  the  best   performance   is   obtained   by   setting
+       PCRE2_DOTALL,  or  starting  the  pattern  with ^.* or ^.*? to indicate
        explicit anchoring. That saves PCRE2 from having to scan along the sub-
        ject looking for a newline to restart at.

-       Beware of patterns that contain nested indefinite  repeats.  These  can
-       take  a  long time to run when applied to a string that does not match.
+       Beware  of  patterns  that contain nested indefinite repeats. These can
+       take a long time to run when applied to a string that does  not  match.
        Consider the pattern fragment

          ^(a+)*

-       This can match "aaaa" in 16 different ways, and this  number  increases
-       very  rapidly  as the string gets longer. (The * repeat can match 0, 1,
-       2, 3, or 4 times, and for each of those cases other than 0 or 4, the  +
-       repeats  can  match  different numbers of times.) When the remainder of
-       the pattern is such that the entire match is going to fail,  PCRE2  has
-       in  principle  to  try  every  possible variation, and this can take an
+       This  can  match "aaaa" in 16 different ways, and this number increases
+       very rapidly as the string gets longer. (The * repeat can match  0,  1,
+       2,  3, or 4 times, and for each of those cases other than 0 or 4, the +
+       repeats can match different numbers of times.) When  the  remainder  of
+       the  pattern  is such that the entire match is going to fail, PCRE2 has
+       in principle to try every possible variation,  and  this  can  take  an
        extremely long time, even for relatively short strings.

        An optimization catches some of the more simple cases such as
@@ -8611,22 +8612,64 @@

          (a+)*b

-       where a literal character follows. Before  embarking  on  the  standard
-       matching  procedure, PCRE2 checks that there is a "b" later in the sub-
-       ject string, and if there is not, it fails the match immediately.  How-
-       ever,  when  there  is no following literal this optimization cannot be
+       where  a  literal  character  follows. Before embarking on the standard
+       matching procedure, PCRE2 checks that there is a "b" later in the  sub-
+       ject  string, and if there is not, it fails the match immediately. How-
+       ever, when there is no following literal this  optimization  cannot  be
        used. You can see the difference by comparing the behaviour of

          (a+)*\d

-       with the pattern above. The former gives  a  failure  almost  instantly
-       when  applied  to  a  whole  line of "a" characters, whereas the latter
+       with  the  pattern  above.  The former gives a failure almost instantly
+       when applied to a whole line of  "a"  characters,  whereas  the  latter
        takes an appreciable time with strings longer than about 20 characters.

        In many cases, the solution to this kind of performance issue is to use
-       an atomic group or a possessive quantifier.
+       an atomic group or a possessive quantifier. This can often reduce  mem-
+       ory requirements as well. As another example, consider this pattern:

+         ([^<]|<(?!inet))+

+       It  matches  from wherever it starts until it encounters "<inet" or the
+       end of the data, and is the kind of pattern that  might  be  used  when
+       processing an XML file. Each iteration of the outer parentheses matches
+       either one character that is not "<" or a "<" that is not  followed  by
+       "inet".  However,  each time a parenthesis is processed, a backtracking
+       position is passed, so this formulation uses a memory  frame  for  each
+       matched character. For a long string, a lot of memory is required. Con-
+       sider now this  rewritten  pattern,  which  matches  exactly  the  same
+       strings:
+
+         ([^<]++|<(?!inet))+
+
+       This runs much faster, because sequences of characters that do not con-
+       tain "<" are "swallowed" in one item inside the parentheses, and a pos-
+       sessive  quantifier  is  used to stop any backtracking into the runs of
+       non-"<" characters. This version also uses a lot  less  memory  because
+       entry  to  a  new  set of parentheses happens only when a "<" character
+       that is not followed by "inet" is encountered (and we  assume  this  is
+       relatively rare).
+
+       This example shows that one way of optimizing performance when matching
+       long subject strings is to write repeated parenthesized subpatterns  to
+       match more than one character whenever possible.
+
+   SETTING RESOURCE LIMITS
+
+       You  can  set  limits on the amount of processing that takes place when
+       matching, and on the amount of heap memory that is  used.  The  default
+       values of the limits are very large, and unlikely ever to operate. They
+       can be changed when PCRE2 is built, and  they  can  also  be  set  when
+       pcre2_match()  or  pcre2_dfa_match()  is  called.  For details of these
+       interfaces, see the pcre2build documentation and the  section  entitled
+       "The match context" in the pcre2api documentation.
+
+       The  pcre2test  test program has a modifier called "find_limits" which,
+       if applied to a subject line, causes it to  find  the  smallest  limits
+       that allow a pattern to match. This is done by repeatedly matching with
+       different limits.
+
+
 AUTHOR

        Philip Hazel
@@ -8636,8 +8679,8 @@

REVISION

-       Last updated: 02 January 2015
-       Copyright (c) 1997-2015 University of Cambridge.
+       Last updated: 31 March 2017
+       Copyright (c) 1997-2017 University of Cambridge.
 ------------------------------------------------------------------------------

Modified: code/trunk/doc/pcre2perform.3
===================================================================
--- code/trunk/doc/pcre2perform.3    2017-03-31 15:51:41 UTC (rev 721)
+++ code/trunk/doc/pcre2perform.3    2017-03-31 16:49:33 UTC (rev 722)
@@ -1,4 +1,4 @@
-.TH PCRE2PERFORM 3 "02 January 2015" "PCRE2 10.00"
+.TH PCRE2PERFORM 3 "31 March 2017" "PCRE2 10.30"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 PERFORMANCE"
@@ -12,11 +12,11 @@
 .rs
 .sp
 Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
-so that most simple patterns do not use much memory. However, there is one case
-where the memory usage of a compiled pattern can be unexpectedly large. If a
-parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
-a limited maximum, the whole subpattern is repeated in the compiled code. For
-example, the pattern
+so that most simple patterns do not use much memory for storing the compiled 
+version. However, there is one case where the memory usage of a compiled
+pattern can be unexpectedly large. If a parenthesized subpattern has a
+quantifier with a minimum greater than 1 and/or a limited maximum, the whole
+subpattern is repeated in the compiled code. For example, the pattern
 .sp
   (abc|def){2,4}
 .sp
@@ -34,13 +34,13 @@
 .sp
   ((ab){1,1000}c){1,3}
 .sp
-uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
-with its default internal pointer size of two bytes, the size limit on a
-compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
-is reached with the above pattern if the outer repetition is increased from 3
-to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
-larger compiled patterns, but it is better to try to rewrite your pattern to
-use less memory if you can.
+uses over 50K bytes when compiled using the 8-bit library. When PCRE2 is
+compiled with its default internal pointer size of two bytes, the size limit on
+a compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and
+this is reached with the above pattern if the outer repetition is increased
+from 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus
+handle larger compiled patterns, but it is better to try to rewrite your
+pattern to use less memory if you can.
 .P
 One way of reducing the memory usage for such patterns is to make use of
 PCRE2's
@@ -52,32 +52,34 @@
 .sp
   ((ab)(?2){0,999}c)(?1){0,2}
 .sp
-reduces the memory requirements to 18K, and indeed it remains under 20K even
-with the outer repetition increased to 100. However, this pattern is not
-exactly equivalent, because the "subroutine" calls are treated as
-.\" HTML <a href="pcre2pattern.html#atomicgroup">
-.\" </a>
-atomic groups
-.\"
-into which there can be no backtracking if there is a subsequent matching
-failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
-Furthermore, there is a noticeable loss of speed when executing the modified
-pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
-speed is acceptable, this kind of rewriting will allow you to process patterns
-that PCRE2 cannot otherwise handle.
+reduces the memory requirements to around 16K, and indeed it remains under 20K
+even with the outer repetition increased to 100. However, this kind of pattern
+is not always exactly equivalent, because any captures within subroutine calls
+are lost when the subroutine completes. If this is not a problem, this kind of
+rewriting will allow you to process patterns that PCRE2 cannot otherwise
+handle. The matching performance of the two different versions of the pattern
+are roughly the same. (This applies from release 10.30 - things were different
+in earlier releases.)
 .
 .
-.SH "STACK USAGE AT RUN TIME"
+.SH "STACK AND HEAP USAGE AT RUN TIME"
 .rs
 .sp
-When \fBpcre2_match()\fP is used for matching, certain kinds of pattern can
-cause it to use large amounts of the process stack. In some environments the
-default process stack is quite small, and if it runs out the result is often
-SIGSEGV. Rewriting your pattern can often help. The
-.\" HREF
-\fBpcre2stack\fP
-.\"
-documentation discusses this issue in detail.
+From release 10.30, the interpretive (non-JIT) version of \fBpcre2_match()\fP
+uses very little system stack at run time. In earlier releases recursive
+function calls could use a great deal of stack, and this could cause problems,
+but this usage has been eliminated. Backtracking positions are now explicitly
+remembered in memory frames controlled by the code. An initial 10K vector of
+frames is allocated on the system stack (enough for about 50 frames for small
+patterns), but if this is insufficient, heap memory is used. Rewriting patterns 
+to be time-efficient, as described below, may also reduce the memory 
+requirements.
+.P
+In contrast to \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP does use recursive 
+function calls, but only for processing atomic groups, lookaround assertions,
+and recursion within the pattern. Too much nested recursion may cause stack
+issues. The "match depth" parameter can be used to limit the depth of function
+recursion in \fBpcre2_dfa_match()\fP.
 .
 .
 .SH "PROCESSING TIME"
@@ -160,9 +162,61 @@
 appreciable time with strings longer than about 20 characters.
 .P
 In many cases, the solution to this kind of performance issue is to use an
-atomic group or a possessive quantifier.
+atomic group or a possessive quantifier. This can often reduce memory 
+requirements as well. As another example, consider this pattern:
+.sp
+  ([^<]|<(?!inet))+
+.sp
+It matches from wherever it starts until it encounters "<inet" or the end of
+the data, and is the kind of pattern that might be used when processing an XML
+file. Each iteration of the outer parentheses matches either one character that
+is not "<" or a "<" that is not followed by "inet". However, each time a
+parenthesis is processed, a backtracking position is passed, so this
+formulation uses a memory frame for each matched character. For a long string,
+a lot of memory is required. Consider now this rewritten pattern, which matches
+exactly the same strings:
+.sp
+  ([^<]++|<(?!inet))+
+.sp
+This runs much faster, because sequences of characters that do not contain "<"
+are "swallowed" in one item inside the parentheses, and a possessive quantifier
+is used to stop any backtracking into the runs of non-"<" characters. This
+version also uses a lot less memory because entry to a new set of parentheses
+happens only when a "<" character that is not followed by "inet" is encountered
+(and we assume this is relatively rare). 
+.P
+This example shows that one way of optimizing performance when matching long
+subject strings is to write repeated parenthesized subpatterns to match more
+than one character whenever possible.
 .
 .
+.SS "SETTING RESOURCE LIMITS"
+.rs
+.sp
+You can set limits on the amount of processing that takes place when matching, 
+and on the amount of heap memory that is used. The default values of the limits
+are very large, and unlikely ever to operate. They can be changed when PCRE2 is
+built, and they can also be set when \fBpcre2_match()\fP or 
+\fBpcre2_dfa_match()\fP is called. For details of these interfaces, see the
+.\" HREF
+\fBpcre2build\fP
+.\"
+documentation and the section entitled
+.\" HTML <a href="pcre2api.html#matchcontext">
+.\" </a>
+"The match context"
+.\"
+in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+documentation.
+.P
+The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
+applied to a subject line, causes it to find the smallest limits that allow a
+pattern to match. This is done by repeatedly matching with different limits.
+.
+.
 .SH AUTHOR
 .rs
 .sp
@@ -177,6 +231,6 @@
 .rs
 .sp
 .nf
-Last updated: 02 January 2015
-Copyright (c) 1997-2015 University of Cambridge.
+Last updated: 31 March 2017
+Copyright (c) 1997-2017 University of Cambridge.
 .fi

Deleted: code/trunk/doc/pcre2stack.3
===================================================================
--- code/trunk/doc/pcre2stack.3    2017-03-31 15:51:41 UTC (rev 721)
+++ code/trunk/doc/pcre2stack.3    2017-03-31 16:49:33 UTC (rev 722)
@@ -1,212 +0,0 @@
-.TH PCRE2STACK 3 "23 December 2016" "PCRE2 10.23"
-.SH NAME
-PCRE2 - Perl-compatible regular expressions (revised API)
-.SH "PCRE2 DISCUSSION OF STACK USAGE"
-.rs
-.sp
-When you call \fBpcre2_match()\fP, it makes use of an internal function called
-\fBmatch()\fP. This calls itself recursively at branch points in the pattern,
-in order to remember the state of the match so that it can back up and try a
-different alternative after a failure. As matching proceeds deeper and deeper
-into the tree of possibilities, the recursion depth increases. The
-\fBmatch()\fP function is also called in other circumstances, for example,
-whenever a parenthesized sub-pattern is entered, and in certain cases of
-repetition.
-.P
-Not all calls of \fBmatch()\fP increase the recursion depth; for an item such
-as a* it may be called several times at the same level, after matching
-different numbers of a's. Furthermore, in a number of cases where the result of
-the recursive call would immediately be passed back as the result of the
-current call (a "tail recursion"), the function is just restarted instead.
-.P
-Each time the internal \fBmatch()\fP function is called recursively, it uses
-memory from the process stack. For certain kinds of pattern and data, very
-large amounts of stack may be needed, despite the recognition of "tail
-recursion". Note that if PCRE2 is compiled with the -fsanitize=address option
-of the GCC compiler, the stack requirements are greatly increased.
-.P
-The above comments apply when \fBpcre2_match()\fP is run in its normal
-interpretive manner. If the compiled pattern was processed by
-\fBpcre2_jit_compile()\fP, and just-in-time compiling was successful, and the
-options passed to \fBpcre2_match()\fP were not incompatible, the matching
-process uses the JIT-compiled code instead of the \fBmatch()\fP function. In
-this case, the memory requirements are handled entirely differently. See the
-.\" HREF
-\fBpcre2jit\fP
-.\"
-documentation for details.
-.P
-The \fBpcre2_dfa_match()\fP function operates in a different way to
-\fBpcre2_match()\fP, and uses recursion only when there is a regular expression
-recursion or subroutine call in the pattern. This includes the processing of
-assertion and "once-only" subpatterns, which are handled like subroutine calls.
-Normally, these are never very deep, and the limit on the complexity of
-\fBpcre2_dfa_match()\fP is controlled by the amount of workspace it is given.
-However, it is possible to write patterns with runaway infinite recursions;
-such patterns will cause \fBpcre2_dfa_match()\fP to run out of stack unless a
-limit is applied (see below).
-.P
-The comments in the next three sections do not apply to
-\fBpcre2_dfa_match()\fP; they are relevant only for \fBpcre2_match()\fP without
-the JIT optimization.
-.
-.
-.SS "Reducing \fBpcre2_match()\fP's stack usage"
-.rs
-.sp
-You can often reduce the amount of recursion, and therefore the
-amount of stack used, by modifying the pattern that is being matched. Consider,
-for example, this pattern:
-.sp
-  ([^<]|<(?!inet))+
-.sp
-It matches from wherever it starts until it encounters "<inet" or the end of
-the data, and is the kind of pattern that might be used when processing an XML
-file. Each iteration of the outer parentheses matches either one character that
-is not "<" or a "<" that is not followed by "inet". However, each time a
-parenthesis is processed, a recursion occurs, so this formulation uses a stack
-frame for each matched character. For a long string, a lot of stack is
-required. Consider now this rewritten pattern, which matches exactly the same
-strings:
-.sp
-  ([^<]++|<(?!inet))+
-.sp
-This uses very much less stack, because runs of characters that do not contain
-"<" are "swallowed" in one item inside the parentheses. Recursion happens only
-when a "<" character that is not followed by "inet" is encountered (and we
-assume this is relatively rare). A possessive quantifier is used to stop any
-backtracking into the runs of non-"<" characters, but that is not related to
-stack usage.
-.P
-This example shows that one way of avoiding stack problems when matching long
-subject strings is to write repeated parenthesized subpatterns to match more
-than one character whenever possible.
-.
-.
-.SS "Compiling PCRE2 to use heap instead of stack for \fBpcre2_match()\fP"
-.rs
-.sp
-In environments where stack memory is constrained, you might want to compile
-PCRE2 to use heap memory instead of stack for remembering back-up points when
-\fBpcre2_match()\fP is running. This makes it run more slowly, however. Details
-of how to do this are given in the
-.\" HREF
-\fBpcre2build\fP
-.\"
-documentation. When built in this way, instead of using the stack, PCRE2
-gets memory for remembering backup points from the heap. By default, the memory
-is obtained by calling the system \fBmalloc()\fP function, but you can arrange
-to supply your own memory management function. For details, see the section
-entitled
-.\" HTML <a href="pcre2api.html#matchcontext">
-.\" </a>
-"The match context"
-.\"
-in the
-.\" HREF
-\fBpcre2api\fP
-.\"
-documentation. Since the block sizes are always the same, it may be possible to
-implement a customized memory handler that is more efficient than the standard
-function. The memory blocks obtained for this purpose are retained and re-used
-if possible while \fBpcre2_match()\fP is running. They are all freed just
-before it exits.
-.
-.
-.SS "Limiting \fBpcre2_match()\fP's stack usage"
-.rs
-.sp
-You can set limits on the number of times the internal \fBmatch()\fP function
-is called, both in total and recursively. If a limit is exceeded,
-\fBpcre2_match()\fP returns an error code. Setting suitable limits should
-prevent it from running out of stack. The default values of the limits are very
-large, and unlikely ever to operate. They can be changed when PCRE2 is built,
-and they can also be set when \fBpcre2_match()\fP is called. For details of
-these interfaces, see the
-.\" HREF
-\fBpcre2build\fP
-.\"
-documentation and the section entitled
-.\" HTML <a href="pcre2api.html#matchcontext">
-.\" </a>
-"The match context"
-.\"
-in the
-.\" HREF
-\fBpcre2api\fP
-.\"
-documentation.
-.P
-As a very rough rule of thumb, you should reckon on about 500 bytes per
-recursion. Thus, if you want to limit your stack usage to 8Mb, you should set
-the limit at 16000 recursions. A 64Mb stack, on the other hand, can support
-around 128000 recursions.
-.P
-The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
-applied to a subject line, causes it to find the smallest limits that allow a a
-pattern to match. This is done by calling \fBpcre2_match()\fP repeatedly with
-different limits.
-.
-.
-.SS "Limiting \fBpcre2_dfa_match()\fP's stack usage"
-.rs
-.sp
-The recursion limit, as described above for \fBpcre2_match()\fP, also applies
-to \fBpcre2_dfa_match()\fP, whose use of recursive function calls for
-recursions in the pattern can lead to runaway stack usage. The non-recursive
-match limit is not relevant for DFA matching, and is ignored.
-.
-.
-.SS "Changing stack size in Unix-like systems"
-.rs
-.sp
-In Unix-like environments, there is not often a problem with the stack unless
-very long strings are involved, though the default limit on stack size varies
-from system to system. Values from 8Mb to 64Mb are common. You can find your
-default limit by running the command:
-.sp
-  ulimit -s
-.sp
-Unfortunately, the effect of running out of stack is often SIGSEGV, though
-sometimes a more explicit error message is given. You can normally increase the
-limit on stack size by code such as this:
-.sp
-  struct rlimit rlim;
-  getrlimit(RLIMIT_STACK, &rlim);
-  rlim.rlim_cur = 100*1024*1024;
-  setrlimit(RLIMIT_STACK, &rlim);
-.sp
-This reads the current limits (soft and hard) using \fBgetrlimit()\fP, then
-attempts to increase the soft limit to 100Mb using \fBsetrlimit()\fP. You must
-do this before calling \fBpcre2_match()\fP.
-.
-.
-.SS "Changing stack size in Mac OS X"
-.rs
-.sp
-Using \fBsetrlimit()\fP, as described above, should also work on Mac OS X. It
-is also possible to set a stack size when linking a program. There is a
-discussion about stack sizes in Mac OS X at this web site:
-.\" HTML <a href="http://developer.apple.com/qa/qa2005/qa1419.html">
-.\" </a>
-http://developer.apple.com/qa/qa2005/qa1419.html.
-.\"
-.
-.
-.SH AUTHOR
-.rs
-.sp
-.nf
-Philip Hazel
-University Computing Service
-Cambridge, England.
-.fi
-.
-.
-.SH REVISION
-.rs
-.sp
-.nf
-Last updated: 23 December 2016
-Copyright (c) 1997-2016 University of Cambridge.
-.fi