[Pcre-svn] [684] code/trunk: Documentation update.

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [684] code/trunk: Documentation update.
Revision: 684
          http://www.exim.org/viewvc/pcre2?view=rev&revision=684
Author:   ph10
Date:     2017-03-17 16:55:58 +0000 (Fri, 17 Mar 2017)
Log Message:
-----------
Documentation update.


Modified Paths:
--------------
    code/trunk/HACKING
    code/trunk/NON-AUTOTOOLS-BUILD
    code/trunk/README


Modified: code/trunk/HACKING
===================================================================
--- code/trunk/HACKING    2017-03-17 16:55:47 UTC (rev 683)
+++ code/trunk/HACKING    2017-03-17 16:55:58 UTC (rev 684)
@@ -88,10 +88,10 @@
 a "fake" mode that enables it to compute how much memory it would need, while
 in most cases only ever using a small amount of working memory, and without too
 many tests of the mode that might slow it down. So I refactored the compiling
-functions to work this way. This got rid of about 600 lines of source. It
-should make future maintenance and development easier. As this was such a major
-change, I never released 6.8, instead upping the number to 7.0 (other quite
-major changes were also present in the 7.0 release).
+functions to work this way. This got rid of about 600 lines of source and made
+further maintenance and development easier. As this was such a major change, I
+never released 6.8, instead upping the number to 7.0 (other quite major changes
+were also present in the 7.0 release).


A side effect of this work was that the previous limit of 200 on the nesting
depth of parentheses was removed. However, there was a downside: compiling ran
@@ -122,7 +122,7 @@
that the actual compile (both the memory-computing dummy run and the real
compile) has full knowledge of group names and numbers throughout. Several
dozen lines of messy code were eliminated, though the new pre-pass was not
-short. In particular, parsing and skipping over [] classes is complicated.
+short. In particular, parsing and skipping over [] classes was complicated.

While working on 10.22 I realized that I could simplify yet again by moving
more of the parsing into the pre-pass, thus avoiding doing it in two places, so
@@ -162,7 +162,7 @@
Most errors can be diagnosed during the parsing scan. For those that cannot
(for example, "lookbehind assertion is not fixed length"), the parsed code
contains offsets into the pattern so that the actual compiling code can
-identify where errors occur.
+report where errors are.


The elements of the parsed pattern vector
@@ -217,10 +217,10 @@
elements:

 META_ALT              | alternation
-META_BACKREF
-META_CAPTURE
-META_ESCAPE
-META_RECURSE
+META_BACKREF          back reference
+META_CAPTURE          start of capturing group
+META_ESCAPE           non-literal escape sequence
+META_RECURSE          recursion call


If the data for META_ALT is non-zero, it is inside a lookbehind, and the data
is the length of its branch, for which OP_REVERSE must be generated.
@@ -232,8 +232,8 @@
or more. The offsets of the first ocurrences of references to groups whose
numbers are less than 10 are put in cb->small_ref_offset[] (only the first
occurrence is useful). On 64-bit systems this avoids using more than two parsed
-pattern elements for items such as \3. The offset is used when an error is
-given for a reference to a non-existent group.
+pattern elements for items such as \3. The offset is used when an error occurs
+because the reference is to a non-existent group.

META_RECURSE is always followed by an offset, for use in error messages.

@@ -286,7 +286,7 @@
 META_LOOKBEHIND       (?<=
 META_LOOKBEHINDNOT    (?<!


-The following are followed by two values, the minimum and maximum. Repeat
+The following are followed by two elements, the minimum and maximum. Repeat
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:

@@ -369,7 +369,7 @@

In this description, we assume the "normal" compilation options. Data values
that are counts (e.g. quantifiers) are always two bytes long in 8-bit mode
-(most significant byte first), or one code unit in 16-bit and 32-bit modes.
+(most significant byte first), and one code unit in 16-bit and 32-bit modes.


 Opcodes with no following data
@@ -409,7 +409,7 @@
   OP_ACCEPT              ) These are Perl 5.10's "backtracking control
   OP_COMMIT              ) verbs". If OP_ACCEPT is inside capturing
   OP_FAIL                ) parentheses, it may be preceded by one or more
-  OP_PRUNE               ) OP_CLOSE, each followed by a count that
+  OP_PRUNE               ) OP_CLOSE, each followed by a number that
   OP_SKIP                ) indicates which parentheses must be closed.
   OP_THEN                )


@@ -679,7 +679,7 @@

These are just like other subpatterns, but they start with the opcode OP_ONCE.
The check for matching an empty string in an unbounded repeat is handled
-entirely at runtime, so there are just this one opcode for atomic groups.
+entirely at runtime, so there is just this one opcode for atomic groups.


Assertions
@@ -742,15 +742,22 @@
Recursion either matches the current pattern, or some subexpression. The opcode
OP_RECURSE is followed by a LINK_SIZE value that is the offset to the starting
bracket from the start of the whole pattern. OP_RECURSE is also used for
-"subroutine" calls, even though they are not strictly a recursion. Repeated
-recursions are automatically wrapped inside OP_ONCE brackets, because otherwise
-some patterns broke them. A non-repeated recursion is not wrapped in OP_ONCE
-brackets, but it is nevertheless still treated as an atomic group.
+"subroutine" calls, even though they are not strictly a recursion. Up till
+release 10.30 recursions were treated as atomic groups, making them
+incompatible with Perl (but PCRE had then well before Perl did). From 10.30,
+backtracking into recursions is supported.

+Repeated recursions used to be wrapped inside OP_ONCE brackets, which not only
+forced no backtracking, but also allowed repetition to be handled as for other
+bracketed groups. From 10.30 onwards, repeated recursions are duplicated for
+their minimum repetitions, and then wrapped in non-capturing brackets for the
+remainder. For example, (?1){3} is treated as (?1)(?1)(?1), and (?1){2,4} is
+treated as (?1)(?1)(?:(?1)){0,2}.

-Callout
--------

+Callouts
+--------
+
A callout can nowadays have either a numerical argument or a string argument.
These use OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are
followed by two LINK_SIZE values giving the offset in the pattern string to the
@@ -787,4 +794,4 @@
correct length, in order to catch updating errors.

Philip Hazel
-March 2017
+17 March 2017

Modified: code/trunk/NON-AUTOTOOLS-BUILD
===================================================================
--- code/trunk/NON-AUTOTOOLS-BUILD    2017-03-17 16:55:47 UTC (rev 683)
+++ code/trunk/NON-AUTOTOOLS-BUILD    2017-03-17 16:55:58 UTC (rev 684)
@@ -1,10 +1,6 @@
 Building PCRE2 without using autotools
 --------------------------------------


-This document has been converted from the PCRE1 document. I have removed a
-number of sections about building in various environments, as they applied only
-to PCRE1 and are probably out of date.
-
This document contains the following sections:

General
@@ -183,23 +179,11 @@

STACK SIZE IN WINDOWS ENVIRONMENTS

-The default processor stack size of 1Mb in some Windows environments is too
-small for matching patterns that need much recursion. In particular, test 2 may
-fail because of this. Normally, running out of stack causes a crash, but there
-have been cases where the test program has just died silently. See your linker
-documentation for how to increase stack size if you experience problems. If you
-are using CMake (see "BUILDING PCRE2 ON WINDOWS WITH CMAKE" below) and the gcc
-compiler, you can increase the stack size for pcre2test and pcre2grep by
-setting the CMAKE_EXE_LINKER_FLAGS variable to "-Wl,--stack,8388608" (for
-example). The Linux default of 8Mb is a reasonable choice for the stack, though
-even that can be too small for some pattern/subject combinations.
+Prior to release 10.30 the default system stack size of 1Mb in some Windows
+environments caused issues with some tests. This should no longer be the case
+for 10.30 and later releases.

-PCRE2 has a compile configuration option to disable the use of stack for
-recursion so that heap is used instead. However, pattern matching is
-significantly slower when this is done. There is more about stack usage in the
-"pcre2stack" documentation.

-
LINKING PROGRAMS IN WINDOWS ENVIRONMENTS

If you want to statically link a program against a PCRE2 library in the form of
@@ -393,4 +377,4 @@
recommended download site.

=============================
-Last Updated: 13 October 2016
+Last Updated: 17 March 2017

Modified: code/trunk/README
===================================================================
--- code/trunk/README    2017-03-17 16:55:47 UTC (rev 683)
+++ code/trunk/README    2017-03-17 16:55:58 UTC (rev 684)
@@ -15,8 +15,8 @@


    https://lists.exim.org/mailman/listinfo/pcre-dev


-Please read the NEWS file if you are upgrading from a previous release.
-The contents of this README file are:
+Please read the NEWS file if you are upgrading from a previous release. The
+contents of this README file are:

The PCRE2 APIs
Documentation for PCRE2
@@ -44,8 +44,8 @@

The distribution does contain a set of C wrapper functions for the 8-bit
library that are based on the POSIX regular expression API (see the pcre2posix
-man page). These can be found in a library called libpcre2-posix. Note that this
-just provides a POSIX calling interface to PCRE2; the regular expressions
+man page). These can be found in a library called libpcre2-posix. Note that
+this just provides a POSIX calling interface to PCRE2; the regular expressions
themselves still follow Perl syntax and semantics. The POSIX API is restricted,
and does not give full access to all of PCRE2's facilities.

@@ -95,10 +95,9 @@
Building PCRE2 on non-Unix-like systems
---------------------------------------

-For a non-Unix-like system, please read the comments in the file
-NON-AUTOTOOLS-BUILD, though if your system supports the use of "configure" and
-"make" you may be able to build PCRE2 using autotools in the same way as for
-many Unix-like systems.
+For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if
+your system supports the use of "configure" and "make" you may be able to build
+PCRE2 using autotools in the same way as for many Unix-like systems.

PCRE2 can also be configured using CMake, which can be run in various ways
(command line, GUI, etc). This creates Makefiles, solution files, etc. The file
@@ -174,19 +173,19 @@
architectures. If you try to enable it on an unsupported architecture, there
will be a compile time error.

-. If you do not want to make use of the support for UTF-8 Unicode character
- strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
- library, or UTF-32 Unicode character strings in the 32-bit library, you can
- add --disable-unicode to the "configure" command. This reduces the size of
- the libraries. It is not possible to configure one library with Unicode
- support, and another without, in the same configuration.
+. If you do not want to make use of the default support for UTF-8 Unicode
+ character strings in the 8-bit library, UTF-16 Unicode character strings in
+ the 16-bit library, or UTF-32 Unicode character strings in the 32-bit
+ library, you can add --disable-unicode to the "configure" command. This
+ reduces the size of the libraries. It is not possible to configure one
+ library with Unicode support, and another without, in the same configuration.
+ It is also not possible to use --enable-ebcdic (see below) with Unicode
+ support, so if this option is set, you must also use --disable-unicode.

When Unicode support is available, the use of a UTF encoding still has to be
enabled by setting the PCRE2_UTF option at run time or starting a pattern
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
- either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
- not possible to use both --enable-unicode and --enable-ebcdic at the same
- time.
+ either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms.

As well as supporting UTF strings, Unicode support includes support for the
\P, \p, and \X sequences that recognize Unicode character properties.
@@ -232,18 +231,18 @@
--with-match-limit=500000

on the "configure" command. This is just the default; individual calls to
- pcre2_match() can supply their own value. There is more discussion on the
- pcre2api man page.
+ pcre2_match() can supply their own value. There is more discussion in the
+ pcre2api man page (search for pcre2_set_match_limit).

-. There is a separate counter that limits the depth of recursive function calls
- during a matching process. This also has a default of ten million, which is
- essentially "unlimited". You can change the default by setting, for example,
+. There is a separate counter that limits the depth of nested backtracking
+ during a matching process, which in turn limits the amount of memory that is
+ used. This also has a default of ten million, which is essentially
+ "unlimited". You can change the default by setting, for example,

- --with-match-limit-recursion=500000
+ --with-match-limit-depth=5000

- Recursive function calls use up the runtime stack; running out of stack can
- cause programs to crash in strange ways. There is a discussion about stack
- sizes in the pcre2stack man page.
+ There is more discussion in the pcre2api man page (search for
+ pcre2_set_depth_limit).

. In the 8-bit library, the default maximum compiled pattern size is around
64K bytes. You can increase this by adding --with-link-size=3 to the
@@ -254,20 +253,6 @@
performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
link size setting is ignored, as 4-byte offsets are always used.

-. You can build PCRE2 so that its internal match() function that is called from
- pcre2_match() does not call itself recursively. Instead, it uses memory
- blocks obtained from the heap to save data that would otherwise be saved on
- the stack. To build PCRE2 like this, use
-
- --disable-stack-for-recursion
-
- on the "configure" command. PCRE2 runs more slowly in this mode, but it may
- be necessary in environments with limited stack sizes. This applies only to
- the normal execution of the pcre2_match() function; if JIT support is being
- successfully used, it is not relevant. Equally, it does not apply to
- pcre2_dfa_match(), which does not use deeply nested recursion. There is a
- discussion about stack sizes in the pcre2stack man page.
-
. For speed, PCRE2 uses four tables for manipulating and identifying characters
whose code point values are less than 256. By default, it uses a set of
tables for ASCII encoding that is part of the distribution. If you specify
@@ -389,6 +374,13 @@
string. Otherwise, it is assumed to be a file name, and the contents of the
file are the test string.

+. Releases before 10.30 could be compiled with --disable-stack-for-recursion,
+ which caused pcre2_match() to use individual blocks on the heap for
+ backtracking instead of recursive function calls (which use the stack). This
+ is now obsolete since pcre2_match() was refactored always to use the heap (in
+ a much more efficient way than before). This option is retained for backwards
+ compatibility, but has no effect other than to output a warning.
+
The "configure" script builds the following files for the basic C library:

 . Makefile             the makefile that builds the library
@@ -662,26 +654,33 @@
 Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
 16-bit and 32-bit modes. These are tests that generate different output in
 8-bit mode. Each pair are for general cases and Unicode support, respectively.
+
 Test 13 checks the handling of non-UTF characters greater than 255 by
 pcre2_dfa_match() in 16-bit and 32-bit modes.


-Test 14 contains a number of tests that must not be run with JIT. They check,
+Test 14 contains some special UTF and UCP tests that give different output for
+the different widths.
+
+Test 15 contains a number of tests that must not be run with JIT. They check,
among other non-JIT things, the match-limiting features of the intepretive
matcher.

-Test 15 is run only when JIT support is not available. It checks that an
+Test 16 is run only when JIT support is not available. It checks that an
attempt to use JIT has the expected behaviour.

-Test 16 is run only when JIT support is available. It checks JIT complete and
+Test 17 is run only when JIT support is available. It checks JIT complete and
partial modes, match-limiting under JIT, and other JIT-specific features.

-Tests 17 and 18 are run only in 8-bit mode. They check the POSIX interface to
+Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to
the 8-bit library, without and with Unicode support, respectively.

-Test 19 checks the serialization functions by writing a set of compiled
+Test 20 checks the serialization functions by writing a set of compiled
patterns to a file, and then reloading and checking them.

+Tests 21 and 22 test \C support when the use of \C is not locked out, without
+and with UTF support, respectively. Test 23 tests \C when it is locked out.

+
Character tables
----------------

@@ -866,4 +865,4 @@
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 01 November 2016
+Last updated: 17 March 2017