Revision: 1314
http://vcs.pcre.org/viewvc?view=rev&revision=1314
Author: ph10
Date: 2013-04-26 11:44:13 +0100 (Fri, 26 Apr 2013)
Log Message:
-----------
Documentation updates.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/pcre.3
code/trunk/doc/pcreapi.3
code/trunk/doc/pcrepattern.3
code/trunk/doc/pcresyntax.3
code/trunk/doc/pcretest.1
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2013-04-24 12:07:09 UTC (rev 1313)
+++ code/trunk/ChangeLog 2013-04-26 10:44:13 UTC (rev 1314)
@@ -141,7 +141,9 @@
37. The value of the max lookbehind was not correctly preserved if a compiled
and saved regex was reloaded on a host of different endianness.
-38. Implemented (*LIMIT_MATCH) and (*LIMIT_RECURSION).
+38. Implemented (*LIMIT_MATCH) and (*LIMIT_RECURSION). As part of the extension
+ of the compiled pattern block, expand the flags field from 16 to 32 bits
+ because it was almost full.
Version 8.32 30-November-2012
Modified: code/trunk/doc/pcre.3
===================================================================
--- code/trunk/doc/pcre.3 2013-04-24 12:07:09 UTC (rev 1313)
+++ code/trunk/doc/pcre.3 2013-04-26 10:44:13 UTC (rev 1314)
@@ -1,4 +1,4 @@
-.TH PCRE 3 "11 November 2012" "PCRE 8.32"
+.TH PCRE 3 "26 April 2013" "PCRE 8.33"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH INTRODUCTION
@@ -121,8 +121,11 @@
use sufficiently many resources as to cause your application to lose
performance.
.P
-The best way of guarding against this possibility is to use the
+One way of guarding against this possibility is to use the
\fBpcre_fullinfo()\fP function to check the compiled pattern's options for UTF.
+Alternatively, from release 8.33, you can set the PCRE_NEVER_UTF option at
+compile time. This causes an compile time error if a pattern contains a
+UTF-setting sequence.
.P
If your application is one that supports UTF, be aware that validity checking
can take time. If the same data string is to be matched many times, you can use
@@ -197,6 +200,6 @@
.rs
.sp
.nf
-Last updated: 11 November 2012
-Copyright (c) 1997-2012 University of Cambridge.
+Last updated: 26 April 2013
+Copyright (c) 1997-2013 University of Cambridge.
.fi
Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3 2013-04-24 12:07:09 UTC (rev 1313)
+++ code/trunk/doc/pcreapi.3 2013-04-26 10:44:13 UTC (rev 1314)
@@ -1,4 +1,4 @@
-.TH PCREAPI 3 "05 April 2013" "PCRE 8.33"
+.TH PCREAPI 3 "26 April 2013" "PCRE 8.33"
.SH NAME
PCRE - Perl-compatible regular expressions
.sp
@@ -761,7 +761,7 @@
UTF-32 in the 16-bit and 32-bit libraries). In particular, it prevents the
creator of the pattern from switching to UTF interpretation by starting the
pattern with (*UTF). This may be useful in applications that process patterns
-from external sources. The combination of PCRE_UTF8 and PCRE_NEVER_UTF also
+from external sources. The combination of PCRE_UTF8 and PCRE_NEVER_UTF also
causes an error.
.sp
PCRE_NEWLINE_CR
@@ -1092,13 +1092,13 @@
.P
These two optimizations apply to both \fBpcre_exec()\fP and
\fBpcre_dfa_exec()\fP, and the information is also used by the JIT compiler.
-The optimizations can be disabled by setting the PCRE_NO_START_OPTIMIZE option.
+The optimizations can be disabled by setting the PCRE_NO_START_OPTIMIZE option.
You might want to do this if your pattern contains callouts or (*MARK) and you
want to make use of these facilities in cases where matching fails.
.P
PCRE_NO_START_OPTIMIZE can be specified at either compile time or execution
-time. However, if PCRE_NO_START_OPTIMIZE is passed to \fBpcre_exec()\fP, (that
-is, after any JIT compilation has happened) JIT execution is disabled. For JIT
+time. However, if PCRE_NO_START_OPTIMIZE is passed to \fBpcre_exec()\fP, (that
+is, after any JIT compilation has happened) JIT execution is disabled. For JIT
execution to work with PCRE_NO_START_OPTIMIZE, the option must be set at
compile time.
.P
@@ -1193,6 +1193,7 @@
PCRE_ERROR_BADENDIANNESS the pattern was compiled with different
endianness
PCRE_ERROR_BADOPTION the value of \fIwhat\fP was invalid
+ PCRE_ERROR_UNSET the requested field is not set
.sp
The "magic number" is placed at the start of each compiled pattern as an simple
check against passing an arbitrary memory pointer. The endianness error can
@@ -1311,6 +1312,13 @@
instead the PCRE_INFO_REQUIREDCHARFLAGS and PCRE_INFO_REQUIREDCHAR values should
be used.
.sp
+ PCRE_INFO_MATCHLIMIT
+.sp
+If the pattern set a match limit by including an item of the form
+(*LIMIT_MATCH=nnnn) at the start, the value is returned. The fourth argument
+should point to an unsigned 32-bit integer. If no such value has been set, the
+call to \fBpcre_fullinfo()\fP returns the error PCRE_ERROR_UNSET.
+.sp
PCRE_INFO_MAXLOOKBEHIND
.sp
Return the number of characters (NB not bytes) in the longest lookbehind
@@ -1319,8 +1327,8 @@
\eb and \eB require a one-character lookbehind. \eA also registers a
one-character lookbehind, though it does not actually inspect the previous
character. This is to ensure that at least one character from the old segment
-is retained when a new segment is processed. Otherwise, if there are no
-lookbehinds in the pattern, \eA might match incorrectly at the start of a new
+is retained when a new segment is processed. Otherwise, if there are no
+lookbehinds in the pattern, \eA might match incorrectly at the start of a new
segment.
.sp
PCRE_INFO_MINLENGTH
@@ -1430,6 +1438,13 @@
For such patterns, the PCRE_ANCHORED bit is set in the options returned by
\fBpcre_fullinfo()\fP.
.sp
+ PCRE_INFO_RECURSIONLIMIT
+.sp
+If the pattern set a recursion limit by including an item of the form
+(*LIMIT_RECURSION=nnnn) at the start, the value is returned. The fourth
+argument should point to an unsigned 32-bit integer. If no such value has been
+set, the call to \fBpcre_fullinfo()\fP returns the error PCRE_ERROR_UNSET.
+.sp
PCRE_INFO_SIZE
.sp
Return the size of the compiled pattern in bytes (for both libraries). The
@@ -1663,6 +1678,15 @@
the \fIflags\fP field. If the limit is exceeded, \fBpcre_exec()\fP returns
PCRE_ERROR_MATCHLIMIT.
.P
+A value for the match limit may also be supplied by an item at the start of a
+pattern of the form
+.sp
+ (*LIMIT_MATCH=d)
+.sp
+where d is a decimal number. However, such a setting is ignored unless d is
+less than the limit set by the caller of \fBpcre_exec()\fP or, if no such limit
+is set, less than the default.
+.P
The \fImatch_limit_recursion\fP field is similar to \fImatch_limit\fP, but
instead of limiting the total number of times that \fBmatch()\fP is called, it
limits the depth of recursion. The recursion depth is a smaller number than the
@@ -1681,6 +1705,15 @@
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the \fIflags\fP field. If the limit
is exceeded, \fBpcre_exec()\fP returns PCRE_ERROR_RECURSIONLIMIT.
.P
+A value for the recursion limit may also be supplied by an item at the start of
+a pattern of the form
+.sp
+ (*LIMIT_RECURSION=d)
+.sp
+where d is a decimal number. However, such a setting is ignored unless d is
+less than the limit set by the caller of \fBpcre_exec()\fP or, if no such limit
+is set, less than the default.
+.P
The \fIcallout_data\fP field is used in conjunction with the "callout" feature,
and is described in the
.\" HREF
@@ -2372,8 +2405,8 @@
PCRE_UTF8_ERR22
.sp
This error code was formerly used when the presence of a so-called
-"non-character" caused an error. Unicode corrigendum #9 makes it clear that
-such characters should not cause a string to be rejected, and so this code is
+"non-character" caused an error. Unicode corrigendum #9 makes it clear that
+such characters should not cause a string to be rejected, and so this code is
no longer in use and is never returned.
.
.
@@ -2843,6 +2876,6 @@
.rs
.sp
.nf
-Last updated: 05 April 2013
+Last updated: 26 April 2013
Copyright (c) 1997-2013 University of Cambridge.
.fi
Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3 2013-04-24 12:07:09 UTC (rev 1313)
+++ code/trunk/doc/pcrepattern.3 2013-04-26 10:44:13 UTC (rev 1314)
@@ -1,4 +1,4 @@
-.TH PCREPATTERN 3 "05 April 2013" "PCRE 8.33"
+.TH PCREPATTERN 3 "26 April 2013" "PCRE 8.33"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "PCRE REGULAR EXPRESSION DETAILS"
@@ -20,6 +20,34 @@
published by O'Reilly, covers regular expressions in great detail. This
description of PCRE's regular expressions is intended as reference material.
.P
+This document discusses the patterns that are supported by PCRE when one its
+main matching functions, \fBpcre_exec()\fP (8-bit) or \fBpcre[16|32]_exec()\fP
+(16- or 32-bit), is used. PCRE also has alternative matching functions,
+\fBpcre_dfa_exec()\fP and \fBpcre[16|32_dfa_exec()\fP, which match using a
+different algorithm that is not Perl-compatible. Some of the features discussed
+below are not available when DFA matching is used. The advantages and
+disadvantages of the alternative functions, and how they differ from the normal
+functions, are discussed in the
+.\" HREF
+\fBpcrematching\fP
+.\"
+page.
+.
+.
+.SH "SPECIAL START-OF-PATTERN ITEMS"
+.rs
+.sp
+A number of options that can be passed to \fBpcre_compile()\fP can also be set
+by special items at the start of a pattern. These are not Perl-compatible, but
+are provided to make these options accessible to pattern writers who are not
+able to change the program that processes the pattern. Any number of these
+items may appear, but they must all be together right at the start of the
+pattern string, and the letters must be in upper case.
+.
+.
+.SS "UTF support"
+.rs
+.sp
The original operation of PCRE was on strings of one-byte characters. However,
there is now also support for UTF-8 strings in the original library, an
extra library that supports 16-bit and UTF-16 character strings, and a
@@ -36,55 +64,41 @@
.sp
(*UTF) is a generic sequence that can be used with any of the libraries.
Starting a pattern with such a sequence is equivalent to setting the relevant
-option. This feature is not Perl-compatible. How setting a UTF mode affects
-pattern matching is mentioned in several places below. There is also a summary
-of features in the
+option. How setting a UTF mode affects pattern matching is mentioned in several
+places below. There is also a summary of features in the
.\" HREF
\fBpcreunicode\fP
.\"
page.
.P
-Another special sequence that may appear at the start of a pattern or in
-combination with (*UTF8), (*UTF16), (*UTF32) or (*UTF) is:
+Some applications that allow their users to supply patterns may wish to
+restrict them to non-UTF data for security reasons. If the PCRE_NEVER_UTF
+option is set at compile time, (*UTF) etc. are not allowed, and their
+appearance causes an error.
+.
+.
+.SS "Unicode property support"
+.rs
.sp
+Another special sequence that may appear at the start of a pattern is
+.sp
(*UCP)
.sp
This has the same effect as setting the PCRE_UCP option: it causes sequences
such as \ed and \ew to use Unicode properties to determine character types,
instead of recognizing only characters with codes less than 128 via a lookup
table.
-.P
-If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
-PCRE_NO_START_OPTIMIZE option either at compile or matching time. There are
-also some more of these special sequences that are concerned with the handling
-of newlines; they are described below.
-.P
-The remainder of this document discusses the patterns that are supported by
-PCRE when one its main matching functions, \fBpcre_exec()\fP (8-bit) or
-\fBpcre[16|32]_exec()\fP (16- or 32-bit), is used. PCRE also has alternative
-matching functions, \fBpcre_dfa_exec()\fP and \fBpcre[16|32_dfa_exec()\fP,
-which match using a different algorithm that is not Perl-compatible. Some of
-the features discussed below are not available when DFA matching is used. The
-advantages and disadvantages of the alternative functions, and how they differ
-from the normal functions, are discussed in the
-.\" HREF
-\fBpcrematching\fP
-.\"
-page.
.
.
-.SH "EBCDIC CHARACTER CODES"
+.SS "Disabling start-up optimizations"
.rs
.sp
-PCRE can be compiled to run in an environment that uses EBCDIC as its character
-code rather than ASCII or Unicode (typically a mainframe system). In the
-sections below, character code values are ASCII or Unicode; in an EBCDIC
-environment these characters may have different code values, and there are no
-code points greater than 255.
+If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
+PCRE_NO_START_OPTIMIZE option either at compile or matching time.
.
.
.\" HTML <a name="newlines"></a>
-.SH "NEWLINE CONVENTIONS"
+.SS "Newline conventions"
.rs
.sp
PCRE supports five different conventions for indicating line breaks in
@@ -117,9 +131,7 @@
(*CR)a.b
.sp
changes the convention to CR. That pattern matches "a\enb" because LF is no
-longer a newline. Note that these special settings, which are not
-Perl-compatible, are recognized only at the very start of a pattern, and that
-they must be in upper case. If more than one of them is present, the last one
+longer a newline. If more than one of these settings is present, the last one
is used.
.P
The newline convention affects where the circumflex and dollar assertions are
@@ -136,6 +148,38 @@
convention.
.
.
+.SS "Setting match and recursion limits"
+.rs
+.sp
+The caller of \fBpcre_exec()\fP can set a limit on the number of times the
+internal \fBmatch()\fP function is called and on the maximum depth of
+recursive calls. These facilities are provided to catch runaway matches that
+are provoked by patterns with huge matching trees (a typical example is a
+pattern with nested unlimited repeats) and to avoid running out of system stack
+by too much recursion. When one of these limits is reached, \fBpcre_exec()\fP
+gives an error return. The limits can also be set by items at the start of the
+pattern of the form
+.sp
+ (*LIMIT_MATCH=d)
+ (*LIMIT_RECURSION=d)
+.sp
+where d is any number of decimal digits. However, the value of the setting must
+be less than the value set by the caller of \fBpcre_exec()\fP for it to have
+any effect. In other words, the pattern writer can lower the limit set by the
+programmer, but not raise it. If there is more than one setting of one of these
+limits, the lower value is used.
+.
+.
+.SH "EBCDIC CHARACTER CODES"
+.rs
+.sp
+PCRE can be compiled to run in an environment that uses EBCDIC as its character
+code rather than ASCII or Unicode (typically a mainframe system). In the
+sections below, character code values are ASCII or Unicode; in an EBCDIC
+environment these characters may have different code values, and there are no
+code points greater than 255.
+.
+.
.SH "CHARACTERS AND METACHARACTERS"
.rs
.sp
@@ -3101,6 +3145,6 @@
.rs
.sp
.nf
-Last updated: 05 April 2013
+Last updated: 26 April 2013
Copyright (c) 1997-2013 University of Cambridge.
.fi
Modified: code/trunk/doc/pcresyntax.3
===================================================================
--- code/trunk/doc/pcresyntax.3 2013-04-24 12:07:09 UTC (rev 1313)
+++ code/trunk/doc/pcresyntax.3 2013-04-26 10:44:13 UTC (rev 1314)
@@ -1,4 +1,4 @@
-.TH PCRESYNTAX 3 "27 February 2013" "PCRE 8.33"
+.TH PCRESYNTAX 3 "26 April 2013" "PCRE 8.33"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -347,6 +347,8 @@
The following are recognized only at the start of a pattern or after one of the
newline-setting options with similar syntax:
.sp
+ (*LIMIT_MATCH=d) set the match limit to d (decimal number)
+ (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
(*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
(*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
(*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
@@ -493,6 +495,6 @@
.rs
.sp
.nf
-Last updated: 27 February 2013
+Last updated: 26 April 2013
Copyright (c) 1997-2013 University of Cambridge.
.fi
Modified: code/trunk/doc/pcretest.1
===================================================================
--- code/trunk/doc/pcretest.1 2013-04-24 12:07:09 UTC (rev 1313)
+++ code/trunk/doc/pcretest.1 2013-04-26 10:44:13 UTC (rev 1314)
@@ -1,4 +1,4 @@
-.TH PCRETEST 1 "05 April 2013" "PCRE 8.33"
+.TH PCRETEST 1 "26 April 2013" "PCRE 8.33"
.SH NAME
pcretest - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@@ -40,23 +40,34 @@
but without much justification.
.
.
+.SH "INPUT DATA FORMAT"
+.rs
+.sp
+Input to \fBpcretest\fP is processed line by line, either by calling the C
+library's \fBfgets()\fP function, or via the \fBlibreadline\fP library (see
+below). In Unix-like environments, \fBfgets()\fP treats any bytes other than
+newline as data characters. However, in some Windows environments character 26
+(hex 1A) causes an immediate end of file, and no further data is read. For
+maximum portability, therefore, it is safest to use only ASCII characters in
+\fBpcretest\fP input files.
+.
+.
.SH "PCRE's 8-BIT, 16-BIT AND 32-BIT LIBRARIES"
.rs
.sp
From release 8.30, two separate PCRE libraries can be built. The original one
supports 8-bit character strings, whereas the newer 16-bit library supports
-character strings encoded in 16-bit units. From release 8.32, a third
-library can be built, supporting character strings encoded in 32-bit units.
-The \fBpcretest\fP program can be
-used to test all three libraries. However, it is itself still an 8-bit program,
-reading 8-bit input and writing 8-bit output. When testing the 16-bit or 32-bit
-library, the patterns and data strings are converted to 16- or 32-bit format
-before being passed to the PCRE library functions. Results are converted to
-8-bit for output.
+character strings encoded in 16-bit units. From release 8.32, a third library
+can be built, supporting character strings encoded in 32-bit units. The
+\fBpcretest\fP program can be used to test all three libraries. However, it is
+itself still an 8-bit program, reading 8-bit input and writing 8-bit output.
+When testing the 16-bit or 32-bit library, the patterns and data strings are
+converted to 16- or 32-bit format before being passed to the PCRE library
+functions. Results are converted to 8-bit for output.
.P
References to functions and structures of the form \fBpcre[16|32]_xx\fP below
-mean "\fBpcre_xx\fP when using the 8-bit library or \fBpcre16_xx\fP when using
-the 16-bit library".
+mean "\fBpcre_xx\fP when using the 8-bit library, \fBpcre16_xx\fP when using
+the 16-bit library, or \fBpcre32_xx\fP when using the 32-bit library".
.
.
.SH "COMMAND LINE OPTIONS"
@@ -1083,6 +1094,6 @@
.rs
.sp
.nf
-Last updated: 05 April 2013
+Last updated: 26 April 2013
Copyright (c) 1997-2013 University of Cambridge.
.fi