Revision: 858
http://vcs.pcre.org/viewvc?view=rev&revision=858
Author: ph10
Date: 2012-01-08 17:55:38 +0000 (Sun, 08 Jan 2012)
Log Message:
-----------
Documentation updates.
Modified Paths:
--------------
code/trunk/doc/pcre16.3
code/trunk/doc/pcrecallout.3
code/trunk/doc/pcrecompat.3
code/trunk/doc/pcrecpp.3
code/trunk/doc/pcrejit.3
code/trunk/doc/pcrelimits.3
code/trunk/doc/pcrematching.3
code/trunk/doc/pcrepartial.3
Modified: code/trunk/doc/pcre16.3
===================================================================
--- code/trunk/doc/pcre16.3 2012-01-07 17:39:10 UTC (rev 857)
+++ code/trunk/doc/pcre16.3 2012-01-08 17:55:38 UTC (rev 858)
@@ -219,10 +219,14 @@
other 16-bit functions expect the strings they are passed to be in host byte
order.
.P
-The \fIlength\fP argument of \fBpcre16_utf16_to_host_byte_order()\fP specifies
-the number of 16-bit data units in the input string; a negative value specifies
-a zero-terminated string.
+The \fIinput\fP and \fIoutput\fP arguments of
+\fBpcre16_utf16_to_host_byte_order()\fP may point to the same address, that is,
+conversion in place is supported. The output buffer must be at least as long as
+the input.
.P
+The \fIlength\fP argument of specifies the number of 16-bit data units in the
+input string; a negative value specifies a zero-terminated string.
+.P
If \fIbyte_order\fP is NULL, it is assumed that the string starts off in host
byte order. This may be changed by byte-order marks (BOMs) anywhere in the
string (commonly as the first character).
@@ -230,9 +234,9 @@
If \fIbyte_order\fP is not NULL, a non-zero value of the integer to which it
points means that the input starts off in host byte order, otherwise the
opposite order is assumed. Again, BOMs in the string can change this. The final
-byte order is passed back at the end of processing.
+byte order is passed back at the end of processing.
.P
-If \fIkeep_boms\fP is non zero, byte-order mark characters (0xfeff) are copied
+If \fIkeep_boms\fP is not zero, byte-order mark characters (0xfeff) are copied
into the output string. Otherwise they are discarded.
.P
The result of the function is the number of 16-bit units placed into the output
@@ -370,6 +374,6 @@
.rs
.sp
.nf
-Last updated: 07 January 2012
+Last updated: 08 January 2012
Copyright (c) 1997-2012 University of Cambridge.
.fi
Modified: code/trunk/doc/pcrecallout.3
===================================================================
--- code/trunk/doc/pcrecallout.3 2012-01-07 17:39:10 UTC (rev 857)
+++ code/trunk/doc/pcrecallout.3 2012-01-08 17:55:38 UTC (rev 858)
@@ -6,11 +6,14 @@
.sp
.B int (*pcre_callout)(pcre_callout_block *);
.PP
+.B int (*pcre16_callout)(pcre16_callout_block *);
+.PP
PCRE provides a feature called "callout", which is a means of temporarily
passing control to the caller of PCRE in the middle of pattern matching. The
caller of PCRE provides an external function by putting its entry point in the
-global variable \fIpcre_callout\fP. By default, this variable contains NULL,
-which disables all calling out.
+global variable \fIpcre_callout\fP (\fIpcre16_callout\fP for the 16-bit
+library). By default, this variable contains NULL, which disables all calling
+out.
.P
Within a regular expression, (?C) indicates the points at which the external
function is to be called. Different callout points can be identified by putting
@@ -19,10 +22,9 @@
.sp
(?C1)abc(?C2)def
.sp
-If the PCRE_AUTO_CALLOUT option bit is set when \fBpcre_compile()\fP or
-\fBpcre_compile2()\fP is called, PCRE automatically inserts callouts, all with
-number 255, before each item in the pattern. For example, if PCRE_AUTO_CALLOUT
-is used with the pattern
+If the PCRE_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE
+automatically inserts callouts, all with number 255, before each item in the
+pattern. For example, if PCRE_AUTO_CALLOUT is used with the pattern
.sp
A(\ed{2}|--)
.sp
@@ -65,33 +67,35 @@
been scanned far enough.
.P
You can disable these optimizations by passing the PCRE_NO_START_OPTIMIZE
-option to \fBpcre_compile()\fP, \fBpcre_exec()\fP, or \fBpcre_dfa_exec()\fP,
-or by starting the pattern with (*NO_START_OPT). This slows down the matching
-process, but does ensure that callouts such as the example above are obeyed.
+option to the matching function, or by starting the pattern with
+(*NO_START_OPT). This slows down the matching process, but does ensure that
+callouts such as the example above are obeyed.
.
.
.SH "THE CALLOUT INTERFACE"
.rs
.sp
During matching, when PCRE reaches a callout point, the external function
-defined by \fIpcre_callout\fP is called (if it is set). This applies to both
-the \fBpcre_exec()\fP and the \fBpcre_dfa_exec()\fP matching functions. The
-only argument to the callout function is a pointer to a \fBpcre_callout\fP
-block. This structure contains the following fields:
+defined by \fIpcre_callout\fP or \fIpcre16_callout\fP is called (if it is set).
+This applies to both normal and DFA matching. The only argument to the callout
+function is a pointer to a \fBpcre_callout\fP or \fBpcre16_callout\fP block.
+These structures contains the following fields:
.sp
- int \fIversion\fP;
- int \fIcallout_number\fP;
- int *\fIoffset_vector\fP;
- const char *\fIsubject\fP;
- int \fIsubject_length\fP;
- int \fIstart_match\fP;
- int \fIcurrent_position\fP;
- int \fIcapture_top\fP;
- int \fIcapture_last\fP;
- void *\fIcallout_data\fP;
- int \fIpattern_position\fP;
- int \fInext_item_length\fP;
- const unsigned char *\fImark\fP;
+ int \fIversion\fP;
+ int \fIcallout_number\fP;
+ int *\fIoffset_vector\fP;
+ const char *\fIsubject\fP; (8-bit version)
+ PCRE_SPTR16 \fIsubject\fP; (16-bit version)
+ int \fIsubject_length\fP;
+ int \fIstart_match\fP;
+ int \fIcurrent_position\fP;
+ int \fIcapture_top\fP;
+ int \fIcapture_last\fP;
+ void *\fIcallout_data\fP;
+ int \fIpattern_position\fP;
+ int \fInext_item_length\fP;
+ const unsigned char *\fImark\fP; (8-bit version)
+ const PCRE_SCHAR16 *\fImark\fP; (16-bit version)
.sp
The \fIversion\fP field is an integer containing the version number of the
block format. The initial version was 0; the current version is 2. The version
@@ -103,14 +107,14 @@
automatically generated callouts).
.P
The \fIoffset_vector\fP field is a pointer to the vector of offsets that was
-passed by the caller to \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP. When
-\fBpcre_exec()\fP is used, the contents can be inspected in order to extract
+passed by the caller to the matching function. When \fBpcre_exec()\fP or
+\fBpcre16_exec()\fP is used, the contents can be inspected, in order to extract
substrings that have been matched so far, in the same way as for extracting
-substrings after a match has completed. For \fBpcre_dfa_exec()\fP this field is
-not useful.
+substrings after a match has completed. For the DFA matching functions, this
+field is not useful.
.P
The \fIsubject\fP and \fIsubject_length\fP fields contain copies of the values
-that were passed to \fBpcre_exec()\fP.
+that were passed to the matching function.
.P
The \fIstart_match\fP field normally contains the offset within the subject at
which the current match attempt started. However, if the escape sequence \eK
@@ -122,48 +126,47 @@
The \fIcurrent_position\fP field contains the offset within the subject of the
current match pointer.
.P
-When the \fBpcre_exec()\fP function is used, the \fIcapture_top\fP field
-contains one more than the number of the highest numbered captured substring so
-far. If no substrings have been captured, the value of \fIcapture_top\fP is
-one. This is always the case when \fBpcre_dfa_exec()\fP is used, because it
-does not support captured substrings.
+When the \fBpcre_exec()\fP or \fBpcre16_exec()\fP is used, the
+\fIcapture_top\fP field contains one more than the number of the highest
+numbered captured substring so far. If no substrings have been captured, the
+value of \fIcapture_top\fP is one. This is always the case when the DFA
+functions are used, because they do not support captured substrings.
.P
The \fIcapture_last\fP field contains the number of the most recently captured
substring. If no substrings have been captured, its value is -1. This is always
-the case when \fBpcre_dfa_exec()\fP is used.
+the case for the DFA matching functions.
.P
-The \fIcallout_data\fP field contains a value that is passed to
-\fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP specifically so that it can be
-passed back in callouts. It is passed in the \fIpcre_callout\fP field of the
-\fBpcre_extra\fP data structure. If no such data was passed, the value of
-\fIcallout_data\fP in a \fBpcre_callout\fP block is NULL. There is a
-description of the \fBpcre_extra\fP structure in the
+The \fIcallout_data\fP field contains a value that is passed to a matching
+function specifically so that it can be passed back in callouts. It is passed
+in the \fIcallout_data\fP field of a \fBpcre_extra\fP or \fBpcre16_extra\fP
+data structure. If no such data was passed, the value of \fIcallout_data\fP in
+a callout block is NULL. There is a description of the \fBpcre_extra\fP
+structure in the
.\" HREF
\fBpcreapi\fP
.\"
documentation.
.P
-The \fIpattern_position\fP field is present from version 1 of the
-\fIpcre_callout\fP structure. It contains the offset to the next item to be
-matched in the pattern string.
+The \fIpattern_position\fP field is present from version 1 of the callout
+structure. It contains the offset to the next item to be matched in the pattern
+string.
.P
-The \fInext_item_length\fP field is present from version 1 of the
-\fIpcre_callout\fP structure. It contains the length of the next item to be
-matched in the pattern string. When the callout immediately precedes an
-alternation bar, a closing parenthesis, or the end of the pattern, the length
-is zero. When the callout precedes an opening parenthesis, the length is that
-of the entire subpattern.
+The \fInext_item_length\fP field is present from version 1 of the callout
+structure. It contains the length of the next item to be matched in the pattern
+string. When the callout immediately precedes an alternation bar, a closing
+parenthesis, or the end of the pattern, the length is zero. When the callout
+precedes an opening parenthesis, the length is that of the entire subpattern.
.P
The \fIpattern_position\fP and \fInext_item_length\fP fields are intended to
help in distinguishing between different automatic callouts, which all have the
same callout number. However, they are set for all callouts.
.P
-The \fImark\fP field is present from version 2 of the \fIpcre_callout\fP
-structure. In callouts from \fBpcre_exec()\fP it contains a pointer to the
-zero-terminated name of the most recently passed (*MARK), (*PRUNE), or (*THEN)
-item in the match, or NULL if no such items have been passed. Instances of
-(*PRUNE) or (*THEN) without a name do not obliterate a previous (*MARK). In
-callouts from \fBpcre_dfa_exec()\fP this field always contains NULL.
+The \fImark\fP field is present from version 2 of the callout structure. In
+callouts from \fBpcre_exec()\fP or \fBpcre16_exec()\fP it contains a pointer to
+the zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
+(*THEN) item in the match, or NULL if no such items have been passed. Instances
+of (*PRUNE) or (*THEN) without a name do not obliterate a previous (*MARK). In
+callouts from the DFA matching functions this field always contains NULL.
.
.
.SH "RETURN VALUES"
@@ -173,8 +176,7 @@
matching proceeds as normal. If the value is greater than zero, matching fails
at the current point, but the testing of other matching possibilities goes
ahead, just as if a lookahead assertion had failed. If the value is less than
-zero, the match is abandoned, and \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP
-returns the negative value.
+zero, the match is abandoned, the matching function returns the negative value.
.P
Negative values should normally be chosen from the set of PCRE_ERROR_xxx
values. In particular, PCRE_ERROR_NOMATCH forces a standard "no match" failure.
@@ -196,6 +198,6 @@
.rs
.sp
.nf
-Last updated: 30 November 2011
-Copyright (c) 1997-2011 University of Cambridge.
+Last updated: 08 Janurary 2012
+Copyright (c) 1997-2012 University of Cambridge.
.fi
Modified: code/trunk/doc/pcrecompat.3
===================================================================
--- code/trunk/doc/pcrecompat.3 2012-01-07 17:39:10 UTC (rev 857)
+++ code/trunk/doc/pcrecompat.3 2012-01-08 17:55:38 UTC (rev 858)
@@ -8,8 +8,8 @@
regular expressions. The differences described here are with respect to Perl
versions 5.10 and above.
.P
-1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details of what
-it does have are given in the
+1. PCRE has only a subset of Perl's Unicode support. Details of what it does
+have are given in the
.\" HREF
\fBpcreunicode\fP
.\"
@@ -154,8 +154,8 @@
different hosts that have the other endianness. However, this does not apply to
optimized data created by the just-in-time compiler.
.sp
-(k) The alternative matching function (\fBpcre_dfa_exec()\fP) matches in a
-different way and is not Perl-compatible.
+(k) The alternative matching functions (\fBpcre_dfa_exec()\fP and
+\fBpcre16_dfa_exec()\fP) match in a different way and are not Perl-compatible.
.sp
(l) PCRE recognizes some special sequences such as (*CR) at the start of
a pattern that set overall options that cannot be changed within the pattern.
@@ -175,6 +175,6 @@
.rs
.sp
.nf
-Last updated: 14 November 2011
-Copyright (c) 1997-2011 University of Cambridge.
+Last updated: 08 Januray 2012
+Copyright (c) 1997-2012 University of Cambridge.
.fi
Modified: code/trunk/doc/pcrecpp.3
===================================================================
--- code/trunk/doc/pcrecpp.3 2012-01-07 17:39:10 UTC (rev 857)
+++ code/trunk/doc/pcrecpp.3 2012-01-08 17:55:38 UTC (rev 858)
@@ -12,7 +12,8 @@
The C++ wrapper for PCRE was provided by Google Inc. Some additional
functionality was added by Giuseppe Maxia. This brief man page was constructed
from the notes in the \fIpcrecpp.h\fP file, which should be consulted for
-further details.
+further details. Note that the C++ wrapper supports only the original 8-bit
+PCRE library. There is no 16-bit support at present.
.
.
.SH "MATCHING INTERFACE"
@@ -343,6 +344,5 @@
.rs
.sp
.nf
-Last updated: 17 March 2009
-Minor typo fixed: 25 July 2011
+Last updated: 08 January 2012
.fi
Modified: code/trunk/doc/pcrejit.3
===================================================================
--- code/trunk/doc/pcrejit.3 2012-01-07 17:39:10 UTC (rev 857)
+++ code/trunk/doc/pcrejit.3 2012-01-08 17:55:38 UTC (rev 858)
@@ -7,17 +7,27 @@
Just-in-time compiling is a heavyweight optimization that can greatly speed up
pattern matching. However, it comes at the cost of extra processing before the
match is performed. Therefore, it is of most benefit when the same pattern is
-going to be matched many times. This does not necessarily mean many calls of
-\fPpcre_exec()\fP; if the pattern is not anchored, matching attempts may take
-place many times at various positions in the subject, even for a single call to
-\fBpcre_exec()\fP. If the subject string is very long, it may still pay to use
-JIT for one-off matches.
+going to be matched many times. This does not necessarily mean many calls of a
+matching function; if the pattern is not anchored, matching attempts may take
+place many times at various positions in the subject, even for a single call.
+Therefore, if the subject string is very long, it may still pay to use JIT for
+one-off matches.
.P
-JIT support applies only to the traditional matching function,
-\fBpcre_exec()\fP. It does not apply when \fBpcre_dfa_exec()\fP is being used.
-The code for this support was written by Zoltan Herczeg.
+JIT support applies only to the traditional Perl-compatible matching function.
+It does not apply when the DFA matching function is being used. The code for
+this support was written by Zoltan Herczeg.
.
.
+.SH "8-BIT and 16-BIT SUPPORT"
+.rs
+.sp
+JIT support is available for both the 8-bit and 16-bit PCRE libraries. To keep
+this documentation simple, only the 8-bit interface is described in what
+follows. If you are using the 16-bit library, substitute the 16-bit functions
+and 16-bit structures (for example, \fIpcre16_jit_stack\fP instead of
+\fIpcre_jit_stack\fP).
+.
+.
.SH "AVAILABILITY OF JIT SUPPORT"
.rs
.sp
@@ -357,6 +367,6 @@
.rs
.sp
.nf
-Last updated: 26 November 2011
-Copyright (c) 1997-2011 University of Cambridge.
+Last updated: 08 January 2012
+Copyright (c) 1997-2012 University of Cambridge.
.fi
Modified: code/trunk/doc/pcrelimits.3
===================================================================
--- code/trunk/doc/pcrelimits.3 2012-01-07 17:39:10 UTC (rev 857)
+++ code/trunk/doc/pcrelimits.3 2012-01-08 17:55:38 UTC (rev 858)
@@ -7,15 +7,16 @@
There are some size limitations in PCRE but it is hoped that they will never in
practice be relevant.
.P
-The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE is
-compiled with the default internal linkage size of 2. If you want to process
+The maximum length of a compiled pattern is approximately 64K data units (bytes
+for the 8-bit library, 16-bit units for the 16-bit library) if PCRE is compiled
+with the default internal linkage size of 2 bytes. If you want to process
regular expressions that are truly enormous, you can compile PCRE with an
-internal linkage size of 3 or 4 (see the \fBREADME\fP file in the source
-distribution and the
+internal linkage size of 3 or 4 (when building the 16-bit library, 3 is rounded
+up to 4). See the \fBREADME\fP file in the source distribution and the
.\" HREF
\fBpcrebuild\fP
.\"
-documentation for details). In these cases the limit is substantially larger.
+documentation for details. In these cases the limit is substantially larger.
However, the speed of execution is slower.
.P
All values in repeating quantifiers must be less than 65536.
@@ -57,6 +58,6 @@
.rs
.sp
.nf
-Last updated: 30 November 2011
-Copyright (c) 1997-2011 University of Cambridge.
+Last updated: 08 January 2012
+Copyright (c) 1997-2012 University of Cambridge.
.fi
Modified: code/trunk/doc/pcrematching.3
===================================================================
--- code/trunk/doc/pcrematching.3 2012-01-07 17:39:10 UTC (rev 857)
+++ code/trunk/doc/pcrematching.3 2012-01-08 17:55:38 UTC (rev 858)
@@ -6,14 +6,19 @@
.sp
This document describes the two different algorithms that are available in PCRE
for matching a compiled regular expression against a given subject string. The
-"standard" algorithm is the one provided by the \fBpcre_exec()\fP function.
-This works in the same was as Perl's matching function, and provides a
-Perl-compatible matching operation.
+"standard" algorithm is the one provided by the \fBpcre_exec()\fP and
+\fBpcre16_exec()\fP functions. These work in the same was as Perl's matching
+function, and provide a Perl-compatible matching operation. The just-in-time
+(JIT) optimization that is described in the
+.\" HREF
+\fBpcrejit\fP
+.\"
+documentation is compatible with these functions.
.P
-An alternative algorithm is provided by the \fBpcre_dfa_exec()\fP function;
-this operates in a different way, and is not Perl-compatible. It has advantages
-and disadvantages compared with the standard algorithm, and these are described
-below.
+An alternative algorithm is provided by the \fBpcre_dfa_exec()\fP and
+\fBpcre16_dfa_exec()\fP functions; they operate in a different way, and are not
+Perl-compatible. This alternative has advantages and disadvantages compared
+with the standard algorithm, and these are described below.
.P
When there is only one possible way in which a given subject string can match a
pattern, the two algorithms give the same answer. A difference arises, however,
@@ -28,6 +33,7 @@
there are three possible answers. The standard algorithm finds only one of
them, whereas the alternative algorithm finds all three.
.
+.
.SH "REGULAR EXPRESSIONS AS TREES"
.rs
.sp
@@ -38,6 +44,7 @@
There are two ways to search a tree: depth-first and breadth-first, and these
correspond to the two matching algorithms provided by PCRE.
.
+.
.SH "THE STANDARD MATCHING ALGORITHM"
.rs
.sp
@@ -63,6 +70,7 @@
matched by portions of the pattern in parentheses. This provides support for
capturing parentheses and back references.
.
+.
.SH "THE ALTERNATIVE MATCHING ALGORITHM"
.rs
.sp
@@ -131,14 +139,15 @@
6. Callouts are supported, but the value of the \fIcapture_top\fP field is
always 1, and the value of the \fIcapture_last\fP field is always -1.
.P
-7. The \eC escape sequence, which (in the standard algorithm) matches a single
-byte, even in UTF-8 mode, is not supported in UTF-8 mode, because the
-alternative algorithm moves through the subject string one character at a time,
-for all active paths through the tree.
+7. The \eC escape sequence, which (in the standard algorithm) always matches a
+single data unit, even in UTF-8 or UTF-16 modes, is not supported in these
+modes, because the alternative algorithm moves through the subject string one
+character (not data unit) at a time, for all active paths through the tree.
.P
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
.
+.
.SH "ADVANTAGES OF THE ALTERNATIVE ALGORITHM"
.rs
.sp
@@ -150,11 +159,11 @@
callouts.
.P
2. Because the alternative algorithm scans the subject string just once, and
-never needs to backtrack, it is possible to pass very long subject strings to
-the matching function in several pieces, checking for partial matching each
-time. Although it is possible to do multi-segment matching using the standard
-algorithm (\fBpcre_exec()\fP), by retaining partially matched substrings, it is
-more complicated. The
+never needs to backtrack (except for lookbehinds), it is possible to pass very
+long subject strings to the matching function in several pieces, checking for
+partial matching each time. Although it is possible to do multi-segment
+matching using the standard algorithm by retaining partially matched
+substrings, it is more complicated. The
.\" HREF
\fBpcrepartial\fP
.\"
@@ -191,6 +200,6 @@
.rs
.sp
.nf
-Last updated: 19 November 2011
-Copyright (c) 1997-2010 University of Cambridge.
+Last updated: 08 January 2012
+Copyright (c) 1997-2012 University of Cambridge.
.fi
Modified: code/trunk/doc/pcrepartial.3
===================================================================
--- code/trunk/doc/pcrepartial.3 2012-01-07 17:39:10 UTC (rev 857)
+++ code/trunk/doc/pcrepartial.3 2012-01-08 17:55:38 UTC (rev 858)
@@ -4,11 +4,11 @@
.SH "PARTIAL MATCHING IN PCRE"
.rs
.sp
-In normal use of PCRE, if the subject string that is passed to
-\fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP matches as far as it goes, but is
-too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
-are circumstances where it might be helpful to distinguish this case from other
-cases in which there is no match.
+In normal use of PCRE, if the subject string that is passed to a matching
+function matches as far as it goes, but is too short to match the entire
+pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances where it might
+be helpful to distinguish this case from other cases in which there is no
+match.
.P
Consider, for example, an application where a human is required to type in data
for a field with specific formatting requirements. An example might be a date
@@ -25,42 +25,41 @@
long and is not all available at once.
.P
PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
-PCRE_PARTIAL_HARD options, which can be set when calling \fBpcre_exec()\fP or
-\fBpcre_dfa_exec()\fP. For backwards compatibility, PCRE_PARTIAL is a synonym
-for PCRE_PARTIAL_SOFT. The essential difference between the two options is
-whether or not a partial match is preferred to an alternative complete match,
-though the details differ between the two matching functions. If both options
+PCRE_PARTIAL_HARD options, which can be set when calling any of the matching
+functions. For backwards compatibility, PCRE_PARTIAL is a synonym for
+PCRE_PARTIAL_SOFT. The essential difference between the two options is whether
+or not a partial match is preferred to an alternative complete match, though
+the details differ between the two types of matching function. If both options
are set, PCRE_PARTIAL_HARD takes precedence.
.P
-Setting a partial matching option for \fBpcre_exec()\fP disables the use of any
-just-in-time code that was set up by calling \fBpcre_study()\fP with the
+Setting a partial matching option disables the use of any just-in-time code
+that was set up by studying the compiled pattern with the
PCRE_STUDY_JIT_COMPILE option. It also disables two of PCRE's standard
-optimizations. PCRE remembers the last literal byte in a pattern, and abandons
-matching immediately if such a byte is not present in the subject string. This
+optimizations. PCRE remembers the last literal data unit in a pattern, and
+abandons matching immediately if it is not present in the subject string. This
optimization cannot be used for a subject string that might match only
partially. If the pattern was studied, PCRE knows the minimum length of a
matching string, and does not bother to run the matching function on shorter
strings. This optimization is also disabled for partial matching.
.
.
-.SH "PARTIAL MATCHING USING pcre_exec()"
+.SH "PARTIAL MATCHING USING pcre_exec() OR pcre16_exec()"
.rs
.sp
-A partial match occurs during a call to \fBpcre_exec()\fP when the end of the
-subject string is reached successfully, but matching cannot continue because
-more characters are needed. However, at least one character in the subject must
-have been inspected. This character need not form part of the final matched
-string; lookbehind assertions and the \eK escape sequence provide ways of
-inspecting characters before the start of a matched substring. The requirement
-for inspecting at least one character exists because an empty string can always
-be matched; without such a restriction there would always be a partial match of
-an empty string at the end of the subject.
+A partial match occurs during a call to \fBpcre_exec()\fP or
+\fBpcre16_exec()\fP when the end of the subject string is reached successfully,
+but matching cannot continue because more characters are needed. However, at
+least one character in the subject must have been inspected. This character
+need not form part of the final matched string; lookbehind assertions and the
+\eK escape sequence provide ways of inspecting characters before the start of a
+matched substring. The requirement for inspecting at least one character exists
+because an empty string can always be matched; without such a restriction there
+would always be a partial match of an empty string at the end of the subject.
.P
-If there are at least two slots in the offsets vector when \fBpcre_exec()\fP
-returns with a partial match, the first slot is set to the offset of the
-earliest character that was inspected when the partial match was found. For
-convenience, the second offset points to the end of the subject so that a
-substring can easily be identified.
+If there are at least two slots in the offsets vector when a partial match is
+returned, the first slot is set to the offset of the earliest character that
+was inspected. For convenience, the second offset points to the end of the
+subject so that a substring can easily be identified.
.P
For the majority of patterns, the first offset identifies the start of the
partially matched string. However, for patterns that contain lookbehind
@@ -78,13 +77,14 @@
partial matching options are set.
.
.
-.SS "PCRE_PARTIAL_SOFT with pcre_exec()"
+.SS "PCRE_PARTIAL_SOFT WITH pcre_exec() OR pcre16_exec()"
.rs
.sp
-If PCRE_PARTIAL_SOFT is set when \fBpcre_exec()\fP identifies a partial match,
-the partial match is remembered, but matching continues as normal, and other
-alternatives in the pattern are tried. If no complete match can be found,
-\fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH.
+If PCRE_PARTIAL_SOFT is set when \fBpcre_exec()\fP or \fBpcre16_exec()\fP
+identifies a partial match, the partial match is remembered, but matching
+continues as normal, and other alternatives in the pattern are tried. If no
+complete match can be found, PCRE_ERROR_PARTIAL is returned instead of
+PCRE_ERROR_NOMATCH.
.P
This option is "soft" because it prefers a complete match over a partial match.
All the various matching items in a pattern behave as if the subject string is
@@ -105,21 +105,23 @@
matches the second alternative.)
.
.
-.SS "PCRE_PARTIAL_HARD with pcre_exec()"
+.SS "PCRE_PARTIAL_HARD WITH pcre_exec() OR pcre16_exec()"
.rs
.sp
-If PCRE_PARTIAL_HARD is set for \fBpcre_exec()\fP, it returns
-PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to
-search for possible complete matches. This option is "hard" because it prefers
-an earlier partial match over a later complete match. For this reason, the
-assumption is made that the end of the supplied subject string may not be the
-true end of the available data, and so, if \ez, \eZ, \eb, \eB, or $ are
-encountered at the end of the subject, the result is PCRE_ERROR_PARTIAL.
+If PCRE_PARTIAL_HARD is set for \fBpcre_exec()\fP or \fBpcre16_exec()\fP,
+PCRE_ERROR_PARTIAL is returned as soon as a partial match is found, without
+continuing to search for possible complete matches. This option is "hard"
+because it prefers an earlier partial match over a later complete match. For
+this reason, the assumption is made that the end of the supplied subject string
+may not be the true end of the available data, and so, if \ez, \eZ, \eb, \eB,
+or $ are encountered at the end of the subject, the result is
+PCRE_ERROR_PARTIAL.
.P
-Setting PCRE_PARTIAL_HARD also affects the way \fBpcre_exec()\fP checks UTF-8
-subject strings for validity. Normally, an invalid UTF-8 sequence causes the
-error PCRE_ERROR_BADUTF8. However, in the special case of a truncated UTF-8
-character at the end of the subject, PCRE_ERROR_SHORTUTF8 is returned when
+Setting PCRE_PARTIAL_HARD also affects the way UTF-8 and UTF-16
+subject strings are checked for validity. Normally, an invalid sequence
+causes the error PCRE_ERROR_BADUTF8 or PCRE_ERROR_BADUTF16. However, in the
+special case of a truncated character at the end of the subject,
+PCRE_ERROR_SHORTUTF8 or PCRE_ERROR_SHORTUTF16 is returned when
PCRE_PARTIAL_HARD is set.
.
.
@@ -139,25 +141,25 @@
.sp
/dog(sbody)??/
.sp
-In this case the result is always a complete match because \fBpcre_exec()\fP
-finds that first, and it never continues after finding a match. It might be
-easier to follow this explanation by thinking of the two patterns like this:
+In this case the result is always a complete match because that is found first,
+and matching never continues after finding a complete match. It might be easier
+to follow this explanation by thinking of the two patterns like this:
.sp
/dog(sbody)?/ is the same as /dogsbody|dog/
/dog(sbody)??/ is the same as /dog|dogsbody/
.sp
-The second pattern will never match "dogsbody" when \fBpcre_exec()\fP is
-used, because it will always find the shorter match first.
+The second pattern will never match "dogsbody", because it will always find the
+shorter match first.
.
.
-.SH "PARTIAL MATCHING USING pcre_dfa_exec()"
+.SH "PARTIAL MATCHING USING pcre_dfa_exec() OR pcre16_dfa_exec()"
.rs
.sp
-The \fBpcre_dfa_exec()\fP function moves along the subject string character by
-character, without backtracking, searching for all possible matches
-simultaneously. If the end of the subject is reached before the end of the
-pattern, there is the possibility of a partial match, again provided that at
-least one character has been inspected.
+The DFA functions move along the subject string character by character, without
+backtracking, searching for all possible matches simultaneously. If the end of
+the subject is reached before the end of the pattern, there is the possibility
+of a partial match, again provided that at least one character has been
+inspected.
.P
When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
have been no complete matches. Otherwise, the complete matches are returned.
@@ -166,16 +168,16 @@
partial match was found is set as the first matching string, provided there are
at least two slots in the offsets vector.
.P
-Because \fBpcre_dfa_exec()\fP always searches for all possible matches, and
-there is no difference between greedy and ungreedy repetition, its behaviour is
-different from \fBpcre_exec\fP when PCRE_PARTIAL_HARD is set. Consider the
-string "dog" matched against the ungreedy pattern shown above:
+Because the DFA functions always search for all possible matches, and there is
+no difference between greedy and ungreedy repetition, their behaviour is
+different from the standard functions when PCRE_PARTIAL_HARD is set. Consider
+the string "dog" matched against the ungreedy pattern shown above:
.sp
/dog(sbody)??/
.sp
-Whereas \fBpcre_exec()\fP stops as soon as it finds the complete match for
-"dog", \fBpcre_dfa_exec()\fP also finds the partial match for "dogsbody", and
-so returns that when PCRE_PARTIAL_HARD is set.
+Whereas the standard functions stop as soon as they find the complete match for
+"dog", the DFA functions also find the partial match for "dogsbody", and so
+return that when PCRE_PARTIAL_HARD is set.
.
.
.SH "PARTIAL MATCHING AND WORD BOUNDARIES"
@@ -189,14 +191,11 @@
.sp
This matches "cat", provided there is a word boundary at either end. If the
subject string is "the cat", the comparison of the final "t" with a following
-character cannot take place, so a partial match is found. However,
-\fBpcre_exec()\fP carries on with normal matching, which matches \eb at the end
-of the subject when the last character is a letter, thus finding a complete
-match. The result, therefore, is \fInot\fP PCRE_ERROR_PARTIAL. The same thing
-happens with \fBpcre_dfa_exec()\fP, because it also finds the complete match.
-.P
-Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because
-then the partial match takes precedence.
+character cannot take place, so a partial match is found. However, normal
+matching carries on, and \eb matches at the end of the subject when the last
+character is a letter, so a complete match is found. The result, therefore, is
+\fInot\fP PCRE_ERROR_PARTIAL. Using PCRE_PARTIAL_HARD in this case does yield
+PCRE_ERROR_PARTIAL, because then the partial match takes precedence.
.
.
.SH "FORMERLY RESTRICTED PATTERNS"
@@ -206,7 +205,7 @@
optimizations were implemented in the \fBpcre_exec()\fP function, the
PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be used with
all patterns. From release 8.00 onwards, the restrictions no longer apply, and
-partial matching with \fBpcre_exec()\fP can be requested for any pattern.
+partial matching with can be requested for any pattern.
.P
Items that were formerly restricted were repeated single characters and
repeated metasequences. If PCRE_PARTIAL was set for a pattern that did not
@@ -239,23 +238,22 @@
The first data string is matched completely, so \fBpcretest\fP shows the
matched substrings. The remaining four strings do not match the complete
pattern, but the first two are partial matches. Similar output is obtained
-when \fBpcre_dfa_exec()\fP is used.
+if DFA matching is used.
.P
If the escape sequence \eP is present more than once in a \fBpcretest\fP data
line, the PCRE_PARTIAL_HARD option is set for the match.
.
.
-.SH "MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()"
+.SH "MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre16_dfa_exec()"
.rs
.sp
-When a partial match has been found using \fBpcre_dfa_exec()\fP, it is possible
-to continue the match by providing additional subject data and calling
-\fBpcre_dfa_exec()\fP again with the same compiled regular expression, this
-time setting the PCRE_DFA_RESTART option. You must pass the same working
-space as before, because this is where details of the previous partial match
-are stored. Here is an example using \fBpcretest\fP, using the \eR escape
-sequence to set the PCRE_DFA_RESTART option (\eD specifies the use of
-\fBpcre_dfa_exec()\fP):
+When a partial match has been found using a DFA matching function, it is
+possible to continue the match by providing additional subject data and calling
+the function again with the same compiled regular expression, this time setting
+the PCRE_DFA_RESTART option. You must pass the same working space as before,
+because this is where details of the previous partial match are stored. Here is
+an example using \fBpcretest\fP, using the \eR escape sequence to set the
+PCRE_DFA_RESTART option (\eD specifies the use of the DFA matching function):
.sp
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
data> 23ja\eP\eD
@@ -271,34 +269,35 @@
.P
You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
PCRE_DFA_RESTART to continue partial matching over multiple segments. This
-facility can be used to pass very long subject strings to
-\fBpcre_dfa_exec()\fP.
+facility can be used to pass very long subject strings to the DFA matching
+functions.
.
.
-.SH "MULTI-SEGMENT MATCHING WITH pcre_exec()"
+.SH "MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre16_exec()"
.rs
.sp
-From release 8.00, \fBpcre_exec()\fP can also be used to do multi-segment
-matching. Unlike \fBpcre_dfa_exec()\fP, it is not possible to restart the
-previous match with a new segment of data. Instead, new data must be added to
-the previous subject string, and the entire match re-run, starting from the
-point where the partial match occurred. Earlier data can be discarded. It is
-best to use PCRE_PARTIAL_HARD in this situation, because it does not treat the
-end of a segment as the end of the subject when matching \ez, \eZ, \eb, \eB,
-and $. Consider an unanchored pattern that matches dates:
+From release 8.00, the standard matching functions can also be used to do
+multi-segment matching. Unlike the DFA functions, it is not possible to
+restart the previous match with a new segment of data. Instead, new data must
+be added to the previous subject string, and the entire match re-run, starting
+from the point where the partial match occurred. Earlier data can be discarded.
+.P
+It is best to use PCRE_PARTIAL_HARD in this situation, because it does not
+treat the end of a segment as the end of the subject when matching \ez, \eZ,
+\eb, \eB, and $. Consider an unanchored pattern that matches dates:
.sp
re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
data> The date is 23ja\eP\eP
Partial match: 23ja
.sp
At this stage, an application could discard the text preceding "23ja", add on
-text from the next segment, and call \fBpcre_exec()\fP again. Unlike
-\fBpcre_dfa_exec()\fP, the entire matching string must always be available, and
+text from the next segment, and call the matching function again. Unlike the
+DFA matching functions the entire matching string must always be available, and
the complete matching process occurs for each call, so more memory and more
processing time is needed.
.P
\fBNote:\fP If the pattern contains lookbehind assertions, or \eK, or starts
-with \eb or \eB, the string that is returned for a partial match will include
+with \eb or \eB, the string that is returned for a partial match includes
characters that precede the partially matched string itself, because these must
be retained when adding on more characters for a subsequent matching attempt.
.
@@ -343,14 +342,14 @@
0: dogsbody
1: dog
.sp
-The first data line passes the string "dogsb" to \fBpcre_exec()\fP, setting the
-PCRE_PARTIAL_SOFT option. Although the string is a partial match for
-"dogsbody", the result is not PCRE_ERROR_PARTIAL, because the shorter string
-"dog" is a complete match. Similarly, when the subject is presented to
-\fBpcre_dfa_exec()\fP in several parts ("do" and "gsb" being the first two) the
-match stops when "dog" has been found, and it is not possible to continue. On
-the other hand, if "dogsbody" is presented as a single string,
-\fBpcre_dfa_exec()\fP finds both matches.
+The first data line passes the string "dogsb" to a standard matching function,
+setting the PCRE_PARTIAL_SOFT option. Although the string is a partial match
+for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the shorter
+string "dog" is a complete match. Similarly, when the subject is presented to
+a DFA matching function in several parts ("do" and "gsb" being the first two)
+the match stops when "dog" has been found, and it is not possible to continue.
+On the other hand, if "dogsbody" is presented as a single string, a DFA
+matching function finds both matches.
.P
Because of these problems, it is best to use PCRE_PARTIAL_HARD when matching
multi-segment data. The example above then behaves differently:
@@ -363,10 +362,9 @@
data> gsb\eR\eP\eP\eD
Partial match: gsb
.sp
-4. Patterns that contain alternatives at the top level which do not all
-start with the same pattern item may not work as expected when
-PCRE_DFA_RESTART is used with \fBpcre_dfa_exec()\fP. For example, consider this
-pattern:
+4. Patterns that contain alternatives at the top level which do not all start
+with the same pattern item may not work as expected when PCRE_DFA_RESTART is
+used. For example, consider this pattern:
.sp
1234|3789
.sp
@@ -382,8 +380,8 @@
1234|ABCD
.sp
where no string can be a partial match for both alternatives. This is not a
-problem if \fBpcre_exec()\fP is used, because the entire match has to be rerun
-each time:
+problem if a standard matching function is used, because the entire match has
+to be rerun each time:
.sp
re> /1234|3789/
data> ABC123\eP\eP
@@ -392,7 +390,7 @@
0: 3789
.sp
Of course, instead of using PCRE_DFA_RESTART, the same technique of re-running
-the entire match can also be used with \fBpcre_dfa_exec()\fP. Another
+the entire match can also be used with the DFA matching functions. Another
possibility is to work with two buffers. If a partial match at offset \fIn\fP
in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on
the second buffer, you can then try a new match starting at offset \fIn+1\fP in
@@ -413,6 +411,6 @@
.rs
.sp
.nf
-Last updated: 26 August 2011
-Copyright (c) 1997-2011 University of Cambridge.
+Last updated: 08 January 2012
+Copyright (c) 1997-2012 University of Cambridge.
.fi