Revision: 158
http://www.exim.org/viewvc/pcre2?view=rev&revision=158
Author: ph10
Date: 2014-11-23 18:38:38 +0000 (Sun, 23 Nov 2014)
Log Message:
-----------
More documentation and test updates.
Modified Paths:
--------------
code/trunk/doc/html/pcre2.html
code/trunk/doc/html/pcre2build.html
code/trunk/doc/html/pcre2callout.html
code/trunk/doc/html/pcre2grep.html
code/trunk/doc/html/pcre2jit.html
code/trunk/doc/html/pcre2syntax.html
code/trunk/doc/html/pcre2unicode.html
code/trunk/doc/pcre2.txt
code/trunk/doc/pcre2api.3
code/trunk/doc/pcre2build.3
code/trunk/doc/pcre2callout.3
code/trunk/doc/pcre2grep.1
code/trunk/doc/pcre2grep.txt
code/trunk/doc/pcre2jit.3
code/trunk/doc/pcre2syntax.3
code/trunk/doc/pcre2test.1
code/trunk/doc/pcre2unicode.3
code/trunk/src/pcre2_dfa_match.c
code/trunk/src/pcre2_match.c
code/trunk/src/pcre2test.c
code/trunk/testdata/testinput10
code/trunk/testdata/testinput12
code/trunk/testdata/testoutput10
code/trunk/testdata/testoutput12-16
code/trunk/testdata/testoutput12-32
Modified: code/trunk/doc/html/pcre2.html
===================================================================
--- code/trunk/doc/html/pcre2.html 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/html/pcre2.html 2014-11-23 18:38:38 UTC (rev 158)
@@ -148,7 +148,7 @@
pcre2limits details of size and other limits
pcre2matching discussion of the two matching algorithms
pcre2partial details of the partial matching facility
- pcre2pattern syntax and semantics of supported regular expression patterns
+ pcre2pattern syntax and semantics of supported regular expression patterns
pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program
Modified: code/trunk/doc/html/pcre2build.html
===================================================================
--- code/trunk/doc/html/pcre2build.html 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/html/pcre2build.html 2014-11-23 18:38:38 UTC (rev 158)
@@ -17,9 +17,9 @@
<li><a name="TOC2" href="#SEC2">PCRE2 BUILD-TIME OPTIONS</a>
<li><a name="TOC3" href="#SEC3">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a>
<li><a name="TOC4" href="#SEC4">BUILDING SHARED AND STATIC LIBRARIES</a>
-<li><a name="TOC5" href="#SEC5">Unicode and UTF SUPPORT</a>
+<li><a name="TOC5" href="#SEC5">UNICODE AND UTF SUPPORT</a>
<li><a name="TOC6" href="#SEC6">JUST-IN-TIME COMPILER SUPPORT</a>
-<li><a name="TOC7" href="#SEC7">CODE VALUE OF NEWLINE</a>
+<li><a name="TOC7" href="#SEC7">NEWLINE RECOGNITION</a>
<li><a name="TOC8" href="#SEC8">WHAT \R MATCHES</a>
<li><a name="TOC9" href="#SEC9">HANDLING VERY LARGE PATTERNS</a>
<li><a name="TOC10" href="#SEC10">AVOIDING EXCESSIVE STACK USAGE</a>
@@ -91,12 +91,12 @@
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
the following to the <b>configure</b> command:
<pre>
- --enable-pcre16
- --enable-pcre32
+ --enable-pcre2-16
+ --enable-pcre2-32
</pre>
If you do not want the 8-bit library, add
<pre>
- --disable-pcre8
+ --disable-pcre2-8
</pre>
as well. At least one of the three libraries must be built. Note that the POSIX
wrapper is for the 8-bit library only, and that <b>pcre2grep</b> is an 8-bit
@@ -106,14 +106,15 @@
<br><a name="SEC4" href="#TOC1">BUILDING SHARED AND STATIC LIBRARIES</a><br>
<P>
The Autotools PCRE2 building process uses <b>libtool</b> to build both shared
-and static libraries by default. You can suppress one of these by adding one of
+and static libraries by default. You can suppress an unwanted library by adding
+one of
<pre>
--disable-shared
--disable-static
</pre>
-to the <b>configure</b> command, as required.
+to the <b>configure</b> command.
</P>
-<br><a name="SEC5" href="#TOC1">Unicode and UTF SUPPORT</a><br>
+<br><a name="SEC5" href="#TOC1">UNICODE AND UTF SUPPORT</a><br>
<P>
By default, PCRE2 is built with support for Unicode and UTF character strings.
To build it without Unicode support, add
@@ -126,20 +127,15 @@
</P>
<P>
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
-or UTF-32. To do that you have have to set the PCRE2_UTF option when you call
-<b>pcre2_compile()</b> to compile a pattern.
+or UTF-32. To do that, applications that use the library have to set the
+PCRE2_UTF option when they call <b>pcre2_compile()</b> to compile a pattern.
</P>
<P>
-It is not possible to support both EBCDIC and UTF-8 codes in the same version
-of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
-exclusive.
-</P>
-<P>
-UTF support allows the libraries to process character codepoints up to 0x10ffff
-in the strings that they handle. It also provides support for accessing the
-properties of such characters, using pattern escapes such as \P, \p, and \X.
-Only the general category properties such as <i>Lu</i> and <i>Nd</i> are
-supported. Details are given in the
+UTF support allows the libraries to process character code points up to
+0x10ffff in the strings that they handle. It also provides support for
+accessing the Unicode properties of such characters, using pattern escapes such
+as \P, \p, and \X. Only the general category properties such as <i>Lu</i> and
+<i>Nd</i> are supported. Details are given in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation.
</P>
@@ -150,7 +146,7 @@
--enable-jit
</pre>
This support is available only for certain hardware architectures. If this
-option is set for an unsupported architecture, a compile time error occurs.
+option is set for an unsupported architecture, a building error occurs.
See the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation for a discussion of JIT usage. When JIT support is enabled,
@@ -160,7 +156,7 @@
</pre>
to the "configure" command.
</P>
-<br><a name="SEC7" href="#TOC1">CODE VALUE OF NEWLINE</a><br>
+<br><a name="SEC7" href="#TOC1">NEWLINE RECOGNITION</a><br>
<P>
By default, PCRE2 interprets the linefeed (LF) character as indicating the end
of a line. This is the normal newline character on Unix-like systems. You can
@@ -168,12 +164,13 @@
<pre>
--enable-newline-is-cr
</pre>
-to the <b>configure</b> command. There is also a --enable-newline-is-lf option,
+to the <b>configure</b> command. There is also an --enable-newline-is-lf option,
which explicitly specifies linefeed as the newline character.
-<br>
-<br>
-Alternatively, you can specify that line endings are to be indicated by the two
-character sequence CRLF. If you want this, add
+</P>
+<P>
+Alternatively, you can specify that line endings are to be indicated by the
+two-character sequence CRLF (CR immediately followed by LF). If you want this,
+add
<pre>
--enable-newline-is-crlf
</pre>
@@ -186,22 +183,26 @@
<pre>
--enable-newline-is-any
</pre>
-causes PCRE2 to recognize any Unicode newline sequence.
+causes PCRE2 to recognize any Unicode newline sequence. The Unicode newline
+sequences are the three just mentioned, plus the single characters VT (vertical
+tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
+separator, U+2028), and PS (paragraph separator, U+2029).
</P>
<P>
-Whatever line ending convention is selected when PCRE2 is built can be
-overridden when the library functions are called. At build time it is
+Whatever default line ending convention is selected when PCRE2 is built can be
+overridden by applications that use the library. At build time it is
conventional to use the standard for your operating system.
</P>
<br><a name="SEC8" href="#TOC1">WHAT \R MATCHES</a><br>
<P>
By default, the sequence \R in a pattern matches any Unicode newline sequence,
-whatever has been selected as the line ending sequence. If you specify
+independently of what has been selected as the line ending sequence. If you
+specify
<pre>
--enable-bsr-anycrlf
</pre>
the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
-selected when PCRE2 is built can be overridden when the library functions are
+selected when PCRE2 is built can be overridden by applications that use the
called.
</P>
<br><a name="SEC9" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
@@ -210,10 +211,10 @@
another (for example, from an opening parenthesis to an alternation
metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
are used for these offsets, leading to a maximum size for a compiled pattern of
-around 64K. This is sufficient to handle all but the most gigantic patterns.
-Nevertheless, some people do want to process truly enormous patterns, so it is
-possible to compile PCRE2 to use three-byte or four-byte offsets by adding a
-setting such as
+around 64K code units. This is sufficient to handle all but the most gigantic
+patterns. Nevertheless, some people do want to process truly enormous patterns,
+so it is possible to compile PCRE2 to use three-byte or four-byte offsets by
+adding a setting such as
<pre>
--with-link-size=3
</pre>
@@ -294,18 +295,22 @@
<br><a name="SEC13" href="#TOC1">USING EBCDIC CODE</a><br>
<P>
PCRE2 assumes by default that it will run in an environment where the character
-code is ASCII (or Unicode, which is a superset of ASCII). This is the case for
+code is ASCII or Unicode, which is a superset of ASCII. This is the case for
most computer operating systems. PCRE2 can, however, be compiled to run in an
-EBCDIC environment by adding
+8-bit EBCDIC environment by adding
<pre>
--enable-ebcdic --disable-unicode
</pre>
to the <b>configure</b> command. This setting implies
--enable-rebuild-chartables. You should only use it if you know that you are in
-an EBCDIC environment (for example, an IBM mainframe operating system). The
---enable-ebcdic option is incompatible with Unicode support.
+an EBCDIC environment (for example, an IBM mainframe operating system).
</P>
<P>
+It is not possible to support both EBCDIC and UTF-8 codes in the same version
+of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
+exclusive.
+</P>
+<P>
The EBCDIC character that corresponds to an ASCII LF is assumed to have the
value 0x15 by default. However, in some EBCDIC environments, 0x25 is used. In
such an environment you should use
@@ -347,8 +352,8 @@
<pre>
--with-pcre2grep-bufsize=50K
</pre>
-to the <b>configure</b> command. The caller of \fPpcre2grep\fP can, however,
-override this value by specifying a run-time option.
+to the <b>configure</b> command. The caller of \fPpcre2grep\fP can override this
+value by using --buffer-size on the command line..
</P>
<br><a name="SEC16" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
<P>
@@ -362,16 +367,16 @@
from a terminal, it reads it using the <b>readline()</b> function. This provides
line-editing and history facilities. Note that <b>libreadline</b> is
GPL-licensed, so if you distribute a binary of <b>pcre2test</b> linked in this
-way, there may be licensing issues. These can be avoided by linking with
-<b>libedit</b> (which has a BSD licence) instead.
+way, there may be licensing issues. These can be avoided by linking instead
+with <b>libedit</b>, which has a BSD licence.
</P>
<P>
-Setting this option causes the <b>-lreadline</b> option to be added to the
-<b>pcre2test</b> build. In many operating environments with a sytem-installed
-readline library this is sufficient. However, in some environments (e.g. if an
-unmodified distribution version of readline is in use), some extra
-configuration may be necessary. The INSTALL file for <b>libreadline</b> says
-this:
+Setting --enable-pcre2test-libreadline causes the <b>-lreadline</b> option to be
+added to the <b>pcre2test</b> build. In many operating environments with a
+sytem-installed readline library this is sufficient. However, in some
+environments (e.g. if an unmodified distribution version of readline is in
+use), some extra configuration may be necessary. The INSTALL file for
+<b>libreadline</b> says this:
<pre>
"Readline uses the termcap functions, but does not link with
the termcap or curses library itself, allowing applications
@@ -386,13 +391,13 @@
</P>
<br><a name="SEC17" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
<P>
-By adding the
+If you add
<pre>
--enable-valgrind
</pre>
-option to to the <b>configure</b> command, PCRE2 will use valgrind annotations
-to mark certain memory regions as unaddressable. This allows it to detect
-invalid memory accesses, and is mostly useful for debugging PCRE2 itself.
+to the <b>configure</b> command, PCRE2 will use valgrind annotations to mark
+certain memory regions as unaddressable. This allows it to detect invalid
+memory accesses, and is mostly useful for debugging PCRE2 itself.
</P>
<br><a name="SEC18" href="#TOC1">CODE COVERAGE REPORTING</a><br>
<P>
@@ -466,7 +471,7 @@
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 03 November 2014
+Last updated: 23 November 2014
<br>
Copyright © 1997-2014 University of Cambridge.
<br>
Modified: code/trunk/doc/html/pcre2callout.html
===================================================================
--- code/trunk/doc/html/pcre2callout.html 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/html/pcre2callout.html 2014-11-23 18:38:38 UTC (rev 158)
@@ -85,29 +85,27 @@
<P>
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
-if it were a++[bc]. The <b>pcre2test</b> output when this pattern is anchored
-and then applied with automatic callouts to the string "aaaa" is:
+if it were a++[bc]. The <b>pcre2test</b> output when this pattern is compiled
+with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
+"aaaa" is:
<pre>
--->aaaa
- +0 ^ ^
- +1 ^ a+
- +3 ^ ^ [bc]
+ +0 ^ a+
+ +2 ^ ^ [bc]
No match
</pre>
This indicates that when matching [bc] fails, there is no backtracking into a+
and therefore the callouts that would be taken for the backtracks do not occur.
-You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS
-to <b>pcre2_compile()</b>, or starting the pattern with (*NO_AUTO_POSSESS). If
-this is done in <b>pcre2test</b> (using the /no_auto_possess qualifier), the
-output changes to this:
+You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
+<b>pcre2_compile()</b>, or starting the pattern with (*NO_AUTO_POSSESS). In this
+case, the output changes to this:
<pre>
--->aaaa
- +0 ^ ^
- +1 ^ a+
- +3 ^ ^ [bc]
- +3 ^ ^ [bc]
- +3 ^ ^ [bc]
- +3 ^^ [bc]
+ +0 ^ a+
+ +2 ^ ^ [bc]
+ +2 ^ ^ [bc]
+ +2 ^ ^ [bc]
+ +2 ^^ [bc]
No match
</pre>
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
@@ -137,10 +135,10 @@
</P>
<br><a name="SEC4" href="#TOC1">THE CALLOUT INTERFACE</a><br>
<P>
-During matching, when PCRE2 reaches a callout point, the external function that
-is set in the match context is called (if it is set). This applies to both
-normal and DFA matching. The only argument to the callout function is a pointer
-to a <b>pcre2_callout</b> block. This structure contains the following fields:
+During matching, when PCRE2 reaches a callout point, if an external function is
+set in the match context, it is called. This applies to both normal and DFA
+matching. The only argument to the callout function is a pointer to a
+<b>pcre2_callout</b> block. This structure contains the following fields:
<pre>
uint32_t <i>version</i>;
uint32_t <i>callout_number</i>;
@@ -169,7 +167,7 @@
<P>
The <i>offset_vector</i> field is a pointer to the vector of capturing offsets
(the "ovector") that was passed to the matching function in the match data
-block. When <b>pcre2_match()</b> is used, the contents can be inspected, in
+block. When <b>pcre2_match()</b> is used, the contents can be inspected in
order to extract substrings that have been matched so far, in the same way as
for extracting substrings after a match has completed. For the DFA matching
function, this field is not useful.
@@ -261,7 +259,7 @@
</P>
<br><a name="SEC7" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 19 October 2014
+Last updated: 23 November 2014
<br>
Copyright © 1997-2014 University of Cambridge.
<br>
Modified: code/trunk/doc/html/pcre2grep.html
===================================================================
--- code/trunk/doc/html/pcre2grep.html 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/html/pcre2grep.html 2014-11-23 18:38:38 UTC (rev 158)
@@ -467,8 +467,8 @@
Processing some regular expression patterns can require a very large amount of
memory, leading in some cases to a program crash if not enough is available.
Other patterns may take a very long time to search for all possible matching
-strings. The <b>pcre2_exec()</b> function that is called by <b>pcre2grep</b> to do
-the matching has two parameters that can limit the resources that it uses.
+strings. The <b>pcre2_match()</b> function that is called by <b>pcre2grep</b> to
+do the matching has two parameters that can limit the resources that it uses.
<br>
<br>
The <b>--match-limit</b> option provides a means of limiting resource usage
@@ -750,7 +750,7 @@
</P>
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 28 September 2014
+Last updated: 23 November 2014
<br>
Copyright © 1997-2014 University of Cambridge.
<br>
Modified: code/trunk/doc/html/pcre2jit.html
===================================================================
--- code/trunk/doc/html/pcre2jit.html 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/html/pcre2jit.html 2014-11-23 18:38:38 UTC (rev 158)
@@ -31,11 +31,11 @@
<P>
Just-in-time compiling is a heavyweight optimization that can greatly speed up
pattern matching. However, it comes at the cost of extra processing before the
-match is performed. Therefore, it is of most benefit when the same pattern is
-going to be matched many times. This does not necessarily mean many calls of a
-matching function; if the pattern is not anchored, matching attempts may take
-place many times at various positions in the subject, even for a single call.
-Therefore, if the subject string is very long, it may still pay to use JIT for
+match is performed, so it is of most benefit when the same pattern is going to
+be matched many times. This does not necessarily mean many calls of a matching
+function; if the pattern is not anchored, matching attempts may take place many
+times at various positions in the subject, even for a single call. Therefore,
+if the subject string is very long, it may still pay to use JIT even for
one-off matches. JIT support is available for all of the 8-bit, 16-bit and
32-bit PCRE2 libraries.
</P>
@@ -103,7 +103,7 @@
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
PCRE2_JIT_COMPLETE and just compile code for partial matching. If
<b>pcre2_jit_compile()</b> is called with no option bits set, it immediately
-returns zero. This is an alternative way of testing if JIT is available.
+returns zero. This is an alternative way of testing whether JIT is available.
</P>
<P>
At present, it is not possible to free JIT compiled code except when the entire
@@ -299,7 +299,7 @@
not\fP call <b>pcre2_match()</b> with a match context pointing to an already
freed stack, as that will cause SEGFAULT. (Also, do not free a stack currently
used by <b>pcre2_match()</b> in another thread). You can also replace the stack
-in a context at any time when it is not in use. You can also free the previous
+in a context at any time when it is not in use. You should free the previous
stack before assigning a replacement.
</P>
<P>
@@ -418,7 +418,7 @@
</P>
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 12 November 2014
+Last updated: 23 November 2014
<br>
Copyright © 1997-2014 University of Cambridge.
<br>
Modified: code/trunk/doc/html/pcre2syntax.html
===================================================================
--- code/trunk/doc/html/pcre2syntax.html 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/html/pcre2syntax.html 2014-11-23 18:38:38 UTC (rev 158)
@@ -421,7 +421,7 @@
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
</pre>
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
-limits set by the caller of pcre2_exec(), not increase them.
+limits set by the caller of pcre2_match(), not increase them.
</P>
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
<P>
@@ -553,7 +553,7 @@
</P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 14 November 2014
+Last updated: 23 November 2014
<br>
Copyright © 1997-2014 University of Cambridge.
<br>
Modified: code/trunk/doc/html/pcre2unicode.html
===================================================================
--- code/trunk/doc/html/pcre2unicode.html 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/html/pcre2unicode.html 2014-11-23 18:38:38 UTC (rev 158)
@@ -72,7 +72,7 @@
characters (see the description of \C in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation). The use of \C is not supported in the alternative matching
-function <b>pcre2_dfa_exec()</b>, nor is it supported in UTF mode by the JIT
+function <b>pcre2_dfa_match()</b>, nor is it supported in UTF mode by the JIT
optimization. If JIT optimization is requested for a UTF pattern that contains
\C, it will not succeed, and so the matching will be carried out by the normal
interpretive function.
@@ -141,15 +141,15 @@
In some situations, you may already know that your strings are valid, and
therefore want to skip these checks in order to improve performance, for
example in the case of a long subject string that is being scanned repeatedly.
-If you set the PCRE2_NO_UTF_CHECK flag at compile time or at run time, PCRE2
-assumes that the pattern or subject it is given (respectively) contains only
-valid UTF code unit sequences.
+If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
+PCRE2 assumes that the pattern or subject it is given (respectively) contains
+only valid UTF code unit sequences.
</P>
<P>
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
the pattern; it does not also apply to subject strings. If you want to disable
-the check for a subject string you must pass this option to <b>pcre2_exec()</b>
-or <b>pcre2_dfa_exec()</b>.
+the check for a subject string you must pass this option to <b>pcre2_match()</b>
+or <b>pcre2_dfa_match()</b>.
</P>
<P>
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
@@ -261,7 +261,7 @@
REVISION
</b><br>
<P>
-Last updated: 03 November 2014
+Last updated: 23 November 2014
<br>
Copyright © 1997-2014 University of Cambridge.
<br>
Modified: code/trunk/doc/pcre2.txt
===================================================================
--- code/trunk/doc/pcre2.txt 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2.txt 2014-11-23 18:38:38 UTC (rev 158)
@@ -2667,12 +2667,12 @@
or UTF-16/UTF-32 strings. To build these additional libraries, add one
or both of the following to the configure command:
- --enable-pcre16
- --enable-pcre32
+ --enable-pcre2-16
+ --enable-pcre2-32
If you do not want the 8-bit library, add
- --disable-pcre8
+ --disable-pcre2-8
as well. At least one of the three libraries must be built. Note that
the POSIX wrapper is for the 8-bit library only, and that pcre2grep is
@@ -2683,16 +2683,16 @@
BUILDING SHARED AND STATIC LIBRARIES
The Autotools PCRE2 building process uses libtool to build both shared
- and static libraries by default. You can suppress one of these by
- adding one of
+ and static libraries by default. You can suppress an unwanted library
+ by adding one of
--disable-shared
--disable-static
- to the configure command, as required.
+ to the configure command.
-Unicode and UTF SUPPORT
+UNICODE AND UTF SUPPORT
By default, PCRE2 is built with support for Unicode and UTF character
strings. To build it without Unicode support, add
@@ -2704,20 +2704,18 @@
another without, in the same configuration.
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
- UTF-16 or UTF-32. To do that you have have to set the PCRE2_UTF option
- when you call pcre2_compile() to compile a pattern.
+ UTF-16 or UTF-32. To do that, applications that use the library have to
+ set the PCRE2_UTF option when they call pcre2_compile() to compile a
+ pattern.
- It is not possible to support both EBCDIC and UTF-8 codes in the same
- version of the library. Consequently, --enable-unicode and --enable-
- ebcdic are mutually exclusive.
+ UTF support allows the libraries to process character code points up to
+ 0x10ffff in the strings that they handle. It also provides support for
+ accessing the Unicode properties of such characters, using pattern
+ escapes such as \P, \p, and \X. Only the general category properties
+ such as Lu and Nd are supported. Details are given in the pcre2pattern
+ documentation.
- UTF support allows the libraries to process character codepoints up to
- 0x10ffff in the strings that they handle. It also provides support for
- accessing the properties of such characters, using pattern escapes such
- as \P, \p, and \X. Only the general category properties such as Lu and
- Nd are supported. Details are given in the pcre2pattern documentation.
-
JUST-IN-TIME COMPILER SUPPORT
Just-in-time compiler support is included in the build by specifying
@@ -2725,17 +2723,17 @@
--enable-jit
This support is available only for certain hardware architectures. If
- this option is set for an unsupported architecture, a compile time
- error occurs. See the pcre2jit documentation for a discussion of JIT
- usage. When JIT support is enabled, pcre2grep automatically makes use
- of it, unless you add
+ this option is set for an unsupported architecture, a building error
+ occurs. See the pcre2jit documentation for a discussion of JIT usage.
+ When JIT support is enabled, pcre2grep automatically makes use of it,
+ unless you add
--disable-pcre2grep-jit
to the "configure" command.
-CODE VALUE OF NEWLINE
+NEWLINE RECOGNITION
By default, PCRE2 interprets the linefeed (LF) character as indicating
the end of a line. This is the normal newline character on Unix-like
@@ -2744,11 +2742,12 @@
--enable-newline-is-cr
- to the configure command. There is also a --enable-newline-is-lf
+ to the configure command. There is also an --enable-newline-is-lf
option, which explicitly specifies linefeed as the newline character.
Alternatively, you can specify that line endings are to be indicated by
- the two character sequence CRLF. If you want this, add
+ the two-character sequence CRLF (CR immediately followed by LF). If you
+ want this, add
--enable-newline-is-crlf
@@ -2756,41 +2755,46 @@
--enable-newline-is-anycrlf
- which causes PCRE2 to recognize any of the three sequences CR, LF, or
+ which causes PCRE2 to recognize any of the three sequences CR, LF, or
CRLF as indicating a line ending. Finally, a fifth option, specified by
--enable-newline-is-any
- causes PCRE2 to recognize any Unicode newline sequence.
+ causes PCRE2 to recognize any Unicode newline sequence. The Unicode
+ newline sequences are the three just mentioned, plus the single charac-
+ ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
+ U+0085), LS (line separator, U+2028), and PS (paragraph separator,
+ U+2029).
- Whatever line ending convention is selected when PCRE2 is built can be
- overridden when the library functions are called. At build time it is
- conventional to use the standard for your operating system.
+ Whatever default line ending convention is selected when PCRE2 is built
+ can be overridden by applications that use the library. At build time
+ it is conventional to use the standard for your operating system.
WHAT \R MATCHES
- By default, the sequence \R in a pattern matches any Unicode newline
- sequence, whatever has been selected as the line ending sequence. If
- you specify
+ By default, the sequence \R in a pattern matches any Unicode newline
+ sequence, independently of what has been selected as the line ending
+ sequence. If you specify
--enable-bsr-anycrlf
- the default is changed so that \R matches only CR, LF, or CRLF. What-
- ever is selected when PCRE2 is built can be overridden when the library
- functions are called.
+ the default is changed so that \R matches only CR, LF, or CRLF. What-
+ ever is selected when PCRE2 is built can be overridden by applications
+ that use the called.
HANDLING VERY LARGE PATTERNS
- Within a compiled pattern, offset values are used to point from one
- part to another (for example, from an opening parenthesis to an alter-
- nation metacharacter). By default, in the 8-bit and 16-bit libraries,
- two-byte values are used for these offsets, leading to a maximum size
- for a compiled pattern of around 64K. This is sufficient to handle all
- but the most gigantic patterns. Nevertheless, some people do want to
- process truly enormous patterns, so it is possible to compile PCRE2 to
- use three-byte or four-byte offsets by adding a setting such as
+ Within a compiled pattern, offset values are used to point from one
+ part to another (for example, from an opening parenthesis to an alter-
+ nation metacharacter). By default, in the 8-bit and 16-bit libraries,
+ two-byte values are used for these offsets, leading to a maximum size
+ for a compiled pattern of around 64K code units. This is sufficient to
+ handle all but the most gigantic patterns. Nevertheless, some people do
+ want to process truly enormous patterns, so it is possible to compile
+ PCRE2 to use three-byte or four-byte offsets by adding a setting such
+ as
--with-link-size=3
@@ -2876,25 +2880,28 @@
USING EBCDIC CODE
PCRE2 assumes by default that it will run in an environment where the
- character code is ASCII (or Unicode, which is a superset of ASCII).
- This is the case for most computer operating systems. PCRE2 can, how-
- ever, be compiled to run in an EBCDIC environment by adding
+ character code is ASCII or Unicode, which is a superset of ASCII. This
+ is the case for most computer operating systems. PCRE2 can, however, be
+ compiled to run in an 8-bit EBCDIC environment by adding
--enable-ebcdic --disable-unicode
to the configure command. This setting implies --enable-rebuild-charta-
bles. You should only use it if you know that you are in an EBCDIC
- environment (for example, an IBM mainframe operating system). The
- --enable-ebcdic option is incompatible with Unicode support.
+ environment (for example, an IBM mainframe operating system).
+ It is not possible to support both EBCDIC and UTF-8 codes in the same
+ version of the library. Consequently, --enable-unicode and --enable-
+ ebcdic are mutually exclusive.
+
The EBCDIC character that corresponds to an ASCII LF is assumed to have
- the value 0x15 by default. However, in some EBCDIC environments, 0x25
+ the value 0x15 by default. However, in some EBCDIC environments, 0x25
is used. In such an environment you should use
--enable-ebcdic-nl25
as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
- has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
+ has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
acter (which, in Unicode, is 0x85).
@@ -2905,32 +2912,32 @@
PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
- By default, pcre2grep reads all files as plain text. You can build it
- so that it recognizes files whose names end in .gz or .bz2, and reads
+ By default, pcre2grep reads all files as plain text. You can build it
+ so that it recognizes files whose names end in .gz or .bz2, and reads
them with libz or libbz2, respectively, by adding one or both of
--enable-pcre2grep-libz
--enable-pcre2grep-libbz2
to the configure command. These options naturally require that the rel-
- evant libraries are installed on your system. Configuration will fail
+ evant libraries are installed on your system. Configuration will fail
if they are not.
PCRE2GREP BUFFER SIZE
- pcre2grep uses an internal buffer to hold a "window" on the file it is
+ pcre2grep uses an internal buffer to hold a "window" on the file it is
scanning, in order to be able to output "before" and "after" lines when
- it finds a match. The size of the buffer is controlled by a parameter
+ it finds a match. The size of the buffer is controlled by a parameter
whose default value is 20K. The buffer itself is three times this size,
but because of the way it is used for holding "before" lines, the long-
- est line that is guaranteed to be processable is the parameter size.
+ est line that is guaranteed to be processable is the parameter size.
You can change the default parameter value by adding, for example,
--with-pcre2grep-bufsize=50K
- to the configure command. The caller of pcre2grep can, however, over-
- ride this value by specifying a run-time option.
+ to the configure command. The caller of pcre2grep can override this
+ value by using --buffer-size on the command line..
PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
@@ -2940,26 +2947,26 @@
--enable-pcre2test-libreadline
--enable-pcre2test-libedit
- to the configure command, pcre2test is linked with the libreadline
+ to the configure command, pcre2test is linked with the libreadline
orlibedit library, respectively, and when its input is from a terminal,
- it reads it using the readline() function. This provides line-editing
- and history facilities. Note that libreadline is GPL-licensed, so if
- you distribute a binary of pcre2test linked in this way, there may be
- licensing issues. These can be avoided by linking with libedit (which
- has a BSD licence) instead.
+ it reads it using the readline() function. This provides line-editing
+ and history facilities. Note that libreadline is GPL-licensed, so if
+ you distribute a binary of pcre2test linked in this way, there may be
+ licensing issues. These can be avoided by linking instead with libedit,
+ which has a BSD licence.
- Setting this option causes the -lreadline option to be added to the
- pcre2test build. In many operating environments with a sytem-installed
- readline library this is sufficient. However, in some environments
- (e.g. if an unmodified distribution version of readline is in use),
- some extra configuration may be necessary. The INSTALL file for
- libreadline says this:
+ Setting --enable-pcre2test-libreadline causes the -lreadline option to
+ be added to the pcre2test build. In many operating environments with a
+ sytem-installed readline library this is sufficient. However, in some
+ environments (e.g. if an unmodified distribution version of readline is
+ in use), some extra configuration may be necessary. The INSTALL file
+ for libreadline says this:
"Readline uses the termcap functions, but does not link with
the termcap or curses library itself, allowing applications
which link with readline the to choose an appropriate library."
- If your environment has not been set up so that an appropriate library
+ If your environment has not been set up so that an appropriate library
is automatically included, you may need to add something like
LIBS="-ncurses"
@@ -2969,19 +2976,19 @@
DEBUGGING WITH VALGRIND SUPPORT
- By adding the
+ If you add
--enable-valgrind
- option to to the configure command, PCRE2 will use valgrind annotations
- to mark certain memory regions as unaddressable. This allows it to
- detect invalid memory accesses, and is mostly useful for debugging
- PCRE2 itself.
+ to the configure command, PCRE2 will use valgrind annotations to mark
+ certain memory regions as unaddressable. This allows it to detect
+ invalid memory accesses, and is mostly useful for debugging PCRE2
+ itself.
CODE COVERAGE REPORTING
- If your C compiler is gcc, you can build a version of PCRE2 that can
+ If your C compiler is gcc, you can build a version of PCRE2 that can
generate a code coverage report for its test suite. To enable this, you
must install lcov version 1.6 or above. Then specify
@@ -2990,20 +2997,20 @@
to the configure command and build PCRE2 in the usual way.
Note that using ccache (a caching C compiler) is incompatible with code
- coverage reporting. If you have configured ccache to run automatically
+ coverage reporting. If you have configured ccache to run automatically
on your system, you must set the environment variable
CCACHE_DISABLE=1
before running make to build PCRE2, so that ccache is not used.
- When --enable-coverage is used, the following addition targets are
+ When --enable-coverage is used, the following addition targets are
added to the Makefile:
make coverage
- This creates a fresh coverage report for the PCRE2 test suite. It is
- equivalent to running "make coverage-reset", "make coverage-baseline",
+ This creates a fresh coverage report for the PCRE2 test suite. It is
+ equivalent to running "make coverage-reset", "make coverage-baseline",
"make check", and then "make coverage-report".
make coverage-reset
@@ -3020,18 +3027,18 @@
make coverage-clean-report
- This removes the generated coverage report without cleaning the cover-
+ This removes the generated coverage report without cleaning the cover-
age data itself.
make coverage-clean-data
- This removes the captured coverage data without removing the coverage
+ This removes the captured coverage data without removing the coverage
files created at compile time (*.gcno).
make coverage-clean
- This cleans all coverage data including the generated coverage report.
- For more information about code coverage, see the gcov and lcov docu-
+ This cleans all coverage data including the generated coverage report.
+ For more information about code coverage, see the gcov and lcov docu-
mentation.
@@ -3049,7 +3056,7 @@
REVISION
- Last updated: 03 November 2014
+ Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------
@@ -3122,62 +3129,59 @@
At compile time, PCRE2 "auto-possessifies" repeated items when it knows
that what follows cannot be part of the repeat. For example, a+[bc] is
compiled as if it were a++[bc]. The pcre2test output when this pattern
- is anchored and then applied with automatic callouts to the string
- "aaaa" is:
+ is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
+ to the string "aaaa" is:
--->aaaa
- +0 ^ ^
- +1 ^ a+
- +3 ^ ^ [bc]
+ +0 ^ a+
+ +2 ^ ^ [bc]
No match
This indicates that when matching [bc] fails, there is no backtracking
into a+ and therefore the callouts that would be taken for the back-
tracks do not occur. You can disable the auto-possessify feature by
passing PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the pat-
- tern with (*NO_AUTO_POSSESS). If this is done in pcre2test (using the
- /no_auto_possess qualifier), the output changes to this:
+ tern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
--->aaaa
- +0 ^ ^
- +1 ^ a+
- +3 ^ ^ [bc]
- +3 ^ ^ [bc]
- +3 ^ ^ [bc]
- +3 ^^ [bc]
+ +0 ^ a+
+ +2 ^ ^ [bc]
+ +2 ^ ^ [bc]
+ +2 ^ ^ [bc]
+ +2 ^^ [bc]
No match
This time, when matching [bc] fails, the matcher backtracks into a+ and
tries again, repeatedly, until a+ itself fails.
- Other optimizations that provide fast "no match" results also affect
+ Other optimizations that provide fast "no match" results also affect
callouts. For example, if the pattern is
ab(?C4)cd
- PCRE2 knows that any matching string must contain the letter "d". If
- the subject string is "abyz", the lack of "d" means that matching
- doesn't ever start, and the callout is never reached. However, with
+ PCRE2 knows that any matching string must contain the letter "d". If
+ the subject string is "abyz", the lack of "d" means that matching
+ doesn't ever start, and the callout is never reached. However, with
"abyd", though the result is still no match, the callout is obeyed.
- PCRE2 also knows the minimum length of a matching string, and will
- immediately give a "no match" return without actually running a match
- if the subject is not long enough, or, for unanchored patterns, if it
+ PCRE2 also knows the minimum length of a matching string, and will
+ immediately give a "no match" return without actually running a match
+ if the subject is not long enough, or, for unanchored patterns, if it
has been scanned far enough.
You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
- MIZE option to pcre2_compile(), or by starting the pattern with
- (*NO_START_OPT). This slows down the matching process, but does ensure
+ MIZE option to pcre2_compile(), or by starting the pattern with
+ (*NO_START_OPT). This slows down the matching process, but does ensure
that callouts such as the example above are obeyed.
THE CALLOUT INTERFACE
- During matching, when PCRE2 reaches a callout point, the external func-
- tion that is set in the match context is called (if it is set). This
- applies to both normal and DFA matching. The only argument to the call-
- out function is a pointer to a pcre2_callout block. This structure con-
- tains the following fields:
+ During matching, when PCRE2 reaches a callout point, if an external
+ function is set in the match context, it is called. This applies to
+ both normal and DFA matching. The only argument to the callout function
+ is a pointer to a pcre2_callout block. This structure contains the fol-
+ lowing fields:
uint32_t version;
uint32_t callout_number;
@@ -3193,69 +3197,69 @@
PCRE2_SIZE pattern_position;
PCRE2_SIZE next_item_length;
- The version field contains the version number of the block format. The
+ The version field contains the version number of the block format. The
current version is 0. The version number will change in future if addi-
- tional fields are added, but the intention is never to remove any of
+ tional fields are added, but the intention is never to remove any of
the existing fields.
- The callout_number field contains the number of the callout, as com-
- piled into the pattern (that is, the number after ?C for manual call-
+ The callout_number field contains the number of the callout, as com-
+ piled into the pattern (that is, the number after ?C for manual call-
outs, and 255 for automatically generated callouts).
The offset_vector field is a pointer to the vector of capturing offsets
- (the "ovector") that was passed to the matching function in the match
- data block. When pcre2_match() is used, the contents can be inspected,
- in order to extract substrings that have been matched so far, in the
- same way as for extracting substrings after a match has completed. For
+ (the "ovector") that was passed to the matching function in the match
+ data block. When pcre2_match() is used, the contents can be inspected
+ in order to extract substrings that have been matched so far, in the
+ same way as for extracting substrings after a match has completed. For
the DFA matching function, this field is not useful.
The subject and subject_length fields contain copies of the values that
were passed to the matching function.
- The start_match field normally contains the offset within the subject
- at which the current match attempt started. However, if the escape
- sequence \K has been encountered, this value is changed to reflect the
- modified starting point. If the pattern is not anchored, the callout
+ The start_match field normally contains the offset within the subject
+ at which the current match attempt started. However, if the escape
+ sequence \K has been encountered, this value is changed to reflect the
+ modified starting point. If the pattern is not anchored, the callout
function may be called several times from the same point in the pattern
for different starting points in the subject.
- The current_position field contains the offset within the subject of
+ The current_position field contains the offset within the subject of
the current match pointer.
When the pcre2_match() is used, the capture_top field contains one more
- than the number of the highest numbered captured substring so far. If
+ than the number of the highest numbered captured substring so far. If
no substrings have been captured, the value of capture_top is one. This
is always the case when the DFA functions are used, because they do not
support captured substrings.
- The capture_last field contains the number of the most recently cap-
- tured substring. However, when a recursion exits, the value reverts to
- what it was outside the recursion, as do the values of all captured
- substrings. If no substrings have been captured, the value of cap-
+ The capture_last field contains the number of the most recently cap-
+ tured substring. However, when a recursion exits, the value reverts to
+ what it was outside the recursion, as do the values of all captured
+ substrings. If no substrings have been captured, the value of cap-
ture_last is 0. This is always the case for the DFA matching functions.
- The callout_data field contains a value that is passed to a matching
- function specifically so that it can be passed back in callouts. It is
- set in the match context when the callout is set up by calling
+ The callout_data field contains a value that is passed to a matching
+ function specifically so that it can be passed back in callouts. It is
+ set in the match context when the callout is set up by calling
pcre2_set_callout() (see the pcre2api documentation).
- The pattern_position field contains the offset to the next item to be
+ The pattern_position field contains the offset to the next item to be
matched in the pattern string.
- The next_item_length field contains the length of the next item to be
+ The next_item_length field contains the length of the next item to be
matched in the pattern string. When the callout immediately precedes an
- alternation bar, a closing parenthesis, or the end of the pattern, the
- length is zero. When the callout precedes an opening parenthesis, the
+ alternation bar, a closing parenthesis, or the end of the pattern, the
+ length is zero. When the callout precedes an opening parenthesis, the
length is that of the entire subpattern.
- The pattern_position and next_item_length fields are intended to help
- in distinguishing between different automatic callouts, which all have
+ The pattern_position and next_item_length fields are intended to help
+ in distinguishing between different automatic callouts, which all have
the same callout number. However, they are set for all callouts.
In callouts from pcre2_match() the mark field contains a pointer to the
- zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
- (*THEN) item in the match, or NULL if no such items have been passed.
- Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
+ zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
+ (*THEN) item in the match, or NULL if no such items have been passed.
+ Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
previous (*MARK). In callouts from the DFA matching function this field
always contains NULL.
@@ -3263,16 +3267,16 @@
RETURN VALUES
The external callout function returns an integer to PCRE2. If the value
- is zero, matching proceeds as normal. If the value is greater than
- zero, matching fails at the current point, but the testing of other
+ is zero, matching proceeds as normal. If the value is greater than
+ zero, matching fails at the current point, but the testing of other
matching possibilities goes ahead, just as if a lookahead assertion had
failed. If the value is less than zero, the match is abandoned, and the
matching function returns the negative value.
- Negative values should normally be chosen from the set of
- PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a
- standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is
- reserved for use by callout functions; it will never be used by PCRE2
+ Negative values should normally be chosen from the set of
+ PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a
+ standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is
+ reserved for use by callout functions; it will never be used by PCRE2
itself.
@@ -3285,7 +3289,7 @@
REVISION
- Last updated: 19 October 2014
+ Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------
@@ -3487,12 +3491,12 @@
Just-in-time compiling is a heavyweight optimization that can greatly
speed up pattern matching. However, it comes at the cost of extra pro-
- cessing before the match is performed. Therefore, it is of most benefit
- when the same pattern is going to be matched many times. This does not
- necessarily mean many calls of a matching function; if the pattern is
- not anchored, matching attempts may take place many times at various
- positions in the subject, even for a single call. Therefore, if the
- subject string is very long, it may still pay to use JIT for one-off
+ cessing before the match is performed, so it is of most benefit when
+ the same pattern is going to be matched many times. This does not nec-
+ essarily mean many calls of a matching function; if the pattern is not
+ anchored, matching attempts may take place many times at various posi-
+ tions in the subject, even for a single call. Therefore, if the subject
+ string is very long, it may still pay to use JIT even for one-off
matches. JIT support is available for all of the 8-bit, 16-bit and
32-bit PCRE2 libraries.
@@ -3558,8 +3562,8 @@
again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it
will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
ing. If pcre2_jit_compile() is called with no option bits set, it imme-
- diately returns zero. This is an alternative way of testing if JIT is
- available.
+ diately returns zero. This is an alternative way of testing whether JIT
+ is available.
At present, it is not possible to free JIT compiled code except when
the entire compiled pattern is freed by calling pcre2_free_code().
@@ -3745,7 +3749,7 @@
an already freed stack, as that will cause SEGFAULT. (Also, do not free
a stack currently used by pcre2_match() in another thread). You can
also replace the stack in a context at any time when it is not in use.
- You can also free the previous stack before assigning a replacement.
+ You should free the previous stack before assigning a replacement.
(5) Should I allocate/free a stack every time before/after calling
pcre2_match()?
@@ -3855,7 +3859,7 @@
REVISION
- Last updated: 12 November 2014
+ Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------
@@ -4642,7 +4646,7 @@
UTF mode, but its use can lead to some strange effects because it
breaks up multi-unit characters (see the description of \C in the
pcre2pattern documentation). The use of \C is not supported in the
- alternative matching function pcre2_dfa_exec(), nor is it supported in
+ alternative matching function pcre2_dfa_match(), nor is it supported in
UTF mode by the JIT optimization. If JIT optimization is requested for
a UTF pattern that contains \C, it will not succeed, and so the match-
ing will be carried out by the normal interpretive function.
@@ -4701,14 +4705,14 @@
In some situations, you may already know that your strings are valid,
and therefore want to skip these checks in order to improve perfor-
mance, for example in the case of a long subject string that is being
- scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK flag at compile
- time or at run time, PCRE2 assumes that the pattern or subject it is
- given (respectively) contains only valid UTF code unit sequences.
+ scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
+ pile time or at match time, PCRE2 assumes that the pattern or subject
+ it is given (respectively) contains only valid UTF code unit sequences.
Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check
for the pattern; it does not also apply to subject strings. If you want
to disable the check for a subject string you must pass this option to
- pcre2_exec() or pcre2_dfa_exec().
+ pcre2_match() or pcre2_dfa_match().
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
result is undefined and your program may crash or loop indefinitely.
@@ -4807,7 +4811,7 @@
REVISION
- Last updated: 03 November 2014
+ Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------
Modified: code/trunk/doc/pcre2api.3
===================================================================
--- code/trunk/doc/pcre2api.3 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2api.3 2014-11-23 18:38:38 UTC (rev 158)
@@ -1,4 +1,4 @@
-.TH PCRE2API 3 "21 November 2014" "PCRE2 10.00"
+.TH PCRE2API 3 "23 November 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@@ -2090,6 +2090,13 @@
different to the value of \fIovector[0]\fP if the pattern contains the \eK
escape sequence. After a partial match, however, this value is always the same
as \fIovector[0]\fP because \eK does not affect the result of a partial match.
+.P
+The \fBstartchar\fP field is also used to return the offset of an invalid
+UTF character when UTF checking fails. Details are given in the
+.\" HREF
+\fBpcre2unicode\fP
+.\"
+page.
.
.
.\" HTML <a name="errorlist"></a>
@@ -2707,6 +2714,6 @@
.rs
.sp
.nf
-Last updated: 21 November 2014
+Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre2build.3
===================================================================
--- code/trunk/doc/pcre2build.3 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2build.3 2014-11-23 18:38:38 UTC (rev 158)
@@ -1,4 +1,4 @@
-.TH PCRE2BUILD 3 "03 November 2014" "PCRE2 10.00"
+.TH PCRE2BUILD 3 "23 November 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.
@@ -74,12 +74,12 @@
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
the following to the \fBconfigure\fP command:
.sp
- --enable-pcre16
- --enable-pcre32
+ --enable-pcre2-16
+ --enable-pcre2-32
.sp
If you do not want the 8-bit library, add
.sp
- --disable-pcre8
+ --disable-pcre2-8
.sp
as well. At least one of the three libraries must be built. Note that the POSIX
wrapper is for the 8-bit library only, and that \fBpcre2grep\fP is an 8-bit
@@ -91,15 +91,16 @@
.rs
.sp
The Autotools PCRE2 building process uses \fBlibtool\fP to build both shared
-and static libraries by default. You can suppress one of these by adding one of
+and static libraries by default. You can suppress an unwanted library by adding
+one of
.sp
--disable-shared
--disable-static
.sp
-to the \fBconfigure\fP command, as required.
+to the \fBconfigure\fP command.
.
.
-.SH "Unicode and UTF SUPPORT"
+.SH "UNICODE AND UTF SUPPORT"
.rs
.sp
By default, PCRE2 is built with support for Unicode and UTF character strings.
@@ -112,18 +113,14 @@
in the same configuration.
.P
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
-or UTF-32. To do that you have have to set the PCRE2_UTF option when you call
-\fBpcre2_compile()\fP to compile a pattern.
+or UTF-32. To do that, applications that use the library have to set the
+PCRE2_UTF option when they call \fBpcre2_compile()\fP to compile a pattern.
.P
-It is not possible to support both EBCDIC and UTF-8 codes in the same version
-of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
-exclusive.
-.P
-UTF support allows the libraries to process character codepoints up to 0x10ffff
-in the strings that they handle. It also provides support for accessing the
-properties of such characters, using pattern escapes such as \eP, \ep, and \eX.
-Only the general category properties such as \fILu\fP and \fINd\fP are
-supported. Details are given in the
+UTF support allows the libraries to process character code points up to
+0x10ffff in the strings that they handle. It also provides support for
+accessing the Unicode properties of such characters, using pattern escapes such
+as \eP, \ep, and \eX. Only the general category properties such as \fILu\fP and
+\fINd\fP are supported. Details are given in the
.\" HREF
\fBpcre2pattern\fP
.\"
@@ -138,7 +135,7 @@
--enable-jit
.sp
This support is available only for certain hardware architectures. If this
-option is set for an unsupported architecture, a compile time error occurs.
+option is set for an unsupported architecture, a building error occurs.
See the
.\" HREF
\fBpcre2jit\fP
@@ -151,7 +148,7 @@
to the "configure" command.
.
.
-.SH "CODE VALUE OF NEWLINE"
+.SH "NEWLINE RECOGNITION"
.rs
.sp
By default, PCRE2 interprets the linefeed (LF) character as indicating the end
@@ -160,12 +157,13 @@
.sp
--enable-newline-is-cr
.sp
-to the \fBconfigure\fP command. There is also a --enable-newline-is-lf option,
+to the \fBconfigure\fP command. There is also an --enable-newline-is-lf option,
which explicitly specifies linefeed as the newline character.
+.P
+Alternatively, you can specify that line endings are to be indicated by the
+two-character sequence CRLF (CR immediately followed by LF). If you want this,
+add
.sp
-Alternatively, you can specify that line endings are to be indicated by the two
-character sequence CRLF. If you want this, add
-.sp
--enable-newline-is-crlf
.sp
to the \fBconfigure\fP command. There is a fourth option, specified by
@@ -177,10 +175,13 @@
.sp
--enable-newline-is-any
.sp
-causes PCRE2 to recognize any Unicode newline sequence.
+causes PCRE2 to recognize any Unicode newline sequence. The Unicode newline
+sequences are the three just mentioned, plus the single characters VT (vertical
+tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
+separator, U+2028), and PS (paragraph separator, U+2029).
.P
-Whatever line ending convention is selected when PCRE2 is built can be
-overridden when the library functions are called. At build time it is
+Whatever default line ending convention is selected when PCRE2 is built can be
+overridden by applications that use the library. At build time it is
conventional to use the standard for your operating system.
.
.
@@ -188,12 +189,13 @@
.rs
.sp
By default, the sequence \eR in a pattern matches any Unicode newline sequence,
-whatever has been selected as the line ending sequence. If you specify
+independently of what has been selected as the line ending sequence. If you
+specify
.sp
--enable-bsr-anycrlf
.sp
the default is changed so that \eR matches only CR, LF, or CRLF. Whatever is
-selected when PCRE2 is built can be overridden when the library functions are
+selected when PCRE2 is built can be overridden by applications that use the
called.
.
.
@@ -204,10 +206,10 @@
another (for example, from an opening parenthesis to an alternation
metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
are used for these offsets, leading to a maximum size for a compiled pattern of
-around 64K. This is sufficient to handle all but the most gigantic patterns.
-Nevertheless, some people do want to process truly enormous patterns, so it is
-possible to compile PCRE2 to use three-byte or four-byte offsets by adding a
-setting such as
+around 64K code units. This is sufficient to handle all but the most gigantic
+patterns. Nevertheless, some people do want to process truly enormous patterns,
+so it is possible to compile PCRE2 to use three-byte or four-byte offsets by
+adding a setting such as
.sp
--with-link-size=3
.sp
@@ -299,17 +301,20 @@
.rs
.sp
PCRE2 assumes by default that it will run in an environment where the character
-code is ASCII (or Unicode, which is a superset of ASCII). This is the case for
+code is ASCII or Unicode, which is a superset of ASCII. This is the case for
most computer operating systems. PCRE2 can, however, be compiled to run in an
-EBCDIC environment by adding
+8-bit EBCDIC environment by adding
.sp
--enable-ebcdic --disable-unicode
.sp
to the \fBconfigure\fP command. This setting implies
--enable-rebuild-chartables. You should only use it if you know that you are in
-an EBCDIC environment (for example, an IBM mainframe operating system). The
---enable-ebcdic option is incompatible with Unicode support.
+an EBCDIC environment (for example, an IBM mainframe operating system).
.P
+It is not possible to support both EBCDIC and UTF-8 codes in the same version
+of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
+exclusive.
+.P
The EBCDIC character that corresponds to an ASCII LF is assumed to have the
value 0x15 by default. However, in some EBCDIC environments, 0x25 is used. In
such an environment you should use
@@ -354,8 +359,8 @@
.sp
--with-pcre2grep-bufsize=50K
.sp
-to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can, however,
-override this value by specifying a run-time option.
+to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can override this
+value by using --buffer-size on the command line..
.
.
.SH "PCRE2TEST OPTION FOR LIBREADLINE SUPPORT"
@@ -371,15 +376,15 @@
from a terminal, it reads it using the \fBreadline()\fP function. This provides
line-editing and history facilities. Note that \fBlibreadline\fP is
GPL-licensed, so if you distribute a binary of \fBpcre2test\fP linked in this
-way, there may be licensing issues. These can be avoided by linking with
-\fBlibedit\fP (which has a BSD licence) instead.
+way, there may be licensing issues. These can be avoided by linking instead
+with \fBlibedit\fP, which has a BSD licence.
.P
-Setting this option causes the \fB-lreadline\fP option to be added to the
-\fBpcre2test\fP build. In many operating environments with a sytem-installed
-readline library this is sufficient. However, in some environments (e.g. if an
-unmodified distribution version of readline is in use), some extra
-configuration may be necessary. The INSTALL file for \fBlibreadline\fP says
-this:
+Setting --enable-pcre2test-libreadline causes the \fB-lreadline\fP option to be
+added to the \fBpcre2test\fP build. In many operating environments with a
+sytem-installed readline library this is sufficient. However, in some
+environments (e.g. if an unmodified distribution version of readline is in
+use), some extra configuration may be necessary. The INSTALL file for
+\fBlibreadline\fP says this:
.sp
"Readline uses the termcap functions, but does not link with
the termcap or curses library itself, allowing applications
@@ -396,13 +401,13 @@
.SH "DEBUGGING WITH VALGRIND SUPPORT"
.rs
.sp
-By adding the
+If you add
.sp
--enable-valgrind
.sp
-option to to the \fBconfigure\fP command, PCRE2 will use valgrind annotations
-to mark certain memory regions as unaddressable. This allows it to detect
-invalid memory accesses, and is mostly useful for debugging PCRE2 itself.
+to the \fBconfigure\fP command, PCRE2 will use valgrind annotations to mark
+certain memory regions as unaddressable. This allows it to detect invalid
+memory accesses, and is mostly useful for debugging PCRE2 itself.
.
.
.SH "CODE COVERAGE REPORTING"
@@ -482,6 +487,6 @@
.rs
.sp
.nf
-Last updated: 03 November 2014
+Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre2callout.3
===================================================================
--- code/trunk/doc/pcre2callout.3 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2callout.3 2014-11-23 18:38:38 UTC (rev 158)
@@ -1,4 +1,4 @@
-.TH PCRE2CALLOUT 3 "19 October 2014" "PCRE2 10.00"
+.TH PCRE2CALLOUT 3 "23 November 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -68,29 +68,27 @@
.P
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
-if it were a++[bc]. The \fBpcre2test\fP output when this pattern is anchored
-and then applied with automatic callouts to the string "aaaa" is:
+if it were a++[bc]. The \fBpcre2test\fP output when this pattern is compiled
+with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
+"aaaa" is:
.sp
--->aaaa
- +0 ^ ^
- +1 ^ a+
- +3 ^ ^ [bc]
+ +0 ^ a+
+ +2 ^ ^ [bc]
No match
.sp
This indicates that when matching [bc] fails, there is no backtracking into a+
and therefore the callouts that would be taken for the backtracks do not occur.
-You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS
-to \fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). If
-this is done in \fBpcre2test\fP (using the /no_auto_possess qualifier), the
-output changes to this:
+You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
+\fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). In this
+case, the output changes to this:
.sp
--->aaaa
- +0 ^ ^
- +1 ^ a+
- +3 ^ ^ [bc]
- +3 ^ ^ [bc]
- +3 ^ ^ [bc]
- +3 ^^ [bc]
+ +0 ^ a+
+ +2 ^ ^ [bc]
+ +2 ^ ^ [bc]
+ +2 ^ ^ [bc]
+ +2 ^^ [bc]
No match
.sp
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
@@ -119,10 +117,10 @@
.SH "THE CALLOUT INTERFACE"
.rs
.sp
-During matching, when PCRE2 reaches a callout point, the external function that
-is set in the match context is called (if it is set). This applies to both
-normal and DFA matching. The only argument to the callout function is a pointer
-to a \fBpcre2_callout\fP block. This structure contains the following fields:
+During matching, when PCRE2 reaches a callout point, if an external function is
+set in the match context, it is called. This applies to both normal and DFA
+matching. The only argument to the callout function is a pointer to a
+\fBpcre2_callout\fP block. This structure contains the following fields:
.sp
uint32_t \fIversion\fP;
uint32_t \fIcallout_number\fP;
@@ -149,7 +147,7 @@
.P
The \fIoffset_vector\fP field is a pointer to the vector of capturing offsets
(the "ovector") that was passed to the matching function in the match data
-block. When \fBpcre2_match()\fP is used, the contents can be inspected, in
+block. When \fBpcre2_match()\fP is used, the contents can be inspected in
order to extract substrings that have been matched so far, in the same way as
for extracting substrings after a match has completed. For the DFA matching
function, this field is not useful.
@@ -238,6 +236,6 @@
.rs
.sp
.nf
-Last updated: 19 October 2014
+Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre2grep.1
===================================================================
--- code/trunk/doc/pcre2grep.1 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2grep.1 2014-11-23 18:38:38 UTC (rev 158)
@@ -1,4 +1,4 @@
-.TH PCRE2GREP 1 "28 September 2014" "PCRE2 10.00"
+.TH PCRE2GREP 1 "23 November 2014" "PCRE2 10.00"
.SH NAME
pcre2grep - a grep with Perl-compatible regular expressions.
.SH SYNOPSIS
@@ -403,8 +403,8 @@
Processing some regular expression patterns can require a very large amount of
memory, leading in some cases to a program crash if not enough is available.
Other patterns may take a very long time to search for all possible matching
-strings. The \fBpcre2_exec()\fP function that is called by \fBpcre2grep\fP to do
-the matching has two parameters that can limit the resources that it uses.
+strings. The \fBpcre2_match()\fP function that is called by \fBpcre2grep\fP to
+do the matching has two parameters that can limit the resources that it uses.
.sp
The \fB--match-limit\fP option provides a means of limiting resource usage
when processing patterns that are not going to match, but which have a very
@@ -678,6 +678,6 @@
.rs
.sp
.nf
-Last updated: 28 September 2014
+Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre2grep.txt
===================================================================
--- code/trunk/doc/pcre2grep.txt 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2grep.txt 2014-11-23 18:38:38 UTC (rev 158)
@@ -446,7 +446,7 @@
very large amount of memory, leading in some cases to a pro-
gram crash if not enough is available. Other patterns may
take a very long time to search for all possible matching
- strings. The pcre2_exec() function that is called by
+ strings. The pcre2_match() function that is called by
pcre2grep to do the matching has two parameters that can
limit the resources that it uses.
@@ -737,5 +737,5 @@
REVISION
- Last updated: 28 September 2014
+ Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
Modified: code/trunk/doc/pcre2jit.3
===================================================================
--- code/trunk/doc/pcre2jit.3 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2jit.3 2014-11-23 18:38:38 UTC (rev 158)
@@ -1,4 +1,4 @@
-.TH PCRE2JIT 3 "12 November 2014" "PCRE2 10.00"
+.TH PCRE2JIT 3 "23 November 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
@@ -6,11 +6,11 @@
.sp
Just-in-time compiling is a heavyweight optimization that can greatly speed up
pattern matching. However, it comes at the cost of extra processing before the
-match is performed. Therefore, it is of most benefit when the same pattern is
-going to be matched many times. This does not necessarily mean many calls of a
-matching function; if the pattern is not anchored, matching attempts may take
-place many times at various positions in the subject, even for a single call.
-Therefore, if the subject string is very long, it may still pay to use JIT for
+match is performed, so it is of most benefit when the same pattern is going to
+be matched many times. This does not necessarily mean many calls of a matching
+function; if the pattern is not anchored, matching attempts may take place many
+times at various positions in the subject, even for a single call. Therefore,
+if the subject string is very long, it may still pay to use JIT even for
one-off matches. JIT support is available for all of the 8-bit, 16-bit and
32-bit PCRE2 libraries.
.P
@@ -77,7 +77,7 @@
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
PCRE2_JIT_COMPLETE and just compile code for partial matching. If
\fBpcre2_jit_compile()\fP is called with no option bits set, it immediately
-returns zero. This is an alternative way of testing if JIT is available.
+returns zero. This is an alternative way of testing whether JIT is available.
.P
At present, it is not possible to free JIT compiled code except when the entire
compiled pattern is freed by calling \fBpcre2_free_code()\fP.
@@ -276,7 +276,7 @@
not\fP call \fBpcre2_match()\fP with a match context pointing to an already
freed stack, as that will cause SEGFAULT. (Also, do not free a stack currently
used by \fBpcre2_match()\fP in another thread). You can also replace the stack
-in a context at any time when it is not in use. You can also free the previous
+in a context at any time when it is not in use. You should free the previous
stack before assigning a replacement.
.P
(5) Should I allocate/free a stack every time before/after calling
@@ -398,6 +398,6 @@
.rs
.sp
.nf
-Last updated: 12 November 2014
+Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre2syntax.3
===================================================================
--- code/trunk/doc/pcre2syntax.3 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2syntax.3 2014-11-23 18:38:38 UTC (rev 158)
@@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "14 November 2014" "PCRE2 10.00"
+.TH PCRE2SYNTAX 3 "23 November 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -394,7 +394,7 @@
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
.sp
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
-limits set by the caller of pcre2_exec(), not increase them.
+limits set by the caller of pcre2_match(), not increase them.
.
.
.SH "NEWLINE CONVENTION"
@@ -536,6 +536,6 @@
.rs
.sp
.nf
-Last updated: 14 November 2014
+Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre2test.1
===================================================================
--- code/trunk/doc/pcre2test.1 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2test.1 2014-11-23 18:38:38 UTC (rev 158)
@@ -1,4 +1,4 @@
-.TH PCRE2TEST 1 "14 November 2014" "PCRE 10.00"
+.TH PCRE2TEST 1 "23 November 2014" "PCRE 10.00"
.SH NAME
pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@@ -200,7 +200,7 @@
number of subject lines to be matched against that pattern. In between sets of
test data, command lines that begin with a hash (#) character may appear. This
file format, with some restrictions, can also be processed by the
-\fBperltest.pl\fP script that is distributed with PCRE2 as a means of checking
+\fBperltest.sh\fP script that is distributed with PCRE2 as a means of checking
that the behaviour of PCRE2 and Perl is the same.
.P
Each subject line is matched separately and independently. If you want to do
@@ -243,11 +243,11 @@
#perltest
.sp
The appearance of this line causes all subsequent modifier settings to be
-checked for compatibility with the \fBperltest.pl\fP script, which is used to
+checked for compatibility with the \fBperltest.sh\fP script, which is used to
confirm that Perl gives the same results as PCRE2. Also, apart from comment
lines, none of the other command lines are permitted, because they and many
of the modifiers are specific to \fBpcre2test\fP, and should not be used in
-test files that are also processed by \fBperltest.pl\fP. The \fP#perltest\fB
+test files that are also processed by \fBperltest.sh\fP. The \fP#perltest\fB
command helps detect tests that are accidentally put in the wrong file.
.sp
#subject <modifier-list>
@@ -265,7 +265,7 @@
other only. Each modifier has a long name, for example "anchored", and some of
them must be followed by an equals sign and a value, for example, "offset=12".
Modifiers that do not take values may be preceded by a minus sign to turn off a
-previous default setting.
+previous setting.
.P
A few of the more common modifiers can also be specified as single letters, for
example "i" for "caseless". In documentation, following the Perl convention,
@@ -336,7 +336,7 @@
\exhh hexadecimal byte (up to 2 hex digits)
\ex{hh...} hexadecimal character (any number of hex digits)
.sp
-The use of \ex{hh...} is not dependent on the use of the utf modifier on
+The use of \ex{hh...} is not dependent on the use of the \fButf\fP modifier on
the pattern. It is recognized always. There may be any number of hexadecimal
digits inside the braces; invalid values provoke error messages.
.P
@@ -366,7 +366,7 @@
is converted to "abcabcabcabc". This feature does not support nesting. To
include a closing square bracket in the characters, code it as \ex5D.
.P
-A backslash followed by an equals sign marke the end of the subject string and
+A backslash followed by an equals sign marks the end of the subject string and
the start of a modifier list. For example:
.sp
abc\e=notbol,notempty
@@ -461,8 +461,8 @@
is built, with the default default being Unicode.
.P
The \fBnewline\fP modifier specifies which characters are to be interpreted as
-newlines, both in the pattern and (by default) in subject lines. The type must
-be one of CR, LF, CRLF, ANYCRLF, or ANY.
+newlines, both in the pattern and in subject lines. The type must be one of CR,
+LF, CRLF, ANYCRLF, or ANY (in upper or lower case).
.
.
.SS "Information about a pattern"
@@ -478,8 +478,8 @@
regression tests can be used in different environments.
.P
The \fBfullbincode\fP modifier, by contrast, \fIdoes\fP include length and
-offset values. This is used in a few special tests and is also useful for
-one-off tests.
+offset values. This is used in a few special tests that run only for specific
+code unit widths and link sizes, and is also useful for one-off tests.
.P
The \fBinfo\fP modifier requests information about the compiled pattern
(whether it is anchored, has a fixed first character, and so on). The
@@ -501,13 +501,14 @@
Last code unit = 'c' (caseless)
Subject length lower bound = 3
.sp
-"Compile options" are those specified to the compile function; "overall
-options" have added options that are taken or deduced from the pattern. If both
-sets of options are the same, just a single "options" line is output. "First
-code unit" is where any match must start; if there is more than one they are
-listed as "starting code units". "Last code unit" is the last literal code unit
-that must be present in any match. This is not necessarily the last character.
-These lines are omitted if no starting or ending code units are recorded.
+"Compile options" are those specified by modifiers; "overall options" have
+added options that are taken or deduced from the pattern. If both sets of
+options are the same, just a single "options" line is output; if there are no
+options, the line is omitted. "First code unit" is where any match must start;
+if there is more than one they are listed as "starting code units". "Last code
+unit" is the last literal code unit that must be present in any match. This is
+not necessarily the last character. These lines are omitted if no starting or
+ending code units are recorded.
.
.
.SS "Specifying a pattern in hex"
@@ -520,16 +521,16 @@
/ab 32 59/hex
.sp
This feature is provided as a way of creating patterns that contain binary zero
-characters. By default, \fBpcre2test\fP passes patterns as zero-terminated
-strings to \fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED.
-However, for patterns specified in hexadecimal, the actual length of the
-pattern is passed.
+and other non-printing characters. By default, \fBpcre2test\fP passes patterns
+as zero-terminated strings to \fBpcre2_compile()\fP, giving the length as
+PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal, the
+actual length of the pattern is passed.
.
.
.SS "JIT compilation"
.rs
.sp
-The \fB/jit\fP modifier may optionally be followed by and equals sign and a
+The \fB/jit\fP modifier may optionally be followed by an equals sign and a
number in the range 0 to 7:
.sp
0 disable JIT
@@ -561,7 +562,7 @@
\fBjitverify\fP is specified without \fBjit\fP, jit=7 is assumed. If JIT
compilation is successful when \fBjitverify\fP is set, the text "(JIT)" is
added to the first output line after a match or non match when JIT-compiled
-code was actually used.
+code was actually used in the match.
.
.
.SS "Setting a locale"
@@ -645,8 +646,8 @@
.SS "Using alternative character tables"
.rs
.sp
-The \fB/tables\fP modifier must be followed by a single digit. It causes a
-specific set of built-in character tables to be passed to
+The value specified for the \fB/tables\fP modifier must be one of the digits 0,
+1, or 2. It causes a specific set of built-in character tables to be passed to
\fBpcre2_compile()\fP. This is used in the PCRE2 tests to check behaviour with
different character tables. The digit specifies the tables as follows:
.sp
@@ -759,13 +760,13 @@
.SS "Showing more text"
.rs
.sp
-The \fBaftertext\fP modifier requests that as well as outputting the substring
-that matched the entire pattern, \fBpcre2test\fP should in addition output the
-remainder of the subject string. This is useful for tests where the subject
-contains multiple copies of the same substring. The \fBallaftertext\fP modifier
-requests the same action for captured substrings as well as the main matched
-substring. In each case the remainder is output on the following line with a
-plus character following the capture number.
+The \fBaftertext\fP modifier requests that as well as outputting the part of
+the subject string that matched the entire pattern, \fBpcre2test\fP should in
+addition output the remainder of the subject string. This is useful for tests
+where the subject contains multiple copies of the same substring. The
+\fBallaftertext\fP modifier requests the same action for captured substrings as
+well as the main matched substring. In each case the remainder is output on the
+following line with a plus character following the capture number.
.P
The \fBallusedtext\fP modifier requests that all the text that was consulted
during a successful pattern match by the interpreter should be shown. This
@@ -782,7 +783,8 @@
<<< >>>
.sp
This shows that the matched string is "abc", with the preceding and following
-strings "pqr" and "xyz" also consulted during the match.
+strings "pqr" and "xyz" having been consulted during the match (when processing
+the assertions).
.P
The \fBstartchar\fP modifier requests that the starting character for the match
be indicated, if it is different to the start of the matched string. The only
@@ -836,7 +838,7 @@
between \fBglobal\fP and \fBaltglobal\fP is that the former uses the
\fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
to start searching at a new point within the entire string (which is what Perl
-does), whereas the latter passes over a shortened substring. This makes a
+does), whereas the latter passes over a shortened subject. This makes a
difference to the matching process if the pattern begins with a lookbehind
assertion (including \eb or \eB).
.P
@@ -847,7 +849,7 @@
imitates the way Perl handles such cases when using the \fB/g\fP modifier or
the \fBsplit()\fP function. Normally, the start offset is advanced by one
character, but if the newline convention recognizes CRLF as a newline, and the
-current character is CR followed by LF, an advance of two is used.
+current character is CR followed by LF, an advance of two characters occurs.
.
.
.SS "Testing substring extraction functions"
@@ -860,9 +862,9 @@
.sp
abcd\e=copy=1,copy=3,get=G1
.sp
-If the \fB#subject\fP command is used to set default copy and get lists, these
-can be unset by specifying a negative number for numbered groups and an empty
-name for named groups.
+If the \fB#subject\fP command is used to set default copy and/or get lists,
+these can be unset by specifying a negative number to cancel all numbered
+groups and an empty name to cancel all named groups.
.P
The \fBgetall\fP modifier tests \fBpcre2_substring_list_get()\fP, which
extracts all captured substrings.
@@ -871,7 +873,8 @@
convenience functions are output with C, G, or L after the string number
instead of a colon. This is in addition to the normal full list. The string
length (that is, the return from the extraction function) is given in
-parentheses after each substring.
+parentheses after each substring, followed by the name when the extraction was
+by name.
.
.
.SS "Testing the substitution function"
@@ -1044,11 +1047,10 @@
characters before the actual match start if a lookbehind assertion, \eK, \eb,
or \eB was involved.)
.P
-For any other return, \fBpcre2test\fP outputs the PCRE2
-negative error number and a short descriptive phrase. If the error is a failed
-UTF string check, the offset of the start of the failing character and the
-reason code are also output. Here is an example of an interactive
-\fBpcre2test\fP run.
+For any other return, \fBpcre2test\fP outputs the PCRE2 negative error number
+and a short descriptive phrase. If the error is a failed UTF string check, the
+code unit offset of the start of the failing character is also output. Here is
+an example of an interactive \fBpcre2test\fP run.
.sp
$ pcre2test
PCRE2 version 9.00 2014-05-10
@@ -1061,10 +1063,10 @@
No match
.sp
Unset capturing substrings that are not followed by one that is set are not
-returned by \fBpcre2_match()\fP, and are not shown by \fBpcre2test\fP. In the
-following example, there are two capturing substrings, but when the first data
-line is matched, the second, unset substring is not shown. An "internal" unset
-substring is shown as "<unset>", as for the second data line.
+shown by \fBpcre2test\fP unless the \fBallcaptures\fP modifier is specified. In
+the following example, there are two capturing substrings, but when the first
+data line is matched, the second, unset substring is not shown. An "internal"
+unset substring is shown as "<unset>", as for the second data line.
.sp
re> /(a)|(b)/
data> a
@@ -1100,8 +1102,8 @@
1: pp
.sp
"No match" is output only if the first match attempt fails. Here is an example
-of a failure message (the offset 4 that is specified by \e>4 is past the end of
-the subject string):
+of a failure message (the offset 4 that is specified by the \fBoffset\fP
+modifier is past the end of the subject string):
.sp
re> /xyz/
data> xyz\e=offset=4
@@ -1127,12 +1129,13 @@
1: tang
2: tan
.sp
-(Using the normal matching function on this data finds only "tang".) The
+Using the normal matching function on this data finds only "tang". The
longest matching string is always given first (and numbered zero). After a
PCRE2_ERROR_PARTIAL return, the output is "Partial match:", followed by the
-partially matching substring. (Note that this is the entire substring that was
+partially matching substring. Note that this is the entire substring that was
inspected during the partial match; it may include characters before the actual
-match start if a lookbehind assertion, \eK, \eb, or \eB was involved.)
+match start if a lookbehind assertion, \eb, or \eB was involved. (\eK is not
+supported for DFA matching.)
.P
If global matching is requested, the search for further matches resumes
at the end of the longest match. For example:
@@ -1174,9 +1177,9 @@
.SH CALLOUTS
.rs
.sp
-If the pattern contains any callout requests, \fBpcre2test\fP's callout function
-is called during matching. This works with both matching functions. By default,
-the called function displays the callout number, the start and current
+If the pattern contains any callout requests, \fBpcre2test\fP's callout
+function is called during matching. This works with both matching functions. By
+default, the called function displays the callout number, the start and current
positions in the text at the callout time, and the next pattern item to be
tested. For example:
.sp
@@ -1271,6 +1274,6 @@
.rs
.sp
.nf
-Last updated: 14 November 2014
+Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre2unicode.3
===================================================================
--- code/trunk/doc/pcre2unicode.3 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2unicode.3 2014-11-23 18:38:38 UTC (rev 158)
@@ -1,4 +1,4 @@
-.TH PCRE2UNICODE 3 "03 November 2014" "PCRE2 10.00"
+.TH PCRE2UNICODE 3 "23 November 2014" "PCRE2 10.00"
.SH NAME
PCRE - Perl-compatible regular expressions (revised API)
.SH "UNICODE AND UTF SUPPORT"
@@ -64,7 +64,7 @@
\fBpcre2pattern\fP
.\"
documentation). The use of \eC is not supported in the alternative matching
-function \fBpcre2_dfa_exec()\fP, nor is it supported in UTF mode by the JIT
+function \fBpcre2_dfa_match()\fP, nor is it supported in UTF mode by the JIT
optimization. If JIT optimization is requested for a UTF pattern that contains
\eC, it will not succeed, and so the matching will be carried out by the normal
interpretive function.
@@ -108,7 +108,10 @@
.sp
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
are (by default) checked for validity on entry to the relevant functions.
-If an invalid UTF string is passed, an error return is given.
+If an invalid UTF string is passed, an negative error code is returned. The
+code unit offset to the offending character can be extracted from the match
+data block by calling \fBpcre2_get_startchar()\fP, which is used for this
+purpose after a UTF error.
.P
UTF-16 and UTF-32 strings can indicate their endianness by special code knows
as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
@@ -130,14 +133,14 @@
In some situations, you may already know that your strings are valid, and
therefore want to skip these checks in order to improve performance, for
example in the case of a long subject string that is being scanned repeatedly.
-If you set the PCRE2_NO_UTF_CHECK flag at compile time or at run time, PCRE2
-assumes that the pattern or subject it is given (respectively) contains only
-valid UTF code unit sequences.
+If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
+PCRE2 assumes that the pattern or subject it is given (respectively) contains
+only valid UTF code unit sequences.
.P
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
the pattern; it does not also apply to subject strings. If you want to disable
-the check for a subject string you must pass this option to \fBpcre2_exec()\fP
-or \fBpcre2_dfa_exec()\fP.
+the check for a subject string you must pass this option to \fBpcre2_match()\fP
+or \fBpcre2_dfa_match()\fP.
.P
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
is undefined and your program may crash or loop indefinitely.
@@ -249,6 +252,6 @@
.rs
.sp
.nf
-Last updated: 03 November 2014
+Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi
Modified: code/trunk/src/pcre2_dfa_match.c
===================================================================
--- code/trunk/src/pcre2_dfa_match.c 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/src/pcre2_dfa_match.c 2014-11-23 18:38:38 UTC (rev 158)
@@ -3224,12 +3224,8 @@
#ifdef SUPPORT_UNICODE
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
{
- match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->rightchar));
- if (match_data->rc != 0)
- {
- match_data->leftchar = 0;
- return match_data->rc;
- }
+ match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->startchar));
+ if (match_data->rc != 0) return match_data->rc;
#if PCRE2_CODE_UNIT_WIDTH != 32
if (start_offset > 0 && start_offset < length &&
NOT_FIRSTCHAR(subject[start_offset]))
Modified: code/trunk/src/pcre2_match.c
===================================================================
--- code/trunk/src/pcre2_match.c 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/src/pcre2_match.c 2014-11-23 18:38:38 UTC (rev 158)
@@ -6459,12 +6459,8 @@
#ifdef SUPPORT_UNICODE
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
{
- match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->rightchar));
- if (match_data->rc != 0)
- {
- match_data->leftchar = 0;
- return match_data->rc;
- }
+ match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->startchar));
+ if (match_data->rc != 0) return match_data->rc;
#if PCRE2_CODE_UNIT_WIDTH != 32
if (start_offset > 0 && start_offset < length &&
NOT_FIRSTCHAR(subject[start_offset]))
Modified: code/trunk/src/pcre2test.c
===================================================================
--- code/trunk/src/pcre2test.c 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/src/pcre2test.c 2014-11-23 18:38:38 UTC (rev 158)
@@ -5570,6 +5570,13 @@
fprintf(outfile, "Failed: error %d: ", capcount);
PCRE2_GET_ERROR_MESSAGE(mlen, capcount, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, mlen, FALSE, outfile);
+ if (capcount <= PCRE2_ERROR_UTF8_ERR1 &&
+ capcount >= PCRE2_ERROR_UTF32_ERR2)
+ {
+ PCRE2_SIZE startchar;
+ PCRE2_GET_STARTCHAR(startchar, match_data);
+ fprintf(outfile, " at offset %ld", startchar);
+ }
fprintf(outfile, "\n");
break;
}
Modified: code/trunk/testdata/testinput10
===================================================================
--- code/trunk/testdata/testinput10 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/testdata/testinput10 2014-11-23 18:38:38 UTC (rev 158)
@@ -48,12 +48,12 @@
/\xC3\xC3\xC3xxx/utf
/badutf/utf
- \xdf
- \xef
- \xef\x80
- \xf7
- \xf7\x80
- \xf7\x80\x80
+ X\xdf
+ XX\xef
+ XXX\xef\x80
+ X\xf7
+ XX\xf7\x80
+ XXX\xf7\x80\x80
\xfb
\xfb\x80
\xfb\x80\x80
@@ -89,14 +89,14 @@
\xff
/badutf/utf
- \xfb\x80\x80\x80\x80
- \xfd\x80\x80\x80\x80\x80
- \xf7\xbf\xbf\xbf
+ XX\xfb\x80\x80\x80\x80
+ XX\xfd\x80\x80\x80\x80\x80
+ XX\xf7\xbf\xbf\xbf
/shortutf/utf
- \xdf\=ph
- \xef\=ph
- \xef\x80\=ph
+ XX\xdf\=ph
+ XX\xef\=ph
+ XX\xef\x80\=ph
\xf7\=ph
\xf7\x80\=ph
\xf7\x80\x80\=ph
@@ -111,9 +111,9 @@
\xfd\x80\x80\x80\x80\=ph
/anything/utf
- \xc0\x80
- \xc1\x8f
- \xe0\x9f\x80
+ X\xc0\x80
+ XX\xc1\x8f
+ XXX\xe0\x9f\x80
\xf0\x8f\x80\x80
\xf8\x87\x80\x80\x80
\xfc\x83\x80\x80\x80\x80
Modified: code/trunk/testdata/testinput12
===================================================================
--- code/trunk/testdata/testinput12 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/testdata/testinput12 2014-11-23 18:38:38 UTC (rev 158)
@@ -157,18 +157,18 @@
/^[\QĀ\E-\QŐ\E/B,utf
/X/utf
- \x{d800}
- \x{d800}\=no_utf_check
- \x{da00}
- \x{da00}\=no_utf_check
- \x{dc00}
- \x{dc00}\=no_utf_check
- \x{de00}
- \x{de00}\=no_utf_check
- \x{dfff}
- \x{dfff}\=no_utf_check
- \x{110000}
- \x{d800}\x{1234}
+ XX\x{d800}
+ XX\x{d800}\=no_utf_check
+ XX\x{da00}
+ XX\x{da00}\=no_utf_check
+ XX\x{dc00}
+ XX\x{dc00}\=no_utf_check
+ XX\x{de00}
+ XX\x{de00}\=no_utf_check
+ XX\x{dfff}
+ XX\x{dfff}\=no_utf_check
+ XX\x{110000}
+ XX\x{d800}\x{1234}
/(*UTF16)\x{11234}/
abcd\x{11234}pqr
Modified: code/trunk/testdata/testoutput10
===================================================================
--- code/trunk/testdata/testoutput10 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/testdata/testoutput10 2014-11-23 18:38:38 UTC (rev 158)
@@ -73,142 +73,142 @@
Failed: error -8 at offset 0: UTF-8 error: byte 2 top bits not 0x80
/badutf/utf
- \xdf
-Failed: error -3: UTF-8 error: 1 byte missing at end
- \xef
-Failed: error -4: UTF-8 error: 2 bytes missing at end
- \xef\x80
-Failed: error -3: UTF-8 error: 1 byte missing at end
- \xf7
-Failed: error -5: UTF-8 error: 3 bytes missing at end
- \xf7\x80
-Failed: error -4: UTF-8 error: 2 bytes missing at end
- \xf7\x80\x80
-Failed: error -3: UTF-8 error: 1 byte missing at end
+ X\xdf
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 1
+ XX\xef
+Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
+ XXX\xef\x80
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 3
+ X\xf7
+Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 1
+ XX\xf7\x80
+Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
+ XXX\xf7\x80\x80
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 3
\xfb
-Failed: error -6: UTF-8 error: 4 bytes missing at end
+Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
\xfb\x80
-Failed: error -5: UTF-8 error: 3 bytes missing at end
+Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
\xfb\x80\x80
-Failed: error -4: UTF-8 error: 2 bytes missing at end
+Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
\xfb\x80\x80\x80
-Failed: error -3: UTF-8 error: 1 byte missing at end
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
\xfd
-Failed: error -7: UTF-8 error: 5 bytes missing at end
+Failed: error -7: UTF-8 error: 5 bytes missing at end at offset 0
\xfd\x80
-Failed: error -6: UTF-8 error: 4 bytes missing at end
+Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
\xfd\x80\x80
-Failed: error -5: UTF-8 error: 3 bytes missing at end
+Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
\xfd\x80\x80\x80
-Failed: error -4: UTF-8 error: 2 bytes missing at end
+Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
\xfd\x80\x80\x80\x80
-Failed: error -3: UTF-8 error: 1 byte missing at end
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
\xdf\x7f
-Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
+Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
\xef\x7f\x80
-Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
+Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
\xef\x80\x7f
-Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
+Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
\xf7\x7f\x80\x80
-Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
+Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
\xf7\x80\x7f\x80
-Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
+Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
\xf7\x80\x80\x7f
-Failed: error -10: UTF-8 error: byte 4 top bits not 0x80
+Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
\xfb\x7f\x80\x80\x80
-Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
+Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
\xfb\x80\x7f\x80\x80
-Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
+Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
\xfb\x80\x80\x7f\x80
-Failed: error -10: UTF-8 error: byte 4 top bits not 0x80
+Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
\xfb\x80\x80\x80\x7f
-Failed: error -11: UTF-8 error: byte 5 top bits not 0x80
+Failed: error -11: UTF-8 error: byte 5 top bits not 0x80 at offset 0
\xfd\x7f\x80\x80\x80\x80
-Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
+Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
\xfd\x80\x7f\x80\x80\x80
-Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
+Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
\xfd\x80\x80\x7f\x80\x80
-Failed: error -10: UTF-8 error: byte 4 top bits not 0x80
+Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
\xfd\x80\x80\x80\x7f\x80
-Failed: error -11: UTF-8 error: byte 5 top bits not 0x80
+Failed: error -11: UTF-8 error: byte 5 top bits not 0x80 at offset 0
\xfd\x80\x80\x80\x80\x7f
-Failed: error -12: UTF-8 error: byte 6 top bits not 0x80
+Failed: error -12: UTF-8 error: byte 6 top bits not 0x80 at offset 0
\xed\xa0\x80
-Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
+Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
\xc0\x8f
-Failed: error -17: UTF-8 error: overlong 2-byte sequence
+Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 0
\xe0\x80\x8f
-Failed: error -18: UTF-8 error: overlong 3-byte sequence
+Failed: error -18: UTF-8 error: overlong 3-byte sequence at offset 0
\xf0\x80\x80\x8f
-Failed: error -19: UTF-8 error: overlong 4-byte sequence
+Failed: error -19: UTF-8 error: overlong 4-byte sequence at offset 0
\xf8\x80\x80\x80\x8f
-Failed: error -20: UTF-8 error: overlong 5-byte sequence
+Failed: error -20: UTF-8 error: overlong 5-byte sequence at offset 0
\xfc\x80\x80\x80\x80\x8f
-Failed: error -21: UTF-8 error: overlong 6-byte sequence
+Failed: error -21: UTF-8 error: overlong 6-byte sequence at offset 0
\x80
-Failed: error -22: UTF-8 error: isolated 0x80 byte
+Failed: error -22: UTF-8 error: isolated 0x80 byte at offset 0
\xfe
-Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
+Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
\xff
-Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
+Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
/badutf/utf
- \xfb\x80\x80\x80\x80
-Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
- \xfd\x80\x80\x80\x80\x80
-Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
- \xf7\xbf\xbf\xbf
-Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined
+ XX\xfb\x80\x80\x80\x80
+Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 2
+ XX\xfd\x80\x80\x80\x80\x80
+Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 2
+ XX\xf7\xbf\xbf\xbf
+Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined at offset 2
/shortutf/utf
- \xdf\=ph
-Failed: error -3: UTF-8 error: 1 byte missing at end
- \xef\=ph
-Failed: error -4: UTF-8 error: 2 bytes missing at end
- \xef\x80\=ph
-Failed: error -3: UTF-8 error: 1 byte missing at end
+ XX\xdf\=ph
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 2
+ XX\xef\=ph
+Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
+ XX\xef\x80\=ph
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 2
\xf7\=ph
-Failed: error -5: UTF-8 error: 3 bytes missing at end
+Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
\xf7\x80\=ph
-Failed: error -4: UTF-8 error: 2 bytes missing at end
+Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
\xf7\x80\x80\=ph
-Failed: error -3: UTF-8 error: 1 byte missing at end
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
\xfb\=ph
-Failed: error -6: UTF-8 error: 4 bytes missing at end
+Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
\xfb\x80\=ph
-Failed: error -5: UTF-8 error: 3 bytes missing at end
+Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
\xfb\x80\x80\=ph
-Failed: error -4: UTF-8 error: 2 bytes missing at end
+Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
\xfb\x80\x80\x80\=ph
-Failed: error -3: UTF-8 error: 1 byte missing at end
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
\xfd\=ph
-Failed: error -7: UTF-8 error: 5 bytes missing at end
+Failed: error -7: UTF-8 error: 5 bytes missing at end at offset 0
\xfd\x80\=ph
-Failed: error -6: UTF-8 error: 4 bytes missing at end
+Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
\xfd\x80\x80\=ph
-Failed: error -5: UTF-8 error: 3 bytes missing at end
+Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
\xfd\x80\x80\x80\=ph
-Failed: error -4: UTF-8 error: 2 bytes missing at end
+Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
\xfd\x80\x80\x80\x80\=ph
-Failed: error -3: UTF-8 error: 1 byte missing at end
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
/anything/utf
- \xc0\x80
-Failed: error -17: UTF-8 error: overlong 2-byte sequence
- \xc1\x8f
-Failed: error -17: UTF-8 error: overlong 2-byte sequence
- \xe0\x9f\x80
-Failed: error -18: UTF-8 error: overlong 3-byte sequence
+ X\xc0\x80
+Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 1
+ XX\xc1\x8f
+Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 2
+ XXX\xe0\x9f\x80
+Failed: error -18: UTF-8 error: overlong 3-byte sequence at offset 3
\xf0\x8f\x80\x80
-Failed: error -19: UTF-8 error: overlong 4-byte sequence
+Failed: error -19: UTF-8 error: overlong 4-byte sequence at offset 0
\xf8\x87\x80\x80\x80
-Failed: error -20: UTF-8 error: overlong 5-byte sequence
+Failed: error -20: UTF-8 error: overlong 5-byte sequence at offset 0
\xfc\x83\x80\x80\x80\x80
-Failed: error -21: UTF-8 error: overlong 6-byte sequence
+Failed: error -21: UTF-8 error: overlong 6-byte sequence at offset 0
\xfe\x80\x80\x80\x80\x80
-Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
+Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
\xff\x80\x80\x80\x80\x80
-Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
+Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
\xc3\x8f
No match
\xe0\xaf\x80
@@ -220,13 +220,13 @@
\xf1\x8f\x80\x80
No match
\xf8\x88\x80\x80\x80
-Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
+Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
\xf9\x87\x80\x80\x80
-Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
+Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
\xfc\x84\x80\x80\x80\x80
-Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
+Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
\xfd\x83\x80\x80\x80\x80
-Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
+Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
\xf8\x88\x80\x80\x80\=no_utf_check
No match
\xf9\x87\x80\x80\x80\=no_utf_check
@@ -751,27 +751,27 @@
/X/utf
\x{d800}
-Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
+Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
\x{d800}\=no_utf_check
No match
\x{da00}
-Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
+Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
\x{da00}\=no_utf_check
No match
\x{dfff}
-Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
+Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
\x{dfff}\=no_utf_check
No match
\x{110000}
-Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined
+Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined at offset 0
\x{110000}\=no_utf_check
No match
\x{2000000}
-Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
+Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
\x{2000000}\=no_utf_check
No match
\x{7fffffff}
-Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
+Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
\x{7fffffff}\=no_utf_check
No match
@@ -1106,7 +1106,7 @@
\x{ff000041}
** Character \x{ff000041} is greater than 0x7fffffff and so cannot be converted to UTF-8
\x{7f000041}
-Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
+Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
/(*UTF8)abc/never_utf
Failed: error 174 at offset 7: using UTF is disabled by the application
Modified: code/trunk/testdata/testoutput12-16
===================================================================
--- code/trunk/testdata/testoutput12-16 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/testdata/testoutput12-16 2014-11-23 18:38:38 UTC (rev 158)
@@ -607,30 +607,30 @@
Failed: error 106 at offset 13: missing terminating ] for character class
/X/utf
- \x{d800}
-Failed: error -24: UTF-16 error: missing low surrogate at end
- \x{d800}\=no_utf_check
-No match
- \x{da00}
-Failed: error -24: UTF-16 error: missing low surrogate at end
- \x{da00}\=no_utf_check
-No match
- \x{dc00}
-Failed: error -26: UTF-16 error: isolated low surrogate
- \x{dc00}\=no_utf_check
-No match
- \x{de00}
-Failed: error -26: UTF-16 error: isolated low surrogate
- \x{de00}\=no_utf_check
-No match
- \x{dfff}
-Failed: error -26: UTF-16 error: isolated low surrogate
- \x{dfff}\=no_utf_check
-No match
- \x{110000}
+ XX\x{d800}
+Failed: error -24: UTF-16 error: missing low surrogate at end at offset 2
+ XX\x{d800}\=no_utf_check
+ 0: X
+ XX\x{da00}
+Failed: error -24: UTF-16 error: missing low surrogate at end at offset 2
+ XX\x{da00}\=no_utf_check
+ 0: X
+ XX\x{dc00}
+Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
+ XX\x{dc00}\=no_utf_check
+ 0: X
+ XX\x{de00}
+Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
+ XX\x{de00}\=no_utf_check
+ 0: X
+ XX\x{dfff}
+Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
+ XX\x{dfff}\=no_utf_check
+ 0: X
+ XX\x{110000}
** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
- \x{d800}\x{1234}
-Failed: error -25: UTF-16 error: invalid low surrogate
+ XX\x{d800}\x{1234}
+Failed: error -25: UTF-16 error: invalid low surrogate at offset 3
/(*UTF16)\x{11234}/
abcd\x{11234}pqr
Modified: code/trunk/testdata/testoutput12-32
===================================================================
--- code/trunk/testdata/testoutput12-32 2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/testdata/testoutput12-32 2014-11-23 18:38:38 UTC (rev 158)
@@ -600,30 +600,30 @@
Failed: error 106 at offset 13: missing terminating ] for character class
/X/utf
- \x{d800}
-Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
- \x{d800}\=no_utf_check
-No match
- \x{da00}
-Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
- \x{da00}\=no_utf_check
-No match
- \x{dc00}
-Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
- \x{dc00}\=no_utf_check
-No match
- \x{de00}
-Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
- \x{de00}\=no_utf_check
-No match
- \x{dfff}
-Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
- \x{dfff}\=no_utf_check
-No match
- \x{110000}
-Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined
- \x{d800}\x{1234}
-Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
+ XX\x{d800}
+Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
+ XX\x{d800}\=no_utf_check
+ 0: X
+ XX\x{da00}
+Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
+ XX\x{da00}\=no_utf_check
+ 0: X
+ XX\x{dc00}
+Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
+ XX\x{dc00}\=no_utf_check
+ 0: X
+ XX\x{de00}
+Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
+ XX\x{de00}\=no_utf_check
+ 0: X
+ XX\x{dfff}
+Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
+ XX\x{dfff}\=no_utf_check
+ 0: X
+ XX\x{110000}
+Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 2
+ XX\x{d800}\x{1234}
+Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
/(*UTF16)\x{11234}/
Failed: error 160 at offset 5: (*VERB) not recognized or malformed
@@ -1113,7 +1113,7 @@
/\C/utf
\x{110000}
-Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined
+Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 0
/\x{100}*A/IB,utf
------------------------------------------------------------------