[Pcre-svn] [158] code/trunk: More documentation and test upd…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [158] code/trunk: More documentation and test updates.
Revision: 158
          http://www.exim.org/viewvc/pcre2?view=rev&revision=158
Author:   ph10
Date:     2014-11-23 18:38:38 +0000 (Sun, 23 Nov 2014)


Log Message:
-----------
More documentation and test updates.

Modified Paths:
--------------
    code/trunk/doc/html/pcre2.html
    code/trunk/doc/html/pcre2build.html
    code/trunk/doc/html/pcre2callout.html
    code/trunk/doc/html/pcre2grep.html
    code/trunk/doc/html/pcre2jit.html
    code/trunk/doc/html/pcre2syntax.html
    code/trunk/doc/html/pcre2unicode.html
    code/trunk/doc/pcre2.txt
    code/trunk/doc/pcre2api.3
    code/trunk/doc/pcre2build.3
    code/trunk/doc/pcre2callout.3
    code/trunk/doc/pcre2grep.1
    code/trunk/doc/pcre2grep.txt
    code/trunk/doc/pcre2jit.3
    code/trunk/doc/pcre2syntax.3
    code/trunk/doc/pcre2test.1
    code/trunk/doc/pcre2unicode.3
    code/trunk/src/pcre2_dfa_match.c
    code/trunk/src/pcre2_match.c
    code/trunk/src/pcre2test.c
    code/trunk/testdata/testinput10
    code/trunk/testdata/testinput12
    code/trunk/testdata/testoutput10
    code/trunk/testdata/testoutput12-16
    code/trunk/testdata/testoutput12-32


Modified: code/trunk/doc/html/pcre2.html
===================================================================
--- code/trunk/doc/html/pcre2.html    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/html/pcre2.html    2014-11-23 18:38:38 UTC (rev 158)
@@ -148,7 +148,7 @@
   pcre2limits        details of size and other limits
   pcre2matching      discussion of the two matching algorithms
   pcre2partial       details of the partial matching facility
-  pcre2pattern       syntax and semantics of supported regular  expression patterns
+  pcre2pattern       syntax and semantics of supported regular expression patterns
   pcre2perform       discussion of performance issues
   pcre2posix         the POSIX-compatible C API for the 8-bit library
   pcre2sample        discussion of the pcre2demo program


Modified: code/trunk/doc/html/pcre2build.html
===================================================================
--- code/trunk/doc/html/pcre2build.html    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/html/pcre2build.html    2014-11-23 18:38:38 UTC (rev 158)
@@ -17,9 +17,9 @@
 <li><a name="TOC2" href="#SEC2">PCRE2 BUILD-TIME OPTIONS</a>
 <li><a name="TOC3" href="#SEC3">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a>
 <li><a name="TOC4" href="#SEC4">BUILDING SHARED AND STATIC LIBRARIES</a>
-<li><a name="TOC5" href="#SEC5">Unicode and UTF SUPPORT</a>
+<li><a name="TOC5" href="#SEC5">UNICODE AND UTF SUPPORT</a>
 <li><a name="TOC6" href="#SEC6">JUST-IN-TIME COMPILER SUPPORT</a>
-<li><a name="TOC7" href="#SEC7">CODE VALUE OF NEWLINE</a>
+<li><a name="TOC7" href="#SEC7">NEWLINE RECOGNITION</a>
 <li><a name="TOC8" href="#SEC8">WHAT \R MATCHES</a>
 <li><a name="TOC9" href="#SEC9">HANDLING VERY LARGE PATTERNS</a>
 <li><a name="TOC10" href="#SEC10">AVOIDING EXCESSIVE STACK USAGE</a>
@@ -91,12 +91,12 @@
 UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
 the following to the <b>configure</b> command:
 <pre>
-  --enable-pcre16
-  --enable-pcre32
+  --enable-pcre2-16
+  --enable-pcre2-32
 </pre>
 If you do not want the 8-bit library, add
 <pre>
-  --disable-pcre8
+  --disable-pcre2-8
 </pre>
 as well. At least one of the three libraries must be built. Note that the POSIX
 wrapper is for the 8-bit library only, and that <b>pcre2grep</b> is an 8-bit
@@ -106,14 +106,15 @@
 <br><a name="SEC4" href="#TOC1">BUILDING SHARED AND STATIC LIBRARIES</a><br>
 <P>
 The Autotools PCRE2 building process uses <b>libtool</b> to build both shared
-and static libraries by default. You can suppress one of these by adding one of
+and static libraries by default. You can suppress an unwanted library by adding
+one of
 <pre>
   --disable-shared
   --disable-static
 </pre>
-to the <b>configure</b> command, as required.
+to the <b>configure</b> command.
 </P>
-<br><a name="SEC5" href="#TOC1">Unicode and UTF SUPPORT</a><br>
+<br><a name="SEC5" href="#TOC1">UNICODE AND UTF SUPPORT</a><br>
 <P>
 By default, PCRE2 is built with support for Unicode and UTF character strings.
 To build it without Unicode support, add
@@ -126,20 +127,15 @@
 </P>
 <P>
 Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
-or UTF-32. To do that you have have to set the PCRE2_UTF option when you call
-<b>pcre2_compile()</b> to compile a pattern.
+or UTF-32. To do that, applications that use the library have to set the
+PCRE2_UTF option when they call <b>pcre2_compile()</b> to compile a pattern.
 </P>
 <P>
-It is not possible to support both EBCDIC and UTF-8 codes in the same version
-of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
-exclusive.
-</P>
-<P>
-UTF support allows the libraries to process character codepoints up to 0x10ffff
-in the strings that they handle. It also provides support for accessing the
-properties of such characters, using pattern escapes such as \P, \p, and \X.
-Only the general category properties such as <i>Lu</i> and <i>Nd</i> are
-supported. Details are given in the
+UTF support allows the libraries to process character code points up to
+0x10ffff in the strings that they handle. It also provides support for
+accessing the Unicode properties of such characters, using pattern escapes such
+as \P, \p, and \X. Only the general category properties such as <i>Lu</i> and
+<i>Nd</i> are supported. Details are given in the
 <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
 documentation.
 </P>
@@ -150,7 +146,7 @@
   --enable-jit
 </pre>
 This support is available only for certain hardware architectures. If this
-option is set for an unsupported architecture, a compile time error occurs.
+option is set for an unsupported architecture, a building error occurs.
 See the
 <a href="pcre2jit.html"><b>pcre2jit</b></a>
 documentation for a discussion of JIT usage. When JIT support is enabled,
@@ -160,7 +156,7 @@
 </pre>
 to the "configure" command.
 </P>
-<br><a name="SEC7" href="#TOC1">CODE VALUE OF NEWLINE</a><br>
+<br><a name="SEC7" href="#TOC1">NEWLINE RECOGNITION</a><br>
 <P>
 By default, PCRE2 interprets the linefeed (LF) character as indicating the end
 of a line. This is the normal newline character on Unix-like systems. You can
@@ -168,12 +164,13 @@
 <pre>
   --enable-newline-is-cr
 </pre>
-to the <b>configure</b> command. There is also a --enable-newline-is-lf option,
+to the <b>configure</b> command. There is also an --enable-newline-is-lf option,
 which explicitly specifies linefeed as the newline character.
-<br>
-<br>
-Alternatively, you can specify that line endings are to be indicated by the two
-character sequence CRLF. If you want this, add
+</P>
+<P>
+Alternatively, you can specify that line endings are to be indicated by the
+two-character sequence CRLF (CR immediately followed by LF). If you want this,
+add
 <pre>
   --enable-newline-is-crlf
 </pre>
@@ -186,22 +183,26 @@
 <pre>
   --enable-newline-is-any
 </pre>
-causes PCRE2 to recognize any Unicode newline sequence.
+causes PCRE2 to recognize any Unicode newline sequence. The Unicode newline
+sequences are the three just mentioned, plus the single characters VT (vertical
+tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
+separator, U+2028), and PS (paragraph separator, U+2029).
 </P>
 <P>
-Whatever line ending convention is selected when PCRE2 is built can be
-overridden when the library functions are called. At build time it is
+Whatever default line ending convention is selected when PCRE2 is built can be
+overridden by applications that use the library. At build time it is
 conventional to use the standard for your operating system.
 </P>
 <br><a name="SEC8" href="#TOC1">WHAT \R MATCHES</a><br>
 <P>
 By default, the sequence \R in a pattern matches any Unicode newline sequence,
-whatever has been selected as the line ending sequence. If you specify
+independently of what has been selected as the line ending sequence. If you
+specify
 <pre>
   --enable-bsr-anycrlf
 </pre>
 the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
-selected when PCRE2 is built can be overridden when the library functions are
+selected when PCRE2 is built can be overridden by applications that use the
 called.
 </P>
 <br><a name="SEC9" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
@@ -210,10 +211,10 @@
 another (for example, from an opening parenthesis to an alternation
 metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
 are used for these offsets, leading to a maximum size for a compiled pattern of
-around 64K. This is sufficient to handle all but the most gigantic patterns.
-Nevertheless, some people do want to process truly enormous patterns, so it is
-possible to compile PCRE2 to use three-byte or four-byte offsets by adding a
-setting such as
+around 64K code units. This is sufficient to handle all but the most gigantic
+patterns. Nevertheless, some people do want to process truly enormous patterns,
+so it is possible to compile PCRE2 to use three-byte or four-byte offsets by
+adding a setting such as
 <pre>
   --with-link-size=3
 </pre>
@@ -294,18 +295,22 @@
 <br><a name="SEC13" href="#TOC1">USING EBCDIC CODE</a><br>
 <P>
 PCRE2 assumes by default that it will run in an environment where the character
-code is ASCII (or Unicode, which is a superset of ASCII). This is the case for
+code is ASCII or Unicode, which is a superset of ASCII. This is the case for
 most computer operating systems. PCRE2 can, however, be compiled to run in an
-EBCDIC environment by adding
+8-bit EBCDIC environment by adding
 <pre>
   --enable-ebcdic --disable-unicode
 </pre>
 to the <b>configure</b> command. This setting implies
 --enable-rebuild-chartables. You should only use it if you know that you are in
-an EBCDIC environment (for example, an IBM mainframe operating system). The
---enable-ebcdic option is incompatible with Unicode support.
+an EBCDIC environment (for example, an IBM mainframe operating system).
 </P>
 <P>
+It is not possible to support both EBCDIC and UTF-8 codes in the same version
+of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
+exclusive.
+</P>
+<P>
 The EBCDIC character that corresponds to an ASCII LF is assumed to have the
 value 0x15 by default. However, in some EBCDIC environments, 0x25 is used. In
 such an environment you should use
@@ -347,8 +352,8 @@
 <pre>
   --with-pcre2grep-bufsize=50K
 </pre>
-to the <b>configure</b> command. The caller of \fPpcre2grep\fP can, however,
-override this value by specifying a run-time option.
+to the <b>configure</b> command. The caller of \fPpcre2grep\fP can override this
+value by using --buffer-size on the command line..
 </P>
 <br><a name="SEC16" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
 <P>
@@ -362,16 +367,16 @@
 from a terminal, it reads it using the <b>readline()</b> function. This provides
 line-editing and history facilities. Note that <b>libreadline</b> is
 GPL-licensed, so if you distribute a binary of <b>pcre2test</b> linked in this
-way, there may be licensing issues. These can be avoided by linking with
-<b>libedit</b> (which has a BSD licence) instead.
+way, there may be licensing issues. These can be avoided by linking instead
+with <b>libedit</b>, which has a BSD licence.
 </P>
 <P>
-Setting this option causes the <b>-lreadline</b> option to be added to the
-<b>pcre2test</b> build. In many operating environments with a sytem-installed
-readline library this is sufficient. However, in some environments (e.g. if an
-unmodified distribution version of readline is in use), some extra
-configuration may be necessary. The INSTALL file for <b>libreadline</b> says
-this:
+Setting --enable-pcre2test-libreadline causes the <b>-lreadline</b> option to be
+added to the <b>pcre2test</b> build. In many operating environments with a
+sytem-installed readline library this is sufficient. However, in some
+environments (e.g. if an unmodified distribution version of readline is in
+use), some extra configuration may be necessary. The INSTALL file for
+<b>libreadline</b> says this:
 <pre>
   "Readline uses the termcap functions, but does not link with
   the termcap or curses library itself, allowing applications
@@ -386,13 +391,13 @@
 </P>
 <br><a name="SEC17" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
 <P>
-By adding the
+If you add
 <pre>
   --enable-valgrind
 </pre>
-option to to the <b>configure</b> command, PCRE2 will use valgrind annotations
-to mark certain memory regions as unaddressable. This allows it to detect
-invalid memory accesses, and is mostly useful for debugging PCRE2 itself.
+to the <b>configure</b> command, PCRE2 will use valgrind annotations to mark
+certain memory regions as unaddressable. This allows it to detect invalid
+memory accesses, and is mostly useful for debugging PCRE2 itself.
 </P>
 <br><a name="SEC18" href="#TOC1">CODE COVERAGE REPORTING</a><br>
 <P>
@@ -466,7 +471,7 @@
 </P>
 <br><a name="SEC21" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 03 November 2014
+Last updated: 23 November 2014
 <br>
 Copyright &copy; 1997-2014 University of Cambridge.
 <br>


Modified: code/trunk/doc/html/pcre2callout.html
===================================================================
--- code/trunk/doc/html/pcre2callout.html    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/html/pcre2callout.html    2014-11-23 18:38:38 UTC (rev 158)
@@ -85,29 +85,27 @@
 <P>
 At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
 what follows cannot be part of the repeat. For example, a+[bc] is compiled as
-if it were a++[bc]. The <b>pcre2test</b> output when this pattern is anchored
-and then applied with automatic callouts to the string "aaaa" is:
+if it were a++[bc]. The <b>pcre2test</b> output when this pattern is compiled
+with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
+"aaaa" is:
 <pre>
   ---&#62;aaaa
-   +0 ^        ^
-   +1 ^        a+
-   +3 ^   ^    [bc]
+   +0 ^        a+
+   +2 ^   ^    [bc]
   No match
 </pre>
 This indicates that when matching [bc] fails, there is no backtracking into a+
 and therefore the callouts that would be taken for the backtracks do not occur.
-You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS
-to <b>pcre2_compile()</b>, or starting the pattern with (*NO_AUTO_POSSESS). If
-this is done in <b>pcre2test</b> (using the /no_auto_possess qualifier), the
-output changes to this:
+You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
+<b>pcre2_compile()</b>, or starting the pattern with (*NO_AUTO_POSSESS). In this
+case, the output changes to this:
 <pre>
   ---&#62;aaaa
-   +0 ^        ^
-   +1 ^        a+
-   +3 ^   ^    [bc]
-   +3 ^  ^     [bc]
-   +3 ^ ^      [bc]
-   +3 ^^       [bc]
+   +0 ^        a+
+   +2 ^   ^    [bc]
+   +2 ^  ^     [bc]
+   +2 ^ ^      [bc]
+   +2 ^^       [bc]
   No match
 </pre>
 This time, when matching [bc] fails, the matcher backtracks into a+ and tries
@@ -137,10 +135,10 @@
 </P>
 <br><a name="SEC4" href="#TOC1">THE CALLOUT INTERFACE</a><br>
 <P>
-During matching, when PCRE2 reaches a callout point, the external function that
-is set in the match context is called (if it is set). This applies to both
-normal and DFA matching. The only argument to the callout function is a pointer
-to a <b>pcre2_callout</b> block. This structure contains the following fields:
+During matching, when PCRE2 reaches a callout point, if an external function is
+set in the match context, it is called. This applies to both normal and DFA
+matching. The only argument to the callout function is a pointer to a
+<b>pcre2_callout</b> block. This structure contains the following fields:
 <pre>
   uint32_t      <i>version</i>;
   uint32_t      <i>callout_number</i>;
@@ -169,7 +167,7 @@
 <P>
 The <i>offset_vector</i> field is a pointer to the vector of capturing offsets
 (the "ovector") that was passed to the matching function in the match data
-block. When <b>pcre2_match()</b> is used, the contents can be inspected, in
+block. When <b>pcre2_match()</b> is used, the contents can be inspected in
 order to extract substrings that have been matched so far, in the same way as
 for extracting substrings after a match has completed. For the DFA matching
 function, this field is not useful.
@@ -261,7 +259,7 @@
 </P>
 <br><a name="SEC7" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 19 October 2014
+Last updated: 23 November 2014
 <br>
 Copyright &copy; 1997-2014 University of Cambridge.
 <br>


Modified: code/trunk/doc/html/pcre2grep.html
===================================================================
--- code/trunk/doc/html/pcre2grep.html    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/html/pcre2grep.html    2014-11-23 18:38:38 UTC (rev 158)
@@ -467,8 +467,8 @@
 Processing some regular expression patterns can require a very large amount of
 memory, leading in some cases to a program crash if not enough is available.
 Other patterns may take a very long time to search for all possible matching
-strings. The <b>pcre2_exec()</b> function that is called by <b>pcre2grep</b> to do
-the matching has two parameters that can limit the resources that it uses.
+strings. The <b>pcre2_match()</b> function that is called by <b>pcre2grep</b> to
+do the matching has two parameters that can limit the resources that it uses.
 <br>
 <br>
 The <b>--match-limit</b> option provides a means of limiting resource usage
@@ -750,7 +750,7 @@
 </P>
 <br><a name="SEC14" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 28 September 2014
+Last updated: 23 November 2014
 <br>
 Copyright &copy; 1997-2014 University of Cambridge.
 <br>


Modified: code/trunk/doc/html/pcre2jit.html
===================================================================
--- code/trunk/doc/html/pcre2jit.html    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/html/pcre2jit.html    2014-11-23 18:38:38 UTC (rev 158)
@@ -31,11 +31,11 @@
 <P>
 Just-in-time compiling is a heavyweight optimization that can greatly speed up
 pattern matching. However, it comes at the cost of extra processing before the
-match is performed. Therefore, it is of most benefit when the same pattern is
-going to be matched many times. This does not necessarily mean many calls of a
-matching function; if the pattern is not anchored, matching attempts may take
-place many times at various positions in the subject, even for a single call.
-Therefore, if the subject string is very long, it may still pay to use JIT for
+match is performed, so it is of most benefit when the same pattern is going to
+be matched many times. This does not necessarily mean many calls of a matching
+function; if the pattern is not anchored, matching attempts may take place many
+times at various positions in the subject, even for a single call. Therefore,
+if the subject string is very long, it may still pay to use JIT even for
 one-off matches. JIT support is available for all of the 8-bit, 16-bit and
 32-bit PCRE2 libraries.
 </P>
@@ -103,7 +103,7 @@
 PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
 PCRE2_JIT_COMPLETE and just compile code for partial matching. If
 <b>pcre2_jit_compile()</b> is called with no option bits set, it immediately
-returns zero. This is an alternative way of testing if JIT is available.
+returns zero. This is an alternative way of testing whether JIT is available.
 </P>
 <P>
 At present, it is not possible to free JIT compiled code except when the entire
@@ -299,7 +299,7 @@
 not\fP call <b>pcre2_match()</b> with a match context pointing to an already
 freed stack, as that will cause SEGFAULT. (Also, do not free a stack currently
 used by <b>pcre2_match()</b> in another thread). You can also replace the stack
-in a context at any time when it is not in use. You can also free the previous
+in a context at any time when it is not in use. You should free the previous
 stack before assigning a replacement.
 </P>
 <P>
@@ -418,7 +418,7 @@
 </P>
 <br><a name="SEC13" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 12 November 2014
+Last updated: 23 November 2014
 <br>
 Copyright &copy; 1997-2014 University of Cambridge.
 <br>


Modified: code/trunk/doc/html/pcre2syntax.html
===================================================================
--- code/trunk/doc/html/pcre2syntax.html    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/html/pcre2syntax.html    2014-11-23 18:38:38 UTC (rev 158)
@@ -421,7 +421,7 @@
   (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
 </pre>
 Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
-limits set by the caller of pcre2_exec(), not increase them.
+limits set by the caller of pcre2_match(), not increase them.
 </P>
 <br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
 <P>
@@ -553,7 +553,7 @@
 </P>
 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 14 November 2014
+Last updated: 23 November 2014
 <br>
 Copyright &copy; 1997-2014 University of Cambridge.
 <br>


Modified: code/trunk/doc/html/pcre2unicode.html
===================================================================
--- code/trunk/doc/html/pcre2unicode.html    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/html/pcre2unicode.html    2014-11-23 18:38:38 UTC (rev 158)
@@ -72,7 +72,7 @@
 characters (see the description of \C in the
 <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
 documentation). The use of \C is not supported in the alternative matching
-function <b>pcre2_dfa_exec()</b>, nor is it supported in UTF mode by the JIT
+function <b>pcre2_dfa_match()</b>, nor is it supported in UTF mode by the JIT
 optimization. If JIT optimization is requested for a UTF pattern that contains
 \C, it will not succeed, and so the matching will be carried out by the normal
 interpretive function.
@@ -141,15 +141,15 @@
 In some situations, you may already know that your strings are valid, and
 therefore want to skip these checks in order to improve performance, for
 example in the case of a long subject string that is being scanned repeatedly.
-If you set the PCRE2_NO_UTF_CHECK flag at compile time or at run time, PCRE2
-assumes that the pattern or subject it is given (respectively) contains only
-valid UTF code unit sequences.
+If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
+PCRE2 assumes that the pattern or subject it is given (respectively) contains
+only valid UTF code unit sequences.
 </P>
 <P>
 Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
 the pattern; it does not also apply to subject strings. If you want to disable
-the check for a subject string you must pass this option to <b>pcre2_exec()</b>
-or <b>pcre2_dfa_exec()</b>.
+the check for a subject string you must pass this option to <b>pcre2_match()</b>
+or <b>pcre2_dfa_match()</b>.
 </P>
 <P>
 If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
@@ -261,7 +261,7 @@
 REVISION
 </b><br>
 <P>
-Last updated: 03 November 2014
+Last updated: 23 November 2014
 <br>
 Copyright &copy; 1997-2014 University of Cambridge.
 <br>


Modified: code/trunk/doc/pcre2.txt
===================================================================
--- code/trunk/doc/pcre2.txt    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2.txt    2014-11-23 18:38:38 UTC (rev 158)
@@ -2667,12 +2667,12 @@
        or UTF-16/UTF-32 strings. To build these additional libraries, add  one
        or both of the following to the configure command:


-         --enable-pcre16
-         --enable-pcre32
+         --enable-pcre2-16
+         --enable-pcre2-32


        If you do not want the 8-bit library, add


-         --disable-pcre8
+         --disable-pcre2-8


        as  well.  At least one of the three libraries must be built. Note that
        the POSIX wrapper is for the 8-bit library only, and that pcre2grep  is
@@ -2683,16 +2683,16 @@
 BUILDING SHARED AND STATIC LIBRARIES


        The Autotools PCRE2 building process uses libtool to build both  shared
-       and  static  libraries  by  default.  You  can suppress one of these by
-       adding one of
+       and  static  libraries by default. You can suppress an unwanted library
+       by adding one of


          --disable-shared
          --disable-static


-       to the configure command, as required.
+       to the configure command.



-Unicode and UTF SUPPORT
+UNICODE AND UTF SUPPORT

        By default, PCRE2 is built with support for Unicode and  UTF  character
        strings.  To build it without Unicode support, add
@@ -2704,20 +2704,18 @@
        another without, in the same configuration.


        Of  itself, Unicode support does not make PCRE2 treat strings as UTF-8,
-       UTF-16 or UTF-32. To do that you have have to set the PCRE2_UTF  option
-       when you call pcre2_compile() to compile a pattern.
+       UTF-16 or UTF-32. To do that, applications that use the library have to
+       set  the  PCRE2_UTF  option when they call pcre2_compile() to compile a
+       pattern.


-       It  is  not possible to support both EBCDIC and UTF-8 codes in the same
-       version of the library. Consequently,  --enable-unicode  and  --enable-
-       ebcdic are mutually exclusive.
+       UTF support allows the libraries to process character code points up to
+       0x10ffff  in the strings that they handle. It also provides support for
+       accessing the Unicode properties  of  such  characters,  using  pattern
+       escapes  such  as  \P, \p, and \X. Only the general category properties
+       such as Lu and Nd are supported. Details are given in the  pcre2pattern
+       documentation.


-       UTF  support allows the libraries to process character codepoints up to
-       0x10ffff in the strings that they handle. It also provides support  for
-       accessing the properties of such characters, using pattern escapes such
-       as \P, \p, and \X.  Only the general category properties such as Lu and
-       Nd are supported. Details are given in the pcre2pattern documentation.


-
JUST-IN-TIME COMPILER SUPPORT

        Just-in-time compiler support is included in the build by specifying
@@ -2725,17 +2723,17 @@
          --enable-jit


        This  support  is available only for certain hardware architectures. If
-       this option is set for an  unsupported  architecture,  a  compile  time
-       error  occurs.   See the pcre2jit documentation for a discussion of JIT
-       usage. When JIT support is enabled, pcre2grep automatically  makes  use
-       of it, unless you add
+       this option is set for an unsupported architecture,  a  building  error
+       occurs.   See the pcre2jit documentation for a discussion of JIT usage.
+       When JIT support is enabled, pcre2grep automatically makes use  of  it,
+       unless you add


          --disable-pcre2grep-jit


        to the "configure" command.



-CODE VALUE OF NEWLINE
+NEWLINE RECOGNITION

        By  default, PCRE2 interprets the linefeed (LF) character as indicating
        the end of a line. This is the normal newline  character  on  Unix-like
@@ -2744,11 +2742,12 @@


          --enable-newline-is-cr


-       to the  configure  command.  There  is  also  a  --enable-newline-is-lf
+       to the configure  command.  There  is  also  an  --enable-newline-is-lf
        option, which explicitly specifies linefeed as the newline character.


        Alternatively, you can specify that line endings are to be indicated by
-       the two character sequence CRLF. If you want this, add
+       the two-character sequence CRLF (CR immediately followed by LF). If you
+       want this, add


          --enable-newline-is-crlf


@@ -2756,41 +2755,46 @@

          --enable-newline-is-anycrlf


-       which causes PCRE2 to recognize any of the three sequences CR,  LF,  or
+       which  causes  PCRE2 to recognize any of the three sequences CR, LF, or
        CRLF as indicating a line ending. Finally, a fifth option, specified by


          --enable-newline-is-any


-       causes PCRE2 to recognize any Unicode newline sequence.
+       causes PCRE2 to recognize any Unicode  newline  sequence.  The  Unicode
+       newline sequences are the three just mentioned, plus the single charac-
+       ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
+       U+0085),  LS  (line  separator,  U+2028),  and PS (paragraph separator,
+       U+2029).


-       Whatever  line ending convention is selected when PCRE2 is built can be
-       overridden when the library functions are called. At build time  it  is
-       conventional to use the standard for your operating system.
+       Whatever default line ending convention is selected when PCRE2 is built
+       can  be  overridden by applications that use the library. At build time
+       it is conventional to use the standard for your operating system.



WHAT \R MATCHES

-       By  default,  the  sequence \R in a pattern matches any Unicode newline
-       sequence, whatever has been selected as the line  ending  sequence.  If
-       you specify
+       By default, the sequence \R in a pattern matches  any  Unicode  newline
+       sequence,  independently  of  what has been selected as the line ending
+       sequence. If you specify


          --enable-bsr-anycrlf


-       the  default  is changed so that \R matches only CR, LF, or CRLF. What-
-       ever is selected when PCRE2 is built can be overridden when the library
-       functions are called.
+       the default is changed so that \R matches only CR, LF, or  CRLF.  What-
+       ever  is selected when PCRE2 is built can be overridden by applications
+       that use the called.



HANDLING VERY LARGE PATTERNS

-       Within  a  compiled  pattern,  offset values are used to point from one
-       part to another (for example, from an opening parenthesis to an  alter-
-       nation  metacharacter).  By default, in the 8-bit and 16-bit libraries,
-       two-byte values are used for these offsets, leading to a  maximum  size
-       for  a compiled pattern of around 64K. This is sufficient to handle all
-       but the most gigantic patterns.  Nevertheless, some people do  want  to
-       process  truly enormous patterns, so it is possible to compile PCRE2 to
-       use three-byte or four-byte offsets by adding a setting such as
+       Within a compiled pattern, offset values are used  to  point  from  one
+       part  to another (for example, from an opening parenthesis to an alter-
+       nation metacharacter). By default, in the 8-bit and  16-bit  libraries,
+       two-byte  values  are used for these offsets, leading to a maximum size
+       for a compiled pattern of around 64K code units. This is sufficient  to
+       handle all but the most gigantic patterns. Nevertheless, some people do
+       want to process truly enormous patterns, so it is possible  to  compile
+       PCRE2  to  use three-byte or four-byte offsets by adding a setting such
+       as


          --with-link-size=3


@@ -2876,25 +2880,28 @@
USING EBCDIC CODE

        PCRE2  assumes  by default that it will run in an environment where the
-       character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
-       This  is  the case for most computer operating systems. PCRE2 can, how-
-       ever, be compiled to run in an EBCDIC environment by adding
+       character code is ASCII or Unicode, which is a superset of ASCII.  This
+       is the case for most computer operating systems. PCRE2 can, however, be
+       compiled to run in an 8-bit EBCDIC environment by adding


          --enable-ebcdic --disable-unicode


        to the configure command. This setting implies --enable-rebuild-charta-
        bles.  You  should  only  use  it if you know that you are in an EBCDIC
-       environment (for example,  an  IBM  mainframe  operating  system).  The
-       --enable-ebcdic option is incompatible with Unicode support.
+       environment (for example, an IBM mainframe operating system).


+       It is not possible to support both EBCDIC and UTF-8 codes in  the  same
+       version  of  the  library. Consequently, --enable-unicode and --enable-
+       ebcdic are mutually exclusive.
+
        The EBCDIC character that corresponds to an ASCII LF is assumed to have
-       the value 0x15 by default. However, in some EBCDIC  environments,  0x25
+       the  value  0x15 by default. However, in some EBCDIC environments, 0x25
        is used. In such an environment you should use


          --enable-ebcdic-nl25


        as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
-       has the same value as in ASCII, namely, 0x0d.  Whichever  of  0x15  and
+       has  the  same  value  as in ASCII, namely, 0x0d. Whichever of 0x15 and
        0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
        acter (which, in Unicode, is 0x85).


@@ -2905,32 +2912,32 @@

PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT

-       By default, pcre2grep reads all files as plain text. You can  build  it
-       so  that  it recognizes files whose names end in .gz or .bz2, and reads
+       By  default,  pcre2grep reads all files as plain text. You can build it
+       so that it recognizes files whose names end in .gz or .bz2,  and  reads
        them with libz or libbz2, respectively, by adding one or both of


          --enable-pcre2grep-libz
          --enable-pcre2grep-libbz2


        to the configure command. These options naturally require that the rel-
-       evant  libraries  are installed on your system. Configuration will fail
+       evant libraries are installed on your system. Configuration  will  fail
        if they are not.



PCRE2GREP BUFFER SIZE

-       pcre2grep uses an internal buffer to hold a "window" on the file it  is
+       pcre2grep  uses an internal buffer to hold a "window" on the file it is
        scanning, in order to be able to output "before" and "after" lines when
-       it finds a match. The size of the buffer is controlled by  a  parameter
+       it  finds  a match. The size of the buffer is controlled by a parameter
        whose default value is 20K. The buffer itself is three times this size,
        but because of the way it is used for holding "before" lines, the long-
-       est  line  that  is guaranteed to be processable is the parameter size.
+       est line that is guaranteed to be processable is  the  parameter  size.
        You can change the default parameter value by adding, for example,


          --with-pcre2grep-bufsize=50K


-       to the configure command. The caller of pcre2grep can,  however,  over-
-       ride this value by specifying a run-time option.
+       to  the  configure  command.  The caller of pcre2grep can override this
+       value by using --buffer-size on the command line..



 PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
@@ -2940,26 +2947,26 @@
          --enable-pcre2test-libreadline
          --enable-pcre2test-libedit


-       to  the  configure  command,  pcre2test  is linked with the libreadline
+       to the configure command, pcre2test  is  linked  with  the  libreadline
        orlibedit library, respectively, and when its input is from a terminal,
-       it  reads  it using the readline() function. This provides line-editing
-       and history facilities. Note that libreadline is  GPL-licensed,  so  if
-       you  distribute  a binary of pcre2test linked in this way, there may be
-       licensing issues. These can be avoided by linking with  libedit  (which
-       has a BSD licence) instead.
+       it reads it using the readline() function. This  provides  line-editing
+       and  history  facilities.  Note that libreadline is GPL-licensed, so if
+       you distribute a binary of pcre2test linked in this way, there  may  be
+       licensing issues. These can be avoided by linking instead with libedit,
+       which has a BSD licence.


-       Setting  this  option  causes  the -lreadline option to be added to the
-       pcre2test build. In many operating environments with a  sytem-installed
-       readline  library  this  is  sufficient.  However, in some environments
-       (e.g. if an unmodified distribution version of  readline  is  in  use),
-       some  extra  configuration  may  be  necessary.  The  INSTALL  file for
-       libreadline says this:
+       Setting --enable-pcre2test-libreadline causes the -lreadline option  to
+       be  added to the pcre2test build. In many operating environments with a
+       sytem-installed readline library this is sufficient. However,  in  some
+       environments (e.g. if an unmodified distribution version of readline is
+       in use), some extra configuration may be necessary.  The  INSTALL  file
+       for libreadline says this:


          "Readline uses the termcap functions, but does not link with
          the termcap or curses library itself, allowing applications
          which link with readline the to choose an appropriate library."


-       If your environment has not been set up so that an appropriate  library
+       If  your environment has not been set up so that an appropriate library
        is automatically included, you may need to add something like


          LIBS="-ncurses"
@@ -2969,19 +2976,19 @@


DEBUGGING WITH VALGRIND SUPPORT

-       By adding the
+       If you add


          --enable-valgrind


-       option to to the configure command, PCRE2 will use valgrind annotations
-       to mark certain memory regions as  unaddressable.  This  allows  it  to
-       detect  invalid  memory  accesses,  and  is mostly useful for debugging
-       PCRE2 itself.
+       to the configure command, PCRE2 will use valgrind annotations  to  mark
+       certain  memory  regions  as  unaddressable.  This  allows it to detect
+       invalid memory accesses, and  is  mostly  useful  for  debugging  PCRE2
+       itself.



CODE COVERAGE REPORTING

-       If your C compiler is gcc, you can build a version of  PCRE2  that  can
+       If  your  C  compiler is gcc, you can build a version of PCRE2 that can
        generate a code coverage report for its test suite. To enable this, you
        must install lcov version 1.6 or above. Then specify


@@ -2990,20 +2997,20 @@
        to the configure command and build PCRE2 in the usual way.


        Note that using ccache (a caching C compiler) is incompatible with code
-       coverage  reporting. If you have configured ccache to run automatically
+       coverage reporting. If you have configured ccache to run  automatically
        on your system, you must set the environment variable


          CCACHE_DISABLE=1


        before running make to build PCRE2, so that ccache is not used.


-       When --enable-coverage is used,  the  following  addition  targets  are
+       When  --enable-coverage  is  used,  the  following addition targets are
        added to the Makefile:


          make coverage


-       This  creates  a  fresh coverage report for the PCRE2 test suite. It is
-       equivalent to running "make coverage-reset", "make  coverage-baseline",
+       This creates a fresh coverage report for the PCRE2 test  suite.  It  is
+       equivalent  to running "make coverage-reset", "make coverage-baseline",
        "make check", and then "make coverage-report".


          make coverage-reset
@@ -3020,18 +3027,18 @@


          make coverage-clean-report


-       This  removes the generated coverage report without cleaning the cover-
+       This removes the generated coverage report without cleaning the  cover-
        age data itself.


          make coverage-clean-data


-       This removes the captured coverage data without removing  the  coverage
+       This  removes  the captured coverage data without removing the coverage
        files created at compile time (*.gcno).


          make coverage-clean


-       This  cleans all coverage data including the generated coverage report.
-       For more information about code coverage, see the gcov and  lcov  docu-
+       This cleans all coverage data including the generated coverage  report.
+       For  more  information about code coverage, see the gcov and lcov docu-
        mentation.



@@ -3049,7 +3056,7 @@

REVISION

-       Last updated: 03 November 2014
+       Last updated: 23 November 2014
        Copyright (c) 1997-2014 University of Cambridge.
 ------------------------------------------------------------------------------


@@ -3122,62 +3129,59 @@
        At compile time, PCRE2 "auto-possessifies" repeated items when it knows
        that what follows cannot be part of the repeat. For example, a+[bc]  is
        compiled  as if it were a++[bc]. The pcre2test output when this pattern
-       is anchored and then applied with  automatic  callouts  to  the  string
-       "aaaa" is:
+       is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
+       to the string "aaaa" is:


          --->aaaa
-          +0 ^        ^
-          +1 ^        a+
-          +3 ^   ^    [bc]
+          +0 ^        a+
+          +2 ^   ^    [bc]
          No match


        This  indicates that when matching [bc] fails, there is no backtracking
        into a+ and therefore the callouts that would be taken  for  the  back-
        tracks  do  not  occur.  You can disable the auto-possessify feature by
        passing PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the  pat-
-       tern  with  (*NO_AUTO_POSSESS). If this is done in pcre2test (using the
-       /no_auto_possess qualifier), the output changes to this:
+       tern with (*NO_AUTO_POSSESS). In this case, the output changes to this:


          --->aaaa
-          +0 ^        ^
-          +1 ^        a+
-          +3 ^   ^    [bc]
-          +3 ^  ^     [bc]
-          +3 ^ ^      [bc]
-          +3 ^^       [bc]
+          +0 ^        a+
+          +2 ^   ^    [bc]
+          +2 ^  ^     [bc]
+          +2 ^ ^      [bc]
+          +2 ^^       [bc]
          No match


        This time, when matching [bc] fails, the matcher backtracks into a+ and
        tries again, repeatedly, until a+ itself fails.


-       Other  optimizations  that  provide fast "no match" results also affect
+       Other optimizations that provide fast "no match"  results  also  affect
        callouts.  For example, if the pattern is


          ab(?C4)cd


-       PCRE2 knows that any matching string must contain the  letter  "d".  If
-       the  subject  string  is  "abyz",  the  lack of "d" means that matching
-       doesn't ever start, and the callout is  never  reached.  However,  with
+       PCRE2  knows  that  any matching string must contain the letter "d". If
+       the subject string is "abyz", the  lack  of  "d"  means  that  matching
+       doesn't  ever  start,  and  the callout is never reached. However, with
        "abyd", though the result is still no match, the callout is obeyed.


-       PCRE2  also  knows  the  minimum  length of a matching string, and will
-       immediately give a "no match" return without actually running  a  match
-       if  the  subject is not long enough, or, for unanchored patterns, if it
+       PCRE2 also knows the minimum length of  a  matching  string,  and  will
+       immediately  give  a "no match" return without actually running a match
+       if the subject is not long enough, or, for unanchored patterns,  if  it
        has been scanned far enough.


        You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
-       MIZE  option  to  pcre2_compile(),  or  by  starting  the  pattern with
-       (*NO_START_OPT). This slows down the matching process, but does  ensure
+       MIZE option  to  pcre2_compile(),  or  by  starting  the  pattern  with
+       (*NO_START_OPT).  This slows down the matching process, but does ensure
        that callouts such as the example above are obeyed.



THE CALLOUT INTERFACE

-       During matching, when PCRE2 reaches a callout point, the external func-
-       tion that is set in the match context is called (if it  is  set).  This
-       applies to both normal and DFA matching. The only argument to the call-
-       out function is a pointer to a pcre2_callout block. This structure con-
-       tains the following fields:
+       During matching, when PCRE2 reaches a callout  point,  if  an  external
+       function  is  set  in  the match context, it is called. This applies to
+       both normal and DFA matching. The only argument to the callout function
+       is a pointer to a pcre2_callout block. This structure contains the fol-
+       lowing fields:


          uint32_t      version;
          uint32_t      callout_number;
@@ -3193,69 +3197,69 @@
          PCRE2_SIZE    pattern_position;
          PCRE2_SIZE    next_item_length;


-       The  version field contains the version number of the block format. The
+       The version field contains the version number of the block format.  The
        current version is 0. The version number will change in future if addi-
-       tional  fields  are  added, but the intention is never to remove any of
+       tional fields are added, but the intention is never to  remove  any  of
        the existing fields.


-       The callout_number field contains the number of the  callout,  as  com-
-       piled  into  the pattern (that is, the number after ?C for manual call-
+       The  callout_number  field  contains the number of the callout, as com-
+       piled into the pattern (that is, the number after ?C for  manual  call-
        outs, and 255 for automatically generated callouts).


        The offset_vector field is a pointer to the vector of capturing offsets
-       (the  "ovector")  that was passed to the matching function in the match
-       data block. When pcre2_match() is used, the contents can be  inspected,
-       in  order  to  extract substrings that have been matched so far, in the
-       same way as for extracting substrings after a match has completed.  For
+       (the "ovector") that was passed to the matching function in  the  match
+       data  block.  When pcre2_match() is used, the contents can be inspected
+       in order to extract substrings that have been matched so  far,  in  the
+       same  way as for extracting substrings after a match has completed. For
        the DFA matching function, this field is not useful.


        The subject and subject_length fields contain copies of the values that
        were passed to the matching function.


-       The start_match field normally contains the offset within  the  subject
-       at  which  the  current  match  attempt started. However, if the escape
-       sequence \K has been encountered, this value is changed to reflect  the
-       modified  starting  point.  If the pattern is not anchored, the callout
+       The  start_match  field normally contains the offset within the subject
+       at which the current match attempt  started.  However,  if  the  escape
+       sequence  \K has been encountered, this value is changed to reflect the
+       modified starting point. If the pattern is not  anchored,  the  callout
        function may be called several times from the same point in the pattern
        for different starting points in the subject.


-       The  current_position  field  contains the offset within the subject of
+       The current_position field contains the offset within  the  subject  of
        the current match pointer.


        When the pcre2_match() is used, the capture_top field contains one more
-       than  the  number of the highest numbered captured substring so far. If
+       than the number of the highest numbered captured substring so  far.  If
        no substrings have been captured, the value of capture_top is one. This
        is always the case when the DFA functions are used, because they do not
        support captured substrings.


-       The capture_last field contains the number of the  most  recently  cap-
-       tured  substring. However, when a recursion exits, the value reverts to
-       what it was outside the recursion, as do the  values  of  all  captured
-       substrings.  If  no  substrings  have  been captured, the value of cap-
+       The  capture_last  field  contains the number of the most recently cap-
+       tured substring. However, when a recursion exits, the value reverts  to
+       what  it  was  outside  the recursion, as do the values of all captured
+       substrings. If no substrings have been  captured,  the  value  of  cap-
        ture_last is 0. This is always the case for the DFA matching functions.


-       The callout_data field contains a value that is passed  to  a  matching
-       function  specifically so that it can be passed back in callouts. It is
-       set in the match  context  when  the  callout  is  set  up  by  calling
+       The  callout_data  field  contains a value that is passed to a matching
+       function specifically so that it can be passed back in callouts. It  is
+       set  in  the  match  context  when  the  callout  is  set up by calling
        pcre2_set_callout() (see the pcre2api documentation).


-       The  pattern_position  field contains the offset to the next item to be
+       The pattern_position field contains the offset to the next item  to  be
        matched in the pattern string.


-       The next_item_length field contains the length of the next item  to  be
+       The  next_item_length  field contains the length of the next item to be
        matched in the pattern string. When the callout immediately precedes an
-       alternation bar, a closing parenthesis, or the end of the pattern,  the
-       length  is  zero. When the callout precedes an opening parenthesis, the
+       alternation  bar, a closing parenthesis, or the end of the pattern, the
+       length is zero. When the callout precedes an opening  parenthesis,  the
        length is that of the entire subpattern.


-       The pattern_position and next_item_length fields are intended  to  help
-       in  distinguishing between different automatic callouts, which all have
+       The  pattern_position  and next_item_length fields are intended to help
+       in distinguishing between different automatic callouts, which all  have
        the same callout number. However, they are set for all callouts.


        In callouts from pcre2_match() the mark field contains a pointer to the
-       zero-terminated  name of the most recently passed (*MARK), (*PRUNE), or
-       (*THEN) item in the match, or NULL if no such items have  been  passed.
-       Instances  of  (*PRUNE)  or  (*THEN) without a name do not obliterate a
+       zero-terminated name of the most recently passed (*MARK), (*PRUNE),  or
+       (*THEN)  item  in the match, or NULL if no such items have been passed.
+       Instances of (*PRUNE) or (*THEN) without a name  do  not  obliterate  a
        previous (*MARK). In callouts from the DFA matching function this field
        always contains NULL.


@@ -3263,16 +3267,16 @@
RETURN VALUES

        The external callout function returns an integer to PCRE2. If the value
-       is zero, matching proceeds as normal. If  the  value  is  greater  than
-       zero,  matching  fails  at  the current point, but the testing of other
+       is  zero,  matching  proceeds  as  normal. If the value is greater than
+       zero, matching fails at the current point, but  the  testing  of  other
        matching possibilities goes ahead, just as if a lookahead assertion had
        failed. If the value is less than zero, the match is abandoned, and the
        matching function returns the negative value.


-       Negative  values  should  normally  be   chosen   from   the   set   of
-       PCRE2_ERROR_xxx  values.  In  particular,  PCRE2_ERROR_NOMATCH forces a
-       standard "no match" failure. The error  number  PCRE2_ERROR_CALLOUT  is
-       reserved  for  use by callout functions; it will never be used by PCRE2
+       Negative   values   should   normally   be   chosen  from  the  set  of
+       PCRE2_ERROR_xxx values. In  particular,  PCRE2_ERROR_NOMATCH  forces  a
+       standard  "no  match"  failure. The error number PCRE2_ERROR_CALLOUT is
+       reserved for use by callout functions; it will never be used  by  PCRE2
        itself.



@@ -3285,7 +3289,7 @@

REVISION

-       Last updated: 19 October 2014
+       Last updated: 23 November 2014
        Copyright (c) 1997-2014 University of Cambridge.
 ------------------------------------------------------------------------------


@@ -3487,12 +3491,12 @@

        Just-in-time  compiling  is a heavyweight optimization that can greatly
        speed up pattern matching. However, it comes at the cost of extra  pro-
-       cessing before the match is performed. Therefore, it is of most benefit
-       when the same pattern is going to be matched many times. This does  not
-       necessarily  mean  many calls of a matching function; if the pattern is
-       not anchored, matching attempts may take place many  times  at  various
-       positions  in  the  subject, even for a single call.  Therefore, if the
-       subject string is very long, it may still pay to use  JIT  for  one-off
+       cessing  before  the  match is performed, so it is of most benefit when
+       the same pattern is going to be matched many times. This does not  nec-
+       essarily  mean many calls of a matching function; if the pattern is not
+       anchored, matching attempts may take place many times at various  posi-
+       tions in the subject, even for a single call. Therefore, if the subject
+       string is very long, it may still pay  to  use  JIT  even  for  one-off
        matches.  JIT  support  is  available  for all of the 8-bit, 16-bit and
        32-bit PCRE2 libraries.


@@ -3558,8 +3562,8 @@
        again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time  it
        will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
        ing. If pcre2_jit_compile() is called with no option bits set, it imme-
-       diately  returns  zero. This is an alternative way of testing if JIT is
-       available.
+       diately returns zero. This is an alternative way of testing whether JIT
+       is available.


        At present, it is not possible to free JIT compiled  code  except  when
        the entire compiled pattern is freed by calling pcre2_free_code().
@@ -3745,7 +3749,7 @@
        an already freed stack, as that will cause SEGFAULT. (Also, do not free
        a stack currently used by pcre2_match() in  another  thread).  You  can
        also  replace the stack in a context at any time when it is not in use.
-       You can also free the previous stack before assigning a replacement.
+       You should free the previous stack before assigning a replacement.


        (5) Should I allocate/free a  stack  every  time  before/after  calling
        pcre2_match()?
@@ -3855,7 +3859,7 @@


REVISION

-       Last updated: 12 November 2014
+       Last updated: 23 November 2014
        Copyright (c) 1997-2014 University of Cambridge.
 ------------------------------------------------------------------------------


@@ -4642,7 +4646,7 @@
        UTF mode, but its use can lead  to  some  strange  effects  because  it
        breaks  up  multi-unit  characters  (see  the  description of \C in the
        pcre2pattern documentation). The use of \C  is  not  supported  in  the
-       alternative  matching function pcre2_dfa_exec(), nor is it supported in
+       alternative matching function pcre2_dfa_match(), nor is it supported in
        UTF mode by the JIT optimization. If JIT optimization is requested  for
        a  UTF pattern that contains \C, it will not succeed, and so the match-
        ing will be carried out by the normal interpretive function.
@@ -4701,14 +4705,14 @@
        In some situations, you may already know that your strings  are  valid,
        and  therefore  want  to  skip these checks in order to improve perfor-
        mance, for example in the case of a long subject string that  is  being
-       scanned  repeatedly.  If you set the PCRE2_NO_UTF_CHECK flag at compile
-       time or at run time, PCRE2 assumes that the pattern or  subject  it  is
-       given (respectively) contains only valid UTF code unit sequences.
+       scanned  repeatedly.   If you set the PCRE2_NO_UTF_CHECK option at com-
+       pile time or at match time, PCRE2 assumes that the pattern  or  subject
+       it is given (respectively) contains only valid UTF code unit sequences.


        Passing  PCRE2_NO_UTF_CHECK  to pcre2_compile() just disables the check
        for the pattern; it does not also apply to subject strings. If you want
        to  disable the check for a subject string you must pass this option to
-       pcre2_exec() or pcre2_dfa_exec().
+       pcre2_match() or pcre2_dfa_match().


        If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is  set,  the
        result is undefined and your program may crash or loop indefinitely.
@@ -4807,7 +4811,7 @@


REVISION

-       Last updated: 03 November 2014
+       Last updated: 23 November 2014
        Copyright (c) 1997-2014 University of Cambridge.
 ------------------------------------------------------------------------------



Modified: code/trunk/doc/pcre2api.3
===================================================================
--- code/trunk/doc/pcre2api.3    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2api.3    2014-11-23 18:38:38 UTC (rev 158)
@@ -1,4 +1,4 @@
-.TH PCRE2API 3 "21 November 2014" "PCRE2 10.00"
+.TH PCRE2API 3 "23 November 2014" "PCRE2 10.00"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .sp
@@ -2090,6 +2090,13 @@
 different to the value of \fIovector[0]\fP if the pattern contains the \eK
 escape sequence. After a partial match, however, this value is always the same
 as \fIovector[0]\fP because \eK does not affect the result of a partial match.
+.P
+The \fBstartchar\fP field is also used to return the offset of an invalid 
+UTF character when UTF checking fails. Details are given in the
+.\" HREF
+\fBpcre2unicode\fP
+.\"
+page. 
 .
 .
 .\" HTML <a name="errorlist"></a>
@@ -2707,6 +2714,6 @@
 .rs
 .sp
 .nf
-Last updated: 21 November 2014
+Last updated: 23 November 2014
 Copyright (c) 1997-2014 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcre2build.3
===================================================================
--- code/trunk/doc/pcre2build.3    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2build.3    2014-11-23 18:38:38 UTC (rev 158)
@@ -1,4 +1,4 @@
-.TH PCRE2BUILD 3 "03 November 2014" "PCRE2 10.00"
+.TH PCRE2BUILD 3 "23 November 2014" "PCRE2 10.00"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .
@@ -74,12 +74,12 @@
 UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
 the following to the \fBconfigure\fP command:
 .sp
-  --enable-pcre16
-  --enable-pcre32
+  --enable-pcre2-16
+  --enable-pcre2-32
 .sp
 If you do not want the 8-bit library, add
 .sp
-  --disable-pcre8
+  --disable-pcre2-8
 .sp
 as well. At least one of the three libraries must be built. Note that the POSIX
 wrapper is for the 8-bit library only, and that \fBpcre2grep\fP is an 8-bit
@@ -91,15 +91,16 @@
 .rs
 .sp
 The Autotools PCRE2 building process uses \fBlibtool\fP to build both shared
-and static libraries by default. You can suppress one of these by adding one of
+and static libraries by default. You can suppress an unwanted library by adding
+one of
 .sp
   --disable-shared
   --disable-static
 .sp
-to the \fBconfigure\fP command, as required.
+to the \fBconfigure\fP command.
 .
 .
-.SH "Unicode and UTF SUPPORT"
+.SH "UNICODE AND UTF SUPPORT"
 .rs
 .sp
 By default, PCRE2 is built with support for Unicode and UTF character strings.
@@ -112,18 +113,14 @@
 in the same configuration.
 .P
 Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
-or UTF-32. To do that you have have to set the PCRE2_UTF option when you call
-\fBpcre2_compile()\fP to compile a pattern.
+or UTF-32. To do that, applications that use the library have to set the
+PCRE2_UTF option when they call \fBpcre2_compile()\fP to compile a pattern.
 .P
-It is not possible to support both EBCDIC and UTF-8 codes in the same version
-of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
-exclusive.
-.P
-UTF support allows the libraries to process character codepoints up to 0x10ffff
-in the strings that they handle. It also provides support for accessing the
-properties of such characters, using pattern escapes such as \eP, \ep, and \eX.
-Only the general category properties such as \fILu\fP and \fINd\fP are
-supported. Details are given in the
+UTF support allows the libraries to process character code points up to
+0x10ffff in the strings that they handle. It also provides support for
+accessing the Unicode properties of such characters, using pattern escapes such
+as \eP, \ep, and \eX. Only the general category properties such as \fILu\fP and
+\fINd\fP are supported. Details are given in the
 .\" HREF
 \fBpcre2pattern\fP
 .\"
@@ -138,7 +135,7 @@
   --enable-jit
 .sp
 This support is available only for certain hardware architectures. If this
-option is set for an unsupported architecture, a compile time error occurs.
+option is set for an unsupported architecture, a building error occurs.
 See the
 .\" HREF
 \fBpcre2jit\fP
@@ -151,7 +148,7 @@
 to the "configure" command.
 .
 .
-.SH "CODE VALUE OF NEWLINE"
+.SH "NEWLINE RECOGNITION"
 .rs
 .sp
 By default, PCRE2 interprets the linefeed (LF) character as indicating the end
@@ -160,12 +157,13 @@
 .sp
   --enable-newline-is-cr
 .sp
-to the \fBconfigure\fP command. There is also a --enable-newline-is-lf option,
+to the \fBconfigure\fP command. There is also an --enable-newline-is-lf option,
 which explicitly specifies linefeed as the newline character.
+.P
+Alternatively, you can specify that line endings are to be indicated by the
+two-character sequence CRLF (CR immediately followed by LF). If you want this,
+add
 .sp
-Alternatively, you can specify that line endings are to be indicated by the two
-character sequence CRLF. If you want this, add
-.sp
   --enable-newline-is-crlf
 .sp
 to the \fBconfigure\fP command. There is a fourth option, specified by
@@ -177,10 +175,13 @@
 .sp
   --enable-newline-is-any
 .sp
-causes PCRE2 to recognize any Unicode newline sequence.
+causes PCRE2 to recognize any Unicode newline sequence. The Unicode newline
+sequences are the three just mentioned, plus the single characters VT (vertical
+tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
+separator, U+2028), and PS (paragraph separator, U+2029).
 .P
-Whatever line ending convention is selected when PCRE2 is built can be
-overridden when the library functions are called. At build time it is
+Whatever default line ending convention is selected when PCRE2 is built can be
+overridden by applications that use the library. At build time it is
 conventional to use the standard for your operating system.
 .
 .
@@ -188,12 +189,13 @@
 .rs
 .sp
 By default, the sequence \eR in a pattern matches any Unicode newline sequence,
-whatever has been selected as the line ending sequence. If you specify
+independently of what has been selected as the line ending sequence. If you
+specify
 .sp
   --enable-bsr-anycrlf
 .sp
 the default is changed so that \eR matches only CR, LF, or CRLF. Whatever is
-selected when PCRE2 is built can be overridden when the library functions are
+selected when PCRE2 is built can be overridden by applications that use the
 called.
 .
 .
@@ -204,10 +206,10 @@
 another (for example, from an opening parenthesis to an alternation
 metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
 are used for these offsets, leading to a maximum size for a compiled pattern of
-around 64K. This is sufficient to handle all but the most gigantic patterns.
-Nevertheless, some people do want to process truly enormous patterns, so it is
-possible to compile PCRE2 to use three-byte or four-byte offsets by adding a
-setting such as
+around 64K code units. This is sufficient to handle all but the most gigantic
+patterns. Nevertheless, some people do want to process truly enormous patterns,
+so it is possible to compile PCRE2 to use three-byte or four-byte offsets by
+adding a setting such as
 .sp
   --with-link-size=3
 .sp
@@ -299,17 +301,20 @@
 .rs
 .sp
 PCRE2 assumes by default that it will run in an environment where the character
-code is ASCII (or Unicode, which is a superset of ASCII). This is the case for
+code is ASCII or Unicode, which is a superset of ASCII. This is the case for
 most computer operating systems. PCRE2 can, however, be compiled to run in an
-EBCDIC environment by adding
+8-bit EBCDIC environment by adding
 .sp
   --enable-ebcdic --disable-unicode
 .sp
 to the \fBconfigure\fP command. This setting implies
 --enable-rebuild-chartables. You should only use it if you know that you are in
-an EBCDIC environment (for example, an IBM mainframe operating system). The
---enable-ebcdic option is incompatible with Unicode support.
+an EBCDIC environment (for example, an IBM mainframe operating system).
 .P
+It is not possible to support both EBCDIC and UTF-8 codes in the same version
+of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
+exclusive.
+.P
 The EBCDIC character that corresponds to an ASCII LF is assumed to have the
 value 0x15 by default. However, in some EBCDIC environments, 0x25 is used. In
 such an environment you should use
@@ -354,8 +359,8 @@
 .sp
   --with-pcre2grep-bufsize=50K
 .sp
-to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can, however,
-override this value by specifying a run-time option.
+to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can override this
+value by using --buffer-size on the command line..
 .
 .
 .SH "PCRE2TEST OPTION FOR LIBREADLINE SUPPORT"
@@ -371,15 +376,15 @@
 from a terminal, it reads it using the \fBreadline()\fP function. This provides
 line-editing and history facilities. Note that \fBlibreadline\fP is
 GPL-licensed, so if you distribute a binary of \fBpcre2test\fP linked in this
-way, there may be licensing issues. These can be avoided by linking with
-\fBlibedit\fP (which has a BSD licence) instead.
+way, there may be licensing issues. These can be avoided by linking instead
+with \fBlibedit\fP, which has a BSD licence.
 .P
-Setting this option causes the \fB-lreadline\fP option to be added to the
-\fBpcre2test\fP build. In many operating environments with a sytem-installed
-readline library this is sufficient. However, in some environments (e.g. if an
-unmodified distribution version of readline is in use), some extra
-configuration may be necessary. The INSTALL file for \fBlibreadline\fP says
-this:
+Setting --enable-pcre2test-libreadline causes the \fB-lreadline\fP option to be
+added to the \fBpcre2test\fP build. In many operating environments with a
+sytem-installed readline library this is sufficient. However, in some
+environments (e.g. if an unmodified distribution version of readline is in
+use), some extra configuration may be necessary. The INSTALL file for
+\fBlibreadline\fP says this:
 .sp
   "Readline uses the termcap functions, but does not link with
   the termcap or curses library itself, allowing applications
@@ -396,13 +401,13 @@
 .SH "DEBUGGING WITH VALGRIND SUPPORT"
 .rs
 .sp
-By adding the
+If you add
 .sp
   --enable-valgrind
 .sp
-option to to the \fBconfigure\fP command, PCRE2 will use valgrind annotations
-to mark certain memory regions as unaddressable. This allows it to detect
-invalid memory accesses, and is mostly useful for debugging PCRE2 itself.
+to the \fBconfigure\fP command, PCRE2 will use valgrind annotations to mark
+certain memory regions as unaddressable. This allows it to detect invalid
+memory accesses, and is mostly useful for debugging PCRE2 itself.
 .
 .
 .SH "CODE COVERAGE REPORTING"
@@ -482,6 +487,6 @@
 .rs
 .sp
 .nf
-Last updated: 03 November 2014
+Last updated: 23 November 2014
 Copyright (c) 1997-2014 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcre2callout.3
===================================================================
--- code/trunk/doc/pcre2callout.3    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2callout.3    2014-11-23 18:38:38 UTC (rev 158)
@@ -1,4 +1,4 @@
-.TH PCRE2CALLOUT 3 "19 October 2014" "PCRE2 10.00"
+.TH PCRE2CALLOUT 3 "23 November 2014" "PCRE2 10.00"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH SYNOPSIS
@@ -68,29 +68,27 @@
 .P
 At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
 what follows cannot be part of the repeat. For example, a+[bc] is compiled as
-if it were a++[bc]. The \fBpcre2test\fP output when this pattern is anchored
-and then applied with automatic callouts to the string "aaaa" is:
+if it were a++[bc]. The \fBpcre2test\fP output when this pattern is compiled
+with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
+"aaaa" is:
 .sp
   --->aaaa
-   +0 ^        ^
-   +1 ^        a+
-   +3 ^   ^    [bc]
+   +0 ^        a+
+   +2 ^   ^    [bc]
   No match
 .sp
 This indicates that when matching [bc] fails, there is no backtracking into a+
 and therefore the callouts that would be taken for the backtracks do not occur.
-You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS
-to \fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). If
-this is done in \fBpcre2test\fP (using the /no_auto_possess qualifier), the
-output changes to this:
+You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
+\fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). In this
+case, the output changes to this:
 .sp
   --->aaaa
-   +0 ^        ^
-   +1 ^        a+
-   +3 ^   ^    [bc]
-   +3 ^  ^     [bc]
-   +3 ^ ^      [bc]
-   +3 ^^       [bc]
+   +0 ^        a+
+   +2 ^   ^    [bc]
+   +2 ^  ^     [bc]
+   +2 ^ ^      [bc]
+   +2 ^^       [bc]
   No match
 .sp
 This time, when matching [bc] fails, the matcher backtracks into a+ and tries
@@ -119,10 +117,10 @@
 .SH "THE CALLOUT INTERFACE"
 .rs
 .sp
-During matching, when PCRE2 reaches a callout point, the external function that
-is set in the match context is called (if it is set). This applies to both
-normal and DFA matching. The only argument to the callout function is a pointer
-to a \fBpcre2_callout\fP block. This structure contains the following fields:
+During matching, when PCRE2 reaches a callout point, if an external function is
+set in the match context, it is called. This applies to both normal and DFA
+matching. The only argument to the callout function is a pointer to a
+\fBpcre2_callout\fP block. This structure contains the following fields:
 .sp
   uint32_t      \fIversion\fP;
   uint32_t      \fIcallout_number\fP;
@@ -149,7 +147,7 @@
 .P
 The \fIoffset_vector\fP field is a pointer to the vector of capturing offsets
 (the "ovector") that was passed to the matching function in the match data
-block. When \fBpcre2_match()\fP is used, the contents can be inspected, in
+block. When \fBpcre2_match()\fP is used, the contents can be inspected in
 order to extract substrings that have been matched so far, in the same way as
 for extracting substrings after a match has completed. For the DFA matching
 function, this field is not useful.
@@ -238,6 +236,6 @@
 .rs
 .sp
 .nf
-Last updated: 19 October 2014
+Last updated: 23 November 2014
 Copyright (c) 1997-2014 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcre2grep.1
===================================================================
--- code/trunk/doc/pcre2grep.1    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2grep.1    2014-11-23 18:38:38 UTC (rev 158)
@@ -1,4 +1,4 @@
-.TH PCRE2GREP 1 "28 September 2014" "PCRE2 10.00"
+.TH PCRE2GREP 1 "23 November 2014" "PCRE2 10.00"
 .SH NAME
 pcre2grep - a grep with Perl-compatible regular expressions.
 .SH SYNOPSIS
@@ -403,8 +403,8 @@
 Processing some regular expression patterns can require a very large amount of
 memory, leading in some cases to a program crash if not enough is available.
 Other patterns may take a very long time to search for all possible matching
-strings. The \fBpcre2_exec()\fP function that is called by \fBpcre2grep\fP to do
-the matching has two parameters that can limit the resources that it uses.
+strings. The \fBpcre2_match()\fP function that is called by \fBpcre2grep\fP to
+do the matching has two parameters that can limit the resources that it uses.
 .sp
 The \fB--match-limit\fP option provides a means of limiting resource usage
 when processing patterns that are not going to match, but which have a very
@@ -678,6 +678,6 @@
 .rs
 .sp
 .nf
-Last updated: 28 September 2014
+Last updated: 23 November 2014
 Copyright (c) 1997-2014 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcre2grep.txt
===================================================================
--- code/trunk/doc/pcre2grep.txt    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2grep.txt    2014-11-23 18:38:38 UTC (rev 158)
@@ -446,7 +446,7 @@
                  very large amount of memory, leading in some cases to a  pro-
                  gram  crash  if  not enough is available.  Other patterns may
                  take a very long time to search  for  all  possible  matching
-                 strings.   The   pcre2_exec()  function  that  is  called  by
+                 strings.   The  pcre2_match()  function  that  is  called  by
                  pcre2grep to do the matching  has  two  parameters  that  can
                  limit the resources that it uses.


@@ -737,5 +737,5 @@

REVISION

-       Last updated: 28 September 2014
+       Last updated: 23 November 2014
        Copyright (c) 1997-2014 University of Cambridge.


Modified: code/trunk/doc/pcre2jit.3
===================================================================
--- code/trunk/doc/pcre2jit.3    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2jit.3    2014-11-23 18:38:38 UTC (rev 158)
@@ -1,4 +1,4 @@
-.TH PCRE2JIT 3 "12 November 2014" "PCRE2 10.00"
+.TH PCRE2JIT 3 "23 November 2014" "PCRE2 10.00"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
@@ -6,11 +6,11 @@
 .sp
 Just-in-time compiling is a heavyweight optimization that can greatly speed up
 pattern matching. However, it comes at the cost of extra processing before the
-match is performed. Therefore, it is of most benefit when the same pattern is
-going to be matched many times. This does not necessarily mean many calls of a
-matching function; if the pattern is not anchored, matching attempts may take
-place many times at various positions in the subject, even for a single call.
-Therefore, if the subject string is very long, it may still pay to use JIT for
+match is performed, so it is of most benefit when the same pattern is going to
+be matched many times. This does not necessarily mean many calls of a matching
+function; if the pattern is not anchored, matching attempts may take place many
+times at various positions in the subject, even for a single call. Therefore,
+if the subject string is very long, it may still pay to use JIT even for
 one-off matches. JIT support is available for all of the 8-bit, 16-bit and
 32-bit PCRE2 libraries.
 .P
@@ -77,7 +77,7 @@
 PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
 PCRE2_JIT_COMPLETE and just compile code for partial matching. If
 \fBpcre2_jit_compile()\fP is called with no option bits set, it immediately
-returns zero. This is an alternative way of testing if JIT is available.
+returns zero. This is an alternative way of testing whether JIT is available.
 .P
 At present, it is not possible to free JIT compiled code except when the entire
 compiled pattern is freed by calling \fBpcre2_free_code()\fP.
@@ -276,7 +276,7 @@
 not\fP call \fBpcre2_match()\fP with a match context pointing to an already
 freed stack, as that will cause SEGFAULT. (Also, do not free a stack currently
 used by \fBpcre2_match()\fP in another thread). You can also replace the stack
-in a context at any time when it is not in use. You can also free the previous
+in a context at any time when it is not in use. You should free the previous
 stack before assigning a replacement.
 .P
 (5) Should I allocate/free a stack every time before/after calling
@@ -398,6 +398,6 @@
 .rs
 .sp
 .nf
-Last updated: 12 November 2014
+Last updated: 23 November 2014
 Copyright (c) 1997-2014 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcre2syntax.3
===================================================================
--- code/trunk/doc/pcre2syntax.3    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2syntax.3    2014-11-23 18:38:38 UTC (rev 158)
@@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "14 November 2014" "PCRE2 10.00"
+.TH PCRE2SYNTAX 3 "23 November 2014" "PCRE2 10.00"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -394,7 +394,7 @@
   (*UCP)          set PCRE2_UCP (use Unicode properties for \ed etc)
 .sp
 Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
-limits set by the caller of pcre2_exec(), not increase them.
+limits set by the caller of pcre2_match(), not increase them.
 .
 .
 .SH "NEWLINE CONVENTION"
@@ -536,6 +536,6 @@
 .rs
 .sp
 .nf
-Last updated: 14 November 2014
+Last updated: 23 November 2014
 Copyright (c) 1997-2014 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcre2test.1
===================================================================
--- code/trunk/doc/pcre2test.1    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2test.1    2014-11-23 18:38:38 UTC (rev 158)
@@ -1,4 +1,4 @@
-.TH PCRE2TEST 1 "14 November 2014" "PCRE 10.00"
+.TH PCRE2TEST 1 "23 November 2014" "PCRE 10.00"
 .SH NAME
 pcre2test - a program for testing Perl-compatible regular expressions.
 .SH SYNOPSIS
@@ -200,7 +200,7 @@
 number of subject lines to be matched against that pattern. In between sets of
 test data, command lines that begin with a hash (#) character may appear. This
 file format, with some restrictions, can also be processed by the
-\fBperltest.pl\fP script that is distributed with PCRE2 as a means of checking
+\fBperltest.sh\fP script that is distributed with PCRE2 as a means of checking
 that the behaviour of PCRE2 and Perl is the same.
 .P
 Each subject line is matched separately and independently. If you want to do
@@ -243,11 +243,11 @@
   #perltest
 .sp
 The appearance of this line causes all subsequent modifier settings to be
-checked for compatibility with the \fBperltest.pl\fP script, which is used to
+checked for compatibility with the \fBperltest.sh\fP script, which is used to
 confirm that Perl gives the same results as PCRE2. Also, apart from comment
 lines, none of the other command lines are permitted, because they and many
 of the modifiers are specific to \fBpcre2test\fP, and should not be used in
-test files that are also processed by \fBperltest.pl\fP. The \fP#perltest\fB
+test files that are also processed by \fBperltest.sh\fP. The \fP#perltest\fB
 command helps detect tests that are accidentally put in the wrong file.
 .sp
   #subject <modifier-list>
@@ -265,7 +265,7 @@
 other only. Each modifier has a long name, for example "anchored", and some of
 them must be followed by an equals sign and a value, for example, "offset=12".
 Modifiers that do not take values may be preceded by a minus sign to turn off a
-previous default setting.
+previous setting.
 .P
 A few of the more common modifiers can also be specified as single letters, for
 example "i" for "caseless". In documentation, following the Perl convention,
@@ -336,7 +336,7 @@
   \exhh       hexadecimal byte (up to 2 hex digits)
   \ex{hh...}  hexadecimal character (any number of hex digits)
 .sp
-The use of \ex{hh...} is not dependent on the use of the utf modifier on
+The use of \ex{hh...} is not dependent on the use of the \fButf\fP modifier on
 the pattern. It is recognized always. There may be any number of hexadecimal
 digits inside the braces; invalid values provoke error messages.
 .P
@@ -366,7 +366,7 @@
 is converted to "abcabcabcabc". This feature does not support nesting. To
 include a closing square bracket in the characters, code it as \ex5D.
 .P
-A backslash followed by an equals sign marke the end of the subject string and
+A backslash followed by an equals sign marks the end of the subject string and
 the start of a modifier list. For example:
 .sp
   abc\e=notbol,notempty
@@ -461,8 +461,8 @@
 is built, with the default default being Unicode.
 .P
 The \fBnewline\fP modifier specifies which characters are to be interpreted as
-newlines, both in the pattern and (by default) in subject lines. The type must
-be one of CR, LF, CRLF, ANYCRLF, or ANY.
+newlines, both in the pattern and in subject lines. The type must be one of CR,
+LF, CRLF, ANYCRLF, or ANY (in upper or lower case).
 .
 .
 .SS "Information about a pattern"
@@ -478,8 +478,8 @@
 regression tests can be used in different environments.
 .P
 The \fBfullbincode\fP modifier, by contrast, \fIdoes\fP include length and
-offset values. This is used in a few special tests and is also useful for
-one-off tests.
+offset values. This is used in a few special tests that run only for specific 
+code unit widths and link sizes, and is also useful for one-off tests.
 .P
 The \fBinfo\fP modifier requests information about the compiled pattern
 (whether it is anchored, has a fixed first character, and so on). The
@@ -501,13 +501,14 @@
   Last code unit = 'c' (caseless)
   Subject length lower bound = 3
 .sp
-"Compile options" are those specified to the compile function; "overall
-options" have added options that are taken or deduced from the pattern. If both
-sets of options are the same, just a single "options" line is output. "First
-code unit" is where any match must start; if there is more than one they are
-listed as "starting code units". "Last code unit" is the last literal code unit
-that must be present in any match. This is not necessarily the last character.
-These lines are omitted if no starting or ending code units are recorded.
+"Compile options" are those specified by modifiers; "overall options" have
+added options that are taken or deduced from the pattern. If both sets of
+options are the same, just a single "options" line is output; if there are no 
+options, the line is omitted. "First code unit" is where any match must start;
+if there is more than one they are listed as "starting code units". "Last code
+unit" is the last literal code unit that must be present in any match. This is
+not necessarily the last character. These lines are omitted if no starting or
+ending code units are recorded.
 .
 .
 .SS "Specifying a pattern in hex"
@@ -520,16 +521,16 @@
   /ab 32 59/hex
 .sp
 This feature is provided as a way of creating patterns that contain binary zero
-characters. By default, \fBpcre2test\fP passes patterns as zero-terminated
-strings to \fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED.
-However, for patterns specified in hexadecimal, the actual length of the
-pattern is passed.
+and other non-printing characters. By default, \fBpcre2test\fP passes patterns
+as zero-terminated strings to \fBpcre2_compile()\fP, giving the length as
+PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal, the
+actual length of the pattern is passed.
 .
 .
 .SS "JIT compilation"
 .rs
 .sp
-The \fB/jit\fP modifier may optionally be followed by and equals sign and a
+The \fB/jit\fP modifier may optionally be followed by an equals sign and a
 number in the range 0 to 7:
 .sp
   0  disable JIT
@@ -561,7 +562,7 @@
 \fBjitverify\fP is specified without \fBjit\fP, jit=7 is assumed. If JIT
 compilation is successful when \fBjitverify\fP is set, the text "(JIT)" is
 added to the first output line after a match or non match when JIT-compiled
-code was actually used.
+code was actually used in the match.
 .
 .
 .SS "Setting a locale"
@@ -645,8 +646,8 @@
 .SS "Using alternative character tables"
 .rs
 .sp
-The \fB/tables\fP modifier must be followed by a single digit. It causes a
-specific set of built-in character tables to be passed to
+The value specified for the \fB/tables\fP modifier must be one of the digits 0, 
+1, or 2. It causes a specific set of built-in character tables to be passed to
 \fBpcre2_compile()\fP. This is used in the PCRE2 tests to check behaviour with
 different character tables. The digit specifies the tables as follows:
 .sp
@@ -759,13 +760,13 @@
 .SS "Showing more text"
 .rs
 .sp
-The \fBaftertext\fP modifier requests that as well as outputting the substring
-that matched the entire pattern, \fBpcre2test\fP should in addition output the
-remainder of the subject string. This is useful for tests where the subject
-contains multiple copies of the same substring. The \fBallaftertext\fP modifier
-requests the same action for captured substrings as well as the main matched
-substring. In each case the remainder is output on the following line with a
-plus character following the capture number.
+The \fBaftertext\fP modifier requests that as well as outputting the part of 
+the subject string that matched the entire pattern, \fBpcre2test\fP should in
+addition output the remainder of the subject string. This is useful for tests
+where the subject contains multiple copies of the same substring. The
+\fBallaftertext\fP modifier requests the same action for captured substrings as
+well as the main matched substring. In each case the remainder is output on the
+following line with a plus character following the capture number.
 .P
 The \fBallusedtext\fP modifier requests that all the text that was consulted
 during a successful pattern match by the interpreter should be shown. This
@@ -782,7 +783,8 @@
       <<<   >>>
 .sp
 This shows that the matched string is "abc", with the preceding and following
-strings "pqr" and "xyz" also consulted during the match.
+strings "pqr" and "xyz" having been consulted during the match (when processing 
+the assertions).
 .P
 The \fBstartchar\fP modifier requests that the starting character for the match
 be indicated, if it is different to the start of the matched string. The only
@@ -836,7 +838,7 @@
 between \fBglobal\fP and \fBaltglobal\fP is that the former uses the
 \fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
 to start searching at a new point within the entire string (which is what Perl
-does), whereas the latter passes over a shortened substring. This makes a
+does), whereas the latter passes over a shortened subject. This makes a
 difference to the matching process if the pattern begins with a lookbehind
 assertion (including \eb or \eB).
 .P
@@ -847,7 +849,7 @@
 imitates the way Perl handles such cases when using the \fB/g\fP modifier or
 the \fBsplit()\fP function. Normally, the start offset is advanced by one
 character, but if the newline convention recognizes CRLF as a newline, and the
-current character is CR followed by LF, an advance of two is used.
+current character is CR followed by LF, an advance of two characters occurs.
 .
 .
 .SS "Testing substring extraction functions"
@@ -860,9 +862,9 @@
 .sp
    abcd\e=copy=1,copy=3,get=G1
 .sp
-If the \fB#subject\fP command is used to set default copy and get lists, these
-can be unset by specifying a negative number for numbered groups and an empty
-name for named groups.
+If the \fB#subject\fP command is used to set default copy and/or get lists,
+these can be unset by specifying a negative number to cancel all numbered
+groups and an empty name to cancel all named groups.
 .P
 The \fBgetall\fP modifier tests \fBpcre2_substring_list_get()\fP, which
 extracts all captured substrings.
@@ -871,7 +873,8 @@
 convenience functions are output with C, G, or L after the string number
 instead of a colon. This is in addition to the normal full list. The string
 length (that is, the return from the extraction function) is given in
-parentheses after each substring.
+parentheses after each substring, followed by the name when the extraction was 
+by name.
 .
 .
 .SS "Testing the substitution function"
@@ -1044,11 +1047,10 @@
 characters before the actual match start if a lookbehind assertion, \eK, \eb,
 or \eB was involved.)
 .P
-For any other return, \fBpcre2test\fP outputs the PCRE2
-negative error number and a short descriptive phrase. If the error is a failed
-UTF string check, the offset of the start of the failing character and the
-reason code are also output. Here is an example of an interactive
-\fBpcre2test\fP run.
+For any other return, \fBpcre2test\fP outputs the PCRE2 negative error number
+and a short descriptive phrase. If the error is a failed UTF string check, the
+code unit offset of the start of the failing character is also output. Here is
+an example of an interactive \fBpcre2test\fP run.
 .sp
   $ pcre2test
   PCRE2 version 9.00 2014-05-10
@@ -1061,10 +1063,10 @@
   No match
 .sp
 Unset capturing substrings that are not followed by one that is set are not
-returned by \fBpcre2_match()\fP, and are not shown by \fBpcre2test\fP. In the
-following example, there are two capturing substrings, but when the first data
-line is matched, the second, unset substring is not shown. An "internal" unset
-substring is shown as "<unset>", as for the second data line.
+shown by \fBpcre2test\fP unless the \fBallcaptures\fP modifier is specified. In
+the following example, there are two capturing substrings, but when the first
+data line is matched, the second, unset substring is not shown. An "internal"
+unset substring is shown as "<unset>", as for the second data line.
 .sp
     re> /(a)|(b)/
   data> a
@@ -1100,8 +1102,8 @@
    1: pp
 .sp
 "No match" is output only if the first match attempt fails. Here is an example
-of a failure message (the offset 4 that is specified by \e>4 is past the end of
-the subject string):
+of a failure message (the offset 4 that is specified by the \fBoffset\fP 
+modifier is past the end of the subject string):
 .sp
     re> /xyz/
   data> xyz\e=offset=4
@@ -1127,12 +1129,13 @@
    1: tang
    2: tan
 .sp
-(Using the normal matching function on this data finds only "tang".) The
+Using the normal matching function on this data finds only "tang". The
 longest matching string is always given first (and numbered zero). After a
 PCRE2_ERROR_PARTIAL return, the output is "Partial match:", followed by the
-partially matching substring. (Note that this is the entire substring that was
+partially matching substring. Note that this is the entire substring that was
 inspected during the partial match; it may include characters before the actual
-match start if a lookbehind assertion, \eK, \eb, or \eB was involved.)
+match start if a lookbehind assertion, \eb, or \eB was involved. (\eK is not 
+supported for DFA matching.)
 .P
 If global matching is requested, the search for further matches resumes
 at the end of the longest match. For example:
@@ -1174,9 +1177,9 @@
 .SH CALLOUTS
 .rs
 .sp
-If the pattern contains any callout requests, \fBpcre2test\fP's callout function
-is called during matching. This works with both matching functions. By default,
-the called function displays the callout number, the start and current
+If the pattern contains any callout requests, \fBpcre2test\fP's callout
+function is called during matching. This works with both matching functions. By
+default, the called function displays the callout number, the start and current
 positions in the text at the callout time, and the next pattern item to be
 tested. For example:
 .sp
@@ -1271,6 +1274,6 @@
 .rs
 .sp
 .nf
-Last updated: 14 November 2014
+Last updated: 23 November 2014
 Copyright (c) 1997-2014 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcre2unicode.3
===================================================================
--- code/trunk/doc/pcre2unicode.3    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/doc/pcre2unicode.3    2014-11-23 18:38:38 UTC (rev 158)
@@ -1,4 +1,4 @@
-.TH PCRE2UNICODE 3 "03 November 2014" "PCRE2 10.00"
+.TH PCRE2UNICODE 3 "23 November 2014" "PCRE2 10.00"
 .SH NAME
 PCRE - Perl-compatible regular expressions (revised API)
 .SH "UNICODE AND UTF SUPPORT"
@@ -64,7 +64,7 @@
 \fBpcre2pattern\fP
 .\"
 documentation). The use of \eC is not supported in the alternative matching
-function \fBpcre2_dfa_exec()\fP, nor is it supported in UTF mode by the JIT
+function \fBpcre2_dfa_match()\fP, nor is it supported in UTF mode by the JIT
 optimization. If JIT optimization is requested for a UTF pattern that contains
 \eC, it will not succeed, and so the matching will be carried out by the normal
 interpretive function.
@@ -108,7 +108,10 @@
 .sp
 When the PCRE2_UTF option is set, the strings passed as patterns and subjects
 are (by default) checked for validity on entry to the relevant functions.
-If an invalid UTF string is passed, an error return is given.
+If an invalid UTF string is passed, an negative error code is returned. The 
+code unit offset to the offending character can be extracted from the match 
+data block by calling \fBpcre2_get_startchar()\fP, which is used for this
+purpose after a UTF error.
 .P
 UTF-16 and UTF-32 strings can indicate their endianness by special code knows
 as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
@@ -130,14 +133,14 @@
 In some situations, you may already know that your strings are valid, and
 therefore want to skip these checks in order to improve performance, for
 example in the case of a long subject string that is being scanned repeatedly.
-If you set the PCRE2_NO_UTF_CHECK flag at compile time or at run time, PCRE2
-assumes that the pattern or subject it is given (respectively) contains only
-valid UTF code unit sequences.
+If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
+PCRE2 assumes that the pattern or subject it is given (respectively) contains
+only valid UTF code unit sequences.
 .P
 Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
 the pattern; it does not also apply to subject strings. If you want to disable
-the check for a subject string you must pass this option to \fBpcre2_exec()\fP
-or \fBpcre2_dfa_exec()\fP.
+the check for a subject string you must pass this option to \fBpcre2_match()\fP
+or \fBpcre2_dfa_match()\fP.
 .P
 If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
 is undefined and your program may crash or loop indefinitely.
@@ -249,6 +252,6 @@
 .rs
 .sp
 .nf
-Last updated: 03 November 2014
+Last updated: 23 November 2014
 Copyright (c) 1997-2014 University of Cambridge.
 .fi


Modified: code/trunk/src/pcre2_dfa_match.c
===================================================================
--- code/trunk/src/pcre2_dfa_match.c    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/src/pcre2_dfa_match.c    2014-11-23 18:38:38 UTC (rev 158)
@@ -3224,12 +3224,8 @@
 #ifdef SUPPORT_UNICODE
 if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
   {
-  match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->rightchar));
-  if (match_data->rc != 0)
-    {
-    match_data->leftchar = 0;
-    return match_data->rc;
-    }
+  match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->startchar));
+  if (match_data->rc != 0) return match_data->rc;
 #if PCRE2_CODE_UNIT_WIDTH != 32
   if (start_offset > 0 && start_offset < length &&
       NOT_FIRSTCHAR(subject[start_offset]))


Modified: code/trunk/src/pcre2_match.c
===================================================================
--- code/trunk/src/pcre2_match.c    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/src/pcre2_match.c    2014-11-23 18:38:38 UTC (rev 158)
@@ -6459,12 +6459,8 @@
 #ifdef SUPPORT_UNICODE
 if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
   {
-  match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->rightchar));
-  if (match_data->rc != 0)
-    {
-    match_data->leftchar = 0;
-    return match_data->rc;
-    }
+  match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->startchar));
+  if (match_data->rc != 0) return match_data->rc;
 #if PCRE2_CODE_UNIT_WIDTH != 32
   if (start_offset > 0 && start_offset < length &&
       NOT_FIRSTCHAR(subject[start_offset]))


Modified: code/trunk/src/pcre2test.c
===================================================================
--- code/trunk/src/pcre2test.c    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/src/pcre2test.c    2014-11-23 18:38:38 UTC (rev 158)
@@ -5570,6 +5570,13 @@
       fprintf(outfile, "Failed: error %d: ", capcount);
       PCRE2_GET_ERROR_MESSAGE(mlen, capcount, pbuffer);
       PCHARSV(CASTVAR(void *, pbuffer), 0, mlen, FALSE, outfile);
+      if (capcount <= PCRE2_ERROR_UTF8_ERR1 && 
+          capcount >= PCRE2_ERROR_UTF32_ERR2)
+        {
+        PCRE2_SIZE startchar;
+        PCRE2_GET_STARTCHAR(startchar, match_data);
+        fprintf(outfile, " at offset %ld", startchar);
+        }    
       fprintf(outfile, "\n");
       break;
       }


Modified: code/trunk/testdata/testinput10
===================================================================
--- code/trunk/testdata/testinput10    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/testdata/testinput10    2014-11-23 18:38:38 UTC (rev 158)
@@ -48,12 +48,12 @@
 /\xC3\xC3\xC3xxx/utf


 /badutf/utf
-    \xdf
-    \xef
-    \xef\x80
-    \xf7
-    \xf7\x80
-    \xf7\x80\x80
+    X\xdf
+    XX\xef
+    XXX\xef\x80
+    X\xf7
+    XX\xf7\x80
+    XXX\xf7\x80\x80
     \xfb
     \xfb\x80
     \xfb\x80\x80
@@ -89,14 +89,14 @@
     \xff


 /badutf/utf
-    \xfb\x80\x80\x80\x80
-    \xfd\x80\x80\x80\x80\x80
-    \xf7\xbf\xbf\xbf
+    XX\xfb\x80\x80\x80\x80
+    XX\xfd\x80\x80\x80\x80\x80
+    XX\xf7\xbf\xbf\xbf


 /shortutf/utf
-    \xdf\=ph
-    \xef\=ph
-    \xef\x80\=ph
+    XX\xdf\=ph
+    XX\xef\=ph
+    XX\xef\x80\=ph
     \xf7\=ph
     \xf7\x80\=ph
     \xf7\x80\x80\=ph
@@ -111,9 +111,9 @@
     \xfd\x80\x80\x80\x80\=ph


 /anything/utf
-    \xc0\x80
-    \xc1\x8f
-    \xe0\x9f\x80
+    X\xc0\x80
+    XX\xc1\x8f
+    XXX\xe0\x9f\x80
     \xf0\x8f\x80\x80
     \xf8\x87\x80\x80\x80
     \xfc\x83\x80\x80\x80\x80


Modified: code/trunk/testdata/testinput12
===================================================================
--- code/trunk/testdata/testinput12    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/testdata/testinput12    2014-11-23 18:38:38 UTC (rev 158)
@@ -157,18 +157,18 @@
 /^[\QĀ\E-\QŐ\E/B,utf


 /X/utf
-    \x{d800}
-    \x{d800}\=no_utf_check
-    \x{da00}
-    \x{da00}\=no_utf_check
-    \x{dc00}
-    \x{dc00}\=no_utf_check
-    \x{de00}
-    \x{de00}\=no_utf_check
-    \x{dfff}
-    \x{dfff}\=no_utf_check
-    \x{110000}
-    \x{d800}\x{1234}
+    XX\x{d800}
+    XX\x{d800}\=no_utf_check
+    XX\x{da00}
+    XX\x{da00}\=no_utf_check
+    XX\x{dc00}
+    XX\x{dc00}\=no_utf_check
+    XX\x{de00}
+    XX\x{de00}\=no_utf_check
+    XX\x{dfff}
+    XX\x{dfff}\=no_utf_check
+    XX\x{110000}
+    XX\x{d800}\x{1234}


/(*UTF16)\x{11234}/
abcd\x{11234}pqr

Modified: code/trunk/testdata/testoutput10
===================================================================
--- code/trunk/testdata/testoutput10    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/testdata/testoutput10    2014-11-23 18:38:38 UTC (rev 158)
@@ -73,142 +73,142 @@
 Failed: error -8 at offset 0: UTF-8 error: byte 2 top bits not 0x80


 /badutf/utf
-    \xdf
-Failed: error -3: UTF-8 error: 1 byte missing at end
-    \xef
-Failed: error -4: UTF-8 error: 2 bytes missing at end
-    \xef\x80
-Failed: error -3: UTF-8 error: 1 byte missing at end
-    \xf7
-Failed: error -5: UTF-8 error: 3 bytes missing at end
-    \xf7\x80
-Failed: error -4: UTF-8 error: 2 bytes missing at end
-    \xf7\x80\x80
-Failed: error -3: UTF-8 error: 1 byte missing at end
+    X\xdf
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 1
+    XX\xef
+Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
+    XXX\xef\x80
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 3
+    X\xf7
+Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 1
+    XX\xf7\x80
+Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
+    XXX\xf7\x80\x80
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 3
     \xfb
-Failed: error -6: UTF-8 error: 4 bytes missing at end
+Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
     \xfb\x80
-Failed: error -5: UTF-8 error: 3 bytes missing at end
+Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
     \xfb\x80\x80
-Failed: error -4: UTF-8 error: 2 bytes missing at end
+Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
     \xfb\x80\x80\x80
-Failed: error -3: UTF-8 error: 1 byte missing at end
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
     \xfd
-Failed: error -7: UTF-8 error: 5 bytes missing at end
+Failed: error -7: UTF-8 error: 5 bytes missing at end at offset 0
     \xfd\x80
-Failed: error -6: UTF-8 error: 4 bytes missing at end
+Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
     \xfd\x80\x80
-Failed: error -5: UTF-8 error: 3 bytes missing at end
+Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
     \xfd\x80\x80\x80
-Failed: error -4: UTF-8 error: 2 bytes missing at end
+Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
     \xfd\x80\x80\x80\x80
-Failed: error -3: UTF-8 error: 1 byte missing at end
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
     \xdf\x7f
-Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
+Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
     \xef\x7f\x80
-Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
+Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
     \xef\x80\x7f
-Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
+Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
     \xf7\x7f\x80\x80
-Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
+Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
     \xf7\x80\x7f\x80
-Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
+Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
     \xf7\x80\x80\x7f
-Failed: error -10: UTF-8 error: byte 4 top bits not 0x80
+Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
     \xfb\x7f\x80\x80\x80
-Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
+Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
     \xfb\x80\x7f\x80\x80
-Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
+Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
     \xfb\x80\x80\x7f\x80
-Failed: error -10: UTF-8 error: byte 4 top bits not 0x80
+Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
     \xfb\x80\x80\x80\x7f
-Failed: error -11: UTF-8 error: byte 5 top bits not 0x80
+Failed: error -11: UTF-8 error: byte 5 top bits not 0x80 at offset 0
     \xfd\x7f\x80\x80\x80\x80
-Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
+Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
     \xfd\x80\x7f\x80\x80\x80
-Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
+Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
     \xfd\x80\x80\x7f\x80\x80
-Failed: error -10: UTF-8 error: byte 4 top bits not 0x80
+Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
     \xfd\x80\x80\x80\x7f\x80
-Failed: error -11: UTF-8 error: byte 5 top bits not 0x80
+Failed: error -11: UTF-8 error: byte 5 top bits not 0x80 at offset 0
     \xfd\x80\x80\x80\x80\x7f
-Failed: error -12: UTF-8 error: byte 6 top bits not 0x80
+Failed: error -12: UTF-8 error: byte 6 top bits not 0x80 at offset 0
     \xed\xa0\x80
-Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
+Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
     \xc0\x8f
-Failed: error -17: UTF-8 error: overlong 2-byte sequence
+Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 0
     \xe0\x80\x8f
-Failed: error -18: UTF-8 error: overlong 3-byte sequence
+Failed: error -18: UTF-8 error: overlong 3-byte sequence at offset 0
     \xf0\x80\x80\x8f
-Failed: error -19: UTF-8 error: overlong 4-byte sequence
+Failed: error -19: UTF-8 error: overlong 4-byte sequence at offset 0
     \xf8\x80\x80\x80\x8f
-Failed: error -20: UTF-8 error: overlong 5-byte sequence
+Failed: error -20: UTF-8 error: overlong 5-byte sequence at offset 0
     \xfc\x80\x80\x80\x80\x8f
-Failed: error -21: UTF-8 error: overlong 6-byte sequence
+Failed: error -21: UTF-8 error: overlong 6-byte sequence at offset 0
     \x80
-Failed: error -22: UTF-8 error: isolated 0x80 byte
+Failed: error -22: UTF-8 error: isolated 0x80 byte at offset 0
     \xfe
-Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
+Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
     \xff
-Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
+Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0


 /badutf/utf
-    \xfb\x80\x80\x80\x80
-Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
-    \xfd\x80\x80\x80\x80\x80
-Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
-    \xf7\xbf\xbf\xbf
-Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined
+    XX\xfb\x80\x80\x80\x80
+Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 2
+    XX\xfd\x80\x80\x80\x80\x80
+Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 2
+    XX\xf7\xbf\xbf\xbf
+Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined at offset 2


 /shortutf/utf
-    \xdf\=ph
-Failed: error -3: UTF-8 error: 1 byte missing at end
-    \xef\=ph
-Failed: error -4: UTF-8 error: 2 bytes missing at end
-    \xef\x80\=ph
-Failed: error -3: UTF-8 error: 1 byte missing at end
+    XX\xdf\=ph
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 2
+    XX\xef\=ph
+Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
+    XX\xef\x80\=ph
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 2
     \xf7\=ph
-Failed: error -5: UTF-8 error: 3 bytes missing at end
+Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
     \xf7\x80\=ph
-Failed: error -4: UTF-8 error: 2 bytes missing at end
+Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
     \xf7\x80\x80\=ph
-Failed: error -3: UTF-8 error: 1 byte missing at end
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
     \xfb\=ph
-Failed: error -6: UTF-8 error: 4 bytes missing at end
+Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
     \xfb\x80\=ph
-Failed: error -5: UTF-8 error: 3 bytes missing at end
+Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
     \xfb\x80\x80\=ph
-Failed: error -4: UTF-8 error: 2 bytes missing at end
+Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
     \xfb\x80\x80\x80\=ph
-Failed: error -3: UTF-8 error: 1 byte missing at end
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
     \xfd\=ph
-Failed: error -7: UTF-8 error: 5 bytes missing at end
+Failed: error -7: UTF-8 error: 5 bytes missing at end at offset 0
     \xfd\x80\=ph
-Failed: error -6: UTF-8 error: 4 bytes missing at end
+Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
     \xfd\x80\x80\=ph
-Failed: error -5: UTF-8 error: 3 bytes missing at end
+Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
     \xfd\x80\x80\x80\=ph
-Failed: error -4: UTF-8 error: 2 bytes missing at end
+Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
     \xfd\x80\x80\x80\x80\=ph
-Failed: error -3: UTF-8 error: 1 byte missing at end
+Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0


 /anything/utf
-    \xc0\x80
-Failed: error -17: UTF-8 error: overlong 2-byte sequence
-    \xc1\x8f
-Failed: error -17: UTF-8 error: overlong 2-byte sequence
-    \xe0\x9f\x80
-Failed: error -18: UTF-8 error: overlong 3-byte sequence
+    X\xc0\x80
+Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 1
+    XX\xc1\x8f
+Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 2
+    XXX\xe0\x9f\x80
+Failed: error -18: UTF-8 error: overlong 3-byte sequence at offset 3
     \xf0\x8f\x80\x80
-Failed: error -19: UTF-8 error: overlong 4-byte sequence
+Failed: error -19: UTF-8 error: overlong 4-byte sequence at offset 0
     \xf8\x87\x80\x80\x80
-Failed: error -20: UTF-8 error: overlong 5-byte sequence
+Failed: error -20: UTF-8 error: overlong 5-byte sequence at offset 0
     \xfc\x83\x80\x80\x80\x80
-Failed: error -21: UTF-8 error: overlong 6-byte sequence
+Failed: error -21: UTF-8 error: overlong 6-byte sequence at offset 0
     \xfe\x80\x80\x80\x80\x80
-Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
+Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
     \xff\x80\x80\x80\x80\x80
-Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
+Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
     \xc3\x8f
 No match
     \xe0\xaf\x80
@@ -220,13 +220,13 @@
     \xf1\x8f\x80\x80
 No match
     \xf8\x88\x80\x80\x80
-Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
+Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
     \xf9\x87\x80\x80\x80
-Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
+Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
     \xfc\x84\x80\x80\x80\x80
-Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
+Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
     \xfd\x83\x80\x80\x80\x80
-Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
+Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
     \xf8\x88\x80\x80\x80\=no_utf_check
 No match
     \xf9\x87\x80\x80\x80\=no_utf_check
@@ -751,27 +751,27 @@


 /X/utf
     \x{d800}
-Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
+Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
     \x{d800}\=no_utf_check
 No match
     \x{da00}
-Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
+Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
     \x{da00}\=no_utf_check
 No match
     \x{dfff}
-Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
+Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
     \x{dfff}\=no_utf_check
 No match
     \x{110000}
-Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined
+Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined at offset 0
     \x{110000}\=no_utf_check
 No match
     \x{2000000}
-Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
+Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
     \x{2000000}\=no_utf_check
 No match
     \x{7fffffff}
-Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
+Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
     \x{7fffffff}\=no_utf_check
 No match


@@ -1106,7 +1106,7 @@
\x{ff000041}
** Character \x{ff000041} is greater than 0x7fffffff and so cannot be converted to UTF-8
\x{7f000041}
-Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
+Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0

/(*UTF8)abc/never_utf
Failed: error 174 at offset 7: using UTF is disabled by the application

Modified: code/trunk/testdata/testoutput12-16
===================================================================
--- code/trunk/testdata/testoutput12-16    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/testdata/testoutput12-16    2014-11-23 18:38:38 UTC (rev 158)
@@ -607,30 +607,30 @@
 Failed: error 106 at offset 13: missing terminating ] for character class


 /X/utf
-    \x{d800}
-Failed: error -24: UTF-16 error: missing low surrogate at end
-    \x{d800}\=no_utf_check
-No match
-    \x{da00}
-Failed: error -24: UTF-16 error: missing low surrogate at end
-    \x{da00}\=no_utf_check
-No match
-    \x{dc00}
-Failed: error -26: UTF-16 error: isolated low surrogate
-    \x{dc00}\=no_utf_check
-No match
-    \x{de00}
-Failed: error -26: UTF-16 error: isolated low surrogate
-    \x{de00}\=no_utf_check
-No match
-    \x{dfff}
-Failed: error -26: UTF-16 error: isolated low surrogate
-    \x{dfff}\=no_utf_check
-No match
-    \x{110000}
+    XX\x{d800}
+Failed: error -24: UTF-16 error: missing low surrogate at end at offset 2
+    XX\x{d800}\=no_utf_check
+ 0: X
+    XX\x{da00}
+Failed: error -24: UTF-16 error: missing low surrogate at end at offset 2
+    XX\x{da00}\=no_utf_check
+ 0: X
+    XX\x{dc00}
+Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
+    XX\x{dc00}\=no_utf_check
+ 0: X
+    XX\x{de00}
+Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
+    XX\x{de00}\=no_utf_check
+ 0: X
+    XX\x{dfff}
+Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
+    XX\x{dfff}\=no_utf_check
+ 0: X
+    XX\x{110000}
 ** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
-    \x{d800}\x{1234}
-Failed: error -25: UTF-16 error: invalid low surrogate
+    XX\x{d800}\x{1234}
+Failed: error -25: UTF-16 error: invalid low surrogate at offset 3


/(*UTF16)\x{11234}/
abcd\x{11234}pqr

Modified: code/trunk/testdata/testoutput12-32
===================================================================
--- code/trunk/testdata/testoutput12-32    2014-11-21 16:45:06 UTC (rev 157)
+++ code/trunk/testdata/testoutput12-32    2014-11-23 18:38:38 UTC (rev 158)
@@ -600,30 +600,30 @@
 Failed: error 106 at offset 13: missing terminating ] for character class


 /X/utf
-    \x{d800}
-Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
-    \x{d800}\=no_utf_check
-No match
-    \x{da00}
-Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
-    \x{da00}\=no_utf_check
-No match
-    \x{dc00}
-Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
-    \x{dc00}\=no_utf_check
-No match
-    \x{de00}
-Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
-    \x{de00}\=no_utf_check
-No match
-    \x{dfff}
-Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
-    \x{dfff}\=no_utf_check
-No match
-    \x{110000}
-Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined
-    \x{d800}\x{1234}
-Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
+    XX\x{d800}
+Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
+    XX\x{d800}\=no_utf_check
+ 0: X
+    XX\x{da00}
+Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
+    XX\x{da00}\=no_utf_check
+ 0: X
+    XX\x{dc00}
+Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
+    XX\x{dc00}\=no_utf_check
+ 0: X
+    XX\x{de00}
+Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
+    XX\x{de00}\=no_utf_check
+ 0: X
+    XX\x{dfff}
+Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
+    XX\x{dfff}\=no_utf_check
+ 0: X
+    XX\x{110000}
+Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 2
+    XX\x{d800}\x{1234}
+Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2


/(*UTF16)\x{11234}/
Failed: error 160 at offset 5: (*VERB) not recognized or malformed
@@ -1113,7 +1113,7 @@

 /\C/utf
     \x{110000}
-Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined
+Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 0


/\x{100}*A/IB,utf
------------------------------------------------------------------