[Pcre-svn] [820] code/trunk: Implement REG_PEND (GNU extension) for the POSIX wrapper.

Author: Subversion repository
Date:
To: pcre-svn
Subject: [Pcre-svn] [820] code/trunk: Implement REG_PEND (GNU extension) for the POSIX wrapper.

Revision: 820

          http://www.exim.org/viewvc/pcre2?view=rev&revision=820
Author:   ph10
Date:     2017-06-05 19:25:47 +0100 (Mon, 05 Jun 2017)
Log Message:
-----------
Implement REG_PEND (GNU extension) for the POSIX wrapper.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/html/pcre2posix.html
    code/trunk/doc/html/pcre2test.html
    code/trunk/doc/pcre2.txt
    code/trunk/doc/pcre2posix.3
    code/trunk/doc/pcre2test.txt
    code/trunk/src/pcre2posix.c
    code/trunk/src/pcre2posix.h
    code/trunk/src/pcre2test.c
    code/trunk/testdata/testinput18
    code/trunk/testdata/testinput19
    code/trunk/testdata/testoutput18
    code/trunk/testdata/testoutput19

Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2017-06-03 17:50:03 UTC (rev 819)
+++ code/trunk/ChangeLog    2017-06-05 18:25:47 UTC (rev 820)
@@ -182,7 +182,9 @@
 38. Fix returned offsets from regexec() when REG_STARTEND is used with a 
 starting offset greater than zero.

+39. Implement REG_PEND (GNU extension) for the POSIX wrapper.

+
Version 10.23 14-February-2017
------------------------------

Modified: code/trunk/doc/html/pcre2posix.html
===================================================================
--- code/trunk/doc/html/pcre2posix.html    2017-06-03 17:50:03 UTC (rev 819)
+++ code/trunk/doc/html/pcre2posix.html    2017-06-05 18:25:47 UTC (rev 820)
@@ -69,7 +69,7 @@
 <P>
 There are also some options that are not defined by POSIX. These have been
 added at the request of users who want to make use of certain PCRE2-specific
-features via the POSIX calling interface.
+features via the POSIX calling interface or to add BSD or GNU functionality.
 </P>
 <P>
 When PCRE2 is called via these functions, it is only the API that is POSIX-like
@@ -91,10 +91,11 @@
 <br><a name="SEC3" href="#TOC1">COMPILING A PATTERN</a><br>
 <P>
 The function <b>regcomp()</b> is called to compile a pattern into an
-internal form. The pattern is a C string terminated by a binary zero, and
-is passed in the argument <i>pattern</i>. The <i>preg</i> argument is a pointer
-to a <b>regex_t</b> structure that is used as a base for storing information
-about the compiled regular expression.
+internal form. By default, the pattern is a C string terminated by a binary
+zero (but see REG_PEND below). The <i>preg</i> argument is a pointer to a
+<b>regex_t</b> structure that is used as a base for storing information about
+the compiled regular expression. (It is also used for input when REG_PEND is 
+set.)
 </P>
 <P>
 The argument <i>cflags</i> is either zero, or contains one or more of the bits
@@ -125,6 +126,16 @@
 to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
 because it disables the use of back references.
 <pre>
+  REG_PEND
+</pre>
+If this option is set, the <b>reg_endp</b> field in the <i>preg</i> structure 
+(which has the type const char *) must be set to point to the character beyond 
+the end of the pattern before calling <b>regcomp()</b>. The pattern itself may
+now contain binary zeroes, which are treated as data characters. Without
+REG_PEND, a binary zero terminates the pattern and the <b>re_endp</b> field is
+ignored. This is a GNU extension to the POSIX standard and should be used with
+caution in software intended to be portable to other systems.
+<pre>
   REG_UCP
 </pre>
 The PCRE2_UCP option is set when the regular expression is passed for
@@ -156,9 +167,10 @@
 </P>
 <P>
 The yield of <b>regcomp()</b> is zero on success, and non-zero otherwise. The
-<i>preg</i> structure is filled in on success, and one member of the structure
-is public: <i>re_nsub</i> contains the number of capturing subpatterns in
-the regular expression. Various error codes are defined in the header file.
+<i>preg</i> structure is filled in on success, and one other member of the
+structure (as well as <i>re_endp</i>) is public: <i>re_nsub</i> contains the
+number of capturing subpatterns in the regular expression. Various error codes
+are defined in the header file.
 </P>
 <P>
 NOTE: If the yield of <b>regcomp()</b> is non-zero, you must not attempt to
@@ -228,17 +240,28 @@
 <pre>
   REG_STARTEND
 </pre>
-The string is considered to start at <i>string</i> + <i>pmatch[0].rm_so</i> and
-to have a terminating NUL located at <i>string</i> + <i>pmatch[0].rm_eo</i>
-(there need not actually be a NUL at that location), regardless of the value of
-<i>nmatch</i>. This is a BSD extension, compatible with but not specified by
-IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
-intended to be portable to other systems. Note that a non-zero <i>rm_so</i> does
-not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
-how it is matched. Setting REG_STARTEND and passing <i>pmatch</i> as NULL are
-mutually exclusive; the error REG_INVARG is returned.
+When this option is set, the subject string is starts at <i>string</i> +
+<i>pmatch[0].rm_so</i> and ends at <i>string</i> + <i>pmatch[0].rm_eo</i>, which
+should point to the first character beyond the string. There may be binary 
+zeroes within the subject string, and indeed, using REG_STARTEND is the only 
+way to pass a subject string that contains a binary zero.
 </P>
 <P>
+Whatever the value of <i>pmatch[0].rm_so</i>, the offsets of the matched string
+and any captured substrings are still given relative to the start of
+<i>string</i> itself. (Before PCRE2 release 10.30 these were given relative to
+<i>string</i> + <i>pmatch[0].rm_so</i>, but this differs from other
+implementations.)
+</P>
+<P>
+This is a BSD extension, compatible with but not specified by IEEE Standard
+1003.2 (POSIX.2), and should be used with caution in software intended to be
+portable to other systems. Note that a non-zero <i>rm_so</i> does not imply
+REG_NOTBOL; REG_STARTEND affects only the location and length of the string,
+not how it is matched. Setting REG_STARTEND and passing <i>pmatch</i> as NULL
+are mutually exclusive; the error REG_INVARG is returned.
+</P>
+<P>
 If the pattern was compiled with the REG_NOSUB flag, no data about any matched
 strings is returned. The <i>nmatch</i> and <i>pmatch</i> arguments of
 <b>regexec()</b> are ignored (except possibly as input for REG_STARTEND).
@@ -291,9 +314,9 @@
 </P>
 <br><a name="SEC9" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 31 January 2016
+Last updated: 05 June 2017
 <br>
-Copyright &copy; 1997-2016 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE2 index page</a>.

Modified: code/trunk/doc/html/pcre2test.html
===================================================================
--- code/trunk/doc/html/pcre2test.html    2017-06-03 17:50:03 UTC (rev 819)
+++ code/trunk/doc/html/pcre2test.html    2017-06-05 18:25:47 UTC (rev 820)
@@ -1078,6 +1078,19 @@
 REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>.
 The other modifiers are ignored, with a warning message.
 </P>
+<P>
+There is one additional modifier that can be used with the POSIX wrapper. It is 
+ignored (with a warning) if used for non-POSIX matching.
+<pre>
+      posix_startend=&#60;n&#62;[:&#60;m&#62;] 
+</pre>
+This causes the subject string to be passed to <b>regexec()</b> using the
+REG_STARTEND option, which uses offsets to restrict which part of the string is
+searched. If only one number is given, the end offset is passed as the end of
+the subject string. For more detail of REG_STARTEND, see the
+<a href="pcre2posix.html"><b>pcre2posix</b></a>
+documentation. 
+</P>
 <br><b>
 Setting match controls
 </b><br>
@@ -1817,7 +1830,7 @@
 </P>
 <br><a name="SEC21" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 01 June 2017
+Last updated: 03 June 2017
 <br>
 Copyright &copy; 1997-2017 University of Cambridge.
 <br>

Modified: code/trunk/doc/pcre2.txt
===================================================================
--- code/trunk/doc/pcre2.txt    2017-06-03 17:50:03 UTC (rev 819)
+++ code/trunk/doc/pcre2.txt    2017-06-05 18:25:47 UTC (rev 820)
@@ -8986,32 +8986,34 @@

        There are also some options that are not defined by POSIX.  These  have
        been  added  at  the  request  of users who want to make use of certain
-       PCRE2-specific features via the POSIX calling interface.
+       PCRE2-specific features via the POSIX calling interface or to  add  BSD
+       or GNU functionality.

-       When PCRE2 is called via these functions, it is only the  API  that  is
-       POSIX-like  in  style.  The syntax and semantics of the regular expres-
-       sions themselves are still those of Perl, subject  to  the  setting  of
-       various  PCRE2 options, as described below. "POSIX-like in style" means
-       that the API approximates to the POSIX  definition;  it  is  not  fully
-       POSIX-compatible,  and  in  multi-unit  encoding domains it is probably
+       When  PCRE2  is  called via these functions, it is only the API that is
+       POSIX-like in style. The syntax and semantics of  the  regular  expres-
+       sions  themselves  are  still  those of Perl, subject to the setting of
+       various PCRE2 options, as described below. "POSIX-like in style"  means
+       that  the  API  approximates  to  the POSIX definition; it is not fully
+       POSIX-compatible, and in multi-unit encoding  domains  it  is  probably
        even less compatible.

        The header for these functions is supplied as pcre2posix.h to avoid any
-       potential  clash  with  other  POSIX  libraries.  It can, of course, be
+       potential clash with other POSIX  libraries.  It  can,  of  course,  be
        renamed or aliased as regex.h, which is the "correct" name. It provides
-       two  structure  types,  regex_t  for  compiled internal forms, and reg-
-       match_t for returning captured substrings. It also  defines  some  con-
-       stants  whose  names  start  with  "REG_";  these  are used for setting
+       two structure types, regex_t for  compiled  internal  forms,  and  reg-
+       match_t  for  returning  captured substrings. It also defines some con-
+       stants whose names start  with  "REG_";  these  are  used  for  setting
        options and identifying error codes.

COMPILING A PATTERN

-       The function regcomp() is called to compile a pattern into an  internal
-       form.  The  pattern  is  a C string terminated by a binary zero, and is
-       passed in the argument pattern. The preg argument is  a  pointer  to  a
-       regex_t  structure that is used as a base for storing information about
-       the compiled regular expression.
+       The  function regcomp() is called to compile a pattern into an internal
+       form. By default, the pattern is a C string terminated by a binary zero
+       (but  see  REG_PEND below). The preg argument is a pointer to a regex_t
+       structure that is used as a base for storing information about the com-
+       piled  regular  expression. (It is also used for input when REG_PEND is
+       set.)

        The argument cflags is either zero, or contains one or more of the bits
        defined by the following macros:
@@ -9042,38 +9044,50 @@
        used to set the  PCRE2_NO_AUTO_CAPTURE  compile  option,  but  this  no
        longer happens because it disables the use of back references.

+         REG_PEND
+
+       If  this option is set, the reg_endp field in the preg structure (which
+       has the type const char *) must be set to point to the character beyond
+       the end of the pattern before calling regcomp(). The pattern itself may
+       now contain binary zeroes, which are treated as data characters.  With-
+       out  REG_PEND,  a  binary  zero  terminates the pattern and the re_endp
+       field is ignored. This is a GNU extension to  the  POSIX  standard  and
+       should  be  used  with  caution  in software intended to be portable to
+       other systems.
+
          REG_UCP

-       The  PCRE2_UCP  option is set when the regular expression is passed for
-       compilation to the native function. This causes PCRE2  to  use  Unicode
-       properties  when  matchine  \d,  \w,  etc., instead of just recognizing
+       The PCRE2_UCP option is set when the regular expression is  passed  for
+       compilation  to  the  native function. This causes PCRE2 to use Unicode
+       properties when matchine \d, \w,  etc.,  instead  of  just  recognizing
        ASCII values. Note that REG_UCP is not part of the POSIX standard.

          REG_UNGREEDY

-       The PCRE2_UNGREEDY option is set when the regular expression is  passed
-       for  compilation  to the native function. Note that REG_UNGREEDY is not
+       The  PCRE2_UNGREEDY option is set when the regular expression is passed
+       for compilation to the native function. Note that REG_UNGREEDY  is  not
        part of the POSIX standard.

          REG_UTF

-       The PCRE2_UTF option is set when the regular expression is  passed  for
-       compilation  to the native function. This causes the pattern itself and
-       all data strings used for matching it to be treated as  UTF-8  strings.
+       The  PCRE2_UTF  option is set when the regular expression is passed for
+       compilation to the native function. This causes the pattern itself  and
+       all  data  strings used for matching it to be treated as UTF-8 strings.
        Note that REG_UTF is not part of the POSIX standard.

-       In  the  absence  of  these  flags, no options are passed to the native
-       function.  This means the the regex  is  compiled  with  PCRE2  default
-       semantics.  In particular, the way it handles newline characters in the
-       subject string is the Perl way, not the POSIX way.  Note  that  setting
+       In the absence of these flags, no options  are  passed  to  the  native
+       function.   This  means  the  the  regex is compiled with PCRE2 default
+       semantics. In particular, the way it handles newline characters in  the
+       subject  string  is  the Perl way, not the POSIX way. Note that setting
        PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
-       It does not affect the way newlines are matched by the dot  metacharac-
+       It  does not affect the way newlines are matched by the dot metacharac-
        ter (they are not) or by a negative class such as [^a] (they are).

-       The  yield of regcomp() is zero on success, and non-zero otherwise. The
-       preg structure is filled in on success, and one member of the structure
-       is  public: re_nsub contains the number of capturing subpatterns in the
-       regular expression. Various error codes are defined in the header file.
+       The yield of regcomp() is zero on success, and non-zero otherwise.  The
+       preg  structure  is  filled  in on success, and one other member of the
+       structure (as well as re_endp) is public: re_nsub contains  the  number
+       of capturing subpatterns in the regular expression. Various error codes
+       are defined in the header file.

        NOTE: If the yield of regcomp() is non-zero, you must  not  attempt  to
        use the contents of the preg structure. If, for example, you pass it to
@@ -9146,37 +9160,46 @@

          REG_STARTEND

-       The  string  is  considered to start at string + pmatch[0].rm_so and to
-       have a terminating NUL located at string + pmatch[0].rm_eo (there  need
-       not  actually  be  a  NUL at that location), regardless of the value of
-       nmatch. This is a BSD extension, compatible with but not  specified  by
-       IEEE  Standard  1003.2  (POSIX.2),  and  should be used with caution in
-       software intended to be portable to other systems. Note that a non-zero
-       rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
-       of the string, not how it is matched. Setting REG_STARTEND and  passing
-       pmatch  as  NULL  are  mutually  exclusive;  the  error  REG_INVARG  is
+       When  this  option  is  set,  the  subject string is starts at string +
+       pmatch[0].rm_so and ends at  string  +  pmatch[0].rm_eo,  which  should
+       point  to  the  first  character beyond the string. There may be binary
+       zeroes within the subject string, and indeed, using REG_STARTEND is the
+       only way to pass a subject string that contains a binary zero.
+
+       Whatever  the  value  of  pmatch[0].rm_so,  the  offsets of the matched
+       string and any captured substrings are  still  given  relative  to  the
+       start  of  string  itself. (Before PCRE2 release 10.30 these were given
+       relative to string +  pmatch[0].rm_so,  but  this  differs  from  other
+       implementations.)
+
+       This  is  a  BSD  extension,  compatible with but not specified by IEEE
+       Standard 1003.2 (POSIX.2), and should be used with caution in  software
+       intended  to  be  portable to other systems. Note that a non-zero rm_so
+       does not imply REG_NOTBOL; REG_STARTEND affects only the  location  and
+       length  of  the string, not how it is matched. Setting REG_STARTEND and
+       passing pmatch as NULL are mutually exclusive; the error REG_INVARG  is
        returned.

-       If the pattern was compiled with the REG_NOSUB flag, no data about  any
-       matched  strings  is  returned.  The  nmatch  and  pmatch  arguments of
+       If  the pattern was compiled with the REG_NOSUB flag, no data about any
+       matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of
        regexec() are ignored (except possibly as input for REG_STARTEND).

-       The value of nmatch may be zero, and  the  value  pmatch  may  be  NULL
-       (unless  REG_STARTEND  is  set);  in both these cases no data about any
+       The  value  of  nmatch  may  be  zero, and the value pmatch may be NULL
+       (unless REG_STARTEND is set); in both these cases  no  data  about  any
        matched strings is returned.

-       Otherwise, the portion of the string that was  matched,  and  also  any
+       Otherwise,  the  portion  of  the string that was matched, and also any
        captured substrings, are returned via the pmatch argument, which points
-       to an array of nmatch structures of  type  regmatch_t,  containing  the
-       members  rm_so  and  rm_eo.  These contain the byte offset to the first
+       to  an  array  of  nmatch structures of type regmatch_t, containing the
+       members rm_so and rm_eo. These contain the byte  offset  to  the  first
        character of each substring and the offset to the first character after
-       the  end of each substring, respectively. The 0th element of the vector
-       relates to the entire portion of string that  was  matched;  subsequent
+       the end of each substring, respectively. The 0th element of the  vector
+       relates  to  the  entire portion of string that was matched; subsequent
        elements relate to the capturing subpatterns of the regular expression.
        Unused entries in the array have both structure members set to -1.

-       A successful match yields  a  zero  return;  various  error  codes  are
-       defined  in  the  header  file,  of which REG_NOMATCH is the "expected"
+       A  successful  match  yields  a  zero  return;  various error codes are
+       defined in the header file, of  which  REG_NOMATCH  is  the  "expected"
        failure code.

@@ -9183,20 +9206,20 @@
ERROR MESSAGES

        The regerror() function maps a non-zero errorcode from either regcomp()
-       or  regexec()  to  a  printable message. If preg is not NULL, the error
+       or regexec() to a printable message. If preg is  not  NULL,  the  error
        should have arisen from the use of that structure. A message terminated
-       by  a binary zero is placed in errbuf. If the buffer is too short, only
+       by a binary zero is placed in errbuf. If the buffer is too short,  only
        the first errbuf_size - 1 characters of the error message are used. The
-       yield  of  the  function is the size of buffer needed to hold the whole
-       message, including the terminating zero. This  value  is  greater  than
+       yield of the function is the size of buffer needed to  hold  the  whole
+       message,  including  the  terminating  zero. This value is greater than
        errbuf_size if the message was truncated.

MEMORY USAGE

-       Compiling  a regular expression causes memory to be allocated and asso-
-       ciated with the preg structure. The function regfree() frees  all  such
-       memory,  after  which  preg may no longer be used as a compiled expres-
+       Compiling a regular expression causes memory to be allocated and  asso-
+       ciated  with  the preg structure. The function regfree() frees all such
+       memory, after which preg may no longer be used as  a  compiled  expres-
        sion.

@@ -9209,8 +9232,8 @@

REVISION

-       Last updated: 31 January 2016
-       Copyright (c) 1997-2016 University of Cambridge.
+       Last updated: 05 June 2017
+       Copyright (c) 1997-2017 University of Cambridge.
 ------------------------------------------------------------------------------

Modified: code/trunk/doc/pcre2posix.3
===================================================================
--- code/trunk/doc/pcre2posix.3    2017-06-03 17:50:03 UTC (rev 819)
+++ code/trunk/doc/pcre2posix.3    2017-06-05 18:25:47 UTC (rev 820)
@@ -1,4 +1,4 @@
-.TH PCRE2POSIX 3 "03 June 2017" "PCRE2 10.30"
+.TH PCRE2POSIX 3 "05 June 2017" "PCRE2 10.30"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "SYNOPSIS"
@@ -46,7 +46,7 @@
 .P
 There are also some options that are not defined by POSIX. These have been
 added at the request of users who want to make use of certain PCRE2-specific
-features via the POSIX calling interface.
+features via the POSIX calling interface or to add BSD or GNU functionality.
 .P
 When PCRE2 is called via these functions, it is only the API that is POSIX-like
 in style. The syntax and semantics of the regular expressions themselves are
@@ -68,10 +68,11 @@
 .rs
 .sp
 The function \fBregcomp()\fP is called to compile a pattern into an
-internal form. The pattern is a C string terminated by a binary zero, and
-is passed in the argument \fIpattern\fP. The \fIpreg\fP argument is a pointer
-to a \fBregex_t\fP structure that is used as a base for storing information
-about the compiled regular expression.
+internal form. By default, the pattern is a C string terminated by a binary
+zero (but see REG_PEND below). The \fIpreg\fP argument is a pointer to a
+\fBregex_t\fP structure that is used as a base for storing information about
+the compiled regular expression. (It is also used for input when REG_PEND is 
+set.)
 .P
 The argument \fIcflags\fP is either zero, or contains one or more of the bits
 defined by the following macros:
@@ -101,6 +102,16 @@
 to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
 because it disables the use of back references.
 .sp
+  REG_PEND
+.sp
+If this option is set, the \fBreg_endp\fP field in the \fIpreg\fP structure 
+(which has the type const char *) must be set to point to the character beyond 
+the end of the pattern before calling \fBregcomp()\fP. The pattern itself may
+now contain binary zeroes, which are treated as data characters. Without
+REG_PEND, a binary zero terminates the pattern and the \fBre_endp\fP field is
+ignored. This is a GNU extension to the POSIX standard and should be used with
+caution in software intended to be portable to other systems.
+.sp
   REG_UCP
 .sp
 The PCRE2_UCP option is set when the regular expression is passed for
@@ -130,9 +141,10 @@
 class such as [^a] (they are).
 .P
 The yield of \fBregcomp()\fP is zero on success, and non-zero otherwise. The
-\fIpreg\fP structure is filled in on success, and one member of the structure
-is public: \fIre_nsub\fP contains the number of capturing subpatterns in
-the regular expression. Various error codes are defined in the header file.
+\fIpreg\fP structure is filled in on success, and one other member of the
+structure (as well as \fIre_endp\fP) is public: \fIre_nsub\fP contains the
+number of capturing subpatterns in the regular expression. Various error codes
+are defined in the header file.
 .P
 NOTE: If the yield of \fBregcomp()\fP is non-zero, you must not attempt to
 use the contents of the \fIpreg\fP structure. If, for example, you pass it to
@@ -204,12 +216,15 @@
 .sp
   REG_STARTEND
 .sp
-When this option is set, the string is considered to start at \fIstring\fP +
-\fIpmatch[0].rm_so\fP and to have a terminating NUL located at \fIstring\fP +
-\fIpmatch[0].rm_eo\fP (there need not actually be a NUL at that location),
-regardless of the value of \fInmatch\fP. However, the offsets of the matched
-string and any captured substrings are still given relative to the start of
-\fIstring\fP. (Before PCRE2 release 10.30 these were given relative to
+When this option is set, the subject string is starts at \fIstring\fP +
+\fIpmatch[0].rm_so\fP and ends at \fIstring\fP + \fIpmatch[0].rm_eo\fP, which
+should point to the first character beyond the string. There may be binary 
+zeroes within the subject string, and indeed, using REG_STARTEND is the only 
+way to pass a subject string that contains a binary zero.
+.P
+Whatever the value of \fIpmatch[0].rm_so\fP, the offsets of the matched string
+and any captured substrings are still given relative to the start of
+\fIstring\fP itself. (Before PCRE2 release 10.30 these were given relative to
 \fIstring\fP + \fIpmatch[0].rm_so\fP, but this differs from other
 implementations.)
 .P
@@ -216,9 +231,9 @@
 This is a BSD extension, compatible with but not specified by IEEE Standard
 1003.2 (POSIX.2), and should be used with caution in software intended to be
 portable to other systems. Note that a non-zero \fIrm_so\fP does not imply
-REG_NOTBOL; REG_STARTEND affects only the location of the string, not how it is
-matched. Setting REG_STARTEND and passing \fIpmatch\fP as NULL are mutually
-exclusive; the error REG_INVARG is returned.
+REG_NOTBOL; REG_STARTEND affects only the location and length of the string,
+not how it is matched. Setting REG_STARTEND and passing \fIpmatch\fP as NULL
+are mutually exclusive; the error REG_INVARG is returned.
 .P
 If the pattern was compiled with the REG_NOSUB flag, no data about any matched
 strings is returned. The \fInmatch\fP and \fIpmatch\fP arguments of
@@ -277,6 +292,6 @@
 .rs
 .sp
 .nf
-Last updated: 03 June 2017
+Last updated: 05 June 2017
 Copyright (c) 1997-2017 University of Cambridge.
 .fi

Modified: code/trunk/doc/pcre2test.txt
===================================================================
--- code/trunk/doc/pcre2test.txt    2017-06-03 17:50:03 UTC (rev 819)
+++ code/trunk/doc/pcre2test.txt    2017-06-05 18:25:47 UTC (rev 820)
@@ -965,11 +965,22 @@
        REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to  regexec().
        The other modifiers are ignored, with a warning message.

+       There  is one additional modifier that can be used with the POSIX wrap-
+       per. It is ignored (with a warning) if used for non-POSIX matching.
+
+             posix_startend=<n>[:<m>]
+
+       This causes the subject string to be  passed  to  regexec()  using  the
+       REG_STARTEND  option,  which uses offsets to restrict which part of the
+       string is searched. If only one number is  given,  the  end  offset  is
+       passed  as  the end of the subject string. For more detail of REG_STAR-
+       TEND, see the pcre2posix documentation.
+
    Setting match controls

-       The  following  modifiers  affect the matching process or request addi-
-       tional information. Some of them may also be  specified  on  a  pattern
-       line  (see  above), in which case they apply to every subject line that
+       The following modifiers affect the matching process  or  request  addi-
+       tional  information.  Some  of  them may also be specified on a pattern
+       line (see above), in which case they apply to every subject  line  that
        is matched against that pattern.

              aftertext                  show text after match
@@ -1009,29 +1020,29 @@
              zero_terminate             pass the subject as zero-terminated

        The effects of these modifiers are described in the following sections.
-       When  matching  via the POSIX wrapper API, the aftertext, allaftertext,
-       and ovector subject modifiers work as described below. All other  modi-
+       When matching via the POSIX wrapper API, the  aftertext,  allaftertext,
+       and  ovector subject modifiers work as described below. All other modi-
        fiers are either ignored, with a warning message, or cause an error.

    Showing more text

-       The  aftertext modifier requests that as well as outputting the part of
+       The aftertext modifier requests that as well as outputting the part  of
        the subject string that matched the entire pattern, pcre2test should in
        addition output the remainder of the subject string. This is useful for
        tests where the subject contains multiple copies of the same substring.
-       The  allaftertext  modifier  requests the same action for captured sub-
+       The allaftertext modifier requests the same action  for  captured  sub-
        strings as well as the main matched substring. In each case the remain-
        der is output on the following line with a plus character following the
        capture number.

-       The allusedtext modifier requests that all the text that was  consulted
-       during  a  successful pattern match by the interpreter should be shown.
-       This feature is not supported for JIT matching, and if  requested  with
-       JIT  it  is  ignored  (with  a  warning message). Setting this modifier
+       The  allusedtext modifier requests that all the text that was consulted
+       during a successful pattern match by the interpreter should  be  shown.
+       This  feature  is not supported for JIT matching, and if requested with
+       JIT it is ignored (with  a  warning  message).  Setting  this  modifier
        affects the output if there is a lookbehind at the start of a match, or
-       a  lookahead  at  the  end, or if \K is used in the pattern. Characters
-       that precede or follow the start and end of the actual match are  indi-
-       cated  in  the output by '<' or '>' characters underneath them. Here is
+       a lookahead at the end, or if \K is used  in  the  pattern.  Characters
+       that  precede or follow the start and end of the actual match are indi-
+       cated in the output by '<' or '>' characters underneath them.  Here  is
        an example:

            re> /(?<=pqr)abc(?=xyz)/
@@ -1039,16 +1050,16 @@
           0: pqrabcxyz
              <<<   >>>

-       This shows that the matched string is "abc",  with  the  preceding  and
-       following  strings  "pqr"  and  "xyz"  having been consulted during the
+       This  shows  that  the  matched string is "abc", with the preceding and
+       following strings "pqr" and "xyz"  having  been  consulted  during  the
        match (when processing the assertions).

-       The startchar modifier requests that the  starting  character  for  the
-       match  be  indicated,  if  it  is different to the start of the matched
+       The  startchar  modifier  requests  that the starting character for the
+       match be indicated, if it is different to  the  start  of  the  matched
        string. The only time when this occurs is when \K has been processed as
        part of the match. In this situation, the output for the matched string
-       is displayed from the starting character  instead  of  from  the  match
-       point,  with  circumflex  characters  under the earlier characters. For
+       is  displayed  from  the  starting  character instead of from the match
+       point, with circumflex characters under  the  earlier  characters.  For
        example:

            re> /abc\Kxyz/
@@ -1056,7 +1067,7 @@
           0: abcxyz
              ^^^

-       Unlike allusedtext, the startchar modifier can be used with JIT.   How-
+       Unlike  allusedtext, the startchar modifier can be used with JIT.  How-
        ever, these two modifiers are mutually exclusive.

    Showing the value of all capture groups
@@ -1064,98 +1075,98 @@
        The allcaptures modifier requests that the values of all potential cap-
        tured parentheses be output after a match. By default, only those up to
        the highest one actually used in the match are output (corresponding to
-       the return code from pcre2_match()). Groups that did not take  part  in
-       the  match  are  output as "<unset>". This modifier is not relevant for
-       DFA matching (which does no capturing); it is ignored, with  a  warning
+       the  return  code from pcre2_match()). Groups that did not take part in
+       the match are output as "<unset>". This modifier is  not  relevant  for
+       DFA  matching  (which does no capturing); it is ignored, with a warning
        message, if present.

    Testing callouts

-       A  callout function is supplied when pcre2test calls the library match-
-       ing functions, unless callout_none is specified. If callout_capture  is
-       set,  the current captured groups are output when a callout occurs. The
+       A callout function is supplied when pcre2test calls the library  match-
+       ing  functions, unless callout_none is specified. If callout_capture is
+       set, the current captured groups are output when a callout occurs.  The
        default return from the callout function is zero, which allows matching
        to continue.

-       The  callout_fail modifier can be given one or two numbers. If there is
-       only one number, 1 is returned instead of 0 (causing matching to  back-
-       track)  when  a  callout  of  that  number  is  reached. If two numbers
-       (<n>:<m>) are given, 1 is returned when  callout  <n>  is  reached  and
-       there  have  been  at least <m> callouts. The callout_error modifier is
-       similar, except  that  PCRE2_ERROR_CALLOUT  is  returned,  causing  the
-       entire  matching process to be aborted. If both these modifiers are set
+       The callout_fail modifier can be given one or two numbers. If there  is
+       only  one number, 1 is returned instead of 0 (causing matching to back-
+       track) when a callout  of  that  number  is  reached.  If  two  numbers
+       (<n>:<m>)  are  given,  1  is  returned when callout <n> is reached and
+       there have been at least <m> callouts. The  callout_error  modifier  is
+       similar,  except  that  PCRE2_ERROR_CALLOUT  is  returned,  causing the
+       entire matching process to be aborted. If both these modifiers are  set
        for the same callout number, callout_error takes precedence.

-       Note that callouts with string arguments are always  given  the  number
+       Note  that  callouts  with string arguments are always given the number
        zero. See "Callouts" below for a description of the output when a call-
        out it taken.

-       The callout_data modifier can be given an unsigned or a  negative  num-
-       ber.   This  is  set  as the "user data" that is passed to the matching
-       function, and passed back when the callout  function  is  invoked.  Any
-       value  other  than  zero  is  used as a return from pcre2test's callout
+       The  callout_data  modifier can be given an unsigned or a negative num-
+       ber.  This is set as the "user data" that is  passed  to  the  matching
+       function,  and  passed  back  when the callout function is invoked. Any
+       value other than zero is used as  a  return  from  pcre2test's  callout
        function.

    Finding all matches in a string

        Searching for all possible matches within a subject can be requested by
-       the  global  or altglobal modifier. After finding a match, the matching
-       function is called again to search the remainder of  the  subject.  The
-       difference  between  global  and  altglobal is that the former uses the
-       start_offset argument to pcre2_match() or  pcre2_dfa_match()  to  start
-       searching  at  a new point within the entire string (which is what Perl
+       the global or altglobal modifier. After finding a match,  the  matching
+       function  is  called  again to search the remainder of the subject. The
+       difference between global and altglobal is that  the  former  uses  the
+       start_offset  argument  to  pcre2_match() or pcre2_dfa_match() to start
+       searching at a new point within the entire string (which is  what  Perl
        does), whereas the latter passes over a shortened subject. This makes a
        difference to the matching process if the pattern begins with a lookbe-
        hind assertion (including \b or \B).

-       If an empty string  is  matched,  the  next  match  is  done  with  the
+       If  an  empty  string  is  matched,  the  next  match  is done with the
        PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
        for another, non-empty, match at the same point in the subject. If this
-       match  fails,  the  start  offset  is advanced, and the normal match is
-       retried. This imitates the way Perl handles such cases when  using  the
-       /g  modifier  or  the  split()  function. Normally, the start offset is
-       advanced by one character, but if  the  newline  convention  recognizes
-       CRLF  as  a newline, and the current character is CR followed by LF, an
+       match fails, the start offset is advanced,  and  the  normal  match  is
+       retried.  This  imitates the way Perl handles such cases when using the
+       /g modifier or the split() function.  Normally,  the  start  offset  is
+       advanced  by  one  character,  but if the newline convention recognizes
+       CRLF as a newline, and the current character is CR followed by  LF,  an
        advance of two characters occurs.

    Testing substring extraction functions

-       The copy  and  get  modifiers  can  be  used  to  test  the  pcre2_sub-
+       The  copy  and  get  modifiers  can  be  used  to  test  the pcre2_sub-
        string_copy_xxx() and pcre2_substring_get_xxx() functions.  They can be
-       given more than once, and each can specify a group name or number,  for
+       given  more than once, and each can specify a group name or number, for
        example:

           abcd\=copy=1,copy=3,get=G1

-       If  the  #subject command is used to set default copy and/or get lists,
-       these can be unset by specifying a negative number to cancel  all  num-
+       If the #subject command is used to set default copy and/or  get  lists,
+       these  can  be unset by specifying a negative number to cancel all num-
        bered groups and an empty name to cancel all named groups.

-       The  getall  modifier  tests pcre2_substring_list_get(), which extracts
+       The getall modifier tests  pcre2_substring_list_get(),  which  extracts
        all captured substrings.

-       If the subject line is successfully matched, the  substrings  extracted
-       by  the  convenience  functions  are  output  with C, G, or L after the
-       string number instead of a colon. This is in  addition  to  the  normal
-       full  list.  The string length (that is, the return from the extraction
+       If  the  subject line is successfully matched, the substrings extracted
+       by the convenience functions are output with  C,  G,  or  L  after  the
+       string  number  instead  of  a colon. This is in addition to the normal
+       full list. The string length (that is, the return from  the  extraction
        function) is given in parentheses after each substring, followed by the
        name when the extraction was by name.

    Testing the substitution function

-       If  the  replace  modifier  is  set, the pcre2_substitute() function is
-       called instead of one of the matching functions. Note that  replacement
-       strings  cannot  contain commas, because a comma signifies the end of a
+       If the replace modifier is  set,  the  pcre2_substitute()  function  is
+       called  instead of one of the matching functions. Note that replacement
+       strings cannot contain commas, because a comma signifies the end  of  a
        modifier. This is not thought to be an issue in a test program.

-       Unlike subject strings, pcre2test does not process replacement  strings
-       for  escape  sequences. In UTF mode, a replacement string is checked to
-       see if it is a valid UTF-8 string. If so, it is correctly converted  to
-       a  UTF  string of the appropriate code unit width. If it is not a valid
-       UTF-8 string, the individual code units are copied directly. This  pro-
+       Unlike  subject strings, pcre2test does not process replacement strings
+       for escape sequences. In UTF mode, a replacement string is  checked  to
+       see  if it is a valid UTF-8 string. If so, it is correctly converted to
+       a UTF string of the appropriate code unit width. If it is not  a  valid
+       UTF-8  string, the individual code units are copied directly. This pro-
        vides a means of passing an invalid UTF-8 string for testing purposes.

-       The  following modifiers set options (in additional to the normal match
+       The following modifiers set options (in additional to the normal  match
        options) for pcre2_substitute():

          global                      PCRE2_SUBSTITUTE_GLOBAL
@@ -1165,8 +1176,8 @@
          substitute_unset_empty      PCRE2_SUBSTITUTE_UNSET_EMPTY

-       After a successful substitution, the modified string  is  output,  pre-
-       ceded  by the number of replacements. This may be zero if there were no
+       After  a  successful  substitution, the modified string is output, pre-
+       ceded by the number of replacements. This may be zero if there were  no
        matches. Here is a simple example of a substitution test:

          /abc/replace=xxx
@@ -1175,12 +1186,12 @@
              =abc=abc=\=global
           2: =xxx=xxx=

-       Subject and replacement strings should be kept relatively short  (fewer
-       than  256 characters) for substitution tests, as fixed-size buffers are
-       used. To make it easy to test for buffer overflow, if  the  replacement
-       string  starts  with a number in square brackets, that number is passed
-       to pcre2_substitute() as the  size  of  the  output  buffer,  with  the
-       replacement  string  starting at the next character. Here is an example
+       Subject  and replacement strings should be kept relatively short (fewer
+       than 256 characters) for substitution tests, as fixed-size buffers  are
+       used.  To  make it easy to test for buffer overflow, if the replacement
+       string starts with a number in square brackets, that number  is  passed
+       to  pcre2_substitute()  as  the  size  of  the  output buffer, with the
+       replacement string starting at the next character. Here is  an  example
        that tests the edge case:

          /abc/
@@ -1189,11 +1200,11 @@
              123abc123\=replace=[9]XYZ
          Failed: error -47: no more memory

-       The   default   action   of    pcre2_substitute()    is    to    return
-       PCRE2_ERROR_NOMEMORY  when  the output buffer is too small. However, if
-       the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using  the  sub-
-       stitute_overflow_length  modifier),  pcre2_substitute() continues to go
-       through the motions of matching and substituting, in order  to  compute
+       The    default    action    of    pcre2_substitute()   is   to   return
+       PCRE2_ERROR_NOMEMORY when the output buffer is too small.  However,  if
+       the  PCRE2_SUBSTITUTE_OVERFLOW_LENGTH  option is set (by using the sub-
+       stitute_overflow_length modifier), pcre2_substitute() continues  to  go
+       through  the  motions of matching and substituting, in order to compute
        the size of buffer that is required. When this happens, pcre2test shows
        the required buffer length (which includes space for the trailing zero)
        as part of the error message. For example:
@@ -1203,13 +1214,13 @@
          Failed: error -47: no more memory: 10 code units are needed

        A replacement string is ignored with POSIX and DFA matching. Specifying
-       partial matching provokes an error return  ("bad  option  value")  from
+       partial  matching  provokes  an  error return ("bad option value") from
        pcre2_substitute().

    Setting the JIT stack size

-       The  jitstack modifier provides a way of setting the maximum stack size
-       that is used by the just-in-time optimization code. It  is  ignored  if
+       The jitstack modifier provides a way of setting the maximum stack  size
+       that  is  used  by the just-in-time optimization code. It is ignored if
        JIT optimization is not being used. The value is a number of kilobytes.
        Providing a stack that is larger than the default 32K is necessary only
        for very complicated patterns.
@@ -1216,32 +1227,32 @@

    Setting heap, match, and depth limits

-       The  heap_limit,  match_limit, and depth_limit modifiers set the appro-
-       priate limits in the match context. These values are ignored  when  the
+       The heap_limit, match_limit, and depth_limit modifiers set  the  appro-
+       priate  limits  in the match context. These values are ignored when the
        find_limits modifier is specified.

    Finding minimum limits

-       If  the  find_limits  modifier  is present on a subject line, pcre2test
-       calls the relevant matching function several times,  setting  different
-       values    in    the    match    context   via   pcre2_set_heap_limit(),
-       pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds  the
-       minimum  values  for  each  parameter that allows the match to complete
+       If the find_limits modifier is present on  a  subject  line,  pcre2test
+       calls  the  relevant matching function several times, setting different
+       values   in   the    match    context    via    pcre2_set_heap_limit(),
+       pcre2_set_match_limit(),  or pcre2_set_depth_limit() until it finds the
+       minimum values for each parameter that allows  the  match  to  complete
        without error.

        If JIT is being used, only the match limit is relevant. If DFA matching
        is being used, only the depth limit is relevant.

-       The  match_limit number is a measure of the amount of backtracking that
-       takes place, and learning the minimum value  can  be  instructive.  For
-       most  simple  matches, the number is quite small, but for patterns with
-       very large numbers of matching possibilities, it can become large  very
+       The match_limit number is a measure of the amount of backtracking  that
+       takes  place,  and  learning  the minimum value can be instructive. For
+       most simple matches, the number is quite small, but for  patterns  with
+       very  large numbers of matching possibilities, it can become large very
        quickly with increasing length of subject string.

-       For  non-DFA  matching,  the minimum depth_limit number is a measure of
+       For non-DFA matching, the minimum depth_limit number is  a  measure  of
        how much nested backtracking happens (that is, how deeply the pattern's
-       tree  is  searched).  In the case of DFA matching, depth_limit controls
-       the depth of recursive calls of the internal function that is used  for
+       tree is searched). In the case of DFA  matching,  depth_limit  controls
+       the  depth of recursive calls of the internal function that is used for
        handling pattern recursion, lookaround assertions, and atomic groups.

    Showing MARK names
@@ -1248,50 +1259,50 @@

        The mark modifier causes the names from backtracking control verbs that
-       are returned from calls to pcre2_match() to be displayed. If a mark  is
-       returned  for a match, non-match, or partial match, pcre2test shows it.
-       For a match, it is on a line by itself, tagged with  "MK:".  Otherwise,
+       are  returned from calls to pcre2_match() to be displayed. If a mark is
+       returned for a match, non-match, or partial match, pcre2test shows  it.
+       For  a  match, it is on a line by itself, tagged with "MK:". Otherwise,
        it is added to the non-match message.

    Showing memory usage

-       The  memory modifier causes pcre2test to log the sizes of all heap mem-
-       ory  allocation  and  freeing  calls  that  occur  during  a  call   to
-       pcre2_match().  These  occur only when a match requires a bigger vector
-       than the default for remembering backtracking  points.  In  many  cases
-       there  will  be no heap memory used and therefore no additional output.
-       No heap memory is allocated during  matching  with  pcre2_dfa_match  or
-       with  JIT,  so in those cases the memory modifier never has any effect.
+       The memory modifier causes pcre2test to log the sizes of all heap  mem-
+       ory   allocation  and  freeing  calls  that  occur  during  a  call  to
+       pcre2_match(). These occur only when a match requires a  bigger  vector
+       than  the  default  for  remembering backtracking points. In many cases
+       there will be no heap memory used and therefore no  additional  output.
+       No  heap  memory  is  allocated during matching with pcre2_dfa_match or
+       with JIT, so in those cases the memory modifier never has  any  effect.
        For this modifier to work, the null_context modifier must not be set on
-       both  the  pattern  and the subject, though it can be set on one or the
+       both the pattern and the subject, though it can be set on  one  or  the
        other.

    Setting a starting offset

-       The offset modifier sets an offset  in  the  subject  string  at  which
+       The  offset  modifier  sets  an  offset  in the subject string at which
        matching starts. Its value is a number of code units, not characters.

    Setting an offset limit

-       The  offset_limit  modifier  sets  a limit for unanchored matches. If a
+       The offset_limit modifier sets a limit for  unanchored  matches.  If  a
        match cannot be found starting at or before this offset in the subject,
        a "no match" return is given. The data value is a number of code units,
-       not characters. When this modifier is used, the use_offset_limit  modi-
+       not  characters. When this modifier is used, the use_offset_limit modi-
        fier must have been set for the pattern; if not, an error is generated.

    Setting the size of the output vector

-       The  ovector  modifier  applies  only  to  the subject line in which it
-       appears, though of course it can also be used to set  a  default  in  a
-       #subject  command. It specifies the number of pairs of offsets that are
+       The ovector modifier applies only to  the  subject  line  in  which  it
+       appears,  though  of  course  it can also be used to set a default in a
+       #subject command. It specifies the number of pairs of offsets that  are
        available for storing matching information. The default is 15.

-       A value of zero is useful when testing the POSIX API because it  causes
+       A  value of zero is useful when testing the POSIX API because it causes
        regexec() to be called with a NULL capture vector. When not testing the
-       POSIX API, a value of  zero  is  used  to  cause  pcre2_match_data_cre-
-       ate_from_pattern()  to  be  called, in order to create a match block of
+       POSIX  API,  a  value  of  zero  is used to cause pcre2_match_data_cre-
+       ate_from_pattern() to be called, in order to create a  match  block  of
        exactly the right size for the pattern. (It is not possible to create a
-       match  block  with  a zero-length ovector; there is always at least one
+       match block with a zero-length ovector; there is always  at  least  one
        pair of offsets.)

    Passing the subject as zero-terminated
@@ -1298,56 +1309,56 @@

        By default, the subject string is passed to a native API matching func-
        tion with its correct length. In order to test the facility for passing
-       a zero-terminated string, the zero_terminate modifier is  provided.  It
+       a  zero-terminated  string, the zero_terminate modifier is provided. It
        causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching
-       via the POSIX interface, this modifier has no effect, as  there  is  no
+       via  the  POSIX  interface, this modifier has no effect, as there is no
        facility for passing a length.)

-       When  testing  pcre2_substitute(), this modifier also has the effect of
+       When testing pcre2_substitute(), this modifier also has the  effect  of
        passing the replacement string as zero-terminated.

    Passing a NULL context

-       Normally,  pcre2test  passes  a   context   block   to   pcre2_match(),
+       Normally,   pcre2test   passes   a   context  block  to  pcre2_match(),
        pcre2_dfa_match() or pcre2_jit_match(). If the null_context modifier is
-       set, however, NULL is passed. This is for  testing  that  the  matching
+       set,  however,  NULL  is  passed. This is for testing that the matching
        functions behave correctly in this case (they use default values). This
-       modifier cannot be used with the find_limits modifier or  when  testing
+       modifier  cannot  be used with the find_limits modifier or when testing
        the substitution function.

THE ALTERNATIVE MATCHING FUNCTION

-       By  default,  pcre2test  uses  the  standard  PCRE2  matching function,
+       By default,  pcre2test  uses  the  standard  PCRE2  matching  function,
        pcre2_match() to match each subject line. PCRE2 also supports an alter-
-       native  matching  function, pcre2_dfa_match(), which operates in a dif-
-       ferent way, and has some restrictions. The differences between the  two
+       native matching function, pcre2_dfa_match(), which operates in  a  dif-
+       ferent  way, and has some restrictions. The differences between the two
        functions are described in the pcre2matching documentation.

-       If  the dfa modifier is set, the alternative matching function is used.
-       This function finds all possible matches at a given point in  the  sub-
-       ject.  If,  however, the dfa_shortest modifier is set, processing stops
-       after the first match is found. This is always  the  shortest  possible
+       If the dfa modifier is set, the alternative matching function is  used.
+       This  function  finds all possible matches at a given point in the sub-
+       ject. If, however, the dfa_shortest modifier is set,  processing  stops
+       after  the  first  match is found. This is always the shortest possible
        match.

DEFAULT OUTPUT FROM pcre2test

-       This  section  describes  the output when the normal matching function,
+       This section describes the output when the  normal  matching  function,
        pcre2_match(), is being used.

-       When a match succeeds, pcre2test outputs  the  list  of  captured  sub-
-       strings,  starting  with number 0 for the string that matched the whole
-       pattern.   Otherwise,  it  outputs  "No  match"  when  the  return   is
-       PCRE2_ERROR_NOMATCH,  or  "Partial  match:"  followed  by the partially
-       matching substring when the return is PCRE2_ERROR_PARTIAL.  (Note  that
-       this  is  the  entire  substring  that was inspected during the partial
-       match; it may include characters before the actual  match  start  if  a
+       When  a  match  succeeds,  pcre2test  outputs the list of captured sub-
+       strings, starting with number 0 for the string that matched  the  whole
+       pattern.    Otherwise,  it  outputs  "No  match"  when  the  return  is
+       PCRE2_ERROR_NOMATCH, or "Partial  match:"  followed  by  the  partially
+       matching  substring  when the return is PCRE2_ERROR_PARTIAL. (Note that
+       this is the entire substring that  was  inspected  during  the  partial
+       match;  it  may  include  characters before the actual match start if a
        lookbehind assertion, \K, \b, or \B was involved.)

        For any other return, pcre2test outputs the PCRE2 negative error number
-       and a short descriptive phrase. If the error is  a  failed  UTF  string
-       check,  the  code  unit offset of the start of the failing character is
+       and  a  short  descriptive  phrase. If the error is a failed UTF string
+       check, the code unit offset of the start of the  failing  character  is
        also output. Here is an example of an interactive pcre2test run.

          $ pcre2test
@@ -1363,8 +1374,8 @@
        Unset capturing substrings that are not followed by one that is set are
        not shown by pcre2test unless the allcaptures modifier is specified. In
        the following example, there are two capturing substrings, but when the
-       first  data  line is matched, the second, unset substring is not shown.
-       An "internal" unset substring is shown as "<unset>", as for the  second
+       first data line is matched, the second, unset substring is  not  shown.
+       An  "internal" unset substring is shown as "<unset>", as for the second
        data line.

            re> /(a)|(b)/
@@ -1376,11 +1387,11 @@
           1: <unset>
           2: b

-       If  the strings contain any non-printing characters, they are output as
-       \xhh escapes if the value is less than 256 and UTF  mode  is  not  set.
+       If the strings contain any non-printing characters, they are output  as
+       \xhh  escapes  if  the  value is less than 256 and UTF mode is not set.
        Otherwise they are output as \x{hh...} escapes. See below for the defi-
-       nition of non-printing characters. If the aftertext  modifier  is  set,
-       the  output  for substring 0 is followed by the the rest of the subject
+       nition  of  non-printing  characters. If the aftertext modifier is set,
+       the output for substring 0 is followed by the the rest of  the  subject
        string, identified by "0+" like this:

            re> /cat/aftertext
@@ -1388,7 +1399,7 @@
           0: cat
           0+ aract

-       If global matching is requested, the  results  of  successive  matching
+       If  global  matching  is  requested, the results of successive matching
        attempts are output in sequence, like this:

            re> /\Bi(\w\w)/g
@@ -1400,8 +1411,8 @@
           0: ipp
           1: pp

-       "No  match" is output only if the first match attempt fails. Here is an
-       example of a failure message (the offset 4 that  is  specified  by  the
+       "No match" is output only if the first match attempt fails. Here is  an
+       example  of  a  failure  message (the offset 4 that is specified by the
        offset modifier is past the end of the subject string):

            re> /xyz/
@@ -1409,7 +1420,7 @@
          Error -24 (bad offset value)

        Note that whereas patterns can be continued over several lines (a plain
-       ">" prompt is used for continuations), subject lines may  not.  However
+       ">"  prompt  is used for continuations), subject lines may not. However
        newlines can be included in a subject by means of the \n escape (or \r,
        \r\n, etc., depending on the newline sequence setting).

@@ -1417,7 +1428,7 @@
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION

        When the alternative matching function, pcre2_dfa_match(), is used, the
-       output  consists  of  a list of all the matches that start at the first
+       output consists of a list of all the matches that start  at  the  first
        point in the subject where there is at least one match. For example:

            re> /(tang|tangerine|tan)/
@@ -1426,11 +1437,11 @@
           1: tang
           2: tan

-       Using the normal matching function on this data finds only "tang".  The
-       longest  matching  string  is  always  given first (and numbered zero).
-       After a PCRE2_ERROR_PARTIAL return, the  output  is  "Partial  match:",
-       followed  by  the  partially  matching substring. Note that this is the
-       entire substring that was inspected during the partial  match;  it  may
+       Using  the normal matching function on this data finds only "tang". The
+       longest matching string is always  given  first  (and  numbered  zero).
+       After  a  PCRE2_ERROR_PARTIAL  return,  the output is "Partial match:",
+       followed by the partially matching substring. Note  that  this  is  the
+       entire  substring  that  was inspected during the partial match; it may
        include characters before the actual match start if a lookbehind asser-
        tion, \b, or \B was involved. (\K is not supported for DFA matching.)

@@ -1446,16 +1457,16 @@
           1: tan
           0: tan

-       The  alternative  matching function does not support substring capture,
-       so the modifiers that are concerned with captured  substrings  are  not
+       The alternative matching function does not support  substring  capture,
+       so  the  modifiers  that are concerned with captured substrings are not
        relevant.

RESTARTING AFTER A PARTIAL MATCH

-       When  the  alternative matching function has given the PCRE2_ERROR_PAR-
+       When the alternative matching function has given  the  PCRE2_ERROR_PAR-
        TIAL return, indicating that the subject partially matched the pattern,
-       you  can restart the match with additional subject data by means of the
+       you can restart the match with additional subject data by means of  the
        dfa_restart modifier. For example:

            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@@ -1464,7 +1475,7 @@
          data> n05\=dfa,dfa_restart
           0: n05

-       For further information about partial matching,  see  the  pcre2partial
+       For  further  information  about partial matching, see the pcre2partial
        documentation.

@@ -1471,38 +1482,38 @@
CALLOUTS

        If the pattern contains any callout requests, pcre2test's callout func-
-       tion is called during matching unless callout_none is specified.   This
+       tion  is called during matching unless callout_none is specified.  This
        works with both matching functions.

-       The  callout  function in pcre2test returns zero (carry on matching) by
-       default, but you can use a callout_fail modifier in a subject line  (as
+       The callout function in pcre2test returns zero (carry on  matching)  by
+       default,  but you can use a callout_fail modifier in a subject line (as
        described above) to change this and other parameters of the callout.

        Inserting callouts can be helpful when using pcre2test to check compli-
-       cated regular expressions. For further information about callouts,  see
+       cated  regular expressions. For further information about callouts, see
        the pcre2callout documentation.

-       The  output for callouts with numerical arguments and those with string
+       The output for callouts with numerical arguments and those with  string
        arguments is slightly different.

    Callouts with numerical arguments

        By default, the callout function displays the callout number, the start
-       and  current positions in the subject text at the callout time, and the
+       and current positions in the subject text at the callout time, and  the
        next pattern item to be tested. For example:

          --->pqrabcdef
            0    ^  ^     \d

-       This output indicates that  callout  number  0  occurred  for  a  match
-       attempt  starting  at  the fourth character of the subject string, when
-       the pointer was at the seventh character, and  when  the  next  pattern
-       item  was  \d.  Just  one circumflex is output if the start and current
-       positions are the same, or if the current position precedes  the  start
+       This  output  indicates  that  callout  number  0  occurred for a match
+       attempt starting at the fourth character of the  subject  string,  when
+       the  pointer  was  at  the seventh character, and when the next pattern
+       item was \d. Just one circumflex is output if  the  start  and  current
+       positions  are  the same, or if the current position precedes the start
        position, which can happen if the callout is in a lookbehind assertion.

        Callouts numbered 255 are assumed to be automatic callouts, inserted as
-       a result of the /auto_callout pattern modifier. In this  case,  instead
+       a  result  of the /auto_callout pattern modifier. In this case, instead
        of showing the callout number, the offset in the pattern, preceded by a
        plus, is output. For example:

@@ -1516,7 +1527,7 @@
           0: E*

        If a pattern contains (*MARK) items, an additional line is output when-
-       ever  a  change  of  latest mark is passed to the callout function. For
+       ever a change of latest mark is passed to  the  callout  function.  For
        example:

            re> /a(*MARK:X)bc/auto_callout
@@ -1530,17 +1541,17 @@
          +12 ^  ^
           0: abc

-       The mark changes between matching "a" and "b", but stays the  same  for
-       the  rest  of  the match, so nothing more is output. If, as a result of
-       backtracking, the mark reverts to being unset, the  text  "<unset>"  is
+       The  mark  changes between matching "a" and "b", but stays the same for
+       the rest of the match, so nothing more is output. If, as  a  result  of
+       backtracking,  the  mark  reverts to being unset, the text "<unset>" is
        output.

    Callouts with string arguments

        The output for a callout with a string argument is similar, except that
-       instead of outputting a callout number before the position  indicators,
-       the  callout  string  and  its  offset in the pattern string are output
-       before the reflection of the subject string, and the subject string  is
+       instead  of outputting a callout number before the position indicators,
+       the callout string and its offset in  the  pattern  string  are  output
+       before  the reflection of the subject string, and the subject string is
        reflected for each callout. For example:

            re> /^ab(?C'first')cd(?C"second")ef/
@@ -1557,43 +1568,43 @@
 NON-PRINTING CHARACTERS

        When pcre2test is outputting text in the compiled version of a pattern,
-       bytes other than 32-126 are always treated as  non-printing  characters
+       bytes  other  than 32-126 are always treated as non-printing characters
        and are therefore shown as hex escapes.

-       When  pcre2test  is outputting text that is a matched part of a subject
-       string, it behaves in the same way, unless a different locale has  been
-       set  for  the  pattern  (using  the locale modifier). In this case, the
-       isprint() function is used to  distinguish  printing  and  non-printing
+       When pcre2test is outputting text that is a matched part of  a  subject
+       string,  it behaves in the same way, unless a different locale has been
+       set for the pattern (using the locale  modifier).  In  this  case,  the
+       isprint()  function  is  used  to distinguish printing and non-printing
        characters.

SAVING AND RESTORING COMPILED PATTERNS

-       It  is  possible  to  save  compiled patterns on disc or elsewhere, and
+       It is possible to save compiled patterns  on  disc  or  elsewhere,  and
        reload them later, subject to a number of restrictions. JIT data cannot
-       be  saved.  The host on which the patterns are reloaded must be running
+       be saved. The host on which the patterns are reloaded must  be  running
        the same version of PCRE2, with the same code unit width, and must also
-       have  the  same  endianness,  pointer width and PCRE2_SIZE type. Before
-       compiled patterns can be saved they must be serialized, that  is,  con-
-       verted  to a stream of bytes. A single byte stream may contain any num-
-       ber of compiled patterns, but they must  all  use  the  same  character
+       have the same endianness, pointer width  and  PCRE2_SIZE  type.  Before
+       compiled  patterns  can be saved they must be serialized, that is, con-
+       verted to a stream of bytes. A single byte stream may contain any  num-
+       ber  of  compiled  patterns,  but  they must all use the same character
        tables. A single copy of the tables is included in the byte stream (its
        size is 1088 bytes).

-       The functions whose names begin  with  pcre2_serialize_  are  used  for
-       serializing  and de-serializing. They are described in the pcre2serial-
+       The  functions  whose  names  begin  with pcre2_serialize_ are used for
+       serializing and de-serializing. They are described in the  pcre2serial-
        ize  documentation.  In  this  section  we  describe  the  features  of
        pcre2test that can be used to test these functions.

-       When  a  pattern  with  push  modifier  is successfully compiled, it is
-       pushed onto a stack of compiled patterns,  and  pcre2test  expects  the
-       next  line  to  contain a new pattern (or command) instead of a subject
-       line. By contrast, the pushcopy modifier causes a copy of the  compiled
-       pattern  to  be  stacked,  leaving the original available for immediate
-       matching. By using push and/or pushcopy, a number of  patterns  can  be
+       When a pattern with push  modifier  is  successfully  compiled,  it  is
+       pushed  onto  a  stack  of compiled patterns, and pcre2test expects the
+       next line to contain a new pattern (or command) instead  of  a  subject
+       line.  By contrast, the pushcopy modifier causes a copy of the compiled
+       pattern to be stacked, leaving the  original  available  for  immediate
+       matching.  By  using  push and/or pushcopy, a number of patterns can be
        compiled and retained. These modifiers are incompatible with posix, and
-       control modifiers that act at match time are ignored (with  a  message)
-       for  the  stacked patterns. The jitverify modifier applies only at com-
+       control  modifiers  that act at match time are ignored (with a message)
+       for the stacked patterns. The jitverify modifier applies only  at  com-
        pile time.

        The command
@@ -1601,21 +1612,21 @@
          #save <filename>

        causes all the stacked patterns to be serialized and the result written
-       to  the named file. Afterwards, all the stacked patterns are freed. The
+       to the named file. Afterwards, all the stacked patterns are freed.  The
        command

          #load <filename>

-       reads the data in the file, and then arranges for it to  be  de-serial-
-       ized,  with the resulting compiled patterns added to the pattern stack.
-       The pattern on the top of the stack can be retrieved by the  #pop  com-
-       mand,  which  must  be  followed  by  lines  of subjects that are to be
-       matched with the pattern, terminated as usual by an empty line  or  end
-       of  file.  This  command  may be followed by a modifier list containing
-       only control modifiers that act after a pattern has been  compiled.  In
+       reads  the  data in the file, and then arranges for it to be de-serial-
+       ized, with the resulting compiled patterns added to the pattern  stack.
+       The  pattern  on the top of the stack can be retrieved by the #pop com-
+       mand, which must be followed by  lines  of  subjects  that  are  to  be
+       matched  with  the pattern, terminated as usual by an empty line or end
+       of file. This command may be followed by  a  modifier  list  containing
+       only  control  modifiers that act after a pattern has been compiled. In
        particular,  hex,  posix,  posix_nosub,  push,  and  pushcopy  are  not
-       allowed, nor are any option-setting modifiers.  The JIT modifiers  are,
-       however  permitted.  Here is an example that saves and reloads two pat-
+       allowed,  nor are any option-setting modifiers.  The JIT modifiers are,
+       however permitted. Here is an example that saves and reloads  two  pat-
        terns.

          /abc/push
@@ -1628,10 +1639,10 @@
          #pop jit,bincode
          abc

-       If jitverify is used with #pop, it does not  automatically  imply  jit,
+       If  jitverify  is  used with #pop, it does not automatically imply jit,
        which is different behaviour from when it is used on a pattern.

-       The  #popcopy  command is analagous to the pushcopy modifier in that it
+       The #popcopy command is analagous to the pushcopy modifier in  that  it
        makes current a copy of the topmost stack pattern, leaving the original
        still on the stack.

@@ -1651,5 +1662,5 @@

REVISION

-       Last updated: 01 June 2017
+       Last updated: 03 June 2017
        Copyright (c) 1997-2017 University of Cambridge.

Modified: code/trunk/src/pcre2posix.c
===================================================================
--- code/trunk/src/pcre2posix.c    2017-06-03 17:50:03 UTC (rev 819)
+++ code/trunk/src/pcre2posix.c    2017-06-05 18:25:47 UTC (rev 820)
@@ -231,10 +231,14 @@
 regcomp(regex_t *preg, const char *pattern, int cflags)
 {
 PCRE2_SIZE erroffset;
+PCRE2_SIZE patlen;
 int errorcode;
 int options = 0;
 int re_nsub = 0;

+patlen = ((cflags & REG_PEND) != 0)? (PCRE2_SIZE)(preg->re_endp - pattern) :
+  PCRE2_ZERO_TERMINATED;
+  
 if ((cflags & REG_ICASE) != 0)    options |= PCRE2_CASELESS;
 if ((cflags & REG_NEWLINE) != 0)  options |= PCRE2_MULTILINE;
 if ((cflags & REG_DOTALL) != 0)   options |= PCRE2_DOTALL;
@@ -243,8 +247,8 @@
 if ((cflags & REG_UNGREEDY) != 0) options |= PCRE2_UNGREEDY;

preg->re_cflags = cflags;
-preg->re_pcre2_code = pcre2_compile((PCRE2_SPTR)pattern, PCRE2_ZERO_TERMINATED,
- options, &errorcode, &erroffset, NULL);
+preg->re_pcre2_code = pcre2_compile((PCRE2_SPTR)pattern, patlen, options,
+ &errorcode, &erroffset, NULL);
preg->re_erroffset = erroffset;

if (preg->re_pcre2_code == NULL)

Modified: code/trunk/src/pcre2posix.h
===================================================================
--- code/trunk/src/pcre2posix.h    2017-06-03 17:50:03 UTC (rev 819)
+++ code/trunk/src/pcre2posix.h    2017-06-05 18:25:47 UTC (rev 820)
@@ -62,6 +62,7 @@
 #define REG_NOTEMPTY  0x0100  /* NOT defined by POSIX; maps to PCRE2_NOTEMPTY */
 #define REG_UNGREEDY  0x0200  /* NOT defined by POSIX; maps to PCRE2_UNGREEDY */
 #define REG_UCP       0x0400  /* NOT defined by POSIX; maps to PCRE2_UCP */
+#define REG_PEND      0x0800  /* GNU feature: pass end pattern by re_endp */

/* This is not used by PCRE2, but by defining it we make it easier
to slot PCRE2 into existing programs that make POSIX calls. */
@@ -91,11 +92,13 @@
};

-/* The structure representing a compiled regular expression. */
+/* The structure representing a compiled regular expression. It is also used
+for passing the pattern end pointer when REG_PEND is set. */

typedef struct {
void *re_pcre2_code;
void *re_match_data;
+ const char *re_endp;
size_t re_nsub;
size_t re_erroffset;
int re_cflags;

Modified: code/trunk/src/pcre2test.c
===================================================================
--- code/trunk/src/pcre2test.c    2017-06-03 17:50:03 UTC (rev 819)
+++ code/trunk/src/pcre2test.c    2017-06-05 18:25:47 UTC (rev 820)
@@ -538,7 +538,7 @@
   uint32_t  control;       /* Must be in same position as patctl */
   uint32_t  control2;      /* Must be in same position as patctl */
    uint8_t  replacement[REPLACE_MODSIZE];  /* So must this */
-  uint32_t  startend[2];  
+  uint32_t  startend[2];
   uint32_t  cerror[2];
   uint32_t  cfail[2];
    int32_t  callout_data;
@@ -699,7 +699,8 @@
 #define POSIX_SUPPORTED_COMPILE_EXTRA_OPTIONS (0)

 static uint32_t exclusive_pat_controls[] = {
-  CTL_POSIX    | CTL_HEXPAT,
   CTL_POSIX    | CTL_PUSH,
   CTL_POSIX    | CTL_PUSHCOPY,
   CTL_POSIX    | CTL_PUSHTABLESCOPY,
-  CTL_POSIX    | CTL_USE_LENGTH,
   CTL_PUSH     | CTL_PUSHCOPY,
   CTL_PUSH     | CTL_PUSHTABLESCOPY,
   CTL_PUSHCOPY | CTL_PUSHTABLESCOPY,
@@ -896,7 +895,7 @@
 static uint32_t malloclistptr = 0;

#ifdef SUPPORT_PCRE2_8
-static regex_t preg = { NULL, NULL, 0, 0, 0 };
+static regex_t preg = { NULL, NULL, 0, 0, 0, 0 };
#endif

static int *dfa_workspace = NULL;
@@ -5264,6 +5263,12 @@
if ((pat_patctl.options & PCRE2_DOTALL) != 0) cflags |= REG_DOTALL;
if ((pat_patctl.options & PCRE2_UNGREEDY) != 0) cflags |= REG_UNGREEDY;

+  if ((pat_patctl.control & (CTL_HEXPAT|CTL_USE_LENGTH)) != 0)
+    {
+    preg.re_endp = (char *)pbuffer8 + patlen;
+    cflags |= REG_PEND;  
+    }  
+
   rc = regcomp(&preg, (char *)pbuffer8, cflags);

   /* Compiling failed */
@@ -6665,10 +6670,10 @@
   if (dat_datctl.startend[0] != CFORE_UNSET)
     {
     pmatch[0].rm_so = dat_datctl.startend[0];
-    pmatch[0].rm_eo = (dat_datctl.startend[1] != 0)? 
+    pmatch[0].rm_eo = (dat_datctl.startend[1] != 0)?
       dat_datctl.startend[1] : len;
     eflags |= REG_STARTEND;
-    }  
+    }

if ((dat_datctl.options & PCRE2_NOTBOL) != 0) eflags |= REG_NOTBOL;
if ((dat_datctl.options & PCRE2_NOTEOL) != 0) eflags |= REG_NOTEOL;

Modified: code/trunk/testdata/testinput18
===================================================================
--- code/trunk/testdata/testinput18    2017-06-03 17:50:03 UTC (rev 819)
+++ code/trunk/testdata/testinput18    2017-06-05 18:25:47 UTC (rev 820)
@@ -123,4 +123,10 @@
 /^a\x{00}b$/posix
     a\x{00}b\=posix_startend=0:3

+/"A" 00 "B"/hex
+    A\x{00}B\=posix_startend=0:3
+    
+/ABC/use_length 
+    ABC
+
 # End of testdata/testinput18

Modified: code/trunk/testdata/testinput19
===================================================================
--- code/trunk/testdata/testinput19    2017-06-03 17:50:03 UTC (rev 819)
+++ code/trunk/testdata/testinput19    2017-06-05 18:25:47 UTC (rev 820)
@@ -15,4 +15,7 @@
 /\w/ucp
     +++\x{c2}

+/"^AB" 00 "\x{1234}$"/hex,utf
+    AB\x{00}\x{1234}\=posix_startend=0:6 
+    
 # End of testdata/testinput19

Modified: code/trunk/testdata/testoutput18
===================================================================
--- code/trunk/testdata/testoutput18    2017-06-03 17:50:03 UTC (rev 819)
+++ code/trunk/testdata/testoutput18    2017-06-05 18:25:47 UTC (rev 820)
@@ -191,4 +191,12 @@
     a\x{00}b\=posix_startend=0:3
  0: a\x00b

+/"A" 00 "B"/hex
+    A\x{00}B\=posix_startend=0:3
+ 0: A\x00B
+    
+/ABC/use_length 
+    ABC
+ 0: ABC
+
 # End of testdata/testinput18

Modified: code/trunk/testdata/testoutput19
===================================================================
--- code/trunk/testdata/testoutput19    2017-06-03 17:50:03 UTC (rev 819)
+++ code/trunk/testdata/testoutput19    2017-06-05 18:25:47 UTC (rev 820)
@@ -18,4 +18,8 @@
     +++\x{c2}
  0: \xc2

+/"^AB" 00 "\x{1234}$"/hex,utf
+    AB\x{00}\x{1234}\=posix_startend=0:6 
+ 0: AB\x{00}\x{1234}
+    
 # End of testdata/testinput19

[Pcre-svn] [820] code/trunk: Implement REG_PEND (GNU extensi…