[Pcre-svn] [840] code/trunk: 16-bit update of non-man docume…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [840] code/trunk: 16-bit update of non-man documentation files and the PrepareRelease script .
Revision: 840
          http://vcs.pcre.org/viewvc?view=rev&revision=840
Author:   ph10
Date:     2011-12-30 19:32:50 +0000 (Fri, 30 Dec 2011)


Log Message:
-----------
16-bit update of non-man documentation files and the PrepareRelease script.

Modified Paths:
--------------
    code/trunk/HACKING
    code/trunk/NEWS
    code/trunk/NON-UNIX-USE
    code/trunk/PrepareRelease
    code/trunk/README


Modified: code/trunk/HACKING
===================================================================
--- code/trunk/HACKING    2011-12-30 13:22:28 UTC (rev 839)
+++ code/trunk/HACKING    2011-12-30 19:32:50 UTC (rev 840)
@@ -49,6 +49,18 @@
 first pass through the pattern is helpful for other reasons. 



+Support for 16-bit data strings
+-------------------------------
+
+From release 8.30, PCRE supports 16-bit as well as 8-bit data strings, by being
+compilable in either 8-bit or 16-bit modes, or both. Thus, two different
+libraries can be created. In the description that follows, the word "short" is
+used for a 16-bit data quantity, and the word "unit" is used for a quantity
+that is a byte in 8-bit mode and a short in 16-bit mode. However, so as not to
+over-complicate the text, the names of PCRE functions are given in 8-bit form
+only.
+
+
Computing the memory requirement: how it was
--------------------------------------------

@@ -125,23 +137,25 @@
Format of compiled patterns
---------------------------

-The compiled form of a pattern is a vector of bytes, containing items of
-variable length. The first byte in an item is an opcode, and the length of the
-item is either implicit in the opcode or contained in the data bytes that
-follow it.
+The compiled form of a pattern is a vector of units (bytes in 8-bit mode, or
+shorts in 16-bit mode), containing items of variable length. The first unit in
+an item contains an opcode, and the length of the item is either implicit in
+the opcode or contained in the data that follows it.

-In many cases below LINK_SIZE data values are specified for offsets within the
-compiled pattern. The default value for LINK_SIZE is 2, but PCRE can be
-compiled to use 3-byte or 4-byte values for these offsets (impairing the
-performance). This is necessary only when patterns whose compiled length is
-greater than 64K are going to be processed. In this description, we assume the
-"normal" compilation options. Data values that are counts (e.g. for
-quantifiers) are always just two bytes long.
+In many cases listed below, LINK_SIZE data values are specified for offsets
+within the compiled pattern. LINK_SIZE always specifies a number of bytes. The
+default value for LINK_SIZE is 2, but PCRE can be compiled to use 3-byte or
+4-byte values for these offsets, although this impairs the performance. (3-byte
+LINK_SIZE values are available only in 8-bit mode.) Specifing a LINK_SIZE
+larger than 2 is necessary only when patterns whose compiled length is greater
+than 64K are going to be processed. In this description, we assume the "normal"
+compilation options. Data values that are counts (e.g. for quantifiers) are
+always just two bytes long (one short in 16-bit mode).

Opcodes with no following data
------------------------------

-These items are all just one byte long
+These items are all just one unit long

   OP_END                 end of pattern
   OP_ANY                 match any one character other than newline
@@ -182,7 +196,7 @@
 -----------------------------------------------


(*THEN) without an argument generates the opcode OP_THEN and no following data.
-OP_MARK is followed by the mark name, preceded by a one-byte length, and
+OP_MARK is followed by the mark name, preceded by a one-unit length, and
followed by a binary zero. For (*PRUNE), (*SKIP), and (*THEN) with arguments,
the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used, with the name
following in the same format.
@@ -192,16 +206,14 @@
---------------------------

The OP_CHAR opcode is followed by a single character that is to be matched
-casefully. For caseless matching, OP_CHARI is used. In UTF-8 mode, the
-character may be more than one byte long. (Earlier versions of PCRE used
-multi-character strings, but this was changed to allow some new features to be
-added.)
+casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes,
+the character may be more than one unit long.


Repeating single characters
---------------------------

-The common repeats (*, +, ?) when applied to a single character use the
+The common repeats (*, +, ?), when applied to a single character, use the
following opcodes, which come in caseful and caseless versions:

   Caseful         Caseless
@@ -215,10 +227,11 @@
   OP_MINQUERY     OP_MINQUERYI  
   OP_POSQUERY     OP_POSQUERYI  


-In ASCII mode, these are two-byte items; in UTF-8 mode, the length is variable.
-Those with "MIN" in their name are the minimizing versions. Those with "POS" in
-their names are possessive versions. Each is followed by the character that is
-to be repeated. Other repeats make use of these opcodes:
+Each opcode is followed by the character that is to be repeated. In ASCII mode,
+these are two-unit items; in UTF-8 or UTF-16 modes, the length is variable.
+Those with "MIN" in their names are the minimizing versions. Those with "POS"
+in their names are possessive versions. Other repeats make use of these
+opcodes:

   Caseful         Caseless
   OP_UPTO         OP_UPTOI    
@@ -226,10 +239,10 @@
   OP_POSUPTO      OP_POSUPTOI 
   OP_EXACT        OP_EXACTI   


-Each of these is followed by a two-byte count (most significant first) and the
-repeated character. OP_UPTO matches from 0 to the given number. A repeat with a
-non-zero minimum and a fixed maximum is coded as an OP_EXACT followed by an
-OP_UPTO (or OP_MINUPTO or OPT_POSUPTO).
+Each of these is followed by a two-byte (one short) count (most significant
+byte first in 8-bit mode) and then the repeated character. OP_UPTO matches from
+0 to the given number. A repeat with a non-zero minimum and a fixed maximum is
+coded as an OP_EXACT followed by an OP_UPTO (or OP_MINUPTO or OPT_POSUPTO).


Repeating character types
@@ -237,7 +250,7 @@

Repeats of things like \d are done exactly as for single characters, except
that instead of a character, the opcode for the type is stored in the data
-byte. The opcodes are:
+unit. The opcodes are:

OP_TYPESTAR
OP_TYPEMINSTAR
@@ -259,49 +272,51 @@

OP_PROP and OP_NOTPROP are used for positive and negative matches of a
character by testing its Unicode property (the \p and \P escape sequences).
-Each is followed by two bytes that encode the desired property as a type and a
+Each is followed by two units that encode the desired property as a type and a
value.

-Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
-three bytes: OP_PROP or OP_NOTPROP and then the desired property type and
+Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
+three units: OP_PROP or OP_NOTPROP, and then the desired property type and
value.


Character classes
-----------------

-If there is only one character, OP_CHAR or OP_CHARI is used for a positive
-class, and OP_NOT or OP_NOTI for a negative one (that is, for something like
-[^a]). However, in UTF-8 mode, the use of OP_NOT[I] applies only to characters
-with values < 128, because OP_NOT[I] is confined to single bytes.
+If there is only one character in the class, OP_CHAR or OP_CHARI is used for a
+positive class, and OP_NOT or OP_NOTI for a negative one (that is, for
+something like [^a]). However, OP_NOT[I] can be used only with single-unit
+characters, so in UTF-8 (UTF-16) mode, the use of OP_NOT[I] applies only to
+characters whose code points are no greater than 127 (0xffff).

-Another set of 13 repeating opcodes (called OP_NOTSTAR etc.) are used for a
-repeated, negated, single-character class. The normal single-character opcodes
-(OP_STAR, etc.) are used for a repeated positive single-character class.
+Another set of 13 repeating opcodes (called OP_NOTSTAR etc.) are used for
+repeated, negated, single-character classes. The normal single-character
+opcodes (OP_STAR, etc.) are used for repeated positive single-character
+classes.

When there is more than one character in a class and all the characters are
less than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a
-negative one. In either case, the opcode is followed by a 32-byte bit map
-containing a 1 bit for every character that is acceptable. The bits are counted
-from the least significant end of each byte. In caseless mode, bits for both
-cases are set.
+negative one. In either case, the opcode is followed by a 32-byte (16-short)
+bit map containing a 1 bit for every character that is acceptable. The bits are
+counted from the least significant end of each unit. In caseless mode, bits for
+both cases are set.

-The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode,
-subject characters with values greater than 256 can be handled correctly. For
+The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8/16 mode,
+subject characters with values greater than 255 can be handled correctly. For
OP_CLASS they do not match, whereas for OP_NCLASS they do.

-For classes containing characters with values > 255, OP_XCLASS is used. It
-optionally uses a bit map (if any characters lie within it), followed by a list
-of pairs (for a range) and single characters. In caseless mode, both cases are
-explicitly listed. There is a flag character than indicates whether it is a
-positive or a negative class.
+For classes containing characters with values greater than 255, OP_XCLASS is
+used. It optionally uses a bit map (if any characters lie within it), followed
+by a list of pairs (for a range) and single characters. In caseless mode, both
+cases are explicitly listed. There is a flag character than indicates whether
+it is a positive or a negative class.


Back references
---------------

-OP_REF (caseful) or OP_REFI (caseless) is followed by two bytes containing the
-reference number.
+OP_REF (caseful) or OP_REFI (caseless) is followed by two bytes (one short)
+containing the reference number.


Repeating character classes and back references
@@ -321,10 +336,10 @@
OP_CRRANGE
OP_CRMINRANGE

-All but the last two are just single-byte items. The others are followed by
-four bytes of data, comprising the minimum and maximum repeat counts. There are
-no special possessive opcodes for these repeats; a possessive repeat is
-compiled into an atomic group.
+All but the last two are just single-unit items. The others are followed by
+four bytes (two shorts) of data, comprising the minimum and maximum repeat
+counts. There are no special possessive opcodes for these repeats; a possessive
+repeat is compiled into an atomic group.


Brackets and alternation
@@ -334,7 +349,8 @@
compile time, so alternation always happens in the context of brackets.

[Note for North Americans: "bracket" to some English speakers, including
-myself, can be round, square, curly, or pointy. Hence this usage.]
+myself, can be round, square, curly, or pointy. Hence this usage rather than
+"parentheses".]

Non-capturing brackets use the opcode OP_BRA. Originally PCRE was limited to 99
capturing brackets and it used a different opcode for each one. From release
@@ -346,9 +362,9 @@
next alternative OP_ALT or, if there aren't any branches, to the matching
OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to
the next one, or to the OP_KET opcode. For capturing brackets, the bracket
-number immediately follows the offset, always as a 2-byte item.
+number immediately follows the offset, always as a 2-byte (one short) item.

-OP_KET is used for subpatterns that do not repeat indefinitely, while
+OP_KET is used for subpatterns that do not repeat indefinitely, and
OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or
maximally respectively (see below for possessive repetitions). All three are
followed by LINK_SIZE bytes giving (as a positive number) the offset back to
@@ -356,7 +372,7 @@

If a subpattern is quantified such that it is permitted to match zero times, it
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
-single-byte opcodes that tell the matcher that skipping the following
+single-unit opcodes that tell the matcher that skipping the following
subpattern entirely is a valid branch. In the case of the first two, not
skipping the pattern is also valid (greedy and non-greedy). The third is used
when a pattern has the quantifier {0,0}. It cannot be entirely discarded,
@@ -395,11 +411,11 @@
Forward assertions are just like other subpatterns, but starting with one of
the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
-is OP_REVERSE, followed by a two byte count of the number of characters to move
-back the pointer in the subject string. When operating in UTF-8 mode, the count
-is a character count rather than a byte count. A separate count is present in
-each alternative of a lookbehind assertion, allowing them to have different
-fixed lengths.
+is OP_REVERSE, followed by a two byte (one short) count of the number of
+characters to move back the pointer in the subject string. In ASCII mode, the
+count is a number of units, but in UTF-8/16 mode each character may occupy more
+than one unit. A separate count is present in each alternative of a lookbehind
+assertion, allowing them to have different fixed lengths.


Once-only (atomic) subpatterns
@@ -416,14 +432,15 @@
These are like other subpatterns, but they start with the opcode OP_COND, or
OP_SCOND for one that might match an empty string in an unbounded repeat. If
the condition is a back reference, this is stored at the start of the
-subpattern using the opcode OP_CREF followed by two bytes containing the
-reference number. OP_NCREF is used instead if the reference was generated by
-name (so that the runtime code knows to check for duplicate names).
+subpattern using the opcode OP_CREF followed by two bytes (one short)
+containing the reference number. OP_NCREF is used instead if the reference was
+generated by name (so that the runtime code knows to check for duplicate
+names).

If the condition is "in recursion" (coded as "(?(R)"), or "in recursion of
group x" (coded as "(?(Rx)"), the group number is stored at the start of the
subpattern using the opcode OP_RREF or OP_NRREF (cf OP_NCREF), and a value of
-zero for "the whole pattern". For a DEFINE condition, just the single byte
+zero for "the whole pattern". For a DEFINE condition, just the single unit
OP_DEF is used (it has no associated data). Otherwise, a conditional subpattern
always starts with one of the assertions.

@@ -442,12 +459,12 @@
Callout
-------

-OP_CALLOUT is followed by one byte of data that holds a callout number in the
+OP_CALLOUT is followed by one unit of data that holds a callout number in the
range 0 to 254 for manual callouts, or 255 for an automatic callout. In both
-cases there follows a two-byte value giving the offset in the pattern to the
-start of the following item, and another two-byte item giving the length of the
-next item.
+cases there follows a two-byte (one short) value giving the offset in the
+pattern to the start of the following item, and another two-byte (one short)
+item giving the length of the next item.


Philip Hazel
-October 2011
+December 2011

Modified: code/trunk/NEWS
===================================================================
--- code/trunk/NEWS    2011-12-30 13:22:28 UTC (rev 839)
+++ code/trunk/NEWS    2011-12-30 19:32:50 UTC (rev 840)
@@ -1,6 +1,15 @@
 News about PCRE releases
 ------------------------


+Release 8.30
+------------
+
+Release 8.30 introduces a major new feature: support for 16-bit character
+strings, compiled as a separate library. There are no new features in the 8-bit
+library, but some bugs have been mended. However, note that the pcre_info()
+function, which has been obsolete for over 10 years, has been removed.
+
+
Release 8.21 12-Dec-2011
------------------------


Modified: code/trunk/NON-UNIX-USE
===================================================================
--- code/trunk/NON-UNIX-USE    2011-12-30 13:22:28 UTC (rev 839)
+++ code/trunk/NON-UNIX-USE    2011-12-30 19:32:50 UTC (rev 840)
@@ -88,15 +88,11 @@
        pcre_internal.h
        ucp.h


- (5) Also ensure that you have the following file, which is #included as source
-     when building a debugging version of PCRE, and is also used by pcretest.
+ (5) For an 8-bit library, compile the following source files, setting
+     -DHAVE_CONFIG_H as a compiler option if you have set up config.h with your
+     configuration, or else use other -D settings to change the configuration
+     as required.


-       pcre_printint.src
-
- (6) Compile the following source files, setting -DHAVE_CONFIG_H as a compiler
-     option if you have set up config.h with your configuration, or else use
-     other -D settings to change the configuration as required.
-
        pcre_byte_order.c
        pcre_chartables.c
        pcre_compile.c
@@ -106,11 +102,11 @@
        pcre_fullinfo.c
        pcre_get.c
        pcre_globals.c
-       pcre_info.c
        pcre_maketables.c
        pcre_newline.c
        pcre_ord2utf8.c
        pcre_refcount.c
+       pcre_string_utils.c 
        pcre_study.c
        pcre_tables.c
        pcre_ucd.c
@@ -122,37 +118,67 @@
      an unusual compiler) so that all included PCRE header files are first
      sought in the current directory. Otherwise you run the risk of picking up
      a previously-installed file from somewhere else.
+     
+ (6) If you have defined SUPPORT_JIT in config.h, you must also compile


- (7) If you have defined SUPPORT_JIT in config.h, you must also compile
-
        pcre_jit_compile.c


      This file #includes sources from the sljit subdirectory, where there
      should be 16 files, all of whose names begin with "sljit".


- (8) Now link all the compiled code into an object library in whichever form
-     your system keeps such libraries. This is the basic PCRE C library. If
-     your system has static and shared libraries, you may have to do this once
-     for each type.
+ (7) Now link all the compiled code into an object library in whichever form
+     your system keeps such libraries. This is the basic PCRE C 8-bit library.
+     If your system has static and shared libraries, you may have to do this
+     once for each type.
+     
+ (8) If you want to build a 16-bit library (as well as, or instead of the 8-bit 
+     library) repeat steps 5-7 with the following files:


- (9) Similarly, if you want to build the POSIX wrapper functions, ensure that
-     you have the pcreposix.h file and then compile pcreposix.c (remembering
-     -DHAVE_CONFIG_H if necessary). Link the result (on its own) as the
-     pcreposix library.
+       pcre16_byte_order.c
+       pcre16_chartables.c
+       pcre16_compile.c
+       pcre16_config.c
+       pcre16_dfa_exec.c
+       pcre16_exec.c
+       pcre16_fullinfo.c
+       pcre16_get.c
+       pcre16_globals.c
+       pcre16_jit_compile.c (if SUPPORT_JIT is defined)
+       pcre16_maketables.c
+       pcre16_newline.c
+       pcre16_ord2utf16.c
+       pcre16_refcount.c
+       pcre16_string_utils.c 
+       pcre16_study.c
+       pcre16_tables.c
+       pcre16_ucd.c
+       pcre16_utf16_utils.c 
+       pcre16_valid_utf16.c
+       pcre16_version.c
+       pcre16_xclass.c


-(10) Compile the test program pcretest.c (again, don't forget -DHAVE_CONFIG_H).
-     This needs the functions in the PCRE library when linking. It also needs
-     the pcreposix wrapper functions unless you compile it with -DNOPOSIX. The
-     pcretest.c program also needs the pcre_printint.src source file, which it
-     #includes.
+ (9) If you want to build the POSIX wrapper functions (which apply only to the 
+     8-bit library), ensure that you have the pcreposix.h file and then compile
+     pcreposix.c (remembering -DHAVE_CONFIG_H if necessary). Link the result
+     (on its own) as the pcreposix library.


+(10) The pcretest program can be linked with either or both of the 8-bit and
+     16-bit libraries (depending on what you selected in config.h). Compile
+     pcretest.c and pcre_printint.c (again, don't forget -DHAVE_CONFIG_H) and
+     link them together with the appropriate library/ies. If you compiled an
+     8-bit library, pcretest also needs the pcreposix wrapper library unless
+     you compiled it with -DNOPOSIX.
+
 (11) Run pcretest on the testinput files in the testdata directory, and check
-     that the output matches the corresponding testoutput files. Some tests are
-     relevant only when certain build-time options are selected. For example,
-     test 4 is for UTF-8 support, and will not run if you have build PCRE
-     without it. See the comments at the start of each testinput file. If you
-     have a suitable Unix-like shell, the RunTest script will run the
-     appropriate tests for you.
+     that the output matches the corresponding testoutput files. If you 
+     compiled both an 8-bit and a 16-bit library, you need to run pcretest with 
+     the -16 option to do 16-bit tests.
+       
+     Some tests are relevant only when certain build-time options are selected.
+     For example, test 4 is for UTF-8 or UTF-16 support, and will not run if
+     you have built PCRE without it. See the comments at the start of each
+     testinput file. If you have a suitable Unix-like shell, the RunTest script
+     will run the appropriate tests for you.


      Note that the supplied files are in Unix format, with just LF characters
      as line terminators. You may need to edit them to change this if your
@@ -167,17 +193,18 @@
      the JIT test program, pcre_jit_test.c.


 (13) If you want to use the pcregrep command, compile and link pcregrep.c; it
-     uses only the basic PCRE library (it does not need the pcreposix library).
+     uses only the basic 8-bit PCRE library (it does not need the pcreposix
+     library).



THE C++ WRAPPER FUNCTIONS

-The PCRE distribution also contains some C++ wrapper functions and tests,
-contributed by Google Inc. On a system that can use "configure" and "make",
-the functions are automatically built into a library called pcrecpp. It should
-be straightforward to compile the .cc files manually on other systems. The
-files called xxx_unittest.cc are test programs for each of the corresponding
-xxx.cc files.
+The PCRE distribution also contains some C++ wrapper functions and tests,
+applicable to the 8-bit library, which were contributed by Google Inc. On a
+system that can use "configure" and "make", the functions are automatically
+built into a library called pcrecpp. It should be straightforward to compile
+the .cc files manually on other systems. The files called xxx_unittest.cc are
+test programs for each of the corresponding xxx.cc files.


BUILDING FOR VIRTUAL PASCAL
@@ -547,5 +574,5 @@


=========================
-Last Updated: 9 October 2011
+Last Updated: 30 December 2011
****

Modified: code/trunk/PrepareRelease
===================================================================
--- code/trunk/PrepareRelease    2011-12-30 13:22:28 UTC (rev 839)
+++ code/trunk/PrepareRelease    2011-12-30 19:32:50 UTC (rev 840)
@@ -159,7 +159,9 @@
 # These files are detrailed; do not detrail the test data because there may be
 # significant trailing spaces. Do not detrail RunTest.bat, because it has CRLF
 # line endings and the detrail script removes all trailing white space. The
-# configure files are also omitted from the detrailing.
+# configure files are also omitted from the detrailing. We don't bother with
+# those pcre16_xx files that just define COMPILE_PCRE16 and then #include the
+# common file, because they aren't going to change.


files="\
Makefile.am \
@@ -181,10 +183,10 @@
RunTest \
pcre-config.in \
libpcre.pc.in \
+ libpcre16.pc.in \
libpcreposix.pc.in \
libpcrecpp.pc.in \
config.h.in \
- pcre_printint.src \
pcre_chartables.c.dist \
pcredemo.c \
pcregrep.c \
@@ -202,19 +204,23 @@
pcre_fullinfo.c \
pcre_get.c \
pcre_globals.c \
- pcre_info.c \
pcre_jit_compile.c \
pcre_jit_test.c \
pcre_maketables.c \
pcre_newline.c \
pcre_ord2utf8.c \
+ pcre_ord2utf16.c \
+ pcre_printint.c \
pcre_refcount.c \
+ pcre_stringutils.c \
pcre_study.c \
pcre_tables.c \
pcre_ucp_searchfuncs.c \
pcre_valid_utf8.c \
pcre_version.c \
pcre_xclass.c \
+ pcre16_utf16_utils.c \
+ pcre16_valid_utf16.c \
pcre_scanner.cc \
pcre_scanner.h \
pcre_scanner_unittest.cc \

Modified: code/trunk/README
===================================================================
--- code/trunk/README    2011-12-30 13:22:28 UTC (rev 839)
+++ code/trunk/README    2011-12-30 19:32:50 UTC (rev 840)
@@ -34,16 +34,19 @@
 The PCRE APIs
 -------------


-PCRE is written in C, and it has its own API. The distribution also includes a
-set of C++ wrapper functions (see the pcrecpp man page for details), courtesy
-of Google Inc.
+PCRE is written in C, and it has its own API. There are two sets of functions,
+one for the 8-bit library, which processes strings of bytes, and one for the
+16-bit library, which processes strings of 16-bit values. The distribution also
+includes a set of C++ wrapper functions (see the pcrecpp man page for details),
+courtesy of Google Inc., which can be used to call the 8-bit PCRE library from
+C++.

-In addition, there is a set of C wrapper functions that are based on the POSIX
-regular expression API (see the pcreposix man page). These end up in the
-library called libpcreposix. Note that this just provides a POSIX calling
-interface to PCRE; the regular expressions themselves still follow Perl syntax
-and semantics. The POSIX API is restricted, and does not give full access to
-all of PCRE's facilities.
+In addition, there is a set of C wrapper functions (again, just for the 8-bit
+library) that are based on the POSIX regular expression API (see the pcreposix
+man page). These end up in the library called libpcreposix. Note that this just
+provides a POSIX calling interface to PCRE; the regular expressions themselves
+still follow Perl syntax and semantics. The POSIX API is restricted, and does
+not give full access to all of PCRE's facilities.

The header file for the POSIX-style functions is called pcreposix.h. The
official POSIX name is regex.h, but I did not want to risk possible problems
@@ -143,9 +146,9 @@

CFLAGS='-O2 -Wall' ./configure --prefix=/opt/local

-specifies that the C compiler should be run with the flags '-O2 -Wall' instead
-of the default, and that "make install" should install PCRE under /opt/local
-instead of the default /usr/local.
+This command specifies that the C compiler should be run with the flags '-O2
+-Wall' instead of the default, and that "make install" should install PCRE
+under /opt/local instead of the default /usr/local.

If you want to build in a different directory, just run "configure" with that
directory as current. For example, suppose you have unpacked the PCRE source
@@ -168,11 +171,16 @@
--disable-static

(See also "Shared libraries on Unix-like systems" below.)
+
+. By default, only the 8-bit library is built. If you add --enable-pcre16 to
+ the "configure" command, the 16-bit library is also built. If you want only
+ the 16-bit library, use "./configure --enable-pcre16 --disable-pcre8".

-. If you want to suppress the building of the C++ wrapper library, you can add
- --disable-cpp to the "configure" command. Otherwise, when "configure" is run,
- it will try to find a C++ compiler and C++ header files, and if it succeeds,
- it will try to build the C++ wrapper.
+. If you are building the 8-bit library and want to suppress the building of
+ the C++ wrapper library, you can add --disable-cpp to the "configure"
+ command. Otherwise, when "configure" is run without --disable-pcre8, it will
+ try to find a C++ compiler and C++ header files, and if it succeeds, it will
+ try to build the C++ wrapper.

. If you want to include support for just-in-time compiling, which can give
large performance improvements on certain platforms, add --enable-jit to the
@@ -184,19 +192,26 @@
you add --disable-pcregrep-jit to the "configure" command.

. If you want to make use of the support for UTF-8 Unicode character strings in
- PCRE, you must add --enable-utf8 to the "configure" command. Without it, the
- code for handling UTF-8 is not included in the library. Even when included,
- it still has to be enabled by an option at run time. When PCRE is compiled
- with this option, its input can only either be ASCII or UTF-8, even when
- running on EBCDIC platforms. It is not possible to use both --enable-utf8 and
- --enable-ebcdic at the same time.
+ the 8-bit library, or UTF-16 Unicode character strings in the 16-bit library,
+ you must add --enable-utf to the "configure" command. Without it, the code
+ for handling UTF-8 and UTF-16 is not included in the relevant library. Even
+ when --enable-utf included, the use of UTF encoding still has to be enabled
+ by an option at run time. When PCRE is compiled with this option, its input
+ can only either be ASCII or UTF-8/16, even when running on EBCDIC platforms.
+ It is not possible to use both --enable-utf and --enable-ebcdic at the same
+ time.
+
+. The option --enable-utf8 is retained for backwards compatibility with earlier
+ releases that did not support 16-bit character strings. It is synonymous with
+ --enable-utf. It is not possible to configure one library with UTF support
+ and the other without in the same configuration.

-. If, in addition to support for UTF-8 character strings, you want to include
- support for the \P, \p, and \X sequences that recognize Unicode character
- properties, you must add --enable-unicode-properties to the "configure"
- command. This adds about 30K to the size of the library (in the form of a
- property table); only the basic two-letter properties such as Lu are
- supported.
+. If, in addition to support for UTF-8/16 character strings, you want to
+ include support for the \P, \p, and \X sequences that recognize Unicode
+ character properties, you must add --enable-unicode-properties to the
+ "configure" command. This adds about 30K to the size of the library (in the
+ form of a property table); only the basic two-letter properties such as Lu
+ are supported.

. You can build PCRE to recognize either CR or LF or the sequence CRLF or any
of the preceding, or any of the Unicode newline sequences as indicating the
@@ -249,10 +264,11 @@
sizes in the pcrestack man page.

. The default maximum compiled pattern size is around 64K. You can increase
- this by adding --with-link-size=3 to the "configure" command. You can
- increase it even more by setting --with-link-size=4, but this is unlikely
- ever to be necessary. Increasing the internal link size will reduce
- performance.
+ this by adding --with-link-size=3 to the "configure" command. In the 8-bit
+ library, PCRE then uses three bytes instead of two for offsets to different
+ parts of the compiled pattern. In the 16-bit library, --with-link-size=3 is
+ the same as --with-link-size=4, which (in both libraries) uses four-byte
+ offsets. Increasing the internal link size reduces performance.

. You can build PCRE so that its internal match() function that is called from
pcre_exec() does not call itself recursively. Instead, it uses memory blocks
@@ -287,10 +303,12 @@

This automatically implies --enable-rebuild-chartables (see above). However,
when PCRE is built this way, it always operates in EBCDIC. It cannot support
- both EBCDIC and UTF-8.
+ both EBCDIC and UTF-8/16.

-. It is possible to compile pcregrep to use libz and/or libbz2, in order to
- read .gz and .bz2 files (respectively), by specifying one or both of
+. The pcregrep program currently supports only 8-bit data files, and so
+ requires the 8-bit PCRE library. It is possible to compile pcregrep to use
+ libz and/or libbz2, in order to read .gz and .bz2 files (respectively), by
+ specifying one or both of

   --enable-pcregrep-libz
   --enable-pcregrep-libbz2
@@ -333,6 +351,7 @@
 . pcre-config          script that shows the building settings such as CFLAGS
                          that were set for "configure"
 . libpcre.pc         ) data for the pkg-config command
+. libpcre16.pc       )
 . libpcreposix.pc    )
 . libtool              script that builds shared and/or static libraries
 . RunTest              script for running tests on the basic C library
@@ -343,7 +362,8 @@
 have to built PCRE without using "configure" or CMake. If you use "configure"
 or CMake, the .generic versions are not used.


-If a C++ compiler is found, the following files are also built:
+When building the 8-bit library, if a C++ compiler is found, the following
+files are also built:

 . libpcrecpp.pc        data for the pkg-config command
 . pcrecpparg.h         header file for calling PCRE via the C++ wrapper
@@ -353,14 +373,17 @@
 script that can be run to recreate the configuration, and config.log, which
 contains compiler output from tests that "configure" runs.


-Once "configure" has run, you can run "make". It builds two libraries, called
-libpcre and libpcreposix, a test program called pcretest, and the pcregrep
-command. If a C++ compiler was found on your system, and you did not disable it
-with --disable-cpp, "make" also builds the C++ wrapper library, which is called
-libpcrecpp, and some test programs called pcrecpp_unittest,
-pcre_scanner_unittest, and pcre_stringpiece_unittest. If you enabled JIT
-support with --enable-jit, a test program called pcre_jit_test is also built.
+Once "configure" has run, you can run "make". This builds either or both of the
+libraries libpcre and libpcre16, and a test program called pcretest. If you
+enabled JIT support with --enable-jit, a test program called pcre_jit_test is
+built as well.

+If the 8-bit library is built, libpcreposix and the pcregrep command are also
+built, and if a C++ compiler was found on your system, and you did not disable
+it with --disable-cpp, "make" builds the C++ wrapper library, which is called
+libpcrecpp, as well as some test programs called pcrecpp_unittest,
+pcre_scanner_unittest, and pcre_stringpiece_unittest.
+
The command "make check" runs all the appropriate tests. Details of the PCRE
tests are given below in a separate section of this document.

@@ -370,15 +393,17 @@

   Commands (bin):
     pcretest
-    pcregrep
+    pcregrep (if 8-bit support is enabled)
     pcre-config


   Libraries (lib):
-    libpcre
-    libpcreposix
-    libpcrecpp (if C++ support is enabled)
+    libpcre16     (if 16-bit support is enabled) 
+    libpcre       (if 8-bit support is enabled)
+    libpcreposix  (if 8-bit support is enabled)
+    libpcrecpp    (if 8-bit and C++ support is enabled)


   Configuration information (lib/pkgconfig):
+    libpcre16.pc 
     libpcre.pc
     libpcreposix.pc
     libpcrecpp.pc (if C++ support is enabled)
@@ -558,8 +583,8 @@
 own man page) on each of the relevant testinput files in the testdata
 directory, and compares the output with the contents of the corresponding
 testoutput files. Some tests are relevant only when certain build-time options
-were selected. For example, the tests for UTF-8 support are run only if
---enable-utf8 was used. RunTest outputs a comment when it skips a test.
+were selected. For example, the tests for UTF-8/16 support are run only if
+--enable-utf was used. RunTest outputs a comment when it skips a test.


Many of the tests that are not skipped are run up to three times. The second
run forces pcre_study() to be called for all patterns except for a few in some
@@ -567,17 +592,22 @@
done). If JIT support is available, the non-DFA tests are run a third time,
this time with a forced pcre_study() with the PCRE_STUDY_JIT_COMPILE option.

+When both 8-bit and 16-bit support is enabled, the entire set of tests is run
+twice, once for each library. If you want to run just one set of tests, call
+RunTest with either the -8 or -16 option.
+
RunTest uses a file called testtry to hold the main output from pcretest
-(testsavedregex is also used as a working file). To run pcretest on just one of
-the test files, give its number as an argument to RunTest, for example:
+(testsavedregex is also used as a working file). To run pcretest on just one or
+more specific test files, give their numbers as arguments to RunTest, for
+example:

- RunTest 2
-
+ RunTest 2 7 11
+
The first test file can be fed directly into the perltest.pl script to check
that Perl gives the same results. The only difference you should see is in the
first few lines, where the Perl version is given instead of the PCRE version.

-The second set of tests check pcre_fullinfo(), pcre_info(), pcre_study(),
+The second set of tests check pcre_fullinfo(), pcre_study(),
pcre_copy_substring(), pcre_get_substring(), pcre_get_substring_list(), error
detection, and run-time flags that are specific to PCRE, as well as the POSIX
wrapper API. It also uses the debugging flags to check some of the internals of
@@ -612,36 +642,29 @@
Windows versions of test 2. More info on using RunTest.bat is included in the
document entitled NON-UNIX-USE.]

-The fourth test checks the UTF-8 support. This file can be also fed directly to
-the perltest.pl script, provided you are running Perl 5.8 or higher.
+The fourth and fifth tests check the UTF-8/16 support and error handling and
+internal UTF features of PCRE that are not relevant to Perl, respectively. The
+sixth and seventh tests do the same for Unicode character properties support.

-The fifth test checks error handling with UTF-8 encoding, and internal UTF-8
-features of PCRE that are not relevant to Perl.
+The eighth, ninth, and tenth tests check the pcre_dfa_exec() alternative
+matching function, in non-UTF-8/16 mode, UTF-8/16 mode, and UTF-8/16 mode with
+Unicode property support, respectively.

-The sixth test (which is Perl-5.10 compatible) checks the support for Unicode
-character properties. This file can be also fed directly to the perltest.pl
-script, provided you are running Perl 5.10 or higher.
-
-The seventh, eighth, and ninth tests check the pcre_dfa_exec() alternative
-matching function, in non-UTF-8 mode, UTF-8 mode, and UTF-8 mode with Unicode
-property support, respectively.
-
-The tenth test checks some internal offsets and code size features; it is run
-only when the default "link size" of 2 is set (in other cases the sizes
+The eleventh test checks some internal offsets and code size features; it is
+run only when the default "link size" of 2 is set (in other cases the sizes
change) and when Unicode property support is enabled.

-The eleventh and twelfth tests check out features that are new in Perl 5.10,
-without and with UTF-8 support, respectively. This file can be also fed
-directly to the perltest.pl script, provided you are running Perl 5.10 or
-higher.
+The twelfth test is run only when JIT support is available, and the thirteenth
+test is run only when JIT support is not available. They test some JIT-specific
+features such as information output from pcretest about JIT compilation.

-The thirteenth test checks a number internals and non-Perl features concerned
-with Unicode property support.
+The fourteenth, fifteenth, and sixteenth tests are run only in 8-bit mode, and
+the seventeenth, eighteenth, and nineteenth tests are run only in 16-bit mode.
+These are tests that generate different output in the two modes. They are for
+general cases, UTF-8/16 support, and Unicode property support, respectively.

-The fourteenth test is run only when JIT support is available, and the
-fifteenth test is run only when JIT support is not available. They test some
-JIT-specific features such as information output from pcretest about JIT
-compilation.
+The twentieth test is run only in 16-bit mode. It tests some specific 16-bit
+features of the DFA matching engine.


Character tables
@@ -701,7 +724,9 @@
File manifest
-------------

-The distribution should contain the following files:
+The distribution should contain the files listed below. Where a file name is
+given as pcre[16]_xxx it means that there are two files, one with the name
+pcre_xxx and the other with the name pcre16_xxx.

(A) Source files of the PCRE library functions and their headers:

@@ -710,31 +735,36 @@

   pcre_chartables.c.dist  a default set of character tables that assume ASCII
                             coding; used, unless --enable-rebuild-chartables is
-                            specified, by copying to pcre_chartables.c
+                            specified, by copying to pcre[16]_chartables.c


   pcreposix.c             )
-  pcre_byte_order.c       )
-  pcre_compile.c          )
-  pcre_config.c           )
-  pcre_dfa_exec.c         )
-  pcre_exec.c             )
-  pcre_fullinfo.c         )
-  pcre_get.c              ) sources for the functions in the library,
-  pcre_globals.c          )   and some internal functions that they use
-  pcre_info.c             )
-  pcre_jit_compile.c      )
-  pcre_maketables.c       )
-  pcre_newline.c          )
+  pcre[16]_byte_order.c   )
+  pcre[16]_compile.c      )
+  pcre[16]_config.c       )
+  pcre[16]_dfa_exec.c     )
+  pcre[16]_exec.c         )
+  pcre[16]_fullinfo.c     )
+  pcre[16]_get.c          ) sources for the functions in the library,
+  pcre[16]_globals.c      )   and some internal functions that they use
+  pcre[16]_jit_compile.c  )
+  pcre[16]_maketables.c   )
+  pcre[16]_newline.c      )
+  pcre[16]_refcount.c     )
+  pcre[16]_string_utils.c )
+  pcre[16]_study.c        )
+  pcre[16]_tables.c       )
+  pcre[16]_ucd.c          )
+  pcre[16]_version.c      )
+  pcre[16]_xclass.c       )
   pcre_ord2utf8.c         )
-  pcre_refcount.c         )
-  pcre_study.c            )
-  pcre_tables.c           )
-  pcre_ucd.c              )
   pcre_valid_utf8.c       )
-  pcre_version.c          )
-  pcre_xclass.c           )
-  pcre_printint.src       ) debugging function that is #included in pcretest,
+  pcre16_ord2utf16.c      )
+  pcre16_utf16_utils.c    )
+  pcre16_valid_utf16.c    )
+   
+  pcre[16]_printint.c     ) debugging function that is used by pcretest,
                           )   and can also be #included in pcre_compile()
+                           
   pcre.h.in               template for pcre.h when built by "configure"
   pcreposix.h             header for the external POSIX wrapper API
   pcre_internal.h         header for internal use
@@ -796,6 +826,7 @@
   doc/pcretest.txt        plain text documentation of test program
   doc/perltest.txt        plain text documentation of Perl test program
   install-sh              a shell script for installing files
+  libpcre16.pc.in         template for libpcre16.pc for pkg-config
   libpcre.pc.in           template for libpcre.pc for pkg-config
   libpcreposix.pc.in      template for libpcreposix.pc for pkg-config
   libpcrecpp.pc.in        template for libpcrecpp.pc for pkg-config
@@ -812,6 +843,7 @@
   testdata/testinput*     test data for main library tests
   testdata/testoutput*    expected test results
   testdata/grep*          input and output for pcregrep tests
+  testdata/*              other supporting test files 


(D) Auxiliary files for cmake support

@@ -842,4 +874,4 @@
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 06 September 2011
+Last updated: 30 December 2011