[Pcre-svn] [580] code/trunk/HACKING: Fix error in documentat…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [580] code/trunk/HACKING: Fix error in documentation.
Revision: 580
          http://www.exim.org/viewvc/pcre2?view=rev&revision=580
Author:   ph10
Date:     2016-10-28 17:08:44 +0100 (Fri, 28 Oct 2016)
Log Message:
-----------
Fix error in documentation.


Modified Paths:
--------------
    code/trunk/HACKING


Modified: code/trunk/HACKING
===================================================================
--- code/trunk/HACKING    2016-10-27 17:42:14 UTC (rev 579)
+++ code/trunk/HACKING    2016-10-28 16:08:44 UTC (rev 580)
@@ -7,7 +7,7 @@
 library is referred to as PCRE1 below. For information about testing PCRE2, see
 the pcre2test documentation and the comment at the head of the RunTest file.


-PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
+PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
releases remain in the 8.xx series. PCRE2 releases started at 10.00 to avoid
confusion with PCRE1.

@@ -124,39 +124,39 @@
dozen lines of messy code were eliminated, though the new pre-pass was not
short. In particular, parsing and skipping over [] classes is complicated.

-While working on 10.22 I realized that I could simplify yet again by moving
+While working on 10.22 I realized that I could simplify yet again by moving
more of the parsing into the pre-pass, thus avoiding doing it in two places, so
after 10.22 was released, the code underwent yet another big refactoring. This
is how it is from 10.23 onwards:

-The function called parse_regex() scans the pattern characters, parsing them
-into literal data and meta characters. It converts escapes such as \x{123}
-into literals, handles \Q...\E, and skips over comments and non-significant
-white space. The result of the scanning is put into a vector of 32-bit unsigned
-integers. Values less than 0x80000000 are literal data. Higher values represent
+The function called parse_regex() scans the pattern characters, parsing them
+into literal data and meta characters. It converts escapes such as \x{123}
+into literals, handles \Q...\E, and skips over comments and non-significant
+white space. The result of the scanning is put into a vector of 32-bit unsigned
+integers. Values less than 0x80000000 are literal data. Higher values represent
meta-characters. The top 16-bits of such values identify the meta-character,
and these are given names such as META_CAPTURE. The lower 16-bits are available
-for data, for example, the capturing group number. The only situation in which
-literal data values greater than 0x7fffffff can appear is when the 32-bit
-library is running in non-UTF mode. This is handled by having a special
+for data, for example, the capturing group number. The only situation in which
+literal data values greater than 0x7fffffff can appear is when the 32-bit
+library is running in non-UTF mode. This is handled by having a special
meta-character that is followed by the 32-bit data value.

The size of the parsed pattern vector, when auto-callouts are not enabled, is
-bounded by the length of the pattern (with one exception). The code is written
-so that each item in the pattern uses no more vector elements than the number
+bounded by the length of the pattern (with one exception). The code is written
+so that each item in the pattern uses no more vector elements than the number
of code units in the item itself. The exception is the aforementioned large
32-bit number handling. For this reason, 32-bit non-UTF patterns are scanned in
advance to check for such values. When auto-callouts are enabled, the generous
assumption is made that there will be a callout for each pattern code unit
-(which of course is only actually true if all code units are literals) plus one
+(which of course is only actually true if all code units are literals) plus one
at the end. There is a default parsed pattern vector on the stack, but if this
is not big enough, heap memory is used.

-As before, the actual compiling function is run twice, the first time to
-determine the amount of memory needed for the final compiled pattern. It
+As before, the actual compiling function is run twice, the first time to
+determine the amount of memory needed for the final compiled pattern. It
now processes the parsed pattern vector, not the pattern itself, although some
of the parsed items refer to strings in the pattern - for example, group
-names. As escapes and comments have already been processed, the code is a bit
+names. As escapes and comments have already been processed, the code is a bit
simpler than before.

Most errors can be diagnosed during the parsing scan. For those that cannot
@@ -168,7 +168,7 @@
The elements of the parsed pattern vector
-----------------------------------------

-The word "offset" below means a code unit offset into the pattern. When
+The word "offset" below means a code unit offset into the pattern. When
PCRE2_SIZE (which is usually size_t) is no bigger than uint32_t, an offset is
stored in a single parsed pattern element. Otherwise (typically on 64-bit
systems) it occupies two elements. The following meta items occupy just one
@@ -175,57 +175,60 @@
element, with no data:

 META_ACCEPT           (*ACCEPT)
-META_ALT              | alternation 
-META_ASTERISK         *  
-META_ASTERISK_PLUS    *+ 
-META_ASTERISK_QUERY   *? 
-META_ATOMIC           (?> start of atomic group 
-META_CIRCUMFLEX       ^ metacharacter 
-META_CLASS            [ start of non-empty class 
-META_CLASS_EMPTY      [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS 
+META_ASTERISK         *
+META_ASTERISK_PLUS    *+
+META_ASTERISK_QUERY   *?
+META_ATOMIC           (?> start of atomic group
+META_CIRCUMFLEX       ^ metacharacter
+META_CLASS            [ start of non-empty class
+META_CLASS_EMPTY      [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
 META_CLASS_EMPTY_NOT  [^] negative empty class - ditto
-META_CLASS_END        ] end of non-empty class 
-META_CLASS_NOT        [^ start non-empty negative class 
+META_CLASS_END        ] end of non-empty class
+META_CLASS_NOT        [^ start non-empty negative class
 META_COMMIT           (*COMMIT)
-META_DOLLAR           $ metacharacter 
-META_DOT              . metacharacter 
+META_DOLLAR           $ metacharacter
+META_DOT              . metacharacter
 META_END              End of pattern (this value is 0x80000000)
 META_FAIL             (*FAIL)
-META_KET              ) closing parenthesis 
+META_KET              ) closing parenthesis
 META_LOOKAHEAD        (?= start of lookahead
 META_LOOKAHEADNOT     (?! start of negative lookahead
-META_NOCAPTURE        (?: no capture parens 
-META_PLUS             +  
-META_PLUS_PLUS        ++ 
-META_PLUS_QUERY       +? 
+META_NOCAPTURE        (?: no capture parens
+META_PLUS             +
+META_PLUS_PLUS        ++
+META_PLUS_QUERY       +?
 META_PRUNE            (*PRUNE) - no argument
-META_QUERY            ?  
-META_QUERY_PLUS       ?+ 
-META_QUERY_QUERY      ?? 
-META_RANGE_ESCAPED    hyphen in class range with at least one escape 
-META_RANGE_LITERAL    hyphen in class range defined literally 
+META_QUERY            ?
+META_QUERY_PLUS       ?+
+META_QUERY_QUERY      ??
+META_RANGE_ESCAPED    hyphen in class range with at least one escape
+META_RANGE_LITERAL    hyphen in class range defined literally
 META_SKIP             (*SKIP) - no argument
 META_THEN             (*THEN) - no argument


-The two RANGE values occur only in character classes. They are positioned
-between two literals that define the start and end of the range. In an EBCDIC
-evironment it is necessary to know whether either of the range values was
-specified as an escape. In an ASCII/Unicode environment the distinction is not
+The two RANGE values occur only in character classes. They are positioned
+between two literals that define the start and end of the range. In an EBCDIC
+evironment it is necessary to know whether either of the range values was
+specified as an escape. In an ASCII/Unicode environment the distinction is not
relevant.

-The following have data in the lower 16 bits, and may be followed by other data
+The following have data in the lower 16 bits, and may be followed by other data
elements:

+META_ALT              | alternation
 META_BACKREF
 META_CAPTURE
 META_ESCAPE
 META_RECURSE


+If the data for META_ALT is non-zero, it is inside a lookbehind, and the data
+is the length of its branch, for which OP_REVERSE must be generated.
+
META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as
their data in the lower 16 bits of the element.

META_BACKREF is followed by an offset if the back reference group number is 10
-or more. The offsets of the first ocurrences of references to groups whose
+or more. The offsets of the first ocurrences of references to groups whose
numbers are less than 10 are put in cb->small_ref_offset[] (only the first
occurrence is useful). On 64-bit systems this avoids using more than two parsed
pattern elements for items such as \3. The offset is used when an error is
@@ -241,7 +244,7 @@

The following have one data item that follows in the next vector element:

-META_BIGVALUE         Next is a literal >= META_END 
+META_BIGVALUE         Next is a literal >= META_END
 META_OPTIONS          (?i) and friends (data is new option bits)
 META_POSIX            POSIX class item (data identifies the class)
 META_POSIX_NEG        negative POSIX class item (ditto)
@@ -249,19 +252,19 @@
 The following are followed by a length element, then a number of character code
 values (which should match with the length):


-META_MARK             (*MARK:xxxx) 
+META_MARK             (*MARK:xxxx)
 META_PRUNE_ARG        (*PRUNE:xxx)
 META_SKIP_ARG         (*SKIP:xxxx)
 META_THEN_ARG         (*THEN:xxxx)


-The following are followed by a length element, then an offset in the pattern
+The following are followed by a length element, then an offset in the pattern
that identifies the name:

-META_COND_NAME        (?(<name>) or (?('name') or (?(name) 
-META_COND_RNAME       (?(R&name) 
+META_COND_NAME        (?(<name>) or (?('name') or (?(name)
+META_COND_RNAME       (?(R&name)
 META_COND_RNUMBER     (?(Rdigits)
-META_RECURSE_BYNAME   (?&name) 
-META_BACKREF_BYNAME   \k'name' 
+META_RECURSE_BYNAME   (?&name)
+META_BACKREF_BYNAME   \k'name'


META_COND_RNUMBER is used for names that start with R and continue with digits,
because this is an ambiguous case. It could be a back reference to a group with
@@ -269,26 +272,31 @@

This one is followed by an offset, for use in error messages, then a number:

-META_COND_NUMBER       (?([+-]digits) 
+META_COND_NUMBER       (?([+-]digits)


The following are followed just by an offset, for use in error messages:

 META_COND_ASSERT      (?(?assertion)
 META_COND_DEFINE      (?(DEFINE)
-META_LOOKBEHIND       (?<= 
+
+In fact, META_COND_ASSERT is used for any group starting (?( that does not
+match any of the other META_COND cases. The check that this group is an
+assertion (optionally preceded by a callout) happens at compile time.
+
+The following are also followed just by an offset, but also the lower 16 bits
+of the main word contain the length of the first branch of the lookbehind
+group; this is used when generating OP_REVERSE for that branch.
+
+META_LOOKBEHIND       (?<=
 META_LOOKBEHINDNOT    (?<!


-In fact, META_COND_ASSERT is used for any group starting (?( that does not
-match any of the other META_COND cases. The check that this group is an
-assertion (optionally preceded by a callout) happens at compile time.
-
The following are followed by two values, the minimum and maximum. Repeat
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:

-META_MINMAX           {n,m}  repeat 
-META_MINMAX_PLUS      {n,m}+ repeat 
-META_MINMAX_QUERY     {n,m}? repeat 
+META_MINMAX           {n,m}  repeat
+META_MINMAX_PLUS      {n,m}+ repeat
+META_MINMAX_QUERY     {n,m}? repeat


This one is followed by three elements. The first is 0 for '>' and 1 for '>=';
the next two are the major and minor numbers:
@@ -297,11 +305,11 @@

Callouts are converted into one of two items:

-META_CALLOUT_NUMBER (?C with numerical argument
-META_CALLOUT_STRING (?C with string argument
+META_CALLOUT_NUMBER (?C with numerical argument
+META_CALLOUT_STRING (?C with string argument

-In both cases, the next two elements contain the offset and length of the next
-item in the pattern. Then there is either one callout number, or a length and
+In both cases, the next two elements contain the offset and length of the next
+item in the pattern. Then there is either one callout number, or a length and
an offset for the string argument. The length includes both delimiters.


@@ -410,11 +418,11 @@
   OP_THEN                )


OP_ASSERT_ACCEPT is used when (*ACCEPT) is encountered within an assertion.
-This ends the assertion, not the entire pattern match. The assertion (?!) is
+This ends the assertion, not the entire pattern match. The assertion (?!) is
always optimized to OP_FAIL.

OP_ALLANY is used for '.' when PCRE2_DOTALL is set. It is also used for \C in
-non-UTF modes and in UTF-32 mode (since one code unit still equals one
+non-UTF modes and in UTF-32 mode (since one code unit still equals one
character). Another use is for [^] when empty classes are permitted
(PCRE2_ALLOW_EMPTY_CLASS is set).

@@ -735,8 +743,8 @@
callout at this point. Only assertion conditions may have callouts preceding
the condition.

-A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
-parts of the pattern, so this is another opcode that may appear as a condition.
+A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
+parts of the pattern, so this is another opcode that may appear as a condition.
It is treated the same as OP_FALSE.