Revision: 1396
http://vcs.pcre.org/viewvc?view=rev&revision=1396
Author: ph10
Date: 2013-11-10 19:04:34 +0000 (Sun, 10 Nov 2013)
Log Message:
-----------
In /x mode, allow white space before a possessive + character.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/pcreapi.3
code/trunk/doc/pcrecompat.3
code/trunk/doc/pcrepattern.3
code/trunk/pcre_compile.c
code/trunk/testdata/testinput1
code/trunk/testdata/testoutput1
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2013-11-09 16:54:52 UTC (rev 1395)
+++ code/trunk/ChangeLog 2013-11-10 19:04:34 UTC (rev 1396)
@@ -173,6 +173,10 @@
36. Perl no longer allows group names to start with digits, so I have made this
change also in PCRE. It simplifies the code a bit.
+
+37. In extended mode, Perl ignores spaces before a + that indicates a
+ possessive quantifier. PCRE allowed a space before the quantifier, but not
+ before the possessive +. It now does.
Version 8.33 28-May-2013
Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3 2013-11-09 16:54:52 UTC (rev 1395)
+++ code/trunk/doc/pcreapi.3 2013-11-10 19:04:34 UTC (rev 1396)
@@ -651,16 +651,23 @@
.sp
PCRE_EXTENDED
.sp
-If this bit is set, white space data characters in the pattern are totally
-ignored except when escaped or inside a character class. White space did not
-used to include the VT character (code 11), because Perl did not treat this
-character as white space. However, Perl changed at release 5.18, so PCRE
-followed at release 8.34, and VT is now treated as white space. PCRE_EXTENDED
-also causes characters between an unescaped # outside a character class and the
-next newline, inclusive, to be ignored. PCRE_EXTENDED is equivalent to
-Perl's /x option, and it can be changed within a pattern by a (?x) option
-setting.
+If this bit is set, most white space characters in the pattern are totally
+ignored except when escaped or inside a character class. However, white space
+is not allowed within sequences such as (?> that introduce various
+parenthesized subpatterns, nor within a numerical quantifier such as {1,3}.
+However, ignorable white space is permitted between an item and a following
+quantifier and between a quantifier and a following + that indicates
+possessiveness.
.P
+White space did not used to include the VT character (code 11), because Perl
+did not treat this character as white space. However, Perl changed at release
+5.18, so PCRE followed at release 8.34, and VT is now treated as white space.
+.P
+PCRE_EXTENDED also causes characters between an unescaped # outside a character
+class and the next newline, inclusive, to be ignored. PCRE_EXTENDED is
+equivalent to Perl's /x option, and it can be changed within a pattern by a
+(?x) option setting.
+.P
Which characters are interpreted as newlines is controlled by the options
passed to \fBpcre_compile()\fP or by a special sequence at the start of the
pattern, as described in the section entitled
Modified: code/trunk/doc/pcrecompat.3
===================================================================
--- code/trunk/doc/pcrecompat.3 2013-11-09 16:54:52 UTC (rev 1395)
+++ code/trunk/doc/pcrecompat.3 2013-11-10 19:04:34 UTC (rev 1396)
@@ -1,4 +1,4 @@
-.TH PCRECOMPAT 3 "05 November 2013" "PCRE 8.34"
+.TH PCRECOMPAT 3 "10 November 2013" "PCRE 8.34"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "DIFFERENCES BETWEEN PCRE AND PERL"
@@ -122,8 +122,8 @@
.P
15. Perl recognizes comments in some places that PCRE does not, for example,
between the ( and ? at the start of a subpattern. If the /x modifier is set,
-Perl allows white space between ( and ? but PCRE never does, even if the
-PCRE_EXTENDED option is set.
+Perl allows white space between ( and ? (though current Perls warn that this is
+deprecated) but PCRE never does, even if the PCRE_EXTENDED option is set.
.P
16. Perl, when in warning mode, gives warnings for character classes such as
[A-\ed] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE has no
@@ -195,6 +195,6 @@
.rs
.sp
.nf
-Last updated: 05 November 2013
+Last updated: 10 November 2013
Copyright (c) 1997-2013 University of Cambridge.
.fi
Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3 2013-11-09 16:54:52 UTC (rev 1395)
+++ code/trunk/doc/pcrepattern.3 2013-11-10 19:04:34 UTC (rev 1396)
@@ -273,10 +273,11 @@
backslash. All other characters (in particular, those whose codepoints are
greater than 127) are treated as literals.
.P
-If a pattern is compiled with the PCRE_EXTENDED option, white space in the
-pattern (other than in a character class) and characters between a # outside
-a character class and the next newline are ignored. An escaping backslash can
-be used to include a white space or # character as part of the pattern.
+If a pattern is compiled with the PCRE_EXTENDED option, most white space in the
+pattern (other than in a character class), and characters between a # outside a
+character class and the next newline, inclusive, are ignored. An escaping
+backslash can be used to include a white space or # character as part of the
+pattern.
.P
If you want to remove the special meaning from a sequence of characters, you
can do so by putting them between \eQ and \eE. This is different from Perl in
Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c 2013-11-09 16:54:52 UTC (rev 1395)
+++ code/trunk/pcre_compile.c 2013-11-10 19:04:34 UTC (rev 1396)
@@ -4446,7 +4446,7 @@
/* Get next character in the pattern */
c = *ptr;
-
+
/* If we are at the end of a nested substitution, revert to the outer level
string. Nesting only happens one level deep. */
@@ -4548,8 +4548,37 @@
}
goto NORMAL_CHAR;
}
+ /* Control does not reach here. */
}
+ /* In extended mode, skip white space and comments. We need a loop in order
+ to check for more white space and more comments after a comment. */
+
+ if ((options & PCRE_EXTENDED) != 0)
+ {
+ for (;;)
+ {
+ while (MAX_255(c) && (cd->ctypes[c] & ctype_space) != 0) c = *(++ptr);
+ if (c != CHAR_NUMBER_SIGN) break;
+ ptr++;
+ while (*ptr != CHAR_NULL)
+ {
+ if (IS_NEWLINE(ptr)) /* For non-fixed-length newline cases, */
+ { /* IS_NEWLINE sets cd->nllen. */
+ ptr += cd->nllen;
+ break;
+ }
+ ptr++;
+#ifdef SUPPORT_UTF
+ if (utf) FORWARDCHAR(ptr);
+#endif
+ }
+ c = *ptr; /* Either NULL or the char after a newline */
+ }
+ }
+
+ /* See if the next thing is a quantifier. */
+
is_quantifier =
c == CHAR_ASTERISK || c == CHAR_PLUS || c == CHAR_QUESTION_MARK ||
(c == CHAR_LEFT_CURLY_BRACKET && is_counted_repeat(ptr+1));
@@ -4565,42 +4594,21 @@
previous_callout = NULL;
}
- /* In extended mode, skip white space and comments. */
+ /* Create auto callout, except for quantifiers, or while processing property
+ strings that are substituted for \w etc in UCP mode. */
- if ((options & PCRE_EXTENDED) != 0)
- {
- if (MAX_255(*ptr) && (cd->ctypes[c] & ctype_space) != 0) continue;
- if (c == CHAR_NUMBER_SIGN)
- {
- ptr++;
- while (*ptr != CHAR_NULL)
- {
- if (IS_NEWLINE(ptr)) { ptr += cd->nllen - 1; break; }
- ptr++;
-#ifdef SUPPORT_UTF
- if (utf) FORWARDCHAR(ptr);
-#endif
- }
- if (*ptr != CHAR_NULL) continue;
-
- /* Else fall through to handle end of string */
- c = 0;
- }
- }
-
- /* No auto callout for quantifiers, or while processing property strings that
- are substituted for \w etc in UCP mode. */
-
if ((options & PCRE_AUTO_CALLOUT) != 0 && !is_quantifier && nestptr == NULL)
{
previous_callout = code;
code = auto_callout(code, ptr, cd);
}
+
+ /* Process the next pattern item. */
switch(c)
{
/* ===================================================================*/
- case 0: /* The branch terminates at string end */
+ case CHAR_NULL: /* The branch terminates at string end */
case CHAR_VERTICAL_LINE: /* or | or ) */
case CHAR_RIGHT_PARENTHESIS:
*firstcharptr = firstchar;
@@ -5445,6 +5453,34 @@
insert something before it. */
tempcode = previous;
+
+ /* Before checking for a possessive quantifier, we must skip over
+ whitespace and comments in extended mode because Perl allows white space at
+ this point. */
+
+ if ((options & PCRE_EXTENDED) != 0)
+ {
+ const pcre_uchar *p = ptr + 1;
+ for (;;)
+ {
+ while (MAX_255(*p) && (cd->ctypes[*p] & ctype_space) != 0) p++;
+ if (*p != CHAR_NUMBER_SIGN) break;
+ p++;
+ while (*p != CHAR_NULL)
+ {
+ if (IS_NEWLINE(p)) /* For non-fixed-length newline cases, */
+ { /* IS_NEWLINE sets cd->nllen. */
+ p += cd->nllen;
+ break;
+ }
+ p++;
+#ifdef SUPPORT_UTF
+ if (utf) FORWARDCHAR(p);
+#endif
+ } /* Loop for comment characters */
+ } /* Loop for multiple comments */
+ ptr = p - 1; /* Character before the next significant one. */
+ }
/* If the next character is '+', we have a possessive quantifier. This
implies greediness, whatever the setting of the PCRE_UNGREEDY option.
@@ -7752,8 +7788,8 @@
/* ===================================================================*/
/* Handle a literal character. It is guaranteed not to be whitespace or #
- when the extended flag is set. If we are in UTF-8 mode, it may be a
- multi-byte literal character. */
+ when the extended flag is set. If we are in a UTF mode, it may be a
+ multi-unit literal character. */
default:
NORMAL_CHAR:
@@ -8899,7 +8935,7 @@
cd->nl[0] = newline;
}
}
-
+
/* Maximum back reference and backref bitmap. The bitmap records up to 31 back
references to help in deciding whether (.*) can be treated as anchored or not.
*/
@@ -8952,6 +8988,7 @@
ptr += skipatstart;
code = cworkspace;
*code = OP_BRA;
+
(void)compile_regex(cd->external_options, &code, &ptr, &errorcode, FALSE,
FALSE, 0, 0, &firstchar, &firstcharflags, &reqchar, &reqcharflags, NULL,
cd, &length);
Modified: code/trunk/testdata/testinput1
===================================================================
--- code/trunk/testdata/testinput1 2013-11-09 16:54:52 UTC (rev 1395)
+++ code/trunk/testdata/testinput1 2013-11-10 19:04:34 UTC (rev 1396)
@@ -2,7 +2,7 @@
Perl >= 5.10, in non-UTF-8 mode. It should run clean for the 8-bit, 16-bit,
and 32-bit PCRE libraries. --/
-< forbid 8BCDIMWZ<
+< forbid 8BCDIMOWZ<
/the quick brown fox/
the quick brown fox
@@ -5633,4 +5633,22 @@
/^A\o{123}B/
A\123B
+/ ^ a + + b $ /x
+ aaaab
+
+/ ^ a + #comment
+ + b $ /x
+ aaaab
+
+/ ^ a + #comment
+ #comment
+ + b $ /x
+ aaaab
+
+/ ^ (?> a + ) b $ /x
+ aaaab
+
+/ ^ ( a + ) + + \w $ /x
+ aaaab
+
/-- End of testinput1 --/
Modified: code/trunk/testdata/testoutput1
===================================================================
--- code/trunk/testdata/testoutput1 2013-11-09 16:54:52 UTC (rev 1395)
+++ code/trunk/testdata/testoutput1 2013-11-10 19:04:34 UTC (rev 1396)
@@ -2,7 +2,7 @@
Perl >= 5.10, in non-UTF-8 mode. It should run clean for the 8-bit, 16-bit,
and 32-bit PCRE libraries. --/
-< forbid 8BCDIMWZ<
+< forbid 8BCDIMOWZ<
/the quick brown fox/
the quick brown fox
@@ -9262,4 +9262,28 @@
A\123B
0: ASB
+/ ^ a + + b $ /x
+ aaaab
+ 0: aaaab
+
+/ ^ a + #comment
+ + b $ /x
+ aaaab
+ 0: aaaab
+
+/ ^ a + #comment
+ #comment
+ + b $ /x
+ aaaab
+ 0: aaaab
+
+/ ^ (?> a + ) b $ /x
+ aaaab
+ 0: aaaab
+
+/ ^ ( a + ) + + \w $ /x
+ aaaab
+ 0: aaaab
+ 1: aaaa
+
/-- End of testinput1 --/