[Pcre-svn] [1396] code/trunk: In /x mode, allow white space…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [1396] code/trunk: In /x mode, allow white space before a possessive + character.
Revision: 1396
          http://vcs.pcre.org/viewvc?view=rev&revision=1396
Author:   ph10
Date:     2013-11-10 19:04:34 +0000 (Sun, 10 Nov 2013)


Log Message:
-----------
In /x mode, allow white space before a possessive + character.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcreapi.3
    code/trunk/doc/pcrecompat.3
    code/trunk/doc/pcrepattern.3
    code/trunk/pcre_compile.c
    code/trunk/testdata/testinput1
    code/trunk/testdata/testoutput1


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2013-11-09 16:54:52 UTC (rev 1395)
+++ code/trunk/ChangeLog    2013-11-10 19:04:34 UTC (rev 1396)
@@ -173,6 +173,10 @@


 36. Perl no longer allows group names to start with digits, so I have made this
     change also in PCRE. It simplifies the code a bit. 
+    
+37. In extended mode, Perl ignores spaces before a + that indicates a 
+    possessive quantifier. PCRE allowed a space before the quantifier, but not 
+    before the possessive +. It now does.



Version 8.33 28-May-2013

Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3    2013-11-09 16:54:52 UTC (rev 1395)
+++ code/trunk/doc/pcreapi.3    2013-11-10 19:04:34 UTC (rev 1396)
@@ -651,16 +651,23 @@
 .sp
   PCRE_EXTENDED
 .sp
-If this bit is set, white space data characters in the pattern are totally
-ignored except when escaped or inside a character class. White space did not
-used to include the VT character (code 11), because Perl did not treat this 
-character as white space. However, Perl changed at release 5.18, so PCRE
-followed at release 8.34, and VT is now treated as white space. PCRE_EXTENDED
-also causes characters between an unescaped # outside a character class and the
-next newline, inclusive, to be ignored. PCRE_EXTENDED is equivalent to
-Perl's /x option, and it can be changed within a pattern by a (?x) option
-setting.
+If this bit is set, most white space characters in the pattern are totally
+ignored except when escaped or inside a character class. However, white space
+is not allowed within sequences such as (?> that introduce various
+parenthesized subpatterns, nor within a numerical quantifier such as {1,3}.
+However, ignorable white space is permitted between an item and a following
+quantifier and between a quantifier and a following + that indicates 
+possessiveness.
 .P
+White space did not used to include the VT character (code 11), because Perl
+did not treat this character as white space. However, Perl changed at release
+5.18, so PCRE followed at release 8.34, and VT is now treated as white space.
+.P
+PCRE_EXTENDED also causes characters between an unescaped # outside a character
+class and the next newline, inclusive, to be ignored. PCRE_EXTENDED is
+equivalent to Perl's /x option, and it can be changed within a pattern by a
+(?x) option setting.
+.P
 Which characters are interpreted as newlines is controlled by the options
 passed to \fBpcre_compile()\fP or by a special sequence at the start of the
 pattern, as described in the section entitled


Modified: code/trunk/doc/pcrecompat.3
===================================================================
--- code/trunk/doc/pcrecompat.3    2013-11-09 16:54:52 UTC (rev 1395)
+++ code/trunk/doc/pcrecompat.3    2013-11-10 19:04:34 UTC (rev 1396)
@@ -1,4 +1,4 @@
-.TH PCRECOMPAT 3 "05 November 2013" "PCRE 8.34"
+.TH PCRECOMPAT 3 "10 November 2013" "PCRE 8.34"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .SH "DIFFERENCES BETWEEN PCRE AND PERL"
@@ -122,8 +122,8 @@
 .P
 15. Perl recognizes comments in some places that PCRE does not, for example,
 between the ( and ? at the start of a subpattern. If the /x modifier is set,
-Perl allows white space between ( and ? but PCRE never does, even if the
-PCRE_EXTENDED option is set.
+Perl allows white space between ( and ? (though current Perls warn that this is 
+deprecated) but PCRE never does, even if the PCRE_EXTENDED option is set.
 .P
 16. Perl, when in warning mode, gives warnings for character classes such as
 [A-\ed] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE has no 
@@ -195,6 +195,6 @@
 .rs
 .sp
 .nf
-Last updated: 05 November 2013
+Last updated: 10 November 2013
 Copyright (c) 1997-2013 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3    2013-11-09 16:54:52 UTC (rev 1395)
+++ code/trunk/doc/pcrepattern.3    2013-11-10 19:04:34 UTC (rev 1396)
@@ -273,10 +273,11 @@
 backslash. All other characters (in particular, those whose codepoints are
 greater than 127) are treated as literals.
 .P
-If a pattern is compiled with the PCRE_EXTENDED option, white space in the
-pattern (other than in a character class) and characters between a # outside
-a character class and the next newline are ignored. An escaping backslash can
-be used to include a white space or # character as part of the pattern.
+If a pattern is compiled with the PCRE_EXTENDED option, most white space in the
+pattern (other than in a character class), and characters between a # outside a
+character class and the next newline, inclusive, are ignored. An escaping
+backslash can be used to include a white space or # character as part of the
+pattern.
 .P
 If you want to remove the special meaning from a sequence of characters, you
 can do so by putting them between \eQ and \eE. This is different from Perl in


Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2013-11-09 16:54:52 UTC (rev 1395)
+++ code/trunk/pcre_compile.c    2013-11-10 19:04:34 UTC (rev 1396)
@@ -4446,7 +4446,7 @@
   /* Get next character in the pattern */


c = *ptr;
-
+
/* If we are at the end of a nested substitution, revert to the outer level
string. Nesting only happens one level deep. */

@@ -4548,8 +4548,37 @@
         }
       goto NORMAL_CHAR;
       }
+    /* Control does not reach here. */   
     }


+  /* In extended mode, skip white space and comments. We need a loop in order 
+  to check for more white space and more comments after a comment. */
+  
+  if ((options & PCRE_EXTENDED) != 0)
+    {
+    for (;;)
+      {
+      while (MAX_255(c) && (cd->ctypes[c] & ctype_space) != 0) c = *(++ptr);
+      if (c != CHAR_NUMBER_SIGN) break;
+      ptr++;
+      while (*ptr != CHAR_NULL)
+        {
+        if (IS_NEWLINE(ptr))         /* For non-fixed-length newline cases, */
+          {                          /* IS_NEWLINE sets cd->nllen. */         
+          ptr += cd->nllen;            
+          break;
+          }
+        ptr++;
+#ifdef SUPPORT_UTF
+        if (utf) FORWARDCHAR(ptr);
+#endif
+        }
+      c = *ptr;     /* Either NULL or the char after a newline */
+      }   
+    }
+
+  /* See if the next thing is a quantifier. */
+
   is_quantifier =
     c == CHAR_ASTERISK || c == CHAR_PLUS || c == CHAR_QUESTION_MARK ||
     (c == CHAR_LEFT_CURLY_BRACKET && is_counted_repeat(ptr+1));
@@ -4565,42 +4594,21 @@
     previous_callout = NULL;
     }


- /* In extended mode, skip white space and comments. */
+ /* Create auto callout, except for quantifiers, or while processing property
+ strings that are substituted for \w etc in UCP mode. */

-  if ((options & PCRE_EXTENDED) != 0)
-    {
-    if (MAX_255(*ptr) && (cd->ctypes[c] & ctype_space) != 0) continue;
-    if (c == CHAR_NUMBER_SIGN)
-      {
-      ptr++;
-      while (*ptr != CHAR_NULL)
-        {
-        if (IS_NEWLINE(ptr)) { ptr += cd->nllen - 1; break; }
-        ptr++;
-#ifdef SUPPORT_UTF
-        if (utf) FORWARDCHAR(ptr);
-#endif
-        }
-      if (*ptr != CHAR_NULL) continue;
-
-      /* Else fall through to handle end of string */
-      c = 0;
-      }
-    }
-
-  /* No auto callout for quantifiers, or while processing property strings that
-  are substituted for \w etc in UCP mode. */
-
   if ((options & PCRE_AUTO_CALLOUT) != 0 && !is_quantifier && nestptr == NULL)
     {
     previous_callout = code;
     code = auto_callout(code, ptr, cd);
     }
+    
+  /* Process the next pattern item. */


   switch(c)
     {
     /* ===================================================================*/
-    case 0:                        /* The branch terminates at string end */
+    case CHAR_NULL:                /* The branch terminates at string end */
     case CHAR_VERTICAL_LINE:       /* or | or ) */
     case CHAR_RIGHT_PARENTHESIS:
     *firstcharptr = firstchar;
@@ -5445,6 +5453,34 @@
     insert something before it. */


     tempcode = previous;
+    
+    /* Before checking for a possessive quantifier, we must skip over 
+    whitespace and comments in extended mode because Perl allows white space at 
+    this point. */
+ 
+    if ((options & PCRE_EXTENDED) != 0)
+      {
+      const pcre_uchar *p = ptr + 1;
+      for (;;)
+        {
+        while (MAX_255(*p) && (cd->ctypes[*p] & ctype_space) != 0) p++;
+        if (*p != CHAR_NUMBER_SIGN) break;
+        p++;
+        while (*p != CHAR_NULL)
+          {
+          if (IS_NEWLINE(p))         /* For non-fixed-length newline cases, */
+            {                        /* IS_NEWLINE sets cd->nllen. */         
+            p += cd->nllen;            
+            break;
+            }
+          p++;
+#ifdef SUPPORT_UTF
+          if (utf) FORWARDCHAR(p);
+#endif
+          }           /* Loop for comment characters */
+        }             /* Loop for multiple comments */
+      ptr = p - 1;    /* Character before the next significant one. */
+      }


     /* If the next character is '+', we have a possessive quantifier. This
     implies greediness, whatever the setting of the PCRE_UNGREEDY option.
@@ -7752,8 +7788,8 @@


     /* ===================================================================*/
     /* Handle a literal character. It is guaranteed not to be whitespace or #
-    when the extended flag is set. If we are in UTF-8 mode, it may be a
-    multi-byte literal character. */
+    when the extended flag is set. If we are in a UTF mode, it may be a
+    multi-unit literal character. */


     default:
     NORMAL_CHAR:
@@ -8899,7 +8935,7 @@
     cd->nl[0] = newline;
     }
   }
-
+  
 /* Maximum back reference and backref bitmap. The bitmap records up to 31 back
 references to help in deciding whether (.*) can be treated as anchored or not.
 */
@@ -8952,6 +8988,7 @@
 ptr += skipatstart;
 code = cworkspace;
 *code = OP_BRA;
+
 (void)compile_regex(cd->external_options, &code, &ptr, &errorcode, FALSE,
   FALSE, 0, 0, &firstchar, &firstcharflags, &reqchar, &reqcharflags, NULL,
   cd, &length);


Modified: code/trunk/testdata/testinput1
===================================================================
--- code/trunk/testdata/testinput1    2013-11-09 16:54:52 UTC (rev 1395)
+++ code/trunk/testdata/testinput1    2013-11-10 19:04:34 UTC (rev 1396)
@@ -2,7 +2,7 @@
     Perl >= 5.10, in non-UTF-8 mode. It should run clean for the 8-bit, 16-bit,
     and 32-bit PCRE libraries. --/


-< forbid 8BCDIMWZ<
+< forbid 8BCDIMOWZ<

 /the quick brown fox/
     the quick brown fox
@@ -5633,4 +5633,22 @@
 /^A\o{123}B/
     A\123B


+/ ^ a + + b $ /x
+    aaaab
+    
+/ ^ a + #comment
+  + b $ /x
+    aaaab
+    
+/ ^ a + #comment
+  #comment
+  + b $ /x
+    aaaab
+    
+/ ^ (?> a + ) b $ /x
+    aaaab 
+
+/ ^ ( a + ) + + \w $ /x
+    aaaab 
+
 /-- End of testinput1 --/


Modified: code/trunk/testdata/testoutput1
===================================================================
--- code/trunk/testdata/testoutput1    2013-11-09 16:54:52 UTC (rev 1395)
+++ code/trunk/testdata/testoutput1    2013-11-10 19:04:34 UTC (rev 1396)
@@ -2,7 +2,7 @@
     Perl >= 5.10, in non-UTF-8 mode. It should run clean for the 8-bit, 16-bit,
     and 32-bit PCRE libraries. --/


-< forbid 8BCDIMWZ<
+< forbid 8BCDIMOWZ<

 /the quick brown fox/
     the quick brown fox
@@ -9262,4 +9262,28 @@
     A\123B
  0: ASB


+/ ^ a + + b $ /x
+    aaaab
+ 0: aaaab
+    
+/ ^ a + #comment
+  + b $ /x
+    aaaab
+ 0: aaaab
+    
+/ ^ a + #comment
+  #comment
+  + b $ /x
+    aaaab
+ 0: aaaab
+    
+/ ^ (?> a + ) b $ /x
+    aaaab 
+ 0: aaaab
+
+/ ^ ( a + ) + + \w $ /x
+    aaaab 
+ 0: aaaab
+ 1: aaaa
+
 /-- End of testinput1 --/