[Pcre-svn] [336] code/trunk: Added PCRE_JAVASCRIPT_COMPAT op…

Página Inicial
Delete this message
Autor: Subversion repository
Data:  
Para: pcre-svn
Assunto: [Pcre-svn] [336] code/trunk: Added PCRE_JAVASCRIPT_COMPAT option.
Revision: 336
          http://vcs.pcre.org/viewvc?view=rev&revision=336
Author:   ph10
Date:     2008-04-12 16:59:03 +0100 (Sat, 12 Apr 2008)


Log Message:
-----------
Added PCRE_JAVASCRIPT_COMPAT option.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcre.3
    code/trunk/doc/pcre_compile.3
    code/trunk/doc/pcreapi.3
    code/trunk/doc/pcretest.1
    code/trunk/pcre_compile.c
    code/trunk/pcre_exec.c
    code/trunk/pcre_internal.h
    code/trunk/pcreposix.c
    code/trunk/pcretest.c
    code/trunk/testdata/testinput2
    code/trunk/testdata/testoutput2


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2008-04-12 14:36:14 UTC (rev 335)
+++ code/trunk/ChangeLog    2008-04-12 15:59:03 UTC (rev 336)
@@ -57,6 +57,11 @@
     (an internal error was given). Such groups are now left in the compiled 
     pattern, with a new opcode that causes them to be skipped at execution 
     time.
+    
+13. Added the PCRE_JAVASCRIPT_COMPAT option. This currently does two things:
+    (a) A lone ] character is dis-allowed (Perl treats it as data).
+    (b) A back reference to an unmatched subpattern matches an empty string 
+        (Perl fails the current match path).



Version 7.6 28-Jan-08

Modified: code/trunk/doc/pcre.3
===================================================================
--- code/trunk/doc/pcre.3    2008-04-12 14:36:14 UTC (rev 335)
+++ code/trunk/doc/pcre.3    2008-04-12 15:59:03 UTC (rev 336)
@@ -6,8 +6,10 @@
 .sp
 The PCRE library is a set of functions that implement regular expression
 pattern matching using the same syntax and semantics as Perl, with just a few
-differences. (Certain features that appeared in Python and PCRE before they
-appeared in Perl are also available using the Python syntax.)
+differences. Certain features that appeared in Python and PCRE before they
+appeared in Perl are also available using the Python syntax. There is also some 
+support for certain .NET and Oniguruma syntax items, and there is an option for 
+requesting some minor changes that give better JavaScript compatibility.
 .P
 The current implementation of PCRE (release 7.x) corresponds approximately with
 Perl 5.10, including support for UTF-8 encoded strings and Unicode general
@@ -287,6 +289,6 @@
 .rs
 .sp
 .nf
-Last updated: 09 August 2007
-Copyright (c) 1997-2007 University of Cambridge.
+Last updated: 12 April 2008
+Copyright (c) 1997-2008 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcre_compile.3
===================================================================
--- code/trunk/doc/pcre_compile.3    2008-04-12 14:36:14 UTC (rev 335)
+++ code/trunk/doc/pcre_compile.3    2008-04-12 15:59:03 UTC (rev 336)
@@ -30,31 +30,33 @@
 .sp
 The option bits are:
 .sp
-  PCRE_ANCHORED         Force pattern anchoring
-  PCRE_AUTO_CALLOUT     Compile automatic callouts
-  PCRE_BSR_ANYCRLF      \eR matches only CR, LF, or CRLF
-  PCRE_BSR_UNICODE      \eR matches all Unicode line endings
-  PCRE_CASELESS         Do caseless matching
-  PCRE_DOLLAR_ENDONLY   $ not to match newline at end
-  PCRE_DOTALL           . matches anything including NL
-  PCRE_DUPNAMES         Allow duplicate names for subpatterns
-  PCRE_EXTENDED         Ignore whitespace and # comments
-  PCRE_EXTRA            PCRE extra features
-                          (not much use currently)
-  PCRE_FIRSTLINE        Force matching to be before newline
-  PCRE_MULTILINE        ^ and $ match newlines within data
-  PCRE_NEWLINE_ANY      Recognize any Unicode newline sequence
-  PCRE_NEWLINE_ANYCRLF  Recognize CR, LF, and CRLF as newline sequences
-  PCRE_NEWLINE_CR       Set CR as the newline sequence
-  PCRE_NEWLINE_CRLF     Set CRLF as the newline sequence
-  PCRE_NEWLINE_LF       Set LF as the newline sequence
-  PCRE_NO_AUTO_CAPTURE  Disable numbered capturing paren-
-                          theses (named ones available)
-  PCRE_UNGREEDY         Invert greediness of quantifiers
-  PCRE_UTF8             Run in UTF-8 mode
-  PCRE_NO_UTF8_CHECK    Do not check the pattern for UTF-8
-                          validity (only relevant if
-                          PCRE_UTF8 is set)
+  PCRE_ANCHORED           Force pattern anchoring
+  PCRE_AUTO_CALLOUT       Compile automatic callouts
+  PCRE_BSR_ANYCRLF        \eR matches only CR, LF, or CRLF
+  PCRE_BSR_UNICODE        \eR matches all Unicode line endings
+  PCRE_CASELESS           Do caseless matching
+  PCRE_DOLLAR_ENDONLY     $ not to match newline at end
+  PCRE_DOTALL             . matches anything including NL
+  PCRE_DUPNAMES           Allow duplicate names for subpatterns
+  PCRE_EXTENDED           Ignore whitespace and # comments
+  PCRE_EXTRA              PCRE extra features
+                            (not much use currently)
+  PCRE_FIRSTLINE          Force matching to be before newline
+  PCRE_JAVASCRIPT_COMPAT  JavaScript compatibility
+  PCRE_MULTILINE          ^ and $ match newlines within data
+  PCRE_NEWLINE_ANY        Recognize any Unicode newline sequence
+  PCRE_NEWLINE_ANYCRLF    Recognize CR, LF, and CRLF as newline 
+                            sequences
+  PCRE_NEWLINE_CR         Set CR as the newline sequence
+  PCRE_NEWLINE_CRLF       Set CRLF as the newline sequence
+  PCRE_NEWLINE_LF         Set LF as the newline sequence
+  PCRE_NO_AUTO_CAPTURE    Disable numbered capturing paren-
+                            theses (named ones available)
+  PCRE_UNGREEDY           Invert greediness of quantifiers
+  PCRE_UTF8               Run in UTF-8 mode
+  PCRE_NO_UTF8_CHECK      Do not check the pattern for UTF-8
+                            validity (only relevant if
+                            PCRE_UTF8 is set)
 .sp
 PCRE must be built with UTF-8 support in order to use PCRE_UTF8 and
 PCRE_NO_UTF8_CHECK.


Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3    2008-04-12 14:36:14 UTC (rev 335)
+++ code/trunk/doc/pcreapi.3    2008-04-12 15:59:03 UTC (rev 336)
@@ -549,6 +549,20 @@
 the first newline in the subject string, though the matched text may continue
 over the newline.
 .sp
+  PCRE_JAVASCRIPT_COMPAT
+.sp
+If this option is set, PCRE's behaviour is changed in some ways so that it is 
+compatible with JavaScript rather than Perl. The changes are as follows:
+.P
+(1) A lone closing square bracket in a pattern causes a compile-time error,
+because this is illegal in JavaScript (by default it is treated as a data
+character). Thus, the pattern AB]CD becomes illegal when this option is set.
+.P
+(2) At run time, a back reference to an unset subpattern group matches an empty
+string (by default this causes the current matching path to fail). A pattern 
+such as (\1)(a) succeeds when this option is set (assuming it can find an "a" 
+in the subject), whereas it fails by default, for Perl compatibility.
+.sp
   PCRE_MULTILINE
 .sp
 By default, PCRE treats the subject string as consisting of a single line of
@@ -717,14 +731,15 @@
   54  DEFINE group contains more than one branch
   55  repeating a DEFINE group is not allowed
   56  inconsistent NEWLINE options
-  57  \eg is not followed by a braced name or an optionally braced
-        non-zero number
-  58  (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number
+  57  \eg is not followed by a braced, angle-bracketed, or quoted 
+        name/number or by a plain number 
+  58  a numbered reference must not be zero
   59  (*VERB) with an argument is not supported
   60  (*VERB) not recognized
   61  number is too big
   62  subpattern name expected
   63  digit expected after (?+
+  64  ] is an invalid data character in JavaScript compatibility mode
 .sp
 The numbers 32 and 10000 in errors 48 and 49 are defaults; different values may
 be used if the limits were changed when PCRE was built.
@@ -1960,6 +1975,6 @@
 .rs
 .sp
 .nf
-Last updated: 23 January 2008
+Last updated: 12 April 2008
 Copyright (c) 1997-2008 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcretest.1
===================================================================
--- code/trunk/doc/pcretest.1    2008-04-12 14:36:14 UTC (rev 335)
+++ code/trunk/doc/pcretest.1    2008-04-12 15:59:03 UTC (rev 336)
@@ -171,6 +171,7 @@
   \fB/N\fP              PCRE_NO_AUTO_CAPTURE
   \fB/U\fP              PCRE_UNGREEDY
   \fB/X\fP              PCRE_EXTRA
+  \fB/<JS>\fP           PCRE_JAVASCRIPT_COMPAT 
   \fB/<cr>\fP           PCRE_NEWLINE_CR
   \fB/<lf>\fP           PCRE_NEWLINE_LF
   \fB/<crlf>\fP         PCRE_NEWLINE_CRLF
@@ -717,6 +718,6 @@
 .rs
 .sp
 .nf
-Last updated: 18 December 2007
-Copyright (c) 1997-2007 University of Cambridge.
+Last updated: 12 April 2008
+Copyright (c) 1997-2008 University of Cambridge.
 .fi


Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2008-04-12 14:36:14 UTC (rev 335)
+++ code/trunk/pcre_compile.c    2008-04-12 15:59:03 UTC (rev 336)
@@ -302,7 +302,8 @@
   "(*VERB) not recognized\0"
   "number is too big\0"
   "subpattern name expected\0"
-  "digit expected after (?+";
+  "digit expected after (?+\0"
+  "] is an invalid data character in JavaScript compatibility mode"; 



 /* Table to identify digits and hex digits. This is used when compiling
@@ -2654,7 +2655,17 @@
     opcode is compiled. It may optionally have a bit map for characters < 256,
     but those above are are explicitly listed afterwards. A flag byte tells
     whether the bitmap is present, and whether this is a negated class or not.
-    */
+    
+    In JavaScript compatibility mode, an isolated ']' causes an error. In
+    default (Perl) mode, it is treated as a data character. */
+    
+    case ']':
+    if ((cd->external_options & PCRE_JAVASCRIPT_COMPAT) != 0)
+      {
+      *errorcodeptr = ERR64;
+      goto FAILED;  
+      }
+    goto NORMAL_CHAR;      


     case '[':
     previous = code;


Modified: code/trunk/pcre_exec.c
===================================================================
--- code/trunk/pcre_exec.c    2008-04-12 14:36:14 UTC (rev 335)
+++ code/trunk/pcre_exec.c    2008-04-12 15:59:03 UTC (rev 336)
@@ -1731,17 +1731,26 @@
     case OP_REF:
       {
       offset = GET2(ecode, 1) << 1;               /* Doubled ref number */
-      ecode += 3;                                 /* Advance past item */
+      ecode += 3;   
+      
+      /* If the reference is unset, there are two possibilities:
+      
+      (a) In the default, Perl-compatible state, set the length to be longer
+      than the amount of subject left; this ensures that every attempt at a
+      match fails. We can't just fail here, because of the possibility of
+      quantifiers with zero minima.
+      
+      (b) If the JavaScript compatibility flag is set, set the length to zero 
+      so that the back reference matches an empty string. 
+      
+      Otherwise, set the length to the length of what was matched by the 
+      referenced subpattern. */
+      
+      if (offset >= offset_top || md->offset_vector[offset] < 0)
+        length = (md->jscript_compat)? 0 : md->end_subject - eptr + 1;  
+      else
+        length = md->offset_vector[offset+1] - md->offset_vector[offset];


-      /* If the reference is unset, set the length to be longer than the amount
-      of subject left; this ensures that every attempt at a match fails. We
-      can't just fail here, because of the possibility of quantifiers with zero
-      minima. */
-
-      length = (offset >= offset_top || md->offset_vector[offset] < 0)?
-        md->end_subject - eptr + 1 :
-        md->offset_vector[offset+1] - md->offset_vector[offset];
-
       /* Set up for repetition, or handle the non-repeated case */


       switch (*ecode)
@@ -4458,6 +4467,7 @@


md->endonly = (re->options & PCRE_DOLLAR_ENDONLY) != 0;
utf8 = md->utf8 = (re->options & PCRE_UTF8) != 0;
+md->jscript_compat = (re->options & PCRE_JAVASCRIPT_COMPAT) != 0;

md->notbol = (options & PCRE_NOTBOL) != 0;
md->noteol = (options & PCRE_NOTEOL) != 0;

Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2008-04-12 14:36:14 UTC (rev 335)
+++ code/trunk/pcre_internal.h    2008-04-12 15:59:03 UTC (rev 336)
@@ -514,7 +514,8 @@
   (PCRE_CASELESS|PCRE_EXTENDED|PCRE_ANCHORED|PCRE_MULTILINE| \
    PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY|PCRE_UTF8| \
    PCRE_NO_AUTO_CAPTURE|PCRE_NO_UTF8_CHECK|PCRE_AUTO_CALLOUT|PCRE_FIRSTLINE| \
-   PCRE_DUPNAMES|PCRE_NEWLINE_BITS|PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)
+   PCRE_DUPNAMES|PCRE_NEWLINE_BITS|PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE| \
+   PCRE_JAVASCRIPT_COMPAT)


 #define PUBLIC_EXEC_OPTIONS \
   (PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY|PCRE_NO_UTF8_CHECK| \
@@ -884,7 +885,7 @@
        ERR30, ERR31, ERR32, ERR33, ERR34, ERR35, ERR36, ERR37, ERR38, ERR39,
        ERR40, ERR41, ERR42, ERR43, ERR44, ERR45, ERR46, ERR47, ERR48, ERR49,
        ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59,
-       ERR60, ERR61, ERR62, ERR63 };
+       ERR60, ERR61, ERR62, ERR63, ERR64 };


 /* The real format of the start of the pcre block; the index of names and the
 code vector run on as long as necessary after the end. We store an explicit
@@ -1009,6 +1010,7 @@
   BOOL   notbol;                /* NOTBOL flag */
   BOOL   noteol;                /* NOTEOL flag */
   BOOL   utf8;                  /* UTF8 flag */
+  BOOL   jscript_compat;        /* JAVASCRIPT_COMPAT flag */ 
   BOOL   endonly;               /* Dollar not before final \n */
   BOOL   notempty;              /* Empty string match not wanted */
   BOOL   partial;               /* PARTIAL flag */


Modified: code/trunk/pcreposix.c
===================================================================
--- code/trunk/pcreposix.c    2008-04-12 14:36:14 UTC (rev 335)
+++ code/trunk/pcreposix.c    2008-04-12 15:59:03 UTC (rev 336)
@@ -126,7 +126,8 @@
   REG_BADPAT,  /* (?+ or (?- must be followed by a non-zero number */
   REG_BADPAT,  /* number is too big */
   REG_BADPAT,  /* subpattern name expected */
-  REG_BADPAT   /* digit expected after (?+ */
+  REG_BADPAT,  /* digit expected after (?+ */
+  REG_BADPAT   /* ] is an invalid data character in JavaScript compatibility mode */
 };


/* Table of texts corresponding to POSIX error codes */

Modified: code/trunk/pcretest.c
===================================================================
--- code/trunk/pcretest.c    2008-04-12 14:36:14 UTC (rev 335)
+++ code/trunk/pcretest.c    2008-04-12 15:59:03 UTC (rev 336)
@@ -1247,10 +1247,18 @@


       case '<':
         {
-        int x = check_newline(pp, outfile);
-        if (x == 0) goto SKIP_DATA;
-        options |= x;
-        while (*pp++ != '>');
+        if (strncmp((char *)pp, "JS>", 3) == 0)
+          {
+          options |= PCRE_JAVASCRIPT_COMPAT;
+          pp += 3;  
+          }
+        else
+          {      
+          int x = check_newline(pp, outfile);
+          if (x == 0) goto SKIP_DATA;
+          options |= x;
+          while (*pp++ != '>');
+          } 
         }
       break;



Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2008-04-12 14:36:14 UTC (rev 335)
+++ code/trunk/testdata/testinput2    2008-04-12 15:59:03 UTC (rev 336)
@@ -2655,4 +2655,16 @@
     ** Failers
     xxz  


+/(\3)(\1)(a)/
+    cat
+
+/(\3)(\1)(a)/<JS>
+    cat
+    
+/TA]/
+    The ACTA] comes 
+
+/TA]/<JS>
+    The ACTA] comes 
+
 / End of testinput2 /


Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2008-04-12 14:36:14 UTC (rev 335)
+++ code/trunk/testdata/testoutput2    2008-04-12 15:59:03 UTC (rev 336)
@@ -9527,4 +9527,22 @@
     xxz  
 No match


+/(\3)(\1)(a)/
+    cat
+No match
+
+/(\3)(\1)(a)/<JS>
+    cat
+ 0: a
+ 1: 
+ 2: 
+ 3: a
+    
+/TA]/
+    The ACTA] comes 
+ 0: TA]
+
+/TA]/<JS>
+Failed: ] is an invalid data character in JavaScript compatibility mode at offset 2
+
 / End of testinput2 /