[Pcre-svn] [514] code/trunk: Add support for \N.

Página Inicial
Delete this message
Autor: Subversion repository
Data:  
Para: pcre-svn
Assunto: [Pcre-svn] [514] code/trunk: Add support for \N.
Revision: 514
          http://vcs.pcre.org/viewvc?view=rev&revision=514
Author:   ph10
Date:     2010-05-03 13:54:22 +0100 (Mon, 03 May 2010)


Log Message:
-----------
Add support for \N.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcrepattern.3
    code/trunk/pcre_compile.c
    code/trunk/pcre_internal.h
    code/trunk/testdata/testinput2
    code/trunk/testdata/testoutput2


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2010-05-03 11:13:37 UTC (rev 513)
+++ code/trunk/ChangeLog    2010-05-03 12:54:22 UTC (rev 514)
@@ -16,6 +16,9 @@
 4.  Inside a character class, PCRE always treated \R and \X as literals, 
     whereas Perl faults them if its -w option is set. I have changed PCRE so
     that it faults them when PCRE_EXTRA is set.
+    
+5.  Added support for \N, which always matches any character other than
+    newline. (It is the same as "." when PCRE_DOTALL is not set.)



Version 8.02 19-Mar-2010

Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3    2010-05-03 11:13:37 UTC (rev 513)
+++ code/trunk/doc/pcrepattern.3    2010-05-03 12:54:22 UTC (rev 514)
@@ -95,9 +95,11 @@
 they must be in upper case. If more than one of them is present, the last one
 is used.
 .P
-The newline convention does not affect what the \eR escape sequence matches. By
-default, this is any Unicode newline sequence, for Perl compatibility. However,
-this can be changed; see the description of \eR in the section entitled
+The newline convention affects the interpretation of the dot metacharacter when
+PCRE_DOTALL is not set, and also the behaviour of \eN. However, it does not
+affect what the \eR escape sequence matches. By default, this is any Unicode
+newline sequence, for Perl compatibility. However, this can be changed; see the
+description of \eR in the section entitled
 .\" HTML <a href="#newlineseq">
 .\" </a>
 "Newline sequences"
@@ -296,14 +298,10 @@
 All the sequences that define a single character value can be used both inside
 and outside character classes. In addition, inside a character class, the
 sequence \eb is interpreted as the backspace character (hex 08). The sequences
-\eB, \eR, and \eX are not special inside a character class. Like any other
+\eB, \eN, \eR, and \eX are not special inside a character class. Like any other
 unrecognized escape sequences, they are treated as the literal characters "B",
-"R", and "X" by default, but cause an error if the PCRE_EXTRA option is set.
-Outside a character class, these sequences have different meanings
-.\" HTML <a href="#uniextseq">
-.\" </a>
-(see below).
-.\"
+"N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is
+set. Outside a character class, these sequences have different meanings.
 .
 .
 .SS "Absolute and relative back references"
@@ -345,8 +343,7 @@
 .SS "Generic character types"
 .rs
 .sp
-Another use of backslash is for specifying generic character types. The
-following are always recognized:
+Another use of backslash is for specifying generic character types:
 .sp
   \ed     any decimal digit
   \eD     any character that is not a decimal digit
@@ -359,9 +356,18 @@
   \ew     any "word" character
   \eW     any "non-word" character
 .sp
-Each pair of escape sequences partitions the complete set of characters into
-two disjoint sets. Any given character matches one, and only one, of each pair.
+There is also the single sequence \eN, which matches a non-newline character. 
+This is the same as 
+.\" HTML <a href="#fullstopdot">
+.\" </a>
+the "." metacharacter 
+.\"
+when PCRE_DOTALL is not set.
 .P
+Each pair of lower and upper case escape sequences partitions the complete set
+of characters into two disjoint sets. Any given character matches one, and only
+one, of each pair.
+.P
 These character type sequences can appear both inside and outside character
 classes. They each match one character of the appropriate type. If the current
 matching point is at the end of the subject string, all of them fail, since
@@ -864,7 +870,8 @@
 \eA it is always anchored, whether or not PCRE_MULTILINE is set.
 .
 .
-.SH "FULL STOP (PERIOD, DOT)"
+.\" HTML <a name="fullstopdot"></a>
+.SH "FULL STOP (PERIOD, DOT) AND \eN"
 .rs
 .sp
 Outside a character class, a dot in the pattern matches any one character in
@@ -886,6 +893,10 @@
 The handling of dot is entirely independent of the handling of circumflex and
 dollar, the only relationship being that they both involve newlines. Dot has no
 special meaning in a character class.
+.P
+The escape sequence \eN always behaves as a dot does when PCRE_DOTALL is not 
+set. In other words, it matches any one character except one that signifies the 
+end of a line.
 .
 .
 .SH "MATCHING A SINGLE BYTE"


Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2010-05-03 11:13:37 UTC (rev 513)
+++ code/trunk/pcre_compile.c    2010-05-03 12:54:22 UTC (rev 514)
@@ -124,7 +124,7 @@
      -ESC_H,                  0,
      0,                       -ESC_K,
      0,                       0,
-     0,                       0,
+     -ESC_N,                  0,
      -ESC_P,                  -ESC_Q,
      -ESC_R,                  -ESC_S,
      0,                       0,
@@ -171,7 +171,7 @@
 /*  B8 */     0,     0,      0,       0,      0,   ']',    '=',    '-',
 /*  C0 */   '{',-ESC_A, -ESC_B,  -ESC_C, -ESC_D,-ESC_E,      0, -ESC_G,
 /*  C8 */-ESC_H,     0,      0,       0,      0,     0,      0,      0,
-/*  D0 */   '}',     0, -ESC_K,       0,      0,     0,      0, -ESC_P,
+/*  D0 */   '}',     0, -ESC_K,       0,      0,-ESC_N,      0, -ESC_P,
 /*  D8 */-ESC_Q,-ESC_R,      0,       0,      0,     0,      0,      0,
 /*  E0 */  '\\',     0, -ESC_S,       0,      0,-ESC_V, -ESC_W, -ESC_X,
 /*  E8 */     0,-ESC_Z,      0,       0,      0,     0,      0,      0,
@@ -324,7 +324,7 @@
   /* 35 */
   "invalid condition (?(0)\0"
   "\\C not allowed in lookbehind assertion\0"
-  "PCRE does not support \\L, \\l, \\N, \\U, or \\u\0"
+  "PCRE does not support \\L, \\l, \\N{name}, \\U, or \\u\0"
   "number after (?C is > 255\0"
   "closing ) for (?C expected\0"
   /* 40 */
@@ -593,7 +593,6 @@


     case CHAR_l:
     case CHAR_L:
-    case CHAR_N:
     case CHAR_u:
     case CHAR_U:
     *errorcodeptr = ERR37;
@@ -830,7 +829,13 @@
     break;
     }
   }
+  
+/* Perl supports \N{name} for character names, as well as plain \N for "not 
+newline". PCRE does not support \N{name}. */


+if (c == -ESC_N && ptr[1] == CHAR_LEFT_CURLY_BRACKET)
+  *errorcodeptr = ERR37; 
+
 *ptrptr = ptr;
 return c;
 }
@@ -3500,14 +3505,11 @@
           d = check_escape(&ptr, errorcodeptr, cd->bracount, options, TRUE);
           if (*errorcodeptr != 0) goto FAILED;


-          /* \b is backspace; \X is literal X; \R is literal R; any other
-          special means the '-' was literal */
+          /* \b is backspace; any other special means the '-' was literal */


           if (d < 0)
             {
-            if (d == -ESC_b) d = CHAR_BS;
-            else if (d == -ESC_X) d = CHAR_X;
-            else if (d == -ESC_R) d = CHAR_R; else
+            if (d == -ESC_b) d = CHAR_BS; else
               {
               ptr = oldptr;
               goto LONE_SINGLE_CHARACTER;  /* A few lines below */


Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2010-05-03 11:13:37 UTC (rev 513)
+++ code/trunk/pcre_internal.h    2010-05-03 12:54:22 UTC (rev 514)
@@ -1209,9 +1209,10 @@
 /* These are escaped items that aren't just an encoding of a particular data
 value such as \n. They must have non-zero values, as check_escape() returns
 their negation. Also, they must appear in the same order as in the opcode
-definitions below, up to ESC_z. There's a dummy for OP_ANY because it
-corresponds to "." rather than an escape sequence, and another for OP_ALLANY
-(which is used for [^] in JavaScript compatibility mode).
+definitions below, up to ESC_z. There's a dummy for OP_ALLANY because it
+corresponds to "." in DOTALL mode rather than an escape sequence. It is also
+used for [^] in JavaScript compatibility mode. In non-DOTALL mode, "." behaves 
+like \N.


The final escape must be ESC_REF as subsequent values are used for
backreferences (\1, \2, \3, etc). There are two tests in the code for an escape
@@ -1221,7 +1222,7 @@
*/

 enum { ESC_A = 1, ESC_G, ESC_K, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s,
-       ESC_W, ESC_w, ESC_dum1, ESC_dum2, ESC_C, ESC_P, ESC_p, ESC_R, ESC_H,
+       ESC_W, ESC_w, ESC_N, ESC_dum, ESC_C, ESC_P, ESC_p, ESC_R, ESC_H,
        ESC_h, ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z, ESC_E, ESC_Q, ESC_g, ESC_k,
        ESC_REF };


@@ -1249,8 +1250,8 @@
   OP_WHITESPACE,         /*  9 \s */
   OP_NOT_WORDCHAR,       /* 10 \W */
   OP_WORDCHAR,           /* 11 \w */
-  OP_ANY,            /* 12 Match any character (subject to DOTALL) */
-  OP_ALLANY,         /* 13 Match any character (not subject to DOTALL) */
+  OP_ANY,            /* 12 Match any character except newline */
+  OP_ALLANY,         /* 13 Match any character */
   OP_ANYBYTE,        /* 14 Match any byte (\C); different to OP_ANY for UTF-8 */
   OP_NOTPROP,        /* 15 \P (not Unicode property) */
   OP_PROP,           /* 16 \p (Unicode property) */


Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2010-05-03 11:13:37 UTC (rev 513)
+++ code/trunk/testdata/testinput2    2010-05-03 12:54:22 UTC (rev 514)
@@ -3463,4 +3463,22 @@
 /(.(*ACCEPT))*5/
     abcde


+/A\NB./BZ
+  ACBD
+  ** Failers
+  A\nB
+  ACB\n   
+
+/A\NB./sBZ
+  ACBD
+  ACB\n 
+  ** Failers
+  A\nB  
+  
+/A\NB/<crlf>
+  A\nB
+  A\rB
+  ** Failers
+  A\r\nB    
+
 /-- End of testinput2 --/


Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2010-05-03 11:13:37 UTC (rev 513)
+++ code/trunk/testdata/testoutput2    2010-05-03 12:54:22 UTC (rev 514)
@@ -3228,19 +3228,19 @@
 Failed: POSIX named classes are supported only within a class at offset 0


/\l/I
-Failed: PCRE does not support \L, \l, \N, \U, or \u at offset 1
+Failed: PCRE does not support \L, \l, \N{name}, \U, or \u at offset 1

/\L/I
-Failed: PCRE does not support \L, \l, \N, \U, or \u at offset 1
+Failed: PCRE does not support \L, \l, \N{name}, \U, or \u at offset 1

/\N{name}/I
-Failed: PCRE does not support \L, \l, \N, \U, or \u at offset 1
+Failed: PCRE does not support \L, \l, \N{name}, \U, or \u at offset 1

/\u/I
-Failed: PCRE does not support \L, \l, \N, \U, or \u at offset 1
+Failed: PCRE does not support \L, \l, \N{name}, \U, or \u at offset 1

/\U/I
-Failed: PCRE does not support \L, \l, \N, \U, or \u at offset 1
+Failed: PCRE does not support \L, \l, \N{name}, \U, or \u at offset 1

/[/I
Failed: missing terminating ] for character class at offset 1
@@ -11033,4 +11033,52 @@
0: a
1: a

+/A\NB./BZ
+------------------------------------------------------------------
+        Bra
+        A
+        Any
+        B
+        Any
+        Ket
+        End
+------------------------------------------------------------------
+  ACBD
+ 0: ACBD
+  ** Failers
+No match
+  A\nB
+No match
+  ACB\n   
+No match
+
+/A\NB./sBZ
+------------------------------------------------------------------
+        Bra
+        A
+        Any
+        B
+        AllAny
+        Ket
+        End
+------------------------------------------------------------------
+  ACBD
+ 0: ACBD
+  ACB\n 
+ 0: ACB\x0a
+  ** Failers
+No match
+  A\nB  
+No match
+  
+/A\NB/<crlf>
+  A\nB
+ 0: A\x0aB
+  A\rB
+ 0: A\x0dB
+  ** Failers
+No match
+  A\r\nB    
+No match
+
 /-- End of testinput2 --/