[Pcre-svn] [530] code/trunk: Support \C in lookbehinds and DFA matching in UTF-32 mode.

Author: Subversion repository
Date:
To: pcre-svn
Subject: [Pcre-svn] [530] code/trunk: Support \C in lookbehinds and DFA matching in UTF-32 mode.

Revision: 530

          http://www.exim.org/viewvc/pcre2?view=rev&revision=530
Author:   ph10
Date:     2016-06-20 19:14:51 +0100 (Mon, 20 Jun 2016)
Log Message:
-----------
Support \C in lookbehinds and DFA matching in UTF-32 mode.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/HACKING
    code/trunk/doc/pcre2pattern.3
    code/trunk/src/pcre2_compile.c
    code/trunk/src/pcre2_error.c
    code/trunk/testdata/testinput22
    code/trunk/testdata/testoutput22-16
    code/trunk/testdata/testoutput22-32
    code/trunk/testdata/testoutput22-8

Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2016-06-19 16:07:56 UTC (rev 529)
+++ code/trunk/ChangeLog    2016-06-20 18:14:51 UTC (rev 530)
@@ -151,7 +151,10 @@
 the match extended over a line boundary, as it tried to find more matches "on 
 the same line" - but it was already over the end.

+39. Allow \C in lookbehinds and DFA matching in UTF-32 mode (by converting it
+to the same code as '.' when PCRE2_DOTALL is set).

+
Version 10.21 12-January-2016
-----------------------------

Modified: code/trunk/HACKING
===================================================================
--- code/trunk/HACKING    2016-06-19 16:07:56 UTC (rev 529)
+++ code/trunk/HACKING    2016-06-20 18:14:51 UTC (rev 530)
@@ -228,7 +228,12 @@
 This ends the assertion, not the entire pattern match. The assertion (?!) is 
 always optimized to OP_FAIL.

+OP_ALLANY is used for '.' when PCRE2_DOTALL is set. It is also used for \C in
+non-UTF modes and in UTF-32 mode (since one code unit still equals one
+character). Another use is for [^] when empty classes are permitted
+(PCRE2_ALLOW_EMPTY_CLASS is set).

+
Backtracking control verbs with optional data
---------------------------------------------

@@ -601,4 +606,4 @@
correct length, in order to catch updating errors.

Philip Hazel
-June 2015
+June 2016

Modified: code/trunk/doc/pcre2pattern.3
===================================================================
--- code/trunk/doc/pcre2pattern.3    2016-06-19 16:07:56 UTC (rev 529)
+++ code/trunk/doc/pcre2pattern.3    2016-06-20 18:14:51 UTC (rev 530)
@@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "13 November 2015" "PCRE2 10.21"
+.TH PCRE2PATTERN 3 "20 June 2016" "PCRE2 10.22"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@@ -1256,16 +1256,20 @@
 .\" </a>
 (described below)
 .\"
-in a UTF mode, because this would make it impossible to calculate the length of
-the lookbehind. Neither the alternative matching function
-\fBpcre2_dfa_match()\fP nor the JIT optimizer support \eC in a UTF mode. The
-former gives a match-time error; the latter fails to optimize and so the match
-is always run using the interpreter.
+in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
+the length of the lookbehind. Neither the alternative matching function
+\fBpcre2_dfa_match()\fP nor the JIT optimizer support \eC in these UTF modes.
+The former gives a match-time error; the latter fails to optimize and so the
+match is always run using the interpreter.
 .P
+In the 32-bit library, however, \eC is always supported (when not explicitly 
+locked out) because it always matches a single code unit, whether or not UTF-32
+is specified.
+.P
 In general, the \eC escape sequence is best avoided. However, one way of using
-it that avoids the problem of malformed UTF characters is to use a lookahead to
-check the length of the next character, as in this pattern, which could be used
-with a UTF-8 string (ignore white space and line breaks):
+it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
+lookahead to check the length of the next character, as in this pattern, which
+could be used with a UTF-8 string (ignore white space and line breaks):
 .sp
   (?| (?=[\ex00-\ex7f])(\eC) |
       (?=[\ex80-\ex{7ff}])(\eC)(\eC) |
@@ -3425,6 +3429,6 @@
 .rs
 .sp
 .nf
-Last updated: 13 November 2015
-Copyright (c) 1997-2015 University of Cambridge.
+Last updated: 20 June 2016
+Copyright (c) 1997-2016 University of Cambridge.
 .fi

Modified: code/trunk/src/pcre2_compile.c
===================================================================
--- code/trunk/src/pcre2_compile.c    2016-06-19 16:07:56 UTC (rev 529)
+++ code/trunk/src/pcre2_compile.c    2016-06-20 18:14:51 UTC (rev 530)
@@ -1117,8 +1117,8 @@
     cc++;
     break;

-    /* The single-byte matcher isn't allowed. This only happens in UTF mode;
-    otherwise \C is coded as OP_ALLANY. */
+    /* The single-byte matcher isn't allowed. This only happens in UTF-8 or 
+    UTF-16 mode; otherwise \C is coded as OP_ALLANY. */

     case OP_ANYBYTE:
     return FFL_BACKSLASHC;
@@ -7420,12 +7420,17 @@
           }
         else
 #endif
-        /* In non-UTF mode, we turn \C into OP_ALLANY instead of OP_ANYBYTE
-        so that it works in DFA mode and in lookbehinds. */
+        /* In non-UTF mode, and for both 32-bit modes, we turn \C into
+        OP_ALLANY instead of OP_ANYBYTE so that it works in DFA mode and in
+        lookbehinds. */

           {
           previous = (escape > ESC_b && escape < ESC_Z)? code : NULL;
+#if PCRE2_CODE_UNIT_WIDTH == 32
+          *code++ = (escape == ESC_C)? OP_ALLANY : escape;
+#else
           *code++ = (!utf && escape == ESC_C)? OP_ALLANY : escape;
+#endif          
           }
         }
       continue;

Modified: code/trunk/src/pcre2_error.c
===================================================================
--- code/trunk/src/pcre2_error.c    2016-06-19 16:07:56 UTC (rev 529)
+++ code/trunk/src/pcre2_error.c    2016-06-20 18:14:51 UTC (rev 530)
@@ -106,7 +106,7 @@
   "character code point value in \\x{} or \\o{} is too large\0"
   /* 35 */
   "invalid condition (?(0)\0"
-  "\\C is not allowed in a lookbehind assertion\0"
+  "\\C is not allowed in a lookbehind assertion in UTF-" XSTRING(PCRE2_CODE_UNIT_WIDTH) " mode\0"
   "PCRE does not support \\L, \\l, \\N{name}, \\U, or \\u\0"
   "number after (?C is greater than 255\0"
   "closing parenthesis for (?C expected\0"

Modified: code/trunk/testdata/testinput22
===================================================================
--- code/trunk/testdata/testinput22    2016-06-19 16:07:56 UTC (rev 529)
+++ code/trunk/testdata/testinput22    2016-06-20 18:14:51 UTC (rev 530)
@@ -6,9 +6,11 @@
 /ab\Cde/utf,info
     abXde

-# This should produce an error diagnostic (\C in UTF lookbehind)
+# This should produce an error diagnostic (\C in UTF lookbehind) in 8-bit and
+# 16-bit modes, but not in 32-bit mode.

 /(?<=ab\Cde)X/utf
+    ab!deXYZ

# Autopossessification tests

Modified: code/trunk/testdata/testoutput22-16
===================================================================
--- code/trunk/testdata/testoutput22-16    2016-06-19 16:07:56 UTC (rev 529)
+++ code/trunk/testdata/testoutput22-16    2016-06-20 18:14:51 UTC (rev 530)
@@ -13,10 +13,12 @@
     abXde
  0: abXde

-# This should produce an error diagnostic (\C in UTF lookbehind)
+# This should produce an error diagnostic (\C in UTF lookbehind) in 8-bit and
+# 16-bit modes, but not in 32-bit mode.

 /(?<=ab\Cde)X/utf
-Failed: error 136 at offset 10: \C is not allowed in a lookbehind assertion
+Failed: error 136 at offset 10: \C is not allowed in a lookbehind assertion in UTF-16 mode
+    ab!deXYZ

# Autopossessification tests

Modified: code/trunk/testdata/testoutput22-32
===================================================================
--- code/trunk/testdata/testoutput22-32    2016-06-19 16:07:56 UTC (rev 529)
+++ code/trunk/testdata/testoutput22-32    2016-06-20 18:14:51 UTC (rev 530)
@@ -9,14 +9,16 @@
 Options: utf
 First code unit = 'a'
 Last code unit = 'e'
-Subject length lower bound = 0
+Subject length lower bound = 5
     abXde
  0: abXde

-# This should produce an error diagnostic (\C in UTF lookbehind)
+# This should produce an error diagnostic (\C in UTF lookbehind) in 8-bit and
+# 16-bit modes, but not in 32-bit mode.

 /(?<=ab\Cde)X/utf
-Failed: error 136 at offset 10: \C is not allowed in a lookbehind assertion
+    ab!deXYZ
+ 0: X

# Autopossessification tests

@@ -34,10 +36,10 @@
 /\C+\X \X+\C/Bx,utf
 ------------------------------------------------------------------
         Bra
-        Anybyte+
+        AllAny+
         extuni
         extuni+
-        Anybyte
+        AllAny
         Ket
         End
 ------------------------------------------------------------------

Modified: code/trunk/testdata/testoutput22-8
===================================================================
--- code/trunk/testdata/testoutput22-8    2016-06-19 16:07:56 UTC (rev 529)
+++ code/trunk/testdata/testoutput22-8    2016-06-20 18:14:51 UTC (rev 530)
@@ -13,10 +13,12 @@
     abXde
  0: abXde

-# This should produce an error diagnostic (\C in UTF lookbehind)
+# This should produce an error diagnostic (\C in UTF lookbehind) in 8-bit and
+# 16-bit modes, but not in 32-bit mode.

 /(?<=ab\Cde)X/utf
-Failed: error 136 at offset 10: \C is not allowed in a lookbehind assertion
+Failed: error 136 at offset 10: \C is not allowed in a lookbehind assertion in UTF-8 mode
+    ab!deXYZ

# Autopossessification tests

This message is part of the following thread:
	the complete thread tree sorted by date

[Pcre-svn] [530] code/trunk: Support \C in lookbehinds and D…