[Pcre-svn] [1058] code/trunk: Fix issues with BAD_ESCAPE_IS_…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [1058] code/trunk: Fix issues with BAD_ESCAPE_IS_LITERAL in character classes.
Revision: 1058
          http://www.exim.org/viewvc/pcre2?view=rev&revision=1058
Author:   ph10
Date:     2019-01-04 16:41:32 +0000 (Fri, 04 Jan 2019)
Log Message:
-----------
Fix issues with BAD_ESCAPE_IS_LITERAL in character classes.


Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/html/pcre2api.html
    code/trunk/doc/pcre2.txt
    code/trunk/doc/pcre2api.3
    code/trunk/src/pcre2_compile.c
    code/trunk/src/pcre2_error.c
    code/trunk/testdata/testinput2
    code/trunk/testdata/testoutput2


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2019-01-03 09:34:42 UTC (rev 1057)
+++ code/trunk/ChangeLog    2019-01-04 16:41:32 UTC (rev 1058)
@@ -102,7 +102,17 @@


26. Insert a cast in pcre2_dfa_match.c to suppress a compiler warning.

+26. With PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL set, escape sequences such as \s
+which are valid in character classes, but not as the end of ranges, were being
+treated as literals. An example is [_-\s] (but not [\s-_] because that gave an
+error at the *start* of a range). Now an "invalid range" error is given
+independently of PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL.

+27. Related to 26 above, PCRE2_BAD_ESCAPE_IS_LITERAL was affecting known escape
+sequences such as \eX when they appeared invalidly in a character class. Now
+the option applies only to unrecognized or malformed escape sequences.
+
+
Version 10.32 10-September-2018
-------------------------------


Modified: code/trunk/doc/html/pcre2api.html
===================================================================
--- code/trunk/doc/html/pcre2api.html    2019-01-03 09:34:42 UTC (rev 1057)
+++ code/trunk/doc/html/pcre2api.html    2019-01-04 16:41:32 UTC (rev 1058)
@@ -1870,11 +1870,14 @@
 </P>
 <P>
 If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
-<b>pcre2_compile()</b>, all unrecognized or erroneous escape sequences are
+<b>pcre2_compile()</b>, all unrecognized or malformed escape sequences are
 treated as single-character escapes. For example, \j is a literal "j" and
 \x{2z} is treated as the literal string "x{2z}". Setting this option means
-that typos in patterns may go undetected and have unexpected results. This is a
-dangerous option. Use with care.
+that typos in patterns may go undetected and have unexpected results. Also note 
+that a sequence such as [\N{] is interpreted as a malformed attempt at
+[\N{...}] and so is treated as [N{] whereas [\N] gives an error because an
+unqualified \N is a valid escape sequence but is not supported in a character
+class. To reiterate: this is a dangerous option. Use with great care.
 <pre>
   PCRE2_EXTRA_ESCAPED_CR_IS_LF
 </pre>
@@ -3782,9 +3785,9 @@
 </P>
 <br><a name="SEC42" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 27 November 2018
+Last updated: 04 January 2019
 <br>
-Copyright &copy; 1997-2018 University of Cambridge.
+Copyright &copy; 1997-2019 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE2 index page</a>.


Modified: code/trunk/doc/pcre2.txt
===================================================================
--- code/trunk/doc/pcre2.txt    2019-01-03 09:34:42 UTC (rev 1057)
+++ code/trunk/doc/pcre2.txt    2019-01-04 16:41:32 UTC (rev 1058)
@@ -1846,11 +1846,15 @@
        Perl.


        If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL  extra  option  is  passed  to
-       pcre2_compile(),  all  unrecognized  or  erroneous escape sequences are
+       pcre2_compile(),  all  unrecognized  or  malformed escape sequences are
        treated as single-character escapes. For example, \j is a  literal  "j"
        and  \x{2z}  is  treated  as  the  literal string "x{2z}". Setting this
        option means that typos in patterns may go undetected  and  have  unex-
-       pected results. This is a dangerous option. Use with care.
+       pected  results. Also note that a sequence such as [\N{] is interpreted
+       as a malformed attempt at [\N{...}] and so is treated as  [N{]  whereas
+       [\N]  gives  an  error  because  an  unqualified  \N  is a valid escape
+       sequence but is not supported in a character class. To reiterate:  this
+       is a dangerous option. Use with great care.


          PCRE2_EXTRA_ESCAPED_CR_IS_LF


@@ -3654,8 +3658,8 @@

REVISION

-       Last updated: 27 November 2018
-       Copyright (c) 1997-2018 University of Cambridge.
+       Last updated: 04 January 2019
+       Copyright (c) 1997-2019 University of Cambridge.
 ------------------------------------------------------------------------------




Modified: code/trunk/doc/pcre2api.3
===================================================================
--- code/trunk/doc/pcre2api.3    2019-01-03 09:34:42 UTC (rev 1057)
+++ code/trunk/doc/pcre2api.3    2019-01-04 16:41:32 UTC (rev 1058)
@@ -1,4 +1,4 @@
-.TH PCRE2API 3 "27 November 2018" "PCRE2 10.33"
+.TH PCRE2API 3 "04 January 2019" "PCRE2 10.33"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .sp
@@ -1825,11 +1825,14 @@
 always causes an error in Perl.
 .P
 If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
-\fBpcre2_compile()\fP, all unrecognized or erroneous escape sequences are
+\fBpcre2_compile()\fP, all unrecognized or malformed escape sequences are
 treated as single-character escapes. For example, \ej is a literal "j" and
 \ex{2z} is treated as the literal string "x{2z}". Setting this option means
-that typos in patterns may go undetected and have unexpected results. This is a
-dangerous option. Use with care.
+that typos in patterns may go undetected and have unexpected results. Also note 
+that a sequence such as [\eN{] is interpreted as a malformed attempt at
+[\eN{...}] and so is treated as [N{] whereas [\eN] gives an error because an
+unqualified \eN is a valid escape sequence but is not supported in a character
+class. To reiterate: this is a dangerous option. Use with great care.
 .sp
   PCRE2_EXTRA_ESCAPED_CR_IS_LF
 .sp
@@ -3790,6 +3793,6 @@
 .rs
 .sp
 .nf
-Last updated: 27 November 2018
-Copyright (c) 1997-2018 University of Cambridge.
+Last updated: 04 January 2019
+Copyright (c) 1997-2019 University of Cambridge.
 .fi


Modified: code/trunk/src/pcre2_compile.c
===================================================================
--- code/trunk/src/pcre2_compile.c    2019-01-03 09:34:42 UTC (rev 1057)
+++ code/trunk/src/pcre2_compile.c    2019-01-04 16:41:32 UTC (rev 1058)
@@ -7,7 +7,7 @@


                        Written by Philip Hazel
      Original API code Copyright (c) 1997-2012 University of Cambridge
-          New API code Copyright (c) 2016-2018 University of Cambridge
+          New API code Copyright (c) 2016-2019 University of Cambridge


 -----------------------------------------------------------------------------
 Redistribution and use in source and binary forms, with or without
@@ -3346,9 +3346,9 @@
         tempptr = ptr;
         escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode,
           options, TRUE, cb);
+
         if (errorcode != 0)
           {
-          CLASS_ESCAPE_FAILED:
           if ((cb->cx->extra_options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) == 0)
             goto FAILED;
           ptr = tempptr;
@@ -3359,30 +3359,32 @@
           escape = 0;                 /* Treat as literal character */
           }


-        if (escape == 0)  /* Escaped character code point is in c */
+        switch(escape)
           {
+          case 0:  /* Escaped character code point is in c */
           char_is_literal = FALSE;
           goto CLASS_LITERAL;
-          }


-        /* These three escapes do not alter the class range state. */
-
-        if (escape == ESC_b)
-          {
-          c = CHAR_BS;   /* \b is backspace in a class */
+          case ESC_b:
+          c = CHAR_BS;    /* \b is backspace in a class */
           char_is_literal = FALSE;
           goto CLASS_LITERAL;
-          }


-        else if (escape == ESC_Q)
-          {
+          case ESC_Q:
           inescq = TRUE;  /* Enter literal mode */
           goto CLASS_CONTINUE;
-          }


-        else if (escape == ESC_E)  /* Ignore orphan \E */
+          case ESC_E:     /* Ignore orphan \E */
           goto CLASS_CONTINUE;


+          case ESC_B:     /* Always an error in a class */
+          case ESC_R:
+          case ESC_X:
+          errorcode = ERR7;
+          ptr--;
+          goto FAILED;
+          }
+
         /* The second part of a range can be a single-character escape
         sequence (detected above), but not any of the other escapes. Perl
         treats a hyphen as a literal in such circumstances. However, in Perl's
@@ -3392,7 +3394,7 @@
         if (class_range_state == RANGE_STARTED)
           {
           errorcode = ERR50;
-          goto CLASS_ESCAPE_FAILED;
+          goto FAILED;  /* Not CLASS_ESCAPE_FAILED; always an error */
           }


         /* Of the remaining escapes, only those that define characters are
@@ -3402,8 +3404,8 @@
         switch(escape)
           {
           case ESC_N:
-          errorcode = ERR71;  /* Not supported in a class */
-          goto CLASS_ESCAPE_FAILED;
+          errorcode = ERR71;
+          goto FAILED;


           case ESC_H:
           case ESC_h:
@@ -3466,7 +3468,7 @@
             }
 #else
           errorcode = ERR45;
-          goto CLASS_ESCAPE_FAILED;
+          goto FAILED;
 #endif
           break;  /* End \P and \p */


@@ -3473,7 +3475,7 @@
           default:    /* All others are not allowed in a class */
           errorcode = ERR7;
           ptr--;
-          goto CLASS_ESCAPE_FAILED;
+          goto FAILED;
           }


         /* Perl gives a warning unless a following hyphen is the last character


Modified: code/trunk/src/pcre2_error.c
===================================================================
--- code/trunk/src/pcre2_error.c    2019-01-03 09:34:42 UTC (rev 1057)
+++ code/trunk/src/pcre2_error.c    2019-01-04 16:41:32 UTC (rev 1058)
@@ -7,7 +7,7 @@


                        Written by Philip Hazel
      Original API code Copyright (c) 1997-2012 University of Cambridge
-          New API code Copyright (c) 2016-2018 University of Cambridge
+          New API code Copyright (c) 2016-2019 University of Cambridge


-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -71,7 +71,7 @@
/* 5 */
"number too big in {} quantifier\0"
"missing terminating ] for character class\0"
- "invalid escape sequence in character class\0"
+ "escape sequence is invalid in character class\0"
"range out of order in character class\0"
"quantifier does not follow a repeatable item\0"
/* 10 */

Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2019-01-03 09:34:42 UTC (rev 1057)
+++ code/trunk/testdata/testinput2    2019-01-04 16:41:32 UTC (rev 1058)
@@ -5304,10 +5304,22 @@


/\N{\c/IB,bad_escape_is_literal

-/[\j\x{z}\o\gA-\Nb-\g]/B,bad_escape_is_literal
+/[\j\x{z}\o\gAb\g]/B,bad_escape_is_literal

/[Q-\N]/B,bad_escape_is_literal

+/[\s-_]/bad_escape_is_literal
+
+/[_-\s]/bad_escape_is_literal
+
+/[\B\R\X]/B
+
+/[\B\R\X]/B,bad_escape_is_literal
+
+/[A-\BP-\RV-\X]/B
+
+/[A-\BP-\RV-\X]/B,bad_escape_is_literal
+
# ----------------------------------------------------------------------

/a\b(c/literal

Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2019-01-03 09:34:42 UTC (rev 1057)
+++ code/trunk/testdata/testoutput2    2019-01-04 16:41:32 UTC (rev 1058)
@@ -135,13 +135,13 @@
 Failed: error 106 at offset 5: missing terminating ] for character class


/[\B]/B
-Failed: error 107 at offset 2: invalid escape sequence in character class
+Failed: error 107 at offset 2: escape sequence is invalid in character class

/[\R]/B
-Failed: error 107 at offset 2: invalid escape sequence in character class
+Failed: error 107 at offset 2: escape sequence is invalid in character class

/[\X]/B
-Failed: error 107 at offset 2: invalid escape sequence in character class
+Failed: error 107 at offset 2: escape sequence is invalid in character class

/[z-a]/
Failed: error 108 at offset 3: range out of order in character class
@@ -16224,17 +16224,35 @@
Last code unit = 'c'
Subject length lower bound = 3

-/[\j\x{z}\o\gA-\Nb-\g]/B,bad_escape_is_literal
+/[\j\x{z}\o\gAb\g]/B,bad_escape_is_literal
 ------------------------------------------------------------------
         Bra
-        [A-Nb-gjoxz{}]
+        [Abgjoxz{}]
         Ket
         End
 ------------------------------------------------------------------


/[Q-\N]/B,bad_escape_is_literal
-Failed: error 108 at offset 4: range out of order in character class
+Failed: error 150 at offset 5: invalid range in character class

+/[\s-_]/bad_escape_is_literal
+Failed: error 150 at offset 3: invalid range in character class
+
+/[_-\s]/bad_escape_is_literal
+Failed: error 150 at offset 5: invalid range in character class
+
+/[\B\R\X]/B
+Failed: error 107 at offset 2: escape sequence is invalid in character class
+
+/[\B\R\X]/B,bad_escape_is_literal
+Failed: error 107 at offset 2: escape sequence is invalid in character class
+
+/[A-\BP-\RV-\X]/B
+Failed: error 107 at offset 4: escape sequence is invalid in character class
+
+/[A-\BP-\RV-\X]/B,bad_escape_is_literal
+Failed: error 107 at offset 4: escape sequence is invalid in character class
+
# ----------------------------------------------------------------------

/a\b(c/literal