[Pcre-svn] [285] code/trunk: Make \c operate like Perl in EB…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [285] code/trunk: Make \c operate like Perl in EBCDIC environments.
Revision: 285
          http://www.exim.org/viewvc/pcre2?view=rev&revision=285
Author:   ph10
Date:     2015-06-13 17:10:14 +0100 (Sat, 13 Jun 2015)
Log Message:
-----------
Make \c operate like Perl in EBCDIC environments.


Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcre2pattern.3
    code/trunk/doc/pcre2syntax.3
    code/trunk/src/pcre2_compile.c
    code/trunk/src/pcre2_error.c
    code/trunk/testdata/testinputEBC
    code/trunk/testdata/testoutputEBC


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2015-06-12 16:25:23 UTC (rev 284)
+++ code/trunk/ChangeLog    2015-06-13 16:10:14 UTC (rev 285)
@@ -161,7 +161,10 @@
 41. In an EBCDIC environment, \a in a pattern was converted to the ASCII
 instead of the EBCDIC value.


+42. The handling of \c in an EBCDIC environment has been revised so that it is
+now compatible with the specification in Perl's perlebcdic page.

+
Version 10.10 06-March-2015
---------------------------


Modified: code/trunk/doc/pcre2pattern.3
===================================================================
--- code/trunk/doc/pcre2pattern.3    2015-06-12 16:25:23 UTC (rev 284)
+++ code/trunk/doc/pcre2pattern.3    2015-06-13 16:10:14 UTC (rev 285)
@@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "19 May 2015" "PCRE2 10.20"
+.TH PCRE2PATTERN 3 "13 June 2015" "PCRE2 10.20"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@@ -337,10 +337,11 @@
 in patterns in a visible manner. There is no restriction on the appearance of
 non-printing characters in a pattern, but when a pattern is being prepared by
 text editing, it is often easier to use one of the following escape sequences
-than the binary character it represents:
+than the binary character it represents. In an ASCII or Unicode environment, 
+these escapes are as follows:
 .sp
   \ea        alarm, that is, the BEL character (hex 07)
-  \ecx       "control-x", where x is any ASCII character
+  \ecx       "control-x", where x is any printable ASCII character
   \ee        escape (hex 1B)
   \ef        form feed (hex 0C)
   \en        linefeed (hex 0A)
@@ -351,27 +352,40 @@
   \eo{ddd..} character with octal code ddd..
   \exhh      character with hex code hh
   \ex{hhh..} character with hex code hhh.. (default mode)
-  \euhhhh    character with hex code hhhh (only when PCRE2_ALT_BSUX is set)
+  \euhhhh    character with hex code hhhh (when PCRE2_ALT_BSUX is set)
 .sp
 The precise effect of \ecx on ASCII characters is as follows: if x is a lower
 case letter, it is converted to upper case. Then bit 6 of the character (hex
 40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
 but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
-code unit following \ec has a value greater than 127, a compile-time error
-occurs. This locks out non-ASCII characters in all modes.
+code unit following \ec has a value less than 32 or greater than 126, a
+compile-time error occurs. This locks out non-printable ASCII characters in all
+modes.
 .P
-The \ec facility was designed for use with ASCII characters, but with the
-extension to Unicode it is even less useful than it once was. It is, however,
-recognized when PCRE2 is compiled in EBCDIC mode, where data items are always
-bytes. In this mode, all values are valid after \ec. If the next character is a
-lower case letter, it is converted to upper case. Then the 0xc0 bits of the
-byte are inverted. Thus \ecA becomes hex 01, as in ASCII (A is C1), but because
-the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other
-characters also generate different values.
+When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et
+generate the appropriate EBCDIC code values. The \ec escape is processed
+as specified for Perl in the \fBperlebcdic\fP document. The only characters
+that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], ^, _, or ?. Any
+other character provokes a compile-time error. The sequence \e@ encodes
+character code 0; the letters (in either case) encode characters 1-26 (hex 01
+to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
+\e? becomes either 255 (hex FF) or 95 (hex 5F).
 .P
+Thus, apart from \e?, these escapes generate the same character code values as
+they do in an ASCII environment, though the meanings of the values mostly 
+differ. For example, \eG always generates code value 7, which is BEL in ASCII
+but DEL in EBCDIC.
+.P
+The sequence \e? generates DEL (127, hex 7F) in an ASCII environment, but
+because 127 is not a control character in EBCDIC, Perl makes it generate the 
+APC character. Unfortunately, there are several variants of EBCDIC. In most of 
+them the APC character has the value 255 (hex FF), but in the one Perl calls 
+POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC 
+values, PCRE2 makes \e? generate 95; otherwise it generates 255.
+.P
 After \e0 up to two further octal digits are read. If there are fewer than two
-digits, just those that are present are used. Thus the sequence \e0\ex\e07
-specifies two binary zeros followed by a BEL character (code value 7). Make
+digits, just those that are present are used. Thus the sequence \e0\ex\e015
+specifies two binary zeros followed by a CR character (code value 13). Make
 sure you supply two digits after the initial zero if the pattern character that
 follows is itself an octal digit.
 .P
@@ -3347,6 +3361,6 @@
 .rs
 .sp
 .nf
-Last updated: 19 May 2015
+Last updated: 13 June 2015
 Copyright (c) 1997-2015 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcre2syntax.3
===================================================================
--- code/trunk/doc/pcre2syntax.3    2015-06-12 16:25:23 UTC (rev 284)
+++ code/trunk/doc/pcre2syntax.3    2015-06-13 16:10:14 UTC (rev 285)
@@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "23 April 2015" "PCRE2 10.20"
+.TH PCRE2SYNTAX 3 "13 June 2015" "PCRE2 10.20"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -22,8 +22,10 @@
 .SH "ESCAPED CHARACTERS"
 .rs
 .sp
+This table applies to ASCII and Unicode environments.
+.sp
   \ea         alarm, that is, the BEL character (hex 07)
-  \ecx        "control-x", where x is any ASCII character
+  \ecx        "control-x", where x is any ASCII printing character
   \ee         escape (hex 1B)
   \ef         form feed (hex 0C)
   \en         newline (hex 0A)
@@ -47,7 +49,8 @@
 .\" HREF
 \fBpcre2pattern\fP
 .\"
-documentation.
+documentation, where details of escape processing in EBCDIC environments are 
+also given.
 .P
 When \ex is not followed by {, from zero to two hexadecimal digits are read,
 but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
@@ -567,6 +570,6 @@
 .rs
 .sp
 .nf
-Last updated: 23 April 2015
+Last updated: 13 June 2015
 Copyright (c) 1997-2015 University of Cambridge.
 .fi


Modified: code/trunk/src/pcre2_compile.c
===================================================================
--- code/trunk/src/pcre2_compile.c    2015-06-12 16:25:23 UTC (rev 284)
+++ code/trunk/src/pcre2_compile.c    2015-06-13 16:10:14 UTC (rev 285)
@@ -268,8 +268,9 @@
 in UTF-8 mode. It runs from '0' to 'z'. */


 #ifndef EBCDIC
-#define ESCAPES_FIRST  CHAR_0
-#define ESCAPES_LAST   CHAR_z
+#define ESCAPES_FIRST       CHAR_0
+#define ESCAPES_LAST        CHAR_z
+#define ESCAPES_UPPER_CASE  (-32)    /* Add this to upper case a letter */


 static const short int escapes[] = {
      0,                       0,
@@ -319,12 +320,14 @@
 is sometimes compiled on an ASCII system. In this case, we must not use CHAR_a
 because it is defined as 'a', which of course picks up the ASCII value. */


-#if 'a' == 0x81                 /* Check for a real EBCDIC environment */
-#define ESCAPES_FIRST  CHAR_a
-#define ESCAPES_LAST   CHAR_9
-#else                           /* Testing in an ASCII environment */
+#if 'a' == 0x81                    /* Check for a real EBCDIC environment */
+#define ESCAPES_FIRST       CHAR_a
+#define ESCAPES_LAST        CHAR_9
+#define ESCAPES_UPPER_CASE  (+64)  /* Add this to upper case a letter */
+#else                              /* Testing in an ASCII environment */
 #define ESCAPES_FIRST  ((unsigned char)'\x81')   /* EBCDIC 'a' */
 #define ESCAPES_LAST   ((unsigned char)'\xf9')   /* EBCDIC '9' */
+#define ESCAPES_UPPER_CASE  (-32)  /* Add this to upper case a letter */
 #endif


 static const short int escapes[] = {
@@ -346,6 +349,11 @@
 /*  F8 */     0,     0
 };


+/* We also need a table of characters that may follow \c in an EBCDIC
+environment for characters 0-31. */
+
+static unsigned char ebcdic_escape_c[] = "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_";
+
#endif /* EBCDIC */


@@ -1238,7 +1246,7 @@
PCRE2_SPTR ccode;

c = *code;
-
+
/* Skip over forward assertions; the other assertions are skipped by
first_significant_code() with a TRUE final argument. */

@@ -2076,18 +2084,36 @@
       }       /* End of Perl-style \x handling */
     break;


-    /* For \c, a following letter is upper-cased; then the 0x40 bit is flipped.
-    An error is given if the byte following \c is not a printable ASCII
-    character. This coding is ASCII-specific, but then the whole concept of \cx
-    is ASCII-specific. (However, an EBCDIC equivalent has now been added.) */
+    /* The handling of \c is different in ASCII and EBCDIC environments. In an
+    ASCII (or Unicode) environment, an error is given if the character
+    following \c is not a printable ASCII character. Otherwise, the following
+    character is upper-cased if it is a letter, and after that the 0x40 bit is
+    flipped. The result is the value of the escape.


+    In an EBCDIC environment the handling of \c is compatible with the
+    specification in the perlebcdic document. The following character must be
+    a letter or one of small number of special characters. These provide a
+    means of defining the character values 0-31.
+
+    For testing the EBCDIC handling of \c in an ASCII environment, recognize
+    the EBCDIC value of 'c' explicitly. */
+
+#if defined EBCDIC && 'a' != 0x81
+    case 0x83:
+#else
     case CHAR_c:
+#endif
+
     c = *(++ptr);
+    if (c >= CHAR_a && c <= CHAR_z) c += ESCAPES_UPPER_CASE;
     if (c == CHAR_NULL && ptr >= cb->end_pattern)
       {
       *errorcodeptr = ERR2;
       break;
       }
+
+    /* Handle \c in an ASCII/Unicode environment. */
+
 #ifndef EBCDIC    /* ASCII/UTF-8 coding */
     if (c < 32 || c > 126)  /* Excludes all non-printable ASCII */
       {
@@ -2094,12 +2120,26 @@
       *errorcodeptr = ERR68;
       break;
       }
-    if (c >= CHAR_a && c <= CHAR_z) c -= 32;
     c ^= 0x40;
-#else             /* EBCDIC coding */
-    if (c >= CHAR_a && c <= CHAR_z) c += 64;
-    c ^= 0xC0;
-#endif
+
+    /* Handle \c in an EBCDIC environment. The special case \c? is converted to
+    255 (0xff) or 95 (0x5f) if other character suggest we are using th POSIX-BC
+    encoding. (This is the way Perl indicates that it handles \c?.) The other
+    valid sequences correspond to a list of specific characters. */
+
+#else
+    if (c == CHAR_QUESTION_MARK)
+      c = ('\\' == 188 && '`' == 74)? 0x5f : 0xff;
+    else
+      {
+      for (i = 0; i < 32; i++)
+        {
+        if (c == ebcdic_escape_c[i]) break;
+        }
+      if (i < 32) c = i; else *errorcodeptr = ERR68;
+      }
+#endif  /* EBCDIC */
+
     break;


     /* Any other alphanumeric following \ is an error. Perl gives an error only
@@ -6492,7 +6532,7 @@
               goto FAILED;
               }
             recno = recno * 10 + *ptr++ - CHAR_0;
-            } 
+            }


           if (*ptr != (PCRE2_UCHAR)terminator)
             {


Modified: code/trunk/src/pcre2_error.c
===================================================================
--- code/trunk/src/pcre2_error.c    2015-06-12 16:25:23 UTC (rev 284)
+++ code/trunk/src/pcre2_error.c    2015-06-13 16:10:14 UTC (rev 285)
@@ -145,7 +145,11 @@
   "different names for subpatterns of the same number are not allowed\0"
   "(*MARK) must have an argument\0"
   "non-hex character in \\x{} (closing brace missing?)\0"
+#ifndef EBCDIC   
   "\\c must be followed by a printable ASCII character\0"
+#else   
+  "\\c must be followed by a letter or one of [\\]^_?\0"
+#endif
   "\\k is not followed by a braced, angle-bracketed, or quoted name\0"
   /* 70 */
   "internal error: unknown opcode in find_fixedlength()\0"


Modified: code/trunk/testdata/testinputEBC
===================================================================
--- code/trunk/testdata/testinputEBC    2015-06-12 16:25:23 UTC (rev 284)
+++ code/trunk/testdata/testinputEBC    2015-06-13 16:10:14 UTC (rev 285)
@@ -1,5 +1,5 @@
 # This is a specialized test for checking, when PCRE2 is compiled with the
-# EBCDIC option but in an ASCII environment, that newline and white space
+# EBCDIC option but in an ASCII environment, that newline, white space, and \c
 # functionality is working. It catches cases where explicit values such as 0x0a
 # have been used instead of names like CHAR_LF. Needless to say, it is not a
 # genuine EBCDIC test! In patterns, alphabetic characters that follow a
@@ -117,5 +117,18 @@
     A\x25B
     A\x0bB
     A\x0cB
+    
+# Test \c functionality 
+    
+/\\x83@\\x83A\\x83b\\x83C\\x83d\\x83E\\x83f\\x83G\\x83h\\x83I\\x83J\\x83K\\x83l\\x83m\\x83N\\x83O\\x83p\\x83q\\x83r\\x83S\\x83T\\x83u\\x83V\\x83W\\x83X\\x83y\\x83Z/
+    \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f


+/\\x83[\\x83\\\x83]\\x83^\\x83_/
+    \x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
+    
+/\\x83?/
+    A\xffB
+
+/\\x83&/
+
 # End


Modified: code/trunk/testdata/testoutputEBC
===================================================================
--- code/trunk/testdata/testoutputEBC    2015-06-12 16:25:23 UTC (rev 284)
+++ code/trunk/testdata/testoutputEBC    2015-06-13 16:10:14 UTC (rev 285)
@@ -1,5 +1,5 @@
 # This is a specialized test for checking, when PCRE2 is compiled with the
-# EBCDIC option but in an ASCII environment, that newline and white space
+# EBCDIC option but in an ASCII environment, that newline, white space, and \c
 # functionality is working. It catches cases where explicit values such as 0x0a
 # have been used instead of names like CHAR_LF. Needless to say, it is not a
 # genuine EBCDIC test! In patterns, alphabetic characters that follow a
@@ -178,5 +178,22 @@
 No match
     A\x0cB
 No match
+    
+# Test \c functionality 
+    
+/\\x83@\\x83A\\x83b\\x83C\\x83d\\x83E\\x83f\\x83G\\x83h\\x83I\\x83J\\x83K\\x83l\\x83m\\x83N\\x83O\\x83p\\x83q\\x83r\\x83S\\x83T\\x83u\\x83V\\x83W\\x83X\\x83y\\x83Z/
+    \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
+ 0: \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a


+/\\x83[\\x83\\\x83]\\x83^\\x83_/
+    \x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
+ 0: \x1b\x1c\x1d\x1e\x1f
+    
+/\\x83?/
+    A\xffB
+ 0: \xff
+
+/\\x83&/
+Failed: error 168 at offset 2: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f
+
 # End