Revision: 285
http://www.exim.org/viewvc/pcre2?view=rev&revision=285
Author: ph10
Date: 2015-06-13 17:10:14 +0100 (Sat, 13 Jun 2015)
Log Message:
-----------
Make \c operate like Perl in EBCDIC environments.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/pcre2pattern.3
code/trunk/doc/pcre2syntax.3
code/trunk/src/pcre2_compile.c
code/trunk/src/pcre2_error.c
code/trunk/testdata/testinputEBC
code/trunk/testdata/testoutputEBC
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2015-06-12 16:25:23 UTC (rev 284)
+++ code/trunk/ChangeLog 2015-06-13 16:10:14 UTC (rev 285)
@@ -161,7 +161,10 @@
41. In an EBCDIC environment, \a in a pattern was converted to the ASCII
instead of the EBCDIC value.
+42. The handling of \c in an EBCDIC environment has been revised so that it is
+now compatible with the specification in Perl's perlebcdic page.
+
Version 10.10 06-March-2015
---------------------------
Modified: code/trunk/doc/pcre2pattern.3
===================================================================
--- code/trunk/doc/pcre2pattern.3 2015-06-12 16:25:23 UTC (rev 284)
+++ code/trunk/doc/pcre2pattern.3 2015-06-13 16:10:14 UTC (rev 285)
@@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "19 May 2015" "PCRE2 10.20"
+.TH PCRE2PATTERN 3 "13 June 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@@ -337,10 +337,11 @@
in patterns in a visible manner. There is no restriction on the appearance of
non-printing characters in a pattern, but when a pattern is being prepared by
text editing, it is often easier to use one of the following escape sequences
-than the binary character it represents:
+than the binary character it represents. In an ASCII or Unicode environment,
+these escapes are as follows:
.sp
\ea alarm, that is, the BEL character (hex 07)
- \ecx "control-x", where x is any ASCII character
+ \ecx "control-x", where x is any printable ASCII character
\ee escape (hex 1B)
\ef form feed (hex 0C)
\en linefeed (hex 0A)
@@ -351,27 +352,40 @@
\eo{ddd..} character with octal code ddd..
\exhh character with hex code hh
\ex{hhh..} character with hex code hhh.. (default mode)
- \euhhhh character with hex code hhhh (only when PCRE2_ALT_BSUX is set)
+ \euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
.sp
The precise effect of \ecx on ASCII characters is as follows: if x is a lower
case letter, it is converted to upper case. Then bit 6 of the character (hex
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
-code unit following \ec has a value greater than 127, a compile-time error
-occurs. This locks out non-ASCII characters in all modes.
+code unit following \ec has a value less than 32 or greater than 126, a
+compile-time error occurs. This locks out non-printable ASCII characters in all
+modes.
.P
-The \ec facility was designed for use with ASCII characters, but with the
-extension to Unicode it is even less useful than it once was. It is, however,
-recognized when PCRE2 is compiled in EBCDIC mode, where data items are always
-bytes. In this mode, all values are valid after \ec. If the next character is a
-lower case letter, it is converted to upper case. Then the 0xc0 bits of the
-byte are inverted. Thus \ecA becomes hex 01, as in ASCII (A is C1), but because
-the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other
-characters also generate different values.
+When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et
+generate the appropriate EBCDIC code values. The \ec escape is processed
+as specified for Perl in the \fBperlebcdic\fP document. The only characters
+that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], ^, _, or ?. Any
+other character provokes a compile-time error. The sequence \e@ encodes
+character code 0; the letters (in either case) encode characters 1-26 (hex 01
+to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
+\e? becomes either 255 (hex FF) or 95 (hex 5F).
.P
+Thus, apart from \e?, these escapes generate the same character code values as
+they do in an ASCII environment, though the meanings of the values mostly
+differ. For example, \eG always generates code value 7, which is BEL in ASCII
+but DEL in EBCDIC.
+.P
+The sequence \e? generates DEL (127, hex 7F) in an ASCII environment, but
+because 127 is not a control character in EBCDIC, Perl makes it generate the
+APC character. Unfortunately, there are several variants of EBCDIC. In most of
+them the APC character has the value 255 (hex FF), but in the one Perl calls
+POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
+values, PCRE2 makes \e? generate 95; otherwise it generates 255.
+.P
After \e0 up to two further octal digits are read. If there are fewer than two
-digits, just those that are present are used. Thus the sequence \e0\ex\e07
-specifies two binary zeros followed by a BEL character (code value 7). Make
+digits, just those that are present are used. Thus the sequence \e0\ex\e015
+specifies two binary zeros followed by a CR character (code value 13). Make
sure you supply two digits after the initial zero if the pattern character that
follows is itself an octal digit.
.P
@@ -3347,6 +3361,6 @@
.rs
.sp
.nf
-Last updated: 19 May 2015
+Last updated: 13 June 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi
Modified: code/trunk/doc/pcre2syntax.3
===================================================================
--- code/trunk/doc/pcre2syntax.3 2015-06-12 16:25:23 UTC (rev 284)
+++ code/trunk/doc/pcre2syntax.3 2015-06-13 16:10:14 UTC (rev 285)
@@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "23 April 2015" "PCRE2 10.20"
+.TH PCRE2SYNTAX 3 "13 June 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -22,8 +22,10 @@
.SH "ESCAPED CHARACTERS"
.rs
.sp
+This table applies to ASCII and Unicode environments.
+.sp
\ea alarm, that is, the BEL character (hex 07)
- \ecx "control-x", where x is any ASCII character
+ \ecx "control-x", where x is any ASCII printing character
\ee escape (hex 1B)
\ef form feed (hex 0C)
\en newline (hex 0A)
@@ -47,7 +49,8 @@
.\" HREF
\fBpcre2pattern\fP
.\"
-documentation.
+documentation, where details of escape processing in EBCDIC environments are
+also given.
.P
When \ex is not followed by {, from zero to two hexadecimal digits are read,
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
@@ -567,6 +570,6 @@
.rs
.sp
.nf
-Last updated: 23 April 2015
+Last updated: 13 June 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi
Modified: code/trunk/src/pcre2_compile.c
===================================================================
--- code/trunk/src/pcre2_compile.c 2015-06-12 16:25:23 UTC (rev 284)
+++ code/trunk/src/pcre2_compile.c 2015-06-13 16:10:14 UTC (rev 285)
@@ -268,8 +268,9 @@
in UTF-8 mode. It runs from '0' to 'z'. */
#ifndef EBCDIC
-#define ESCAPES_FIRST CHAR_0
-#define ESCAPES_LAST CHAR_z
+#define ESCAPES_FIRST CHAR_0
+#define ESCAPES_LAST CHAR_z
+#define ESCAPES_UPPER_CASE (-32) /* Add this to upper case a letter */
static const short int escapes[] = {
0, 0,
@@ -319,12 +320,14 @@
is sometimes compiled on an ASCII system. In this case, we must not use CHAR_a
because it is defined as 'a', which of course picks up the ASCII value. */
-#if 'a' == 0x81 /* Check for a real EBCDIC environment */
-#define ESCAPES_FIRST CHAR_a
-#define ESCAPES_LAST CHAR_9
-#else /* Testing in an ASCII environment */
+#if 'a' == 0x81 /* Check for a real EBCDIC environment */
+#define ESCAPES_FIRST CHAR_a
+#define ESCAPES_LAST CHAR_9
+#define ESCAPES_UPPER_CASE (+64) /* Add this to upper case a letter */
+#else /* Testing in an ASCII environment */
#define ESCAPES_FIRST ((unsigned char)'\x81') /* EBCDIC 'a' */
#define ESCAPES_LAST ((unsigned char)'\xf9') /* EBCDIC '9' */
+#define ESCAPES_UPPER_CASE (-32) /* Add this to upper case a letter */
#endif
static const short int escapes[] = {
@@ -346,6 +349,11 @@
/* F8 */ 0, 0
};
+/* We also need a table of characters that may follow \c in an EBCDIC
+environment for characters 0-31. */
+
+static unsigned char ebcdic_escape_c[] = "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_";
+
#endif /* EBCDIC */
@@ -1238,7 +1246,7 @@
PCRE2_SPTR ccode;
c = *code;
-
+
/* Skip over forward assertions; the other assertions are skipped by
first_significant_code() with a TRUE final argument. */
@@ -2076,18 +2084,36 @@
} /* End of Perl-style \x handling */
break;
- /* For \c, a following letter is upper-cased; then the 0x40 bit is flipped.
- An error is given if the byte following \c is not a printable ASCII
- character. This coding is ASCII-specific, but then the whole concept of \cx
- is ASCII-specific. (However, an EBCDIC equivalent has now been added.) */
+ /* The handling of \c is different in ASCII and EBCDIC environments. In an
+ ASCII (or Unicode) environment, an error is given if the character
+ following \c is not a printable ASCII character. Otherwise, the following
+ character is upper-cased if it is a letter, and after that the 0x40 bit is
+ flipped. The result is the value of the escape.
+ In an EBCDIC environment the handling of \c is compatible with the
+ specification in the perlebcdic document. The following character must be
+ a letter or one of small number of special characters. These provide a
+ means of defining the character values 0-31.
+
+ For testing the EBCDIC handling of \c in an ASCII environment, recognize
+ the EBCDIC value of 'c' explicitly. */
+
+#if defined EBCDIC && 'a' != 0x81
+ case 0x83:
+#else
case CHAR_c:
+#endif
+
c = *(++ptr);
+ if (c >= CHAR_a && c <= CHAR_z) c += ESCAPES_UPPER_CASE;
if (c == CHAR_NULL && ptr >= cb->end_pattern)
{
*errorcodeptr = ERR2;
break;
}
+
+ /* Handle \c in an ASCII/Unicode environment. */
+
#ifndef EBCDIC /* ASCII/UTF-8 coding */
if (c < 32 || c > 126) /* Excludes all non-printable ASCII */
{
@@ -2094,12 +2120,26 @@
*errorcodeptr = ERR68;
break;
}
- if (c >= CHAR_a && c <= CHAR_z) c -= 32;
c ^= 0x40;
-#else /* EBCDIC coding */
- if (c >= CHAR_a && c <= CHAR_z) c += 64;
- c ^= 0xC0;
-#endif
+
+ /* Handle \c in an EBCDIC environment. The special case \c? is converted to
+ 255 (0xff) or 95 (0x5f) if other character suggest we are using th POSIX-BC
+ encoding. (This is the way Perl indicates that it handles \c?.) The other
+ valid sequences correspond to a list of specific characters. */
+
+#else
+ if (c == CHAR_QUESTION_MARK)
+ c = ('\\' == 188 && '`' == 74)? 0x5f : 0xff;
+ else
+ {
+ for (i = 0; i < 32; i++)
+ {
+ if (c == ebcdic_escape_c[i]) break;
+ }
+ if (i < 32) c = i; else *errorcodeptr = ERR68;
+ }
+#endif /* EBCDIC */
+
break;
/* Any other alphanumeric following \ is an error. Perl gives an error only
@@ -6492,7 +6532,7 @@
goto FAILED;
}
recno = recno * 10 + *ptr++ - CHAR_0;
- }
+ }
if (*ptr != (PCRE2_UCHAR)terminator)
{
Modified: code/trunk/src/pcre2_error.c
===================================================================
--- code/trunk/src/pcre2_error.c 2015-06-12 16:25:23 UTC (rev 284)
+++ code/trunk/src/pcre2_error.c 2015-06-13 16:10:14 UTC (rev 285)
@@ -145,7 +145,11 @@
"different names for subpatterns of the same number are not allowed\0"
"(*MARK) must have an argument\0"
"non-hex character in \\x{} (closing brace missing?)\0"
+#ifndef EBCDIC
"\\c must be followed by a printable ASCII character\0"
+#else
+ "\\c must be followed by a letter or one of [\\]^_?\0"
+#endif
"\\k is not followed by a braced, angle-bracketed, or quoted name\0"
/* 70 */
"internal error: unknown opcode in find_fixedlength()\0"
Modified: code/trunk/testdata/testinputEBC
===================================================================
--- code/trunk/testdata/testinputEBC 2015-06-12 16:25:23 UTC (rev 284)
+++ code/trunk/testdata/testinputEBC 2015-06-13 16:10:14 UTC (rev 285)
@@ -1,5 +1,5 @@
# This is a specialized test for checking, when PCRE2 is compiled with the
-# EBCDIC option but in an ASCII environment, that newline and white space
+# EBCDIC option but in an ASCII environment, that newline, white space, and \c
# functionality is working. It catches cases where explicit values such as 0x0a
# have been used instead of names like CHAR_LF. Needless to say, it is not a
# genuine EBCDIC test! In patterns, alphabetic characters that follow a
@@ -117,5 +117,18 @@
A\x25B
A\x0bB
A\x0cB
+
+# Test \c functionality
+
+/\\x83@\\x83A\\x83b\\x83C\\x83d\\x83E\\x83f\\x83G\\x83h\\x83I\\x83J\\x83K\\x83l\\x83m\\x83N\\x83O\\x83p\\x83q\\x83r\\x83S\\x83T\\x83u\\x83V\\x83W\\x83X\\x83y\\x83Z/
+ \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
+/\\x83[\\x83\\\x83]\\x83^\\x83_/
+ \x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
+
+/\\x83?/
+ A\xffB
+
+/\\x83&/
+
# End
Modified: code/trunk/testdata/testoutputEBC
===================================================================
--- code/trunk/testdata/testoutputEBC 2015-06-12 16:25:23 UTC (rev 284)
+++ code/trunk/testdata/testoutputEBC 2015-06-13 16:10:14 UTC (rev 285)
@@ -1,5 +1,5 @@
# This is a specialized test for checking, when PCRE2 is compiled with the
-# EBCDIC option but in an ASCII environment, that newline and white space
+# EBCDIC option but in an ASCII environment, that newline, white space, and \c
# functionality is working. It catches cases where explicit values such as 0x0a
# have been used instead of names like CHAR_LF. Needless to say, it is not a
# genuine EBCDIC test! In patterns, alphabetic characters that follow a
@@ -178,5 +178,22 @@
No match
A\x0cB
No match
+
+# Test \c functionality
+
+/\\x83@\\x83A\\x83b\\x83C\\x83d\\x83E\\x83f\\x83G\\x83h\\x83I\\x83J\\x83K\\x83l\\x83m\\x83N\\x83O\\x83p\\x83q\\x83r\\x83S\\x83T\\x83u\\x83V\\x83W\\x83X\\x83y\\x83Z/
+ \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
+ 0: \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a
+/\\x83[\\x83\\\x83]\\x83^\\x83_/
+ \x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
+ 0: \x1b\x1c\x1d\x1e\x1f
+
+/\\x83?/
+ A\xffB
+ 0: \xff
+
+/\\x83&/
+Failed: error 168 at offset 2: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f
+
# End