Revision: 1387
http://vcs.pcre.org/viewvc?view=rev&revision=1387
Author: ph10
Date: 2013-11-02 18:29:05 +0000 (Sat, 02 Nov 2013)
Log Message:
-----------
Update POSIX class handling in UCP mode.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/pcrepattern.3
code/trunk/pcre_compile.c
code/trunk/pcre_internal.h
code/trunk/pcre_printint.c
code/trunk/pcre_xclass.c
code/trunk/testdata/testinput6
code/trunk/testdata/testinput7
code/trunk/testdata/testoutput6
code/trunk/testdata/testoutput7
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2013-10-29 17:15:47 UTC (rev 1386)
+++ code/trunk/ChangeLog 2013-11-02 18:29:05 UTC (rev 1387)
@@ -147,6 +147,10 @@
properties for \w, \d, etc) is present in a test regex. Otherwise if the
test contains no characters greater than 255, Perl doesn't realise it
should be using Unicode semantics.
+
+31. Upgraded the handling of the POSIX classes [:graph:], [:print:], and
+ [:punct:] when PCRE_UCP is set so as to include the same characters as Perl
+ does in Unicode mode.
Version 8.33 28-May-2013
Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3 2013-10-29 17:15:47 UTC (rev 1386)
+++ code/trunk/doc/pcrepattern.3 2013-11-02 18:29:05 UTC (rev 1387)
@@ -1,4 +1,4 @@
-.TH PCREPATTERN 3 "12 October 2013" "PCRE 8.34"
+.TH PCREPATTERN 3 "02 November 2013" "PCRE 8.34"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "PCRE REGULAR EXPRESSION DETAILS"
@@ -925,9 +925,9 @@
.sp
As well as the standard Unicode properties described above, PCRE supports four
more that make it possible to convert traditional escape sequences such as \ew
-and \es and POSIX character classes to use Unicode properties. PCRE uses these
-non-standard, non-Perl properties internally when PCRE_UCP is set. However,
-they may also be used explicitly. These properties are:
+and \es to use Unicode properties. PCRE uses these non-standard, non-Perl
+properties internally when PCRE_UCP is set. However, they may also be used
+explicitly. These properties are:
.sp
Xan Any alphanumeric character
Xps Any POSIX space character
@@ -937,8 +937,9 @@
Xan matches characters that have either the L (letter) or the N (number)
property. Xps matches the characters tab, linefeed, vertical tab, form feed, or
carriage return, and any other character that has the Z (separator) property.
-Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the
-same characters as Xan, plus underscore.
+Xsp is the same as Xps; it used to exclude vertical tab, for Perl
+compatibility, but Perl changed, and so PCRE followed at release 8.34. Xwd
+matches the same characters as Xan, plus underscore.
.P
There is another non-standard property, Xuc, which matches any character that
can be represented by a Universal Character Name in C++ and other programming
@@ -1332,8 +1333,8 @@
By default, in UTF modes, characters with values greater than 128 do not match
any of the POSIX character classes. However, if the PCRE_UCP option is passed
to \fBpcre_compile()\fP, some of the classes are changed so that Unicode
-character properties are used. This is achieved by replacing the POSIX classes
-by other sequences, as follows:
+character properties are used. This is achieved by replacing certain POSIX
+classes by other sequences, as follows:
.sp
[:alnum:] becomes \ep{Xan}
[:alpha:] becomes \ep{L}
@@ -1344,9 +1345,30 @@
[:upper:] becomes \ep{Lu}
[:word:] becomes \ep{Xwd}
.sp
-Negated versions, such as [:^alpha:] use \eP instead of \ep. The other POSIX
-classes are unchanged, and match only characters with code points less than
-128.
+Negated versions, such as [:^alpha:] use \eP instead of \ep. Three other POSIX
+classes are handled specially in UCP mode:
+.TP 10
+[:graph:]
+This matches characters that have glyphs that mark the page when printed. In
+Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf
+properties, except for:
+.sp
+ U+061C Arabic Letter Mark
+ U+180E Mongolian Vowel Separator
+ U+2066 - U+2069 Various "isolate"s
+.sp
+.TP 10
+[:print:]
+This matches the same characters as [:graph:] plus space characters that are
+not controls, that is, characters with the Zs property.
+.TP 10
+[:punct:]
+This matches all characters that have the Unicode P (punctuation) property,
+plus those characters whose code points are less than 128 that have the S
+(Symbol) property.
+.P
+The other POSIX classes are unchanged, and match only characters with code
+points less than 128.
.
.
.SH "VERTICAL BAR"
@@ -3176,6 +3198,6 @@
.rs
.sp
.nf
-Last updated: 12 October 2013
+Last updated: 02 November 2013
Copyright (c) 1997-2013 University of Cambridge.
.fi
Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c 2013-10-29 17:15:47 UTC (rev 1386)
+++ code/trunk/pcre_compile.c 2013-11-02 18:29:05 UTC (rev 1387)
@@ -264,7 +264,8 @@
now all in a single string, to reduce the number of relocations when a shared
library is dynamically loaded. The list of lengths is terminated by a zero
length entry. The first three must be alpha, lower, upper, as this is assumed
-for handling case independence. */
+for handling case independence. The indices for graph, print, and punct are
+needed, so identify them. */
static const char posix_names[] =
STRING_alpha0 STRING_lower0 STRING_upper0 STRING_alnum0
@@ -275,6 +276,11 @@
static const pcre_uint8 posix_name_lengths[] = {
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 6, 0 };
+#define PC_GRAPH 8
+#define PC_PRINT 9
+#define PC_PUNCT 10
+
+
/* Table of class bit maps for each POSIX class. Each class is formed from a
base map, with an optional addition or removal of another map. Then, for some
classes, there is some additional tweaking: for [:blank:] the vertical space
@@ -302,9 +308,8 @@
cbit_xdigit,-1, 0 /* xdigit */
};
-/* Table of substitutes for \d etc when PCRE_UCP is set. The POSIX class
-substitutes must be in the order of the names, defined above, and there are
-both positive and negative cases. NULL means no substitute. */
+/* Table of substitutes for \d etc when PCRE_UCP is set. They are replaced by
+Unicode property escapes. */
#ifdef SUPPORT_UCP
static const pcre_uchar string_PNd[] = {
@@ -329,12 +334,18 @@
static const pcre_uchar *substitutes[] = {
string_PNd, /* \D */
string_pNd, /* \d */
- string_PXsp, /* \S */ /* NOTE: Xsp is Perl space */
- string_pXsp, /* \s */
+ string_PXsp, /* \S */ /* Xsp is Perl space, but from 8.34, Perl */
+ string_pXsp, /* \s */ /* space and POSIX space are the same. */
string_PXwd, /* \W */
string_pXwd /* \w */
};
+/* The POSIX class substitutes must be in the order of the POSIX class names,
+defined above, and there are both positive and negative cases. NULL means no
+general substitute of a Unicode property escape (\p or \P). However, for some
+POSIX classes (e.g. graph, print, punct) a special property code is compiled
+directly. */
+
static const pcre_uchar string_pL[] = {
CHAR_BACKSLASH, CHAR_p, CHAR_LEFT_CURLY_BRACKET,
CHAR_L, CHAR_RIGHT_CURLY_BRACKET, '\0' };
@@ -382,8 +393,8 @@
NULL, /* graph */
NULL, /* print */
NULL, /* punct */
- string_pXps, /* space */ /* NOTE: Xps is POSIX space */
- string_pXwd, /* word */
+ string_pXps, /* space */ /* Xps is POSIX space, but from 8.34 */
+ string_pXwd, /* word */ /* Perl and POSIX space are the same */
NULL, /* xdigit */
/* Negated cases */
string_PL, /* ^alpha */
@@ -397,8 +408,8 @@
NULL, /* ^graph */
NULL, /* ^print */
NULL, /* ^punct */
- string_PXps, /* ^space */ /* NOTE: Xps is POSIX space */
- string_PXwd, /* ^word */
+ string_PXps, /* ^space */ /* Xps is POSIX space, but from 8.34 */
+ string_PXwd, /* ^word */ /* Perl and POSIX space are the same */
NULL /* ^xdigit */
};
#define POSIX_SUBSIZE (sizeof(posix_substitutes) / sizeof(pcre_uchar *))
@@ -2973,7 +2984,6 @@
case OP_CLASS:
#if defined SUPPORT_UTF || !defined COMPILE_PCRE8
case OP_XCLASS:
-
if (c == OP_XCLASS)
end = code + GET(code, 0) - 1;
else
@@ -4830,24 +4840,58 @@
posix_class = 0;
/* When PCRE_UCP is set, some of the POSIX classes are converted to
- different escape sequences that use Unicode properties. */
+ different escape sequences that use Unicode properties \p or \P. Others
+ that are not available via \p or \P generate XCL_PROP/XCL_NOTPROP
+ directly. */
#ifdef SUPPORT_UCP
if ((options & PCRE_UCP) != 0)
{
+ unsigned int ptype = 0;
int pc = posix_class + ((local_negate)? POSIX_SUBSIZE/2 : 0);
+
+ /* The posix_substitutes table specifies which POSIX classes can be
+ converted to \p or \P items. */
+
if (posix_substitutes[pc] != NULL)
{
nestptr = tempptr + 1;
ptr = posix_substitutes[pc] - 1;
continue;
}
+
+ /* There are three other classes that generate special property calls
+ that are recognized only in an XCLASS. */
+
+ else switch(posix_class)
+ {
+ case PC_GRAPH:
+ ptype = PT_PXGRAPH;
+ /* Fall through */
+ case PC_PRINT:
+ if (ptype == 0) ptype = PT_PXPRINT;
+ /* Fall through */
+ case PC_PUNCT:
+ if (ptype == 0) ptype = PT_PXPUNCT;
+ *class_uchardata++ = local_negate? XCL_NOTPROP : XCL_PROP;
+ *class_uchardata++ = ptype;
+ *class_uchardata++ = 0;
+ ptr = tempptr + 1;
+ continue;
+
+ /* For all other POSIX classes, no special action is taken in UCP
+ mode. Fall through to the non_UCP case. */
+
+ default:
+ break;
+ }
}
#endif
- /* In the non-UCP case, we build the bit map for the POSIX class in a
- chunk of local store because we may be adding and subtracting from it,
- and we don't want to subtract bits that may be in the main map already.
- At the end we or the result into the bit map that is being built. */
+ /* In the non-UCP case, or when UCP makes no difference, we build the
+ bit map for the POSIX class in a chunk of local store because we may be
+ adding and subtracting from it, and we don't want to subtract bits that
+ may be in the main map already. At the end we or the result into the
+ bit map that is being built. */
posix_class *= 3;
@@ -6136,20 +6180,20 @@
len = (int)(code - tempcode);
if (len > 0)
- {
+ {
unsigned int repcode = *tempcode;
-
+
/* There is a table for possessifying opcodes, all of which are less
than OP_CALLOUT. A zero entry means there is no possessified version.
*/
-
+
if (repcode < OP_CALLOUT && opcode_possessify[repcode] > 0)
*tempcode = opcode_possessify[repcode];
-
+
/* For opcode without a special possessified version, wrap the item in
ONCE brackets. Because we are moving code along, we must ensure that any
pending recursive references are updated. */
-
+
else
{
*code = OP_END;
@@ -6162,7 +6206,7 @@
PUTINC(code, 0, len);
PUT(tempcode, 1, len);
}
- }
+ }
#ifdef NEVER
if (len > 0) switch (*tempcode)
Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h 2013-10-29 17:15:47 UTC (rev 1386)
+++ code/trunk/pcre_internal.h 2013-11-02 18:29:05 UTC (rev 1387)
@@ -1855,6 +1855,16 @@
#define PT_UCNC 10 /* Universal Character nameable character */
#define PT_TABSIZE 11 /* Size of square table for autopossessify tests */
+/* The following special properties are used only in XCLASS items, when POSIX
+classes are specified and PCRE_UCP is set - in other words, for Unicode
+handling of these classes. They are not available via the \p or \P escapes like
+those in the above list, and so they do not take part in the autopossessifying
+table. */
+
+#define PT_PXGRAPH 11 /* [:graph:] - characters that mark the paper */
+#define PT_PXPRINT 12 /* [:print:] - [:graph:] plus non-control spaces */
+#define PT_PXPUNCT 13 /* [:punct:] - punctuation characters */
+
/* Flag bits and data types for the extended class (OP_XCLASS) for classes that
contain characters with values greater than 255. */
@@ -1868,9 +1878,9 @@
#define XCL_NOTPROP 4 /* Unicode inverted property (ditto) */
/* These are escaped items that aren't just an encoding of a particular data
-value such as \n. They must have non-zero values, as check_escape() returns
-0 for a data character. Also, they must appear in the same order as in the opcode
-definitions below, up to ESC_z. There's a dummy for OP_ALLANY because it
+value such as \n. They must have non-zero values, as check_escape() returns 0
+for a data character. Also, they must appear in the same order as in the
+opcode definitions below, up to ESC_z. There's a dummy for OP_ALLANY because it
corresponds to "." in DOTALL mode rather than an escape sequence. It is also
used for [^] in JavaScript compatibility mode, and for \C in non-utf mode. In
non-DOTALL mode, "." behaves like \N.
Modified: code/trunk/pcre_printint.c
===================================================================
--- code/trunk/pcre_printint.c 2013-10-29 17:15:47 UTC (rev 1386)
+++ code/trunk/pcre_printint.c 2013-11-02 18:29:05 UTC (rev 1387)
@@ -633,9 +633,9 @@
print_prop(f, code, " ", "");
break;
- /* OP_XCLASS can only occur in UTF or PCRE16 modes. However, there's no
- harm in having this code always here, and it makes it less messy without
- all those #ifdefs. */
+ /* OP_XCLASS cannot occur in 8-bit, non-UTF mode. However, there's no harm
+ in having this code always here, and it makes it less messy without all
+ those #ifdefs. */
case OP_CLASS:
case OP_NCLASS:
@@ -696,27 +696,52 @@
pcre_uchar ch;
while ((ch = *ccode++) != XCL_END)
{
- if (ch == XCL_PROP)
+ BOOL not = FALSE;
+ const char *notch = "";
+
+ switch(ch)
{
- unsigned int ptype = *ccode++;
- unsigned int pvalue = *ccode++;
- fprintf(f, "\\p{%s}", get_ucpname(ptype, pvalue));
- }
- else if (ch == XCL_NOTPROP)
- {
- unsigned int ptype = *ccode++;
- unsigned int pvalue = *ccode++;
- fprintf(f, "\\P{%s}", get_ucpname(ptype, pvalue));
- }
- else
- {
+ case XCL_NOTPROP:
+ not = TRUE;
+ notch = "^";
+ /* Fall through */
+
+ case XCL_PROP:
+ {
+ unsigned int ptype = *ccode++;
+ unsigned int pvalue = *ccode++;
+
+ switch(ptype)
+ {
+ case PT_PXGRAPH:
+ fprintf(f, "[:%sgraph:]", notch);
+ break;
+
+ case PT_PXPRINT:
+ fprintf(f, "[:%sprint:]", notch);
+ break;
+
+ case PT_PXPUNCT:
+ fprintf(f, "[:%spunct:]", notch);
+ break;
+
+ default:
+ fprintf(f, "\\%c{%s}", (not? 'P':'p'),
+ get_ucpname(ptype, pvalue));
+ break;
+ }
+ }
+ break;
+
+ default:
ccode += 1 + print_char(f, ccode, utf);
if (ch == XCL_RANGE)
{
fprintf(f, "-");
ccode += 1 + print_char(f, ccode, utf);
}
- }
+ break;
+ }
}
}
Modified: code/trunk/pcre_xclass.c
===================================================================
--- code/trunk/pcre_xclass.c 2013-10-29 17:15:47 UTC (rev 1386)
+++ code/trunk/pcre_xclass.c 2013-11-02 18:29:05 UTC (rev 1387)
@@ -128,34 +128,35 @@
else /* XCL_PROP & XCL_NOTPROP */
{
const ucd_record *prop = GET_UCD(c);
+ BOOL isprop = t == XCL_PROP;
switch(*data)
{
case PT_ANY:
- if (t == XCL_PROP) return !negated;
+ if (isprop) return !negated;
break;
case PT_LAMP:
if ((prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
- prop->chartype == ucp_Lt) == (t == XCL_PROP)) return !negated;
+ prop->chartype == ucp_Lt) == isprop) return !negated;
break;
case PT_GC:
- if ((data[1] == PRIV(ucp_gentype)[prop->chartype]) == (t == XCL_PROP))
+ if ((data[1] == PRIV(ucp_gentype)[prop->chartype]) == isprop)
return !negated;
break;
case PT_PC:
- if ((data[1] == prop->chartype) == (t == XCL_PROP)) return !negated;
+ if ((data[1] == prop->chartype) == isprop) return !negated;
break;
case PT_SC:
- if ((data[1] == prop->script) == (t == XCL_PROP)) return !negated;
+ if ((data[1] == prop->script) == isprop) return !negated;
break;
case PT_ALNUM:
if ((PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
- PRIV(ucp_gentype)[prop->chartype] == ucp_N) == (t == XCL_PROP))
+ PRIV(ucp_gentype)[prop->chartype] == ucp_N) == isprop)
return !negated;
break;
@@ -169,11 +170,11 @@
{
HSPACE_CASES:
VSPACE_CASES:
- if (t == XCL_PROP) return !negated;
+ if (isprop) return !negated;
break;
default:
- if ((PRIV(ucp_gentype)[prop->chartype] == ucp_Z) == (t == XCL_PROP))
+ if ((PRIV(ucp_gentype)[prop->chartype] == ucp_Z) == isprop)
return !negated;
break;
}
@@ -182,7 +183,7 @@
case PT_WORD:
if ((PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
PRIV(ucp_gentype)[prop->chartype] == ucp_N || c == CHAR_UNDERSCORE)
- == (t == XCL_PROP))
+ == isprop)
return !negated;
break;
@@ -190,16 +191,60 @@
if (c < 0xa0)
{
if ((c == CHAR_DOLLAR_SIGN || c == CHAR_COMMERCIAL_AT ||
- c == CHAR_GRAVE_ACCENT) == (t == XCL_PROP))
+ c == CHAR_GRAVE_ACCENT) == isprop)
return !negated;
}
else
{
- if ((c < 0xd800 || c > 0xdfff) == (t == XCL_PROP))
+ if ((c < 0xd800 || c > 0xdfff) == isprop)
return !negated;
}
break;
+
+ /* The following three properties can occur only in an XCLASS, as there
+ is no \p or \P coding for them. */
+ /* Graphic character. Implement this as not Z (space or separator) and
+ not C (other), except for Cf (format) with a few exceptions. This seems
+ to be what Perl does. The exceptional characters are:
+
+ U+061C Arabic Letter Mark
+ U+180E Mongolian Vowel Separator
+ U+2066 - U+2069 Various "isolate"s
+ */
+
+ case PT_PXGRAPH:
+ if ((PRIV(ucp_gentype)[prop->chartype] != ucp_Z &&
+ (PRIV(ucp_gentype)[prop->chartype] != ucp_C ||
+ (prop->chartype == ucp_Cf &&
+ c != 0x061c && c != 0x180e && (c < 0x2066 || c > 0x2069))
+ )) == isprop)
+ return !negated;
+ break;
+
+ /* Printable character: same as graphic, with the addition of Zs, i.e.
+ not Zl and not Zp, and U+180E. */
+
+ case PT_PXPRINT:
+ if ((prop->chartype != ucp_Zl &&
+ prop->chartype != ucp_Zp &&
+ (PRIV(ucp_gentype)[prop->chartype] != ucp_C ||
+ (prop->chartype == ucp_Cf &&
+ c != 0x061c && (c < 0x2066 || c > 0x2069))
+ )) == isprop)
+ return !negated;
+ break;
+
+ /* Punctuation: all Unicode punctuation, plus ASCII characters that
+ Unicode treats as symbols rather than punctuation, for Perl
+ compatibility (these are $+<=>^`|~). */
+
+ case PT_PXPUNCT:
+ if ((PRIV(ucp_gentype)[prop->chartype] == ucp_P ||
+ (c < 256 && PRIV(ucp_gentype)[prop->chartype] == ucp_S)) == isprop)
+ return !negated;
+ break;
+
/* This should never occur, but compilers may mutter if there is no
default. */
Modified: code/trunk/testdata/testinput6
===================================================================
--- code/trunk/testdata/testinput6 2013-10-29 17:15:47 UTC (rev 1386)
+++ code/trunk/testdata/testinput6 2013-11-02 18:29:05 UTC (rev 1387)
@@ -1298,9 +1298,7 @@
/\x{1f80}+/8i
\x{1f88}\x{1f80}
-/\d+\s{0,5}=\s*\S?=\w{0,4}\W*/8WBZ
-
/-- Perl 5.12.4 gets these wrong, but 5.15.3 is OK --/
/\x{004b}+/8i
@@ -1338,12 +1336,150 @@
A\x{2005}Z
A\x{85}\x{180e}\x{2005}Z
-/\D+\X \d+\X \S+\X \s+\X \W+\X \w+\X \C+\X \R+\X \H+\X \h+\X \V+\X \v+\X a+\X \n+\X .+\X/BZx
+/^[[:graph:]]+$/8W
+ Letter:ABC
+ Mark:\x{300}\x{1d172}\x{1d17b}
+ Number:9\x{660}
+ Punctuation:\x{66a},;
+ Symbol:\x{6de}<>\x{fffc}
+ Cf-property:\x{ad}\x{600}\x{601}\x{602}\x{603}\x{604}\x{6dd}\x{70f}
+ \x{200b}\x{200c}\x{200d}\x{200e}\x{200f}
+ \x{202a}\x{202b}\x{202c}\x{202d}\x{202e}
+ \x{2060}\x{2061}\x{2062}\x{2063}\x{2064}
+ \x{206a}\x{206b}\x{206c}\x{206d}\x{206e}\x{206f}
+ \x{feff}
+ \x{fff9}\x{fffa}\x{fffb}
+ \x{110bd}
+ \x{1d173}\x{1d174}\x{1d175}\x{1d176}\x{1d177}\x{1d178}\x{1d179}\x{1d17a}
+ \x{e0001}
+ \x{e0020}\x{e0030}\x{e0040}\x{e0050}\x{e0060}\x{e0070}\x{e007f}
+ ** Failers
+ \x{09}
+ \x{0a}
+ \x{1D}
+ \x{20}
+ \x{85}
+ \x{a0}
+ \x{61c}
+ \x{1680}
+ \x{180e}
+ \x{2028}
+ \x{2029}
+ \x{202f}
+ \x{2065}
+ \x{2066}
+ \x{2067}
+ \x{2068}
+ \x{2069}
+ \x{3000}
+ \x{e0002}
+ \x{e001f}
+ \x{e0080}
-/.+\X/BZxs
+/^[[:print:]]+$/8W
+ Space: \x{a0}
+ \x{1680}\x{2000}\x{2001}\x{2002}\x{2003}\x{2004}\x{2005}
+ \x{2006}\x{2007}\x{2008}\x{2009}\x{200a}
+ \x{202f}\x{205f}
+ \x{3000}
+ Letter:ABC
+ Mark:\x{300}\x{1d172}\x{1d17b}
+ Number:9\x{660}
+ Punctuation:\x{66a},;
+ Symbol:\x{6de}<>\x{fffc}
+ Cf-property:\x{ad}\x{600}\x{601}\x{602}\x{603}\x{604}\x{6dd}\x{70f}
+ \x{180e}
+ \x{200b}\x{200c}\x{200d}\x{200e}\x{200f}
+ \x{202a}\x{202b}\x{202c}\x{202d}\x{202e}
+ \x{202f}
+ \x{2060}\x{2061}\x{2062}\x{2063}\x{2064}
+ \x{206a}\x{206b}\x{206c}\x{206d}\x{206e}\x{206f}
+ \x{feff}
+ \x{fff9}\x{fffa}\x{fffb}
+ \x{110bd}
+ \x{1d173}\x{1d174}\x{1d175}\x{1d176}\x{1d177}\x{1d178}\x{1d179}\x{1d17a}
+ \x{e0001}
+ \x{e0020}\x{e0030}\x{e0040}\x{e0050}\x{e0060}\x{e0070}\x{e007f}
+ ** Failers
+ \x{09}
+ \x{1D}
+ \x{85}
+ \x{61c}
+ \x{2028}
+ \x{2029}
+ \x{2065}
+ \x{2066}
+ \x{2067}
+ \x{2068}
+ \x{2069}
+ \x{e0002}
+ \x{e001f}
+ \x{e0080}
-/\X+$/BZxm
+/^[[:punct:]]+$/8W
+ \$+<=>^`|~
+ !\"#%&'()*,-./:;?@[\\]_{}
+ \x{a1}\x{a7}
+ \x{37e}
+ ** Failers
+ abcde
-/\X+\D \X+\d \X+\S \X+\s \X+\W \X+\w \X+. \X+\C \X+\R \X+\H \X+\h \X+\V \X+\v \X+\X \X+\Z \X+\z \X+$/BZx
+/^[[:^graph:]]+$/8W
+ \x{09}\x{0a}\x{1D}\x{20}\x{85}\x{a0}\x{61c}\x{1680}\x{180e}
+ \x{2028}\x{2029}\x{202f}\x{2065}\x{2066}\x{2067}\x{2068}\x{2069}
+ \x{3000}\x{e0002}\x{e001f}\x{e0080}
+ ** Failers
+ Letter:ABC
+ Mark:\x{300}\x{1d172}\x{1d17b}
+ Number:9\x{660}
+ Punctuation:\x{66a},;
+ Symbol:\x{6de}<>\x{fffc}
+ Cf-property:\x{ad}\x{600}\x{601}\x{602}\x{603}\x{604}\x{6dd}\x{70f}
+ \x{200b}\x{200c}\x{200d}\x{200e}\x{200f}
+ \x{202a}\x{202b}\x{202c}\x{202d}\x{202e}
+ \x{2060}\x{2061}\x{2062}\x{2063}\x{2064}
+ \x{206a}\x{206b}\x{206c}\x{206d}\x{206e}\x{206f}
+ \x{feff}
+ \x{fff9}\x{fffa}\x{fffb}
+ \x{110bd}
+ \x{1d173}\x{1d174}\x{1d175}\x{1d176}\x{1d177}\x{1d178}\x{1d179}\x{1d17a}
+ \x{e0001}
+ \x{e0020}\x{e0030}\x{e0040}\x{e0050}\x{e0060}\x{e0070}\x{e007f}
+/^[[:^print:]]+$/8W
+ \x{09}\x{1D}\x{85}\x{61c}\x{2028}\x{2029}\x{2065}\x{2066}\x{2067}
+ \x{2068}\x{2069}\x{e0002}\x{e001f}\x{e0080}
+ ** Failers
+ Space: \x{a0}
+ \x{1680}\x{2000}\x{2001}\x{2002}\x{2003}\x{2004}\x{2005}
+ \x{2006}\x{2007}\x{2008}\x{2009}\x{200a}
+ \x{202f}\x{205f}
+ \x{3000}
+ Letter:ABC
+ Mark:\x{300}\x{1d172}\x{1d17b}
+ Number:9\x{660}
+ Punctuation:\x{66a},;
+ Symbol:\x{6de}<>\x{fffc}
+ Cf-property:\x{ad}\x{600}\x{601}\x{602}\x{603}\x{604}\x{6dd}\x{70f}
+ \x{180e}
+ \x{200b}\x{200c}\x{200d}\x{200e}\x{200f}
+ \x{202a}\x{202b}\x{202c}\x{202d}\x{202e}
+ \x{202f}
+ \x{2060}\x{2061}\x{2062}\x{2063}\x{2064}
+ \x{206a}\x{206b}\x{206c}\x{206d}\x{206e}\x{206f}
+ \x{feff}
+ \x{fff9}\x{fffa}\x{fffb}
+ \x{110bd}
+ \x{1d173}\x{1d174}\x{1d175}\x{1d176}\x{1d177}\x{1d178}\x{1d179}\x{1d17a}
+ \x{e0001}
+ \x{e0020}\x{e0030}\x{e0040}\x{e0050}\x{e0060}\x{e0070}\x{e007f}
+
+/^[[:^punct:]]+$/8W
+ abcde
+ ** Failers
+ \$+<=>^`|~
+ !\"#%&'()*,-./:;?@[\\]_{}
+ \x{a1}\x{a7}
+ \x{37e}
+
/-- End of testinput6 --/
Modified: code/trunk/testdata/testinput7
===================================================================
--- code/trunk/testdata/testinput7 2013-10-29 17:15:47 UTC (rev 1386)
+++ code/trunk/testdata/testinput7 2013-11-02 18:29:05 UTC (rev 1387)
@@ -819,4 +819,14 @@
/[\p{L}ab]{2,3}+/BZO
+/\D+\X \d+\X \S+\X \s+\X \W+\X \w+\X \C+\X \R+\X \H+\X \h+\X \V+\X \v+\X a+\X \n+\X .+\X/BZx
+
+/.+\X/BZxs
+
+/\X+$/BZxm
+
+/\X+\D \X+\d \X+\S \X+\s \X+\W \X+\w \X+. \X+\C \X+\R \X+\H \X+\h \X+\V \X+\v \X+\X \X+\Z \X+\z \X+$/BZx
+
+/\d+\s{0,5}=\s*\S?=\w{0,4}\W*/8WBZ
+
/-- End of testinput7 --/
Modified: code/trunk/testdata/testoutput6
===================================================================
--- code/trunk/testdata/testoutput6 2013-10-29 17:15:47 UTC (rev 1386)
+++ code/trunk/testdata/testoutput6 2013-11-02 18:29:05 UTC (rev 1387)
@@ -1330,15 +1330,15 @@
/^[[:graph:]]*/8W
A\x{a1}\x{a0}
- 0: A
+ 0: A\x{a1}
/^[[:print:]]*/8W
A z\x{a0}\x{a1}
- 0: A z
+ 0: A z\x{a0}\x{a1}
/^[[:punct:]]*/8W
.+\x{a1}\x{a0}
- 0: .+
+ 0: .+\x{a1}
/\p{Zs}*?\R/
** Failers
@@ -2111,22 +2111,7 @@
\x{1f88}\x{1f80}
0: \x{1f88}\x{1f80}
-/\d+\s{0,5}=\s*\S?=\w{0,4}\W*/8WBZ
-------------------------------------------------------------------
- Bra
- prop Nd ++
- prop Xsp {0,5}+
- =
- prop Xsp *+
- notprop Xsp ?
- =
- prop Xwd {0,4}+
- notprop Xwd *+
- Ket
- End
-------------------------------------------------------------------
-
/-- Perl 5.12.4 gets these wrong, but 5.15.3 is OK --/
/\x{004b}+/8i
@@ -2178,100 +2163,284 @@
A\x{85}\x{180e}\x{2005}Z
0: A\x{85}\x{180e}\x{2005}Z
-/\D+\X \d+\X \S+\X \s+\X \W+\X \w+\X \C+\X \R+\X \H+\X \h+\X \V+\X \v+\X a+\X \n+\X .+\X/BZx
-------------------------------------------------------------------
- Bra
- \D+
- extuni
- \d+
- extuni
- \S+
- extuni
- \s+
- extuni
- \W+
- extuni
- \w+
- extuni
- AllAny+
- extuni
- \R+
- extuni
- \H+
- extuni
- \h+
- extuni
- \V+
- extuni
- \v+
- extuni
- a+
- extuni
- \x0a+
- extuni
- Any+
- extuni
- Ket
- End
-------------------------------------------------------------------
+/^[[:graph:]]+$/8W
+ Letter:ABC
+ 0: Letter:ABC
+ Mark:\x{300}\x{1d172}\x{1d17b}
+ 0: Mark:\x{300}\x{1d172}\x{1d17b}
+ Number:9\x{660}
+ 0: Number:9\x{660}
+ Punctuation:\x{66a},;
+ 0: Punctuation:\x{66a},;
+ Symbol:\x{6de}<>\x{fffc}
+ 0: Symbol:\x{6de}<>\x{fffc}
+ Cf-property:\x{ad}\x{600}\x{601}\x{602}\x{603}\x{604}\x{6dd}\x{70f}
+ 0: Cf-property:\x{ad}\x{600}\x{601}\x{602}\x{603}\x{604}\x{6dd}\x{70f}
+ \x{200b}\x{200c}\x{200d}\x{200e}\x{200f}
+ 0: \x{200b}\x{200c}\x{200d}\x{200e}\x{200f}
+ \x{202a}\x{202b}\x{202c}\x{202d}\x{202e}
+ 0: \x{202a}\x{202b}\x{202c}\x{202d}\x{202e}
+ \x{2060}\x{2061}\x{2062}\x{2063}\x{2064}
+ 0: \x{2060}\x{2061}\x{2062}\x{2063}\x{2064}
+ \x{206a}\x{206b}\x{206c}\x{206d}\x{206e}\x{206f}
+ 0: \x{206a}\x{206b}\x{206c}\x{206d}\x{206e}\x{206f}
+ \x{feff}
+ 0: \x{feff}
+ \x{fff9}\x{fffa}\x{fffb}
+ 0: \x{fff9}\x{fffa}\x{fffb}
+ \x{110bd}
+ 0: \x{110bd}
+ \x{1d173}\x{1d174}\x{1d175}\x{1d176}\x{1d177}\x{1d178}\x{1d179}\x{1d17a}
+ 0: \x{1d173}\x{1d174}\x{1d175}\x{1d176}\x{1d177}\x{1d178}\x{1d179}\x{1d17a}
+ \x{e0001}
+ 0: \x{e0001}
+ \x{e0020}\x{e0030}\x{e0040}\x{e0050}\x{e0060}\x{e0070}\x{e007f}
+ 0: \x{e0020}\x{e0030}\x{e0040}\x{e0050}\x{e0060}\x{e0070}\x{e007f}
+ ** Failers
+No match
+ \x{09}
+No match
+ \x{0a}
+No match
+ \x{1D}
+No match
+ \x{20}
+No match
+ \x{85}
+No match
+ \x{a0}
+No match
+ \x{61c}
+No match
+ \x{1680}
+No match
+ \x{180e}
+No match
+ \x{2028}
+No match
+ \x{2029}
+No match
+ \x{202f}
+No match
+ \x{2065}
+No match
+ \x{2066}
+No match
+ \x{2067}
+No match
+ \x{2068}
+No match
+ \x{2069}
+No match
+ \x{3000}
+No match
+ \x{e0002}
+No match
+ \x{e001f}
+No match
+ \x{e0080}
+No match
-/.+\X/BZxs
-------------------------------------------------------------------
- Bra
- AllAny+
- extuni
- Ket
- End
-------------------------------------------------------------------
+/^[[:print:]]+$/8W
+ Space: \x{a0}
+ 0: Space: \x{a0}
+ \x{1680}\x{2000}\x{2001}\x{2002}\x{2003}\x{2004}\x{2005}
+ 0: \x{1680}\x{2000}\x{2001}\x{2002}\x{2003}\x{2004}\x{2005}
+ \x{2006}\x{2007}\x{2008}\x{2009}\x{200a}
+ 0: \x{2006}\x{2007}\x{2008}\x{2009}\x{200a}
+ \x{202f}\x{205f}
+ 0: \x{202f}\x{205f}
+ \x{3000}
+ 0: \x{3000}
+ Letter:ABC
+ 0: Letter:ABC
+ Mark:\x{300}\x{1d172}\x{1d17b}
+ 0: Mark:\x{300}\x{1d172}\x{1d17b}
+ Number:9\x{660}
+ 0: Number:9\x{660}
+ Punctuation:\x{66a},;
+ 0: Punctuation:\x{66a},;
+ Symbol:\x{6de}<>\x{fffc}
+ 0: Symbol:\x{6de}<>\x{fffc}
+ Cf-property:\x{ad}\x{600}\x{601}\x{602}\x{603}\x{604}\x{6dd}\x{70f}
+ 0: Cf-property:\x{ad}\x{600}\x{601}\x{602}\x{603}\x{604}\x{6dd}\x{70f}
+ \x{180e}
+ 0: \x{180e}
+ \x{200b}\x{200c}\x{200d}\x{200e}\x{200f}
+ 0: \x{200b}\x{200c}\x{200d}\x{200e}\x{200f}
+ \x{202a}\x{202b}\x{202c}\x{202d}\x{202e}
+ 0: \x{202a}\x{202b}\x{202c}\x{202d}\x{202e}
+ \x{202f}
+ 0: \x{202f}
+ \x{2060}\x{2061}\x{2062}\x{2063}\x{2064}
+ 0: \x{2060}\x{2061}\x{2062}\x{2063}\x{2064}
+ \x{206a}\x{206b}\x{206c}\x{206d}\x{206e}\x{206f}
+ 0: \x{206a}\x{206b}\x{206c}\x{206d}\x{206e}\x{206f}
+ \x{feff}
+ 0: \x{feff}
+ \x{fff9}\x{fffa}\x{fffb}
+ 0: \x{fff9}\x{fffa}\x{fffb}
+ \x{110bd}
+ 0: \x{110bd}
+ \x{1d173}\x{1d174}\x{1d175}\x{1d176}\x{1d177}\x{1d178}\x{1d179}\x{1d17a}
+ 0: \x{1d173}\x{1d174}\x{1d175}\x{1d176}\x{1d177}\x{1d178}\x{1d179}\x{1d17a}
+ \x{e0001}
+ 0: \x{e0001}
+ \x{e0020}\x{e0030}\x{e0040}\x{e0050}\x{e0060}\x{e0070}\x{e007f}
+ 0: \x{e0020}\x{e0030}\x{e0040}\x{e0050}\x{e0060}\x{e0070}\x{e007f}
+ ** Failers
+ 0: ** Failers
+ \x{09}
+No match
+ \x{1D}
+No match
+ \x{85}
+No match
+ \x{61c}
+No match
+ \x{2028}
+No match
+ \x{2029}
+No match
+ \x{2065}
+No match
+ \x{2066}
+No match
+ \x{2067}
+No match
+ \x{2068}
+No match
+ \x{2069}
+No match
+ \x{e0002}
+No match
+ \x{e001f}
+No match
+ \x{e0080}
+No match
-/\X+$/BZxm
-------------------------------------------------------------------
- Bra
- extuni+
- /m $
- Ket
- End
-------------------------------------------------------------------
+/^[[:punct:]]+$/8W
+ \$+<=>^`|~
+ 0: $+<=>^`|~
+ !\"#%&'()*,-./:;?@[\\]_{}
+ 0: !"#%&'()*,-./:;?@[\]_{}
+ \x{a1}\x{a7}
+ 0: \x{a1}\x{a7}
+ \x{37e}
+ 0: \x{37e}
+ ** Failers
+No match
+ abcde
+No match
-/\X+\D \X+\d \X+\S \X+\s \X+\W \X+\w \X+. \X+\C \X+\R \X+\H \X+\h \X+\V \X+\v \X+\X \X+\Z \X+\z \X+$/BZx
-------------------------------------------------------------------
- Bra
- extuni+
- \D
- extuni+
- \d
- extuni+
- \S
- extuni+
- \s
- extuni+
- \W
- extuni+
- \w
- extuni+
- Any
- extuni+
- AllAny
- extuni+
- \R
- extuni+
- \H
- extuni+
- \h
- extuni+
- \V
- extuni+
- \v
- extuni+
- extuni
- extuni+
- \Z
- extuni++
- \z
- extuni+
- $
- Ket
- End
-------------------------------------------------------------------
+/^[[:^graph:]]+$/8W
+ \x{09}\x{0a}\x{1D}\x{20}\x{85}\x{a0}\x{61c}\x{1680}\x{180e}
+ 0: \x{09}\x{0a}\x{1d} \x{85}\x{a0}\x{61c}\x{1680}\x{180e}
+ \x{2028}\x{2029}\x{202f}\x{2065}\x{2066}\x{2067}\x{2068}\x{2069}
+ 0: \x{2028}\x{2029}\x{202f}\x{2065}\x{2066}\x{2067}\x{2068}\x{2069}
+ \x{3000}\x{e0002}\x{e001f}\x{e0080}
+ 0: \x{3000}\x{e0002}\x{e001f}\x{e0080}
+ ** Failers
+No match
+ Letter:ABC
+No match
+ Mark:\x{300}\x{1d172}\x{1d17b}
+No match
+ Number:9\x{660}
+No match
+ Punctuation:\x{66a},;
+No match
+ Symbol:\x{6de}<>\x{fffc}
+No match
+ Cf-property:\x{ad}\x{600}\x{601}\x{602}\x{603}\x{604}\x{6dd}\x{70f}
+No match
+ \x{200b}\x{200c}\x{200d}\x{200e}\x{200f}
+No match
+ \x{202a}\x{202b}\x{202c}\x{202d}\x{202e}
+No match
+ \x{2060}\x{2061}\x{2062}\x{2063}\x{2064}
+No match
+ \x{206a}\x{206b}\x{206c}\x{206d}\x{206e}\x{206f}
+No match
+ \x{feff}
+No match
+ \x{fff9}\x{fffa}\x{fffb}
+No match
+ \x{110bd}
+No match
+ \x{1d173}\x{1d174}\x{1d175}\x{1d176}\x{1d177}\x{1d178}\x{1d179}\x{1d17a}
+No match
+ \x{e0001}
+No match
+ \x{e0020}\x{e0030}\x{e0040}\x{e0050}\x{e0060}\x{e0070}\x{e007f}
+No match
+/^[[:^print:]]+$/8W
+ \x{09}\x{1D}\x{85}\x{61c}\x{2028}\x{2029}\x{2065}\x{2066}\x{2067}
+ 0: \x{09}\x{1d}\x{85}\x{61c}\x{2028}\x{2029}\x{2065}\x{2066}\x{2067}
+ \x{2068}\x{2069}\x{e0002}\x{e001f}\x{e0080}
+ 0: \x{2068}\x{2069}\x{e0002}\x{e001f}\x{e0080}
+ ** Failers
+No match
+ Space: \x{a0}
+No match
+ \x{1680}\x{2000}\x{2001}\x{2002}\x{2003}\x{2004}\x{2005}
+No match
+ \x{2006}\x{2007}\x{2008}\x{2009}\x{200a}
+No match
+ \x{202f}\x{205f}
+No match
+ \x{3000}
+No match
+ Letter:ABC
+No match
+ Mark:\x{300}\x{1d172}\x{1d17b}
+No match
+ Number:9\x{660}
+No match
+ Punctuation:\x{66a},;
+No match
+ Symbol:\x{6de}<>\x{fffc}
+No match
+ Cf-property:\x{ad}\x{600}\x{601}\x{602}\x{603}\x{604}\x{6dd}\x{70f}
+No match
+ \x{180e}
+No match
+ \x{200b}\x{200c}\x{200d}\x{200e}\x{200f}
+No match
+ \x{202a}\x{202b}\x{202c}\x{202d}\x{202e}
+No match
+ \x{202f}
+No match
+ \x{2060}\x{2061}\x{2062}\x{2063}\x{2064}
+No match
+ \x{206a}\x{206b}\x{206c}\x{206d}\x{206e}\x{206f}
+No match
+ \x{feff}
+No match
+ \x{fff9}\x{fffa}\x{fffb}
+No match
+ \x{110bd}
+No match
+ \x{1d173}\x{1d174}\x{1d175}\x{1d176}\x{1d177}\x{1d178}\x{1d179}\x{1d17a}
+No match
+ \x{e0001}
+No match
+ \x{e0020}\x{e0030}\x{e0040}\x{e0050}\x{e0060}\x{e0070}\x{e007f}
+No match
+
+/^[[:^punct:]]+$/8W
+ abcde
+ 0: abcde
+ ** Failers
+No match
+ \$+<=>^`|~
+No match
+ !\"#%&'()*,-./:;?@[\\]_{}
+No match
+ \x{a1}\x{a7}
+No match
+ \x{37e}
+No match
+
/-- End of testinput6 --/
Modified: code/trunk/testdata/testoutput7
===================================================================
--- code/trunk/testdata/testoutput7 2013-10-29 17:15:47 UTC (rev 1386)
+++ code/trunk/testdata/testoutput7 2013-11-02 18:29:05 UTC (rev 1387)
@@ -859,7 +859,7 @@
/[[:graph:]]/WBZ
------------------------------------------------------------------
Bra
- [!-~]
+ [[:graph:]]
Ket
End
------------------------------------------------------------------
@@ -867,7 +867,7 @@
/[[:print:]]/WBZ
------------------------------------------------------------------
Bra
- [ -~]
+ [[:print:]]
Ket
End
------------------------------------------------------------------
@@ -875,7 +875,7 @@
/[[:punct:]]/WBZ
------------------------------------------------------------------
Bra
- [!-/:-@[-`{-~]
+ [[:punct:]]
Ket
End
------------------------------------------------------------------
@@ -2152,4 +2152,115 @@
End
------------------------------------------------------------------
+/\D+\X \d+\X \S+\X \s+\X \W+\X \w+\X \C+\X \R+\X \H+\X \h+\X \V+\X \v+\X a+\X \n+\X .+\X/BZx
+------------------------------------------------------------------
+ Bra
+ \D+
+ extuni
+ \d+
+ extuni
+ \S+
+ extuni
+ \s+
+ extuni
+ \W+
+ extuni
+ \w+
+ extuni
+ AllAny+
+ extuni
+ \R+
+ extuni
+ \H+
+ extuni
+ \h+
+ extuni
+ \V+
+ extuni
+ \v+
+ extuni
+ a+
+ extuni
+ \x0a+
+ extuni
+ Any+
+ extuni
+ Ket
+ End
+------------------------------------------------------------------
+
+/.+\X/BZxs
+------------------------------------------------------------------
+ Bra
+ AllAny+
+ extuni
+ Ket
+ End
+------------------------------------------------------------------
+
+/\X+$/BZxm
+------------------------------------------------------------------
+ Bra
+ extuni+
+ /m $
+ Ket
+ End
+------------------------------------------------------------------
+
+/\X+\D \X+\d \X+\S \X+\s \X+\W \X+\w \X+. \X+\C \X+\R \X+\H \X+\h \X+\V \X+\v \X+\X \X+\Z \X+\z \X+$/BZx
+------------------------------------------------------------------
+ Bra
+ extuni+
+ \D
+ extuni+
+ \d
+ extuni+
+ \S
+ extuni+
+ \s
+ extuni+
+ \W
+ extuni+
+ \w
+ extuni+
+ Any
+ extuni+
+ AllAny
+ extuni+
+ \R
+ extuni+
+ \H
+ extuni+
+ \h
+ extuni+
+ \V
+ extuni+
+ \v
+ extuni+
+ extuni
+ extuni+
+ \Z
+ extuni++
+ \z
+ extuni+
+ $
+ Ket
+ End
+------------------------------------------------------------------
+
+/\d+\s{0,5}=\s*\S?=\w{0,4}\W*/8WBZ
+------------------------------------------------------------------
+ Bra
+ prop Nd ++
+ prop Xsp {0,5}+
+ =
+ prop Xsp *+
+ notprop Xsp ?
+ =
+ prop Xwd {0,4}+
+ notprop Xwd *+
+ Ket
+ End
+------------------------------------------------------------------
+
/-- End of testinput7 --/