Revision: 1219
http://vcs.pcre.org/viewvc?view=rev&revision=1219
Author: ph10
Date: 2012-11-11 18:04:37 +0000 (Sun, 11 Nov 2012)
Log Message:
-----------
Support (*UTF) in all libraries.
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/pcre.3
code/trunk/doc/pcrepattern.3
code/trunk/doc/pcresyntax.3
code/trunk/doc/pcreunicode.3
code/trunk/pcre_compile.c
code/trunk/pcre_internal.h
code/trunk/testdata/testinput15
code/trunk/testdata/testinput18
code/trunk/testdata/testoutput15
code/trunk/testdata/testoutput18-16
code/trunk/testdata/testoutput18-32
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/ChangeLog 2012-11-11 18:04:37 UTC (rev 1219)
@@ -159,6 +159,8 @@
34. If PCRE is built with --enable-valgrind, certain memory regions are marked
unaddressable using valgrind annotations, allowing valgrind to detect
invalid memory accesses. This is mainly for the benefit of the developers.
+
+25. (*UTF) can now be used to start a pattern in any of the three libraries.
Version 8.31 06-July-2012
Modified: code/trunk/doc/pcre.3
===================================================================
--- code/trunk/doc/pcre.3 2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/doc/pcre.3 2012-11-11 18:04:37 UTC (rev 1219)
@@ -1,4 +1,4 @@
-.TH PCRE 3 "30 October 2012" "PCRE 8.32"
+.TH PCRE 3 "11 November 2012" "PCRE 8.32"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH INTRODUCTION
@@ -114,10 +114,12 @@
arbitrary patterns for compilation, you should be aware of a feature that
allows users to turn on UTF support from within a pattern, provided that PCRE
was built with UTF support. For example, an 8-bit pattern that begins with
-"(*UTF8)" turns on UTF-8 mode. This causes both the pattern and any data
-against which it is matched to be checked for UTF-8 validity. If the data
-string is very long, such a check might use sufficiently many resources as to
-cause your application to lose performance.
+"(*UTF8)" or "(*UTF)" turns on UTF-8 mode, which interprets patterns and
+subjects as strings of UTF-8 characters instead of individual 8-bit characters.
+This causes both the pattern and any data against which it is matched to be
+checked for UTF-8 validity. If the data string is very long, such a check might
+use sufficiently many resources as to cause your application to lose
+performance.
.P
The best way of guarding against this possibility is to use the
\fBpcre_fullinfo()\fP function to check the compiled pattern's options for UTF.
@@ -195,6 +197,6 @@
.rs
.sp
.nf
-Last updated: 30 October 2012
+Last updated: 11 November 2012
Copyright (c) 1997-2012 University of Cambridge.
.fi
Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3 2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/doc/pcrepattern.3 2012-11-11 18:04:37 UTC (rev 1219)
@@ -1,4 +1,4 @@
-.TH PCREPATTERN 3 "07 November 2012" "PCRE 8.32"
+.TH PCREPATTERN 3 "11 November 2012" "PCRE 8.32"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "PCRE REGULAR EXPRESSION DETAILS"
@@ -22,17 +22,19 @@
.P
The original operation of PCRE was on strings of one-byte characters. However,
there is now also support for UTF-8 strings in the original library, an
-extra library that supports 16-bit and UTF-16 character strings, and an
-extra library that supports 32-bit and UTF-32 character strings. To use these
+extra library that supports 16-bit and UTF-16 character strings, and a
+third library that supports 32-bit and UTF-32 character strings. To use these
features, PCRE must be built to include appropriate support. When using UTF
strings you must either call the compiling function with the PCRE_UTF8,
-PCRE_UTF16 or PCRE_UTF32 option, or the pattern must start with one of
+PCRE_UTF16, or PCRE_UTF32 option, or the pattern must start with one of
these special sequences:
.sp
(*UTF8)
(*UTF16)
(*UTF32)
+ (*UTF)
.sp
+(*UTF) is a generic sequence that can be used with any of the libraries.
Starting a pattern with such a sequence is equivalent to setting the relevant
option. This feature is not Perl-compatible. How setting a UTF mode affects
pattern matching is mentioned in several places below. There is also a summary
@@ -43,7 +45,7 @@
page.
.P
Another special sequence that may appear at the start of a pattern or in
-combination with (*UTF8) or (*UTF16) or (*UTF32) is:
+combination with (*UTF8), (*UTF16), (*UTF32) or (*UTF) is:
.sp
(*UCP)
.sp
@@ -573,10 +575,10 @@
.sp
(*ANY)(*BSR_ANYCRLF)
.sp
-They can also be combined with the (*UTF8), (*UTF16), (*UTF32) or (*UCP) special
-sequences. Inside a character class, \eR is treated as an unrecognized escape
-sequence, and so matches the letter "R" by default, but causes an error if
-PCRE_EXTRA is set.
+They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) or
+(*UCP) special sequences. Inside a character class, \eR is treated as an
+unrecognized escape sequence, and so matches the letter "R" by default, but
+causes an error if PCRE_EXTRA is set.
.
.
.\" HTML <a name="uniextseq"></a>
@@ -1349,10 +1351,11 @@
.\" </a>
"Newline sequences"
.\"
-above. There are also the (*UTF8), (*UTF16),(*UTF32) and (*UCP) leading
+above. There are also the (*UTF8), (*UTF16),(*UTF32), and (*UCP) leading
sequences that can be used to set UTF and Unicode property modes; they are
equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP
-options, respectively.
+options, respectively. The (*UTF) sequence is a generic version that can be
+used with any of the libraries.
.
.
.\" HTML <a name="subpattern"></a>
@@ -2975,6 +2978,6 @@
.rs
.sp
.nf
-Last updated: 07 November 2012
+Last updated: 11 November 2012
Copyright (c) 1997-2012 University of Cambridge.
.fi
Modified: code/trunk/doc/pcresyntax.3
===================================================================
--- code/trunk/doc/pcresyntax.3 2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/doc/pcresyntax.3 2012-11-11 18:04:37 UTC (rev 1219)
@@ -1,4 +1,4 @@
-.TH PCRESYNTAX 3 "24 June 2012" "PCRE 8.30"
+.TH PCRESYNTAX 3 "11 November 2012" "PCRE 8.32"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -349,6 +349,7 @@
(*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
(*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
(*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
+ (*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE_UCP (use Unicode properties for \ed etc)
.
.
@@ -490,6 +491,6 @@
.rs
.sp
.nf
-Last updated: 25 August 2012
+Last updated: 11 November 2012
Copyright (c) 1997-2012 University of Cambridge.
.fi
Modified: code/trunk/doc/pcreunicode.3
===================================================================
--- code/trunk/doc/pcreunicode.3 2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/doc/pcreunicode.3 2012-11-11 18:04:37 UTC (rev 1219)
@@ -1,4 +1,4 @@
-.TH PCREUNICODE 3 "08 November 2012" "PCRE 8.32"
+.TH PCREUNICODE 3 "11 November 2012" "PCRE 8.32"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "UTF-8, UTF-16, UTF-32, AND UNICODE PROPERTY SUPPORT"
@@ -18,9 +18,9 @@
\fBpcre_compile()\fP
.\"
with the PCRE_UTF8 option flag, or the pattern must start with the sequence
-(*UTF8). When either of these is the case, both the pattern and any subject
-strings that are matched against it are treated as UTF-8 strings instead of
-strings of individual 1-byte characters.
+(*UTF8) or (*UTF). When either of these is the case, both the pattern and any
+subject strings that are matched against it are treated as UTF-8 strings
+instead of strings of individual 1-byte characters.
.
.
.SH "UTF-16 AND UTF-32 SUPPORT"
@@ -36,10 +36,11 @@
\fBpcre32_compile()\fP
.\"
with the PCRE_UTF16 or PCRE_UTF32 option flag, as appropriate. Alternatively,
-the pattern must start with the sequence (*UTF16) or (*UTF32). When UTF mode is
-set, both the pattern and any subject strings that are matched against it are
-treated as UTF-16 or UTF-32 strings instead of strings of individual 16-bit or
-32-bit characters.
+the pattern must start with the sequence (*UTF16), (*UTF32), as appropriate, or
+(*UTF), which can be used with either library. When UTF mode is set, both the
+pattern and any subject strings that are matched against it are treated as
+UTF-16 or UTF-32 strings instead of strings of individual 16-bit or 32-bit
+characters.
.
.
.SH "UTF SUPPORT OVERHEAD"
@@ -249,6 +250,6 @@
.rs
.sp
.nf
-Last updated: 08 November 2012
+Last updated: 11 November 2012
Copyright (c) 1997-2012 University of Cambridge.
.fi
Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c 2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/pcre_compile.c 2012-11-11 18:04:37 UTC (rev 1219)
@@ -7848,19 +7848,26 @@
{
int newnl = 0;
int newbsr = 0;
+
+/* For completeness and backward compatibility, (*UTFn) is supported in the
+relevant libraries, but (*UTF) is generic and always supported. Note that
+PCRE_UTF8 == PCRE_UTF16 == PCRE_UTF32. */
#ifdef COMPILE_PCRE8
- if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 5) == 0)
+ if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF8_RIGHTPAR, 5) == 0)
{ skipatstart += 7; options |= PCRE_UTF8; continue; }
#endif
#ifdef COMPILE_PCRE16
- if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 6) == 0)
+ if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF16_RIGHTPAR, 6) == 0)
{ skipatstart += 8; options |= PCRE_UTF16; continue; }
#endif
#ifdef COMPILE_PCRE32
- if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 6) == 0)
+ if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF32_RIGHTPAR, 6) == 0)
{ skipatstart += 8; options |= PCRE_UTF32; continue; }
#endif
+
+ else if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 4) == 0)
+ { skipatstart += 6; options |= PCRE_UTF8; continue; }
else if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UCP_RIGHTPAR, 4) == 0)
{ skipatstart += 6; options |= PCRE_UCP; continue; }
else if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_NO_START_OPT_RIGHTPAR, 13) == 0)
Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h 2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/pcre_internal.h 2012-11-11 18:04:37 UTC (rev 1219)
@@ -1528,15 +1528,10 @@
#define STRING_ANYCRLF_RIGHTPAR "ANYCRLF)"
#define STRING_BSR_ANYCRLF_RIGHTPAR "BSR_ANYCRLF)"
#define STRING_BSR_UNICODE_RIGHTPAR "BSR_UNICODE)"
-#ifdef COMPILE_PCRE8
-#define STRING_UTF_RIGHTPAR "UTF8)"
-#endif
-#ifdef COMPILE_PCRE16
-#define STRING_UTF_RIGHTPAR "UTF16)"
-#endif
-#ifdef COMPILE_PCRE32
-#define STRING_UTF_RIGHTPAR "UTF32)"
-#endif
+#define STRING_UTF8_RIGHTPAR "UTF8)"
+#define STRING_UTF16_RIGHTPAR "UTF16)"
+#define STRING_UTF32_RIGHTPAR "UTF32)"
+#define STRING_UTF_RIGHTPAR "UTF)"
#define STRING_UCP_RIGHTPAR "UCP)"
#define STRING_NO_START_OPT_RIGHTPAR "NO_START_OPT)"
@@ -1794,15 +1789,10 @@
#define STRING_ANYCRLF_RIGHTPAR STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
#define STRING_BSR_ANYCRLF_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
#define STRING_BSR_UNICODE_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_U STR_N STR_I STR_C STR_O STR_D STR_E STR_RIGHT_PARENTHESIS
-#ifdef COMPILE_PCRE8
-#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
-#endif
-#ifdef COMPILE_PCRE16
-#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_1 STR_6 STR_RIGHT_PARENTHESIS
-#endif
-#ifdef COMPILE_PCRE32
-#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_3 STR_2 STR_RIGHT_PARENTHESIS
-#endif
+#define STRING_UTF8_RIGHTPAR STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
+#define STRING_UTF16_RIGHTPAR STR_U STR_T STR_F STR_1 STR_6 STR_RIGHT_PARENTHESIS
+#define STRING_UTF32_RIGHTPAR STR_U STR_T STR_F STR_3 STR_2 STR_RIGHT_PARENTHESIS
+#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_RIGHT_PARENTHESIS
#define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
#define STRING_NO_START_OPT_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS
Modified: code/trunk/testdata/testinput15
===================================================================
--- code/trunk/testdata/testinput15 2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/testdata/testinput15 2012-11-11 18:04:37 UTC (rev 1219)
@@ -327,7 +327,7 @@
/(*UTF8)\x{1234}/
abcd\x{1234}pqr
-/(*CRLF)(*UTF8)(*BSR_UNICODE)a\Rb/I
+/(*CRLF)(*UTF)(*BSR_UNICODE)a\Rb/I
/\h/SI8
ABC\x{09}
Modified: code/trunk/testdata/testinput18
===================================================================
--- code/trunk/testdata/testinput18 2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/testdata/testinput18 2012-11-11 18:04:37 UTC (rev 1219)
@@ -174,6 +174,9 @@
/(*UTF16)\x{11234}/
abcd\x{11234}pqr
+/(*UTF)\x{11234}/I
+ abcd\x{11234}pqr
+
/(*UTF-32)\x{11234}/
abcd\x{11234}pqr
Modified: code/trunk/testdata/testoutput15
===================================================================
--- code/trunk/testdata/testoutput15 2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/testdata/testoutput15 2012-11-11 18:04:37 UTC (rev 1219)
@@ -922,7 +922,7 @@
abcd\x{1234}pqr
0: \x{1234}
-/(*CRLF)(*UTF8)(*BSR_UNICODE)a\Rb/I
+/(*CRLF)(*UTF)(*BSR_UNICODE)a\Rb/I
Capturing subpattern count = 0
Options: bsr_unicode utf
Forced newline sequence: CRLF
Modified: code/trunk/testdata/testoutput18-16
===================================================================
--- code/trunk/testdata/testoutput18-16 2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/testdata/testoutput18-16 2012-11-11 18:04:37 UTC (rev 1219)
@@ -641,6 +641,14 @@
abcd\x{11234}pqr
0: \x{11234}
+/(*UTF)\x{11234}/I
+Capturing subpattern count = 0
+Options: utf
+First char = \x{d804}
+Need char = \x{de34}
+ abcd\x{11234}pqr
+ 0: \x{11234}
+
/(*UTF-32)\x{11234}/
Failed: (*VERB) not recognized at offset 5
Modified: code/trunk/testdata/testoutput18-32
===================================================================
--- code/trunk/testdata/testoutput18-32 2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/testdata/testoutput18-32 2012-11-11 18:04:37 UTC (rev 1219)
@@ -638,6 +638,14 @@
/(*UTF16)\x{11234}/
Failed: (*VERB) not recognized at offset 5
+/(*UTF)\x{11234}/I
+Capturing subpattern count = 0
+Options: utf
+First char = \x{11234}
+No need char
+ abcd\x{11234}pqr
+ 0: \x{11234}
+
/(*UTF-32)\x{11234}/
Failed: (*VERB) not recognized at offset 5