[Pcre-svn] [1219] code/trunk: Support (*UTF) in all librarie…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [1219] code/trunk: Support (*UTF) in all libraries.
Revision: 1219
          http://vcs.pcre.org/viewvc?view=rev&revision=1219
Author:   ph10
Date:     2012-11-11 18:04:37 +0000 (Sun, 11 Nov 2012)


Log Message:
-----------
Support (*UTF) in all libraries.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcre.3
    code/trunk/doc/pcrepattern.3
    code/trunk/doc/pcresyntax.3
    code/trunk/doc/pcreunicode.3
    code/trunk/pcre_compile.c
    code/trunk/pcre_internal.h
    code/trunk/testdata/testinput15
    code/trunk/testdata/testinput18
    code/trunk/testdata/testoutput15
    code/trunk/testdata/testoutput18-16
    code/trunk/testdata/testoutput18-32


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/ChangeLog    2012-11-11 18:04:37 UTC (rev 1219)
@@ -159,6 +159,8 @@
 34. If PCRE is built with --enable-valgrind, certain memory regions are marked
     unaddressable using valgrind annotations, allowing valgrind to detect 
     invalid memory accesses. This is mainly for the benefit of the developers.
+    
+25. (*UTF) can now be used to start a pattern in any of the three libraries. 



Version 8.31 06-July-2012

Modified: code/trunk/doc/pcre.3
===================================================================
--- code/trunk/doc/pcre.3    2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/doc/pcre.3    2012-11-11 18:04:37 UTC (rev 1219)
@@ -1,4 +1,4 @@
-.TH PCRE 3 "30 October 2012" "PCRE 8.32"
+.TH PCRE 3 "11 November 2012" "PCRE 8.32"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .SH INTRODUCTION
@@ -114,10 +114,12 @@
 arbitrary patterns for compilation, you should be aware of a feature that
 allows users to turn on UTF support from within a pattern, provided that PCRE
 was built with UTF support. For example, an 8-bit pattern that begins with
-"(*UTF8)" turns on UTF-8 mode. This causes both the pattern and any data
-against which it is matched to be checked for UTF-8 validity. If the data
-string is very long, such a check might use sufficiently many resources as to
-cause your application to lose performance.
+"(*UTF8)" or "(*UTF)" turns on UTF-8 mode, which interprets patterns and 
+subjects as strings of UTF-8 characters instead of individual 8-bit characters.
+This causes both the pattern and any data against which it is matched to be
+checked for UTF-8 validity. If the data string is very long, such a check might
+use sufficiently many resources as to cause your application to lose
+performance.
 .P
 The best way of guarding against this possibility is to use the
 \fBpcre_fullinfo()\fP function to check the compiled pattern's options for UTF.
@@ -195,6 +197,6 @@
 .rs
 .sp
 .nf
-Last updated: 30 October 2012
+Last updated: 11 November 2012
 Copyright (c) 1997-2012 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3    2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/doc/pcrepattern.3    2012-11-11 18:04:37 UTC (rev 1219)
@@ -1,4 +1,4 @@
-.TH PCREPATTERN 3 "07 November 2012" "PCRE 8.32"
+.TH PCREPATTERN 3 "11 November 2012" "PCRE 8.32"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .SH "PCRE REGULAR EXPRESSION DETAILS"
@@ -22,17 +22,19 @@
 .P
 The original operation of PCRE was on strings of one-byte characters. However,
 there is now also support for UTF-8 strings in the original library, an
-extra library that supports 16-bit and UTF-16 character strings, and an
-extra library that supports 32-bit and UTF-32 character strings. To use these
+extra library that supports 16-bit and UTF-16 character strings, and a
+third library that supports 32-bit and UTF-32 character strings. To use these
 features, PCRE must be built to include appropriate support. When using UTF
 strings you must either call the compiling function with the PCRE_UTF8,
-PCRE_UTF16 or PCRE_UTF32 option, or the pattern must start with one of
+PCRE_UTF16, or PCRE_UTF32 option, or the pattern must start with one of
 these special sequences:
 .sp
   (*UTF8)
   (*UTF16)
   (*UTF32)
+  (*UTF) 
 .sp
+(*UTF) is a generic sequence that can be used with any of the libraries.
 Starting a pattern with such a sequence is equivalent to setting the relevant
 option. This feature is not Perl-compatible. How setting a UTF mode affects
 pattern matching is mentioned in several places below. There is also a summary
@@ -43,7 +45,7 @@
 page.
 .P
 Another special sequence that may appear at the start of a pattern or in
-combination with (*UTF8) or (*UTF16) or (*UTF32) is:
+combination with (*UTF8), (*UTF16), (*UTF32) or (*UTF) is:
 .sp
   (*UCP)
 .sp
@@ -573,10 +575,10 @@
 .sp
   (*ANY)(*BSR_ANYCRLF)
 .sp
-They can also be combined with the (*UTF8), (*UTF16), (*UTF32) or (*UCP) special
-sequences. Inside a character class, \eR is treated as an unrecognized escape
-sequence, and so matches the letter "R" by default, but causes an error if
-PCRE_EXTRA is set.
+They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) or
+(*UCP) special sequences. Inside a character class, \eR is treated as an
+unrecognized escape sequence, and so matches the letter "R" by default, but
+causes an error if PCRE_EXTRA is set.
 .
 .
 .\" HTML <a name="uniextseq"></a>
@@ -1349,10 +1351,11 @@
 .\" </a>
 "Newline sequences"
 .\"
-above. There are also the (*UTF8), (*UTF16),(*UTF32) and (*UCP) leading
+above. There are also the (*UTF8), (*UTF16),(*UTF32), and (*UCP) leading
 sequences that can be used to set UTF and Unicode property modes; they are
 equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP
-options, respectively.
+options, respectively. The (*UTF) sequence is a generic version that can be
+used with any of the libraries.
 .
 .
 .\" HTML <a name="subpattern"></a>
@@ -2975,6 +2978,6 @@
 .rs
 .sp
 .nf
-Last updated: 07 November 2012
+Last updated: 11 November 2012
 Copyright (c) 1997-2012 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcresyntax.3
===================================================================
--- code/trunk/doc/pcresyntax.3    2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/doc/pcresyntax.3    2012-11-11 18:04:37 UTC (rev 1219)
@@ -1,4 +1,4 @@
-.TH PCRESYNTAX 3 "24 June 2012" "PCRE 8.30"
+.TH PCRESYNTAX 3 "11 November 2012" "PCRE 8.32"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -349,6 +349,7 @@
   (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
   (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
   (*UTF32)        set UTF-32 mode: 32-bit library (PCRE_UTF32)
+  (*UTF)          set appropriate UTF mode for the library in use
   (*UCP)          set PCRE_UCP (use Unicode properties for \ed etc)
 .
 .
@@ -490,6 +491,6 @@
 .rs
 .sp
 .nf
-Last updated: 25 August 2012
+Last updated: 11 November 2012
 Copyright (c) 1997-2012 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcreunicode.3
===================================================================
--- code/trunk/doc/pcreunicode.3    2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/doc/pcreunicode.3    2012-11-11 18:04:37 UTC (rev 1219)
@@ -1,4 +1,4 @@
-.TH PCREUNICODE 3 "08 November 2012" "PCRE 8.32"
+.TH PCREUNICODE 3 "11 November 2012" "PCRE 8.32"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .SH "UTF-8, UTF-16, UTF-32, AND UNICODE PROPERTY SUPPORT"
@@ -18,9 +18,9 @@
 \fBpcre_compile()\fP
 .\"
 with the PCRE_UTF8 option flag, or the pattern must start with the sequence
-(*UTF8). When either of these is the case, both the pattern and any subject
-strings that are matched against it are treated as UTF-8 strings instead of
-strings of individual 1-byte characters.
+(*UTF8) or (*UTF). When either of these is the case, both the pattern and any
+subject strings that are matched against it are treated as UTF-8 strings
+instead of strings of individual 1-byte characters.
 .
 .
 .SH "UTF-16 AND UTF-32 SUPPORT"
@@ -36,10 +36,11 @@
 \fBpcre32_compile()\fP
 .\"
 with the PCRE_UTF16 or PCRE_UTF32 option flag, as appropriate. Alternatively,
-the pattern must start with the sequence (*UTF16) or (*UTF32). When UTF mode is 
-set, both the pattern and any subject strings that are matched against it are
-treated as UTF-16 or UTF-32 strings instead of strings of individual 16-bit or
-32-bit characters.
+the pattern must start with the sequence (*UTF16), (*UTF32), as appropriate, or
+(*UTF), which can be used with either library. When UTF mode is set, both the
+pattern and any subject strings that are matched against it are treated as
+UTF-16 or UTF-32 strings instead of strings of individual 16-bit or 32-bit
+characters.
 .
 .
 .SH "UTF SUPPORT OVERHEAD"
@@ -249,6 +250,6 @@
 .rs
 .sp
 .nf
-Last updated: 08 November 2012
+Last updated: 11 November 2012
 Copyright (c) 1997-2012 University of Cambridge.
 .fi


Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/pcre_compile.c    2012-11-11 18:04:37 UTC (rev 1219)
@@ -7848,19 +7848,26 @@
   {
   int newnl = 0;
   int newbsr = 0;
+  
+/* For completeness and backward compatibility, (*UTFn) is supported in the
+relevant libraries, but (*UTF) is generic and always supported. Note that 
+PCRE_UTF8 == PCRE_UTF16 == PCRE_UTF32. */ 


 #ifdef COMPILE_PCRE8
-  if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 5) == 0)
+  if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF8_RIGHTPAR, 5) == 0)
     { skipatstart += 7; options |= PCRE_UTF8; continue; }
 #endif
 #ifdef COMPILE_PCRE16
-  if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 6) == 0)
+  if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF16_RIGHTPAR, 6) == 0)
     { skipatstart += 8; options |= PCRE_UTF16; continue; }
 #endif
 #ifdef COMPILE_PCRE32
-  if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 6) == 0)
+  if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF32_RIGHTPAR, 6) == 0)
     { skipatstart += 8; options |= PCRE_UTF32; continue; }
 #endif
+
+  else if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 4) == 0)
+    { skipatstart += 6; options |= PCRE_UTF8; continue; } 
   else if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UCP_RIGHTPAR, 4) == 0)
     { skipatstart += 6; options |= PCRE_UCP; continue; }
   else if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_NO_START_OPT_RIGHTPAR, 13) == 0)


Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/pcre_internal.h    2012-11-11 18:04:37 UTC (rev 1219)
@@ -1528,15 +1528,10 @@
 #define STRING_ANYCRLF_RIGHTPAR        "ANYCRLF)"
 #define STRING_BSR_ANYCRLF_RIGHTPAR    "BSR_ANYCRLF)"
 #define STRING_BSR_UNICODE_RIGHTPAR    "BSR_UNICODE)"
-#ifdef COMPILE_PCRE8
-#define STRING_UTF_RIGHTPAR            "UTF8)"
-#endif
-#ifdef COMPILE_PCRE16
-#define STRING_UTF_RIGHTPAR            "UTF16)"
-#endif
-#ifdef COMPILE_PCRE32
-#define STRING_UTF_RIGHTPAR            "UTF32)"
-#endif
+#define STRING_UTF8_RIGHTPAR           "UTF8)"
+#define STRING_UTF16_RIGHTPAR          "UTF16)"
+#define STRING_UTF32_RIGHTPAR          "UTF32)"
+#define STRING_UTF_RIGHTPAR            "UTF)"
 #define STRING_UCP_RIGHTPAR            "UCP)"
 #define STRING_NO_START_OPT_RIGHTPAR   "NO_START_OPT)"


@@ -1794,15 +1789,10 @@
 #define STRING_ANYCRLF_RIGHTPAR        STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
 #define STRING_BSR_ANYCRLF_RIGHTPAR    STR_B STR_S STR_R STR_UNDERSCORE STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
 #define STRING_BSR_UNICODE_RIGHTPAR    STR_B STR_S STR_R STR_UNDERSCORE STR_U STR_N STR_I STR_C STR_O STR_D STR_E STR_RIGHT_PARENTHESIS
-#ifdef COMPILE_PCRE8
-#define STRING_UTF_RIGHTPAR            STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
-#endif
-#ifdef COMPILE_PCRE16
-#define STRING_UTF_RIGHTPAR            STR_U STR_T STR_F STR_1 STR_6 STR_RIGHT_PARENTHESIS
-#endif
-#ifdef COMPILE_PCRE32
-#define STRING_UTF_RIGHTPAR            STR_U STR_T STR_F STR_3 STR_2 STR_RIGHT_PARENTHESIS
-#endif
+#define STRING_UTF8_RIGHTPAR           STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
+#define STRING_UTF16_RIGHTPAR          STR_U STR_T STR_F STR_1 STR_6 STR_RIGHT_PARENTHESIS
+#define STRING_UTF32_RIGHTPAR          STR_U STR_T STR_F STR_3 STR_2 STR_RIGHT_PARENTHESIS
+#define STRING_UTF_RIGHTPAR            STR_U STR_T STR_F STR_RIGHT_PARENTHESIS
 #define STRING_UCP_RIGHTPAR            STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
 #define STRING_NO_START_OPT_RIGHTPAR   STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS



Modified: code/trunk/testdata/testinput15
===================================================================
--- code/trunk/testdata/testinput15    2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/testdata/testinput15    2012-11-11 18:04:37 UTC (rev 1219)
@@ -327,7 +327,7 @@
 /(*UTF8)\x{1234}/
   abcd\x{1234}pqr


-/(*CRLF)(*UTF8)(*BSR_UNICODE)a\Rb/I
+/(*CRLF)(*UTF)(*BSR_UNICODE)a\Rb/I

 /\h/SI8
     ABC\x{09}


Modified: code/trunk/testdata/testinput18
===================================================================
--- code/trunk/testdata/testinput18    2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/testdata/testinput18    2012-11-11 18:04:37 UTC (rev 1219)
@@ -174,6 +174,9 @@
 /(*UTF16)\x{11234}/
   abcd\x{11234}pqr


+/(*UTF)\x{11234}/I
+ abcd\x{11234}pqr
+
/(*UTF-32)\x{11234}/
abcd\x{11234}pqr


Modified: code/trunk/testdata/testoutput15
===================================================================
--- code/trunk/testdata/testoutput15    2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/testdata/testoutput15    2012-11-11 18:04:37 UTC (rev 1219)
@@ -922,7 +922,7 @@
   abcd\x{1234}pqr
  0: \x{1234}


-/(*CRLF)(*UTF8)(*BSR_UNICODE)a\Rb/I
+/(*CRLF)(*UTF)(*BSR_UNICODE)a\Rb/I
Capturing subpattern count = 0
Options: bsr_unicode utf
Forced newline sequence: CRLF

Modified: code/trunk/testdata/testoutput18-16
===================================================================
--- code/trunk/testdata/testoutput18-16    2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/testdata/testoutput18-16    2012-11-11 18:04:37 UTC (rev 1219)
@@ -641,6 +641,14 @@
   abcd\x{11234}pqr
  0: \x{11234}


+/(*UTF)\x{11234}/I
+Capturing subpattern count = 0
+Options: utf
+First char = \x{d804}
+Need char = \x{de34}
+ abcd\x{11234}pqr
+ 0: \x{11234}
+
/(*UTF-32)\x{11234}/
Failed: (*VERB) not recognized at offset 5


Modified: code/trunk/testdata/testoutput18-32
===================================================================
--- code/trunk/testdata/testoutput18-32    2012-11-11 17:27:22 UTC (rev 1218)
+++ code/trunk/testdata/testoutput18-32    2012-11-11 18:04:37 UTC (rev 1219)
@@ -638,6 +638,14 @@
 /(*UTF16)\x{11234}/
 Failed: (*VERB) not recognized at offset 5


+/(*UTF)\x{11234}/I
+Capturing subpattern count = 0
+Options: utf
+First char = \x{11234}
+No need char
+ abcd\x{11234}pqr
+ 0: \x{11234}
+
/(*UTF-32)\x{11234}/
Failed: (*VERB) not recognized at offset 5