[Pcre-svn] [1309] code/trunk: Implement PCRE_NEVER_UTF

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [1309] code/trunk: Implement PCRE_NEVER_UTF
Revision: 1309
          http://vcs.pcre.org/viewvc?view=rev&revision=1309
Author:   ph10
Date:     2013-04-05 16:35:59 +0100 (Fri, 05 Apr 2013)


Log Message:
-----------
Implement PCRE_NEVER_UTF

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/pcreapi.3
    code/trunk/doc/pcrepattern.3
    code/trunk/doc/pcretest.1
    code/trunk/pcre.h.in
    code/trunk/pcre_compile.c
    code/trunk/pcre_internal.h
    code/trunk/pcreposix.c
    code/trunk/pcretest.c
    code/trunk/testdata/testinput15
    code/trunk/testdata/testinput18
    code/trunk/testdata/testoutput15
    code/trunk/testdata/testoutput18-16
    code/trunk/testdata/testoutput18-32


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2013-04-02 06:58:55 UTC (rev 1308)
+++ code/trunk/ChangeLog    2013-04-05 15:35:59 UTC (rev 1309)
@@ -132,7 +132,10 @@


34. Auto-detect and optimize limited repetitions in JIT.

+35. Implement PCRE_NEVER_UTF to lock out the use of UTF, in particular, 
+    blocking (*UTF) etc.


+
Version 8.32 30-November-2012
-----------------------------


Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3    2013-04-02 06:58:55 UTC (rev 1308)
+++ code/trunk/doc/pcreapi.3    2013-04-05 15:35:59 UTC (rev 1309)
@@ -1,4 +1,4 @@
-.TH PCREAPI 3 "26 March 2013" "PCRE 8.33"
+.TH PCREAPI 3 "05 April 2013" "PCRE 8.33"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .sp
@@ -755,6 +755,15 @@
 (?m) option setting. If there are no newlines in a subject string, or no
 occurrences of ^ or $ in a pattern, setting PCRE_MULTILINE has no effect.
 .sp
+  PCRE_NEVER_UTF
+.sp
+This option locks out interpretation of the pattern as UTF-8 (or UTF-16 or
+UTF-32 in the 16-bit and 32-bit libraries). In particular, it prevents the
+creator of the pattern from switching to UTF interpretation by starting the
+pattern with (*UTF). This may be useful in applications that process patterns
+from external sources. The combination of PCRE_UTF8 and PCRE_NEVER_UTF also 
+causes an error.
+.sp
   PCRE_NEWLINE_CR
   PCRE_NEWLINE_LF
   PCRE_NEWLINE_CRLF
@@ -2834,6 +2843,6 @@
 .rs
 .sp
 .nf
-Last updated: 26 March 2013
+Last updated: 05 April 2013
 Copyright (c) 1997-2013 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3    2013-04-02 06:58:55 UTC (rev 1308)
+++ code/trunk/doc/pcrepattern.3    2013-04-05 15:35:59 UTC (rev 1309)
@@ -1,4 +1,4 @@
-.TH PCREPATTERN 3 "27 March 2013" "PCRE 8.33"
+.TH PCREPATTERN 3 "05 April 2013" "PCRE 8.33"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .SH "PCRE REGULAR EXPRESSION DETAILS"
@@ -1366,7 +1366,8 @@
 sequences that can be used to set UTF and Unicode property modes; they are
 equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP
 options, respectively. The (*UTF) sequence is a generic version that can be
-used with any of the libraries.
+used with any of the libraries. However, the application can set the 
+PCRE_NEVER_UTF option, which locks out the use of the (*UTF) sequences.
 .
 .
 .\" HTML <a name="subpattern"></a>
@@ -3100,6 +3101,6 @@
 .rs
 .sp
 .nf
-Last updated: 27 March 2013
+Last updated: 05 April 2013
 Copyright (c) 1997-2013 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcretest.1
===================================================================
--- code/trunk/doc/pcretest.1    2013-04-02 06:58:55 UTC (rev 1308)
+++ code/trunk/doc/pcretest.1    2013-04-05 15:35:59 UTC (rev 1309)
@@ -1,4 +1,4 @@
-.TH PCRETEST 1 "22 February 2013" "PCRE 8.33"
+.TH PCRETEST 1 "05 April 2013" "PCRE 8.33"
 .SH NAME
 pcretest - a program for testing Perl-compatible regular expressions.
 .SH SYNOPSIS
@@ -287,6 +287,7 @@
 sections.
 .sp
   \fB/8\fP              set UTF mode
+  \fB/9\fP              set PCRE_NEVER_UTF (locks out UTF mode) 
   \fB/?\fP              disable UTF validity check
   \fB/+\fP              show remainder of subject after match
   \fB/=\fP              show all captures (not just those that are set)
@@ -357,6 +358,7 @@
   \fB/8\fP              PCRE_UTF32          ) when using the 32-bit
   \fB/?\fP              PCRE_NO_UTF32_CHECK )   library
 .sp
+  \fB/9\fP              PCRE_NEVER_UTF
   \fB/A\fP              PCRE_ANCHORED
   \fB/C\fP              PCRE_AUTO_CALLOUT
   \fB/E\fP              PCRE_DOLLAR_ENDONLY
@@ -1081,6 +1083,6 @@
 .rs
 .sp
 .nf
-Last updated: 22 February 2013
+Last updated: 05 April 2013
 Copyright (c) 1997-2013 University of Cambridge.
 .fi


Modified: code/trunk/pcre.h.in
===================================================================
--- code/trunk/pcre.h.in    2013-04-02 06:58:55 UTC (rev 1308)
+++ code/trunk/pcre.h.in    2013-04-05 15:35:59 UTC (rev 1309)
@@ -5,7 +5,7 @@
 /* This is the public header file for the PCRE library, to be #included by
 applications that call the PCRE functions.


-           Copyright (c) 1997-2012 University of Cambridge
+           Copyright (c) 1997-2013 University of Cambridge


-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -96,11 +96,14 @@
#endif

/* Public options. Some are compile-time only, some are run-time only, and some
-are both, so we keep them all distinct. However, almost all the bits in the
-options word are now used. In the long run, we may have to re-use some of the
-compile-time only bits for runtime options, or vice versa. Any of the
-compile-time options may be inspected during studying (and therefore JIT
-compiling).
+are both. Most of the compile-time options are saved with the compiled regex so
+that they can be inspected during studying (and therefore JIT compiling). Note
+that pcre_study() has its own set of options. Originally, all the options
+defined here used distinct bits. However, almost all the bits in a 32-bit word
+are now used, so in order to conserve them, option bits that were previously
+only recognized at matching time (i.e. by pcre_exec() or pcre_dfa_exec()) may
+also be used for compile-time options that affect only compiling and are not
+relevant for studying or JIT compiling.

 Some options for pcre_compile() change its behaviour but do not affect the
 behaviour of the execution functions. Other options are passed through to the
@@ -142,7 +145,11 @@
 #define PCRE_AUTO_CALLOUT       0x00004000  /* C1       */
 #define PCRE_PARTIAL_SOFT       0x00008000  /*    E D J  ) Synonyms */
 #define PCRE_PARTIAL            0x00008000  /*    E D J  )          */
-#define PCRE_DFA_SHORTEST       0x00010000  /*      D   */
+
+/* This pair use the same bit. */
+#define PCRE_NEVER_UTF          0x00010000  /* C1        ) Overlaid */
+#define PCRE_DFA_SHORTEST       0x00010000  /*      D    ) Overlaid */
+
 #define PCRE_DFA_RESTART        0x00020000  /*      D   */
 #define PCRE_FIRSTLINE          0x00040000  /* C3       */
 #define PCRE_DUPNAMES           0x00080000  /* C1       */


Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2013-04-02 06:58:55 UTC (rev 1308)
+++ code/trunk/pcre_compile.c    2013-04-05 15:35:59 UTC (rev 1309)
@@ -6,7 +6,7 @@
 and semantics are as close as possible to those of the Perl 5 language.


                        Written by Philip Hazel
-           Copyright (c) 1997-2012 University of Cambridge
+           Copyright (c) 1997-2013 University of Cambridge


-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -508,6 +508,7 @@
"name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)\0"
"character value in \\u.... sequence is too large\0"
"invalid UTF-32 string\0"
+ "setting UTF is disabled by the application\0"
;

/* Table to identify digits and hex digits. This is used when compiling
@@ -7771,6 +7772,7 @@
int errorcode = 0;
int skipatstart = 0;
BOOL utf;
+BOOL never_utf = FALSE;
size_t size;
pcre_uchar *code;
const pcre_uchar *codestart;
@@ -7829,7 +7831,16 @@
errorcode = ERR17;
goto PCRE_EARLY_ERROR_RETURN;
}
+
+/* If PCRE_NEVER_UTF is set, remember it. As this option steals a bit that is
+also used for execution options, flatten it just in case. */

+if ((options & PCRE_NEVER_UTF) != 0)
+ {
+ never_utf = TRUE;
+ options &= ~PCRE_NEVER_UTF;
+ }
+
/* Check for global one-time settings at the start of the pattern, and remember
the offset for later. */

@@ -7885,9 +7896,14 @@
     options = (options & ~(PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) | newbsr;
   else break;
   }
-
+  
 /* PCRE_UTF(16|32) have the same value as PCRE_UTF8. */
 utf = (options & PCRE_UTF8) != 0;
+if (utf && never_utf)
+  {
+  errorcode = ERR78;
+  goto PCRE_EARLY_ERROR_RETURN2;
+  } 


/* Can't support UTF unless PCRE has been compiled to include the code. The
return of an error code from PRIV(valid_utf)() is a new feature, introduced in

Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2013-04-02 06:58:55 UTC (rev 1308)
+++ code/trunk/pcre_internal.h    2013-04-05 15:35:59 UTC (rev 1309)
@@ -7,7 +7,7 @@
 and semantics are as close as possible to those of the Perl 5 language.


                        Written by Philip Hazel
-           Copyright (c) 1997-2012 University of Cambridge
+           Copyright (c) 1997-2013 University of Cambridge


 -----------------------------------------------------------------------------
 Redistribution and use in source and binary forms, with or without
@@ -1164,7 +1164,7 @@
    PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY|PCRE_UTF8| \
    PCRE_NO_AUTO_CAPTURE|PCRE_NO_UTF8_CHECK|PCRE_AUTO_CALLOUT|PCRE_FIRSTLINE| \
    PCRE_DUPNAMES|PCRE_NEWLINE_BITS|PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE| \
-   PCRE_JAVASCRIPT_COMPAT|PCRE_UCP|PCRE_NO_START_OPTIMIZE)
+   PCRE_JAVASCRIPT_COMPAT|PCRE_UCP|PCRE_NO_START_OPTIMIZE|PCRE_NEVER_UTF)


 #define PUBLIC_EXEC_OPTIONS \
   (PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY|PCRE_NOTEMPTY_ATSTART| \
@@ -2271,7 +2271,7 @@
        ERR40, ERR41, ERR42, ERR43, ERR44, ERR45, ERR46, ERR47, ERR48, ERR49,
        ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59,
        ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69,
-       ERR70, ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERRCOUNT };
+       ERR70, ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERRCOUNT };


/* JIT compiling modes. The function list is indexed by them. */
enum { JIT_COMPILE, JIT_PARTIAL_SOFT_COMPILE, JIT_PARTIAL_HARD_COMPILE,

Modified: code/trunk/pcreposix.c
===================================================================
--- code/trunk/pcreposix.c    2013-04-02 06:58:55 UTC (rev 1308)
+++ code/trunk/pcreposix.c    2013-04-05 15:35:59 UTC (rev 1309)
@@ -162,7 +162,8 @@
   /* 75 */
   REG_BADPAT,  /* overlong MARK name */
   REG_BADPAT,  /* character value in \u.... sequence is too large */
-  REG_BADPAT   /* invalid UTF-32 string (should not occur) */
+  REG_BADPAT,  /* invalid UTF-32 string (should not occur) */
+  REG_BADPAT   /* setting UTF is disabled by the application */ 
 };


/* Table of texts corresponding to POSIX error codes */

Modified: code/trunk/pcretest.c
===================================================================
--- code/trunk/pcretest.c    2013-04-02 06:58:55 UTC (rev 1308)
+++ code/trunk/pcretest.c    2013-04-05 15:35:59 UTC (rev 1309)
@@ -3689,6 +3689,7 @@
       case 'Y': options |= PCRE_NO_START_OPTIMISE; break;
       case 'Z': debug_lengths = 0; break;
       case '8': options |= PCRE_UTF8; use_utf = 1; break;
+      case '9': options |= PCRE_NEVER_UTF; break; 
       case '?': options |= PCRE_NO_UTF8_CHECK; break;


       case 'T':


Modified: code/trunk/testdata/testinput15
===================================================================
--- code/trunk/testdata/testinput15    2013-04-02 06:58:55 UTC (rev 1308)
+++ code/trunk/testdata/testinput15    2013-04-05 15:35:59 UTC (rev 1309)
@@ -357,4 +357,8 @@
   \x{ff000041}
   \x{7f000041} 


+/(*UTF8)abc/9
+
+/abc/89
+
/-- End of testinput15 --/

Modified: code/trunk/testdata/testinput18
===================================================================
--- code/trunk/testdata/testinput18    2013-04-02 06:58:55 UTC (rev 1308)
+++ code/trunk/testdata/testinput18    2013-04-05 15:35:59 UTC (rev 1309)
@@ -291,4 +291,8 @@
 /\x{a0}+\s!/8BZT1
     \x{a0}\x20!


+/(*UTF)abc/9
+
+/abc/89
+
/-- End of testinput18 --/

Modified: code/trunk/testdata/testoutput15
===================================================================
--- code/trunk/testdata/testoutput15    2013-04-02 06:58:55 UTC (rev 1308)
+++ code/trunk/testdata/testoutput15    2013-04-05 15:35:59 UTC (rev 1309)
@@ -1128,4 +1128,10 @@
   \x{7f000041} 
 Error -10 (bad UTF-8 string) offset=0 reason=12


+/(*UTF8)abc/9
+Failed: setting UTF is disabled by the application at offset 0
+
+/abc/89
+Failed: setting UTF is disabled by the application at offset 0
+
/-- End of testinput15 --/

Modified: code/trunk/testdata/testoutput18-16
===================================================================
--- code/trunk/testdata/testoutput18-16    2013-04-02 06:58:55 UTC (rev 1308)
+++ code/trunk/testdata/testoutput18-16    2013-04-05 15:35:59 UTC (rev 1309)
@@ -1015,4 +1015,10 @@
     \x{a0}\x20!
  0: \x{a0} !


+/(*UTF)abc/9
+Failed: setting UTF is disabled by the application at offset 0
+
+/abc/89
+Failed: setting UTF is disabled by the application at offset 0
+
/-- End of testinput18 --/

Modified: code/trunk/testdata/testoutput18-32
===================================================================
--- code/trunk/testdata/testoutput18-32    2013-04-02 06:58:55 UTC (rev 1308)
+++ code/trunk/testdata/testoutput18-32    2013-04-05 15:35:59 UTC (rev 1309)
@@ -1012,4 +1012,10 @@
     \x{a0}\x20!
  0: \x{a0} !


+/(*UTF)abc/9
+Failed: setting UTF is disabled by the application at offset 0
+
+/abc/89
+Failed: setting UTF is disabled by the application at offset 0
+
/-- End of testinput18 --/