Revision: 412
http://vcs.pcre.org/viewvc?view=rev&revision=412
Author: ph10
Date: 2009-04-11 11:34:37 +0100 (Sat, 11 Apr 2009)
Log Message:
-----------
Add support for (*UTF8).
Modified Paths:
--------------
code/trunk/ChangeLog
code/trunk/doc/pcre.3
code/trunk/doc/pcreapi.3
code/trunk/doc/pcrepattern.3
code/trunk/doc/pcresyntax.3
code/trunk/pcre_compile.c
code/trunk/pcre_internal.h
code/trunk/pcretest.c
code/trunk/testdata/testinput5
code/trunk/testdata/testoutput5
Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog 2009-04-10 15:40:21 UTC (rev 411)
+++ code/trunk/ChangeLog 2009-04-11 10:34:37 UTC (rev 412)
@@ -1,7 +1,7 @@
ChangeLog for PCRE
------------------
-Version 7.9 10-Apr-09
+Version 7.9 11-Apr-09
---------------------
1. When building with support for bzlib/zlib (pcregrep) and/or readline
@@ -116,6 +116,8 @@
27. Wrapped the definitions of fileno and isatty for Windows, which appear in
pcretest.c, inside #ifndefs, because it seems they are sometimes already
pre-defined.
+
+28. Added support for (*UTF8) at the start of a pattern.
Modified: code/trunk/doc/pcre.3
===================================================================
--- code/trunk/doc/pcre.3 2009-04-10 15:40:21 UTC (rev 411)
+++ code/trunk/doc/pcre.3 2009-04-11 10:34:37 UTC (rev 412)
@@ -15,7 +15,7 @@
Perl 5.10, including support for UTF-8 encoded strings and Unicode general
category properties. However, UTF-8 and Unicode support has to be explicitly
enabled; it is not the default. The Unicode tables correspond to Unicode
-release 5.0.0.
+release 5.1.
.P
In addition to the Perl-compatible matching function, PCRE contains an
alternative matching function that matches the same compiled patterns in a
@@ -163,9 +163,10 @@
.\" HREF
\fBpcre_compile()\fP
.\"
-with the PCRE_UTF8 option flag. When you do this, both the pattern and any
-subject strings that are matched against it are treated as UTF-8 strings
-instead of just strings of bytes.
+with the PCRE_UTF8 option flag, or the pattern must start with the sequence
+(*UTF8). When either of these is the case, both the pattern and any subject
+strings that are matched against it are treated as UTF-8 strings instead of
+just strings of bytes.
.P
If you compile PCRE with UTF-8 support, but do not use it at run time, the
library will be a bit bigger, but the additional run time overhead is limited
@@ -290,6 +291,6 @@
.rs
.sp
.nf
-Last updated: 18 March 2009
+Last updated: 11 April 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3 2009-04-10 15:40:21 UTC (rev 411)
+++ code/trunk/doc/pcreapi.3 2009-04-11 10:34:37 UTC (rev 412)
@@ -406,16 +406,17 @@
.P
The \fIoptions\fP argument contains various bit settings that affect the
compilation. It should be zero if no options are required. The available
-options are described below. Some of them, in particular, those that are
-compatible with Perl, can also be set and unset from within the pattern (see
-the detailed description in the
+options are described below. Some of them (in particular, those that are
+compatible with Perl, but also some others) can also be set and unset from
+within the pattern (see the detailed description in the
.\" HREF
\fBpcrepattern\fP
.\"
-documentation). For these options, the contents of the \fIoptions\fP argument
-specifies their initial settings at the start of compilation and execution. The
-PCRE_ANCHORED and PCRE_NEWLINE_\fIxxx\fP options can be set at the time of
-matching as well as at compile time.
+documentation). For those options that can be different in different parts of
+the pattern, the contents of the \fIoptions\fP argument specifies their initial
+settings at the start of compilation and execution. The PCRE_ANCHORED and
+PCRE_NEWLINE_\fIxxx\fP options can be set at the time of matching as well as at
+compile time.
.P
If \fIerrptr\fP is NULL, \fBpcre_compile()\fP returns NULL immediately.
Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fP returns
@@ -1995,6 +1996,6 @@
.rs
.sp
.nf
-Last updated: 17 March 2009
+Last updated: 11 April 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3 2009-04-10 15:40:21 UTC (rev 411)
+++ code/trunk/doc/pcrepattern.3 2009-04-11 10:34:37 UTC (rev 412)
@@ -23,8 +23,15 @@
The original operation of PCRE was on strings of one-byte characters. However,
there is now also support for UTF-8 character strings. To use this, you must
build PCRE to include UTF-8 support, and then call \fBpcre_compile()\fP with
-the PCRE_UTF8 option. How this affects pattern matching is mentioned in several
-places below. There is also a summary of UTF-8 features in the
+the PCRE_UTF8 option. There is also a special sequence that can be given at the
+start of a pattern:
+.sp
+ (*UTF8)
+.sp
+Starting a pattern with this sequence is equivalent to setting the PCRE_UTF8
+option. This feature is not Perl-compatible. How setting UTF-8 mode affects
+pattern matching is mentioned in several places below. There is also a summary
+of UTF-8 features in the
.\" HTML <a href="pcre.html#utf8support">
.\" </a>
section on UTF-8 support
@@ -1032,11 +1039,11 @@
changed in the same way as the Perl-compatible options by using the characters
J, U and X respectively.
.P
-When an option change occurs at top level (that is, not inside subpattern
-parentheses), the change applies to the remainder of the pattern that follows.
-If the change is placed right at the start of a pattern, PCRE extracts it into
-the global options (and it will therefore show up in data extracted by the
-\fBpcre_fullinfo()\fP function).
+When one of these option changes occurs at top level (that is, not inside
+subpattern parentheses), the change applies to the remainder of the pattern
+that follows. If the change is placed right at the start of a pattern, PCRE
+extracts it into the global options (and it will therefore show up in data
+extracted by the \fBpcre_fullinfo()\fP function).
.P
An option change within a subpattern (see below for a description of
subpatterns) affects only that part of the current pattern that follows it, so
@@ -1057,13 +1064,15 @@
.P
\fBNote:\fP There are other PCRE-specific options that can be set by the
application when the compile or match functions are called. In some cases the
-pattern can contain special leading sequences to override what the application
-has set or what has been defaulted. Details are given in the section entitled
+pattern can contain special leading sequences such as (*CRLF) to override what
+the application has set or what has been defaulted. Details are given in the
+section entitled
.\" HTML <a href="#newlineseq">
.\" </a>
"Newline sequences"
.\"
-above.
+above. There is also the (*UTF8) leading sequence that can be used to set UTF-8
+mode; this is equivalent to setting the PCRE_UTF8 option.
.
.
.\" HTML <a name="subpattern"></a>
@@ -2245,6 +2254,6 @@
.rs
.sp
.nf
-Last updated: 18 March 2009
+Last updated: 11 April 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
Modified: code/trunk/doc/pcresyntax.3
===================================================================
--- code/trunk/doc/pcresyntax.3 2009-04-10 15:40:21 UTC (rev 411)
+++ code/trunk/doc/pcresyntax.3 2009-04-11 10:34:37 UTC (rev 412)
@@ -120,6 +120,8 @@
Buginese,
Buhid,
Canadian_Aboriginal,
+Carian,
+Cham,
Cherokee,
Common,
Coptic,
@@ -143,12 +145,16 @@
Inherited,
Kannada,
Katakana,
+Kayah_Li,
Kharoshthi,
Khmer,
Lao,
Latin,
+Lepcha,
Limbu,
Linear_B,
+Lycian,
+Lydian,
Malayalam,
Mongolian,
Myanmar,
@@ -157,13 +163,17 @@
Ogham,
Old_Italic,
Old_Persian,
+Ol_Chiki,
Oriya,
Osmanya,
Phags_Pa,
Phoenician,
+Rejang,
Runic,
+Saurashtra,
Shavian,
Sinhala,
+Sudanese,
Syloti_Nagri,
Syriac,
Tagalog,
@@ -176,6 +186,7 @@
Tibetan,
Tifinagh,
Ugaritic,
+Vai,
Yi.
.
.
@@ -231,7 +242,7 @@
.SH "ANCHORS AND SIMPLE ASSERTIONS"
.rs
.sp
- \eb word boundary
+ \eb word boundary (only ASCII letters recognized)
\eB not a word boundary
^ start of subject
also after internal newline in multiline mode
@@ -260,19 +271,19 @@
.SH "CAPTURING"
.rs
.sp
- (...) capturing group
- (?<name>...) named capturing group (Perl)
- (?'name'...) named capturing group (Perl)
- (?P<name>...) named capturing group (Python)
- (?:...) non-capturing group
- (?|...) non-capturing group; reset group numbers for
- capturing groups in each alternative
+ (...) capturing group
+ (?<name>...) named capturing group (Perl)
+ (?'name'...) named capturing group (Perl)
+ (?P<name>...) named capturing group (Python)
+ (?:...) non-capturing group
+ (?|...) non-capturing group; reset group numbers for
+ capturing groups in each alternative
.
.
.SH "ATOMIC GROUPS"
.rs
.sp
- (?>...) atomic, non-capturing group
+ (?>...) atomic, non-capturing group
.
.
.
@@ -280,28 +291,33 @@
.SH "COMMENT"
.rs
.sp
- (?#....) comment (not nestable)
+ (?#....) comment (not nestable)
.
.
.SH "OPTION SETTING"
.rs
.sp
- (?i) caseless
- (?J) allow duplicate names
- (?m) multiline
- (?s) single line (dotall)
- (?U) default ungreedy (lazy)
- (?x) extended (ignore white space)
- (?-...) unset option(s)
+ (?i) caseless
+ (?J) allow duplicate names
+ (?m) multiline
+ (?s) single line (dotall)
+ (?U) default ungreedy (lazy)
+ (?x) extended (ignore white space)
+ (?-...) unset option(s)
+.sp
+The following is recognized only at the start of a pattern or after one of the
+newline-setting options with similar syntax:
+.sp
+ (*UTF8) set UTF-8 mode
.
.
.SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
.rs
.sp
- (?=...) positive look ahead
- (?!...) negative look ahead
- (?<=...) positive look behind
- (?<!...) negative look behind
+ (?=...) positive look ahead
+ (?!...) negative look ahead
+ (?<=...) positive look behind
+ (?<!...) negative look behind
.sp
Each top-level branch of a look behind must be of a fixed length.
.
@@ -309,34 +325,34 @@
.SH "BACKREFERENCES"
.rs
.sp
- \en reference by number (can be ambiguous)
- \egn reference by number
- \eg{n} reference by number
- \eg{-n} relative reference by number
- \ek<name> reference by name (Perl)
- \ek'name' reference by name (Perl)
- \eg{name} reference by name (Perl)
- \ek{name} reference by name (.NET)
- (?P=name) reference by name (Python)
+ \en reference by number (can be ambiguous)
+ \egn reference by number
+ \eg{n} reference by number
+ \eg{-n} relative reference by number
+ \ek<name> reference by name (Perl)
+ \ek'name' reference by name (Perl)
+ \eg{name} reference by name (Perl)
+ \ek{name} reference by name (.NET)
+ (?P=name) reference by name (Python)
.
.
.SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
.rs
.sp
- (?R) recurse whole pattern
- (?n) call subpattern by absolute number
- (?+n) call subpattern by relative number
- (?-n) call subpattern by relative number
- (?&name) call subpattern by name (Perl)
- (?P>name) call subpattern by name (Python)
- \eg<name> call subpattern by name (Oniguruma)
- \eg'name' call subpattern by name (Oniguruma)
- \eg<n> call subpattern by absolute number (Oniguruma)
- \eg'n' call subpattern by absolute number (Oniguruma)
- \eg<+n> call subpattern by relative number (PCRE extension)
- \eg'+n' call subpattern by relative number (PCRE extension)
- \eg<-n> call subpattern by relative number (PCRE extension)
- \eg'-n' call subpattern by relative number (PCRE extension)
+ (?R) recurse whole pattern
+ (?n) call subpattern by absolute number
+ (?+n) call subpattern by relative number
+ (?-n) call subpattern by relative number
+ (?&name) call subpattern by name (Perl)
+ (?P>name) call subpattern by name (Python)
+ \eg<name> call subpattern by name (Oniguruma)
+ \eg'name' call subpattern by name (Oniguruma)
+ \eg<n> call subpattern by absolute number (Oniguruma)
+ \eg'n' call subpattern by absolute number (Oniguruma)
+ \eg<+n> call subpattern by relative number (PCRE extension)
+ \eg'+n' call subpattern by relative number (PCRE extension)
+ \eg<-n> call subpattern by relative number (PCRE extension)
+ \eg'-n' call subpattern by relative number (PCRE extension)
.
.
.SH "CONDITIONAL PATTERNS"
@@ -345,17 +361,17 @@
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
.sp
- (?(n)... absolute reference condition
- (?(+n)... relative reference condition
- (?(-n)... relative reference condition
- (?(<name>)... named reference condition (Perl)
- (?('name')... named reference condition (Perl)
- (?(name)... named reference condition (PCRE)
- (?(R)... overall recursion condition
- (?(Rn)... specific group recursion condition
- (?(R&name)... specific recursion condition
- (?(DEFINE)... define subpattern for reference
- (?(assert)... assertion condition
+ (?(n)... absolute reference condition
+ (?(+n)... relative reference condition
+ (?(-n)... relative reference condition
+ (?(<name>)... named reference condition (Perl)
+ (?('name')... named reference condition (Perl)
+ (?(name)... named reference condition (PCRE)
+ (?(R)... overall recursion condition
+ (?(Rn)... specific group recursion condition
+ (?(R&name)... specific recursion condition
+ (?(DEFINE)... define subpattern for reference
+ (?(assert)... assertion condition
.
.
.SH "BACKTRACKING CONTROL"
@@ -363,41 +379,41 @@
.sp
The following act immediately they are reached:
.sp
- (*ACCEPT) force successful match
- (*FAIL) force backtrack; synonym (*F)
+ (*ACCEPT) force successful match
+ (*FAIL) force backtrack; synonym (*F)
.sp
The following act only when a subsequent match failure causes a backtrack to
reach them. They all force a match failure, but they differ in what happens
afterwards. Those that advance the start-of-match point do so only if the
pattern is not anchored.
.sp
- (*COMMIT) overall failure, no advance of starting point
- (*PRUNE) advance to next starting character
- (*SKIP) advance start to current matching position
- (*THEN) local failure, backtrack to next alternation
+ (*COMMIT) overall failure, no advance of starting point
+ (*PRUNE) advance to next starting character
+ (*SKIP) advance start to current matching position
+ (*THEN) local failure, backtrack to next alternation
.
.
.SH "NEWLINE CONVENTIONS"
.rs
.sp
These are recognized only at the very start of the pattern or after a
-(*BSR_...) option.
+(*BSR_...) or (*UTF8) option.
.sp
- (*CR)
- (*LF)
- (*CRLF)
- (*ANYCRLF)
- (*ANY)
+ (*CR) carriage return only
+ (*LF) linefeed only
+ (*CRLF) carriage return followed by linefeed
+ (*ANYCRLF) all three of the above
+ (*ANY) any Unicode newline sequence
.
.
.SH "WHAT \eR MATCHES"
.rs
.sp
These are recognized only at the very start of the pattern or after a
-(*...) option that sets the newline convention.
+(*...) option that sets the newline convention or UTF-8 mode.
.sp
- (*BSR_ANYCRLF)
- (*BSR_UNICODE)
+ (*BSR_ANYCRLF) CR, LF, or CRLF
+ (*BSR_UNICODE) any Unicode newline sequence
.
.
.SH "CALLOUTS"
@@ -428,6 +444,6 @@
.rs
.sp
.nf
-Last updated: 09 April 2008
-Copyright (c) 1997-2008 University of Cambridge.
+Last updated: 11 April 2009
+Copyright (c) 1997-2009 University of Cambridge.
.fi
Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c 2009-04-10 15:40:21 UTC (rev 411)
+++ code/trunk/pcre_compile.c 2009-04-11 10:34:37 UTC (rev 412)
@@ -6226,38 +6226,22 @@
*erroroffset = 0;
-/* Can't support UTF8 unless PCRE has been compiled to include the code. */
+/* Set up pointers to the individual character tables */
-#ifdef SUPPORT_UTF8
-utf8 = (options & PCRE_UTF8) != 0;
-if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0 &&
- (*erroroffset = _pcre_valid_utf8((uschar *)pattern, -1)) >= 0)
- {
- errorcode = ERR44;
- goto PCRE_EARLY_ERROR_RETURN2;
- }
-#else
-if ((options & PCRE_UTF8) != 0)
- {
- errorcode = ERR32;
- goto PCRE_EARLY_ERROR_RETURN;
- }
-#endif
+if (tables == NULL) tables = _pcre_default_tables;
+cd->lcc = tables + lcc_offset;
+cd->fcc = tables + fcc_offset;
+cd->cbits = tables + cbits_offset;
+cd->ctypes = tables + ctypes_offset;
+/* Check that all undefined public option bits are zero */
+
if ((options & ~PUBLIC_COMPILE_OPTIONS) != 0)
{
errorcode = ERR17;
goto PCRE_EARLY_ERROR_RETURN;
}
-/* Set up pointers to the individual character tables */
-
-if (tables == NULL) tables = _pcre_default_tables;
-cd->lcc = tables + lcc_offset;
-cd->fcc = tables + fcc_offset;
-cd->cbits = tables + cbits_offset;
-cd->ctypes = tables + ctypes_offset;
-
/* Check for global one-time settings at the start of the pattern, and remember
the offset for later. */
@@ -6267,6 +6251,9 @@
int newnl = 0;
int newbsr = 0;
+ if (strncmp((char *)(ptr+skipatstart+2), STRING_UTF8_RIGHTPAR, 5) == 0)
+ { skipatstart += 7; options |= PCRE_UTF8; continue; }
+
if (strncmp((char *)(ptr+skipatstart+2), STRING_CR_RIGHTPAR, 3) == 0)
{ skipatstart += 5; newnl = PCRE_NEWLINE_CR; }
else if (strncmp((char *)(ptr+skipatstart+2), STRING_LF_RIGHTPAR, 3) == 0)
@@ -6290,6 +6277,24 @@
else break;
}
+/* Can't support UTF8 unless PCRE has been compiled to include the code. */
+
+#ifdef SUPPORT_UTF8
+utf8 = (options & PCRE_UTF8) != 0;
+if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0 &&
+ (*erroroffset = _pcre_valid_utf8((uschar *)pattern, -1)) >= 0)
+ {
+ errorcode = ERR44;
+ goto PCRE_EARLY_ERROR_RETURN2;
+ }
+#else
+if ((options & PCRE_UTF8) != 0)
+ {
+ errorcode = ERR32;
+ goto PCRE_EARLY_ERROR_RETURN;
+ }
+#endif
+
/* Check validity of \R options. */
switch (options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE))
Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h 2009-04-10 15:40:21 UTC (rev 411)
+++ code/trunk/pcre_internal.h 2009-04-11 10:34:37 UTC (rev 412)
@@ -881,6 +881,7 @@
#define STRING_ANYCRLF_RIGHTPAR "ANYCRLF)"
#define STRING_BSR_ANYCRLF_RIGHTPAR "BSR_ANYCRLF)"
#define STRING_BSR_UNICODE_RIGHTPAR "BSR_UNICODE)"
+#define STRING_UTF8_RIGHTPAR "UTF8)"
#else /* SUPPORT_UTF8 */
@@ -1132,6 +1133,7 @@
#define STRING_ANYCRLF_RIGHTPAR STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
#define STRING_BSR_ANYCRLF_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
#define STRING_BSR_UNICODE_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_U STR_N STR_I STR_C STR_O STR_D STR_E STR_RIGHT_PARENTHESIS
+#define STRING_UTF8_RIGHTPAR STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
#endif /* SUPPORT_UTF8 */
Modified: code/trunk/pcretest.c
===================================================================
--- code/trunk/pcretest.c 2009-04-10 15:40:21 UTC (rev 411)
+++ code/trunk/pcretest.c 2009-04-11 10:34:37 UTC (rev 412)
@@ -1325,6 +1325,8 @@
#endif /* !defined NOPOSIX */
{
+ unsigned long int get_options;
+
if (timeit > 0)
{
register int i;
@@ -1367,10 +1369,17 @@
}
goto CONTINUE;
}
+
+ /* Compilation succeeded. It is now possible to set the UTF-8 option from
+ within the regex; check for this so that we know how to process the data
+ lines. */
+
+ new_info(re, NULL, PCRE_INFO_OPTIONS, &get_options);
+ if ((get_options & PCRE_UTF8) != 0) use_utf8 = 1;
- /* Compilation succeeded; print data if required. There are now two
- info-returning functions. The old one has a limited interface and
- returns only limited data. Check that it agrees with the newer one. */
+ /* Print information if required. There are now two info-returning
+ functions. The old one has a limited interface and returns only limited
+ data. Check that it agrees with the newer one. */
if (log_store)
fprintf(outfile, "Memory allocation (code space): %d\n",
@@ -1454,10 +1463,12 @@
fprintf(outfile, "------------------------------------------------------------------\n");
pcre_printint(re, outfile, debug_lengths);
}
+
+ /* We already have the options in get_options (see above) */
if (do_showinfo)
{
- unsigned long int get_options, all_options;
+ unsigned long int all_options;
#if !defined NOINFOCHECK
int old_first_char, old_options, old_count;
#endif
@@ -1466,7 +1477,6 @@
int nameentrysize, namecount;
const uschar *nametable;
- new_info(re, NULL, PCRE_INFO_OPTIONS, &get_options);
new_info(re, NULL, PCRE_INFO_SIZE, &size);
new_info(re, NULL, PCRE_INFO_CAPTURECOUNT, &count);
new_info(re, NULL, PCRE_INFO_BACKREFMAX, &backrefmax);
Modified: code/trunk/testdata/testinput5
===================================================================
--- code/trunk/testdata/testinput5 2009-04-10 15:40:21 UTC (rev 411)
+++ code/trunk/testdata/testinput5 2009-04-11 10:34:37 UTC (rev 412)
@@ -480,4 +480,9 @@
/X/8f<any>
A\x{1ec5}ABCXYZ
+/(*UTF8)\x{1234}/
+ abcd\x{1234}pqr
+
+/(*CRLF)(*UTF8)(*BSR_UNICODE)a\Rb/I
+
/ End of testinput5 /
Modified: code/trunk/testdata/testoutput5
===================================================================
--- code/trunk/testdata/testoutput5 2009-04-10 15:40:21 UTC (rev 411)
+++ code/trunk/testdata/testoutput5 2009-04-11 10:34:37 UTC (rev 412)
@@ -1641,4 +1641,15 @@
A\x{1ec5}ABCXYZ
0: X
+/(*UTF8)\x{1234}/
+ abcd\x{1234}pqr
+ 0: \x{1234}
+
+/(*CRLF)(*UTF8)(*BSR_UNICODE)a\Rb/I
+Capturing subpattern count = 0
+Options: bsr_unicode utf8
+Forced newline sequence: CRLF
+First char = 'a'
+Need char = 'b'
+
/ End of testinput5 /