[Pcre-svn] [1370] code/trunk: Add \o{} and tidy up \x{} handling.

著者: Subversion repository
日付:
To: pcre-svn
題目: [Pcre-svn] [1370] code/trunk: Add \o{} and tidy up \x{} handling.

Revision: 1370

          http://vcs.pcre.org/viewvc?view=rev&revision=1370
Author:   ph10
Date:     2013-10-09 11:18:26 +0100 (Wed, 09 Oct 2013)

Log Message:
-----------
Add \o{} and tidy up \x{} handling. Minor update to RunTest.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/RunTest
    code/trunk/doc/pcreapi.3
    code/trunk/doc/pcrepattern.3
    code/trunk/doc/pcresyntax.3
    code/trunk/doc/pcretest.1
    code/trunk/pcre_compile.c
    code/trunk/pcre_internal.h
    code/trunk/pcreposix.c
    code/trunk/pcretest.c
    code/trunk/testdata/testinput1
    code/trunk/testdata/testinput14
    code/trunk/testdata/testinput2
    code/trunk/testdata/testinput23
    code/trunk/testdata/testinput26
    code/trunk/testdata/testinput5
    code/trunk/testdata/testoutput1
    code/trunk/testdata/testoutput11-16
    code/trunk/testdata/testoutput11-32
    code/trunk/testdata/testoutput11-8
    code/trunk/testdata/testoutput14
    code/trunk/testdata/testoutput2
    code/trunk/testdata/testoutput23
    code/trunk/testdata/testoutput26
    code/trunk/testdata/testoutput5

Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/ChangeLog    2013-10-09 10:18:26 UTC (rev 1370)
@@ -110,7 +110,20 @@
     encountered capturing group of those numbers, they are treated as the 
     literal characters 8 and 9 instead of a binary zero followed by the 
     literals. PCRE now does the same.
+    
+22. Following Perl, added \o{} to specify codepoints in octal, making it
+    possible to specify values greater than 0777 and also making them 
+    unambiguous. 
+    
+23. Perl now gives an error for missing closing braces after \x{... instead of 
+    treating the string as literal. PCRE now does the same.
+    
+24. RunTest used to grumble if an inappropriate test was selected explicitly, 
+    but just skip it when running all tests. This make it awkward to run ranges 
+    of tests when one of them was inappropriate. Now it just skips any 
+    inappropriate tests, as it always did when running all tests.

+

Version 8.33 28-May-2013
------------------------

Modified: code/trunk/RunTest
===================================================================
--- code/trunk/RunTest    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/RunTest    2013-10-09 10:18:26 UTC (rev 1370)
@@ -14,11 +14,11 @@
 # UTF-8 with the UTF-8 check turned off; for this, studying must also be
 # disabled with /SS.
 #
-# When JIT support is available, all the tests are also run with -s+ to test
-# (again, almost) everything with studying and the JIT option, unless "nojit"
-# is given on the command line. There are also two tests for JIT-specific
-# features, one to be run when JIT support is available (unless "nojit" is
-# specified), and one when it is not.
+# When JIT support is available, all appropriate tests are also run with -s+ to
+# test (again, almost) everything with studying and the JIT option, unless
+# "nojit" is given on the command line. There are also two tests for
+# JIT-specific features, one to be run when JIT support is available (unless
+# "nojit" is specified), and one when it is not.
 #
 # Whichever of the 8-, 16- and 32-bit libraries exist are tested. It is also
 # possible to select which to test by giving "-8", "-16" or "-32" on the
@@ -30,9 +30,13 @@
 # runs tests 3 to 15, excluding test 10, and just "~10" runs all the tests
 # except test 10. Whatever order the arguments are in, the tests are always run
 # in numerical order.
-
+#
+# Inappropriate tests are automatically skipped (with a comment to say so): for
+# example, if JIT support is not compiled, test 12 is skipped, whereas if JIT
+# support is compiled, test 13 is skipped.
+#
 # Other arguments can be one of the words "valgrind", "valgrind-log", or "sim"
-# followed by an argument to run cross- compiled executables under a simulator,
+# followed by an argument to run cross-compiled executables under a simulator,
 # for example:
 #
 # RunTest 3 sim "qemu-arm -s 8388608"
@@ -62,8 +66,8 @@
 title9="Test 9: DFA matching with UTF"
 title10="Test 10: DFA matching with Unicode properties"
 title11="Test 11: Internal offsets and code size tests"
-title12="Test 12: JIT-specific features (JIT available)"
-title13="Test 13: JIT-specific features (JIT not available)"
+title12="Test 12: JIT-specific features (when JIT is available)"
+title13="Test 13: JIT-specific features (when JIT is not available)"
 title14="Test 14: Specials for the basic 8-bit library"
 title15="Test 15: Specials for the 8-bit library with UTF-8 support"
 title16="Test 16: Specials for the 8-bit library with Unicode propery support"
@@ -350,79 +354,6 @@
   jitopt=-s+
 fi

-# Handle any explicit skips
-
-for i in $skip; do eval do$i=no; done
-
-# If any unsuitable tests were explicitly requested, grumble.
-
-if [ $utf -eq 0 ] ; then
-  if [ $do4 = yes ] ; then
-    echo "Can't run test 4 because UTF support is not configured"
-    exit 1
-  fi
-  if [ $do5 = yes ] ; then
-    echo "Can't run test 5 because UTF support is not configured"
-    exit 1
-  fi
-  if [ $do9 = yes ] ; then
-    echo "Can't run test 8 because UTF support is not configured"
-    exit 1
-  fi
-  if [ $do15 = yes ] ; then
-    echo "Can't run test 15 because UTF support is not configured"
-    exit 1
-  fi
-  if [ $do18 = yes ] ; then
-    echo "Can't run test 18 because UTF support is not configured"
-  fi
-  if [ $do22 = yes ] ; then
-    echo "Can't run test 22 because UTF support is not configured"
-  fi
-fi
-
-if [ $ucp -eq 0 ] ; then
-  if [ $do6 = yes ] ; then
-    echo "Can't run test 6 because Unicode property support is not configured"
-    exit 1
-  fi
-  if [ $do7 = yes ] ; then
-    echo "Can't run test 7 because Unicode property support is not configured"
-    exit 1
-  fi
-  if [ $do10 = yes ] ; then
-    echo "Can't run test 10 because Unicode property support is not configured"
-    exit 1
-  fi
-  if [ $do16 = yes ] ; then
-    echo "Can't run test 16 because Unicode property support is not configured"
-    exit 1
-  fi
-  if [ $do19 = yes ] ; then
-    echo "Can't run test 19 because Unicode property support is not configured"
-    exit 1
-  fi
-fi
-
-if [ $link_size -ne 2 ] ; then
-  if [ $do11 = yes ] ; then
-    echo "Can't run test 11 because the link size ($link_size) is not 2"
-    exit 1
-  fi
-fi
-
-if [ $jit -eq 0 ] ; then
-  if [ $do12 = "yes" ] ; then
-    echo "Can't run test 12 because JIT support is not configured"
-    exit 1
-  fi
-else
-  if [ $do13 = "yes" ] ; then
-    echo "Can't run test 13 because JIT support is configured"
-    exit 1
-  fi
-fi
-
 # If no specific tests were requested, select all. Those that are not
 # relevant will be automatically skipped.

@@ -461,8 +392,8 @@
do26=yes
fi

-# Handle any explicit skips (again, so that an argument list may consist only
-# of explicit skips).
+# Handle any explicit skips at this stage, so that an argument list may consist
+# only of explicit skips.

for i in $skip; do eval do$i=no; done

Modified: code/trunk/doc/pcreapi.3
===================================================================
--- code/trunk/doc/pcreapi.3    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/doc/pcreapi.3    2013-10-09 10:18:26 UTC (rev 1370)
@@ -1,4 +1,4 @@
-.TH PCREAPI 3 "05 October 2013" "PCRE 8.34"
+.TH PCREAPI 3 "08 October 2013" "PCRE 8.34"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .sp
@@ -919,7 +919,7 @@
   31  POSIX collating elements are not supported
   32  this version of PCRE is compiled without UTF support
   33  [this code is not in use]
-  34  character value in \ex{...} sequence is too large
+  34  character value in \ex{} or \eo{} is too large
   35  invalid condition (?(0)
   36  \eC not allowed in lookbehind assertion
   37  PCRE does not support \eL, \el, \eN{name}, \eU, or \eu
@@ -967,6 +967,10 @@
   75  name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
   76  character value in \eu.... sequence is too large
   77  invalid UTF-32 string (specifically UTF-32)
+  78  setting UTF is disabled by the application
+  79  non-hex character in \ex{} (closing brace missing?)
+  80  non-octal character in \eo{} (closing brace missing?)
+  81  missing opening brace after \eo 
 .sp
 The numbers 32 and 10000 in errors 48 and 49 are defaults; different values may
 be used if the limits were changed when PCRE was built.
@@ -2866,6 +2870,6 @@
 .rs
 .sp
 .nf
-Last updated: 05 October 2013
+Last updated: 08 October 2013
 Copyright (c) 1997-2013 University of Cambridge.
 .fi

Modified: code/trunk/doc/pcrepattern.3
===================================================================
--- code/trunk/doc/pcrepattern.3    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/doc/pcrepattern.3    2013-10-09 10:18:26 UTC (rev 1370)
@@ -300,7 +300,9 @@
   \en        linefeed (hex 0A)
   \er        carriage return (hex 0D)
   \et        tab (hex 09)
+  \e0dd      character with octal code 0dd 
   \eddd      character with octal code ddd, or back reference
+  \eo{ddd..} character with octal code ddd.. 
   \exhh      character with hex code hh
   \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)
   \euhhhh    character with hex code hhhh (JavaScript mode only)
@@ -321,44 +323,23 @@
 the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other
 characters also generate different values.
 .P
-By default, after \ex, from zero to two hexadecimal digits are read (letters
-can be in upper or lower case). Any number of hexadecimal digits may appear
-between \ex{ and }, but the character code is constrained as follows:
-.sp
-  8-bit non-UTF mode    less than 0x100
-  8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
-  16-bit non-UTF mode   less than 0x10000
-  16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
-  32-bit non-UTF mode   less than 0x80000000
-  32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
-.sp
-Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
-"surrogate" codepoints), and 0xffef.
-.P
-If characters other than hexadecimal digits appear between \ex{ and }, or if
-there is no terminating }, this form of escape is not recognized. Instead, the
-initial \ex will be interpreted as a basic hexadecimal escape, with no
-following digits, giving a character whose value is zero.
-.P
-If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is
-as just described only when it is followed by two hexadecimal digits.
-Otherwise, it matches a literal "x" character. In JavaScript mode, support for
-code points greater than 256 is provided by \eu, which must be followed by
-four hexadecimal digits; otherwise it matches a literal "u" character.
-Character codes specified by \eu in JavaScript mode are constrained in the same
-was as those specified by \ex in non-JavaScript mode.
-.P
-Characters whose value is less than 256 can be defined by either of the two
-syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the
-way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
-\eu00dc in JavaScript mode).
-.P
 After \e0 up to two further octal digits are read. If there are fewer than two
 digits, just those that are present are used. Thus the sequence \e0\ex\e07
 specifies two binary zeros followed by a BEL character (code value 7). Make
 sure you supply two digits after the initial zero if the pattern character that
 follows is itself an octal digit.
 .P
+The escape \eo must be followed by a sequence of octal digits, enclosed in 
+braces. An error occurs if this is not the case. This escape is a recent
+addition to Perl; it provides way of specifying character code points as octal
+numbers greater than 0777, and it also allows octal numbers and back references
+to be unambiguously specified.
+.P
+For greater clarity and unambiguity, it is best to avoid following \e by a
+digit greater than zero. Instead, use \eo{} or \ex{} to specify character
+numbers, and \eg{} to specify back references. The following paragraphs
+describe the old, ambiguous syntax.
+.P
 The handling of a backslash followed by a digit other than 0 is complicated,
 and Perl has changed in recent releases, causing PCRE also to change. Outside a
 character class, PCRE reads the digit and any following digits as a decimal
@@ -379,9 +360,7 @@
 7 and there have not been that many capturing subpatterns, PCRE handles \e8 and
 \e9 as the literal characters "8" and "9", and otherwise re-reads up to three
 octal digits following the backslash, using them to generate a data character.
-Any subsequent digits stand for themselves. The value of the character is
-constrained in the same way as characters specified in hexadecimal. For
-example:
+Any subsequent digits stand for themselves. For example:
 .sp
   \e040   is another way of writing an ASCII space
 .\" JOIN
@@ -403,9 +382,48 @@
   \e81    is either a back reference, or the two
             characters "8" and "1"
 .sp
-Note that octal values of 100 or greater must not be introduced by a leading
-zero, because no more than three octal digits are ever read.
+Note that octal values of 100 or greater that are specified using this syntax
+must not be introduced by a leading zero, because no more than three octal
+digits are ever read.
 .P
+By default, after \ex that is not followed by {, from zero to two hexadecimal
+digits are read (letters can be in upper or lower case). Any number of
+hexadecimal digits may appear between \ex{ and }. If a character other than
+a hexadecimal digit appears between \ex{ and }, or if there is no terminating
+}, an error occurs.
+.P
+If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is
+as just described only when it is followed by two hexadecimal digits.
+Otherwise, it matches a literal "x" character. In JavaScript mode, support for
+code points greater than 256 is provided by \eu, which must be followed by
+four hexadecimal digits; otherwise it matches a literal "u" character.
+.P
+Characters whose value is less than 256 can be defined by either of the two
+syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the
+way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
+\eu00dc in JavaScript mode).
+.
+.
+.SS "Constraints on character values"
+.rs
+.sp
+Characters that are specified using octal or hexadecimal numbers are
+limited to certain values, as follows:
+.sp
+  8-bit non-UTF mode    less than 0x100
+  8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
+  16-bit non-UTF mode   less than 0x10000
+  16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
+  32-bit non-UTF mode   less than 0x80000000
+  32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
+.sp
+Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
+"surrogate" codepoints), and 0xffef. 
+.
+.
+.SS "Escape sequences in character classes"
+.rs
+.sp
 All the sequences that define a single character value can be used both inside
 and outside character classes. In addition, inside a character class, \eb is
 interpreted as the backspace character (hex 08).

Modified: code/trunk/doc/pcresyntax.3
===================================================================
--- code/trunk/doc/pcresyntax.3    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/doc/pcresyntax.3    2013-10-09 10:18:26 UTC (rev 1370)
@@ -29,7 +29,9 @@
   \en         newline (hex 0A)
   \er         carriage return (hex 0D)
   \et         tab (hex 09)
+  \e0dd       character with octal code 0dd 
   \eddd       character with octal code ddd, or backreference
+  \eo{ddd..}  character with octal code ddd.. 
   \exhh       character with hex code hh
   \ex{hhh..}  character with hex code hhh..
 .sp

Modified: code/trunk/doc/pcretest.1
===================================================================
--- code/trunk/doc/pcretest.1    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/doc/pcretest.1    2013-10-09 10:18:26 UTC (rev 1370)
@@ -1,4 +1,4 @@
-.TH PCRETEST 1 "01 October 2013" "PCRE 8.34"
+.TH PCRETEST 1 "09 October 2013" "PCRE 8.34"
 .SH NAME
 pcretest - a program for testing Perl-compatible regular expressions.
 .SH SYNOPSIS
@@ -618,6 +618,7 @@
   \ev         vertical tab (\ex0b)
   \ennn       octal character (up to 3 octal digits); always
                a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode
+  \eo{dd...}  octal character (any number of octal digits}              
   \exhh       hexadecimal byte (up to 2 hex digits)
   \ex{hh...}  hexadecimal character (any number of hex digits)
 .\" JOIN
@@ -1104,6 +1105,6 @@
 .rs
 .sp
 .nf
-Last updated: 01 October 2013
+Last updated: 09 October 2013
 Copyright (c) 1997-2013 University of Cambridge.
 .fi

Modified: code/trunk/pcre_compile.c
===================================================================
--- code/trunk/pcre_compile.c    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/pcre_compile.c    2013-10-09 10:18:26 UTC (rev 1370)
@@ -462,7 +462,7 @@
   "POSIX collating elements are not supported\0"
   "this version of PCRE is compiled without UTF support\0"
   "spare error\0"  /** DEAD **/
-  "character value in \\x{...} sequence is too large\0"
+  "character value in \\x{} or \\o{} is too large\0"
   /* 35 */
   "invalid condition (?(0)\0"
   "\\C not allowed in lookbehind assertion\0"
@@ -516,6 +516,10 @@
   "character value in \\u.... sequence is too large\0"
   "invalid UTF-32 string\0"
   "setting UTF is disabled by the application\0"
+  "non-hex character in \\x{} (closing brace missing?)\0" 
+  /* 80 */ 
+  "non-octal character in \\o{} (closing brace missing?)\0" 
+  "missing opening brace after \\o\0" 
   ;

 /* Table to identify digits and hex digits. This is used when compiling
@@ -1162,16 +1166,51 @@
     if (!utf && c > 0xff) *errorcodeptr = ERR51;
 #endif
     break;
+    
+    /* \o is a relatively new Perl feature, supporting a more general way of 
+    specifying character codes in octal. The only supported form is \o{ddd}. */
+    
+    case CHAR_o:
+    if (ptr[1] != CHAR_LEFT_CURLY_BRACKET) *errorcodeptr = ERR81; else
+      {
+      ptr += 2; 
+      c = 0;
+      overflow = FALSE;
+      while (*ptr >= CHAR_0 && *ptr <= CHAR_7)
+        {
+        register pcre_uint32 cc = *ptr++;
+        if (c == 0 && cc == CHAR_0) continue;     /* Leading zeroes */
+#ifdef COMPILE_PCRE32
+        if (c >= 0x10000000l) { overflow = TRUE; break; }
+#endif
+        c = (c << 3) + cc - CHAR_0 ;
+#if defined COMPILE_PCRE8
+        if (c > (utf ? 0x10ffffU : 0xffU)) { overflow = TRUE; break; }
+#elif defined COMPILE_PCRE16
+        if (c > (utf ? 0x10ffffU : 0xffffU)) { overflow = TRUE; break; }
+#elif defined COMPILE_PCRE32
+        if (utf && c > 0x10ffffU) { overflow = TRUE; break; }
+#endif
+        }
+      if (overflow)
+        {
+        while (*ptr >= CHAR_0 && *ptr <= CHAR_7) ptr++;
+        *errorcodeptr = ERR34;
+        }
+      else if (*ptr == CHAR_RIGHT_CURLY_BRACKET)
+        {
+        if (utf && c >= 0xd800 && c <= 0xdfff) *errorcodeptr = ERR73;
+        }
+      else *errorcodeptr = ERR80;
+      }
+    break;

-    /* \x is complicated. \x{ddd} is a character number which can be greater
-    than 0xff in utf or non-8bit mode, but only if the ddd are hex digits.
-    If not, { is treated as a data character. */
+    /* \x is complicated. In JavaScript, \x must be followed by two hexadecimal
+    numbers. Otherwise it is a lowercase x letter. */

     case CHAR_x:
     if ((options & PCRE_JAVASCRIPT_COMPAT) != 0)
       {
-      /* In JavaScript, \x must be followed by two hexadecimal numbers.
-      Otherwise it is a lowercase x letter. */
       if (MAX_255(ptr[1]) && (digitab[ptr[1]] & ctype_xdigit) != 0
         && MAX_255(ptr[2]) && (digitab[ptr[2]] & ctype_xdigit) != 0)
         {
@@ -1188,73 +1227,86 @@
 #endif
           }
         }
-      break;
-      }
-
-    if (ptr[1] == CHAR_LEFT_CURLY_BRACKET)
-      {
-      const pcre_uchar *pt = ptr + 2;
-
-      c = 0;
-      overflow = FALSE;
-      while (MAX_255(*pt) && (digitab[*pt] & ctype_xdigit) != 0)
+      }    /* End JavaScript handling */
+    
+    /* Handle \x in Perl's style. \x{ddd} is a character number which can be
+    greater than 0xff in utf or non-8bit mode, but only if the ddd are hex
+    digits. If not, { used to be treated as a data character. However, Perl
+    seems to read hex digits up to the first non-such, and ignore the rest, so
+    that, for example \x{zz} matches a binary zero. This seems crazy, so PCRE
+    now gives an error. */
+      
+    else
+      {    
+      if (ptr[1] == CHAR_LEFT_CURLY_BRACKET)
         {
-        register pcre_uint32 cc = *pt++;
-        if (c == 0 && cc == CHAR_0) continue;     /* Leading zeroes */
-
+        ptr += 2;
+        c = 0;
+        overflow = FALSE;
+        while (MAX_255(*ptr) && (digitab[*ptr] & ctype_xdigit) != 0)
+          {
+          register pcre_uint32 cc = *ptr++;
+          if (c == 0 && cc == CHAR_0) continue;     /* Leading zeroes */
+      
 #ifdef COMPILE_PCRE32
-        if (c >= 0x10000000l) { overflow = TRUE; break; }
+          if (c >= 0x10000000l) { overflow = TRUE; break; }
 #endif
-
+      
 #ifndef EBCDIC  /* ASCII/UTF-8 coding */
-        if (cc >= CHAR_a) cc -= 32;               /* Convert to upper case */
-        c = (c << 4) + cc - ((cc < CHAR_A)? CHAR_0 : (CHAR_A - 10));
+          if (cc >= CHAR_a) cc -= 32;               /* Convert to upper case */
+          c = (c << 4) + cc - ((cc < CHAR_A)? CHAR_0 : (CHAR_A - 10));
 #else           /* EBCDIC coding */
-        if (cc >= CHAR_a && cc <= CHAR_z) cc += 64;  /* Convert to upper case */
-        c = (c << 4) + cc - ((cc >= CHAR_0)? CHAR_0 : (CHAR_A - 10));
+          if (cc >= CHAR_a && cc <= CHAR_z) cc += 64;  /* Convert to upper case */
+          c = (c << 4) + cc - ((cc >= CHAR_0)? CHAR_0 : (CHAR_A - 10));
 #endif
-
+      
 #if defined COMPILE_PCRE8
-        if (c > (utf ? 0x10ffffU : 0xffU)) { overflow = TRUE; break; }
+          if (c > (utf ? 0x10ffffU : 0xffU)) { overflow = TRUE; break; }
 #elif defined COMPILE_PCRE16
-        if (c > (utf ? 0x10ffffU : 0xffffU)) { overflow = TRUE; break; }
+          if (c > (utf ? 0x10ffffU : 0xffffU)) { overflow = TRUE; break; }
 #elif defined COMPILE_PCRE32
-        if (utf && c > 0x10ffffU) { overflow = TRUE; break; }
+          if (utf && c > 0x10ffffU) { overflow = TRUE; break; }
 #endif
-        }
+          }
+      
+        if (overflow)
+          {
+          while (MAX_255(*ptr) && (digitab[*ptr] & ctype_xdigit) != 0) ptr++;
+          *errorcodeptr = ERR34;
+          }
+      
+        else if (*ptr == CHAR_RIGHT_CURLY_BRACKET)
+          {
+          if (utf && c >= 0xd800 && c <= 0xdfff) *errorcodeptr = ERR73;
+          }
+      
+        /* If the sequence of hex digits does not end with '}', give an error.
+        We used just to recognize this construct and fall through to the normal
+        \x handling, but nowadays Perl gives an error, which seems much more
+        sensible, so we do too. */
+        
+        else *errorcodeptr = ERR79;
+        }   /* End of \x{} processing */

-      if (overflow)
-        {
-        while (MAX_255(*pt) && (digitab[*pt] & ctype_xdigit) != 0) pt++;
-        *errorcodeptr = ERR34;
-        }
+      /* Read a single-byte hex-defined char (up to two hex digits after \x) */

-      if (*pt == CHAR_RIGHT_CURLY_BRACKET)
-        {
-        if (utf && c >= 0xd800 && c <= 0xdfff) *errorcodeptr = ERR73;
-        ptr = pt;
-        break;
-        }
-
-      /* If the sequence of hex digits does not end with '}', then we don't
-      recognize this construct; fall through to the normal \x handling. */
-      }
-
-    /* Read just a single-byte hex-defined char */
-
-    c = 0;
-    while (i++ < 2 && MAX_255(ptr[1]) && (digitab[ptr[1]] & ctype_xdigit) != 0)
-      {
-      pcre_uint32 cc;                          /* Some compilers don't like */
-      cc = *(++ptr);                           /* ++ in initializers */
+      else
+        { 
+        c = 0;
+        while (i++ < 2 && MAX_255(ptr[1]) && (digitab[ptr[1]] & ctype_xdigit) != 0)
+          {
+          pcre_uint32 cc;                          /* Some compilers don't like */
+          cc = *(++ptr);                           /* ++ in initializers */
 #ifndef EBCDIC  /* ASCII/UTF-8 coding */
-      if (cc >= CHAR_a) cc -= 32;              /* Convert to upper case */
-      c = c * 16 + cc - ((cc < CHAR_A)? CHAR_0 : (CHAR_A - 10));
+          if (cc >= CHAR_a) cc -= 32;              /* Convert to upper case */
+          c = c * 16 + cc - ((cc < CHAR_A)? CHAR_0 : (CHAR_A - 10));
 #else           /* EBCDIC coding */
-      if (cc <= CHAR_z) cc += 64;              /* Convert to upper case */
-      c = c * 16 + cc - ((cc >= CHAR_0)? CHAR_0 : (CHAR_A - 10));
+          if (cc <= CHAR_z) cc += 64;              /* Convert to upper case */
+          c = c * 16 + cc - ((cc >= CHAR_0)? CHAR_0 : (CHAR_A - 10));
 #endif
-      }
+          }
+        }     /* End of \xdd handling */
+      }       /* End of Perl-style \x handling */
     break;

     /* For \c, a following letter is upper-cased; then the 0x40 bit is flipped.

Modified: code/trunk/pcre_internal.h
===================================================================
--- code/trunk/pcre_internal.h    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/pcre_internal.h    2013-10-09 10:18:26 UTC (rev 1370)
@@ -2315,7 +2315,8 @@
        ERR40, ERR41, ERR42, ERR43, ERR44, ERR45, ERR46, ERR47, ERR48, ERR49,
        ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59,
        ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69,
-       ERR70, ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERRCOUNT };
+       ERR70, ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79,
+       ERR80, ERR81, ERRCOUNT };

/* JIT compiling modes. The function list is indexed by them. */
enum { JIT_COMPILE, JIT_PARTIAL_SOFT_COMPILE, JIT_PARTIAL_HARD_COMPILE,

Modified: code/trunk/pcreposix.c
===================================================================
--- code/trunk/pcreposix.c    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/pcreposix.c    2013-10-09 10:18:26 UTC (rev 1370)
@@ -110,7 +110,7 @@
   REG_BADPAT,  /* POSIX collating elements are not supported */
   REG_INVARG,  /* this version of PCRE is not compiled with PCRE_UTF8 support */
   REG_BADPAT,  /* spare error */
-  REG_BADPAT,  /* character value in \x{...} sequence is too large */
+  REG_BADPAT,  /* character value in \x{} or \o{} is too large */
   /* 35 */
   REG_BADPAT,  /* invalid condition (?(0) */
   REG_BADPAT,  /* \C not allowed in lookbehind assertion */
@@ -163,7 +163,11 @@
   REG_BADPAT,  /* overlong MARK name */
   REG_BADPAT,  /* character value in \u.... sequence is too large */
   REG_BADPAT,  /* invalid UTF-32 string (should not occur) */
-  REG_BADPAT   /* setting UTF is disabled by the application */
+  REG_BADPAT,  /* setting UTF is disabled by the application */
+  REG_BADPAT,  /* non-hex character in \\x{} (closing brace missing?) */ 
+  /* 80 */ 
+  REG_BADPAT,  /* non-octal character in \o{} (closing brace missing?) */ 
+  REG_BADPAT   /* missing opening brace after \o */
 };

/* Table of texts corresponding to POSIX error codes */

Modified: code/trunk/pcretest.c
===================================================================
--- code/trunk/pcretest.c    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/pcretest.c    2013-10-09 10:18:26 UTC (rev 1370)
@@ -4497,6 +4497,23 @@
         while (i++ < 2 && isdigit(*p) && *p != '8' && *p != '9')
           c = c * 8 + *p++ - '0';
         break;
+        
+        case 'o':
+        if (*p == '{')
+          {
+          pcre_uint8 *pt = p;
+          c = 0;
+          for (pt++; isdigit(*pt) && *pt != '8' && *pt != '9'; pt++)
+            {
+            if (++i == 12)
+              fprintf(outfile, "** Too many octal digits in \\o{...} item; "
+                               "using only the first twelve.\n");
+            else c = c * 8 + *pt - '0';
+            }
+          if (*pt == '}') p = pt + 1;
+            else fprintf(outfile, "** Missing } after \\o{ (assumed)\n");   
+          }
+        break;

         case 'x':
         if (*p == '{')

Modified: code/trunk/testdata/testinput1
===================================================================
--- code/trunk/testdata/testinput1    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/testdata/testinput1    2013-10-09 10:18:26 UTC (rev 1370)
@@ -5635,4 +5635,7 @@
 /^A\xZ/
     A\0Z

+/^A\o{123}B/
+    A\123B
+
 /-- End of testinput1 --/

Modified: code/trunk/testdata/testinput14
===================================================================
--- code/trunk/testdata/testinput14    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/testdata/testinput14    2013-10-09 10:18:26 UTC (rev 1370)
@@ -85,9 +85,12 @@
     a\nb
     ** Failers (too big char) 
     A\x{123}B 
+    A\o{443}B

/\x{100}/I

+/\o{400}/I
+
 /  (?: [\040\t] |  \(
 (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
 \)  )*                          # optional leading comment

Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/testdata/testinput2    2013-10-09 10:18:26 UTC (rev 1370)
@@ -3892,4 +3892,15 @@

/-- End of special auto-possessive tests --/

+/^A\o{1239}B/
+    A\123B
+
+/^A\oB/
+    
+/^A\x{zz}B/ 
+
+/^A\x{12Z/
+
+/^A\x{/
+
 /-- End of testinput2 --/

Modified: code/trunk/testdata/testinput23
===================================================================
--- code/trunk/testdata/testinput23    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/testdata/testinput23    2013-10-09 10:18:26 UTC (rev 1370)
@@ -7,6 +7,8 @@

/\x{10000}/

+/\o{20000}/
+
/-- Check character ranges --/

/[\H]/BZSI

Modified: code/trunk/testdata/testinput26
===================================================================
--- code/trunk/testdata/testinput26    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/testdata/testinput26    2013-10-09 10:18:26 UTC (rev 1370)
@@ -4,6 +4,8 @@

/\x{110000}/8

+/\o{4200000}/8
+
 /\C/8
     \x{110000}

Modified: code/trunk/testdata/testinput5
===================================================================
--- code/trunk/testdata/testinput5    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/testdata/testinput5    2013-10-09 10:18:26 UTC (rev 1370)
@@ -4,18 +4,32 @@

/\x{110000}/8DZ

+/\o{4200000}/8DZ
+
/\x{ffffffff}/8

+/\o{37777777777}/8
+
/\x{100000000}/8

+/\o{77777777777}/8
+
/\x{d800}/8

+/\o{154000}/8
+
/\x{dfff}/8

+/\o{157777}/8
+
/\x{d7ff}/8

+/\o{153777}/8
+
/\x{e000}/8

+/\o{170000}/8
+
 /^\x{100}a\x{1234}/8
     \x{100}a\x{1234}bcd

Modified: code/trunk/testdata/testoutput1
===================================================================
--- code/trunk/testdata/testoutput1    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/testdata/testoutput1    2013-10-09 10:18:26 UTC (rev 1370)
@@ -9268,4 +9268,8 @@
     A\0Z 
  0: A\x00Z

+/^A\o{123}B/
+    A\123B
+ 0: ASB
+
 /-- End of testinput1 --/

Modified: code/trunk/testdata/testoutput11-16
===================================================================
--- code/trunk/testdata/testoutput11-16    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/testdata/testoutput11-16    2013-10-09 10:18:26 UTC (rev 1370)
@@ -328,7 +328,7 @@
 ------------------------------------------------------------------

/\x{110000}/8BM
-Failed: character value in \x{...} sequence is too large at offset 9
+Failed: character value in \x{} or \o{} is too large at offset 9

/[\x{ff}]/8BM
Memory allocation (code space): 14

Modified: code/trunk/testdata/testoutput11-32
===================================================================
--- code/trunk/testdata/testoutput11-32    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/testdata/testoutput11-32    2013-10-09 10:18:26 UTC (rev 1370)
@@ -328,7 +328,7 @@
 ------------------------------------------------------------------

Modified: code/trunk/testdata/testoutput11-8
===================================================================
--- code/trunk/testdata/testoutput11-8    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/testdata/testoutput11-8    2013-10-09 10:18:26 UTC (rev 1370)
@@ -328,7 +328,7 @@
 ------------------------------------------------------------------

Modified: code/trunk/testdata/testoutput14
===================================================================
--- code/trunk/testdata/testoutput14    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/testdata/testoutput14    2013-10-09 10:18:26 UTC (rev 1370)
@@ -147,10 +147,17 @@
 ** Character \x{123} is greater than 255 and UTF-8 mode is not enabled.
 ** Truncation will probably give the wrong result.
 No match
+    A\o{443}B 
+** Character \x{123} is greater than 255 and UTF-8 mode is not enabled.
+** Truncation will probably give the wrong result.
+No match

/\x{100}/I
-Failed: character value in \x{...} sequence is too large at offset 6
+Failed: character value in \x{} or \o{} is too large at offset 6

+/\o{400}/I
+Failed: character value in \x{} or \o{} is too large at offset 6
+
 /  (?: [\040\t] |  \(
 (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  |  \( (?:  [^\\\x80-\xff\n\015()]  |  \\ [^\x80-\xff]  )* \)  )*
 \)  )*                          # optional leading comment

Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/testdata/testoutput2    2013-10-09 10:18:26 UTC (rev 1370)
@@ -13513,4 +13513,19 @@

/-- End of special auto-possessive tests --/

+/^A\o{1239}B/
+Failed: non-octal character in \o{} (closing brace missing?) at offset 8
+
+/^A\oB/
+Failed: missing opening brace after \o at offset 3
+
+/^A\x{zz}B/
+Failed: non-hex character in \x{} (closing brace missing?) at offset 5
+
+/^A\x{12Z/
+Failed: non-hex character in \x{} (closing brace missing?) at offset 7
+
+/^A\x{/
+Failed: non-hex character in \x{} (closing brace missing?) at offset 5
+
/-- End of testinput2 --/

Modified: code/trunk/testdata/testoutput23
===================================================================
--- code/trunk/testdata/testoutput23    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/testdata/testoutput23    2013-10-09 10:18:26 UTC (rev 1370)
@@ -7,8 +7,10 @@
  0: \x{ffff}

/\x{10000}/
-Failed: character value in \x{...} sequence is too large at offset 8
+Failed: character value in \x{} or \o{} is too large at offset 8

+/\o{20000}/
+
/-- Check character ranges --/

/[\H]/BZSI

Modified: code/trunk/testdata/testoutput26
===================================================================
--- code/trunk/testdata/testoutput26    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/testdata/testoutput26    2013-10-09 10:18:26 UTC (rev 1370)
@@ -3,8 +3,11 @@
 /-- Non-UTF characters --/

/\x{110000}/8
-Failed: character value in \x{...} sequence is too large at offset 9
+Failed: character value in \x{} or \o{} is too large at offset 9

+/\o{4200000}/8
+Failed: character value in \x{} or \o{} is too large at offset 10
+
 /\C/8
     \x{110000}
 Error -10 (bad UTF-32 string) offset=0 reason=3

Modified: code/trunk/testdata/testoutput5
===================================================================
--- code/trunk/testdata/testoutput5    2013-10-08 15:06:46 UTC (rev 1369)
+++ code/trunk/testdata/testoutput5    2013-10-09 10:18:26 UTC (rev 1370)
@@ -3,24 +3,43 @@
     results in 8-bit and 16-bit modes are excluded (see tests 16 and 17). --/

/\x{110000}/8DZ
-Failed: character value in \x{...} sequence is too large at offset 9
+Failed: character value in \x{} or \o{} is too large at offset 9

+/\o{4200000}/8DZ
+Failed: character value in \x{} or \o{} is too large at offset 10
+
/\x{ffffffff}/8
-Failed: character value in \x{...} sequence is too large at offset 11
+Failed: character value in \x{} or \o{} is too large at offset 11

+/\o{37777777777}/8
+Failed: character value in \x{} or \o{} is too large at offset 14
+
/\x{100000000}/8
-Failed: character value in \x{...} sequence is too large at offset 12
+Failed: character value in \x{} or \o{} is too large at offset 12

+/\o{77777777777}/8
+Failed: character value in \x{} or \o{} is too large at offset 14
+
/\x{d800}/8
Failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff) at offset 7

+/\o{154000}/8
+Failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff) at offset 9
+
/\x{dfff}/8
Failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff) at offset 7

+/\o{157777}/8
+Failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff) at offset 9
+
/\x{d7ff}/8

+/\o{153777}/8
+
/\x{e000}/8

+/\o{170000}/8
+
 /^\x{100}a\x{1234}/8
     \x{100}a\x{1234}bcd
  0: \x{100}a\x{1234}

このメッセージは次のスレッドの一部です:
	日付によるスレッドの仕分け

[Pcre-svn] [1370] code/trunk: Add \o{} and tidy up \x{} hand…