I don't know if anyone has had a chance yet to study the change I proposed in an earlier message. In case it's gotten lost in the shuffle, here's the message again. Also, please note that the Unicode conformance requirements that currently appear to be violated include these:
"C4 A process shall interpret a coded character sequence according to the character semantics established by this standard, if that process does interpret that coded character sequence."
"C10 When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters."
The standard also notes, "There are important security issues associated with encoding conversion, especially with the conversion of malformed text." See also <
http://www.unicode.org/reports/tr36/>.
It's fine to have explicit ways to override these conformance requirements (I propose a macro PCRE_MASK_UTF32_BEYOND_1FFFFF), but PCRE_NO_UTF32_CHECK in itself should only turn off the error reporting, not turn off the conformance. In other words, if the user specifies PCRE_NO_UTF32_CHECK, of course PCRE is freed from the responsibility to report an error for ill-formed UTF-32, but it still has the responsibility not to report a regex as matching when (according to the standard) it doesn't match. For example, the regex "A" matches the code for the letter "A" (0x00000042 in UTF-32), but not a code such as 0x10000042.
--------
It's good that the masking with 0x1fffff now only occurs if PCRE_NO_UTF32_CHECK is specified. The Unicode conformance can be improved, and the code made slightly smaller, faster, and more flexible, with a simple change to pcre_internal.h. By default, PCRE_NO_UTF32_CHECK should disable checking without enabling masking. Masking can be enabled by a compile-time option. The definition of UTF32_MASK can be replaced by the following:
#if defined PCRE_MASK_UTF32_BEYOND_1FFFFF
#define ADJUST_UTF32_CODE_UNIT(c) ((c) & 0x1fffffu)
#else
#define ADJUST_UTF32_CODE_UNIT(c) (c)
#endif
and these macros can be revised as follows:
#define GETCHAR(c, eptr) \
c = ADJUST_UTF32_CODE_UNIT(*(eptr));
#define GETCHARTEST(c, eptr) \
c = *eptr; \
if (utf) c = ADJUST_UTF32_CODE_UNIT(c);
#define GETCHARINC(c, eptr) \
c = ADJUST_UTF32_CODE_UNIT(*eptr++);
#define GETCHARINCTEST(c, eptr) \
c = *eptr++; \
if (utf) c = ADJUST_UTF32_CODE_UNIT(c);
#define RAWUCHAR(eptr) \
ADJUST_UTF32_CODE_UNIT(*(eptr))
#define RAWUCHARINC(eptr) \
ADJUST_UTF32_CODE_UNIT(*(eptr)++)
#define RAWUCHARTEST(eptr) \
(utf ? (ADJUST_UTF32_CODE_UNIT(*(eptr))) : *(eptr))
#define RAWUCHARINCTEST(eptr) \
(utf ? (ADJUST_UTF32_CODE_UNIT(*(eptr)++)) : *(eptr)++)
Best wishes,
Tom
文林 Wenlin Institute, Inc. Software for Learning Chinese
E-mail: wenlin@??? Web: http://www.wenlin.com
Telephone: 1-877-4-WENLIN (1-877-493-6546)
☯