On Sep 18, 2012, at 3:31 PM, Christian Persch <chpe@???> wrote:
> Actually it would be more useful to look at it sooner; you can get the
> current code from gitorious at
> https://gitorious.org/~chpe/pcre/chpe-pcre/ (pcre32 branch). However
> note that the masking isn't actually working yet! :-)
OK, I checked out the code as follows:
git clone -b pcre32 git://gitorious.org/~chpe/pcre/chpe-pcre.git chpe-pcre
There are a few issues.
In pcre32_valid_utf32.c, the validity checking is too lenient in two respects. First of all, each code unit is masked with UTF32_MASK (0x1ffffful) before checking, so most invalid code units will be treated as though they were valid. You could change this line:
c = *p & UTF32_MASK;
to
c = *p;
Or else, make the masking conditional on something like PCRE_MASK_UTF32_BEYOND_1FFFFF, which should be false by default, for Unicode conformance.
Second, the test for noncharacters is incomplete. It should reject all noncharacters, namely, the range U+FDD0..U+FDEF, and any code unit whose last four digits are FFFE or FFFF. You could change this line:
if (c == 0xfffeu)
to
if (c >= 0xfdd0 && (c <= 0xfdef || ((c & 0xffff) | 1) == 0xffff))
I suggest generalizing the meaning of the macro PCRE_UTF32_ERR2 from "Disallowed character 0xfffe" to "Noncharacter".
In pcre32_ord2utf32.c, there are similar issues. UTF32_MASK is used when it shouldn't be (by default); the checking for noncharacters is incomplete; and if a surrogate or out-of-range code unit is encountered, it's replaced by U+FFFE, which is itself a noncharacter. What is the rationale for that? To ensure that the output is valid UTF-32, the conventional value to use is U+FFFD REPLACEMENT CHARACTER.
In pcre_internal.h, there's a macro GETCHAR:
#define GETCHAR(c, eptr) \
c = *eptr & UTF32_MASK;
I wouldn't use UTF32_MASK here at all; or else, make the masking conditional on something like PCRE_MASK_UTF32_BEYOND_1FFFFF, which should be false by default.
By the way, these macros could be made more robust by adding parentheses to avoid side-effects; and they seem to have superfluous semi-colons. On first glance, it doesn't look like the lack of parentheses would actually cause any bugs the way the macros are currently called, but I can't say for sure, not having studied all the calls carefully.
You said, "the masking isn't actually working yet". Do you mean that you're trying to use the masking to enable the feature where you store extra information in the upper eleven bits of each code unit, and it's failing when you have nonzero bits up there? Or do you mean that the code is failing even when you test it with valid UTF-32?
I'll be happy to do further studying and testing, but first I'd like to understand what the real issue is at this point. The masking feature might be more trouble than it's worth at this stage. Probably the first priority is to get the implementation right for valid UTF-32. I'm interested in optional support for code points beyond U+10FFFF, but if the code isn't working right for ordinary code points, we shouldn't put the cart before the horse.
Best wishes,
Tom
文林 Wenlin Institute, Inc. Software for Learning Chinese
E-mail: wenlin@??? Web: http://www.wenlin.com
Telephone: 1-877-4-WENLIN (1-877-493-6546)
☯