Re: [pcre-dev] [Bug 1295] add 32-bit library

Author: Christian Persch
Date:
To: Tom Bishop, Wenlin Institute
CC: PCRE Development Mailing List
Subject: Re: [pcre-dev] [Bug 1295] add 32-bit library

Hi!

Am Thu, 20 Sep 2012 11:44:42 -0400
schrieb "Tom Bishop, Wenlin Institute" <tangmu@???>:
> In pcre32_valid_utf32.c, the validity checking is too lenient in two
> respects. First of all, each code unit is masked with UTF32_MASK
> (0x1ffffful) before checking, so most invalid code units will be
> treated as though they were valid. You could change this line:
>
> c = *p & UTF32_MASK;
>
> to
>
> c = *p;

Coincidentally, this is the bit of code that I said was temporary :-)

> Or else, make the masking conditional on something like
> PCRE_MASK_UTF32_BEYOND_1FFFFF, which should be false by default, for
> Unicode conformance.
>
> Second, the test for noncharacters is incomplete. It should reject
> all noncharacters, namely, the range U+FDD0..U+FDEF, and any code
> unit whose last four digits are FFFE or FFFF.

Yes, I had already noticed this discrepancy, too. The UTF-8 code
doesn't do any checks like that, and the UTF-16 code from which the
32-bit code is copied only rejects 0xfffe. Anyway, I have now
implemented this for all of UTF-8, 16 and 32.

> I suggest generalizing the meaning of the macro PCRE_UTF32_ERR2 from
> "Disallowed character 0xfffe" to "Noncharacter".

Done as part of the above.

> In pcre32_ord2utf32.c, there are similar issues. UTF32_MASK is used
> when it shouldn't be (by default); the checking for noncharacters is
> incomplete; and if a surrogate or out-of-range code unit is
> encountered, it's replaced by U+FFFE, which is itself a noncharacter.
> What is the rationale for that? To ensure that the output is valid
> UTF-32, the conventional value to use is U+FFFD REPLACEMENT CHARACTER.

Right. Again this is copied from the 16-bit code (and the 8-bit code
does the same too), and since afaik the only places this code is called
(it's a PRIV function) is from internal code where we know the
character is valid... so all that shouldn't be necessary. Taking this
code out still lets the test suite pass, so I've committed that to my
branch.

> In pcre_internal.h, there's a macro GETCHAR:
>
> #define GETCHAR(c, eptr) \
> c = *eptr & UTF32_MASK;
>
> I wouldn't use UTF32_MASK here at all; or else, make the masking
> conditional on something like PCRE_MASK_UTF32_BEYOND_1FFFFF, which
> should be false by default.

No, this is exactly the place where the masking should take place; it's
these macros that are used to iterate over the data string. Now if we
do want to support that PCRE_MASK_UTF32_BEYOND_1FFFFF as a *compile
time* define, then we can simply redefine UTF32_MASK to 0xffffffffu in
that case and rely on the compiler to do away with the no-op AND (or
just re-define the macros); but making that a runtime option isn't so
easy, and again I don't think we need that facility for now.

> By the way, these macros could be made more robust by adding
> parentheses to avoid side-effects; and they seem to have superfluous
> semi-colons. On first glance, it doesn't look like the lack of
> parentheses would actually cause any bugs the way the macros are
> currently called, but I can't say for sure, not having studied all
> the calls carefully.

You're right. But the 8- and 16-bit macros do the same, so I'm not
going to change this at this time. May still be worthwhile to do
separately, of course.

> You said, "the masking isn't actually working yet". Do you mean that
> you're trying to use the masking to enable the feature where you
> store extra information in the upper eleven bits of each code unit,
> and it's failing when you have nonzero bits up there? Or do you mean
> that the code is failing even when you test it with valid UTF-32?

No, the test suite as it exists on the branch *passes all tests*,
including the UTF ones. It's only when I change pcretest to pass in
high bits or'd to UTF-32 that things aren't working completely yet; I'm
still working on fixing that. I already have a good idea where the
problem lays, and it shouldn't be too much work. It really helps that
pcre has such an extensive test suite!

Regards,
    Christian

This message is part of the following thread:
	the complete thread tree sorted by date
	Tom Bishop, Wenlin Institute at
	Tom Bishop, Wenlin Institute at