Re: [pcre-dev] [Bug 1295] add 32-bit library

Góra strony
Delete this message
Autor: Philip Hazel
Data:  
Dla: Christian Persch
CC: PCRE Development Mailing List
Temat: Re: [pcre-dev] [Bug 1295] add 32-bit library
On Sun, 28 Oct 2012, Christian Persch wrote:

> I'm against a compile time switch like this;


(I knew I should have kept out of this discussion, but now that I
haven't...)

I too am against a compile-time switch.

> that there is just one bit left in the 32-bit integer for the pcre
> flags, so I'd be reluctant to use it for something like this).


Actually, there are two bits. Also, if this is an exec-time only feature
(as is the current implementation), you *could* re-use one of the
existing bits that is currently compile-time only (e.g. PCRE_EXTENDED).

> * If the data isn't pre-checked by the programme itself, just simply
> must _not_ pass PCRE_NO_UTF32_CHECK; pcre32_exec() will do that
> check, returning an error for anything that's not UTF-32 (including
> the case of those high bits being set!).
>
> * If the data has been pre-checked for UTF-32-ness by the programme,
> you can save a bit of runtime by passing PCRE_NO_UTF32_CHECK. This
> applies whether the programme leaves the high bits as 0, or stores
> internal stuff there. The bit masking simply allows the programme to
> save making a sanitised copy of the data just to pass it to
> pcre32_exec().


I am not unhappy with that, when it comes down to it. My only worry is
the fact that a single switch does two things, but I can probably live
with that as long as there is good documentation (something along the
lines of those two paragraphs). I'll make sure there is. :-)

> It _is not_ matching the code 0x10000042. The input to pcre32_exec()
> may contain that bit pattern, but it is simply transformed (just like
> the gzip example above) before it reaches the matching stage.


I think I see your point here: by setting PCRE_NO_UTF32_CHECK the caller
has said "the bottom 21 bits of each 32-bit value are guaranteed by me to
be valid UTF-32; just use them without checking, and ignore the top 11
bits". This is a different situation to UTF-8 and UTF-16 where there are
no "spare" bits in the data values. I suppose a hypothetical analogy
would be storing UTF-16 in the bottom 16 bits of 32-bit values.

> In conclusion: I don't think we need this as a compile-time flag, nor
> as a run-time flag.


I'm beginning to agree. :-)

Philip

--
Philip Hazel