[pcre-dev] [Bug 1295] add 32-bit library

Etusivu
Poista viesti
Lähettäjä: Christian Persch
Päiväys:  
Vastaanottaja: pcre-dev
Aihe: [pcre-dev] [Bug 1295] add 32-bit library
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1295




--- Comment #3 from Christian Persch (GNOME) <chpe@???> 2012-09-17 22:09:55 ---
Tom Bishop wrote:
> > ...Since UTF-32 only occupies 21 bits of the 32-bit characters,
> > it's useful for implementations to use the upper bits to store
> > extra info (flags, etc). Since it's more efficient to pass the
> > unmodified strings to pcre32, I aim to make pcre32 mask out those
> > upper bits. This is done in the code but hasn't been debugged yet
> > (it's not working yet).
>
> I suggest that such masking behavior should not be the default, but
> only enabled, if at all, by explicitly setting some configuration
> option.


I don't see a problem with masking the values. If the UTF-32 check isn't
disabled by PCRE_NO_UTF32_CHECK, these values will still be fauled (iow, the
masking in _pcre32_valid_utf() is only a temporary measure while developing
this); only if bypassing that check we'll allow these bits through.
And making this masking optional would only make the code more complicated
without any gains, IMO.

> If a 32-bit string contains a code unit such as 0x10000021, the safer
> assumption is that it is *not* equivalent to U+0021. 0x10000021 might
> trigger a warning that the string is not valid UTF-32, or it might
> just be treated as a different character. But to treat it by default
> as matching U+0021 would be just as wrong as an ASCII-based program
> treating 0xA1 as equivalent to 0x21.
>
> The originally ASCII-based programs that continue to work well today
> (for Latin1, UTF-8, etc.) are the ones that treat the byte 0xA1
> differently from 0x21, and refrain from
> masking/bending/folding/mutilating it.


There will be the non-UTF 32-bit mode where you can pass any characters; we
don't need to complicate the UTF mode with this.
This 'masking' is purely a convenience for the API user; you don't *have to*
use it.

> Using the upper bits of 32-bit code units for flags, etc., risks
> incompatibility with future use of code points beyond U+10FFFF (such
> for extended private use); developers need to weigh the risks and
> benefits of such an approach carefully. Anyway, if they do it, they
> should at least be responsible for setting an option instructing PCRE
> to mask the high bits. In general, most libraries shouldn't be
> expected to mask or ignore those bits.
>
> I hope this suggestion is helpful. A 32-bit PCRE is likely to be
> useful for the long-term future, especially if code points beyond
> U+10FFFF are eventually employed.


It's absolutely certain that there will never be unicode characters > 10ffff,
so there's no forward compatibility problem.

Now you seem to want some sort of "UCS-4" mode that would allow any characters
from the 31-bit range (up to 7fffffff) of UCS-4 ? I don't see how that would be
useful; for example, which properties would those characters beyond the UTF-32
range have ? (And if an actual use case for that UCS-4 mode ever arises, we can
just add it at that point as a _new_ flag/mode.)

(In reply to comment #1)
> The html docs are created automatically from the man pages when the
> script PrepareRelease is run. I will check this all out once your
> patches make it into the svn repo. I guess we'll have to do a bit of
> merging because independent changes are happening. (I'm currently
> tidying up code for OP_HSPACE and OP_VSPACE so that the case lists of
> values are defined only once, in a macro.)


I'll 'git rebase' the branch when new svn commits happen (already done so for
the OP_[HV]SPACE changes).

> I don't know if you've already picked this up, but I recently noticed in
> the code a few places where
>
> #ifdef COMPILE_PCRE16
>
> should be changed to
>
> #ifndef COMPILE_PCRE8
>    ^
>    ^ 

>
> so that it applies to 32-bit as well as 16-bit.


In my patch, generally I have been changing #ifdef COMPILE_PCRE16 to #if
defined COMPILE_PCRE16 || defined COMPILE_PCRE32 which I find more readable,
but if you prefer I can switch to #ifndef COMPILE_PCRE8 ?

(In reply to comment #2)
> > The JIT compiler also works in pcre32; I only had to comment out the use of the
> > fast_forward_first_two_chars() function since I couldn't figure out how to port
> > it to 32-bit; help appreciated there (and for everything else too :-).
>
> I have implemented a less platform dependent forward search, which should be
> compatible with any machine and any supported code format. Those ugly ifdefs
> are gone forever.


Thanks! I rebased the branch, and the #warning is now gone :-)

> > To check out the code, get the "pcre32" branch from my gitorious repository at
> > https://gitorious.org/~chpe/pcre/chpe-pcre . (It'll be frequently rebased for
> > updates from svn.)
> > (BTW, I've also set up a (manually updated) git-svn clone of the PCRE svn
> > repository at https://gitorious.org/pcre/pcre ).
>
> I think we should also setup a branch as we did when the 16 bit mode was
> developed.


Do you prefer me to push a branch to svn instead of keeping the work on
gitorious until it lands in svn trunk?


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email