Re: [pcre-dev] A PCRE 32-bit library?

Top Page
Delete this message
Author: Christian Persch
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] A PCRE 32-bit library?
Hi!

Am Wed, 18 Jul 2012 07:36:46 +0200 (CEST)
schrieb Zoltán Herczeg <hzmester@???>:
> wow, this is awesome! The UTF16 support was really designed that way,
> that the UTF32 support shouldn't be too much trouble, but I suspect
> there are some places, where you needed to do a lot of work.


I've put the WIP patch to 8.31 (svn rev 984) at [1]. (I'm
using git-svn but I can't upload a git repo right now; it's just a
straight git-svn setup however, nothing special.) It's a git patch and
contains new files (where git had some strange idea what file it was
copied from) and some binary files (the saved test patterns for 32-bit
mode), so I'm not sure if it applies with plain patch to a svn
checkout, instead of git am -3 to the git-svn checkout.

It's quite big! But a large part of that is the updated docs, and
the new testoutputN-32 files. And the changes to pcretest.c and
pcre_jit_test.c to accomodate 3 libraries were quite large too. Many
of the actual code changes in the library are just changing/adding
COMPILE_PCRE32 ifdefs. There were a couple of tricky places, of
course :-)

> I have some questions:
>
> 1) The first / required character data is only 16 bit wide. Did you
> changed them to 32 bit? What about the alignment of the new
> structure?


Yes, that was a problem. If I enlarged the real_pcre8 and real_pcre16
structs, that would be an ABI break and saved patterns from older
version couldn't be loaded anymore. And I couldn't use an ifdef inside
the struct because pcretest needs to handle all 3 versions of the
struct. So I 'forked' the real_pcre struct into 8/16 and 32 bit
versions. The struct shouldn't have any holes, the 16 bit int members
are aligned 2, the 32 bit ints aligned 4 and sizeof is divisible by 8 (I
even added a compile-time assert for the last one). See the diff to
pcre_internal.h in the patch.

> 2) Did you make new tests?


Yes, I added some new tests, and run the existing tests on 32 bit where
it makes sense, too. For some of them I forked the testoutputN
into testoutputN-16 and testoutputN-32 where the results legitimately
differ between 16 and 32 bit (e.g. where surrogate characters occur,
etc.). The diff between those testoutputN-{16,32} is quite small and
thus easily verifiable; the 32-bit different results appear correct to
me. All tests pass here on x86. And test runs are valgrind-clean, as
well.

> 3) What is the status of the
> non-utf, plain 32 bit mode? I remember places where the uint32
> characters also contain some flags in the higher bits.


You spotted the problem right away :-) Yes, there are a couple problems
here to be solved. First, in some places 32-bit signed ints are used to
handle characters, so effectively that means the 32-bit is really
31-bit only. Doesn't look too hard to fix, however. Furthermore,
there's the REQ_VARY and REQ_CASELESS flags or'd to the characters, but
that again looks fixable. I just haven't done so yet.

> 4) Which build systems support UTF32?


I've tested everything with the autotools build system; that one
definitely works. I've also updated the NON-AUTOTOOLS-BUILD file, and
the CMake setup, but those are untested so far.

> 5) What about JIT?


JIT works too, and all tests (pcre_jit_test and RunTest) pass on x86.
Quite possible it's broken on non-x86, of course.

> Again, I think this is a really nice work! I suspect only new symbols
> were added to the pcre.h, so we shouldn't worry about compatibility.


Right, the public API is the same as the 16-bit library, with just
s/16/32/g.

Regards,
    Christian


[1] http://people.gnome.org/~chpe/patches/pcre32.diff