[pcre-dev] [Bug 1295] New: add 32-bit library

Top Page
Delete this message
Author: Christian Persch
Date:  
To: pcre-dev
New-Topics: [pcre-dev] [Bug 1295] add 32-bit library
Subject: [pcre-dev] [Bug 1295] New: add 32-bit library
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1295
           Summary: add 32-bit library
           Product: PCRE
           Version: N/A
          Platform: Other
        OS/Version: Linux
            Status: NEW
          Severity: bug
          Priority: medium
         Component: Code
        AssignedTo: ph10@???
        ReportedBy: chpe@???
                CC: pcre-dev@???



PCRE should add support to handle 32-bit character strings.

Like the 8- and 16-bit libraries, there will be two modes of operation:
(A) UTF-32 strings
(B) arbitrary 32-bit character strings

Like the 16-bit support, this will be done by adding a new libpcre32 library
alongside the existing 8- and 16-bit libraries. The API is exactly the same as
the existing 16-bit library except that all instances of "pcre16" were replaced
with "pcre32", and the character data type used is an unsigned 32-bit integer
type.

Since the existing PCRE_INFO_FIRSTCHAR and PCRE_INFO_LASTLITERAL return the
first/req char in an integer but also return negative numbers in some special
cases (ie to indicate that no such char was set), it was necessary to add new
PCRE_INFO_FIRSTLITERAL and PCRE_INFO_LASTLITERAL2 to return those characters in
an unsigned int32, and PCRE_INFO_FIRSTLITERALSET and PCRE_INFO_LASTLITERAL2SET
that return the information previously communicated via the negative values.

All tests pass. (Some tests have different output on 16/32 bit, so I've split
the output into -16 and -32 for some tests, and also added new tests 23..26 for
testing things really specific to just 16 or 32-bit and non-UTF vs. UTF.)

Since UTF-32 only occupies 21 bits of the 32-bit characters, it's useful for
implementations to use the upper bits to store extra info (flags, etc). Since
it's more efficient to pass the unmodified strings to pcre32, I aim to make
pcre32 mask out those upper bits. This is done in the code but hasn't been
debugged yet (it's not working yet).

To allow arbitrary 32-bit character strings for goal (B), I had to make some
extra changes to the code where previously characters were passed around in an
int with negative values reserved for special purposes; matching itself wasn't
tested yet beyond the existing tests. So I suspect there'll be more work to do;
possibly a full audit of the code for signed/unsigned conversions and assigning
values from one int type to another (truncation); compile warning flags can
help us there.

The JIT compiler also works in pcre32; I only had to comment out the use of the
fast_forward_first_two_chars() function since I couldn't figure out how to port
it to 32-bit; help appreciated there (and for everything else too :-).

The docs have already been updated to included the 32-bit library (except the
new values for pcre32_fullinfo), but the html docs haven't been updated yet (is
there some automation for that?).

To check out the code, get the "pcre32" branch from my gitorious repository at
https://gitorious.org/~chpe/pcre/chpe-pcre . (It'll be frequently rebased for
updates from svn.)
(BTW, I've also set up a (manually updated) git-svn clone of the PCRE svn
repository at https://gitorious.org/pcre/pcre ).


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email