[pcre-dev] [Bug 978] New: Generalized support for alternate …

Top Page
Delete this message
Author: Doug Cook
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 978] New: Generalized support for alternate character encodings
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=978
           Summary: Generalized support for alternate character encodings
           Product: PCRE
           Version: 8.01
          Platform: Other
        OS/Version: Windows
            Status: NEW
          Severity: wishlist
          Priority: low
         Component: Code
        AssignedTo: ph10@???
        ReportedBy: idigdoug@???
                CC: pcre-dev@???



Two aspects of PCRE make it less-than-ideal for my current application. First,
it directly supports only untranslated-8-bit and UTF-8, making it unwieldy for
cases where other code pages need to be selected dynamically (primarily UTF-16,
but also others). Second, even when UTF-8 is used, it must be pre-validated,
requiring multiple passes over the data (significant performance penalty for
input that exceeds the CPU's data cache).

One possible solution would be to define the input character sequence using a
more object-oriented system. Instead of getting a start pointer, a length, and
a utf8 flag, PCRE could accept a pointer to a structure such as the following:

struct pcre_input
{
    /* Reads the Unicode value of the character starting at pos into *pc.
       If successful, returns the position of the start of the next character.
       If unsuccessful, returns NULL. */
    USPTR (*next)(
        const struct pcre_input* self,
        USPTR pos,
        int* pc);


    /* Reads the Unicode value of the character ending before pos into *pc.
       If successful, returns the position of the start of the character that
       was returned. If unsuccessful, returns NULL. */
    USPTR (*prev)(
        const struct pcre_input* self,
        USPTR pos,
        int* pC);


    USPTR start_pos;
    USPTR end_pos;
};


(For backwards-compatibility, the existing APIs could fill out the structure
providing function pointers for untranslated or UTF-8 input.)

The potential disadvantages of this (other than implementation cost) would be
an unknown performance impact (only way to determine would be to try it) and it
would be a breaking change for those using custom subject pointers.

I am willing to try to implement this if you think a feature like this would be
well-accepted.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email