Re: [pcre-dev] Character class to bitmask (or other represen…

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Rich Siegel
CC: pcre-dev
Subject: Re: [pcre-dev] Character class to bitmask (or other representation)?
Rich,

Funnily enough, I was thinking about character classes recently. The
internals in PCRE2 are an area that could do with some work (and Perl has
now got more sophisticated ones). Originally, when PCRE was first
implemented as an 8-bit ASCII program, it just used a bit map internally.
Now it's more complicated, with a bit map for characters whose code points
are less than 256 and extended encoding for higher characters and character
types such as \pL etc. It has all got a bit messy, which is why I was
wondering whether to think of a different way of encoding. But that's as
far as I've got.

If you are working in an ASCII environment only, there is a way to get what
you want. Take a look at the PCRE2_INFO_FIRSTBITMAP query with the
pcre2_pattern_info() function. If your pattern consists just of the class,
the "match must start with one of these characters" bit map will correspond
to the bit map for that class. Of course, you have to compile the pattern
first. You can see how this works by using the -i option of pcre2test:

$ pcre2test -i
PCRE2 version 10.35 2020-05-09
re> "[a-z0-9_?$]"

Capture group count = 0
Starting code units: $ 0 1 2 3 4 5 6 7 8 9 ? _ a b c d e f g h i j k l m n
o p q r s t u v w x y z
Subject length lower bound = 1
data>

Regards,
Philip


On Mon, 7 Dec 2020 at 15:43, Rich Siegel <siegel@???> wrote:

> Good afternoon,
>
> There's something I'd like to use pcre2 for, but I'm not sure if it's
> possible (or if it is, quite how to get there).
>
> Given a valid character class string (e.g. "[a-z0-9_?$]" in a very
> simple case), I'd like to get back some representation which describes
> all of the characters included in the class. A bit mask would be fine,
> but a list of code point ranges would do as well.
>
> My use case is that I need to rapidly test whether a given character
> matches a user-specified character class. I know I can do this by
> compiling a pattern and then attempting to match, but that's a little
> "heavy" for my use case.
>
> I haven't looked at how character class matching works, but I *assume*
> that some sort of representation of the class is compiled that allows
> rapid testing. So I guess a way to expose that, or parse it into a
> bitmask/code point ranges would be ideal.
>
> Is this currently possible, or could it be? (I'll be happy to write this
> up in Bugzilla if you think it's feasible, but I'm looking for a sense
> of whether it's doable.)
>
> Thanks for any advice,
>
> R.
>
> --
> Rich Siegel                                 Bare Bones Software, Inc.
> <siegel@???>                      <https://www.barebones.com/>

>
> Someday I'll look back on all this and laugh... until they sedate me.
>
> --
> ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
>