Re: [pcre-dev] Extracting trigrams from PCRE syntax

Autor: ph10
Data:
Dla: Ævar Arnfjörð Bjarmason
CC: pcre-dev
Temat: Re: [pcre-dev] Extracting trigrams from PCRE syntax

On Sat, 25 Nov 2017, Ævar Arnfjörð Bjarmason via Pcre-dev wrote:

> Hence this E-Mail. Is there some way I can use the PCRE API to extract
> the parse tree, in particular some nested and/or tree of the fixed
> strings to be found in the regex, and if not could such an API be
> added / does any other library parsing PCRE-like regexes provide that?

Unfortunately not, not least because there is no parse tree. :-) At
least not if you really mean a tree data structure. The input is first
scanned to discover the number and names of capturing parentheses and at
the same time remove comments and translate escaped characters. The result
is then converted directly into a vector that represents the compiled
pattern. This is all described in the file called HACKING in the PCRE2
distribution.

There is no record of fixed strings in the regex as they are broken into
individual characters. Some early versions of PCRE did use strings.
This was changed at some point, probably because it makes life easier in
UTF mode.

There is nothing to stop you scanning the compiled regex, but there are
no guarantees that the intermal format won't change. I suppose one might
consider providing a function similar to pcre2_callout_enumerate(),
which enumerates the callouts in a compiled pattern. Something like
pcre2_fixed_strings_enumerate() which would pass back the strings (it
could bundle up runs of individual characters). Off the top of my head,
a specification something like:

  int pcre2_fixed_strings_enumerate(
    const pcre2_code *code, 
    PCRE2_SIZE       *offset, 
    PCRE2_UCHAR      *buffer,
    PCRE2_SIZE        buffer_length,
    int              *caseless
    );

  Arguments:
    code             points to compiled pattern 
    offset           must be set 0 for the first call;
                      updated to remember where we are in the pattern
    buffer           where to put the next string
    buffer_length    length of buffer
    caseless         set to 0 for caseful match
                     set to 1 for caseless match

  Returns:
    0        no more strings
    > 0      length of returned string
    < 0      error code (e.g. buffer too small, offset too large)

The only problem with this is that it would return strings in negative
lookarounds, which is NOT what you want! But perhaps with a bit more
thought this API could be expanded to pass back the information as to
whether the string is in a lookaround, and if so, what type.

Would something like this be helpful?

Philip

--
Philip Hazel

Wiadomość jest częścią wątku:
	pełne drzewo wątku posortowane wg daty
	Ævar Arnfjörð Bjarmason at
	Zoltán Herczeg at