Re: [pcre-dev] Extracting trigrams from PCRE syntax

Top Page
Delete this message
Author: ph10
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] Extracting trigrams from PCRE syntax
On Mon, 27 Nov 2017, I wrote:

> I suppose one might consider providing a function similar to
> pcre2_callout_enumerate(), which enumerates the callouts in a compiled
> pattern. Something like pcre2_fixed_strings_enumerate() which would
> pass back the strings (it could bundle up runs of individual
> characters).


Thinking about this some more ... knowing the fixed strings is not good
enough. Consider a pattern such as ABC|\d\d\d which can match lines that
do not contain ABC. An external indexing trigram scheme could only work
if the pattern has no wild cards and no verbs such as (*ACCEPT). It
would, of course, be possible to implement a pcre2_pattern_info() option
that gives TRUE only if the pattern contains literal characters,
vertical bar, non-lookaround, parentheses, circumflex, and dollar. I
suppose quantifiers whose minimum is 1 could be permitted in some cases.
Also maybe back references.

Is all this going to be worth it?

What you really need (I think) is a function that doesn't just give a
list of strings in the pattern, but gives a list of strings, at least
one of which *must* be present in the subject for there to be a match.
That is something to think about.

Some time ago I spent a bit of time playing with code that, given a
compiled pattern, generates strings that match it. I had some success
until I got to lookarounds, when I realized that I needed a whole new
approach that included backtracking, and I haven't gone back to it. This
requirement of yours seems similar in some ways.

I'll think about it, but please do not hold your breath.

Philip

--
Philip Hazel