Re: [pcre-dev] Extracting trigrams from PCRE syntax

Top Page
Delete this message
Author: Ævar Arnfjörð Bjarmason
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] Extracting trigrams from PCRE syntax
On Tue, Nov 28, 2017 at 6:42 PM, <ph10@???> wrote:
> On Mon, 27 Nov 2017, I wrote:
>
>> I suppose one might consider providing a function similar to
>> pcre2_callout_enumerate(), which enumerates the callouts in a compiled
>> pattern. Something like pcre2_fixed_strings_enumerate() which would
>> pass back the strings (it could bundle up runs of individual
>> characters).
>
> Thinking about this some more ... knowing the fixed strings is not good
> enough. Consider a pattern such as ABC|\d\d\d which can match lines that
> do not contain ABC. An external indexing trigram scheme could only work
> if the pattern has no wild cards and no verbs such as (*ACCEPT). It
> would, of course, be possible to implement a pcre2_pattern_info() option
> that gives TRUE only if the pattern contains literal characters,
> vertical bar, non-lookaround, parentheses, circumflex, and dollar. I
> suppose quantifiers whose minimum is 1 could be permitted in some cases.
> Also maybe back references.
>
> Is all this going to be worth it?
>
> What you really need (I think) is a function that doesn't just give a
> list of strings in the pattern, but gives a list of strings, at least
> one of which *must* be present in the subject for there to be a match.
> That is something to think about.


Right, the way Hound and Google Code search (IIRC) deal with this is
that you just can't search for patterns like these, i.e. Hound in
particular will just start and then return an error saying there's too
many matches.

So it's really a partial regex search, i.e. the pattern *must* contain
some string that you can anchor on, the typical case will be something
like:

(somefunc|otherfunc)\(void\s*\*

Or whatever, i.e. searching for something that has large fixed
strings. There were other limitations, e.g. RE2 which GCS used was an
NFA that couldn't handle backrefernces or other complex constructs.

> Some time ago I spent a bit of time playing with code that, given a
> compiled pattern, generates strings that match it. I had some success
> until I got to lookarounds, when I realized that I needed a whole new
> approach that included backtracking, and I haven't gone back to it. This
> requirement of yours seems similar in some ways.
>
> I'll think about it, but please do not hold your breath.


Sorry, gotta run now. Can't reply to the rest of your questions,
thought I'd send this first...