[pcre-dev] Extracting trigrams from PCRE syntax

Top Page
Delete this message
Author: Ævar Arnfjörð Bjarmason
Date:  
To: pcre-dev
Subject: [pcre-dev] Extracting trigrams from PCRE syntax
One of the pie-in-the-sky ideas I have for PCRE use in Git is to start
pre-indexing regular expressions.

This method is covered in this paper/blog post:
https://swtch.com/~rsc/regexp/regexp4.html

Basically, consider a regex like:

    foox.*(bary|bazz)


It can be broken into a trigram tree like:

    (FOO AND OOX) AND ((BAR AND BARY) or (BAZ AND AZZ))


With the idea being that you'd previously gone through all the text
that could possibly matched and indexed which line/file/whatever
contained those trigrams, so you'd only have to do a PCRE match
against those that did.

This is how Google Code search is fast & Etsy Hound as well
(https://github.com/etsy/hound). The idea is that most regexes contain
some fixed strings, and you can just do a hash lookup in an index
first to look at the documents you need to consider.

So specifically, the idea is to make git-grep fast by having
pre-indexed the working tree so PCRE only needs to look at those files
that contain the relevant trigrams, and make stuff like log search /
diff search similarly fast by having having a trigram:<list of
commits> index.

Hence this E-Mail. Is there some way I can use the PCRE API to extract
the parse tree, in particular some nested and/or tree of the fixed
strings to be found in the regex, and if not could such an API be
added / does any other library parsing PCRE-like regexes provide that?