Re: [pcre-dev] Extracting trigrams from PCRE syntax

Góra strony
Delete this message
Autor: Zoltán Herczeg
Data:  
Dla: Ævar Arnfjörð Bjarmason, pcre-dev@exim.org
Temat: Re: [pcre-dev] Extracting trigrams from PCRE syntax
Hi,

for some time I have been thinking about creating a regex optimizer which does everything except matching. It could analyze and optimize patterns. Unfortunately I don't have time working on it., so it is unlikely it will be available in the foreseeable future :(

Regards,
Zoltan
 
-------- Eredeti levél --------
Feladó: Ævar Arnfjörð Bjarmason via Pcre-dev < pcre-dev@??? (Link -> mailto:pcre-dev@exim.org) >
Dátum: 2017 november 25 22:40:09
Tárgy: [pcre-dev] Extracting trigrams from PCRE syntax
Címzett: pcre-dev@??? (Link -> mailto:pcre-dev@exim.org)
 
One of the pie-in-the-sky ideas I have for PCRE use in Git is to start
pre-indexing regular expressions.
This method is covered in this paper/blog post:
https://swtch.com/~rsc/regexp/regexp4.html
Basically, consider a regex like:
foox.*(bary|bazz)
It can be broken into a trigram tree like:
(FOO AND OOX) AND ((BAR AND BARY) or (BAZ AND AZZ))
With the idea being that you'd previously gone through all the text
that could possibly matched and indexed which line/file/whatever
contained those trigrams, so you'd only have to do a PCRE match
against those that did.
This is how Google Code search is fast & Etsy Hound as well
(https://github.com/etsy/hound). The idea is that most regexes contain
some fixed strings, and you can just do a hash lookup in an index
first to look at the documents you need to consider.
So specifically, the idea is to make git-grep fast by having
pre-indexed the working tree so PCRE only needs to look at those files
that contain the relevant trigrams, and make stuff like log search /
diff search similarly fast by having having a trigram:<list of
commits> index.
Hence this E-Mail. Is there some way I can use the PCRE API to extract
the parse tree, in particular some nested and/or tree of the fixed
strings to be found in the regex, and if not could such an API be
added / does any other library parsing PCRE-like regexes provide that?
--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev