Following a suggestion from Ævar Arnfjörð Bjarmason I've been doing some
experimental work on extracting trigrams (sequences of 3 consecutive
characters) from compiled patterns. For example, if the pattern is
/foo.(bar|baz)/ the extraction yields AND("foo",OR("bar","baz")).
The present code does "best efforts". It does not find every possible
trigram from every pattern. It handles text leading into parentheses,
but does not manage to carry text on afterwards:
/a(bc|de)/
OR("abc","ade")
/(bc|de)a/
No mandatory 3-character sequences identified
There are also some patterns (for example, those using (*ACCEPT)) where
it gives up completely.
The idea behind this is to improve performance for certain applications
that run very many searches on the same large data sets. If a trigram
index is created, a full pattern match need only be applied to those
lines (or records) that contain the relevant trigrams.
I also wrote a proof-of-concept grep-like program that makes a simple
trigram index (using a balanced binary tree, because I had code to
manage that) and then scans a file with and without optimization. The
file is 476K of English text. Some crude timings on simple searches
showed that, approximately:
Using JIT cut the time for the complete scan by 50%.
Building the index took about 5 times the non-JIT scan time.
The scan using the index and then JIT matching on those lines whose
trigrams matched was 10 times faster than JIT on every line.
These relative timings were very much the same for patterns that had
only one or two matches and also those with many matches. (Patterns with
no matches because none of their trigrams were in the index were very
fast, of course.) I probably could make the index scan faster still by
remembering offsets in the file in the index instead of line numbers,
but this is just testing code. It is clear that making the index is only
worth it if there are going to be many searches on the same data.
So, what to do next? I'd welcome comments and options from this list.
1. Is this a useful addition to PCRE2? (It could, of course, be optional
at build time.)
2. If so, what kind of API is best? Something along the lines of
rc = pcre2_find_ngrams(re, n, options, buffer, buffsize, context)
comes to mind, where re is a compiled pattern, n is 3 for a trigram, or
4, 5, 6 etc (perhaps limited to the range 2-10?), and the context is
needed to provide memory allocation/free functions because the code
builds a structure in memory before outputting it as an AND/OR string in
the buffer.
3. If the pattern sets UTF, the code works in characters, not code
units, and outputs its strings in UTF. In all cases, internal quotes are
doubled. But maybe code units would be better? Or an option?
4. Is there a more useful format for the output instead of quoted
strings? This would be needed if it was done in code units.
5. I have done nothing about case-independent matching. Something like
/(?i)abc/ just outputs "abc" instead of the 8 possible cased variants.
This could be changed; it could be an option. Equally, there could be an
option just to output the ngrams without the AND/OR wrapping. A simpler
application could use this to run the full match only on lines that
contain at least one of them, without having to implement AND/OR logic.
6. All the testing so far has been with the 8-bit PCRE2 library, but it
should be straightforward to make it general for all three widths.