Re: [pcre-dev] Extracting trigrams from patterns: opinions w…

Top Page

Reply to this message
Author: Zoltán Herczeg
Subject: Re: [pcre-dev] Extracting trigrams from patterns: opinions wanted

my opinion is that this is kind of beyond the scope of pcre. For some time (years actually, but I had no time to start it) I have been thinking that the world would need a regex optimizer framework. It could do a lot of things except pattern matching. It would build an AST from the pattern, and could do both pattern analysis and pattern optimization. It would support millions of options (e.g. target engine, optimization level, emulating missing features based on the target engine (e.g. removing comments, support multiple unicode versions), bash file pattern to regex conversion, whatever...). It could extract trigrams (part of pattern analysis) as well.

Reason: The internal representation of PCRE is designed for matching a pattern, not for analysing a pattern. The AST created by this other project would be the opposite, designed for analysing and optimizing a pattern rather than focusing on matching. Adding removing nodes would be easy, and the result could be serialized to a new pattern.

It could be part of PCRE, but in a separate src directory. The main pcre and this project should have zero dependency.

-------- Eredeti levél --------
Feladó: ph10@??? (Link ->
Dátum: 2018 december 14 16:34:05
Tárgy: [pcre-dev] Extracting trigrams from patterns: opinions wanted
Címzett: pcre-dev@??? (Link ->
Following a suggestion from Ævar Arnfjörð Bjarmason I've been doing some
experimental work on extracting trigrams (sequences of 3 consecutive
characters) from compiled patterns. For example, if the pattern is
/foo.(bar|baz)/ the extraction yields AND("foo",OR("bar","baz")).
Here are some other examples:
The present code does "best efforts". It does not find every possible
trigram from every pattern. It handles text leading into parentheses,
but does not manage to carry text on afterwards:
No mandatory 3-character sequences identified
There are also some patterns (for example, those using (*ACCEPT)) where
it gives up completely.
The idea behind this is to improve performance for certain applications
that run very many searches on the same large data sets. If a trigram
index is created, a full pattern match need only be applied to those
lines (or records) that contain the relevant trigrams.
I also wrote a proof-of-concept grep-like program that makes a simple
trigram index (using a balanced binary tree, because I had code to
manage that) and then scans a file with and without optimization. The
file is 476K of English text. Some crude timings on simple searches
showed that, approximately:
Using JIT cut the time for the complete scan by 50%.
Building the index took about 5 times the non-JIT scan time.
The scan using the index and then JIT matching on those lines whose
trigrams matched was 10 times faster than JIT on every line.
These relative timings were very much the same for patterns that had
only one or two matches and also those with many matches. (Patterns with
no matches because none of their trigrams were in the index were very
fast, of course.) I probably could make the index scan faster still by
remembering offsets in the file in the index instead of line numbers,
but this is just testing code. It is clear that making the index is only
worth it if there are going to be many searches on the same data.
So, what to do next? I'd welcome comments and options from this list.
1. Is this a useful addition to PCRE2? (It could, of course, be optional
at build time.)
2. If so, what kind of API is best? Something along the lines of
rc = pcre2_find_ngrams(re, n, options, buffer, buffsize, context)
comes to mind, where re is a compiled pattern, n is 3 for a trigram, or
4, 5, 6 etc (perhaps limited to the range 2-10?), and the context is
needed to provide memory allocation/free functions because the code
builds a structure in memory before outputting it as an AND/OR string in
the buffer.
3. If the pattern sets UTF, the code works in characters, not code
units, and outputs its strings in UTF. In all cases, internal quotes are
doubled. But maybe code units would be better? Or an option?
4. Is there a more useful format for the output instead of quoted
strings? This would be needed if it was done in code units.
5. I have done nothing about case-independent matching. Something like
/(?i)abc/ just outputs "abc" instead of the 8 possible cased variants.
This could be changed; it could be an option. Equally, there could be an
option just to output the ngrams without the AND/OR wrapping. A simpler
application could use this to run the full match only on lines that
contain at least one of them, without having to implement AND/OR logic.
6. All the testing so far has been with the 8-bit PCRE2 library, but it
should be straightforward to make it general for all three widths.
Feedback awaited.... Thanks!
Philip Hazel
## List details at