[pcre-dev] Auto-possessifying improvements in PCRE

Top Page
Delete this message
Author: Zoltán Herczeg
Date:  
To: pcre-dev
Subject: [pcre-dev] Auto-possessifying improvements in PCRE
Hi all,

we have been working on the auto-possessifying optimization in PCRE for some time, and I would like to give you a brief summary of the results we achived. This optimization replaces greedy/non-greedy single character repetitions with their appropriate posessive form. A simple example is rewriting /a+b/ to /a++b/. This optimization has been part of PCRE for a very long time, but it only supported simple cases before. Now it can even replace \s* to \s*+ in /\s*(?:left|right)?hand/. Of course using \s*+ directly in the pattern would provide the same effect, but possessive quantifiers are among the less known regular expression features, and they are rarely used (this is my impression at least). The performance of possessive quantifiers are usually much higher than other quantifiers, since the backtracking phase can be totally skipped.

However, before I show some results, let me tell you a bit more about possessive quantifiers. Their primary purpose is to define unbreakable multi-byte character sequences. The improved matching speed is just a side effect. For example, "sch" represent a single consonant in german, splitting it to "sc" and "h" is meaningless. Another nice example is newlines: most of the time a newline can be \r, \n and \r\n. If we want
to find the "aa" and "bb" strings, which are separated by at least two newlines, we can use the /aa(?>\r\n|\r|\n){2,}bb/ pattern. Without the ?> bracket type, the pattern would happily accept aa\r\nbb as well, and that is incorrect.

Back to the results, then. Once I got a pattern set used by an Intrusion Detection System, and I use it for benchmarking and also getting ideas how people use regular expressions. Sometimes I browse http://regexlib.com/ as well. I realized that most patterns are not exactly efficient, so regex compiler optimizations such as auto-possessifying seems very important. The gain provided by this particular optimization is the following (INT: interpreter, JIT: PCRE-JIT compiler, s: seconds):

was: INT: 412.16 s, JIT: 86.22 s
now: INT: 182.94 s, JIT: 45.46 s
progress: INT: 125% JIT: 90%

Of course on other pattern sets the results might be totally different, but we hope this helps to improve the overall performance of our favourite regex engine.

Regards,
Zoltan