Hi all,
if you are not interested in pcre performance, you can ignore this mail.
I started to work to speed up pcre_exec with a just-in-time compiler (generates machine executable code during runtime). The work is still in an early phase, it suppots the quantifiers (*, +, ?, operator, their greedy/non-greedy/possessive versions, single character / any length), capturing brackets, atomic blocks, case-insensitive matching, character ranges (except OP_XCLASS) and some special characters like \w, \d and their inverted versions. Still there is a lot of work ahead, and I would be happy if some of you could join to this work (someone who is interested in low-level programming and regular expressions or students could gain a lot of experince, even work on it as their master thesis).
Some performance results on an x86-32 machine:
C - compile time = pcre_compile (in ms)
J - jit comping time = pcre_study (in ms)
Int runtime - interpreter (pcre_exec) runtime (in ms)
JIT runtime - machine code runtime (in ms)
The input.txt (size: 20045118) is loaded.
Pattern: 'how to' Matches: 395
C: 0 ms, J: 0 ms, Int runtime: 110 ms JIT runtime: 50 ms
Pattern: 'how to' Matches: 420 Caseless
C: 0 ms, J: 0 ms, Int runtime: 130 ms JIT runtime: 50 ms
Pattern: 'walk|get|she|make' Matches: 27953
C: 0 ms, J: 0 ms, Int runtime: 600 ms JIT runtime: 140 ms
Pattern: 'walk|get|she|make' Matches: 31511 Caseless
C: 0 ms, J: 0 ms, Int runtime: 640 ms JIT runtime: 130 ms
Pattern: '[Ww]alk|[Gg]et|[Ss]he|[Mm]ake' Matches: 31388
C: 0 ms, J: 0 ms, Int runtime: 780 ms JIT runtime: 150 ms
Pattern: '(?:ba|lu|ro)+(?:r|ck)*' Matches: 87122
C: 0 ms, J: 0 ms, Int runtime: 510 ms JIT runtime: 120 ms
Pattern: '(?:ba|lu|ro)+(?:r|ck)*' Matches: 91714 Caseless
C: 0 ms, J: 0 ms, Int runtime: 550 ms JIT runtime: 140 ms
Pattern: '\w+our\w*' Matches: 17230 Caseless
C: 0 ms, J: 0 ms, Int runtime: 4430 ms JIT runtime: 720 ms
Pattern: '\w{2,5} \w{7,9} \w{1,4}' Matches: 196487
C: 0 ms, J: 0 ms, Int runtime: 3250 ms JIT runtime: 590 ms
Pattern: '\WX{0,3}(?:(?:V?I{1,3})|V|IV|IX)\W' Matches: 51718 Caseless
C: 0 ms, J: 0 ms, Int runtime: 2480 ms JIT runtime: 230 ms
Pattern: '\W([a-k]([g-m]\w*?[h-o])+[d-m]\W*)+\W+[a-d]+' Matches: 6135 Caseless
C: 0 ms, J: 0 ms, Int runtime: 1420 ms JIT runtime: 250 ms
Pattern: '\W[e-z]{4,}+--[a-u]{3,6}+\W' Matches: 2491
C: 0 ms, J: 0 ms, Int runtime: 1070 ms JIT runtime: 250 ms
Pattern: '-?-?-?-?-?-?-?-?-?-?----------' Matches: 61
C: 0 ms, J: 0 ms, Int runtime: 960 ms JIT runtime: 130 ms
You can download the source code from:
http://sljit.sourceforge.net/pcre-8.11-sljit-all.tgz
I know it is not the best place, but I have no better idea for now. If we grow in numbers, a git or svn repository would be much better.
The work is not ready for use, it is under development, and may generate incorrect results or crashes in some cases.
Regards,
Zoltan