On Mon, 3 Aug 2009, Ronen Hod wrote:
> I encountered serious performance issues so I had to do a quick fix
> for my needs. It is not thread-safe, it has assumptions regarding the
> number of states (which are sufficient for me), and I am not 100% sure
> that I understand all the implications (maybe it can be optimized
> further). My application as a whole runs 4x faster now, and I assume
> that pcre_dfa_exec() is ~10x faster on my data. Attached is the code
> with the changes. Enable/Disable them using: "#define
> CRESCENDO_CHECK_FOR_DUPLICATES".
I have now studied your patch, and I understand how you are getting a
speed up. Unfortunately, I cannot install the patch in PCRE because of
the problems you mention: it is not thread-safe and it has assumptions
about the number of states.
I will try to think about other ways of speeding up the duplicate
checking that are thread-safe and do not make any assumptions, though I
have a feeling that this will not be easy.
In the meantime, I hope you have read Jeffrey Friedl's book "Mastering
Regular Expressions" and made sure that the patterns you are using are
as optimised at possible.
Thanks for posting your code and bringing this area of performance to my
attention.