Re: [pcre-dev] optimizing matches for large strings

Author: Herczeg Zoltán
Date:
To: Alan Lehotsky
CC: pcre-dev
Subject: Re: [pcre-dev] optimizing matches for large strings

Hi,

Is P1 means two literals? 'P' and '1'? PCRE has clever optimizations for quantifiers used on single characters: it can count them, and go back when recursion needed. The story is more difficult for other expessions, since a recursion may happened inside the subexpression.

Anyway, back references also optimized in the same way as characters, so perhaps:

"(P1)*" could be rewritten to "(?:(P1)\1*)?"

Regards,
Zoltan

Alan Lehotsky <alehotsky@???> írta:
>I have a string that's 150,002 characters consisting of>
>
P1P2......P4....P9.....>
>
>
(Thats 'P1', followed by 50k each of 'P2', 'P4', 'P9')>
>
and a simpleminded regex of>
>

    (P1(P2)*(P4)*(P9)*)?>

>
With PCRE 7.9, this crashes in match() on the 5672'd recursive call (on a linux box with a moderately large stack)>
>
With the old Spencer regex, we match successfully.>
>
I tried the obvious things to improve this:>
>
1/ replaced capturing parentheses with (?: )>
>
2/ Added explicit ^$ anchors>
>
3/ possessive quantifiers.>
>
4/ atomic grouping>
>
In all cases I still stomp the stack via recursion in match().>
>
I assume that if I configured PCRE to not use the stack that I could make this run (although probably really slowly due to malloc/free overhead). >
>
Any advice?>
-- >
## List details at http://lists.exim.org/mailman/listinfo/pcre-dev >

This message is part of the following thread:
	the complete thread tree sorted by date
	Alan Lehotsky at
	Philip Hazel at