Re: [pcre-dev] optimizing matches for large strings

Top Page
Delete this message
Author: Herczeg Zoltán
Date:  
To: Alan Lehotsky
CC: pcre-dev
Subject: Re: [pcre-dev] optimizing matches for large strings
Hi,

Is P1 means two literals? 'P' and '1'? PCRE has clever optimizations for quantifiers used on single characters: it can count them, and go back when recursion needed. The story is more difficult for other expessions, since a recursion may happened inside the subexpression.

Anyway, back references also optimized in the same way as characters, so perhaps:

"(P1)*" could be rewritten to "(?:(P1)\1*)?"

Regards,
Zoltan

Alan Lehotsky <alehotsky@???> írta:
>I have a string that's 150,002 characters consisting of>
>

        P1P2......P4....P9.....>
>
>

(Thats 'P1', followed by 50k each of 'P2', 'P4', 'P9')>
>

and a simpleminded regex of>
>

    (P1(P2)*(P4)*(P9)*)?>

>

With PCRE 7.9, this crashes in match() on the 5672'd recursive call (on a linux box with a moderately large stack)>
>

With the old Spencer regex, we match successfully.>
>

I tried the obvious things to improve this:>
>

1/ replaced capturing parentheses with (?: )>
>

2/ Added explicit ^$ anchors>
>

3/ possessive quantifiers.>
>

4/ atomic grouping>
>

In all cases I still stomp the stack via recursion in match().>
>

I assume that if I configured PCRE to not use the stack that I could make this run (although probably really slowly due to malloc/free overhead).  >
>

Any advice?>
-- >
## List details at http://lists.exim.org/mailman/listinfo/pcre-dev >