Re: [pcre-dev] optimizing matches for large strings

Author: Alan Lehotsky
Date:
To: pcre-dev
Subject: Re: [pcre-dev] optimizing matches for large strings

The pattern consists of a sequence of P1, followed by a sequence of P2 followed by a sequence of P4
followed by a sequence of P9, so "P1P[249])*" would accept "P1P4P2" when it's not supposed to.

In the wild, the input strings don't have 50k of P2, followed by 50k of P4, followed by 50k of P9, this was just
the test-case distilled down for me by my user.

And in fact, it's part of an application where the patterns are not fixed; so other searches would be using different patterns to match data. I was just hoping there was some way to advise my user how to rewrite the regex pattern being used in this particular case to avoid the recursion.

Thanks,
Al

On Feb 12, 2011, at 11:14 AM, Philip Hazel wrote:

> On Fri, 11 Feb 2011, Alan Lehotsky wrote:
>
>> I have a string that's 150,002 characters consisting of
>>
>> P1P2......P4....P9.....

>>
>>
>> (Thats 'P1', followed by 50k each of 'P2', 'P4', 'P9')
>>
>> and a simpleminded regex of
>>
>> (P1(P2)*(P4)*(P9)*)?

>
> Have you tried (P1(P[249])*)? Actually, don't bother. I don't think it
> will make any difference. The thing is, there is an internal recursive
> call each time it enters a parenthesis.
>
>> I assume that if I configured PCRE to not use the stack that I could
>> make this run (although probably really slowly due to malloc/free
>> overhead).
>
> Well, it's still going to use the same amount of memory. So why not make
> the stack even bigger instead?
>
> If, as your comment suggests, you really know exactly what's in your
> string, what are you using PCRE to try to find? Perhaps there's another
> way of solving the problem.
>
> Philip
>
> --
> Philip Hazel

This message is part of the following thread:
	the complete thread tree sorted by date
	jcd at
	Philip Hazel at