Re: [pcre-dev] Matching file contents as one string using P…

Top Page
Delete this message
Author: ph10
Date:  
To: pcunite
CC: pcre-dev
Subject: Re: [pcre-dev] Matching file contents as one string using PCRE_DOTALL
On Fri, 10 May 2013, pcunite@??? wrote:

> My personal thanks for taking the time to comment in this thread.


It's my birthday, so I thought I'd give you a present. <grin>

> I don't want any form of LookBehind (to minimize stack use) occurring
> at all. Also, if you're telling me I can drop the trailing .* when
> looking inside of files and it increases performance then I'm all for
> that! However, it will be awhile before I can create a parser that is
> highly efficient. Because of the grouping I thought .* made things
> work in testing. I'll double check.


I don't think look behinds affect stack usage any more than look aheads,
which is what you are using. The only difference is where the look
starts from. But any kind of assertion is more resource hungry than a
straight search. I created a line of data nearly 200,000 characters
long, and searched for the last word, which happened to be "STUDYING".
Using pcretest's "-tm 2000" option, I got these times (with only a few
tests, on a Linux box):

/STUDYING/           0.09 ms
/(?=.*STUDYING.*)/   0.2ms
/(?=.*STUDYING)/     0.2ms


So leaving off the trailing .* doesn't make much difference. I also
tried capturing and non-capturing brackets instead of your look aheads,
and again got around 0.2ms.

But why is there such a difference? Answer: because of certain
optimizations, which can be checked with pcretest:

/STUDYING/
Capturing subpattern count = 0
No options
First char = 'S'
Need char = 'G'

In this situation, PCRE can whip through the string till it finds 'S',
before it starts doing the match. That's fast.

/(?:.*STUDYING.*)/
Capturing subpattern count = 0
No options
First char at start or follows newline
Need char = 'G'

In this case, match attempts will be tried at many more locations.

What I'm really pointing out in this post is that there is a tool
(pcretest) that allows you to check out the performance of different
patterns to see which works best.

Philip

--
Philip Hazel