Author: pcunite Date: To: pcre-dev Subject: Re: [pcre-dev] Matching file contents as one string using
PCRE_DOTALL
> Philip > Some comments about that regex, independent of DOTALL:
My personal thanks for taking the time to comment in this thread. Using your
library I've been able to achieve my goals. However, as you've noted ... it
could be better. I'll briefly tell you what I'm doing and this might explain
the approach I've taken.
PCRE is the backend to a frontend GUI that supports DOS and Boolean terms. The
GUI will therefore convert the simple syntax into PCRE syntax. I never know
what the user will type or the order. Thus I have built a parser that does its
best to convert it sanely. Here are some example conversions. Grouping allowed
me to make a smaller parser.
Typed into GUI ------------------> My parsed conversions
File names:
-------------
*.txt ---------------------------> (?=.*\.txt$)
*txt ----------------------------> (?=^txt.*)
*.txt OR *.exe -----------------> (?=.*\.txt$)|(?=.*\.exe$)
*.txt NOT *.exe -----------------> ((?=.*\.txt$))(?!.*\.exe$)
File contents
-------------
hello ---------------------------> (?=.*hello.*)
hello OR world -----------------> (?=.*hello.*)|(?=.*world.*)
hello NOT world -----------------> ((?=.*hello.*))(?!.*world.*)
hello AND world -----------------> (?=.*hello.*)(?=.*world.*)
hello AND world NOT sky ---------> ((?=.*hello.*)(?=.*world.*))(?!.*sky.*)
As you can see I use ^ or $ when they are looking for file names - there is no
newlines to worry with. You may also notice that I'm building my strings starting with (? and
appending =.* or !.* as appropriate. Your suggestion to use a lazy STAR would not be
hard for me to add.
I don't want any form of LookBehind (to minimize stack use) occurring at all. Also, if
you're telling me I can drop the trailing .* when looking inside of files and it increases
performance then I'm all for that! However, it will be awhile before I can create a parser
that is highly efficient. Because of the grouping I thought .* made things work in
testing. I'll double check.