[pcre-dev] [Bug 2106] Please add support for parsing POSIX b…

Top Page

Reply to this message
Author: admin
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 2106] Please add support for parsing POSIX basic & extended regular expressions
https://bugs.exim.org/show_bug.cgi?id=2106

--- Comment #6 from Philip Hazel <ph10@???> ---
(In reply to Kyle J. McKay from comment #5)
> > REG_STARTEND is already there.
>
> Except that it's not *BSD compatible -- see bug #2128


Now mended.

> I might expect matching a fixed pattern like "abcabcz" against
> a string like "abcabcabcabcabcz" to not be handled all that
> efficiently by a naive strstr (or memmem) implementation, but
> I'd expect a pattern matching engine to do better.


Interestingly, I would expect the opposite. A regex engine has to worry about
all the non-fixed stuff whereas an engine specifically looking for fixed
strings (even strstr) doesn't have to. It might be instructive to write a test
program that compares timings for strstr() vs PCRE2 for some fixed strings.

If you are looking for caseful shortish fixed strings in shortish subject
strings, in an 8-bit world, have no binary zeroes in your strings, and are not
too worried about performance, then I would have thought that strstr() would be
fine. For more "serious" searches then one of the Boyer-Moore type algorithms
is best. I don't know if anyone has written a B-M searching library.

For caseless matching, of course, that doesn't apply. PCRE2 just tries each
character one by one against both (all) of its cases.

> (With
> REG_UTF8 does PCRE perform virtual NFC cannonicalization while
> matching so, for example, a decomposed e+accent matches the
> precomposed e+accent version? I'm thinking it probably does...)


No, I'm afraid it doesn't. It handles only individual characters, not
compositions.

> In any case, a wrapper that wants to implement REG_NOSPEC
> can just kludge it up with calls to strstr/memmem or
> producing a malloc'd duplicate starting with \Q and escaped
> \E (which is \E\\E\Q BTW) replacements -- I don't see why
> the pattern translator can't do that itself though in order to
> provide a REG_NOSPEC option.


Not quite sure what you mean by "pattern translator"? PCRE's regcomp() is just
an API wrapper; it doesn't translate anything (except options bits). I suppose
in theory it *could* translate as you suggest, though dealing with \E requires
more than just a simple search: consider this pattern "A\\EB". *If* anything
were to be added to PCRE2 (and it would be PCRE2, as PCRE1 is feature-frozen),
it might be better to add a PCRE2_LITERAL option to pcre2_compile() which
REG_NOSPEC could activate. However, I'm not keen (as you can probably guess).

--
You are receiving this mail because:
You are on the CC list for the bug.