[pcre-dev] [Bug 2106] Please add support for parsing POSIX b…

Top Page

Reply to this message
Author: admin
To: pcre-dev
Subject: [pcre-dev] [Bug 2106] Please add support for parsing POSIX basic & extended regular expressions

--- Comment #9 from Kyle J. McKay <mackyle@???> ---
(In reply to Philip Hazel from comment #6)

> > > REG_STARTEND is already there.
> >
> > Except that it's not *BSD compatible -- see bug #2128
> Now mended.

Very nice. Thank you. :)

> > (With
> > REG_UTF8 does PCRE perform virtual NFC cannonicalization while
> > matching so, for example, a decomposed e+accent matches the
> > precomposed e+accent version? I'm thinking it probably does...)
> No, I'm afraid it doesn't. It handles only individual characters,
> not compositions.

Bummer dude. I filed enhancement request bug #2131 asking for
support for Unicode Collation Algorithm matching (icu does that).
You can always just mark that "Won't Fix". ;)

> > In any case, a wrapper that wants to implement REG_NOSPEC
> > can just kludge it up with calls to strstr/memmem or
> > producing a malloc'd duplicate starting with \Q and escaped
> > \E (which is \E\\E\Q BTW) replacements -- I don't see why
> > the pattern translator can't do that itself though in order to
> > provide a REG_NOSPEC option.
> Not quite sure what you mean by "pattern translator"? PCRE's regcomp()
> is just an API wrapper; it doesn't translate anything (except options
> bits).

I apologize, I should have been clearer. I meant to refer to the
"translation functions" as described above in comment #2 used to
translate POSIX BREs into PCREs. No reason a similar function couldn't
support translation of a "fixed" string into a PCRE pattern.

My misunderstanding of \Q...\E operation has confused the
issue. Proper use of \Q...\E always seems to trip me up.
After some docs exploration I believe I grok it better now.
For readers following along at home, I think this may prove
to be somewhat illuminating if \Q...\E has confounded:

    perl -le '$x="\$\41"; print "\Q(hi$x)\E"'

I would propose a simplistic approach to translation for REG_NOSPEC
but I find I am at a loss for what PCRE does with \Q...\E sequences.

All of these produce one line of matching output:

    printf '%s\n' 'AxxEB' | perl -ne 'print if /AxxEB/'
    printf '%s\n' 'AxxEB' | perl -ne 'print if /\QAxxEB\E/'
    printf '%s\n' 'A\\EB' | perl -ne 'print if /A..EB/'
    printf '%s\n' 'A\\EB' | perl -ne 'print if /A\\\\EB/'
    printf '%s\n' 'A\\EB' | perl -ne 'print if /\QA\\EB/'
    printf '%s\n' 'A\\EB' | perl -ne 'print if /\QA\\EB\E/'

But only the first four of these do:

    printf '%s\n' 'AxxEB' | pcregrep 'AxxEB'
    printf '%s\n' 'AxxEB' | pcregrep '\QAxxEB\E'
    printf '%s\n' 'A\\EB' | pcregrep 'A..EB'
    printf '%s\n' 'A\\EB' | pcregrep 'A\\\\EB'
    printf '%s\n' 'A\\EB' | pcregrep '\QA\\EB'
    printf '%s\n' 'A\\EB' | pcregrep '\QA\\EB\E'

These all produce a line of matching output:

    printf '%s\n' 'A\\EB' | pcregrep '\QA\\E\\EB'
    printf '%s\n' 'A\\EB' | pcregrep '\QA\\\EEB'
    printf '%s\n' 'A\\EB' | pcregrep '\QA\\\E\QEB'
    printf '%s\n' 'A\\EB' | pcregrep '\QA\\\E\QEB\E'

However only the last three of these do:

    printf '%s\n' 'A\\EB' | perl -ne 'print if /\QA\\E\\EB/'
    printf '%s\n' 'A\\EB' | perl -ne 'print if /\QA\\\EEB/'
    printf '%s\n' 'A\\EB' | perl -ne 'print if /\QA\\\E\QEB/'
    printf '%s\n' 'A\\EB' | perl -ne 'print if /\QA\\\E\QEB\E/'

Which suggests that prefixing the string with \Q and replacing
all internal sequences of \E with \\E\QE might work, but PCRE seems to
handle the \Q...\E sequences differently than Perl so I'm unclear on that.

But I'm sure that performing a variation of what Perl's quotemeta function
does ("all characters not matching /[A-Za-z_0-9]/ will be preceded by a
backslash") would take care of it (I think the needed variation is to
limit the processing to only byte values 0x20-0x7f in the string to avoid
corrupting UTF-8 multibyte sequences, although space can probably be excluded
as well for a minor efficiency).

(In reply to Philip Hazel from comment #7)

> 2. With the recently revised pcre2_compile() structure, it would be very
> straightforward to implement a PCRE2_LITERAL option, despite what I said
> in a previous post.

Which makes the above all moot anyway. ;)

> I've just spent some time fiddling around with timing experiments and
> the results are illuminating.

> I tried searching for a 10-character string near the end of a
> 500-character string. The rough ratio of the timings was:
> strstr():           15
> PCRE2 without JIT:  35
> PCRE2 with JIT:      9

> For PCRE2, these are matching times, not including compilation times. So, as
> I expected, strstr() is much better than the PCRE2 interpreter,

> However, PCRE2 with JIT does a lot better, which does surprise me a bit,
> though again it is perhaps cheating not to include the compilation time.
> I realized two important things while doing this:
> 1. The regcomp() API does not use JIT. Should it?

Yes, please. :) Is JIT compilation really that much slower than non-JIT?
Does PCRE2 do both so it can have an exportable pattern? Or is JIT an extra
step after regular pattern compilation so it will always take longer?

> There is always the cost of JIT compilation to weigh against the matching
> speed up.

If JIT is an extra step, then can you use the results of the initial pattern
compilation step to guide a default "PCRE2_AUTOJIT" mode? Where if the result
of the initial pattern compilation suggests JIT would be either 1) extremely
beneficial for the pattern in question or 2) very low cost to compile for the
pattern in question, it gets JITified (provided JIT's available).
Future releases of PCRE2 could then improve on the metrics used to decide
without need for any client adjustments.

> Should there be a PCRE2-specific REG_JIT (or REG_NOJIT) option?

I'm inclined to auto JITifiy by default as you want to encourage folks to
adopt PCRE2 and one good way to accomplish that is to provide an easy-to-use
and familiar interface (i.e. pcreposix) that provides the fastest possible
pattern matching by default (i.e. use of extra options not required).

You also might want to have the pcreposix interface tamp down on any of the
defaults (if PCRE2 hasn't already done that) which allow malicious patterns
to consume excessive CPU. I doubt any of the POSIX pattern requirements
need a recursion or matching limit of a million for example. The reduced
defaults need only be applied while compiling patterns via the pcreposix
interface. Not good for adoption if someone tries one of the malicious
patterns with pcreposix just to see what happens and their CPU melts a
hole through their desk. ;)

(In reply to Philip Hazel from comment #8)

> Awaiting any feedback on my previous long comment.

Missives don't just grow on trees you know! Producing a missive is a
lengthy process that requires a carefully nurtured incubation period before
it's ready to be hatched into an actual bug comment. ;)

> FYI: I have implemented REG_PEND.

Very nice. Thank you. :)

You are receiving this mail because:
You are on the CC list for the bug.