Hi
I forked perl5 to cperl with several improvements but full backcompat,
https://github.com/perl11/cperl
with the plan for 2017 to add PCRE2 as default and only fallback on the failing corner cases.
The 2nd plan is to use Hyperscan also, as it’s much simpler and much faster, but very limited.
But for simple boolean matches it’s worthwhile I think, and then fall back to PCRE2 or the perl5
spencer regex engine.
Currently it’s only 10% faster, but I haven’t done proper realistic benchmarking yet, just
running the regression test suite, which only uses the compiled pattern once.
There are two news:
1.
https://github.com/rurban/re-engine-PCRE2
use the pcre2 engine in all perls, using the regex hooks in perl5.
use re::engine::PCRE2; and get a 10% speedup.
2. git mirror:
https://github.com/rurban/pcre
updated bi-hourly.
I’m using this to test perl5/cperl/re-engine-PCRE2 regressions.
I’m running the regression test suite constantly with pcre2 and re-engine-PCRE2 updates.
Also in normal perl with <export PERL5OPT=-Mre::engine::PCRE2>.
Results:
95% pass already.
when I come up with regressions which make sense to add to pcre I’ll add it here on this list.
With all yet unsupported patterns or options I just fall-back from pcre2 to the perl5 core re engine.
https://github.com/rurban/re-engine-PCRE2#failing-tests
perl codeblocks (?{ and (??{ are handled fine already via fallbacks,
just unicode semantics and some illegal patterns cause minor problems.
Also the local character set logic (/l) needs to be improved a bit.
Some perl modules trip over this.
I’ll triage it soon.
New upstream perl features:
Current new perl5 regex features are:
* /aa stricter ascii - adds /i restrictions for ascii rules
https://perl5.git.perl.org/perl.git/blob/HEAD:/pod/perlre.pod#l735
e.g. forbid ASCII/non-ASCII matches, like "k" with C<\N{KELVIN SIGN} under /i.
* /xx RXf_PMf_EXTENDED_MORE (v5.26)
which would be PCRE2_EXTENDED_MORE.
https://perl5.git.perl.org/perl.git/blob/HEAD:/pod/perlre.pod#l450
"Starting in Perl v5.26, if the modifier has a second C<"x"> within it,
it does everything that a single C</x> does, but additionally
non-backslashed SPACE and TAB characters within bracketed character
classes are also generally ignored, and hence can be added to make the
classes more readable.
/ [d-e g-i 3-7]/xx
/[ ! @ " # $ % ^ & * () = ? <> ' ]/xx
may be easier to grasp than the squashed equivalents
/[d-eg-i3-7]/
/[!@"#$%^&*()=?<>']/
“
This is a pretty trivial extension to add.
I haven’t found anything else yet, but I keep you updated.
Reini Urban
rurban@???