Author: Giuseppe D'Angelo Date: To: pcre-dev Subject: [pcre-dev] Limiting the Unicode validity check to the matched-over
substring?
Howdy,
in a couple of projects we've found a bottleneck into the validity
check that by default is performed by PCRE when using Unicode.
Specifically: it looks like that the check performed by PCRE spans
over the entire subject string, even when a non-zero starting offset
is passed. This has a serious hit as we have a giant "string" (a
memory-mapped file), and want to just match a small portion of it, so
we pass a suitable offset and length.
Would it be possible / easy to instead just check the portion of the
string that is going to be inspected by PCRE? We've resorted to doing
the check ourselves and passing the PCRE_NO_UTF_CHECK option to
pcre_exec, otherwise the performance is a disaster.
If this requires introducing a new option in order to keep the
existing behaviour, I think it's perfectly acceptable.
(Sure, we could adjust the subject pointer passed to PCRE, and
readjust back the returned offsets... but that's a workaround, which
is not always semantically equivalent (esp. in case of metacharacters
like \b), and I'm looking for a complete solution.)