[pcre-dev] Limiting the Unicode validity check to the matched-over substring?

Author: Giuseppe D'Angelo
Date:
To: pcre-dev
Subject: [pcre-dev] Limiting the Unicode validity check to the matched-over substring?

Howdy,

in a couple of projects we've found a bottleneck into the validity
check that by default is performed by PCRE when using Unicode.

Specifically: it looks like that the check performed by PCRE spans
over the entire subject string, even when a non-zero starting offset
is passed. This has a serious hit as we have a giant "string" (a
memory-mapped file), and want to just match a small portion of it, so
we pass a suitable offset and length.

Would it be possible / easy to instead just check the portion of the
string that is going to be inspected by PCRE? We've resorted to doing
the check ourselves and passing the PCRE_NO_UTF_CHECK option to
pcre_exec, otherwise the performance is a disaster.

If this requires introducing a new option in order to keep the
existing behaviour, I think it's perfectly acceptable.

(Sure, we could adjust the subject pointer passed to PCRE, and
readjust back the returned offsets... but that's a workaround, which
is not always semantically equivalent (esp. in case of metacharacters
like \b), and I'm looking for a complete solution.)

Thanks,
--
Giuseppe D'Angelo

This message is part of the following thread:
	the complete thread tree sorted by date

	ph10 at

[pcre-dev] Limiting the Unicode validity check to the matche…