Re: [pcre-dev] Limiting the Unicode validity check to the m…

Top Page
Delete this message
Author: ph10
Date:  
To: Giuseppe D'Angelo
CC: pcre-dev
Subject: Re: [pcre-dev] Limiting the Unicode validity check to the matched-over substring?
On Sun, 16 Aug 2015, Giuseppe D'Angelo wrote:

> Specifically: it looks like that the check performed by PCRE spans
> over the entire subject string, even when a non-zero starting offset
> is passed. This has a serious hit as we have a giant "string" (a
> memory-mapped file), and want to just match a small portion of it, so
> we pass a suitable offset and length.
>
> Would it be possible / easy to instead just check the portion of the
> string that is going to be inspected by PCRE? We've resorted to doing
> the check ourselves and passing the PCRE_NO_UTF_CHECK option to
> pcre_exec, otherwise the performance is a disaster.


This is something that had not occurred to me, and it seems to be a nice
idea. The maximum lookbehind for a pattern is known, so the check could
start at the offset minus the maximum lookbehind.

> If this requires introducing a new option in order to keep the
> existing behaviour, I think it's perfectly acceptable.


I don't think it needs a new option, but as this is new development, I
would be tempted to do it only for PCRE2, not PCRE1.

Philip

--
Philip Hazel