Re: [pcre-dev] A clarification upon PCRE_NO_UTF(8|16)

Author: Zoltán Herczeg
Date:
To: Giuseppe D'Angelo
CC: pcre-dev
Subject: Re: [pcre-dev] A clarification upon PCRE_NO_UTF(8|16)_CHECK

Hi Giuseppe,

The first one: the whole string is checked if PCRE_NO_UTF8_CHECK is not passed. It can be really costly if subject is long (several Mbytes). Malformed strings can cause PCRE to read bytes before the beginning or after the end of the subject string and this can lead to undesired crashes. The reason of this behaviour is speed: we can omit some checks which makes PCRE faster, but does not work with malformed input.

So your guess is right: if the subject string does not change, PCRE_NO_UTF8_CHECK can be passed to subsequent calls (regardless if the pattern changes, or it does not match or anything). Moreover, if QString guarantees that it always contains a valid UTF16 string, you can pass PCRE_NO_UTF8_CHECK all the time (just check that the starting offset also points to a beginning of a valid UTF16 character. This is easy, just check that the memory location does not point to a second part of a surrogate).

Btw, I heard that Qt5 alpha is released and contains a PCRE based QRegularExpression. Really nice job!

Regards,
Zoltan

"Giuseppe D'Angelo" <dangelog@???> írta:
>From the docs it's not 100% clear that if PCRE_NO_UTF8_CHECK (or>
PCRE_NO_UTF16_CHECK) is NOT passed to pcre_exec then he *full* subject>
string undergoes the check. Is that the case, or the check is actually>
done in "chunks" or something like that?>
>
I'm thinking about emulating //g: if PCRE checks for the validity of>
the *whole* subject string, then subsequent calls to pcre_exec may>
safely omit this check by passing PCRE_NO_UTF8_CHECK.>
>
Thanks, and have a happy week-end.>
-- >
Giuseppe D'Angelo>
>
-- >
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev >

This message is part of the following thread:
	the complete thread tree sorted by date
	Giuseppe D'Angelo at
	Philip Hazel at