Re: [pcre-dev] First slot of the offset vector have a wrong …

Top Page
Delete this message
Author: Sheri
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] First slot of the offset vector have a wrong value when PCRE_ERROR_SHORTUTF8 rises
On 5/7/2011 11:43 AM, Philip Hazel wrote:
> On Wed, 9 Feb 2011, ND wrote:
>
>>> Putting something other than a match start into the offsets vectorrather
>>> breaks the philosophy of PCRE.
>>>
>> It's useful to give to main application information about position when
>> PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORTUTF8 occurs. It must not be returned
>> in offsets vector nesessarily. May be in another memory block.
>> This information can help main application to analyze and fix erroneous
>> stream.
> OK, I've changed my mind and decided that the offsets vector can be
> used. I also decided that if this was happening, I should do the job
> *properly*. I have just committed a patch which behaves like this:
>
> If the size of the ovector is at least 2, then, for PCRE_ERROR_BADUTF8
> or PCRE_ERROR_SHORTUTF8,
>
>    ovector[0] is set to the byte offset of the first byte of the invalid
>               character
>    ovector[1] is set to a reason code

>
> There are 21 different reason codes, documented in the pcreapi man page.
> They include codes for "short by n bytes" (where n is 1-5), so in fact
> PCRE_ERROR_SHORTUTF8 is no longer needed. However, I have not removed
> it because that would break backwards compatibility.
>
> Philip
>

I'm somewhat concerned about the possible scope of incompatibility of
this change.Might some existing applications blindly treat an execution
that filled the offset vector as a Match? Also, might some applications
collect offset vectors while doing multiple match operations and
subsequently act on the accumulated vectors independent from the
executions? Keep in mind that for users of applications that extend
regular expressions to end users via PCRE, end users can activate utf8
without notifying the host application by including (*UTF8) at the
beginning of the pattern. Up til now such users would have seen a
BADUTF8 behave like a non match.

Regards,
Sheri