Re: [pcre-dev] First slot of the offset vector have a wrong …

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Sheri
CC: pcre-dev
Subject: Re: [pcre-dev] First slot of the offset vector have a wrong value when PCRE_ERROR_SHORTUTF8 rises
On Mon, 9 May 2011, Sheri wrote:

> I'm somewhat concerned about the possible scope of incompatibility of this
> change.Might some existing applications blindly treat an execution that filled
> the offset vector as a Match?


That seems a rather strange thing to do. The API explicitly states that
the return code PCRE_ERROR_BADUTF8 means that the subject string is bad.
I suppose I could add explicitly add "No matching takes place", but
surely this is rather obvious? Previously, the data in the offset vector
in this case was unchanged by PCRE. Any application that was using it
would have had very strange results indeed.

> Also, might some applications collect offset vectors while doing
> multiple match operations and subsequently act on the accumulated
> vectors independent from the executions?


How do they know they have changed? Do these applications seed the
vectors first?

> Keep in mind that for users of applications that extend regular
> expressions to end users via PCRE, end users can activate utf8 without
> notifying the host application by including (*UTF8) at the beginning
> of the pattern. Up til now such users would have seen a BADUTF8 behave
> like a non match.


They should still see BADUTF8. The return codes have *not* changed; all
that has changed is that, when these return codes are given, the offset
vector (which previously was not set) has some information in it -
information that was requested by a user.

Thanks for taking note and considering this. As you see, I don't think
your worry is justified, but nevertheless it is good that people
consider all the issues.

Philip

--
Philip Hazel