On Wed, 9 Feb 2011, ND wrote:
> >Putting something other than a match start into the offsets vectorrather
> > breaks the philosophy of PCRE.
> >
> It's useful to give to main application information about position when
> PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORTUTF8 occurs. It must not be returned
> in offsets vector nesessarily. May be in another memory block.
> This information can help main application to analyze and fix erroneous
> stream.
OK, I've changed my mind and decided that the offsets vector can be
used. I also decided that if this was happening, I should do the job
*properly*. I have just committed a patch which behaves like this:
If the size of the ovector is at least 2, then, for PCRE_ERROR_BADUTF8
or PCRE_ERROR_SHORTUTF8,
ovector[0] is set to the byte offset of the first byte of the invalid
character
ovector[1] is set to a reason code
There are 21 different reason codes, documented in the pcreapi man page.
They include codes for "short by n bytes" (where n is 1-5), so in fact
PCRE_ERROR_SHORTUTF8 is no longer needed. However, I have not removed
it because that would break backwards compatibility.
Philip
--
Philip Hazel