Re: [pcre-dev] First slot of the offset vector have a wrong …

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: ND
CC: Pcre-dev
Subject: Re: [pcre-dev] First slot of the offset vector have a wrong value when PCRE_ERROR_SHORTUTF8 rises
On Sat, 5 Feb 2011, ND wrote:

> > I wrote:
> > I think the problem may be resolved if PCRE will put a position of
> > incomplete UTF-8 character as start offset.
>
>
> I must rectify the omission: not "as start offset" but "INSTEAD start
> offset".


I'm afraid I don't understand why this is needed. The position of the
incomplete UTF-8 character is easy to find. Start at the end of the byte
string and look backwards until you find a byte that has both the 0xc0
bits set. That is the first byte of the incomplete UTF-8 character.

Putting something other than a match start into the offsets vector
rather breaks the philosophy of PCRE.

Philip

--
Philip Hazel