Autore: Philip Hazel Data: To: ND CC: Pcre-dev Oggetto: Re: [pcre-dev] Partial matching of UTF8 multibyte chars at the end
ofsegment
On Sat, 23 Oct 2010, ND wrote:
> UTF8 data flow may comes with segments which end bytes defines beginning
> bytes of UTF8 multibyte character but not character at all. The end bytes
> of character will come with next segment. This occurs, for example, when
> UTF8 text is transmitted by byte-oriented protocol like TCP.
> Now PCRE returns PCRE_ERROR_BADUTF8 in this situation. But IMHO its
> rightly to return PCRE_ERROR_PARTIAL.
PCRE does the check for a valid UTF-8 string before it starts the match.
I suppose, if PCRE_PARTIAL_HARD is set, it could adjust the end of the
subject string to leave off any partial UTF-8 character. Then you'd get
a normal partial match if the pattern partially matched the rest.
However, the returned string would not include the partial UTF-8
character.
This is, of course, something that the application could do for itself
easily enough. I am not sure if it should be part of PCRE.