Re: [pcre-dev] Partial matching of UTF8 multibyte chars at the end ofsegment

Author: Philip Hazel
Date:
To: ND
CC: Pcre-dev
Subject: Re: [pcre-dev] Partial matching of UTF8 multibyte chars at the end ofsegment

On Sat, 23 Oct 2010, ND wrote:

> UTF8 data flow may comes with segments which end bytes defines beginning
> bytes of UTF8 multibyte character but not character at all. The end bytes
> of character will come with next segment. This occurs, for example, when
> UTF8 text is transmitted by byte-oriented protocol like TCP.
> Now PCRE returns PCRE_ERROR_BADUTF8 in this situation. But IMHO its
> rightly to return PCRE_ERROR_PARTIAL.

PCRE does the check for a valid UTF-8 string before it starts the match.
I suppose, if PCRE_PARTIAL_HARD is set, it could adjust the end of the
subject string to leave off any partial UTF-8 character. Then you'd get
a normal partial match if the pattern partially matched the rest.
However, the returned string would not include the partial UTF-8
character.

This is, of course, something that the application could do for itself
easily enough. I am not sure if it should be part of PCRE.

Philip

--
Philip Hazel

This message is part of the following thread:
	the complete thread tree sorted by date
	ND at

Re: [pcre-dev] Partial matching of UTF8 multibyte chars at t…