Author: Philip Hazel Date: To: Ondrej Hoferek CC: pcre-dev Subject: Re: [pcre-dev] PCRE and UTF-8
On Thu, 21 Aug 2008, Ondrej Hoferek wrote:
> I am using the pcre library and have a small problem with the utf-8
> encoded strings. When retrieving the information about the beginning and
> end of the matched part of utf-8 encoded subject, I get wrong numbers.
> Every symbol that is encoded in more than one byte is counted as that
> many symbols as how many bytes it is encoded in. Do you also observe
> this problem or have any idea why this happens to me? Thanks for any
> advice.
The values returned by PCRE are not counts of characters. They are byte
offsets into the subject string. So what you are seeing is correct
behaviour. I will take a look at the documentation and see if I can make
this more clear.