Re: [pcre-dev] PCRE and UTF-8

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Ondrej Hoferek
CC: pcre-dev
Subject: Re: [pcre-dev] PCRE and UTF-8
On Thu, 21 Aug 2008, Ondrej Hoferek wrote:

> I am using the pcre library and have a small problem with the utf-8
> encoded strings. When retrieving the information about the beginning and
> end of the matched part of utf-8 encoded subject, I get wrong numbers.
> Every symbol that is encoded in more than one byte is counted as that
> many symbols as how many bytes it is encoded in. Do you also observe
> this problem or have any idea why this happens to me? Thanks for any
> advice.


The values returned by PCRE are not counts of characters. They are byte
offsets into the subject string. So what you are seeing is correct
behaviour. I will take a look at the documentation and see if I can make
this more clear.

Philip

--
Philip Hazel