Re: [pcre-dev] PCRE and UTF-8

Top Page
Delete this message
Author: Ondrej Hoferek
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] PCRE and UTF-8
OK, thanks for the explanation.

Ondrej

-----Original Message-----
From: Philip Hazel [mailto:ph10@hermes.cam.ac.uk]
Sent: Thursday, August 21, 2008 8:09 PM
To: Ondrej Hoferek
Cc: pcre-dev@???
Subject: Re: [pcre-dev] PCRE and UTF-8

On Thu, 21 Aug 2008, Ondrej Hoferek wrote:

> I am using the pcre library and have a small problem with the utf-8
> encoded strings. When retrieving the information about the beginning

and
> end of the matched part of utf-8 encoded subject, I get wrong numbers.
> Every symbol that is encoded in more than one byte is counted as that
> many symbols as how many bytes it is encoded in. Do you also observe
> this problem or have any idea why this happens to me? Thanks for any
> advice.


The values returned by PCRE are not counts of characters. They are byte
offsets into the subject string. So what you are seeing is correct
behaviour. I will take a look at the documentation and see if I can make

this more clear.

Philip

--
Philip Hazel