[pcre-dev] [Bug 633] pcre_get_substring*: Problems with UTF8

Inizio della pagina
Delete this message
Autore: Philip Hazel
Data:  
To: pcre-dev
Oggetto: [pcre-dev] [Bug 633] pcre_get_substring*: Problems with UTF8
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=633




--- Comment #3 from Philip Hazel <ph10@???> 2007-11-21 16:05:22 ---

> haystack = "some umlauts äöüß and more";
>
>
> *compiled_regexp = pcre_compile(regexp,      /* the pattern */


What was the pattern you were using? I tried matching that string using the
pattern /(.*)/ in pcretest, and it worked fine.

> When extracting to match from "haystack" with pcre_get_substring the result is
> truncated and misses one char for each umlaut.


The return from pcre_get_substring was 30 (the number of bytes), and all the 30
bytes were returned. The test I used was

pcretest zz zzz

where zz contained these lines

 /(.*)/8
    some umlauts \x{e4}\x{f6}\x{fc}\x{df} and more\G1


The output in the zzz file was as expected. It doesn't show well on my screen
because of the top-bit characters. This has exposed a bug in pcretest; it
should be showing those characters as escapes. I will fix that.

I am afraid that I am not convinced that this is a bug.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email