[pcre-dev] [Bug 1437] Using PCRE-8.34 on x86-64 Linux with …

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1437] Using PCRE-8.34 on x86-64 Linux with --enable-jit and --enable-utf , grep -iP '^S' gets stuck on a binary file consuming a lot of CPU for many seconds
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1437




--- Comment #12 from Philip Hazel <ph10@???> 2014-01-24 17:13:56 ---
On Fri, 24 Jan 2014, Zoltan Herczeg wrote:

> Thanks. Now I have the answer. Grep puts a \x0a before the start of the string,
> and your string starts with an invalid utf-8 code, 0xbf. You can fully
> reproduce this issue on the following input:
>
> \x0a\xbf#


Can you reproduce this in pcretest? I don't seem to be able to do that.

> It is a good question what to do here, I would like to hear others opinion
> (especially Philip). I think the best thing would be to switch back to ASCII
> mode in grep, if the input is not a valid UTF string.


Note that pcregrep runs in ASCII mode by default; you have to use the -u
option to make use UTF-8. If you do that on the 1.dat file, it reports
error -10 (bad UTF-8 string).

I agree with your conclusion, so I guess this is really a grep issue,
not a PCRE issue. It should not be setting PCRE_NO_UTF_CHECK if it
cannot guarantee valid UTF-8 data.

Philip


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email