[pcre-dev] [Bug 1468] Segmentation fault for a text starting with 0x80

Author: Zoltan Herczeg
Date:
To: pcre-dev
Subject: [pcre-dev] [Bug 1468] Segmentation fault for a text starting with 0x80 - 0xbf in UTF-8 mode

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1468

--- Comment #12 from Zoltan Herczeg <hzmester@???> 2014-04-23 07:50:13 ---
(In reply to comment #11)
> Zoltan, thanks for the advice.
> I keep what you said in my mind.
>

I think this is especially useful for programs like grep.

>From time to time, we get feedback that matching a simple pattern like /a/ is
very slow with PCRE, but they tend to forget to mention that it is an UTF match
without PCRE_NO_UTF8_CHECK, and the input size is 10 MBytes. Scanning 10 MByte
of data takes a long time. Sometimes they want to find all matches. If the
input contains 10000 'a' letters, the UTF check will takes 10000 times more
time. (The UTF check scans the whole input regardless the starting position!)

Hence we strongly suggest that the application should take care of validating
UTF. The feature is added for simple applications, but a serious tool optimized
for performance (like grep) should do it by itself in the most efficient way.
For example, the matching could be accelerated by doing UTF validating in a
separate thread (multicore is widespread, even phones has 4-8 cores now).

By the way, I would be curious about the performance drop of grep after the
default UTF check is enabled. I saw it in the code, that grep matches one line
at a time, so perhaps it is not a big overhead. But how knows. Another question
is, what grep does when it matches UTF data without PCRE?

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date
	Norihiro Tanaka at

[pcre-dev] [Bug 1468] Segmentation fault for a text starting…