[pcre-dev] [Bug 1419] PCRE slower at matching UTF8 character…

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1419] PCRE slower at matching UTF8 character classes.
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1419




--- Comment #1 from Philip Hazel <ph10@???> 2013-12-14 16:48:31 ---
I am not surprised there is a difference, but it does seem excessive. I have
just tried a quick experiment using the timing facilities in pcretest to match
[w-z] and [\x{fe000}-\x{fefff}] to a 300K data string that did not contain any
relevant characters (so the result was no match). I am on an ancient x86 box
and the compiler optimization was turned off. For [w-z] the time was 28ms and
for the other case it was 35ms. I know this is crude, but it shows nothing like
the difference you report. (Studying the pattern speeds things up, as of course
does using JIT.)

For code points < 256, a class uses a bit map, whereas for your other pattern
it will be comparing character values with the range requested. The interesting
question is, what is different in your environment and mine? (I used [w-z]
rather than [b-z] because my file had no instances of b-z, but as that uses a
bit map, it shouldn't make any difference.)

Were you scanning one long string, or a number of strings from a file? Did you
find a match or not? My data string contained entirely ASCII characters, which
are of course quicker to load than multi-byte characters with code points >
255.
(And there's always the possibility that I've not been running what I thought I
was running...)


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email