[pcre-dev] [Bug 1419] PCRE slower at matching UTF8 character…

Top Page
Delete this message
Author: Zoltan Herczeg
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1419] PCRE slower at matching UTF8 character classes.
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1419




--- Comment #23 from Zoltan Herczeg <hzmester@???> 2014-01-05 17:13:24 ---
According to my measurements, all tests run with the same speed now.

I also made another optimization: partial utf character decoding. The
[\x{fe000}-\x{fefff}] range gave me the idea. Only four byte long UTF
characters can match to this range, and there is no need to decode shorter utf
byte codes.

Now there is a read_char_range() function which returns with a correct
character code between a minimum and maximum value. All other characters are
decoded to a random value, but this value _must_ be outside the min-max range.
The read_char_range generates an optimized character reading code (basically we
can skip several checks). The function has a boolean update_str_ptr value,
which says that the string pointer must be updated for all characters, not only
for those, which are inside the range. Updating the string pointer can be done
without decoding the character (and it is faster as well). This is useful for
negative classes, where all character must be accepted which are outside the
range.

This optimization is mostly target UTF-8, but UTF-16 get some benefit as well.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email