[pcre-dev] [Bug 1419] PCRE slower at matching UTF8 character…

Top Page
Delete this message
Author: Zoltan Herczeg
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1419] PCRE slower at matching UTF8 character classes.
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1419




--- Comment #9 from Zoltan Herczeg <hzmester@???> 2013-12-15 09:30:33 ---
I commented out the following lines in pcre_exec:

          if (utf)
            ACROSSCHAR(start_match < end_subject, *start_match,
              start_match++);


All regression tests are passed (not surprisingly, since JIT always worked this
way), and the performance is now similar in the first two cases even in the
interpreter (without utf check). I also tried what happens, if the input is two
or three character long utf8 code points (where this optimization should do
something useful). There is a slight perf degradation, but compared to the cost
of this optimization it is still much lower.

The second needs more time to do (not that much though, I think it is less than
one day), and should not be in the next release because of the risk of breaking
things.

Btw, how would you optimize this with SSE:

uint8_t bitset[32]; // One bit for each character between 0..255
char* start;

// In 8 bit mode:
while (bit_is_set(*start, bitset) == FALSE)
start++;

In 16 and 32 bit modes, you need to check whether *start is greater than 255 as
well.

I don't plan to take your bounty, so feel free to work on this. It is funny
that they did not specify the amount of speedup they expect.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email