[pcre-dev] [Bug 1419] PCRE slower at matching UTF8 character…

Top Page
Delete this message
Author: Ben Maurer
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1419] PCRE slower at matching UTF8 character classes.
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1419




--- Comment #13 from Ben Maurer <ben.maurer@???> 2013-12-15 18:25:41 ---
Just to be clear, I work for Facebook and put up the bounty. My criteria here
would be to get the 5288 ms time down an order of magnitude -- 500 ms would be
a good threshold (though I'm open to hearing any reasons that number isn't
possible). Even if you're not interested in the bounty, we'd be happy to find
something to do with it after these improvements are made.

As to how to optimize that loop with SSE, the SSE 4.2 instructions can't easily
do the check you listed on an aribtrary bitmask. However, they can handle masks
with:

- Up to 16 characters set
- 8 continuous ranges of characters
- Any negation of the above.

This would handle many common regexes. For example [a-zA-Z0-9] would use the
equal ranges. http://halobates.de/pcmpstr-js/pcmp.html has an example of
writing strspn with sse 4.2 instructions.
http://repo.or.cz/w/glibc.git/blob?f=sysdeps/x86_64/multiarch/strspn-c.c has
the actual glibc implementation of strspn.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email