Re: [pcre-dev] [Bug 1208] Case folding in PCRE

Top Page
Delete this message
Author: jcd
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] [Bug 1208] Case folding in PCRE
It seems my first try to send this didn't succeed. Apologies if it
dups out.



Dear list,

>To tell you the truth this issue has not been raised since I am here, so I
>never estimated the required efforts. Is there any example
>implementation for
>it (for UTF caseless compare)?


A bit of background first:

I'm a long-term user of SQLite (a small but powerful embedded DB
engine) which, by default doesn't come with Unicode folding, non "lower
ASCII" collation and such. Yet you can use the ICU library since
SQLite has built-in hooks for it as an extension, but ICU is a huge
(circa 18 Mb) and slow baby.

I've had the need to handle a number of laguages in the same DB and
decided to write my own extension after looking at what was freely
(open source) available. I used previous code as a basis, but changed
most of it (including tries) to fix many bugs and tailor it to wider needs.

So I came up with a decently small (~180 kb) extension in C which has
its own Unicode tries for folding and casing. It uses Unicode v5.1
specs, to which I added unofficial support for german eszett and a
couple other codepoint.

I can't vouch it will do everything perfectly, but feedback from users
around the globe shows it isn't that far off the mark.

My requirements went beyond what ICU offers: for instance ICU collation
support requires that you choose a precise unique locale for a given
comparison. But in the case of (say) a customer DB table, I have
people from 38 countries, using spelling/letters from various
languages. Choosing a precise locale in this context is
meaningless. I simply relied on (Windows) system calls to handle
locale-independant compares. I also included a fuzzy compare and a
number of other functions.

There is much code in there that can be removed for use along PCRE, so
the final result would be even smaller and much simpler.

If ever someone wants to have a look, the source (and a Windows x86 DLL
build) is freely downloadable at
http://dl.dropbox.com/u/26433628/unifuzz.zip

The source code includes a long comment part, mostly about how to use
the SQLite extension functions it offers. I never had the need to try
compiling for 64-bit OS, but I don't believe there would be significant
issues doing so.

I'm currently not in a position to do much tech work, but I will be
glad to help pruning/adapting code if needed. Compiling and testing
will be much harder in my context.

Of course, the code comes without any guaranty.