Hi,
On your homepage it said that I could contact the active PCRE developers via this mail address, hope this is alright.
I've been using PCRE in a project of mine, a project of which I have recently started performing internationalization.
As far as I could understand, the way of using UTF8 strings with PCRE is by passing the PCRE_UTF8 option to pcre_compile.
Now, while I understand that passing this option flag to pcre_compile causes non-valid UTF8 strings to fail compilation, it seems that I can still use UTF8 strings without passing this option to pcre_compile. This makes strings get treated like plain ASCII strings, thereby comparing English characters case insensitively, and the rest of the chart (for example, Hebrew characters represented by the 128-256 part of the chart) using a plain binary comparison. This, as far as I can see, works for me perfectly - this way I can pass both ASCII and UTF8 strings, which will be matched using case insensitive collation for English characters, and binary comparison for any other character.
Am I missing something? Does the PCRE_UTF8 benefit me in any other way I've managed to miss so far in my testing?
Thanks,
G.
From ph10@??? Mon Oct 19 19:39:32 2009
Envelope-to: pcre-dev@???
Received: from ppsw-0.csi.cam.ac.uk ([131.111.8.130]:48376)
by tahini.csx.cam.ac.uk with esmtp (Exim 4.69)
(envelope-from <ph10@???>) id 1Mzx8k-0002rc-Fx
for pcre-dev@???; Mon, 19 Oct 2009 19:39:31 +0100
X-Cam-AntiVirus: no malware found
X-Cam-SpamDetails: not scanned
X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/
Received: from quercite.quercite.com ([83.104.196.194]:52214
helo
ercite-alias)
by ppsw-0.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.150]:587)
with esmtpsa (PLAIN:ph10) (TLSv1:DHE-RSA-AES256-SHA:256)
id 1Mzx8k-0004DP-0x (Exim 4.70)
(return-path <ph10@???>); Mon, 19 Oct 2009 19:39:30 +0100
Date: Mon, 19 Oct 2009 19:39:26 +0100 (BST)
From: Philip Hazel <ph10@???>
To: 897@???
In-Reply-To: <bug-897-288@???/>
Message-ID: <alpine.LNX.2.00.0910191927210.8379@???>
References: <bug-897-288@???/>
User-Agent: Alpine 2.00 (LNX 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset£-ASCII
X-Spam-Score: -3.1 (---)
X-Spam-Status: No, scoreÓ.1 required~0 testsìL_TRUSTEDÑ.8, AWL.156,
BAYES_00Ñ.5 autolearn¾available version^1.8
Cc: pcre-dev@???
Subject: Re: [pcre-dev] [Bug 897] New: \w and others based on Unicode
properties
X-BeenThere: pcre-dev@???
X-Mailman-Version: 2.1.9
Precedence: list
Reply-To: pcre-dev@???
List-Id: PCRE Development <pcre-dev.exim.org>
List-Unsubscribe: <http://lists.exim.org/mailman/listinfo/pcre-dev>,
<mailto:pcre-dev-request@exim.org?subject¾subscribe>
List-Archive: <http://lists.exim.org/lurker/list/pcre-dev.html>
List-Post: <mailto:pcre-dev@exim.org>
List-Help: <mailto:pcre-dev-request@exim.org?subjectlp>
List-Subscribe: <http://lists.exim.org/mailman/listinfo/pcre-dev>,
<mailto:pcre-dev-request@exim.org?subject¥bscribe>
X-List-Received-Date: Mon, 19 Oct 2009 18:39:32 -0000
On Mon, 19 Oct 2009, Pavel Kostromitinov wrote:
> However, having to deal with international characters almost constantly, I
> would really appreciate something like a compile-time option (for compiling
> pcre) to force it into using Unicode properties always.
> I cannot just replace all the "\b" with complex constructions based on \p{},
> since I don't write patterns myself - end-users do it. And parsing their
> patterns just to make correct replacement doesn't look appealing to me either.
>
> At least, I would greatly appreciate a hint on where should I look in pcre
> sources to try and change this behaviour myself.
Look at all the places in pcre_exec.c where one of the following opcodes
are mentioned:
OP_WORD_BOUNDARY, OP_NOT_WORD_BOUNDARY, OP_DIGIT, OP_NOT_DIGIT,
OP_WHITESPACE, OP_NOT_WHITESPACE, OP_WORDCHAR, OP_NOT_WORDCHAR.
There are 44 places in the code where you would have to make changes.
They would be quite substantial changes because not only does the
current code use a look-up table, it knows that it just needs to test
one byte from the subject instead of looking for a general UTF-8
character.
I suppose a compile-time option would be better that a runtime option,
because that would save testing the option many times during a run.
However, in theory there would still have to be a test for UTF-8 mode at
run time. Some of the tests are inside loops - you don't want to test
the flag every time round the loop, so two copies of the loop will
probably be needed.
Hmmm.... Maybe the compile-time option should be "force UTF-8 mode
always and use Unicode properties always". Then a lot of testing for
UTF-8 mode could be cut out and the PCRE_UTF8 option would be redundant.
I have just (today) released PCRE 8.00 and I don't plan on working on
PCRE now for some time, except to fix any important bugs that show up.
I have, however, noted this item for thinking about sometime in the
future.
Philip
--
Philip Hazel