[pcre-dev] PCRE_UTF8 flag

Top Page
Delete this message
Author: Steven Gerrard
Date:  
To: pcre-dev
Subject: [pcre-dev] PCRE_UTF8 flag
Hi,
 
On your homepage it said that I could contact the active PCRE developers via this mail address, hope this is alright.
 
I've been using PCRE in a project of mine, a project of which I have recently started performing internationalization.
 
As far as I could understand, the way of using UTF8 strings with PCRE is by passing the PCRE_UTF8 option to pcre_compile.
 
Now, while I understand that passing this option flag to pcre_compile causes non-valid UTF8 strings to fail compilation, it seems that I can still use UTF8 strings without passing this option to pcre_compile. This makes strings get treated like plain ASCII strings, thereby comparing English characters case insensitively, and the rest of the chart (for example, Hebrew characters represented by the 128-256 part of the chart) using a plain binary comparison. This, as far as I can see, works for me perfectly - this way I can pass both ASCII and UTF8 strings, which will be matched using case insensitive collation for English characters, and binary comparison for any other character.
 
Am I missing something? Does the PCRE_UTF8 benefit me in any other way I've managed to miss so far in my testing?
 
Thanks,
G.


      From ph10@??? Mon Oct 19 19:39:32 2009
Envelope-to: pcre-dev@???
Received: from ppsw-0.csi.cam.ac.uk ([131.111.8.130]:48376)
    by tahini.csx.cam.ac.uk with esmtp (Exim 4.69)
    (envelope-from <ph10@???>) id 1Mzx8k-0002rc-Fx
    for pcre-dev@???; Mon, 19 Oct 2009 19:39:31 +0100
X-Cam-AntiVirus: no malware found
X-Cam-SpamDetails: not scanned
X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/
Received: from quercite.quercite.com ([83.104.196.194]:52214
    helo…ercite-alias)
    by ppsw-0.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.150]:587)
    with esmtpsa (PLAIN:ph10) (TLSv1:DHE-RSA-AES256-SHA:256)
    id 1Mzx8k-0004DP-0x (Exim 4.70)
    (return-path <ph10@???>); Mon, 19 Oct 2009 19:39:30 +0100
Date: Mon, 19 Oct 2009 19:39:26 +0100 (BST)
From: Philip Hazel <ph10@???>
To: 897@???
In-Reply-To: <bug-897-288@???/>
Message-ID: <alpine.LNX.2.00.0910191927210.8379@???>
References: <bug-897-288@???/>
User-Agent: Alpine 2.00 (LNX 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset£-ASCII
X-Spam-Score: -3.1 (---)
X-Spam-Status: No, scoreÓ.1 required~0 testsìL_TRUSTEDÑ.8, AWL.156,
    BAYES_00Ñ.5 autolearn¾available version^1.8
Cc: pcre-dev@???
Subject: Re: [pcre-dev] [Bug 897] New: \w and others based on Unicode
 properties
X-BeenThere: pcre-dev@???
X-Mailman-Version: 2.1.9
Precedence: list
Reply-To: pcre-dev@???
List-Id: PCRE Development <pcre-dev.exim.org>
List-Unsubscribe: <http://lists.exim.org/mailman/listinfo/pcre-dev>,
    <mailto:pcre-dev-request@exim.org?subject¾subscribe>
List-Archive: <http://lists.exim.org/lurker/list/pcre-dev.html>
List-Post: <mailto:pcre-dev@exim.org>
List-Help: <mailto:pcre-dev-request@exim.org?subjectŽlp>
List-Subscribe: <http://lists.exim.org/mailman/listinfo/pcre-dev>,
    <mailto:pcre-dev-request@exim.org?subject¥bscribe>
X-List-Received-Date: Mon, 19 Oct 2009 18:39:32 -0000


On Mon, 19 Oct 2009, Pavel Kostromitinov wrote:

> However, having to deal with international characters almost constantly, I
> would really appreciate something like a compile-time option (for compiling
> pcre) to force it into using Unicode properties always.
> I cannot just replace all the "\b" with complex constructions based on \p{},
> since I don't write patterns myself - end-users do it. And parsing their
> patterns just to make correct replacement doesn't look appealing to me either.
>
> At least, I would greatly appreciate a hint on where should I look in pcre
> sources to try and change this behaviour myself.


Look at all the places in pcre_exec.c where one of the following opcodes
are mentioned:

OP_WORD_BOUNDARY, OP_NOT_WORD_BOUNDARY, OP_DIGIT, OP_NOT_DIGIT,
OP_WHITESPACE, OP_NOT_WHITESPACE, OP_WORDCHAR, OP_NOT_WORDCHAR.

There are 44 places in the code where you would have to make changes.
They would be quite substantial changes because not only does the
current code use a look-up table, it knows that it just needs to test
one byte from the subject instead of looking for a general UTF-8
character.

I suppose a compile-time option would be better that a runtime option,
because that would save testing the option many times during a run.
However, in theory there would still have to be a test for UTF-8 mode at
run time. Some of the tests are inside loops - you don't want to test
the flag every time round the loop, so two copies of the loop will
probably be needed.

Hmmm.... Maybe the compile-time option should be "force UTF-8 mode
always and use Unicode properties always". Then a lot of testing for
UTF-8 mode could be cut out and the PCRE_UTF8 option would be redundant.

I have just (today) released PCRE 8.00 and I don't plan on working on
PCRE now for some time, except to fix any important bugs that show up.
I have, however, noted this item for thinking about sometime in the
future.

Philip

--
Philip Hazel