[pcre-dev] [Bug 897] \w and others based on Unicode properties

Author: Philip Hazel
Date:
To: pcre-dev
Old-Topics: [pcre-dev] [Bug 897] New: \w and others based on Unicode properties
Subject: [pcre-dev] [Bug 897] \w and others based on Unicode properties

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=897

--- Comment #13 from Philip Hazel <ph10@???> 2009-12-17 17:44:22 ---
On Wed, 16 Dec 2009, Pavel Kostromitinov wrote:

> Well, I added an attachment describing what I meant. It's a patch for
> pcre_internal.h and pcre_exec.c, seemingly handling all what is needed
> (in pcre_exec()) for UTF_USES_UCP to work.
>
> The only functions added are ucp_XXX at the end of pcre_exec, and they
> are only called if they are needed. Surely, one could do it all as
> #defines or just inline it, but this way it looks much cleaner in my
> opinion.

Adding the macros as you have done is indeed a nice tidy way of wrapping
up all the different cases. I have not looked at them in detail, but
they certainly seem to be reasonably at a quick glance, apart from one
thing:

#define CHAR_IS_WORDCHAR(c, utf8) (utf8 ? ucp_wordchar(c) : ((c < 256) &&
(md->ctypes[c] & ctype_word)))

I really do not like using a function here. I am sure it will lose
performance. The function is so short:

  int ucp_wordchar(int c) 
        const ucd_record *prop = GET_UCD(c);
        return 
                (_pcre_ucp_gentype[prop->chartype] == ucp_L) ||
                (_pcre_ucp_gentype[prop->chartype] == ucp_N) ||
                (c == '_');

I am sure there will be performance hit that can easily be avoided by
making the macro generate inline code. The variable "prop" can be
generally available in the function, and the macro can generate
this for the UCD case:

  prop = GET_UCD(c), 
                (_pcre_ucp_gentype[prop->chartype] == ucp_L) ||
                (_pcre_ucp_gentype[prop->chartype] == ucp_N) ||
                (c == '_')

If a function is used, it would have to be called _pcre_ucp_wordchar()
because it is an external (in the C sense) function. Also, it would have
to be put into a separate source file because, assuming pcre_dfa_exec()
is updated as well, some users may use only pcre_dfa_exec() and not
pcre_exec(), and you do not want them to load unnecessary code.

As a style point, I would never write (md->ctypes[c] & ctype_word) as a
true/false value. I would always write

((md->ctypes[c] & ctype_word) != 0)

partly to emphasize that it is a truth value, and partly to ensure that
the value actually is 0 or 1 so that if it is part of a larger
boolean expression, it will work as expected.

Regards,
Philip

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date
	Philip Hazel at
	Pavel Kostromitinov at

[pcre-dev] [Bug 897] \w and others based on Unicode properti…