[pcre-dev] [Bug 897] \w and others based on Unicode properti…

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: pcre-dev
Old-Topics: [pcre-dev] [Bug 897] New: \w and others based on Unicode properties
Subject: [pcre-dev] [Bug 897] \w and others based on Unicode properties
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=897




--- Comment #4 from Philip Hazel <ph10@???> 2009-12-16 10:58:29 ---
On Tue, 15 Dec 2009, Pavel Kostromitinov wrote:

> Here (attached) is my attempt to implement checking for \w as \p{L} - as a
> testcase, all the others will follow.
> It requires UTF8_USE_UCP to be set, along with SUPPORT_UTF8 and SUPPORT_UCP.
>
> I would greatly appreciate if you could review the changes and correct me if I
> did something wrong, or missed something.


I think you have missed something. Here is the code of your first
change, with my comments:

    case OP_WORDCHAR:
    if (eptr >= md->end_subject)
      {
      SCHECK_PARTIAL();
      RRETURN(MATCH_NOMATCH);
      }
    GETCHARINCTEST(c, eptr);
    ^^^^^^^^^^^^^^ 
That macro tests for UTF-8 mode, and loads either one byte or a whole 
UTF-8 character into the variable c. 


#ifdef UTF8_USES_UCP
        {
        const ucd_record *prop = GET_UCD(c);
        if (_pcre_ucp_gentype[prop->chartype] != ucp_L)
                RRETURN(MATCH_NOMATCH);
        }


However, your patch runs unconditionally, even when the UTF-8 flag is
not set at runtime. I am not sure that this is right. In non-UTF-8 mode
I would expect everything to behave as ASCII, for backwards
compatibility if for no other reason.

#else
    if (
#ifdef SUPPORT_UTF8
       c >= 256 ||
#endif
       (md->ctypes[c] & ctype_word) == 0
       )
      RRETURN(MATCH_NOMATCH);
#endif
    ecode++;
    break;



At the moment, it is true, the code does make use of GET_UCD() in
non-UTF-8 mode, but only to process \P and \p.

With your code, the name UTF8_USES_UCP is not correct, because it always
uses UCP. Something like GENERICS_USE_UCP might be better (for "generic
character types"). However, I think I would prefer to keep your name,
and change the code so that the PCRE_UTF8 flag is needed to cause it to
be used.

I see that you have not patched the code for OP_WORD_BOUNDARY, around
line 1633. That code is already split into UTF-8 and non-UTF-8 cases.
If you just patched the UTF-8 case, the result will be different to your
\w patch, for the reason I gave above.

> Also it seems pcre_study.c is to be corrected for this to work, but I
> just pass PCRE_NO_START_OPTIMIZE to pcre_exec for now.


pcre_dfa_exec.c will have to be changed too. I told you this would be a
big job! :-)

Philip


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email