Re: [pcre-dev] BACKREFERENCE with (PCRE_UTF8|PCRE_CASELESS) …

Top Page
Delete this message
Author: Issaana
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] BACKREFERENCE with (PCRE_UTF8|PCRE_CASELESS) is a unexpected result
Hello,

>> I built PCRE with SUPPORT_UTF8 and SUPPORT_UCP, and tried the following
>> code.
>>
>> re=pcre_compile("(\xc3\x80)\\1",PCRE_UTF8|PCRE_CASELESS,&err,&erroff,NULL);
>> rc=pcre_exec(re,NULL,"\xc3\x80\xc3\x80",4,0,0,ov,6); //(A) rc=2
>> rc=pcre_exec(re,NULL,"\xc3\x80\xc3\xa0",4,0,0,ov,6); //(B) rc=
>> PCRE_ERROR_NOMATCH
>>
>> \xc3\x80 is UTF-8 code of U+00C0 (LATIN CAPITAL LETTER A WITH GRAVE)
>> \xc3\xa0 is UTF-8 code of U+00E0 (LATIN SMALL LETTER A WITH GRAVE)
>>
>> (B) is a unexpected result. Which of a bug or my misunderstanding is it?
>
>You have misunderstood. A back reference matches *exactly* what the
>subpattern matched. If you want the subpattern to be re-evaluated, you
>must use a "subroutine" call instead.
>


Thank for your exposition. But I do not feel clear a little.

The following code is ASCII version of above.

re=pcre_compile("(A)\\1",PCRE_UTF8|PCRE_CASELESS,&err,&erroff,NULL);
rc=pcre_exec(re,NULL,"AA",2,0,0,ov,6); //(C) rc=2
rc=pcre_exec(re,NULL,"Aa",2,0,0,ov,6); //(D) rc=2

The difference of (B) and (D) is only that a captured substring is Non-ASCII or
ASCII. PCRE_CASELESS is effective in the case of ASCII. However, it is
ineffective in the case of Non-ASCII. Is this a expected result?

Thanks,
Issaana