Author: Issaana Date: To: pcre-dev Subject: Re: [pcre-dev] BACKREFERENCE with (PCRE_UTF8|PCRE_CASELESS) is a
unexpected result
Hello,
>> I built PCRE with SUPPORT_UTF8 and SUPPORT_UCP, and tried the following
>> code.
>>
>> re=pcre_compile("(\xc3\x80)\\1",PCRE_UTF8|PCRE_CASELESS,&err,&erroff,NULL);
>> rc=pcre_exec(re,NULL,"\xc3\x80\xc3\x80",4,0,0,ov,6); //(A) rc=2
>> rc=pcre_exec(re,NULL,"\xc3\x80\xc3\xa0",4,0,0,ov,6); //(B) rc=
>> PCRE_ERROR_NOMATCH
>>
>> \xc3\x80 is UTF-8 code of U+00C0 (LATIN CAPITAL LETTER A WITH GRAVE)
>> \xc3\xa0 is UTF-8 code of U+00E0 (LATIN SMALL LETTER A WITH GRAVE)
>>
>> (B) is a unexpected result. Which of a bug or my misunderstanding is it?
>
>You have misunderstood. A back reference matches *exactly* what the
>subpattern matched. If you want the subpattern to be re-evaluated, you
>must use a "subroutine" call instead.
>
Thank for your exposition. But I do not feel clear a little.
The difference of (B) and (D) is only that a captured substring is Non-ASCII or
ASCII. PCRE_CASELESS is effective in the case of ASCII. However, it is
ineffective in the case of Non-ASCII. Is this a expected result?