[pcre-dev] Is it possible to use UTF-8 literal characters in…

Top Page
Delete this message
Author: Frank Chang
Date:  
To: pcre-dev
Subject: [pcre-dev] Is it possible to use UTF-8 literal characters in a C/C++ PCRE regex?
Good afternoon, We are trying to match the German string. Munich
tausendschöne Jungfräulein ausendschçne, using a C/C++ PCRE regex with
PCRE_UTF8, PCRE_UCP, PCRE_CASELESS options activated which uses the UTF-8
literals, ö, ä, ç Is it possible to construct a valid PCRE regex which uses
the UTF-8 literals ö or ä or ç without using codepoints?
        We are able to match the German string, Munich tausendschöne
Jungfräulein ausendschçne, with a PCRE regex which uses positive lookahead
and a sequence of multiple UTF-8 codepoints . For
example,(?=.+(\x{0068}\x{00F6})){1}. However, when we add any of the UTF-8
literals, ö, ä, ç into the PCRE regex , pcre_compile() complains about
invalid UTF-8 regex string. Thank you