[pcre-dev] Strange \w+ behavior with custom word characters …

Top Page
Delete this message
Author: Ralf Junker
Date:  
To: pcre-dev@exim.org
Subject: [pcre-dev] Strange \w+ behavior with custom word characters (pcre_chartables.c)
The following examples each use \w+ followed by "Ä", the upper case
German "A-Umlaut", Unicode code point 0xC4.

With the default pcre_chartables.c, all patterns except the last one
match. By default, PCRE does not consider "Ä" a word character.

However, if I change pcre_chartables.c so that "Ä" is a word character,
the 1st pattern does not match. Interestingly, only the 1st pattern
fails. The 2nd pattern, where \w is parenthesized, matches. The last
pattern also matches, showing that \w by itself matches "Ä".

/\w+\x{C4}/8
a\x{C4}

/(\w+)\x{C4}/8
a\x{C4}

/\w\x{C4}/8
a\x{C4}

/\w/
\x{C4}

I suspect that there is an issue with non-ASCII repetitions. I searched
the code but could not see why this is happening.

Any help is much appreciated. I can send my pcre_chartables.c if required.

Ralf