Re: [pcre-dev] 8.34-RC1: \s is locale dependent whereas [\s]…

Top Page
Delete this message
Author: ph10
Date:  
To: Ralf Junker
CC: pcre-dev
Subject: Re: [pcre-dev] 8.34-RC1: \s is locale dependent whereas [\s] is not
On Thu, 21 Nov 2013, Ralf Junker wrote:

> The two patterns below do not match identically. [\s] always matches VT but \s
> only if the current locale defines VT as space.
>
> /[\s]+/
>     > \x09\x0a\x0c\x0d\x0b<

>
> /\s+/
>     > \x09\x0a\x0c\x0d\x0b<


Did you specify the locale at compile time as well as at runtime? The
choice of characters for [\s] is made at compile time, but for \s it is
done at run time. If you did not specify the locale at compile time,
this explains the difference. I updated the documentation for 8.34 to
point out that matching a pattern in a different locale to that in which
it was compiled probably won't give sensible results. This is what it
now says:

It is possible to pass a table pointer or NULL (indicating the use of
the internal tables) to pcre_exec() or pcre_dfa_exec() (see the
discussion below in the section on matching a pattern). This facility
is provided for use with pre-compiled patterns that have been saved
and reloaded. Character tables are not saved with patterns, so if a
non-standard table was used at compile time, it must be provided again
when the reloaded pattern is matched. Attempting to use this facility
to match a pattern in a different locale from the one in which it was
compiled is likely to lead to anomalous (usually incorrect) results.

If you did specify the locale at compile time, the tables should be
remembered and used at runtime. In this case, there is something that
needs to be investigated.

> I believe that according to SVN 1363, ChangeLog entry 18, \s should always
> match VT even if the current locale does not define it as such.


This is a documentation issue. Perl and PCRE can run in three different
modes:

(1) ASCII mode (Perl /a, PCRE default) in which just the 6 characters 
    x09-x0b plus space match \s.


(2) Unicode mode (Perl /u, PCRE_UCP) in which Unicode spaces match.

(3) Locale mode, in which, to quote from the perlcharclass page:

          For code points below 256 ...
               if locale rules are in effect ...
                   "\s" matches whatever the locale considers to be
                   whitespace.


The PCRE documentation does say this: "The default \s characters are now
HT (9), LF (10), VT (11), FF (12), CR (13), and space (32), which are
defined as white space in the "C" locale. This list may vary if
locale-specific matching is taking place; in particular, in some locales
the "non-breaking space" character (\xA0) is recognized as white space."

I will check the documentation again and see if there are any
clarifications that can be made.

Thanks for taking the time to test and comment.

Philip

--
Philip Hazel