Author: ph10 Date: To: Ralf Junker CC: pcre-dev Subject: Re: [pcre-dev] 8.34-RC1: \s is locale dependent whereas [\s] is not
On Mon, 25 Nov 2013, Ralf Junker wrote:
> The problem surface when I tested the PCRE 8.34-RC1 without updating
> pcre_chartables.c. The old file does not define VT as white space and with
> this, PCRE gives different results for [\s] and \s which I believed it should
> not.
It should certainly be the same for [\s] and \s. However, without
remaking pcre_chartables.c things will go wrong. I had forgotten that
pcre_maketables.c had been updated when removing the special nature of
VT from the code.
> On further investigation, it turned out that updating
> pcre_chartables.c solved the issue.
Good! I presume that means that I do not have to do anything more.
> It seems that for [\s], white space are determined by cbits in pcre_compile.c
> line 5030.
Yes, cbits are tables of "class bits" that can quickly be "or-ed" into
the table that is being build for [] classes. The "space" class always
did contain VT because it was/is used for POSIX [:space:], which always
recognized VT as a space character. There was a fiddle in the code for
[\s] to exclude VT. This fiddle has been removed.
> For \s, the white space is determined by md->ctypes in pcre_exec.c line 4815.
This table is used for "is this character a (Perl) space?" while
compiling and matching, and it did not used to contain VT, but now it
does.
> I do not fully understand how these variables are filled, but they contain
> different values for VT.
The tables are created by pcre_maketables using functions like
isspace(), isupper(), etc, but previously there was a fiddle for VT.
Before 8.34, the tables would be different in their treatment of VT; for
8.34 they should be the same.
> Btw., PCRE_EXTENDED seems to be affected by the problem as well: Unless VT is
> defined as white space in pcre_chartables.c, VT is not removed from the
> pattern in extended mode.
Yes, indeed. The tables are used for that purpose as well.