Re: [pcre-dev] 8.34-RC1: \s is locale dependent whereas [\s]…

Top Page
Delete this message
Author: ph10
Date:  
To: Ralf Junker
CC: pcre-dev
Subject: Re: [pcre-dev] 8.34-RC1: \s is locale dependent whereas [\s] is not
On Mon, 25 Nov 2013, Ralf Junker wrote:

> The problem surface when I tested the PCRE 8.34-RC1 without updating
> pcre_chartables.c. The old file does not define VT as white space and with
> this, PCRE gives different results for [\s] and \s which I believed it should
> not.


It should certainly be the same for [\s] and \s. However, without
remaking pcre_chartables.c things will go wrong. I had forgotten that
pcre_maketables.c had been updated when removing the special nature of
VT from the code.

> On further investigation, it turned out that updating
> pcre_chartables.c solved the issue.


Good! I presume that means that I do not have to do anything more.

> It seems that for [\s], white space are determined by cbits in pcre_compile.c
> line 5030.


Yes, cbits are tables of "class bits" that can quickly be "or-ed" into
the table that is being build for [] classes. The "space" class always
did contain VT because it was/is used for POSIX [:space:], which always
recognized VT as a space character. There was a fiddle in the code for
[\s] to exclude VT. This fiddle has been removed.

> For \s, the white space is determined by md->ctypes in pcre_exec.c line 4815.


This table is used for "is this character a (Perl) space?" while
compiling and matching, and it did not used to contain VT, but now it
does.

> I do not fully understand how these variables are filled, but they contain
> different values for VT.


The tables are created by pcre_maketables using functions like
isspace(), isupper(), etc, but previously there was a fiddle for VT.
Before 8.34, the tables would be different in their treatment of VT; for
8.34 they should be the same.

> Btw., PCRE_EXTENDED seems to be affected by the problem as well: Unless VT is
> defined as white space in pcre_chartables.c, VT is not removed from the
> pattern in extended mode.


Yes, indeed. The tables are used for that purpose as well.

Philip

--
Philip Hazel