[pcre-dev] [Bug 1412] \s Randomly Matches xA0

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1412] \s Randomly Matches xA0
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1412




--- Comment #1 from Philip Hazel <ph10@???> 2013-11-07 17:14:51 ---
On Thu, 7 Nov 2013, miqrogroove wrote:

> I am working on a bug patch for WordPress. With the help of other developers
> there, we have reached the conclusion that the PCRE \s sequence will randomly
> match the \xA0 byte in some PHP environments but not in others.
>
> Specifically, this PHP command will return 1 on some computers and 0 on other
> computers:
>
> var_dump( preg_match( '/^\s$/', "\xA0" ) );
>
> In some cases it will produce different results within phpunit than without.
>
> We have not been able to identify any specific environmental cause such as
> hardware, OS, or PHP version.
>
> Can you shed some light on why this regexp has random results?


Yes. PCRE (and Perl) will match \s with \xa0 when running in "Unicode"
mode, where U+00A0 is a non-breaking space. In the case of PCRE, this
means when the PCRE_UCP option is passed to pcre_compile. Here is output
from the pcretest program that demonstrates this:

PCRE version 8.33 2013-05-28

/\s/
\x{a0}
No match

/\s/8
\x{a0}
No match

/\s/W
\x{a0}
0: \xa0

/\s/8W
\x{a0}
0: \x{a0}

The four tests are: ASCII mode, UTF mode without Unicode properties,
non-UTF mode, but with Unicode properties, and both UTF and Unicode
property mode. As you can see, only when Unicode properties support is
invoked does it match.

Clearly, the different PHP implementations you are using must be calling
pcre_compile() with different options.

Regards,
Philip


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email