[pcre-dev] \u in JavaScript compat mode

Top Page
Delete this message
Author: Zoltán Herczeg
Date:  
To: pcre-dev
Subject: [pcre-dev] \u in JavaScript compat mode
Hi,

I recently got a notification that PCRE does not support \u in PCRE_JAVASCRIPT_COMPAT mode.

I have checked the latest standard here:
http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf

And it says:

15.10.1 Patterns

CharacterEscape ::
[...]
HexEscapeSequence
UnicodeEscapeSequence
[...]

Later, in 15.10.2.10 CharacterEscape:

The production CharacterEscape :: HexEscapeSequence evaluates by evaluating the CV of the HexEscapeSequence (see 7.8.4) and returning its character result.
The production CharacterEscape :: UnicodeEscapeSequence evaluates by evaluating the CV of the UnicodeEscapeSequence (see 7.8.4) and returning its character result.

7.8.4 String Literals

[...]
HexEscapeSequence ::
x HexDigit HexDigit
UnicodeEscapeSequence ::
u HexDigit HexDigit HexDigit HexDigit
[...]

Thus a \x hex escape in PCRE_JAVASCRIPT_COMPAT mode must be followed by two hexadecimal character, and evaluated as byte, while \u must be followed by four hexadecimal character, and evaluated as an unsigned short.

I have checked what happens if \u does not followed by 4 hex numbers
/b\u0041x/ matches to "bAx"
/b\u041x/ matches to "bu041x"
/b\x41x/ matches to "bAx"
/b\x1x/ matches to "bx1x"
So \u is simply converted to u, and \x as x if not followed by enough hex characters.

Philip, could we follow the standard here?

Regards,
Zoltan