[pcre-dev] [Bug 2301] Wish: (?w=[class]) modifier redefines …

Top Page
Delete this message
Author: admin
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 2301] Wish: (?w=[class]) modifier redefines word characters for \w \W shorthand and \b \B anchors.
https://bugs.exim.org/show_bug.cgi?id=2301

--- Comment #1 from Philip Hazel <ph10@???> ---
(In reply to Raccoon from comment #0)
> I wish I had requested this over a decade ago...


Two decades would have been better :-) and maybe to the Perl folks...

You can exclude underscore from \w by writing [^\W_] but (a) that is obscure
and (b) it's a lot more characters. If you are calling PCRE from a program and
you are not using PCRE2_UCP, and you are interested only in characters with
codepoints less than 256, you can call pcre2_maketables() to make a set of
character tables, and then hack them, but you have to know what you are doing.

Having said that, thinking about this in general:

These days, we must think in terms of Unicode. Perl has added \{wb} as a
Unicode word boundary - and this is thinking in terms of Roman alphabet
languages, it seems as it knows that apostrophe can be in the middle of words
(for example). I looked at this recently, but decided against implementing
anything at this time.
>
> Usage: (?w=[class])pattern(?-w)
> Example: /(?w=[a-z])\bxyzzy\b/i


This is an interesting proposal, and I suppose it won't be the first time PCRE
has added functionality unavailable in Perl. I have added it to the Wish List,
but do not hold your breath. It may never happen, and certainly not soon
because 10.32 is currently in release candidate testing, so no development is
happening.

> People haven't been using [_] as a word character since the 1960's,


Er, in programming languages they have, and still are.

> It may be possible to allow (?w=...) to accept more than just a
> [character-class] but perhaps a simple pattern as well. Ie: (?w=hello) or
> (?w=[a-zA-Z0-9]{3,}). I don't know how crazy you can get with the
> substitutions. That's up to you!


The speed implications are horrendous, not to mention the implementation
difficulties.

> It would be entirely possible to redefine \w as [a-zA-Z0-9] even though it
> would overlap \d.


No more than the existing \w does, as it is all characters that match alnum()
plus underscore.

I was wondering about simpler solutions. An option to remove underscore from
the default tables (PCRE2_WORD_NO_UNDERSCORE) could be invented, I suppose.
Fastest implementation would be to copy the tables and remove underscore, but
then the tables would have to packaged with each compiled pattern.[*] Ignoring
underscore at runtime would be a minor performance penalty, but needs work in
both interpretive matching functions plus work in the JIT. So none of this is
straightforward. Simply copying your class (and its negative) into the pattern
at the relevant points would also hit performance.

[*] Unless there were two default tables in the library, with and without
underscore.

--
You are receiving this mail because:
You are on the CC list for the bug.