https://bugs.exim.org/show_bug.cgi?id=2301
Bug ID: 2301
Summary: Wish: (?w=[class]) modifier redefines word characters
for \w \W shorthand and \b \B anchors.
Product: PCRE
Version: N/A
Hardware: x86
OS: Windows
Status: NEW
Severity: wishlist
Priority: medium
Component: Code
Assignee: ph10@???
Reporter: agvulp-forums@???
CC: pcre-dev@???
I wish I had requested this over a decade ago...
A modifier that allows you to redefine the word-characters recognized by the
shorthand '\w' and '\W' and by the anchors '\b' and '\B', so that these simple
and convenient constructs may be used even when [_] is undesired as a
word-character, or when other characters like [-] or ['] or [0-9] are desired
as word-characters in your expression (or partial expression).
Usage: (?w=[class])pattern(?-w)
Example: /(?w=[a-z])\bxyzzy\b/i
Case sensitive, except in the presence of (?i) or /i
In the above example, because the whole pattern is case-insensitive, \b
recognizes word boundaries between [a-zA-Z] and [^a-zA-Z], but not inclusive of
[_] as it typically would.
Example: /(?w=[a-z])(\w+)/
Since the above pattern is case sensitive, the pattern will only capture a
series of lower case alpha characters, and again, exclusive of [_] that \w
would normally match.
People haven't been using [_] as a word character since the 1960's, so I look
forward to the excitement and revelry created by the implementation of my
suggestion. :D
Fingers crossed.
Addendum:
It may be possible to allow (?w=...) to accept more than just a
[character-class] but perhaps a simple pattern as well. Ie: (?w=hello) or
(?w=[a-zA-Z0-9]{3,}). I don't know how crazy you can get with the
substitutions. That's up to you!
An observation:
It would be entirely possible to redefine \w as [a-zA-Z0-9] even though it
would overlap \d. However, \d doesn't have any symbiotic relationship with \b
and \B or other such anchors, which is partly why we're doing this, so it
doesn't much matter does it? The pattern \d+(\w+) would just mean your
captured word won't begin with numbers, but may contain or end with them.
Other thoughts?
--
You are receiving this mail because:
You are on the CC list for the bug.