[pcre-dev] [Bug 2301] New: Wish: (?w=[class]) modifier redef…

Top Page

Reply to this message
Author: admin
Date:  
To: pcre-dev
New-Topics: [pcre-dev] [Bug 2301] Wish: (?w=[class]) modifier redefines word characters for \w \W shorthand and \b \B anchors., [pcre-dev] [Bug 2301] Wish: (?w=[class]) modifier redefines word characters for \w \W shorthand and \b \B anchors.
Subject: [pcre-dev] [Bug 2301] New: Wish: (?w=[class]) modifier redefines word characters for \w \W shorthand and \b \B anchors.
https://bugs.exim.org/show_bug.cgi?id=2301

            Bug ID: 2301
           Summary: Wish: (?w=[class]) modifier redefines word characters
                    for \w \W shorthand and \b \B anchors.
           Product: PCRE
           Version: N/A
          Hardware: x86
                OS: Windows
            Status: NEW
          Severity: wishlist
          Priority: medium
         Component: Code
          Assignee: ph10@???
          Reporter: agvulp-forums@???
                CC: pcre-dev@???


I wish I had requested this over a decade ago...

A modifier that allows you to redefine the word-characters recognized by the
shorthand '\w' and '\W' and by the anchors '\b' and '\B', so that these simple
and convenient constructs may be used even when [_] is undesired as a
word-character, or when other characters like [-] or ['] or [0-9] are desired
as word-characters in your expression (or partial expression).

Usage: (?w=[class])pattern(?-w)
Example: /(?w=[a-z])\bxyzzy\b/i

Case sensitive, except in the presence of (?i) or /i

In the above example, because the whole pattern is case-insensitive, \b
recognizes word boundaries between [a-zA-Z] and [^a-zA-Z], but not inclusive of
[_] as it typically would.

Example: /(?w=[a-z])(\w+)/

Since the above pattern is case sensitive, the pattern will only capture a
series of lower case alpha characters, and again, exclusive of [_] that \w
would normally match.

People haven't been using [_] as a word character since the 1960's, so I look
forward to the excitement and revelry created by the implementation of my
suggestion. :D

Fingers crossed.

Addendum:
It may be possible to allow (?w=...) to accept more than just a
[character-class] but perhaps a simple pattern as well. Ie: (?w=hello) or
(?w=[a-zA-Z0-9]{3,}). I don't know how crazy you can get with the
substitutions. That's up to you!

An observation:
It would be entirely possible to redefine \w as [a-zA-Z0-9] even though it
would overlap \d. However, \d doesn't have any symbiotic relationship with \b
and \B or other such anchors, which is partly why we're doing this, so it
doesn't much matter does it? The pattern \d+(\w+) would just mean your
captured word won't begin with numbers, but may contain or end with them.

Other thoughts?

--
You are receiving this mail because:
You are on the CC list for the bug.