[pcre-dev] [Bug 891] Support [[:<:]] and [[:>:]] patterns

Top Page
Delete this message
Author: Alan Lehotsky
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 891] Support [[:<:]] and [[:>:]] patterns
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=891




--- Comment #2 from Alan Lehotsky <alehotsky@???> 2009-09-23 18:34:34 ---
I had never heard of the syntax either (and agree that it's not really
needed for completeness). But one of my users ran across this.
If I get some free time, I'll try and implement it and contribute the
code. I did find a citation (below) to prior implementation.

Regards,
Al Lehotsky

From http://arglist.com/regex/regex7.html, purporting to be the man pages
for Spencer's BSD 4.4 regex.


There are two special cases+ of bracket expressions: the bracket
expressions `[[:<:]]' and `[[:>:]]' match the null string at the beginning
and end of a word respectively. A word is defined as a sequence of word
characters which is neither preceded nor followed by word characters. A
word character is an alnum character (as defined by ctype(3)) or an
underscore. This is an extension, compatible with but not specified by
POSIX 1003.2, and should be used with caution in software intended to be
portable to other systems.



Philip Hazel <ph10@???>
Sent by: admin@???
09/23/2009 08:45 AM
Please respond to
891@???


To
alehotsky@???
cc

Subject
[Bug 891] Support [[:<:]] and [[:>:]] patterns






------- You are receiving this mail because: -------
You reported the bug.

http://bugs.exim.org/show_bug.cgi?id=891




--- Comment #1 from Philip Hazel <ph10@???> 2009-09-23
13:45:47 ---
On Tue, 22 Sep 2009, Alan Lehotsky wrote:

> Apparently one or more implementations (including possibly Henry

Spencer's UCB
> regex code support this as synonyms for the beginning of a word and the

end
> of a word respectively.
>
> It would be handy for compatibility to recognize these two also in PCRE.


Are you sure about that? The patterns [[:<:]] and [[:>:]] look like a
modification of the POSIX character class syntax - and a character class
always matches a character. What would be the meaning of [abc[:<:]def]
for example?

I did a google to try to find any documentation about this, and I
couldn't. What I did find was that several engines use \< and \> for
beginning and end of word. This is incompatible with Perl, and so could
not be added to PCRE. (In Perl, and PCRE, backslash followed by a non-
alphanumeric character always matches a literal character. That is a
nice, clean rule, and I would not want to violate it, even with a
special option.)

If you can point me at some documentation that specifies what [[:<:]]
and [[:>:]] actually mean in some other regex engine, I will think about
it. But they are heckish long sequences, though in Perl and PCRE to do the
same thing takes one or two more characters:

\b(?=\w)      start of word
\b(?<=\w)     end of word


Regards,
Philip


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email