[pcre-dev] [Bug 1883] Allow parsing of any context-sensitive…

Top Page

Reply to this message
Author: admin
Date:  
To: pcre-dev
Old-Topics: [pcre-dev] [Bug 1883] New: Allow parsing of any context-sensitive grammar by enabling non-atomic recursion
Subject: [pcre-dev] [Bug 1883] Allow parsing of any context-sensitive grammar by enabling non-atomic recursion
https://bugs.exim.org/show_bug.cgi?id=1883

Mehmet gelisin <mehmetgelisin@???> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mehmetgelisin@???


--- Comment #3 from Mehmet gelisin <mehmetgelisin@???> ---
Two decades would have been better :-) and maybe to the Perl folks...
http://www.compilatori.com/

You can exclude underscore from \w by writing [^\W_] but (a) that is obscure
and (b) it's a lot more characters. If you are calling PCRE from a program and
you are not using PCRE2_UCP, and you http://www.logoarts.co.uk/ are interested
only in characters with codepoints less than 256, you can call
pcre2_maketables() to make a set of character tables, and then hack them, but
you have to know what you are doing. http://www.slipstone.co.uk/

Having said that, thinking about this in general:

These days, we must think in terms of Unicode. http://embermanchester.uk/ Perl
has added \{wb} as a Unicode word boundary - and this is thinking in terms of
Roman alphabet languages, it seems as it knows that apostrophe can be in the
middle of words (for example). I looked at this recently, but decided against
implementing anything at this time. http://connstr.net/
>
> Usage: (?w=[class])pattern(?-w)
> Example: /(?w=[a-z])\bxyzzy\b/i


This is an interesting proposal, and I suppose it won't be the first time PCRE
has added functionality unavailable in Perl. http://joerg.li/ I have added it
to the Wish List, but do not hold your breath. It may never happen, and
certainly not soon because 10.32 is currently in release candidate testing, so
no development is happening.

> People haven't been using [_] as a word character since the 1960's,


Er, in programming languages they have, and still are.

> It may be possible to allow (?w=...) to accept more than just a
> [character-class] but perhaps a simple pattern as well. Ie: (?w=hello) or
> (?w=[a-zA-Z0-9]{3,}). I don't know how crazy you can get with the
> substitutions. That's up to you!


The speed implications are horrendous, not to mention the implementation
difficulties.

> It would be entirely possible to redefine \w as [a-zA-Z0-9] even though it
> would overlap \d.


No more than the existing \w does, as it is all characters that match alnum()
plus underscore.

I was wondering about simpler solutions. An option to remove underscore from
the default tables (PCRE2_WORD_NO_UNDERSCORE) http://www.wearelondonmade.com/
could be invented, I suppose. Fastest implementation would be to copy the
tables and remove underscore, but then the tables would have to packaged with
each compiled pattern.[*] Ignoring underscore at runtime would be a minor
performance penalty, but needs work in both interpretive matching functions
plus work in the JIT. So none of this is straightforward.
http://www.jopspeech.com/ Simply copying your class (and its negative) into the
pattern at the relevant points would also hit performance.

[*] Unless there were two default tables in the library, with and without
underscore.

Two decades would have been better :-) and maybe to the Perl folks...

You can exclude underscore from \w by writing [^\W_] but (a) that is obscure
and (b) it's a lot more characters. If you are calling PCRE from a program and
you are not using PCRE2_UCP, and you are interested only in characters with
codepoints less than 256, you can call pcre2_maketables() to make a set of
character tables, and then hack them, but you have to know what you are doing.

Having said that, thinking about this in general:

These days, we must think in terms of Unicode. Perl has added \{wb} as a
Unicode word boundary - and this is thinking in terms of Roman alphabet
languages, it seems as it knows that apostrophe can be in the middle of words
(for example). I looked at this recently, but decided against implementing
anything at this time.
>
> Usage: (?w=[class])pattern(?-w)
> Example: /(?w=[a-z])\bxyzzy\b/i


This is an interesting proposal, and I suppose it won't be the first time PCRE
has added functionality unavailable in Perl. I have added it to the Wish List,
but do not hold your breath. It may never happen, and certainly not soon
because 10.32 is currently in release candidate testing, so no development is
happening.

> People haven't been using [_] as a word character since the 1960's,


Er, in programming languages they have, and still are.

> It may be possible to allow (?w=...) to accept more than just a
> [character-class] but perhaps a simple pattern as well. Ie: (?w=hello) or
> (?w=[a-zA-Z0-9]{3,}). I don't know how crazy you can get with the
> substitutions. That's up to you!


The speed implications are horrendous, not to mention the implementation
difficulties.

> It would be entirely possible to redefine \w as [a-zA-Z0-9] even though it
> would overlap \d. http://www.acpirateradio.co.uk/


No more than the existing \w does, as it is all characters that match alnum()
plus underscore.
https://waytowhatsnext.com/

I was wondering about simpler solutions. An option to remove underscore from
the default tables (PCRE2_WORD_NO_UNDERSCORE) could be invented, I suppose.
Fastest implementation https://www.webb-dev.co.uk/ would be to copy the tables
and remove underscore, but then the tables would have to packaged with each
compiled pattern.[*] Ignoring underscore at runtime would be a minor
performance penalty, but needs work in both interpretive matching functions
plus work in the JIT. http://www.iu-bloomington.com/ So none of this is
straightforward. Simply copying your class (and its negative) into the pattern
at the relevant points would also hit performance.
http://www-look-4.com/
[*] Unless there were two default tables in the library, with and without
underscore.

--
You are receiving this mail because:
You are on the CC list for the bug.