Re: [pcre-dev] [Bug 1099] New: Ability to reference 'variabl…

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: 1099
CC: pcre-dev
Subject: Re: [pcre-dev] [Bug 1099] New: Ability to reference 'variables' from regexp
On Wed, 23 Mar 2011, Pavel Kostromitinov wrote:

> In Perl, one can use $var in regexp and usually the variable reference
> will be expanded.


The variable is interpolated *before* the string is interpreted as a
regex. Consider this example:

$x = "\\d+";
print (("abc" =~ /$x/)? "abc: yes\n" : "abc: no\n");
print (("123" =~ /$x/)? "123: yes\n" : "123: no\n");

The output is:

abc: no
123: yes

I am not familiar with the code of Perl, but I would be surprised if it
was not implemented exactly as one might expect: first the variables are
interpolated into the string and then the string is interpreted as a
regex.

> It would be very helpful in some situations to allow PCRE to reference
> such external variables too. The way I see it, without (seemingly)
> breaking backward-compatibility, is to make some way to set values of
> 'named subpatterns' before pcre_exec(), so they can be referenced
> using existing \k{name} syntax. Surely encountering subpattern with
> same name inside of pattern would take precedence.


It would have to be pcre_compile() not pcre_exec(), for a start.
However, the obvious implementation is to do what I think Perl does:
first interpolate the variables and then call pcre_compile(). However, I
do not have any plans to add this kind of functionality.

The main objection I have to doing anything like this is that PCRE is
not a string-manipulating library. It does not change strings. It
provides just a regex-matching facility. Your suggestion is just one of
many "add-on" features that people might like. Another, which was been
suggested before, is a "replace" function. And no doubt there are
others. In my opinion, if these kinds of function are generally wanted,
somebody should design and implement a general-purpose string
manipulation library, which of course could make use of PCRE for pattern
matching.

Issues that immediately spring to mind are: How are the variable
contents coded? Zero-terminated or by length? If the latter, is a binary
zero value allowed as part of the string? (If yes, for PCRE it has to be
turned into \0.) Should variables be interpolated as in Perl, or as
literals, or should there be an option? How to deal with UTF-8 or not
UTF-8? How to handle feedback data for compiling errors? The
pcre_compile() function gives an offset in the string it is compiling,
but the caller of a wrapper function that interpolated variables would
need an offset into the original string.

My feeling is that a substantial design effort is needed to come up with
an API that is sufficiently general as to be widely useful. I am not
myself planning on doing anything about this.

Philip

--
Philip Hazel