Re: [pcre-dev] Remove some restrictions of lookbehind assert…

Top Page
Delete this message
Author: ph10
Date:  
To: Zoltán Herczeg
CC: Pcre-dev@exim.org, ND
Subject: Re: [pcre-dev] Remove some restrictions of lookbehind assertions
On Mon, 29 Jul 2019, Zoltán Herczeg wrote:

> > May be it not quite effective and still have restrictions but is useful.
> > Is it simple to add such functionality?
>
> Definitely not easy in JIT.


Not easy in the interpreter either.

> I have an alternative solution which might be able to solve many of
> the issues raised here. We already have (*SKIP:name), so we need to
> record specific string positions in the input, and since these are
> mark positions, they are even have a name.
>
> I am open to other names, but I would propose the following control verbs:
>
> (*MOVE:mark_name)
>   - This verb changes the current string position to the position recorded by the last mark which name is mark_name.


Note that this new verb can also be used instead of non-atomic
lookaheads. I know that Zoltán doesn't like (*napla: and (*naplb:
because (a) they will be hard to implement in JIT and (b) they aren't
really "assertions", but more like "groups with position reset".
Consider the example that is currently implemented:

/\A(*napla:.*\b(\w++))(?>.*?\b\1\b){3}/

This could be rewritten like this:

/\A(*MARK:X)(?:.*\b(\w++))(*MOVE:X)(?>.*?\b\1\b){3}/

The difference is that the (?:.*\b(\w++)) group is now a regular group,
not an assertion. This makes a difference because items like (*ACCEPT)
are treated differently in assertions. I have been looking at some of
these and realized that the current situation in non-atomic assertions
is inconsistent in some ways and hard to explain.

I think *MOVE would be relatively easy to implement in the
interpreter, and I am prepared to withdraw *napla and *naplb because
they haven't yet been released. I would propose first implementing *MOVE
and playing with it, and then removing the non-atomic lookarounds.

> (*SETEND:mark_name)
>   - This verb changes the end position to the position recorded by the last mark which name is mark_name. If the position is smaller than the current string position, it is set to the current string position.


By "end position" do you mean "end of subject"? I'm misunderstanding
something here because won't a MARK name usually be earlier than the
current position? Or do you envisage using this in some kind of loop? In
the interpreter, this will be easy to implement only if the MARK is
earlier on the matching path. Oh, are you thinking of something like
this?

(*MARK:A)<stuff>(*MARK:B)<stuff>(*MOVE:A)(SETEND:B)<more stuff>

That would be straightforward in the interpreter, I think.

Philip

--
Philip Hazel