Re: [pcre-dev] Return last bumpalong offset in partial_hard …

Pàgina inicial
Delete this message
Autor: ph10
Data:  
A: Zoltán Herczeg
CC: pcre-dev
Assumpte: Re: [pcre-dev] Return last bumpalong offset in partial_hard matching
On Sun, 17 Feb 2013, Zoltán Herczeg wrote:

> Just my two cents: we did incompatible changes before such as removing
> pcre_info, or disallowing 0xD800-0xDFFF range in UTF character sets.
> After these changes people needed to fix their software (Apache or
> PHP), and they did it and moved on. We never actually removed a
> feature, just reworked it.


That is of course true.

> The original behavior was always odd to me, because it depends on the
> current subject. I feel that is not exactly consistent. After this
> change the matching code will be less complex (a little faster and
> more maintainable), and I think the use cases will not even change
> because ovector[0]-max_lookbehind character must be kept even now.


I started running tests. Consider this:

/(?<=123)abc/
Capturing subpattern count = 0
No options
First char = 'a'
Need char = 'c'
Max lookbehind = 3
    xxxx123a\P\P
Partial match: 123a


The returned offset is 4, and pcretest uses this value to show the exact
string that was partially matched. I can see that, having only that
value plus the max lookbehind of 3 means that the application will
compute the offset of the characters to save as 4-3 = 1, when in fact
the optimum value is 4. Providing the bumpalong offset of 7 makes it
possible to compute 7-3 = 4.

So, yes, OK, it's a good idea. However, if we take away the current
data, it becomes impossible for pcretest to show exactly which character
string was matched. It could compute 4 in the above case, so the result
would not change, but consider a different pattern:

/ab(?<=123ab)c/
Capturing subpattern count = 0
No options
First char = 'a'
Need char = 'c'
Max lookbehind = 5
    xxxx123a\P\P
Partial match: a


"Partial match: a" shows that only the single character "a" was
inspected for this partial match. Using maxlookbehind would show
"xx123a", which is definitely confusing, in my opinion, though those
would be the characters you have to retain for the next match.

Hmm. I am not sure what I think now...

Philip

--
Philip Hazel