https://bugs.exim.org/show_bug.cgi?id=2495
Bug ID: 2495
Summary: Captures made in a lookaround inside a loop are not
rolled back when backtracking
Product: PCRE
Version: 8.43
Hardware: All
OS: All
Status: NEW
Severity: bug
Priority: medium
Component: Code
Assignee: ph10@???
Reporter: davidell@???
CC: pcre-dev@???
When a capture group makes captures inside of a lookaround in a looped group,
and then the looped group backtracks due to subsequent conditions outside of
itself not matching, the content of the capture group is not rolled back to the
state it had on that earlier iteration of the loop. This bug is fixed in PCRE2.
Here is a minimal demonstration of this bug, with lookaheads and lookbehinds:
$ (echo 'a_ab'; echo 'a_bb') | pcregrep '^(?:(?=(.)).)*_\1'
a_bb
$ (echo 'a_ab'; echo 'a_bb') | pcregrep '^(?:.(?<=(.)))*_\1'
a_bb
$ (echo 'a_ab'; echo 'a_bb') | pcre2grep '^(?:(?=(.)).)*_\1'
a_ab
$ (echo 'a_ab'; echo 'a_bb') | pcre2grep '^(?:.(?<=(.)))*_\1'
a_ab
The loop initially consumes all characters, ending by consuming and capturing
"b". It then fails to match "_", and backtracks until it does. What it should
do is keep rolling back the content each time it backtracks a character, from
"b" to "a" to "_" to "a", after which the match of "_" succeeds, and it \1
should contain "a" (and does, in PCRE2). But in PCRE, it still contains "b",
the last thing it contained before backtracking.
I apologize for not reporting this bug earlier. I found it back in 2014:
https://gist.github.com/Davidebyzero/9090628#gistcomment-1187218
Then teukon described a small example in which the bug causes the wrong result
(which can now be seen to have been fixed in PCRE2):
https://gist.github.com/Davidebyzero/9090628#gistcomment-1188164
$ echo 'x'|pcregrep '^(?=((?=(x*))x)+)\2$'
$ echo 'x'|pcre2grep '^(?=((?=(x*))x)+)\2$'
x
It's not merely the atomicity of the lookahead that triggers this bug; it
doesn't happen with capturing in an atomic group:
$ (echo 'a_ab'; echo 'a_bb') | pcregrep '^(?:(?>(.)))*_\1'
a_ab
The only situation in which the capture actually is rolled back is if the loop
backtracks all the way to zero iterations (rolling the capture back to being
unset):
$ echo 'a' | pcregrep '^(?:(?=(.)).)*^(?(1)(?!))'
a
But if it only backtracks to the first iteration, the bug still happens:
$ (echo 'aab'; echo 'abb') | pcregrep '^(?:(?=(.)).)*(?<=^.)\1'
abb
$ (echo 'aab'; echo 'abb') | pcre2grep '^(?:(?=(.)).)*(?<=^.)\1'
aab
This is an annoying bug, resulting in undesired and unexpected behavior, and I
can't think of any useful way to exploit it to do something PCRE could
otherwise not do. On the contrary, I think it actually prevents some things
from being possible in PCRE that would be otherwise.
--
You are receiving this mail because:
You are on the CC list for the bug.