[pcre-dev] [Bug 2495] New: Captures made in a lookaround ins…

Top Page
Delete this message
Author: admin
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 2495] New: Captures made in a lookaround inside a loop are not rolled back when backtracking
https://bugs.exim.org/show_bug.cgi?id=2495

            Bug ID: 2495
           Summary: Captures made in a lookaround inside a loop are not
                    rolled back when backtracking
           Product: PCRE
           Version: 8.43
          Hardware: All
                OS: All
            Status: NEW
          Severity: bug
          Priority: medium
         Component: Code
          Assignee: ph10@???
          Reporter: davidell@???
                CC: pcre-dev@???


When a capture group makes captures inside of a lookaround in a looped group,
and then the looped group backtracks due to subsequent conditions outside of
itself not matching, the content of the capture group is not rolled back to the
state it had on that earlier iteration of the loop. This bug is fixed in PCRE2.

Here is a minimal demonstration of this bug, with lookaheads and lookbehinds:

$ (echo 'a_ab'; echo 'a_bb') | pcregrep '^(?:(?=(.)).)*_\1'
a_bb
$ (echo 'a_ab'; echo 'a_bb') | pcregrep '^(?:.(?<=(.)))*_\1'
a_bb
$ (echo 'a_ab'; echo 'a_bb') | pcre2grep '^(?:(?=(.)).)*_\1'
a_ab
$ (echo 'a_ab'; echo 'a_bb') | pcre2grep '^(?:.(?<=(.)))*_\1'
a_ab

The loop initially consumes all characters, ending by consuming and capturing
"b". It then fails to match "_", and backtracks until it does. What it should
do is keep rolling back the content each time it backtracks a character, from
"b" to "a" to "_" to "a", after which the match of "_" succeeds, and it \1
should contain "a" (and does, in PCRE2). But in PCRE, it still contains "b",
the last thing it contained before backtracking.

I apologize for not reporting this bug earlier. I found it back in 2014:
https://gist.github.com/Davidebyzero/9090628#gistcomment-1187218

Then teukon described a small example in which the bug causes the wrong result
(which can now be seen to have been fixed in PCRE2):
https://gist.github.com/Davidebyzero/9090628#gistcomment-1188164

$ echo 'x'|pcregrep '^(?=((?=(x*))x)+)\2$'
$ echo 'x'|pcre2grep '^(?=((?=(x*))x)+)\2$'
x

It's not merely the atomicity of the lookahead that triggers this bug; it
doesn't happen with capturing in an atomic group:

$ (echo 'a_ab'; echo 'a_bb') | pcregrep '^(?:(?>(.)))*_\1'
a_ab

The only situation in which the capture actually is rolled back is if the loop
backtracks all the way to zero iterations (rolling the capture back to being
unset):

$ echo 'a' | pcregrep '^(?:(?=(.)).)*^(?(1)(?!))'
a

But if it only backtracks to the first iteration, the bug still happens:

$ (echo 'aab'; echo 'abb') | pcregrep '^(?:(?=(.)).)*(?<=^.)\1'
abb
$ (echo 'aab'; echo 'abb') | pcre2grep '^(?:(?=(.)).)*(?<=^.)\1'
aab

This is an annoying bug, resulting in undesired and unexpected behavior, and I
can't think of any useful way to exploit it to do something PCRE could
otherwise not do. On the contrary, I think it actually prevents some things
from being possible in PCRE that would be otherwise.

--
You are receiving this mail because:
You are on the CC list for the bug.