[pcre-dev] [Bug 2531] New: Regression: Lookahead inside look…

Top Page
Delete this message
Author: admin
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 2531] New: Regression: Lookahead inside lookbehind, after a define
https://bugs.exim.org/show_bug.cgi?id=2531

            Bug ID: 2531
           Summary: Regression: Lookahead inside lookbehind, after a
                    define
           Product: PCRE
           Version: 10.34 (PCRE2)
          Hardware: x86
                OS: Linux
            Status: NEW
          Severity: bug
          Priority: medium
         Component: Code
          Assignee: ph10@???
          Reporter: egmont@???
                CC: pcre-dev@???


Hi,

10.34 (r1133) introduced a regression. It's prominent in GNOME Terminal's quite
complex regexes to locate URLs inside the text flow, as reported at
https://gitlab.gnome.org/GNOME/gnome-terminal/issues/221

While r1133's commit message says "Fix lookbehind within lookahead within
lookbehind", this regression involves solely "lookahead within lookbehind";
well, combined with a define.

Also r1212 is similar, but here the define is not within but in front of the
lookbehind.

Regex definitions:
https://gitlab.gnome.org/GNOME/gnome-terminal/-/blob/3.34.2/src/terminal-regex.h
Unittests:
https://gitlab.gnome.org/GNOME/gnome-terminal/-/blob/3.34.2/src/terminal-regex.c

---

We define HOSTNAMESEGMENTCHARS_CLASS to match exactly 1 character which has to
be alphanumerical or the dash character, or any non-ASCII graphic Unicode
character. In other words, out of all the graphic Unicode characters only some
ASCII punctuation and symbols are excluded.

Since (according to my latest info from a couple of years ago) pcre2 doesn't
support class subtraction, we went with the workaround suggested at
http://rexegg.com/regex-class-operations.html#subtraction and implemented this
using a one-character negative lookahead:

#define HOSTNAMESEGMENTCHARS_CLASS "(?x: [-[:alnum:]] | (?! [[:ascii:]] )
[[:graph:]] )"

We use that later in REGEX_URL_HTTP, but I'm simplifying it here.

We want to make sure that we match a pattern in a way that its preceding
character is not of this class. We do this by a negative lookahead. The pattern
is

"(?<!" HOSTNAMESEGMENTCHARS_CLASS ")word"

which should (and does) match the standalone word "word".

However, our code is a tiny bit more complicated: the pattern begins with a
regex subroutine definition, like this:

"(?(DEFINE)(?<foo>bar))(?<!" HOSTNAMESEGMENTCHARS_CLASS ")word"

In this example the definition of "foo" isn't used anywhere, so defining it
shouldn't have an effect on the rest. It does, however; as of 10.34 this
pattern no longer matches the word "word".

---

Note: gnome-terminal itself, as per the above linked files, uses GRegex instead
of Pcre2. In order to run our tests with Pcre2, you could use a
work-in-progress branch of VTE from
https://gitlab.gnome.org/GNOME/vte/-/tree/wip/regex-builtins where the regex
definitions are in src/regex-builtins-patterns.hh, and the unittests are in
src/regex-test.cc. In the latter, at around line 460 you find the tests that
are skipped for 10.34, but should pass.

You can compile with "meson" and then "ninja" and run the tests with "ninja
test". In src/meson.build, remove the 'parser' and 'reaper' entries from
test_units[] to speed it up. Or I'm happy to test any candidate patch.

--
You are receiving this mail because:
You are on the CC list for the bug.