https://bugs.exim.org/show_bug.cgi?id=2531
Bug ID: 2531
Summary: Regression: Lookahead inside lookbehind, after a
define
Product: PCRE
Version: 10.34 (PCRE2)
Hardware: x86
OS: Linux
Status: NEW
Severity: bug
Priority: medium
Component: Code
Assignee: ph10@???
Reporter: egmont@???
CC: pcre-dev@???
Hi,
10.34 (r1133) introduced a regression. It's prominent in GNOME Terminal's quite
complex regexes to locate URLs inside the text flow, as reported at
https://gitlab.gnome.org/GNOME/gnome-terminal/issues/221
While r1133's commit message says "Fix lookbehind within lookahead within
lookbehind", this regression involves solely "lookahead within lookbehind";
well, combined with a define.
Also r1212 is similar, but here the define is not within but in front of the
lookbehind.
Regex definitions:
https://gitlab.gnome.org/GNOME/gnome-terminal/-/blob/3.34.2/src/terminal-regex.h
Unittests:
https://gitlab.gnome.org/GNOME/gnome-terminal/-/blob/3.34.2/src/terminal-regex.c
---
We define HOSTNAMESEGMENTCHARS_CLASS to match exactly 1 character which has to
be alphanumerical or the dash character, or any non-ASCII graphic Unicode
character. In other words, out of all the graphic Unicode characters only some
ASCII punctuation and symbols are excluded.
Since (according to my latest info from a couple of years ago) pcre2 doesn't
support class subtraction, we went with the workaround suggested at
http://rexegg.com/regex-class-operations.html#subtraction and implemented this
using a one-character negative lookahead:
#define HOSTNAMESEGMENTCHARS_CLASS "(?x: [-[:alnum:]] | (?! [[:ascii:]] )
[[:graph:]] )"
We use that later in REGEX_URL_HTTP, but I'm simplifying it here.
We want to make sure that we match a pattern in a way that its preceding
character is not of this class. We do this by a negative lookahead. The pattern
is
"(?<!" HOSTNAMESEGMENTCHARS_CLASS ")word"
which should (and does) match the standalone word "word".
However, our code is a tiny bit more complicated: the pattern begins with a
regex subroutine definition, like this:
"(?(DEFINE)(?<foo>bar))(?<!" HOSTNAMESEGMENTCHARS_CLASS ")word"
In this example the definition of "foo" isn't used anywhere, so defining it
shouldn't have an effect on the rest. It does, however; as of 10.34 this
pattern no longer matches the word "word".
---
Note: gnome-terminal itself, as per the above linked files, uses GRegex instead
of Pcre2. In order to run our tests with Pcre2, you could use a
work-in-progress branch of VTE from
https://gitlab.gnome.org/GNOME/vte/-/tree/wip/regex-builtins where the regex
definitions are in src/regex-builtins-patterns.hh, and the unittests are in
src/regex-test.cc. In the latter, at around line 460 you find the tests that
are skipped for 10.34, but should pass.
You can compile with "meson" and then "ninja" and run the tests with "ninja
test". In src/meson.build, remove the 'parser' and 'reaper' entries from
test_units[] to speed it up. Or I'm happy to test any candidate patch.
--
You are receiving this mail because:
You are on the CC list for the bug.