[pcre-dev] [Bug 1490] New: Awkward semantics of anchored pat…

Top Page
Delete this message
Author: Markus Mottl
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1490] New: Awkward semantics of anchored patterns with lookbehind assertions
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1490
           Summary: Awkward semantics of anchored patterns with lookbehind
                    assertions
           Product: PCRE
           Version: N/A
          Platform: All
               URL: https://bitbucket.org/mmottl/pcre-ocaml/issue/3/zero-
                    width-split-problem
        OS/Version: All
            Status: NEW
          Keywords: work:small
          Severity: wishlist
          Priority: medium
         Component: Code
        AssignedTo: ph10@???
        ReportedBy: markus.mottl@???
                CC: pcre-dev@???



The URL associated with this report gives some more background information on
this problem, which makes it hard to use non-C bindings with PCRE.

In order to implement "split" in a PERL-compatible way, it is necessary to
check for null patterns as the function advances over the subject string. My
OCaml-bindings for PCRE use the "ANCHORED" and "NOTEMPTY" flags for that
purpose. But this apparently causes problems when lookbehind assertions are
used in the user's pattern, because they are still allowed to match before the
current position, even when anchored.

It is in principle possible to work around this issue. I could change the API
of my OCaml-bindings such that users can pass two offsets to all
PCRE-functions: one for the beginning of the subject string, one for the match
position. The C-stubs of the bindings would then have to add the new offset to
the string pointer before calling PCRE and, if non-zero, readjust the match
indexes afterwards.

Besides being somewhat involved, this API-change would likely be very
confusing to users, because they would now have to reason about two string
offsets.

I don't think the current ANCHORED-semantics should be changed, because some
people may depend on it. But it would be great if there were another flag that
would prevent lookbehind assertions from matching before the start of the
match, essentially treating this position as the true start of the subject
string.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email