[pcre-dev] [Bug 1474] New: Mechanism to record token ID

Top Page
Delete this message
Author: Jonathan S Shapiro
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1474] New: Mechanism to record token ID
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1474
           Summary: Mechanism to record token ID
           Product: PCRE
           Version: N/A
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: wishlist
          Priority: low
         Component: Code
        AssignedTo: ph10@???
        ReportedBy: shap@???
                CC: pcre-dev@???



I've recently been looking at PCRE for tokenization. It can handle all of the
recognition requirements, but I can't discern a way to easily identify what
token was recognized. I can abuse the named submatch mechanism to get close
(and that is useful), but it's frustrating that it only gets me 95% of the way
there.

I'm wondering if it might be appropriate to introduce a new form of memoization
that allows a token number to be recorded. Like named sub-match, this would
take the form of a new bracketing syntax. In contrast to named submatch, the
bracketing would specify a number, and the library API would be extended to
provide for recall of this number. In the absence of alternative specification
within the RE, the number returned would be zero.

When executed with engines implementing ordered choice, the token ID returned
should be the one associated with the last token-numbered open paren resulting
in a match. When executed with engines implementing a finite automata rather
than a pushdown automata, the token ID returned should follow the leftmost
longest rule.

This is complicated enough that I hesitate to volunteer to implement it, but
I'd be willing to give it a try and submit a patch for consideration if there
is interest in the feature.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email