[pcre-dev] [Bug 2472] New: Feature Request: PCRE2_SUBSTITUT…

Top Page
Delete this message
Author: admin
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 2472] New: Feature Request: PCRE2_SUBSTITUTE_LITERAL option for pcre2_substitute without processing replacement strings
https://bugs.exim.org/show_bug.cgi?id=2472

            Bug ID: 2472
           Summary: Feature Request: PCRE2_SUBSTITUTE_LITERAL option for
                    pcre2_substitute without processing replacement
                    strings
           Product: PCRE
           Version: N/A
          Hardware: x86
                OS: Windows
            Status: NEW
          Severity: wishlist
          Priority: medium
         Component: Code
          Assignee: ph10@???
          Reporter: ew3652@???
                CC: pcre-dev@???


Hi,

I am following the guidelines on https://pcre.org/ to file a feature request by
opening a bug ticket.
I also tried searching for literal and pcre2_substitute in the closed and open
bug section but was not able to find a similar feature request.

Description:
------------
I think an additional option e.g. PCRE2_SUBSTITUTE_LITERAL
which specifies that the replacement string in pcre2_substitute
should not be processed at all would be useful for many programs
that utilize pcre2_substitute.

Rationale
---------
I believe a common use case is when arbitrary replacement strings
are obtained from an external source and copying replacement strings
for preprocessing/escaping is to be avoided.
One example that should be quite common are many long strings
with monetary values such as

            "....amounts to $10 in value...."


(here the replacement string refers to the currency symbol $ for a monetary
dollar value).
Currently this would have to be escaped as

            "....amounts to $$10 in value...."


or with extended syntax

            "\Q....amounts to $10 in value....\E"


according to https://pcre.org/current/doc/html/pcre2api.html#substitutions.
My personal use case is obtaining the replacement strings inside a
user defined function of a database application.

Comparison to other PCRE2 options
--------------------------------
A similar option PCRE2_LITERAL is available for pcre_compile despite regular
expressions not being efficient for its use case.
The proposed option would be the counterpart to PCRE2_SUBSTITUTE_EXTENDED.
While PCRE2_SUBSTITUTE_EXTENDED increases replacement string processing
complexity, PCRE2_SUBSTITUTE_LITERAL would decrease it.


Disadvantages of Alternatives
-----------------------------
Escape Replacement String
Replacement strings need to be copied to a new buffer and escaped.
This requires extra memory and knowledge of which characters are to be
escaped ($).

Extended syntax e.g. \Q \E
Extended sytnax also requires a new copy and
adding \Q and \E as well as escaping \E in the replacement string.

Substitution callouts
A placeholder replacement string could be handed to pcre2_compile
(e.g. empty string) and literal replacement handled by a callout.
This is not only cumbersome but also makes
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
not easy to use because callouts are not called for overflows.

Implementing a separate routine based on pcre2_substitute
Implementing a correct routine that behaves as pcre2_substitute does
is not trivial and some internal methods that pcre2_substitute uses are
not exported.
(e.g. UTF checks or direct access to the callouts set in the match context
which would require a different parameter set in the separate
implementation to handle callouts).
- Actually get_callout and get_substitute_callout functionality
with the public headers seems something that could also be useful but is
not part of this feature request).

Implementation Thoughts
----------------------
I hope some thoughts on untested code are appropriate here.
I could not find a guideline with respect to that and I saw some code in other
reports.
My first impression is that since the size of the replacement is known and it
is constant, one could call the CHECKMEMCPY macro in pcre2_substitute.c before
replacement processing and skip entering the replacement string processing loop
with the next relevant section being the callout section.
E.g.

     BOOL all_literal = ((options & PCRE2_SUBSTITUTE_LITERAL)!=0);
...
     if (all_literal) {
        CHECKMEMCPY(replacement,rlength); 
        // skip replacement processing loop ...
     } else {
        // replacement processing loop ...
     }
     //callout section


The cost of such implementation would then be an option bit of the match
options and one additional if check within the global loop of pcre2_substitute.

--
You are receiving this mail because:
You are on the CC list for the bug.