[Pcre-svn] [991] code/trunk/maint/README: Documentation upda…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [991] code/trunk/maint/README: Documentation update.
Revision: 991
          http://www.exim.org/viewvc/pcre2?view=rev&revision=991
Author:   ph10
Date:     2018-08-23 17:53:45 +0100 (Thu, 23 Aug 2018)
Log Message:
-----------
Documentation update.


Modified Paths:
--------------
    code/trunk/maint/README


Modified: code/trunk/maint/README
===================================================================
--- code/trunk/maint/README    2018-08-21 11:27:35 UTC (rev 990)
+++ code/trunk/maint/README    2018-08-23 16:53:45 UTC (rev 991)
@@ -323,7 +323,7 @@
   example "a" followed by an accent would, together, match "a".


. Perl supports [\N{x}-\N{y}] as a Unicode range, even in EBCDIC. PCRE2
- supports \N{U+dd..} everywhere but, not in EBCDIC.
+ supports \N{U+dd..} everywhere, but not in EBCDIC.

. Unicode stuff from Perl:

@@ -331,7 +331,7 @@
     \b{sb}              sentence boundary
     \b{wb}              word boundary


- See Unicode TR 29.
+ See Unicode TR 29. The last two are very much aimed at natural language.

. (?[...]) extended classes: big project.

@@ -359,7 +359,86 @@

. Some #defines could be replaced with enums to improve robustness.

+. There was a request for and option for pcre2_match() to return the longest 
+  match. This would mean searching for all possible matches, of course.
+  
+. Perl's /a modifier sets Unicode, but restricts \d etc to ASCII characters, 
+  which is the PCRE2 default for PCRE2_UTF (use PCRE2_UCP to change). However,
+  Perl also has /aa, which in addition, disables ASCII/non-ASCII caseless
+  matching. Perhaps we need a new option PCRE2_CASELESS_RESTRICT_ASCII. In 
+  practice, this just means not using the ucd_caseless_sets[] table.
+  
+. There is more that could be done to the oss-fuzz setup (needs some research). 
+  A seed corpus could be built. I noted something about $LIB_FUZZING_ENGINE. 
+  The test function could make use of get_substrings() to cover more code.
+  
+. A neater way of handling recursion file names in pcre2grep, e.g. a single 
+  buffer that can grow.  
+  
+. A user suggested that before/after parameters in pcre2grep could have 
+  negative values, to list lines near to the matched line, but not necessarily 
+  the line itself. For example, --before-context=-1 would list the line *after* 
+  each matched line, without showing the matched line. The problem here is what
+  to do with matches that are close together. Maybe a simpler way would be a 
+  flag to disable showing matched lines, only valid with either -A or -B?
+  
+. There was a suggestiong for a pcre2grep colour default, or possibly a more
+  general PCRE2GREP_OPT, but only for some options - not file names or patterns. 
+
+. Breaking loops that match an empty string: perhaps find a way of continuing 
+  if *something* has changed, but this might mean remembering additional data.
+  "Something" could be a capture value, but then a list of previous values 
+  would be needed to avoid a cycle of changes. Bugzilla #2182.
+  
+. The use of \K in assertions is problematic. There was some talk of Perl 
+  banning this, but it hasn't happened. Some problems could be avoided by 
+  not allowing it to set a value before the match start; others by not allowing 
+  it to set a value after the match end. This could be controlled by an option 
+  such as PCRE2_SANE_BACKSLASH_K, for compatibility (or possibly make the sane 
+  behaviour the default and implement PCRE2_INSANE_BACKSLASH_K).
+  
+. If a function could be written to find 3-character (or other length) fixed 
+  strings, at least one of which must be present for a match, efficient
+  pre-searching of large datasets could be implemented.
+  
+. There's a Perl proposal for some new (* things, including alpha synonyms for 
+  the lookaround assertions:
+
+  (*pla: …)
+  (*plb: …)
+  (*nla: …)
+  (*nlb: …)
+  (*atomic: …)
+  (*positive_look_ahead:...)
+  (*negative_look_ahead:...)
+  (*positive_look_behind:...)
+  (*negative_look_behind:...)
+
+  Also a new one (with synonyms):
+
+  (*script_run: ...)        Ensure all captured chars are in the same script
+  (*sr: …)
+  (*atomic_script_run: …)   A combination of script_run and atomic
+  (*asr:...)
+
+. If pcre2grep had --first-line (match only in the first line) it could be 
+  efficiently used to find files "starting with xxx". What about --last-line?
+  
+. A user requested a means of determining whether a failed match was failed by
+  the start-of-match optimizations, or by running the match engine. Easy enough 
+  to define a bit in the match data, but all three matchers would need work.
+  
+. Would inlining "simple" recursions provide a useful performance boost for the 
+  interpreters? JIT already does some of this.
+  
+. There was a request for a way of re-defining \w (and therefore \W, \b, and 
+  \B). An in-pattern sequence such as (?w=[...]) was suggested. Easiest way 
+  would be simply to inline the class, with lookarounds for \b and \B. Ideally 
+  the setting should last till the end of the group, which means remembering 
+  all previous settings; maybe a fixed amount of stack would do - how deep 
+  would anyone want to nest these things? Bugzilla #2301.
+
 Philip Hazel
 Email local part: ph10
 Email domain: cam.ac.uk
-Last updated: 13 August 2018
+Last updated: 21 August 2018