[Pcre-svn] [1197] code/trunk: Add (?* and (?<* synonyms for …

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [1197] code/trunk: Add (?* and (?<* synonyms for non-atomic lookarounds.
Revision: 1197
          http://www.exim.org/viewvc/pcre2?view=rev&revision=1197
Author:   ph10
Date:     2019-12-28 13:53:59 +0000 (Sat, 28 Dec 2019)
Log Message:
-----------
Add (?* and (?<* synonyms for non-atomic lookarounds.


Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/html/pcre2pattern.html
    code/trunk/doc/html/pcre2syntax.html
    code/trunk/doc/pcre2.txt
    code/trunk/doc/pcre2pattern.3
    code/trunk/doc/pcre2syntax.3
    code/trunk/src/pcre2_compile.c
    code/trunk/testdata/testinput2
    code/trunk/testdata/testoutput2


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2019-12-27 13:35:17 UTC (rev 1196)
+++ code/trunk/ChangeLog    2019-12-28 13:53:59 UTC (rev 1197)
@@ -28,7 +28,11 @@


7. Added PCRE2_SUBSTITUTE_MATCHED.

+8. Added (?* and (?<* as synonms for (*napla: and (*naplb: to match another
+regex engine. The Perl regex folks are aware of this usage and have made a note
+about it.

+
Version 10.34 21-November-2019
------------------------------


Modified: code/trunk/doc/html/pcre2pattern.html
===================================================================
--- code/trunk/doc/html/pcre2pattern.html    2019-12-27 13:35:17 UTC (rev 1196)
+++ code/trunk/doc/html/pcre2pattern.html    2019-12-28 13:53:59 UTC (rev 1197)
@@ -2624,8 +2624,8 @@
 positive assertions can be useful. PCRE2 provides these using the following
 syntax:
 <pre>
-  (*non_atomic_positive_lookahead:  or (*napla:
-  (*non_atomic_positive_lookbehind: or (*naplb:
+  (*non_atomic_positive_lookahead:  or (*napla: or (?*
+  (*non_atomic_positive_lookbehind: or (*naplb: or (?&#60;*
 </pre>
 Consider the problem of finding the right-most word in a string that also
 appears earlier in the string, that is, it must appear at least twice in total.
@@ -3833,7 +3833,7 @@
 </P>
 <br><a name="SEC32" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 18 December 2019
+Last updated: 28 December 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>


Modified: code/trunk/doc/html/pcre2syntax.html
===================================================================
--- code/trunk/doc/html/pcre2syntax.html    2019-12-27 13:35:17 UTC (rev 1196)
+++ code/trunk/doc/html/pcre2syntax.html    2019-12-28 13:53:59 UTC (rev 1197)
@@ -553,11 +553,13 @@
 <P>
 These assertions are specific to PCRE2 and are not Perl-compatible.
 <pre>
-  (*napla:...)
-  (*non_atomic_positive_lookahead:...)
+  (?*...)                                )
+  (*napla:...)                           ) synonyms
+  (*non_atomic_positive_lookahead:...)   )


-  (*naplb:...)
-  (*non_atomic_positive_lookbehind:...)
+  (?&#60;*...)                               )
+  (*naplb:...)                           ) synonyms
+  (*non_atomic_positive_lookbehind:...)  )
 </PRE>
 </P>
 <br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br>
@@ -683,7 +685,7 @@
 </P>
 <br><a name="SEC29" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 29 July 2019
+Last updated: 28 December 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>


Modified: code/trunk/doc/pcre2.txt
===================================================================
--- code/trunk/doc/pcre2.txt    2019-12-27 13:35:17 UTC (rev 1196)
+++ code/trunk/doc/pcre2.txt    2019-12-28 13:53:59 UTC (rev 1197)
@@ -8354,8 +8354,8 @@
        some  cases  where  non-atomic positive assertions can be useful. PCRE2
        provides these using the following syntax:


-         (*non_atomic_positive_lookahead:  or (*napla:
-         (*non_atomic_positive_lookbehind: or (*naplb:
+         (*non_atomic_positive_lookahead:  or (*napla: or (?*
+         (*non_atomic_positive_lookbehind: or (*naplb: or (?<*


        Consider the problem of finding the right-most word in  a  string  that
        also  appears  earlier  in the string, that is, it must appear at least
@@ -9487,7 +9487,7 @@


REVISION

-       Last updated: 18 December 2019
+       Last updated: 28 December 2019
        Copyright (c) 1997-2019 University of Cambridge.
 ------------------------------------------------------------------------------


@@ -10716,11 +10716,13 @@

        These assertions are specific to PCRE2 and are not Perl-compatible.


-         (*napla:...)
-         (*non_atomic_positive_lookahead:...)
+         (?*...)                                )
+         (*napla:...)                           ) synonyms
+         (*non_atomic_positive_lookahead:...)   )


-         (*naplb:...)
-         (*non_atomic_positive_lookbehind:...)
+         (?<*...)                               )
+         (*naplb:...)                           ) synonyms
+         (*non_atomic_positive_lookbehind:...)  )



SCRIPT RUNS
@@ -10844,7 +10846,7 @@

REVISION

-       Last updated: 29 July 2019
+       Last updated: 28 December 2019
        Copyright (c) 1997-2019 University of Cambridge.
 ------------------------------------------------------------------------------



Modified: code/trunk/doc/pcre2pattern.3
===================================================================
--- code/trunk/doc/pcre2pattern.3    2019-12-27 13:35:17 UTC (rev 1196)
+++ code/trunk/doc/pcre2pattern.3    2019-12-28 13:53:59 UTC (rev 1197)
@@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "18 December 2019" "PCRE2 10.35"
+.TH PCRE2PATTERN 3 "28 December 2019" "PCRE2 10.35"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@@ -2637,8 +2637,8 @@
 positive assertions can be useful. PCRE2 provides these using the following
 syntax:
 .sp
-  (*non_atomic_positive_lookahead:  or (*napla:
-  (*non_atomic_positive_lookbehind: or (*naplb:
+  (*non_atomic_positive_lookahead:  or (*napla: or (?*
+  (*non_atomic_positive_lookbehind: or (*naplb: or (?<*
 .sp
 Consider the problem of finding the right-most word in a string that also
 appears earlier in the string, that is, it must appear at least twice in total.
@@ -3874,6 +3874,6 @@
 .rs
 .sp
 .nf
-Last updated: 18 December 2019
+Last updated: 28 December 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi


Modified: code/trunk/doc/pcre2syntax.3
===================================================================
--- code/trunk/doc/pcre2syntax.3    2019-12-27 13:35:17 UTC (rev 1196)
+++ code/trunk/doc/pcre2syntax.3    2019-12-28 13:53:59 UTC (rev 1197)
@@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "29 July 2019" "PCRE2 10.34"
+.TH PCRE2SYNTAX 3 "28 December 2019" "PCRE2 10.35"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -531,11 +531,13 @@
 .sp
 These assertions are specific to PCRE2 and are not Perl-compatible.
 .sp
-  (*napla:...)
-  (*non_atomic_positive_lookahead:...)
-.sp
-  (*naplb:...)
-  (*non_atomic_positive_lookbehind:...)
+  (?*...)                                )
+  (*napla:...)                           ) synonyms
+  (*non_atomic_positive_lookahead:...)   )
+.sp                                      
+  (?<*...)                               )
+  (*naplb:...)                           ) synonyms
+  (*non_atomic_positive_lookbehind:...)  )
 .
 .
 .SH "SCRIPT RUNS"
@@ -670,6 +672,6 @@
 .rs
 .sp
 .nf
-Last updated: 29 July 2019
+Last updated: 28 December 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi


Modified: code/trunk/src/pcre2_compile.c
===================================================================
--- code/trunk/src/pcre2_compile.c    2019-12-27 13:35:17 UTC (rev 1196)
+++ code/trunk/src/pcre2_compile.c    2019-12-28 13:53:59 UTC (rev 1197)
@@ -3653,7 +3653,7 @@
     if (ptr >= ptrend) goto UNCLOSED_PARENTHESIS;


     /* If ( is not followed by ? it is either a capture or a special verb or an
-    alpha assertion. */
+    alpha assertion or a positive non-atomic lookahead. */


     if (*ptr != CHAR_QUESTION_MARK)
       {
@@ -3685,10 +3685,10 @@
         break;


       /* Handle "alpha assertions" such as (*pla:...). Most of these are
-      synonyms for the historical symbolic assertions, but the script run ones
-      are new. They are distinguished by starting with a lower case letter.
-      Checking both ends of the alphabet makes this work in all character
-      codes. */
+      synonyms for the historical symbolic assertions, but the script run and
+      non-atomic lookaround ones are new. They are distinguished by starting
+      with a lower case letter. Checking both ends of the alphabet makes this
+      work in all character codes. */


       else if (CHMAX_255(c) && (cb->ctypes[c] & ctype_lcletter) != 0)
         {
@@ -3747,9 +3747,7 @@
           goto POSITIVE_LOOK_AHEAD;


           case META_LOOKAHEAD_NA:
-          *parsed_pattern++ = meta;
-          ptr++;
-          goto POST_ASSERTION;
+          goto POSITIVE_NONATOMIC_LOOK_AHEAD;


           case META_LOOKAHEADNOT:
           goto NEGATIVE_LOOK_AHEAD;
@@ -4438,6 +4436,12 @@
       ptr++;
       goto POST_ASSERTION;


+      case CHAR_ASTERISK:
+      POSITIVE_NONATOMIC_LOOK_AHEAD:         /* Come from (?* */
+      *parsed_pattern++ = META_LOOKAHEAD_NA;
+      ptr++;
+      goto POST_ASSERTION;
+
       case CHAR_EXCLAMATION_MARK:
       NEGATIVE_LOOK_AHEAD:                   /* Come from (*nla: */
       *parsed_pattern++ = META_LOOKAHEADNOT;
@@ -4447,20 +4451,23 @@


       /* ---- Lookbehind assertions ---- */


-      /* (?< followed by = or ! is a lookbehind assertion. Otherwise (?< is the
-      start of the name of a capturing group. */
+      /* (?< followed by = or ! or * is a lookbehind assertion. Otherwise (?<
+      is the start of the name of a capturing group. */


       case CHAR_LESS_THAN_SIGN:
       if (ptrend - ptr <= 1 ||
-         (ptr[1] != CHAR_EQUALS_SIGN && ptr[1] != CHAR_EXCLAMATION_MARK))
+         (ptr[1] != CHAR_EQUALS_SIGN &&
+          ptr[1] != CHAR_EXCLAMATION_MARK &&
+          ptr[1] != CHAR_ASTERISK))
         {
         terminator = CHAR_GREATER_THAN_SIGN;
         goto DEFINE_NAME;
         }
       *parsed_pattern++ = (ptr[1] == CHAR_EQUALS_SIGN)?
-        META_LOOKBEHIND : META_LOOKBEHINDNOT;
+        META_LOOKBEHIND : (ptr[1] == CHAR_EXCLAMATION_MARK)?
+        META_LOOKBEHINDNOT : META_LOOKBEHIND_NA;


-      POST_LOOKBEHIND:              /* Come from (*plb: (*naplb: and (*nlb: */
+      POST_LOOKBEHIND:           /* Come from (*plb: (*naplb: and (*nlb: */
       *has_lookbehind = TRUE;
       offset = (PCRE2_SIZE)(ptr - cb->start_pattern - 2);
       PUTOFFSET(offset, parsed_pattern);
@@ -4633,8 +4640,6 @@
         *parsed_pattern++ = META_KET;
         }


-
-
       if (top_nest == (nest_save *)(cb->start_workspace)) top_nest = NULL;
         else top_nest--;
       }


Modified: code/trunk/testdata/testinput2
===================================================================
--- code/trunk/testdata/testinput2    2019-12-27 13:35:17 UTC (rev 1196)
+++ code/trunk/testdata/testinput2    2019-12-28 13:53:59 UTC (rev 1197)
@@ -5670,6 +5670,9 @@
 /\A(*napla:.*\b(\w++))(?>.*?\b\1\b){3}/
     word1 word3 word1 word2 word3 word2 word2 word1 word3 word4


+/\A(?*.*\b(\w++))(?>.*?\b\1\b){3}/
+    word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
+
 /(*plb:(.)..|(.)...)(\1|\2)/
     abcdb\=offset=4 
     abcda\=offset=4 
@@ -5678,6 +5681,10 @@
     abcdb\=offset=4 
     abcda\=offset=4 


+/(?<*(.)..|(.)...)(\1|\2)/
+    abcdb\=offset=4 
+    abcda\=offset=4 
+    
 /(*non_atomic_positive_lookahead:ab)/B


/(*non_atomic_positive_lookbehind:ab)/B

Modified: code/trunk/testdata/testoutput2
===================================================================
--- code/trunk/testdata/testoutput2    2019-12-27 13:35:17 UTC (rev 1196)
+++ code/trunk/testdata/testoutput2    2019-12-28 13:53:59 UTC (rev 1197)
@@ -17088,6 +17088,11 @@
  0: word1 word3 word1 word2 word3 word2 word2 word1 word3
  1: word3


+/\A(?*.*\b(\w++))(?>.*?\b\1\b){3}/
+    word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
+ 0: word1 word3 word1 word2 word3 word2 word2 word1 word3
+ 1: word3
+
 /(*plb:(.)..|(.)...)(\1|\2)/
     abcdb\=offset=4 
  0: b
@@ -17109,6 +17114,18 @@
  2: a
  3: a


+/(?<*(.)..|(.)...)(\1|\2)/
+    abcdb\=offset=4 
+ 0: b
+ 1: b
+ 2: <unset>
+ 3: b
+    abcda\=offset=4 
+ 0: a
+ 1: <unset>
+ 2: a
+ 3: a
+    
 /(*non_atomic_positive_lookahead:ab)/B
 ------------------------------------------------------------------
         Bra