[Pcre-svn] [839] code/trunk: Update grapheme breaking rules for Unicode 10.0.0.

Author: Subversion repository
Date:
To: pcre-svn
Subject: [Pcre-svn] [839] code/trunk: Update grapheme breaking rules for Unicode 10.0.0.

Revision: 839

          http://www.exim.org/viewvc/pcre2?view=rev&revision=839
Author:   ph10
Date:     2017-07-05 09:55:49 +0100 (Wed, 05 Jul 2017)
Log Message:
-----------
Update grapheme breaking rules for Unicode 10.0.0.

Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/doc/html/pcre2pattern.html
    code/trunk/doc/pcre2.txt
    code/trunk/doc/pcre2pattern.3
    code/trunk/src/pcre2_dfa_match.c
    code/trunk/src/pcre2_match.c
    code/trunk/src/pcre2_tables.c
    code/trunk/testdata/testinput5
    code/trunk/testdata/testoutput5

Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2017-07-02 16:32:01 UTC (rev 838)
+++ code/trunk/ChangeLog    2017-07-05 08:55:49 UTC (rev 839)
@@ -213,7 +213,10 @@

48. Add the callout_no_where modifier to pcre2test.

+49. Update extended grapheme breaking rules to the latest set that are in
+Unicode Standard Annex #29.

+
Version 10.23 14-February-2017
------------------------------

Modified: code/trunk/doc/html/pcre2pattern.html
===================================================================
--- code/trunk/doc/html/pcre2pattern.html    2017-07-02 16:32:01 UTC (rev 838)
+++ code/trunk/doc/html/pcre2pattern.html    2017-07-05 08:55:49 UTC (rev 839)
@@ -177,7 +177,7 @@
 goes round its main loop. The caller of <b>pcre2_match()</b> can set a limit on
 this counter, which therefore limits the amount of computing resource used for
 a match. The maximum depth of nested backtracking can also be limited; this
-indirectly restricts the amount of heap memory that is used, but there is also 
+indirectly restricts the amount of heap memory that is used, but there is also
 an explicit memory limit that can be set.
 </P>
 <P>
@@ -198,7 +198,7 @@
 setting of one of these limits, the lower value is used.
 </P>
 <P>
-Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is 
+Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
 still recognized for backwards compatibility.
 </P>
 <P>
@@ -233,7 +233,7 @@
   (*CRLF)      carriage return, followed by linefeed
   (*ANYCRLF)   any of the three above
   (*ANY)       all Unicode newline sequences
-  (*NUL)       the NUL character (binary zero) 
+  (*NUL)       the NUL character (binary zero)
 </pre>
 These override the default and the options given to the compiling function. For
 example, on a Unix system where LF is the default newline sequence, the pattern
@@ -249,7 +249,7 @@
 true. It also affects the interpretation of the dot metacharacter when
 PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
 what the \R escape sequence matches. By default, this is any Unicode newline
-sequence, for Perl compatibility. However, this can be changed; see the next 
+sequence, for Perl compatibility. However, this can be changed; see the next
 section and the description of \R in the section entitled
 <a href="#newlineseq">"Newline sequences"</a>
 below. A change of \R setting can be combined with a change of newline
@@ -1001,11 +1001,14 @@
 <a href="#atomicgroup">(see below).</a>
 Unicode supports various kinds of composite character by giving each character
 a grapheme breaking property, and having rules that use these properties to
-define the boundaries of extended grapheme clusters. \X always matches at
-least one character. Then it decides whether to add additional characters
-according to the following rules for ending a cluster:
+define the boundaries of extended grapheme clusters. The rules are defined in
+Unicode Standard Annex 29, "Unicode Text Segmentation".
 </P>
 <P>
+\X always matches at least one character. Then it decides whether to add
+additional characters according to the following rules for ending a cluster:
+</P>
+<P>
 1. End at the end of the subject string.
 </P>
 <P>
@@ -1018,13 +1021,27 @@
 character; an LVT or T character may be follwed only by a T character.
 </P>
 <P>
-4. Do not end before extending characters or spacing marks. Characters with
-the "mark" property always have the "extend" grapheme breaking property.
+4. Do not end before extending characters or spacing marks or the "zero-width
+joiner" characters. Characters with the "mark" property always have the
+"extend" grapheme breaking property.
 </P>
 <P>
 5. Do not end after prepend characters.
 </P>
 <P>
+6. Do not break within emoji modifier sequences (a base character followed by a
+modifier). Extending characters are allowed before the modifier.
+</P>
+<P>
+7. Do not break within emoji zwj sequences (zero-width jointer followed by
+"glue after ZWJ" or "base glue after ZWJ").
+</P>
+<P>
+8. Do not break within emoji flag sequences. That is, do not break between
+regional indicator (RI) characters if there are an odd number of RI characters
+before the break point.
+</P>
+<P>
 6. Otherwise, end the cluster.
 <a name="extraprops"></a></P>
 <br><b>
@@ -1562,15 +1579,15 @@
 <pre>
   i  for PCRE2_CASELESS
   m  for PCRE2_MULTILINE
-  n  for PCRE2_NO_AUTO_CAPTURE 
+  n  for PCRE2_NO_AUTO_CAPTURE
   s  for PCRE2_DOTALL
   x  for PCRE2_EXTENDED
   xx for PCRE2_EXTENDED_MORE
 </pre>
 For example, (?im) sets caseless, multiline matching. It is also possible to
-unset these options by preceding the letter with a hyphen. The two "extended" 
-options are not independent; unsetting either one cancels the effects of both 
-of them. 
+unset these options by preceding the letter with a hyphen. The two "extended"
+options are not independent; unsetting either one cancels the effects of both
+of them.
 </P>
 <P>
 A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
@@ -2249,7 +2266,7 @@
 numbering the capturing subpatterns in the whole pattern. However, substring
 capturing is carried out only for positive assertions that succeed, that is,
 one of their branches matches, so matching continues after the assertion. If
-all branches of a positive assertion fail to match, nothing is captured, and 
+all branches of a positive assertion fail to match, nothing is captured, and
 control is passed to the previous backtracking point.
 </P>
 <P>
@@ -2256,7 +2273,7 @@
 No capturing is done for a negative assertion unless it is being used as a
 condition in a
 <a href="#subpatternsassubroutines">conditional subpattern</a>
-(see the discussion below). Matching continues after a non-conditional negative 
+(see the discussion below). Matching continues after a non-conditional negative
 assertion only if all its branches fail to match.
 </P>
 <P>
@@ -2824,14 +2841,14 @@
 failure. (Historical note: PCRE implemented recursion before Perl did.)
 </P>
 <P>
-Starting with release 10.30, recursive subroutine calls are no longer treated 
-as atomic. That is, they can be re-entered to try unused alternatives if there 
-is a matching failure later in the pattern. This is now compatible with the way 
+Starting with release 10.30, recursive subroutine calls are no longer treated
+as atomic. That is, they can be re-entered to try unused alternatives if there
+is a matching failure later in the pattern. This is now compatible with the way
 Perl works. If you want a subroutine call to be atomic, you must explicitly
 enclose it in an atomic group.
 </P>
 <P>
-Supporting backtracking into recursions simplifies certain types of recursive 
+Supporting backtracking into recursions simplifies certain types of recursive
 pattern. For example, this pattern matches palindromic strings:
 <pre>
   ^((.)(?1)\2|.?)$
@@ -2863,7 +2880,7 @@
 This pattern matches "bab". The first capturing parentheses match "b", then in
 the second group, when the back reference \1 fails to match "b", the second
 alternative matches "a" and then recurses. In the recursion, \1 does now match
-"b" and so the whole match succeeds. This match used to fail in Perl, but in 
+"b" and so the whole match succeeds. This match used to fail in Perl, but in
 later versions (I tried 5.024) it now works.
 <a name="subpatternsassubroutines"></a></P>
 <br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
@@ -3398,17 +3415,17 @@
 </P>
 <P>
 If the assertion is a condition, (*ACCEPT) causes the condition to be true for
-a positive assertion and false for a negative one; captured substrings are 
+a positive assertion and false for a negative one; captured substrings are
 retained in both cases.
 </P>
 <P>
-The effect of (*THEN) is not allowed to escape beyond an assertion. If there 
+The effect of (*THEN) is not allowed to escape beyond an assertion. If there
 are no more branches to try, (*THEN) causes a positive assertion to be false,
 and a negative assertion to be true.
 </P>
 <P>
 The other backtracking verbs are not treated specially if they appear in a
-standalone positive assertion. In a conditional positive assertion, 
+standalone positive assertion. In a conditional positive assertion,
 backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the condition to be
 false. However, for both standalone and conditional negative assertions,
 backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be
@@ -3455,7 +3472,7 @@
 </P>
 <br><a name="SEC30" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 02 July 2017
+Last updated: 05 July 2017
 <br>
 Copyright &copy; 1997-2017 University of Cambridge.
 <br>

Modified: code/trunk/doc/pcre2.txt
===================================================================
--- code/trunk/doc/pcre2.txt    2017-07-02 16:32:01 UTC (rev 838)
+++ code/trunk/doc/pcre2.txt    2017-07-05 08:55:49 UTC (rev 839)
@@ -6433,27 +6433,41 @@
        (see  below).  Unicode supports various kinds of composite character by
        giving each character a grapheme breaking property,  and  having  rules
        that use these properties to define the boundaries of extended grapheme
-       clusters. \X always matches at least one  character.  Then  it  decides
-       whether  to  add additional characters according to the following rules
-       for ending a cluster:
+       clusters. The rules are defined in Unicode Standard Annex 29,  "Unicode
+       Text Segmentation".

+       \X  always  matches  at least one character. Then it decides whether to
+       add additional characters according to the following rules for ending a
+       cluster:
+
        1. End at the end of the subject string.

-       2. Do not end between CR and LF; otherwise end after any control  char-
+       2.  Do not end between CR and LF; otherwise end after any control char-
        acter.

-       3.  Do  not  break  Hangul (a Korean script) syllable sequences. Hangul
-       characters are of five types: L, V, T, LV, and LVT. An L character  may
-       be  followed by an L, V, LV, or LVT character; an LV or V character may
+       3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul
+       characters  are of five types: L, V, T, LV, and LVT. An L character may
+       be followed by an L, V, LV, or LVT character; an LV or V character  may
        be followed by a V or T character; an LVT or T character may be follwed
        only by a T character.

-       4.  Do not end before extending characters or spacing marks. Characters
-       with the "mark" property always have  the  "extend"  grapheme  breaking
-       property.
+       4. Do not end before extending  characters  or  spacing  marks  or  the
+       "zero-width  joiner"  characters.  Characters  with the "mark" property
+       always have the "extend" grapheme breaking property.

        5. Do not end after prepend characters.

+       6. Do not break within emoji modifier sequences (a base character  fol-
+       lowed by a modifier). Extending characters are allowed before the modi-
+       fier.
+
+       7. Do not break within emoji zwj sequences (zero-width jointer followed
+       by "glue after ZWJ" or "base glue after ZWJ").
+
+       8.  Do  not  break  within  emoji flag sequences. That is, do not break
+       between regional indicator (RI) characters if there are an  odd  number
+       of RI characters before the break point.
+
        6. Otherwise, end the cluster.

    PCRE2's additional properties
@@ -8744,7 +8758,7 @@

REVISION

-       Last updated: 02 July 2017
+       Last updated: 05 July 2017
        Copyright (c) 1997-2017 University of Cambridge.
 ------------------------------------------------------------------------------

Modified: code/trunk/doc/pcre2pattern.3
===================================================================
--- code/trunk/doc/pcre2pattern.3    2017-07-02 16:32:01 UTC (rev 838)
+++ code/trunk/doc/pcre2pattern.3    2017-07-05 08:55:49 UTC (rev 839)
@@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "02 July 2017" "PCRE2 10.30"
+.TH PCRE2PATTERN 3 "05 July 2017" "PCRE2 10.30"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@@ -145,7 +145,7 @@
 goes round its main loop. The caller of \fBpcre2_match()\fP can set a limit on
 this counter, which therefore limits the amount of computing resource used for
 a match. The maximum depth of nested backtracking can also be limited; this
-indirectly restricts the amount of heap memory that is used, but there is also 
+indirectly restricts the amount of heap memory that is used, but there is also
 an explicit memory limit that can be set.
 .P
 These facilities are provided to catch runaway matches that are provoked by
@@ -164,7 +164,7 @@
 limits set by the programmer, but not raise them. If there is more than one
 setting of one of these limits, the lower value is used.
 .P
-Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is 
+Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
 still recognized for backwards compatibility.
 .P
 The heap limit applies only when the \fBpcre2_match()\fP interpreter is used
@@ -203,7 +203,7 @@
   (*CRLF)      carriage return, followed by linefeed
   (*ANYCRLF)   any of the three above
   (*ANY)       all Unicode newline sequences
-  (*NUL)       the NUL character (binary zero) 
+  (*NUL)       the NUL character (binary zero)
 .sp
 These override the default and the options given to the compiling function. For
 example, on a Unix system where LF is the default newline sequence, the pattern
@@ -218,7 +218,7 @@
 true. It also affects the interpretation of the dot metacharacter when
 PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect
 what the \eR escape sequence matches. By default, this is any Unicode newline
-sequence, for Perl compatibility. However, this can be changed; see the next 
+sequence, for Perl compatibility. However, this can be changed; see the next
 section and the description of \eR in the section entitled
 .\" HTML <a href="#newlineseq">
 .\" </a>
@@ -998,10 +998,12 @@
 .\"
 Unicode supports various kinds of composite character by giving each character
 a grapheme breaking property, and having rules that use these properties to
-define the boundaries of extended grapheme clusters. \eX always matches at
-least one character. Then it decides whether to add additional characters
-according to the following rules for ending a cluster:
+define the boundaries of extended grapheme clusters. The rules are defined in
+Unicode Standard Annex 29, "Unicode Text Segmentation".
 .P
+\eX always matches at least one character. Then it decides whether to add
+additional characters according to the following rules for ending a cluster:
+.P
 1. End at the end of the subject string.
 .P
 2. Do not end between CR and LF; otherwise end after any control character.
@@ -1011,11 +1013,22 @@
 L, V, LV, or LVT character; an LV or V character may be followed by a V or T
 character; an LVT or T character may be follwed only by a T character.
 .P
-4. Do not end before extending characters or spacing marks. Characters with
-the "mark" property always have the "extend" grapheme breaking property.
+4. Do not end before extending characters or spacing marks or the "zero-width
+joiner" characters. Characters with the "mark" property always have the
+"extend" grapheme breaking property.
 .P
 5. Do not end after prepend characters.
 .P
+6. Do not break within emoji modifier sequences (a base character followed by a
+modifier). Extending characters are allowed before the modifier.
+.P
+7. Do not break within emoji zwj sequences (zero-width jointer followed by
+"glue after ZWJ" or "base glue after ZWJ").
+.P
+8. Do not break within emoji flag sequences. That is, do not break between
+regional indicator (RI) characters if there are an odd number of RI characters
+before the break point.
+.P
 6. Otherwise, end the cluster.
 .
 .
@@ -1560,15 +1573,15 @@
 .sp
   i  for PCRE2_CASELESS
   m  for PCRE2_MULTILINE
-  n  for PCRE2_NO_AUTO_CAPTURE 
+  n  for PCRE2_NO_AUTO_CAPTURE
   s  for PCRE2_DOTALL
   x  for PCRE2_EXTENDED
   xx for PCRE2_EXTENDED_MORE
 .sp
 For example, (?im) sets caseless, multiline matching. It is also possible to
-unset these options by preceding the letter with a hyphen. The two "extended" 
-options are not independent; unsetting either one cancels the effects of both 
-of them. 
+unset these options by preceding the letter with a hyphen. The two "extended"
+options are not independent; unsetting either one cancels the effects of both
+of them.
 .P
 A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
 and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
@@ -2256,7 +2269,7 @@
 numbering the capturing subpatterns in the whole pattern. However, substring
 capturing is carried out only for positive assertions that succeed, that is,
 one of their branches matches, so matching continues after the assertion. If
-all branches of a positive assertion fail to match, nothing is captured, and 
+all branches of a positive assertion fail to match, nothing is captured, and
 control is passed to the previous backtracking point.
 .P
 No capturing is done for a negative assertion unless it is being used as a
@@ -2263,9 +2276,9 @@
 condition in a
 .\" HTML <a href="#subpatternsassubroutines">
 .\" </a>
-conditional subpattern 
+conditional subpattern
 .\"
-(see the discussion below). Matching continues after a non-conditional negative 
+(see the discussion below). Matching continues after a non-conditional negative
 assertion only if all its branches fail to match.
 .P
 For compatibility with Perl, most assertion subpatterns may be repeated; though
@@ -2846,13 +2859,13 @@
 if it contained untried alternatives and there was a subsequent matching
 failure. (Historical note: PCRE implemented recursion before Perl did.)
 .P
-Starting with release 10.30, recursive subroutine calls are no longer treated 
-as atomic. That is, they can be re-entered to try unused alternatives if there 
-is a matching failure later in the pattern. This is now compatible with the way 
+Starting with release 10.30, recursive subroutine calls are no longer treated
+as atomic. That is, they can be re-entered to try unused alternatives if there
+is a matching failure later in the pattern. This is now compatible with the way
 Perl works. If you want a subroutine call to be atomic, you must explicitly
 enclose it in an atomic group.
 .P
-Supporting backtracking into recursions simplifies certain types of recursive 
+Supporting backtracking into recursions simplifies certain types of recursive
 pattern. For example, this pattern matches palindromic strings:
 .sp
   ^((.)(?1)\e2|.?)$
@@ -2883,7 +2896,7 @@
 This pattern matches "bab". The first capturing parentheses match "b", then in
 the second group, when the back reference \e1 fails to match "b", the second
 alternative matches "a" and then recurses. In the recursion, \e1 does now match
-"b" and so the whole match succeeds. This match used to fail in Perl, but in 
+"b" and so the whole match succeeds. This match used to fail in Perl, but in
 later versions (I tried 5.024) it now works.
 .
 .
@@ -3427,15 +3440,15 @@
 processing; captured substrings are discarded.
 .P
 If the assertion is a condition, (*ACCEPT) causes the condition to be true for
-a positive assertion and false for a negative one; captured substrings are 
+a positive assertion and false for a negative one; captured substrings are
 retained in both cases.
 .P
-The effect of (*THEN) is not allowed to escape beyond an assertion. If there 
+The effect of (*THEN) is not allowed to escape beyond an assertion. If there
 are no more branches to try, (*THEN) causes a positive assertion to be false,
 and a negative assertion to be true.
 .P
 The other backtracking verbs are not treated specially if they appear in a
-standalone positive assertion. In a conditional positive assertion, 
+standalone positive assertion. In a conditional positive assertion,
 backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the condition to be
 false. However, for both standalone and conditional negative assertions,
 backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be
@@ -3485,6 +3498,6 @@
 .rs
 .sp
 .nf
-Last updated: 02 July 2017
+Last updated: 05 July 2017
 Copyright (c) 1997-2017 University of Cambridge.
 .fi

Modified: code/trunk/src/pcre2_dfa_match.c
===================================================================
--- code/trunk/src/pcre2_dfa_match.c    2017-07-02 16:32:01 UTC (rev 838)
+++ code/trunk/src/pcre2_dfa_match.c    2017-07-05 08:55:49 UTC (rev 839)
@@ -1379,8 +1379,46 @@
           if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
           rgb = UCD_GRAPHBREAK(d);
           if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
+
+          /* Not breaking between Regional Indicators is allowed only if
+          there are an even number of preceding RIs. */
+
+          if (lgb == ucp_gbRegionalIndicator &&
+              rgb == ucp_gbRegionalIndicator)
+            {
+            int ricount = 0;
+            PCRE2_SPTR bptr = nptr - 1;
+#ifdef SUPPORT_UNICODE
+            if (utf) BACKCHAR(bptr);
+#endif
+            /* bptr is pointing to the left-hand character */
+
+            while (bptr > mb->start_subject)
+              {
+              bptr--;
+#ifdef SUPPORT_UNICODE
+              if (utf)
+                {
+                BACKCHAR(bptr);
+                GETCHAR(d, bptr);
+                }
+              else
+#endif
+              d = *bptr;
+              if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
+              ricount++;
+              }
+            if ((ricount & 1) != 0) break;  /* Grapheme break required */
+            }
+
+          /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+          any number of Extend before a following E_Modifier. */
+
+          if (rgb != ucp_gbExtend ||
+              (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+            lgb = rgb;
+
           ncount++;
-          lgb = rgb;
           nptr += dlen;
           }
         count++;
@@ -1641,8 +1679,46 @@
           if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
           rgb = UCD_GRAPHBREAK(d);
           if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
+
+          /* Not breaking between Regional Indicators is allowed only if
+          there are an even number of preceding RIs. */
+
+          if (lgb == ucp_gbRegionalIndicator &&
+              rgb == ucp_gbRegionalIndicator)
+            {
+            int ricount = 0;
+            PCRE2_SPTR bptr = nptr - 1;
+#ifdef SUPPORT_UNICODE
+            if (utf) BACKCHAR(bptr);
+#endif
+            /* bptr is pointing to the left-hand character */
+
+            while (bptr > mb->start_subject)
+              {
+              bptr--;
+#ifdef SUPPORT_UNICODE
+              if (utf)
+                {
+                BACKCHAR(bptr);
+                GETCHAR(d, bptr);
+                }
+              else
+#endif
+              d = *bptr;
+              if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
+              ricount++;
+              }
+            if ((ricount & 1) != 0) break;  /* Grapheme break required */
+            }
+
+          /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+          any number of Extend before a following E_Modifier. */
+
+          if (rgb != ucp_gbExtend ||
+              (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+            lgb = rgb;
+
           ncount++;
-          lgb = rgb;
           nptr += dlen;
           }
         ADD_NEW_DATA(-(state_offset + count), 0, ncount);
@@ -1912,8 +1988,46 @@
           if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
           rgb = UCD_GRAPHBREAK(d);
           if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
+
+          /* Not breaking between Regional Indicators is allowed only if
+          there are an even number of preceding RIs. */
+
+          if (lgb == ucp_gbRegionalIndicator &&
+              rgb == ucp_gbRegionalIndicator)
+            {
+            int ricount = 0;
+            PCRE2_SPTR bptr = nptr - 1;
+#ifdef SUPPORT_UNICODE
+            if (utf) BACKCHAR(bptr);
+#endif
+            /* bptr is pointing to the left-hand character */
+
+            while (bptr > mb->start_subject)
+              {
+              bptr--;
+#ifdef SUPPORT_UNICODE
+              if (utf)
+                {
+                BACKCHAR(bptr);
+                GETCHAR(d, bptr);
+                }
+              else
+#endif
+              d = *bptr;
+              if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
+              ricount++;
+              }
+            if ((ricount & 1) != 0) break;  /* Grapheme break required */
+            }
+
+          /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+          any number of Extend before a following E_Modifier. */
+
+          if (rgb != ucp_gbExtend ||
+              (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+            lgb = rgb;
+
           ncount++;
-          lgb = rgb;
           nptr += dlen;
           }
         if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
@@ -2102,8 +2216,46 @@
           if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
           rgb = UCD_GRAPHBREAK(d);
           if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
+
+          /* Not breaking between Regional Indicators is allowed only if
+          there are an even number of preceding RIs. */
+
+          if (lgb == ucp_gbRegionalIndicator &&
+              rgb == ucp_gbRegionalIndicator)
+            {
+            int ricount = 0;
+            PCRE2_SPTR bptr = nptr - 1;
+#ifdef SUPPORT_UNICODE
+            if (utf) BACKCHAR(bptr);
+#endif
+            /* bptr is pointing to the left-hand character */
+
+            while (bptr > mb->start_subject)
+              {
+              bptr--;
+#ifdef SUPPORT_UNICODE
+              if (utf)
+                {
+                BACKCHAR(bptr);
+                GETCHAR(d, bptr);
+                }
+              else
+#endif
+              d = *bptr;
+              if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
+              ricount++;
+              }
+            if ((ricount & 1) != 0) break;  /* Grapheme break required */
+            }
+
+          /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+          any number of Extend before a following E_Modifier. */
+
+          if (rgb != ucp_gbExtend ||
+              (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+            lgb = rgb;
+
           ncount++;
-          lgb = rgb;
           nptr += dlen;
           }
         if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
@@ -2129,7 +2281,7 @@
         case 0x2029:
 #endif  /* Not EBCDIC */
         if (mb->bsr_convention == PCRE2_BSR_ANYCRLF) break;
-        /* Fall through */ 
+        /* Fall through */

         case CHAR_LF:
         ADD_NEW(state_offset + 1, 0);
@@ -3427,7 +3579,7 @@
       while (t < mb->end_subject && !IS_NEWLINE(t)) t++;
       end_subject = t;
       }
-      
+
     /* Anchored: check the first code unit if one is recorded. This may seem
     pointless but it can help in detecting a no match case without scanning for
     the required code unit. */

Modified: code/trunk/src/pcre2_match.c
===================================================================
--- code/trunk/src/pcre2_match.c    2017-07-02 16:32:01 UTC (rev 838)
+++ code/trunk/src/pcre2_match.c    2017-07-05 08:55:49 UTC (rev 839)
@@ -2449,7 +2449,44 @@
         if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
         rgb = UCD_GRAPHBREAK(fc);
         if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
-        lgb = rgb;
+
+        /* Not breaking between Regional Indicators is allowed only if there
+        are an even number of preceding RIs. */
+
+        if (lgb == ucp_gbRegionalIndicator && rgb == ucp_gbRegionalIndicator)
+          {
+          int ricount = 0;
+          PCRE2_SPTR bptr = Feptr - 1;
+#ifdef SUPPORT_UNICODE
+          if (utf) BACKCHAR(bptr);
+#endif
+          /* bptr is pointing to the left-hand character */
+
+          while (bptr > mb->start_subject)
+            {
+            bptr--;
+#ifdef SUPPORT_UNICODE
+            if (utf)
+              {
+              BACKCHAR(bptr);
+              GETCHAR(fc, bptr);
+              }
+            else
+#endif
+            fc = *bptr;
+            if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
+            ricount++;
+            }
+          if ((ricount & 1) != 0) break;  /* Grapheme break required */
+          }
+
+        /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+        any number of Extend before a following E_Modifier. */
+
+        if (rgb != ucp_gbExtend ||
+            (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+          lgb = rgb;
+
         Feptr += len;
         }
       }
@@ -2757,7 +2794,45 @@
               if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
               rgb = UCD_GRAPHBREAK(fc);
               if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
-              lgb = rgb;
+
+              /* Not breaking between Regional Indicators is allowed only if
+              there are an even number of preceding RIs. */
+
+              if (lgb == ucp_gbRegionalIndicator &&
+                  rgb == ucp_gbRegionalIndicator)
+                {
+                int ricount = 0;
+                PCRE2_SPTR bptr = Feptr - 1;
+#ifdef SUPPORT_UNICODE
+                if (utf) BACKCHAR(bptr);
+#endif
+                /* bptr is pointing to the left-hand character */
+
+                while (bptr > mb->start_subject)
+                  {
+                  bptr--;
+#ifdef SUPPORT_UNICODE
+                  if (utf)
+                    {
+                    BACKCHAR(bptr);
+                    GETCHAR(fc, bptr);
+                    }
+                  else
+#endif
+                  fc = *bptr;
+                  if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
+                  ricount++;
+                  }
+                if ((ricount & 1) != 0) break;  /* Grapheme break required */
+                }
+
+              /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+              any number of Extend before a following E_Modifier. */
+
+              if (rgb != ucp_gbExtend ||
+                  (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+                lgb = rgb;
+
               Feptr += len;
               }
             }
@@ -3527,7 +3602,45 @@
               if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
               rgb = UCD_GRAPHBREAK(fc);
               if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
-              lgb = rgb;
+
+              /* Not breaking between Regional Indicators is allowed only if
+              there are an even number of preceding RIs. */
+
+              if (lgb == ucp_gbRegionalIndicator &&
+                  rgb == ucp_gbRegionalIndicator)
+                {
+                int ricount = 0;
+                PCRE2_SPTR bptr = Feptr - 1;
+#ifdef SUPPORT_UNICODE
+                if (utf) BACKCHAR(bptr);
+#endif
+                /* bptr is pointing to the left-hand character */
+
+                while (bptr > mb->start_subject)
+                  {
+                  bptr--;
+#ifdef SUPPORT_UNICODE
+                  if (utf)
+                    {
+                    BACKCHAR(bptr);
+                    GETCHAR(fc, bptr);
+                    }
+                  else
+#endif
+                  fc = *bptr;
+                  if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
+                  ricount++;
+                  }
+                if ((ricount & 1) != 0) break;  /* Grapheme break required */
+                }
+
+              /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+              any number of Extend before a following E_Modifier. */
+
+              if (rgb != ucp_gbExtend ||
+                  (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+                lgb = rgb;
+
               Feptr += len;
               }
             }
@@ -4063,7 +4176,45 @@
               if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
               rgb = UCD_GRAPHBREAK(fc);
               if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
-              lgb = rgb;
+
+              /* Not breaking between Regional Indicators is allowed only if
+              there are an even number of preceding RIs. */
+
+              if (lgb == ucp_gbRegionalIndicator &&
+                  rgb == ucp_gbRegionalIndicator)
+                {
+                int ricount = 0;
+                PCRE2_SPTR bptr = Feptr - 1;
+#ifdef SUPPORT_UNICODE
+                if (utf) BACKCHAR(bptr);
+#endif
+                /* bptr is pointing to the left-hand character */
+
+                while (bptr > mb->start_subject)
+                  {
+                  bptr--;
+#ifdef SUPPORT_UNICODE
+                  if (utf)
+                    {
+                    BACKCHAR(bptr);
+                    GETCHAR(fc, bptr);
+                    }
+                  else
+#endif
+                  fc = *bptr;
+                  if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
+                  ricount++;
+                  }
+                if ((ricount & 1) != 0) break;  /* Grapheme break required */
+                }
+
+              /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+              any number of Extend before a following E_Modifier. */
+
+              if (rgb != ucp_gbExtend ||
+                  (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+                lgb = rgb;
+
               Feptr += len;
               }
             }

Modified: code/trunk/src/pcre2_tables.c
===================================================================
--- code/trunk/src/pcre2_tables.c    2017-07-02 16:32:01 UTC (rev 838)
+++ code/trunk/src/pcre2_tables.c    2017-07-05 08:55:49 UTC (rev 839)
@@ -157,9 +157,9 @@
     LV or V may be followed by V or T
     LVT or T may be followed by T

-4. Do not break before extending characters.
+4. Do not break before extending characters or zero-width-joiner (ZWJ).

-The next two rules are only for extended grapheme clusters (but that's what we
+The following rules are only for extended grapheme clusters (but that's what we
are implementing).

5. Do not break before SpacingMarks.
@@ -166,40 +166,53 @@

6. Do not break after Prepend characters.

-7. Otherwise, break everywhere.
+7. Do not break within emoji modifier sequences (E_Base or E_Base_GAZ followed
+ by E_Modifier). Extend characters are allowed before the modifier; this
+ cannot be represented in this table, the code has to deal with it.
+
+8. Do not break within emoji zwj sequences (ZWJ followed by Glue_After_Zwj or
+ E_Base_GAZ).
+
+9. Do not break within emoji flag sequences. That is, do not break between
+ regional indicator (RI) symbols if there are an odd number of RI characters
+ before the break point. This table encodes "join RI characters"; the code
+ has to deal with checking for previous adjoining RIs.
+
+10. Otherwise, break everywhere.
*/

+#define ESZ (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbZWJ)
+
 const uint32_t PRIV(ucp_gbtable)[] = {
    (1<<ucp_gbLF),                                           /*  0 CR */
    0,                                                       /*  1 LF */
    0,                                                       /*  2 Control */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /*  3 Extend */
-   (1<<ucp_gbExtend)|(1<<ucp_gbPrepend)|                    /*  4 Prepend */
-     (1<<ucp_gbSpacingMark)|(1<<ucp_gbL)|
-     (1<<ucp_gbV)|(1<<ucp_gbT)|(1<<ucp_gbLV)|
-     (1<<ucp_gbLVT)|(1<<ucp_gbOther),
-
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /*  5 SpacingMark */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbL)|   /*  6 L */
-     (1<<ucp_gbV)|(1<<ucp_gbLV)|(1<<ucp_gbLVT),
-
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbV)|   /*  7 V */
-     (1<<ucp_gbT),
-
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbT),   /*  8 T */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbV)|   /*  9 LV */
-     (1<<ucp_gbT),
-
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbT),   /* 10 LVT */
+   ESZ,                                                     /*  3 Extend */
+   ESZ|(1<<ucp_gbPrepend)|                                  /*  4 Prepend */
+       (1<<ucp_gbL)|(1<<ucp_gbV)|(1<<ucp_gbT)|
+       (1<<ucp_gbLV)|(1<<ucp_gbLVT)|(1<<ucp_gbOther)|
+       (1<<ucp_gbRegionalIndicator)|
+       (1<<ucp_gbE_Base)|(1<<ucp_gbE_Modifier)|
+       (1<<ucp_gbE_Base_GAZ)|
+       (1<<ucp_gbZWJ)|(1<<ucp_gbGlue_After_Zwj),
+   ESZ,                                                     /*  5 SpacingMark */
+   ESZ|(1<<ucp_gbL)|(1<<ucp_gbV)|(1<<ucp_gbLV)|             /*  6 L */
+       (1<<ucp_gbLVT),
+   ESZ|(1<<ucp_gbV)|(1<<ucp_gbT),                           /*  7 V */
+   ESZ|(1<<ucp_gbT),                                        /*  8 T */
+   ESZ|(1<<ucp_gbV)|(1<<ucp_gbT),                           /*  9 LV */
+   ESZ|(1<<ucp_gbT),                                        /* 10 LVT */
    (1<<ucp_gbRegionalIndicator),                            /* 11 RegionalIndicator */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /* 12 Other */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /* 13 E_Base */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /* 14 E_Modifier */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /* 15 E_Base_GAZ */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /* 16 ZWJ */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)                 /* 12 Glue_After_Zwj */
+   ESZ,                                                     /* 12 Other */
+   ESZ|(1<<ucp_gbE_Modifier),                               /* 13 E_Base */
+   ESZ,                                                     /* 14 E_Modifier */
+   ESZ|(1<<ucp_gbE_Modifier),                               /* 15 E_Base_GAZ */
+   ESZ|(1<<ucp_gbGlue_After_Zwj)|(1<<ucp_gbE_Base_GAZ),     /* 16 ZWJ */
+   ESZ                                                      /* 12 Glue_After_Zwj */
 };

+#undef ESZ
+
#ifdef SUPPORT_JIT
/* This table reverses PRIV(ucp_gentype). We can save the cost
of a memory load. */

Modified: code/trunk/testdata/testinput5
===================================================================
--- code/trunk/testdata/testinput5    2017-07-02 16:32:01 UTC (rev 838)
+++ code/trunk/testdata/testinput5    2017-07-05 08:55:49 UTC (rev 839)
@@ -2043,4 +2043,23 @@
 /^(?:(\X)(?C))+$/utf
     \x{1E900}\x{1E924}\x{1E953}\x{11C00}\x{11C2D}\x{11C3E}\x{11C70}\x{11C77}\x{11CAB}\x{11400}\x{1142F}\x{11455}\x{104B0}\x{104D8}\x{104FB}\x{16FE0}\x{18800}\x{18AF2}\x{11D00}\x{11D3A}\x{11D59}\x{16FE1}\x{1B170}\x{1B2FB}\x{11A50}\x{11A58}\x{11AA2}\x{11A00}\x{11A07}\x{11A47}\=callout_capture,callout_no_where

+# These two are here because JIT is not yet updated. Also, the very first data
+# line is handled differently by Perl.
+
+/^\X/utf
+    A\x{200d}B                     A ZWJ
+    \x{261D}\x{1F3FB}B             E_Base E_Modifier
+    \x{1F466}\x{1F3FF}B            E_Base_GAZ E_Modifier 
+    \x{200d}\x{1F3A4}B             ZWJ Glue_After_ZWJ
+    \x{200d}\x{1F469}B             ZWJ E_Base_GAZ  
+    \x{1F1E6}\x{1F1E7}B            RegionalIndicator RegionalIndicator 
+    \x{261D}\x{E0100}\x{1F3FB}B\=no_jit    E_Base Extend E_Modifier
+    
+# Regional indicators
+
+/^(\X)(\X)/utf,aftertext
+    \x{1F1E6}\x{1F1E7}\x{1F1E7}B\=no_jit
+    \x{1F1E6}\x{1F1E7}\x{1F1E7}\x{1F1E6}B\=no_jit
+
+
 # End of testinput5

Modified: code/trunk/testdata/testoutput5
===================================================================
--- code/trunk/testdata/testoutput5    2017-07-02 16:32:01 UTC (rev 838)
+++ code/trunk/testdata/testoutput5    2017-07-05 08:55:49 UTC (rev 839)
@@ -4669,4 +4669,38 @@
  0: \x{1e900}\x{1e924}\x{1e953}\x{11c00}\x{11c2d}\x{11c3e}\x{11c70}\x{11c77}\x{11cab}\x{11400}\x{1142f}\x{11455}\x{104b0}\x{104d8}\x{104fb}\x{16fe0}\x{18800}\x{18af2}\x{11d00}\x{11d3a}\x{11d59}\x{16fe1}\x{1b170}\x{1b2fb}\x{11a50}\x{11a58}\x{11aa2}\x{11a00}\x{11a07}\x{11a47}
  1: \x{11a00}\x{11a07}\x{11a47}

+# These two are here because JIT is not yet updated. Also, the very first data
+# line is handled differently by Perl.
+
+/^\X/utf
+    A\x{200d}B                     A ZWJ
+ 0: A\x{200d}
+    \x{261D}\x{1F3FB}B             E_Base E_Modifier
+ 0: \x{261d}\x{1f3fb}
+    \x{1F466}\x{1F3FF}B            E_Base_GAZ E_Modifier 
+ 0: \x{1f466}\x{1f3ff}
+    \x{200d}\x{1F3A4}B             ZWJ Glue_After_ZWJ
+ 0: \x{200d}\x{1f3a4}
+    \x{200d}\x{1F469}B             ZWJ E_Base_GAZ  
+ 0: \x{200d}\x{1f469}
+    \x{1F1E6}\x{1F1E7}B            RegionalIndicator RegionalIndicator 
+ 0: \x{1f1e6}\x{1f1e7}
+    \x{261D}\x{E0100}\x{1F3FB}B\=no_jit    E_Base Extend E_Modifier
+** /n is not valid here
+    
+# Regional indicators
+
+/^(\X)(\X)/utf,aftertext
+    \x{1F1E6}\x{1F1E7}\x{1F1E7}B\=no_jit
+ 0: \x{1f1e6}\x{1f1e7}\x{1f1e7}
+ 0+ B
+ 1: \x{1f1e6}\x{1f1e7}
+ 2: \x{1f1e7}
+    \x{1F1E6}\x{1F1E7}\x{1F1E7}\x{1F1E6}B\=no_jit
+ 0: \x{1f1e6}\x{1f1e7}\x{1f1e7}\x{1f1e6}
+ 0+ B
+ 1: \x{1f1e6}\x{1f1e7}
+ 2: \x{1f1e7}\x{1f1e6}
+
+
 # End of testinput5

This message is part of the following thread:
	the complete thread tree sorted by date

[Pcre-svn] [839] code/trunk: Update grapheme breaking rules …