[Pcre-svn] [1022] code/trunk/maint: Documentation update for…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [1022] code/trunk/maint: Documentation update for Script Extensions property coding.
Revision: 1022
          http://www.exim.org/viewvc/pcre2?view=rev&revision=1022
Author:   ph10
Date:     2018-10-07 17:29:51 +0100 (Sun, 07 Oct 2018)
Log Message:
-----------
Documentation update for Script Extensions property coding.


Modified Paths:
--------------
    code/trunk/maint/MultiStage2.py
    code/trunk/maint/README


Modified: code/trunk/maint/MultiStage2.py
===================================================================
--- code/trunk/maint/MultiStage2.py    2018-10-06 17:39:52 UTC (rev 1021)
+++ code/trunk/maint/MultiStage2.py    2018-10-07 16:29:51 UTC (rev 1022)
@@ -8,7 +8,7 @@
 # the upgrading of Unicode property support. The new code speeds up property
 # matching many times. The script is for the use of PCRE maintainers, to
 # generate the pcre2_ucd.c file that contains a digested form of the Unicode
-# data tables.
+# data tables. A number of extensions have been added to the original script.
 #
 # The script has now been upgraded to Python 3 for PCRE2, and should be run in
 # the maint subdirectory, using the command
@@ -15,19 +15,21 @@
 #
 # [python3] ./MultiStage2.py >../src/pcre2_ucd.c
 #
-# It requires five Unicode data tables: DerivedGeneralCategory.txt,
-# GraphemeBreakProperty.txt, Scripts.txt, CaseFolding.txt, and emoji-data.txt.
-# These must be in the maint/Unicode.tables subdirectory.
+# It requires six Unicode data tables: DerivedGeneralCategory.txt,
+# GraphemeBreakProperty.txt, Scripts.txt, ScriptExtensions.txt,
+# CaseFolding.txt, and emoji-data.txt. These must be in the
+# maint/Unicode.tables subdirectory.
 #
 # DerivedGeneralCategory.txt is found in the "extracted" subdirectory of the
 # Unicode database (UCD) on the Unicode web site; GraphemeBreakProperty.txt is
-# in the "auxiliary" subdirectory. Scripts.txt and CaseFolding.txt are directly
-# in the UCD directory. The emoji-data.txt file is in files associated with
-# Unicode Technical Standard #51 ("Unicode Emoji"), for example:
+# in the "auxiliary" subdirectory. Scripts.txt, ScriptExtensions.txt, and
+# CaseFolding.txt are directly in the UCD directory. The emoji-data.txt file is
+# in files associated with Unicode Technical Standard #51 ("Unicode Emoji"),
+# for example:
 #
-#  http://unicode.org/Public/emoji/11.0/emoji-data.txt
+# http://unicode.org/Public/emoji/11.0/emoji-data.txt
 #
-#
+# -----------------------------------------------------------------------------
 # Minor modifications made to this script:
 #  Added #! line at start
 #  Removed tabs
@@ -61,9 +63,33 @@
 #  property, which is used by PCRE2 as a grapheme breaking property. This was
 #  done when updating to Unicode 11.0.0 (July 2018).
 #
-#  Added code to add a Script Extensions field to records.
+#  Added code to add a Script Extensions field to records. This has increased
+#  their size from 8 to 12 bytes, only 10 of which are currently used.
 #
+# 01-March-2010:     Updated list of scripts for Unicode 5.2.0
+# 30-April-2011:     Updated list of scripts for Unicode 6.0.0
+#     July-2012:     Updated list of scripts for Unicode 6.1.0
+# 20-August-2012:    Added scan of GraphemeBreakProperty.txt and added a new
+#                      field in the record to hold the value. Luckily, the
+#                      structure had a hole in it, so the resulting table is
+#                      not much bigger than before.
+# 18-September-2012: Added code for multiple caseless sets. This uses the
+#                      final hole in the structure.
+# 30-September-2012: Added RegionalIndicator break property from Unicode 6.2.0
+# 13-May-2014:       Updated for PCRE2
+# 03-June-2014:      Updated for Python 3
+# 20-June-2014:      Updated for Unicode 7.0.0
+# 12-August-2014:    Updated to put Unicode version into the file
+# 19-June-2015:      Updated for Unicode 8.0.0
+# 02-July-2017:      Updated for Unicode 10.0.0
+# 03-July-2018:      Updated for Unicode 11.0.0
+# 07-July-2018:      Added code to scan emoji-data.txt for the Extended
+#                      Pictographic property.
+# 01-October-2018:   Added the 'Unknown' script name
+# 03-October-2018:   Added new field for Script Extensions
+# ----------------------------------------------------------------------------
 #
+#
 # The main tables generated by this script are used by macros defined in
 # pcre2_internal.h. They look up Unicode character properties using short
 # sequences of code that contains no branches, which makes for greater speed.
@@ -71,11 +97,11 @@
 # Conceptually, there is a table of records (of type ucd_record), containing a
 # script number, script extension value, character type, grapheme break type,
 # offset to caseless matching set, offset to the character's other case, for
-# every character. However, a real table covering all Unicode characters would
-# be far too big. It can be efficiently compressed by observing that many
-# characters have the same record, and many blocks of characters (taking 128
-# characters in a block) have the same set of records as other blocks. This
-# leads to a 2-stage lookup process.
+# every Unicode character. However, a real table covering all Unicode
+# characters would be far too big. It can be efficiently compressed by
+# observing that many characters have the same record, and many blocks of
+# characters (taking 128 characters in a block) have the same set of records as
+# other blocks. This leads to a 2-stage lookup process.
 #
 # This script constructs six tables. The ucd_caseless_sets table contains
 # lists of characters that all match each other caselessly. Each list is
@@ -92,26 +118,32 @@
 # Script Extensions properties of certain characters. Each list is terminated
 # by zero (ucp_Unknown). A character with more than one script listed for its
 # Script Extension property has a negative value in its record. This is the
-# negated offset to the start of the relevant list.
+# negated offset to the start of the relevant list in the ucd_script_sets
+# vector.
 #
 # The ucd_records table contains one instance of every unique record that is
-# required. The ucd_stage1 table is indexed by a character's block number, and
-# yields what is in effect a "virtual" block number. The ucd_stage2 table is a
-# table of "virtual" blocks; each block is indexed by the offset of a character
-# within its own block, and the result is the offset of the required record.
+# required. The ucd_stage1 table is indexed by a character's block number,
+# which is the character's code point divided by 128, since 128 is the size
+# of each block. The result of a lookup in ucd_stage1 a "virtual" block number.
 #
+# The ucd_stage2 table is a table of "virtual" blocks; each block is indexed by
+# the offset of a character within its own block, and the result is the index
+# number of the required record in the ucd_records vector.
+#
 # The following examples are correct for the Unicode 11.0.0 database. Future
 # updates may make change the actual lookup values.
 #
 # Example: lowercase "a" (U+0061) is in block 0
 #          lookup 0 in stage1 table yields 0
-#          lookup 97 in the first table in stage2 yields 16
-#          record 17 is { 33, 5, 11, 0, -32 }
-#            33 = ucp_Latin   => Latin script
+#          lookup 97 (0x61) in the first table in stage2 yields 17
+#          record 17 is { 34, 5, 12, 0, -32, 34, 0 }
+#            34 = ucp_Latin   => Latin script
 #             5 = ucp_Ll      => Lower case letter
 #            12 = ucp_gbOther => Grapheme break property "Other"
-#             0               => not part of a caseless set
-#           -32               => Other case is U+0041
+#             0               => Not part of a caseless set
+#           -32 (-0x20)       => Other case is U+0041
+#            34 = ucp_Latin   => No special Script Extension property
+#             0               => Dummy value, unused at present
 #
 # Almost all lowercase latin characters resolve to the same record. One or two
 # are different because they are part of a multi-character caseless set (for
@@ -119,42 +151,34 @@
 #
 # Example: hiragana letter A (U+3042) is in block 96 (0x60)
 #          lookup 96 in stage1 table yields 90
-#          lookup 66 in the 90th table in stage2 yields 515
-#          record 515 is { 26, 7, 11, 0, 0 }
-#            26 = ucp_Hiragana => Hiragana script
+#          lookup 66 (0x42) in table 90 in stage2 yields 564
+#          record 564 is { 27, 7, 12, 0, 0, 27, 0 }
+#            27 = ucp_Hiragana => Hiragana script
 #             7 = ucp_Lo       => Other letter
 #            12 = ucp_gbOther  => Grapheme break property "Other"
-#             0                => not part of a caseless set
+#             0                => Not part of a caseless set
 #             0                => No other case
+#            27 = ucp_Hiragana => No special Script Extension property
+#             0                => Dummy value, unused at present
 #
-# In these examples, no other blocks resolve to the same "virtual" block, as it
-# happens, but plenty of other blocks do share "virtual" blocks.
+# Example: vedic tone karshana (U+1CD0) is in block 57 (0x39)
+#          lookup 57 in stage1 table yields 55
+#          lookup 80 (0x50) in table 55 in stage2 yields 458
+#          record 458 is { 28, 12, 3, 0, 0, -101, 0 }
+#            28 = ucp_Inherited => Script inherited from predecessor
+#            12 = ucp_Mn        => Non-spacing mark
+#             3 = ucp_gbExtend  => Grapheme break property "Extend"
+#             0                 => Not part of a caseless set
+#             0                 => No other case
+#          -101                 => Script Extension list offset = 101
+#             0                 => Dummy value, unused at present
 #
+# At offset 101 in the ucd_script_sets vector we find the list 3, 15, 107, 29,
+# and terminator 0. This means that this character is expected to be used with
+# any of those scripts, which are Bengali, Devanagari, Grantha, and Kannada.
+#
 #  Philip Hazel, 03 July 2008
-#  Last Updated: 03 October 2018
-#
-#
-# 01-March-2010:     Updated list of scripts for Unicode 5.2.0
-# 30-April-2011:     Updated list of scripts for Unicode 6.0.0
-#     July-2012:     Updated list of scripts for Unicode 6.1.0
-# 20-August-2012:    Added scan of GraphemeBreakProperty.txt and added a new
-#                      field in the record to hold the value. Luckily, the
-#                      structure had a hole in it, so the resulting table is
-#                      not much bigger than before.
-# 18-September-2012: Added code for multiple caseless sets. This uses the
-#                      final hole in the structure.
-# 30-September-2012: Added RegionalIndicator break property from Unicode 6.2.0
-# 13-May-2014:       Updated for PCRE2
-# 03-June-2014:      Updated for Python 3
-# 20-June-2014:      Updated for Unicode 7.0.0
-# 12-August-2014:    Updated to put Unicode version into the file
-# 19-June-2015:      Updated for Unicode 8.0.0
-# 02-July-2017:      Updated for Unicode 10.0.0
-# 03-July-2018:      Updated for Unicode 11.0.0
-# 07-July-2018:      Added code to scan emoji-data.txt for the Extended
-#                      Pictographic property.
-# 01-October-2018:   Added the 'Unknown' script name
-# 03-October-2018:   Added new field for Script Extensions
+#  Last Updated: 07 October 2018
 ##############################################################################



@@ -175,13 +199,13 @@
         if chardata[1] == 'C' or chardata[1] == 'S':
           return int(chardata[2], 16) - int(chardata[0], 16)
         return 0
-        
+
 # Parse a line of ScriptExtensions.txt
 def get_script_extension(chardata):
         this_script_list = list(chardata[1].split(' '))
         if len(this_script_list) == 1:
           return script_abbrevs.index(this_script_list[0])
-            
+
         script_numbers = []
         for d in this_script_list:
           script_numbers.append(script_abbrevs.index(d))
@@ -190,18 +214,18 @@


         for i in range(1, len(script_lists) - script_numbers_length + 1):
           for j in range(0, script_numbers_length):
-            found = True 
+            found = True
             if script_lists[i+j] != script_numbers[j]:
-              found = False 
+              found = False
               break
           if found:
             return -i
-            
-        # Not found in existing lists 
-        
+
+        # Not found in existing lists
+
         return_value = len(script_lists)
         script_lists.extend(script_numbers)
-        return -return_value 
+        return -return_value


# Read the whole table in memory, setting/checking the Unicode version
def read_table(file_name, get_value, default_value):
@@ -402,7 +426,7 @@
'Dogra', 'Gunjala_Gondi', 'Hanifi_Rohingya', 'Makasar', 'Medefaidrin',
'Old_Sogdian', 'Sogdian'
]
-
+
script_abbrevs = [
'Zzzz', 'Arab', 'Armn', 'Beng', 'Bopo', 'Brai', 'Bugi', 'Buhd', 'Cans',
'Cher', 'Zyyy', 'Copt', 'Cprt', 'Cyrl', 'Dsrt', 'Deva', 'Ethi', 'Geor',
@@ -434,7 +458,7 @@
'Zanb',
#New for Unicode 11.0.0
'Dogr', 'Gong', 'Rohg', 'Maka', 'Medf', 'Sogo', 'Sogd'
- ]
+ ]

category_names = ['Cc', 'Cf', 'Cn', 'Co', 'Cs', 'Ll', 'Lm', 'Lo', 'Lt', 'Lu',
'Mc', 'Me', 'Mn', 'Nd', 'Nl', 'No', 'Pc', 'Pd', 'Pe', 'Pf', 'Pi', 'Po', 'Ps',
@@ -499,10 +523,10 @@

 for i in range(0, MAX_UNICODE):
   if scriptx[i] == script_abbrevs_default:
-    scriptx[i] = script[i] 
+    scriptx[i] = script[i]


-# With the addition of the new Script Extensions field, we need some padding
-# to get the Unicode records up to 12 bytes (multiple of 4). Set a value
+# With the addition of the new Script Extensions field, we need some padding
+# to get the Unicode records up to 12 bytes (multiple of 4). Set a value
# greater than 255 to make the field 16 bits.

 padding_dummy = [0] * MAX_UNICODE
@@ -690,11 +714,11 @@
   m = re.match(r'([0-9a-fA-F]+)\.\.([0-9a-fA-F]+)\s+;\s+\S+\s+#\s+Nd\s+', line)
   if m is None:
     continue
-  first = int(m.group(1),16)   
-  last  = int(m.group(2),16)   
+  first = int(m.group(1),16)
+  last  = int(m.group(2),16)
   if ((last - first + 1) % 10) != 0:
     print("ERROR: %04x..%04x does not contain a multiple of 10 characters" % (first, last),
-      file=sys.stderr) 
+      file=sys.stderr)
   while first < last:
     digitsets.append(first + 9)
     first += 10
@@ -724,9 +748,9 @@
 print("  /*   0 */", end='')
 for d in script_lists:
   print(" %3d," % d, end='')
-  count += 1   
+  count += 1
   if d == 0:
-    print("\n  /* %3d */" % count, end='')  
+    print("\n  /* %3d */" % count, end='')
 print("\n};\n")


# Output the main UCD tables.

Modified: code/trunk/maint/README
===================================================================
--- code/trunk/maint/README    2018-10-06 17:39:52 UTC (rev 1021)
+++ code/trunk/maint/README    2018-10-07 16:29:51 UTC (rev 1022)
@@ -23,11 +23,12 @@
 ManyConfigTests  A shell script that runs "configure, make, test" a number of
                  times with different configuration settings.


-MultiStage2.py   A Python script that generates the file pcre2_ucd.c from five
-                 Unicode data tables, which are themselves downloaded from the
+MultiStage2.py   A Python script that generates the file pcre2_ucd.c from six
+                 Unicode data files, which are themselves downloaded from the
                  Unicode web site. Run this script in the "maint" directory.
-                 The generated file contains the tables for a 2-stage lookup
-                 of Unicode properties.
+                 The generated file is written to stdout. It contains the
+                 tables for a 2-stage lookup of Unicode properties, along with
+                 some auxiliary tables.


 pcre2_chartables.c.non-standard
                  This is a set of character tables that came from a Windows
@@ -40,14 +41,15 @@
 Unicode.tables   The files in this directory were downloaded from the Unicode 
                  web site. They contain information about Unicode characters
                  and scripts. The ones used by the MultiStage2.py script are
-                 CaseFolding.txt, DerivedGeneralCategory.txt, Scripts.txt, 
-                 GraphemeBreakProperty.txt, and emoji-data.txt. I've kept 
-                 UnicodeData.txt (which is no longer used by the script)
-                 because it is useful occasionally for manually looking up the
-                 details of certain characters. However, note that character
-                 names in this file such as "Arabic sign sanah" do NOT mean 
-                 that the character is in a particular script (in this case, 
-                 Arabic). Scripts.txt is where to look for script information.
+                 CaseFolding.txt, DerivedGeneralCategory.txt, Scripts.txt,
+                 ScriptExtensions.txt, GraphemeBreakProperty.txt, and
+                 emoji-data.txt. I've kept UnicodeData.txt (which is no longer
+                 used by the script) because it is useful occasionally for
+                 manually looking up the details of certain characters.
+                 However, note that character names in this file such as
+                 "Arabic sign sanah" do NOT mean that the character is in a
+                 particular script (in this case, Arabic). Scripts.txt and
+                 ScriptExtensions.txt are where to look for script information.


 ucptest.c        A short C program for testing the Unicode property macros
                  that do lookups in the pcre2_ucd.c data, mainly useful after
@@ -61,7 +63,7 @@
                  point into a sequence of bytes in the UTF-8 encoding, and vice
                  versa. If its argument is a hex number such as 0x1234, it
                  outputs a list of the equivalent UTF-8 bytes. If its argument
-                 is sequence of concatenated UTF-8 bytes (e.g. e188b4) it
+                 is a sequence of concatenated UTF-8 bytes (e.g. e188b4) it
                  treats them as a UTF-8 character and outputs the equivalent
                  code point in hex.


@@ -72,27 +74,33 @@
When there is a new release of Unicode, the files in Unicode.tables must be
refreshed from the web site. If the new version of Unicode adds new character
scripts, the source file pcre2_ucp.h and both the MultiStage2.py and the
-GenerateUtt.py scripts must be edited to add the new names. Then MultiStage2.py
-can be run to generate a new version of pcre2_ucd.c, and GenerateUtt.py can be
-run to generate the tricky tables for inclusion in pcre2_tables.c.
+GenerateUtt.py scripts must be edited to add the new names. I have been adding
+each new group at the end of the relevant list, with a comment. Note also that
+both the pcre2syntax.3 and pcre2pattern.3 man pages contain lists of Unicode
+script names.

-If MultiStage2.py gives the error "ValueError: list.index(x): x not in list",
-the cause is usually a missing (or misspelt) name in the list of scripts. I
-couldn't find a straightforward list of scripts on the Unicode site, but
-there's a useful Wikipedia page that lists them, and notes the Unicode version
-in which they were introduced:
+MultiStage2.py has two lists: the full names and the abbreviations that are
+found in the ScriptExtensions.txt file. A list of script names and their
+abbreviations s can be found in the PropertyValueAliases.txt file on the
+Unicode web site. There is also a Wikipedia page that lists them, and notes the
+Unicode version in which they were introduced:

http://en.wikipedia.org/wiki/Unicode_scripts#Table_of_Unicode_scripts

+Once the script name lists have been updated, MultiStage2.py can be run to
+generate a new version of pcre2_ucd.c, and GenerateUtt.py can be run to
+generate the tricky tables for inclusion in pcre2_tables.c (which must be
+hand-edited). If MultiStage2.py gives the error "ValueError: list.index(x): x
+not in list", the cause is usually a missing (or misspelt) name in one of the
+lists of scripts.
+
The ucptest program can be compiled and used to check that the new tables in
pcre2_ucd.c work properly, using the data files in ucptestdata to check a
-number of test characters. The source file ucptest.c must be updated whenever
-new Unicode script names are added.
+number of test characters. The source file ucptest.c should also be updated
+whenever new Unicode script names are added, and adding a few tests for new
+scripts is a good idea.

-Note also that both the pcre2syntax.3 and pcre2pattern.3 man pages contain
-lists of Unicode script names.

-
Preparing for a PCRE2 release
=============================

@@ -401,26 +409,6 @@
strings, at least one of which must be present for a match, efficient
pre-searching of large datasets could be implemented.

-. There's a Perl proposal for some new (* things, including alpha synonyms for 
-  the lookaround assertions:
-
-  (*pla: …)
-  (*plb: …)
-  (*nla: …)
-  (*nlb: …)
-  (*atomic: …)
-  (*positive_look_ahead:...)
-  (*negative_look_ahead:...)
-  (*positive_look_behind:...)
-  (*negative_look_behind:...)
-
-  Also a new one (with synonyms):
-
-  (*script_run: ...)        Ensure all captured chars are in the same script
-  (*sr: …)
-  (*atomic_script_run: …)   A combination of script_run and atomic
-  (*asr:...)
-
 . If pcre2grep had --first-line (match only in the first line) it could be 
   efficiently used to find files "starting with xxx". What about --last-line?


@@ -441,4 +429,4 @@
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 21 August 2018
+Last updated: 07 October 2018