Revision: 1022
http://www.exim.org/viewvc/pcre2?view=rev&revision=1022
Author: ph10
Date: 2018-10-07 17:29:51 +0100 (Sun, 07 Oct 2018)
Log Message:
-----------
Documentation update for Script Extensions property coding.
Modified Paths:
--------------
code/trunk/maint/MultiStage2.py
code/trunk/maint/README
Modified: code/trunk/maint/MultiStage2.py
===================================================================
--- code/trunk/maint/MultiStage2.py 2018-10-06 17:39:52 UTC (rev 1021)
+++ code/trunk/maint/MultiStage2.py 2018-10-07 16:29:51 UTC (rev 1022)
@@ -8,7 +8,7 @@
# the upgrading of Unicode property support. The new code speeds up property
# matching many times. The script is for the use of PCRE maintainers, to
# generate the pcre2_ucd.c file that contains a digested form of the Unicode
-# data tables.
+# data tables. A number of extensions have been added to the original script.
#
# The script has now been upgraded to Python 3 for PCRE2, and should be run in
# the maint subdirectory, using the command
@@ -15,19 +15,21 @@
#
# [python3] ./MultiStage2.py >../src/pcre2_ucd.c
#
-# It requires five Unicode data tables: DerivedGeneralCategory.txt,
-# GraphemeBreakProperty.txt, Scripts.txt, CaseFolding.txt, and emoji-data.txt.
-# These must be in the maint/Unicode.tables subdirectory.
+# It requires six Unicode data tables: DerivedGeneralCategory.txt,
+# GraphemeBreakProperty.txt, Scripts.txt, ScriptExtensions.txt,
+# CaseFolding.txt, and emoji-data.txt. These must be in the
+# maint/Unicode.tables subdirectory.
#
# DerivedGeneralCategory.txt is found in the "extracted" subdirectory of the
# Unicode database (UCD) on the Unicode web site; GraphemeBreakProperty.txt is
-# in the "auxiliary" subdirectory. Scripts.txt and CaseFolding.txt are directly
-# in the UCD directory. The emoji-data.txt file is in files associated with
-# Unicode Technical Standard #51 ("Unicode Emoji"), for example:
+# in the "auxiliary" subdirectory. Scripts.txt, ScriptExtensions.txt, and
+# CaseFolding.txt are directly in the UCD directory. The emoji-data.txt file is
+# in files associated with Unicode Technical Standard #51 ("Unicode Emoji"),
+# for example:
#
-# http://unicode.org/Public/emoji/11.0/emoji-data.txt
+# http://unicode.org/Public/emoji/11.0/emoji-data.txt
#
-#
+# -----------------------------------------------------------------------------
# Minor modifications made to this script:
# Added #! line at start
# Removed tabs
@@ -61,9 +63,33 @@
# property, which is used by PCRE2 as a grapheme breaking property. This was
# done when updating to Unicode 11.0.0 (July 2018).
#
-# Added code to add a Script Extensions field to records.
+# Added code to add a Script Extensions field to records. This has increased
+# their size from 8 to 12 bytes, only 10 of which are currently used.
#
+# 01-March-2010: Updated list of scripts for Unicode 5.2.0
+# 30-April-2011: Updated list of scripts for Unicode 6.0.0
+# July-2012: Updated list of scripts for Unicode 6.1.0
+# 20-August-2012: Added scan of GraphemeBreakProperty.txt and added a new
+# field in the record to hold the value. Luckily, the
+# structure had a hole in it, so the resulting table is
+# not much bigger than before.
+# 18-September-2012: Added code for multiple caseless sets. This uses the
+# final hole in the structure.
+# 30-September-2012: Added RegionalIndicator break property from Unicode 6.2.0
+# 13-May-2014: Updated for PCRE2
+# 03-June-2014: Updated for Python 3
+# 20-June-2014: Updated for Unicode 7.0.0
+# 12-August-2014: Updated to put Unicode version into the file
+# 19-June-2015: Updated for Unicode 8.0.0
+# 02-July-2017: Updated for Unicode 10.0.0
+# 03-July-2018: Updated for Unicode 11.0.0
+# 07-July-2018: Added code to scan emoji-data.txt for the Extended
+# Pictographic property.
+# 01-October-2018: Added the 'Unknown' script name
+# 03-October-2018: Added new field for Script Extensions
+# ----------------------------------------------------------------------------
#
+#
# The main tables generated by this script are used by macros defined in
# pcre2_internal.h. They look up Unicode character properties using short
# sequences of code that contains no branches, which makes for greater speed.
@@ -71,11 +97,11 @@
# Conceptually, there is a table of records (of type ucd_record), containing a
# script number, script extension value, character type, grapheme break type,
# offset to caseless matching set, offset to the character's other case, for
-# every character. However, a real table covering all Unicode characters would
-# be far too big. It can be efficiently compressed by observing that many
-# characters have the same record, and many blocks of characters (taking 128
-# characters in a block) have the same set of records as other blocks. This
-# leads to a 2-stage lookup process.
+# every Unicode character. However, a real table covering all Unicode
+# characters would be far too big. It can be efficiently compressed by
+# observing that many characters have the same record, and many blocks of
+# characters (taking 128 characters in a block) have the same set of records as
+# other blocks. This leads to a 2-stage lookup process.
#
# This script constructs six tables. The ucd_caseless_sets table contains
# lists of characters that all match each other caselessly. Each list is
@@ -92,26 +118,32 @@
# Script Extensions properties of certain characters. Each list is terminated
# by zero (ucp_Unknown). A character with more than one script listed for its
# Script Extension property has a negative value in its record. This is the
-# negated offset to the start of the relevant list.
+# negated offset to the start of the relevant list in the ucd_script_sets
+# vector.
#
# The ucd_records table contains one instance of every unique record that is
-# required. The ucd_stage1 table is indexed by a character's block number, and
-# yields what is in effect a "virtual" block number. The ucd_stage2 table is a
-# table of "virtual" blocks; each block is indexed by the offset of a character
-# within its own block, and the result is the offset of the required record.
+# required. The ucd_stage1 table is indexed by a character's block number,
+# which is the character's code point divided by 128, since 128 is the size
+# of each block. The result of a lookup in ucd_stage1 a "virtual" block number.
#
+# The ucd_stage2 table is a table of "virtual" blocks; each block is indexed by
+# the offset of a character within its own block, and the result is the index
+# number of the required record in the ucd_records vector.
+#
# The following examples are correct for the Unicode 11.0.0 database. Future
# updates may make change the actual lookup values.
#
# Example: lowercase "a" (U+0061) is in block 0
# lookup 0 in stage1 table yields 0
-# lookup 97 in the first table in stage2 yields 16
-# record 17 is { 33, 5, 11, 0, -32 }
-# 33 = ucp_Latin => Latin script
+# lookup 97 (0x61) in the first table in stage2 yields 17
+# record 17 is { 34, 5, 12, 0, -32, 34, 0 }
+# 34 = ucp_Latin => Latin script
# 5 = ucp_Ll => Lower case letter
# 12 = ucp_gbOther => Grapheme break property "Other"
-# 0 => not part of a caseless set
-# -32 => Other case is U+0041
+# 0 => Not part of a caseless set
+# -32 (-0x20) => Other case is U+0041
+# 34 = ucp_Latin => No special Script Extension property
+# 0 => Dummy value, unused at present
#
# Almost all lowercase latin characters resolve to the same record. One or two
# are different because they are part of a multi-character caseless set (for
@@ -119,42 +151,34 @@
#
# Example: hiragana letter A (U+3042) is in block 96 (0x60)
# lookup 96 in stage1 table yields 90
-# lookup 66 in the 90th table in stage2 yields 515
-# record 515 is { 26, 7, 11, 0, 0 }
-# 26 = ucp_Hiragana => Hiragana script
+# lookup 66 (0x42) in table 90 in stage2 yields 564
+# record 564 is { 27, 7, 12, 0, 0, 27, 0 }
+# 27 = ucp_Hiragana => Hiragana script
# 7 = ucp_Lo => Other letter
# 12 = ucp_gbOther => Grapheme break property "Other"
-# 0 => not part of a caseless set
+# 0 => Not part of a caseless set
# 0 => No other case
+# 27 = ucp_Hiragana => No special Script Extension property
+# 0 => Dummy value, unused at present
#
-# In these examples, no other blocks resolve to the same "virtual" block, as it
-# happens, but plenty of other blocks do share "virtual" blocks.
+# Example: vedic tone karshana (U+1CD0) is in block 57 (0x39)
+# lookup 57 in stage1 table yields 55
+# lookup 80 (0x50) in table 55 in stage2 yields 458
+# record 458 is { 28, 12, 3, 0, 0, -101, 0 }
+# 28 = ucp_Inherited => Script inherited from predecessor
+# 12 = ucp_Mn => Non-spacing mark
+# 3 = ucp_gbExtend => Grapheme break property "Extend"
+# 0 => Not part of a caseless set
+# 0 => No other case
+# -101 => Script Extension list offset = 101
+# 0 => Dummy value, unused at present
#
+# At offset 101 in the ucd_script_sets vector we find the list 3, 15, 107, 29,
+# and terminator 0. This means that this character is expected to be used with
+# any of those scripts, which are Bengali, Devanagari, Grantha, and Kannada.
+#
# Philip Hazel, 03 July 2008
-# Last Updated: 03 October 2018
-#
-#
-# 01-March-2010: Updated list of scripts for Unicode 5.2.0
-# 30-April-2011: Updated list of scripts for Unicode 6.0.0
-# July-2012: Updated list of scripts for Unicode 6.1.0
-# 20-August-2012: Added scan of GraphemeBreakProperty.txt and added a new
-# field in the record to hold the value. Luckily, the
-# structure had a hole in it, so the resulting table is
-# not much bigger than before.
-# 18-September-2012: Added code for multiple caseless sets. This uses the
-# final hole in the structure.
-# 30-September-2012: Added RegionalIndicator break property from Unicode 6.2.0
-# 13-May-2014: Updated for PCRE2
-# 03-June-2014: Updated for Python 3
-# 20-June-2014: Updated for Unicode 7.0.0
-# 12-August-2014: Updated to put Unicode version into the file
-# 19-June-2015: Updated for Unicode 8.0.0
-# 02-July-2017: Updated for Unicode 10.0.0
-# 03-July-2018: Updated for Unicode 11.0.0
-# 07-July-2018: Added code to scan emoji-data.txt for the Extended
-# Pictographic property.
-# 01-October-2018: Added the 'Unknown' script name
-# 03-October-2018: Added new field for Script Extensions
+# Last Updated: 07 October 2018
##############################################################################
@@ -175,13 +199,13 @@
if chardata[1] == 'C' or chardata[1] == 'S':
return int(chardata[2], 16) - int(chardata[0], 16)
return 0
-
+
# Parse a line of ScriptExtensions.txt
def get_script_extension(chardata):
this_script_list = list(chardata[1].split(' '))
if len(this_script_list) == 1:
return script_abbrevs.index(this_script_list[0])
-
+
script_numbers = []
for d in this_script_list:
script_numbers.append(script_abbrevs.index(d))
@@ -190,18 +214,18 @@
for i in range(1, len(script_lists) - script_numbers_length + 1):
for j in range(0, script_numbers_length):
- found = True
+ found = True
if script_lists[i+j] != script_numbers[j]:
- found = False
+ found = False
break
if found:
return -i
-
- # Not found in existing lists
-
+
+ # Not found in existing lists
+
return_value = len(script_lists)
script_lists.extend(script_numbers)
- return -return_value
+ return -return_value
# Read the whole table in memory, setting/checking the Unicode version
def read_table(file_name, get_value, default_value):
@@ -402,7 +426,7 @@
'Dogra', 'Gunjala_Gondi', 'Hanifi_Rohingya', 'Makasar', 'Medefaidrin',
'Old_Sogdian', 'Sogdian'
]
-
+
script_abbrevs = [
'Zzzz', 'Arab', 'Armn', 'Beng', 'Bopo', 'Brai', 'Bugi', 'Buhd', 'Cans',
'Cher', 'Zyyy', 'Copt', 'Cprt', 'Cyrl', 'Dsrt', 'Deva', 'Ethi', 'Geor',
@@ -434,7 +458,7 @@
'Zanb',
#New for Unicode 11.0.0
'Dogr', 'Gong', 'Rohg', 'Maka', 'Medf', 'Sogo', 'Sogd'
- ]
+ ]
category_names = ['Cc', 'Cf', 'Cn', 'Co', 'Cs', 'Ll', 'Lm', 'Lo', 'Lt', 'Lu',
'Mc', 'Me', 'Mn', 'Nd', 'Nl', 'No', 'Pc', 'Pd', 'Pe', 'Pf', 'Pi', 'Po', 'Ps',
@@ -499,10 +523,10 @@
for i in range(0, MAX_UNICODE):
if scriptx[i] == script_abbrevs_default:
- scriptx[i] = script[i]
+ scriptx[i] = script[i]
-# With the addition of the new Script Extensions field, we need some padding
-# to get the Unicode records up to 12 bytes (multiple of 4). Set a value
+# With the addition of the new Script Extensions field, we need some padding
+# to get the Unicode records up to 12 bytes (multiple of 4). Set a value
# greater than 255 to make the field 16 bits.
padding_dummy = [0] * MAX_UNICODE
@@ -690,11 +714,11 @@
m = re.match(r'([0-9a-fA-F]+)\.\.([0-9a-fA-F]+)\s+;\s+\S+\s+#\s+Nd\s+', line)
if m is None:
continue
- first = int(m.group(1),16)
- last = int(m.group(2),16)
+ first = int(m.group(1),16)
+ last = int(m.group(2),16)
if ((last - first + 1) % 10) != 0:
print("ERROR: %04x..%04x does not contain a multiple of 10 characters" % (first, last),
- file=sys.stderr)
+ file=sys.stderr)
while first < last:
digitsets.append(first + 9)
first += 10
@@ -724,9 +748,9 @@
print(" /* 0 */", end='')
for d in script_lists:
print(" %3d," % d, end='')
- count += 1
+ count += 1
if d == 0:
- print("\n /* %3d */" % count, end='')
+ print("\n /* %3d */" % count, end='')
print("\n};\n")
# Output the main UCD tables.
Modified: code/trunk/maint/README
===================================================================
--- code/trunk/maint/README 2018-10-06 17:39:52 UTC (rev 1021)
+++ code/trunk/maint/README 2018-10-07 16:29:51 UTC (rev 1022)
@@ -23,11 +23,12 @@
ManyConfigTests A shell script that runs "configure, make, test" a number of
times with different configuration settings.
-MultiStage2.py A Python script that generates the file pcre2_ucd.c from five
- Unicode data tables, which are themselves downloaded from the
+MultiStage2.py A Python script that generates the file pcre2_ucd.c from six
+ Unicode data files, which are themselves downloaded from the
Unicode web site. Run this script in the "maint" directory.
- The generated file contains the tables for a 2-stage lookup
- of Unicode properties.
+ The generated file is written to stdout. It contains the
+ tables for a 2-stage lookup of Unicode properties, along with
+ some auxiliary tables.
pcre2_chartables.c.non-standard
This is a set of character tables that came from a Windows
@@ -40,14 +41,15 @@
Unicode.tables The files in this directory were downloaded from the Unicode
web site. They contain information about Unicode characters
and scripts. The ones used by the MultiStage2.py script are
- CaseFolding.txt, DerivedGeneralCategory.txt, Scripts.txt,
- GraphemeBreakProperty.txt, and emoji-data.txt. I've kept
- UnicodeData.txt (which is no longer used by the script)
- because it is useful occasionally for manually looking up the
- details of certain characters. However, note that character
- names in this file such as "Arabic sign sanah" do NOT mean
- that the character is in a particular script (in this case,
- Arabic). Scripts.txt is where to look for script information.
+ CaseFolding.txt, DerivedGeneralCategory.txt, Scripts.txt,
+ ScriptExtensions.txt, GraphemeBreakProperty.txt, and
+ emoji-data.txt. I've kept UnicodeData.txt (which is no longer
+ used by the script) because it is useful occasionally for
+ manually looking up the details of certain characters.
+ However, note that character names in this file such as
+ "Arabic sign sanah" do NOT mean that the character is in a
+ particular script (in this case, Arabic). Scripts.txt and
+ ScriptExtensions.txt are where to look for script information.
ucptest.c A short C program for testing the Unicode property macros
that do lookups in the pcre2_ucd.c data, mainly useful after
@@ -61,7 +63,7 @@
point into a sequence of bytes in the UTF-8 encoding, and vice
versa. If its argument is a hex number such as 0x1234, it
outputs a list of the equivalent UTF-8 bytes. If its argument
- is sequence of concatenated UTF-8 bytes (e.g. e188b4) it
+ is a sequence of concatenated UTF-8 bytes (e.g. e188b4) it
treats them as a UTF-8 character and outputs the equivalent
code point in hex.
@@ -72,27 +74,33 @@
When there is a new release of Unicode, the files in Unicode.tables must be
refreshed from the web site. If the new version of Unicode adds new character
scripts, the source file pcre2_ucp.h and both the MultiStage2.py and the
-GenerateUtt.py scripts must be edited to add the new names. Then MultiStage2.py
-can be run to generate a new version of pcre2_ucd.c, and GenerateUtt.py can be
-run to generate the tricky tables for inclusion in pcre2_tables.c.
+GenerateUtt.py scripts must be edited to add the new names. I have been adding
+each new group at the end of the relevant list, with a comment. Note also that
+both the pcre2syntax.3 and pcre2pattern.3 man pages contain lists of Unicode
+script names.
-If MultiStage2.py gives the error "ValueError: list.index(x): x not in list",
-the cause is usually a missing (or misspelt) name in the list of scripts. I
-couldn't find a straightforward list of scripts on the Unicode site, but
-there's a useful Wikipedia page that lists them, and notes the Unicode version
-in which they were introduced:
+MultiStage2.py has two lists: the full names and the abbreviations that are
+found in the ScriptExtensions.txt file. A list of script names and their
+abbreviations s can be found in the PropertyValueAliases.txt file on the
+Unicode web site. There is also a Wikipedia page that lists them, and notes the
+Unicode version in which they were introduced:
http://en.wikipedia.org/wiki/Unicode_scripts#Table_of_Unicode_scripts
+Once the script name lists have been updated, MultiStage2.py can be run to
+generate a new version of pcre2_ucd.c, and GenerateUtt.py can be run to
+generate the tricky tables for inclusion in pcre2_tables.c (which must be
+hand-edited). If MultiStage2.py gives the error "ValueError: list.index(x): x
+not in list", the cause is usually a missing (or misspelt) name in one of the
+lists of scripts.
+
The ucptest program can be compiled and used to check that the new tables in
pcre2_ucd.c work properly, using the data files in ucptestdata to check a
-number of test characters. The source file ucptest.c must be updated whenever
-new Unicode script names are added.
+number of test characters. The source file ucptest.c should also be updated
+whenever new Unicode script names are added, and adding a few tests for new
+scripts is a good idea.
-Note also that both the pcre2syntax.3 and pcre2pattern.3 man pages contain
-lists of Unicode script names.
-
Preparing for a PCRE2 release
=============================
@@ -401,26 +409,6 @@
strings, at least one of which must be present for a match, efficient
pre-searching of large datasets could be implemented.
-. There's a Perl proposal for some new (* things, including alpha synonyms for
- the lookaround assertions:
-
- (*pla: …)
- (*plb: …)
- (*nla: …)
- (*nlb: …)
- (*atomic: …)
- (*positive_look_ahead:...)
- (*negative_look_ahead:...)
- (*positive_look_behind:...)
- (*negative_look_behind:...)
-
- Also a new one (with synonyms):
-
- (*script_run: ...) Ensure all captured chars are in the same script
- (*sr: …)
- (*atomic_script_run: …) A combination of script_run and atomic
- (*asr:...)
-
. If pcre2grep had --first-line (match only in the first line) it could be
efficiently used to find files "starting with xxx". What about --last-line?
@@ -441,4 +429,4 @@
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 21 August 2018
+Last updated: 07 October 2018