[Pcre-svn] [1241] code/trunk/maint: Tidies and updates to m…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [1241] code/trunk/maint: Tidies and updates to maintenance programs utf8 and ucptest.
Revision: 1241
          http://www.exim.org/viewvc/pcre2?view=rev&revision=1241
Author:   ph10
Date:     2020-04-01 18:00:33 +0100 (Wed, 01 Apr 2020)
Log Message:
-----------
Tidies and updates to maintenance programs utf8 and ucptest.


Modified Paths:
--------------
    code/trunk/maint/README
    code/trunk/maint/ucptest.c
    code/trunk/maint/ucptestdata/testinput1
    code/trunk/maint/ucptestdata/testoutput1
    code/trunk/maint/utf8.c


Added Paths:
-----------
    code/trunk/maint/ucptestdata/testinput2
    code/trunk/maint/ucptestdata/testoutput2


Modified: code/trunk/maint/README
===================================================================
--- code/trunk/maint/README    2020-03-25 17:18:33 UTC (rev 1240)
+++ code/trunk/maint/README    2020-04-01 17:00:33 UTC (rev 1241)
@@ -54,10 +54,12 @@
 ucptest.c        A short C program for testing the Unicode property macros
                  that do lookups in the pcre2_ucd.c data, mainly useful after
                  rebuilding the Unicode property table. Compile and run this in
-                 the "maint" directory (see comments at its head).
+                 the "maint" directory (see comments at its head). This program 
+                 can also be used to find characters with specific properties. 


-ucptestdata      A directory containing two files, testinput1 and testoutput1,
-                 to use in conjunction with the ucptest program.
+ucptestdata      A directory containing four files, testinput{1,2} and
+                 testoutput{1,2}, for use in conjunction with the ucptest
+                 program.


 utf8.c           A short, freestanding C program for converting a Unicode code
                  point into a sequence of bytes in the UTF-8 encoding, and vice
@@ -65,7 +67,7 @@
                  outputs a list of the equivalent UTF-8 bytes. If its argument
                  is a sequence of concatenated UTF-8 bytes (e.g. e188b4) it
                  treats them as a UTF-8 character and outputs the equivalent
-                 code point in hex.
+                 code point in hex. See comments at its head for details.



Updating to a new Unicode release
@@ -96,9 +98,10 @@

The ucptest program can be compiled and used to check that the new tables in
pcre2_ucd.c work properly, using the data files in ucptestdata to check a
-number of test characters. The source file ucptest.c should also be updated
-whenever new Unicode script names are added, and adding a few tests for new
-scripts is a good idea.
+number of test characters. It used to be necessary to update the source
+ucptest.c whenever new Unicode scripts were added, but this is no longer
+required because that program now uses the lists in the PCRE2 source. However,
+adding a few tests for new scripts to the files in ucptestdata is a good idea.


Preparing for a PCRE2 release
@@ -437,4 +440,4 @@
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 03 June 2019
+Last updated: 01 April 2020

Modified: code/trunk/maint/ucptest.c
===================================================================
--- code/trunk/maint/ucptest.c    2020-03-25 17:18:33 UTC (rev 1240)
+++ code/trunk/maint/ucptest.c    2020-04-01 17:00:33 UTC (rev 1241)
@@ -16,28 +16,45 @@
 /* This is a hacked-up program for testing the Unicode properties tables of
 PCRE2. It can also be used for finding characters with certain properties.
 I wrote it to help with debugging PCRE, and have added things that I found
-useful, in a rather haphazard way. The code has never been "tidied" or checked
-for robustness.
+useful, in a rather haphazard way. The code has never been seriously tidied or
+checked for robustness, but it shouldn't now give compiler warnings.


-If there are arguments, they are a list of hexadecimal code points whose
-properties are to be output. Otherwise, the program expects to read commands on
-stdin, and it writes output to stdout. There are two commands:
+There is only one option: "-s". If given, it applies only to the "findprop"
+command. It causes the UTF-8 sequence of bytes that encode the character to be
+output between angle brackets at the end of the line. On a UTF-8 terminal, this
+will show the appropriate graphic for the code point.

-"findprop" must be followed by a list of Unicode code points as hex numbers
-(without any prefixes). The output is one line per character, giving its
-Unicode properties followed by its other case if there is one, followed by its
-Script Extension list if it is not just the same as the base script.
+If the command has arguments, they are concatenated into a buffer, separated by
+spaces. If the first argument starts "U+" or consists entirely of hexadecimal
+digits, "findprop" is inserted at the start. The buffer is then processed as a
+single line file, after which the program exits. If there are no arguments, the
+program reads commands line by line on stdin and writes output to stdout. The
+return code is always zero.

-"find" must be followed by a list of property names and their values. This
-finds characters that have those properties. If multiple properties are listed,
-they must all be matched. Currently supported:
+There are three commands:

+"findprop" must be followed by a space-separated list of Unicode code points as
+hex numbers, either without any prefix or starting with "U+". The output is one
+line per character, giving its Unicode properties followed by its other case or 
+cases if one or more exist, followed by its Script Extension list if it is not
+just the same as the base script. This list is in square brackets. The
+properties are:
+
+General type        e.g. Letter
+Specific type       e.g. Upper case letter
+Script              e.g. Medefaidrin
+Grapheme break type e.g. Extend (most common is Other)
+
+"find" must be followed by a list of property names and their values. The 
+values are case-sensitive. This finds characters that have those properties. If
+multiple properties are listed, they must all be matched. Currently supported:
+
   script <name>    The character must have this script property. Only one
                      such script may be given.
   scriptx <name>   This script must be in the character's Script Extension
                      property list. If this is used many times, all the given
                      scripts must be present.
-  type <abbrev>    The character's type (e.g. Lu or Nd) must match.
+  type <abbrev>    The character's specific type (e.g. Lu or Nd) must match.
   gbreak <name>    The grapheme break property must match.


If a <name> or <abbrev> is preceded by !, the value must NOT be present. For
@@ -44,9 +61,14 @@
Script Extensions, there may be a mixture of positive and negative
requirements. All must be satisfied.

-No more than 100 characters are output. If there are more, the list ends with
-... */
+Sequences of two or more characters are shown as ranges, for example
+U+0041..U+004A. No more than 100 lines are are output. If there are more
+characters, the list ends with ...

+"list" must be followed by a property name (script, type, or gbreak). The
+defined values for that property are listed. */
+
+
#ifdef HAVE_CONFIG_H
#include "../src/config.h"
#endif
@@ -91,231 +113,102 @@

/* -------------------------------------------------------------------*/

+static BOOL show_character = FALSE;

-const unsigned char *script_names[] = {
- US"Unknown",
- US"Arabic",
- US"Armenian",
- US"Bengali",
- US"Bopomofo",
- US"Braille",
- US"Buginese",
- US"Buhid",
- US"Canadian_Aboriginal",
- US"Cherokee",
- US"Common",
- US"Coptic",
- US"Cypriot",
- US"Cyrillic",
- US"Deseret",
- US"Devanagari",
- US"Ethiopic",
- US"Georgian",
- US"Glagolitic",
- US"Gothic",
- US"Greek",
- US"Gujarati",
- US"Gurmukhi",
- US"Han",
- US"Hangul",
- US"Hanunoo",
- US"Hebrew",
- US"Hiragana",
- US"Inherited",
- US"Kannada",
- US"Katakana",
- US"Kharoshthi",
- US"Khmer",
- US"Lao",
- US"Latin",
- US"Limbu",
- US"Linear_B",
- US"Malayalam",
- US"Mongolian",
- US"Myanmar",
- US"New_Tai_Lue",
- US"Ogham",
- US"Old_Italic",
- US"Old_Persian",
- US"Oriya",
- US"Osmanya",
- US"Runic",
- US"Shavian",
- US"Sinhala",
- US"Syloti_Nagri",
- US"Syriac",
- US"Tagalog",
- US"Tagbanwa",
- US"Tai_Le",
- US"Tamil",
- US"Telugu",
- US"Thaana",
- US"Thai",
- US"Tibetan",
- US"Tifinagh",
- US"Ugaritic",
- US"Yi",
- /* New for Unicode 5.0: */
- US"Balinese",
- US"Cuneiform",
- US"Nko",
- US"Phags_Pa",
- US"Phoenician",
- /* New for Unicode 5.1: */
- US"Carian",
- US"Cham",
- US"Kayah_Li",
- US"Lepcha",
- US"Lycian",
- US"Lydian",
- US"Ol_Chiki",
- US"Rejang",
- US"Saurashtra",
- US"Sundanese",
- US"Vai",
- /* New for Unicode 5.2: */
- US"Avestan",
- US"Bamum",
- US"Egyptian_Hieroglyphs",
- US"Imperial_Aramaic",
- US"Inscriptional_Pahlavi",
- US"Inscriptional_Parthian",
- US"Javanese",
- US"Kaithi",
- US"Lisu",
- US"Meetei_Mayek",
- US"Old_South_Arabian",
- US"Old_Turkic",
- US"Samaritan",
- US"Tai_Tham",
- US"Tai_Viet",
- /* New for Unicode 6.0.0 */
- US"Batak",
- US"Brahmi",
- US"Mandaic",
- /* New for Unicode 6.1.0 */
- US"Chakma",
- US"Meroitic_Cursive",
- US"Meroitic_Hieroglyphs",
- US"Miao",
- US"Sharada",
- US"Sora Sompent",
- US"Takri",
- /* New for Unicode 7.0.0 */
- US"Bassa_Vah",
- US"Caucasian_Albanian",
- US"Duployan",
- US"Elbasan",
- US"Grantha",
- US"Khojki",
- US"Khudawadi",
- US"Linear_A",
- US"Mahajani",
- US"Manichaean",
- US"Mende_Kikakui",
- US"Modi",
- US"Mro",
- US"Nabataean",
- US"Old_North_Arabian",
- US"Old_Permic",
- US"Pahawh_Hmong",
- US"Palmyrene",
- US"Psalter_Pahlavi",
- US"Pau_Cin_Hau",
- US"Siddham",
- US"Tirhuta",
- US"Warang_Citi",
- /* New for Unicode 8.0.0 */
- US"Ahom",
- US"Anatolian_Hieroglyphs",
- US"Hatran",
- US"Multani",
- US"Old_Hungarian",
- US"SignWriting",
- /* New for Unicode 10.0.0 (no update since 8.0.0) */
- US"Adlam",
- US"Bhaiksuki",
- US"Marchen",
- US"Newa",
- US"Osage",
- US"Tangut",
- US"Masaram_Gondi",
- US"Nushu",
- US"Soyombo",
- US"Zanabazar_Square",
- /* New for Unicode 11.0.0 */
- US"Dogra",
- US"Gunjala_Gondi",
- US"Hanifi_Rohingya",
- US"Makasar",
- US"Medefaidrin",
- US"Old_Sogdian",
- US"Sogdian",
- /* New for Unicode 12.0.0 */
- US"Elymaic",
- US"Nandinagari",
- US"Nyiakeng_Puachue_Hmong",
- US"Wancho",
- /* New for Unicode 13.0.0 */
- US"Chorasmian",
- US"Dives_Akuru",
- US"Khitan_Small_Script",
- US"Yezidi"
+static const unsigned char *type_names[] = {
+ US"Cc", US"Control",
+ US"Cf", US"Format",
+ US"Cn", US"Unassigned",
+ US"Co", US"Private use",
+ US"Cs", US"Surrogate",
+ US"Ll", US"Lower case letter",
+ US"Lm", US"Modifier letter",
+ US"Lo", US"Other letter",
+ US"Lt", US"Title case letter",
+ US"Lu", US"Upper case letter",
+ US"Mc", US"Spacing mark",
+ US"Me", US"Enclosing mark",
+ US"Mn", US"Non-spacing mark",
+ US"Nd", US"Decimal number",
+ US"Nl", US"Letter number",
+ US"No", US"Other number",
+ US"Pc", US"Connector punctuation",
+ US"Pd", US"Dash punctuation",
+ US"Pe", US"Close punctuation",
+ US"Pf", US"Final punctuation",
+ US"Pi", US"Initial punctuation",
+ US"Po", US"Other punctuation",
+ US"Ps", US"Open punctuation",
+ US"Sc", US"Currency symbol",
+ US"Sk", US"Modifier symbol",
+ US"Sm", US"Mathematical symbol",
+ US"So", US"Other symbol",
+ US"Zl", US"Line separator",
+ US"Zp", US"Paragraph separator",
+ US"Zs", US"Space separator"
};

-const unsigned char *type_names[] = {
-  US"Cc",
-  US"Cf",
-  US"Cn",
-  US"Co",
-  US"Cs",
-  US"Ll",
-  US"Lm",
-  US"Lo",
-  US"Lt",
-  US"Lu",
-  US"Mc",
-  US"Me",
-  US"Mn",
-  US"Nd",
-  US"Nl",
-  US"No",
-  US"Pc",
-  US"Pd",
-  US"Pe",
-  US"Pf",
-  US"Pi",
-  US"Po",
-  US"Ps",
-  US"Sc",
-  US"Sk",
-  US"Sm",
-  US"So",
-  US"Zl",
-  US"Zp",
-  US"Zs"
+static const unsigned char *gb_names[] = {
+  US"CR",                    US"carriage return",
+  US"LF",                    US"linefeed",
+  US"Control",               US"",
+  US"Extend",                US"",
+  US"Prepend",               US"",
+  US"SpacingMark",           US"",
+  US"L",                     US"Hangul syllable type L",
+  US"V",                     US"Hangul syllable type V",
+  US"T",                     US"Hangul syllable type T",
+  US"LV",                    US"Hangul syllable type LV",
+  US"LVT",                   US"Hangul syllable type LVT",
+  US"RegionalIndicator",     US"",
+  US"Other",                 US"",
+  US"ZWJ",                   US"zero width joiner",
+  US"Extended_Pictographic", US""
 };


-const unsigned char *gb_names[] = {
- US"CR",
- US"LF",
- US"Control",
- US"Extend",
- US"Prepend",
- US"SpacingMark",
- US"L",
- US"V",
- US"T",
- US"LV",
- US"LVT",
- US"RegionalIndicator",
- US"Other",
- US"ZWJ",
- US"Extended_Pictographic"
-};

+static const unsigned int utf8_table1[] = {
+ 0x0000007f, 0x000007ff, 0x0000ffff, 0x001fffff, 0x03ffffff, 0x7fffffff};

+static const int utf8_table2[] = {
+  0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc};
+
+
 /*************************************************
+*       Convert character value to UTF-8         *
+*************************************************/
+
+/* This function takes an unsigned long integer value in the range 0 -
+0x7fffffff and encodes it as a UTF-8 character in 1 to 6 bytes.
+
+Arguments:
+  cvalue     the character value
+  buffer     pointer to buffer for result - at least 6 bytes long
+
+Returns:     number of bytes placed in the buffer
+             0 if input code point is too big
+*/
+
+static size_t
+ord2utf8(unsigned int cvalue, unsigned char *buffer)
+{
+size_t i, j;
+for (i = 0; i < sizeof(utf8_table1)/sizeof(int); i++)
+  if (cvalue <= utf8_table1[i]) break;
+if (i >= sizeof(utf8_table1)/sizeof(int)) return 0;
+buffer += i;
+for (j = i; j > 0; j--)
+ {
+ *buffer-- = 0x80 | (cvalue & 0x3f);
+ cvalue >>= 6;
+ }
+*buffer = utf8_table2[i] | cvalue;
+return i + 1;
+}
+
+
+
+/*************************************************
 *             Test for interaction               *
 *************************************************/


@@ -331,11 +224,33 @@


 /*************************************************
+*      Get script name from ucp ident            *
+*************************************************/
+
+static const char *
+get_scriptname(int script)
+{
+size_t i;
+const ucp_type_table *u;
+
+for (i = 0; i < PRIV(utt_size); i++)
+  {
+  u = PRIV(utt) + i; 
+  if (u->type == PT_SC && u->value == script) break;
+  }
+if (i < PRIV(utt_size))
+  return PRIV(utt_names) + u->name_offset;
+  
+return "??";
+}  
+
+
+/*************************************************
 *      Print Unicode property info for a char    *
 *************************************************/


static void
-print_prop(int c)
+print_prop(unsigned int c, BOOL is_just_one)
{
int type = UCD_CATEGORY(c);
int fulltype = UCD_CHARTYPE(c);
@@ -342,17 +257,14 @@
int script = UCD_SCRIPT(c);
int scriptx = UCD_SCRIPTX(c);
int gbprop = UCD_GRAPHBREAK(c);
-int othercase = UCD_OTHERCASE(c);
+unsigned int othercase = UCD_OTHERCASE(c);
int caseset = UCD_CASESET(c);

const unsigned char *fulltypename = US"??";
const unsigned char *typename = US"??";
-const unsigned char *scriptname = US"??";
const unsigned char *graphbreak = US"??";
+const unsigned char *scriptname = CUS get_scriptname(script);

-if (script < sizeof(script_names)/sizeof(char *))
-  scriptname = script_names[script];
-
 switch (type)
   {
   case ucp_C: typename = US"Control"; break;
@@ -420,15 +332,18 @@
   default:                 graphbreak = US"Unknown"; break;
   }


-printf("%04x %s: %s, %s, %s", c, typename, fulltypename, scriptname, graphbreak);
-if (othercase != c)
+printf("U+%04X %s: %s, %s, %s", c, typename, fulltypename, scriptname, graphbreak);
+if (is_just_one && othercase != c)
   {
-  printf(", %04x", othercase);
+  printf(", U+%04X", othercase);
   if (caseset != 0)
     {
     const uint32_t *p = PRIV(ucd_caseless_sets) + caseset - 1;
     while (*(++p) < NOTACHAR)
-      if (*p != othercase && *p != c) printf(", %04x", *p);
+      {
+      unsigned int d = *p;  
+      if (d != othercase && d != c) printf(", U+%04X", d);
+      } 
     }
   }


@@ -436,25 +351,26 @@
   {
   printf(", [");
   if (scriptx >= 0)
-    {
-    scriptname = (scriptx >= sizeof(script_names)/sizeof(char *))?
-      US"??" : script_names[scriptx];
-    printf("%s", scriptname);
-    }
+    printf("%s", get_scriptname(scriptx));
   else
     {
-    char *sep = "";
+    const char *sep = "";
     const uint8_t *p = PRIV(ucd_script_sets) - scriptx;
     while (*p != 0)
       {
-      scriptname = (*p >= sizeof(script_names)/sizeof(char *))?
-        US"??" : script_names[*p++];
-      printf("%s%s", sep, scriptname);
+      printf("%s%s", sep, get_scriptname(*p++));
       sep = ", ";
       }
     }
   printf("]");
   }
+  
+if (show_character && is_just_one)
+  {
+  unsigned char buffer[8];
+  size_t len = ord2utf8(c, buffer);
+  printf(", >%.*s<", (int)len, buffer);  
+  }  


 printf("\n");
 }
@@ -483,7 +399,7 @@
 BOOL gbreak_not = FALSE;
 BOOL hadrange = FALSE;
 const ucd_record *ucd, *next_ucd;
-const char *pad = "      ";
+const char *pad = "        ";


 while (*s != 0)
   {
@@ -508,17 +424,20 @@
       offset = 1;
       }


-    for (i = 0; i < sizeof(script_names)/sizeof(char *); i++)
+    for (i = 0; i < PRIV(utt_size); i++)
       {
-      if (strcmp(CS value + offset, script_names[i]) == 0)
+      const ucp_type_table *u = PRIV(utt) + i; 
+      if (u->type == PT_SC && strcmp(CS(value + offset), 
+            PRIV(utt_names) + u->name_offset) == 0)
         {
+        c = u->value; 
         if (name[6] == 'x')
           {
-          scriptx_list[scriptx_count++] = scriptx_not? (-i):i;
+          scriptx_list[scriptx_count++] = scriptx_not? (-c):c;
           }
         else
           {
-          if (script < 0) script = i; else
+          if (script < 0) script = c; else
             {
             printf("** Only 1 script value allowed\n");
             return;
@@ -528,9 +447,9 @@
         }
       }


-    if (i >= sizeof(script_names)/sizeof(char *))
+    if (i >= PRIV(utt_size))
       {
-      printf("** Unrecognized script name '%s'\n", value);
+      printf("** Unrecognized script name \"%s\"\n", value);
       return;
       }
     }
@@ -550,17 +469,17 @@
         offset = 1;
         }


-      for (i = 0; i < sizeof(type_names)/sizeof(char *); i++)
+      for (i = 0; i < sizeof(type_names)/sizeof(char *); i += 2)
         {
-        if (strcmp(CS (value + offset), type_names[i]) == 0)
+        if (strcmp(CS (value + offset), CS type_names[i]) == 0)
           {
-          type = i;
+          type = i/2;
           break;
           }
         }
       if (i >= sizeof(type_names)/sizeof(char *))
         {
-        printf("** Unrecognized type name '%s'\n", value);
+        printf("** Unrecognized type name \"%s\"\n", value);
         return;
         }
       }
@@ -581,17 +500,17 @@
         offset = 1;
         }


-      for (i = 0; i < sizeof(gb_names)/sizeof(char *); i++)
+      for (i = 0; i < sizeof(gb_names)/sizeof(char *); i += 2)
         {
-        if (strcmp(CS (value + offset), gb_names[i]) == 0)
+        if (strcmp(CS (value + offset), CS gb_names[i]) == 0)
           {
-          gbreak = i;
+          gbreak = i/2;
           break;
           }
         }
       if (i >= sizeof(gb_names)/sizeof(char *))
         {
-        printf("** Unrecognized gbreak name '%s'\n", value);
+        printf("** Unrecognized gbreak name \"%s\"\n", value);
         return;
         }
       }
@@ -599,7 +518,7 @@


   else
     {
-    printf("** Unrecognized property name '%s'\n", name);
+    printf("** Unrecognized property name \"%s\"\n", name);
     return;
     }
   }
@@ -617,7 +536,7 @@
   if (scriptx_count > 0)
     {
     const uint8_t *char_scriptx = NULL;
-    int found = 0;
+    unsigned int found = 0;
     int scriptx = UCD_SCRIPTX(c);


     if (scriptx < 0) char_scriptx = PRIV(ucd_script_sets) - scriptx;
@@ -701,13 +620,13 @@


   if (--i > c)
     {
-    printf("%04x..", c);
+    printf("U+%04X..", c);
     c = i;
     hadrange = TRUE;
     }
   else if (hadrange) printf("%s", pad);


-  print_prop(c);
+  print_prop(c, FALSE);
   if (c >= 0x100000) pad = "        ";
     else if (c >= 0x10000) pad = "       ";
   count++;
@@ -723,6 +642,101 @@



 /*************************************************
+*        Process command line                    *
+*************************************************/
+
+static void
+process_command_line(unsigned char *buffer)
+{
+unsigned char *s, *t;
+unsigned char name[24];
+
+s = buffer;
+while (isspace(*s)) s++;
+if (*s == 0) return;
+
+for (t = name; *s != 0 && !isspace(*s); s++) *t++ = *s;
+*t = 0;
+while (isspace(*s)) s++;
+
+if (strcmp(CS name, "findprop") == 0)
+  {
+  while (*s != 0)
+    {
+    unsigned int c; 
+    unsigned char *endptr;
+    t = s; 
+    if (strncmp(CS t, "U+", 2) == 0) t += 2;
+    c = strtoul(CS t, CSS(&endptr), 16);
+    if (*endptr != 0 && !isspace(*endptr))
+      {
+      while (*endptr != 0 && !isspace(*endptr)) endptr++;
+      printf("** Invalid hex number: ignored \"%.*s\"\n", (int)(endptr-s), s);
+      }
+    else  
+      {
+      if (c > 0x10ffff) 
+        printf("** U+%x is too big for a Unicode code point\n", c);
+      else   
+        print_prop(c, TRUE);
+      } 
+    s = endptr;
+    while (isspace(*s)) s++;
+    }
+  }
+
+else if (strcmp(CS name, "find") == 0)
+  {
+  find_chars(s);
+  }
+  
+else if (strcmp(CS name, "list") == 0)
+  {
+  while (*s != 0)
+    {
+    size_t i;
+    for (t = name; *s != 0 && !isspace(*s); s++) *t++ = *s;
+    *t = 0;
+    while (isspace(*s)) s++;
+    
+    if (strcmp(CS name, "script") == 0 || strcmp(CS name, "scripts") == 0)
+      {
+      for (i = 0; i < PRIV(utt_size); i++) 
+        if (PRIV(utt)[i].type == PT_SC)
+          printf("%s\n", PRIV(utt_names) + PRIV(utt)[i].name_offset);  
+      }
+      
+    else if (strcmp(CS name, "type") == 0 || strcmp(CS name, "types") == 0)
+      {
+      for (i = 0; i < sizeof(type_names)/sizeof(char *); i += 2)
+        printf("%s %s\n", type_names[i], type_names[i+1]); 
+      }  
+      
+    else if (strcmp(CS name, "gbreak") == 0 || strcmp(CS name, "gbreaks") == 0)
+      {
+      for (i = 0; i < sizeof(gb_names)/sizeof(char *); i += 2)
+        {
+        if (gb_names[i+1][0] != 0)  
+          printf("%-3s (%s)\n", gb_names[i], gb_names[i+1]);
+        else   
+          printf("%s\n", gb_names[i]);
+        } 
+      }    
+
+    else 
+      {
+      printf("** Unknown property \"%s\"\n", name);  
+      break;
+      }  
+    }  
+  }  
+
+else printf("** Unknown test command \"%s\"\n", name);
+}
+
+
+
+/*************************************************
 *               Main program                     *
 *************************************************/


@@ -730,19 +744,42 @@
main(int argc, char **argv)
{
BOOL interactive;
+int first_arg = 1;
unsigned char buffer[1024];

-if (argc > 1)
+if (argc > 1 && strcmp(argv[1], "-s") == 0)
   {
+  show_character = TRUE;
+  first_arg++;
+  }   
+
+if (argc > first_arg)
+  {
   int i;
-  for (i = 1; i < argc; i++)
+  BOOL hexfirst = TRUE; 
+  char *arg = argv[first_arg]; 
+  unsigned char *s = buffer;
+  
+  if (strncmp(arg, "U+", 2) != 0 && !isdigit(*arg)) 
     {
-    unsigned char *endptr;
-    int c = strtoul(argv[i], CSS(&endptr), 16);
-    if (*endptr != 0)
-      printf("** Hex number expected; ignored '%s'\n", argv[i]);
-    else print_prop(c);
+    while (*arg != 0) 
+      {
+      if (!isxdigit(*arg++)) { hexfirst = FALSE; break; }  
+      } 
+    } 
+     
+  if (hexfirst)
+    {
+    strcpy(CS s, "findprop ");
+    s += 9;
     }
+    
+  for (i = first_arg; i < argc; i++)
+    {
+    s += sprintf(CS s, "%s ", argv[i]);       
+    }
+
+  process_command_line(buffer);
   return 0;
   }


@@ -754,17 +791,14 @@

 for(;;)
   {
-  unsigned char name[24];
-  unsigned char *s, *t;
-
 #if defined(SUPPORT_LIBREADLINE) || defined(SUPPORT_LIBEDIT)
   if (interactive)
     {
     size_t len;
-    s = readline("> ");
+    unsigned char *s = US readline("> ");
     if (s == NULL) break;
-    len = strlen(s);
-    if (len > 0) add_history(s);
+    len = strlen(CS s);
+    if (len > 0) add_history(CS s);
     memcpy(buffer, s, len);
     buffer[len] = '\n';
     buffer[len+1] = 0;
@@ -778,39 +812,8 @@
     if (fgets(CS buffer, sizeof(buffer), stdin) == NULL) break;
     if (!interactive) printf("%s", buffer);
     }
-
-  s = buffer;
-  while (isspace(*s)) s++;
-  if (*s == 0) continue;
-
-  for (t = name; *s != 0 && !isspace(*s); s++) *t++ = *s;
-  *t = 0;
-  while (isspace(*s)) s++;
-
-  if (strcmp(CS name, "findprop") == 0)
-    {
-    while (*s != 0)
-      {
-      unsigned char *endptr;
-      int c = strtoul(CS s, CSS(&endptr), 16);
-
-      if (*endptr != 0 && !isspace(*endptr))
-        {
-        while (*endptr != 0 && !isspace(*endptr)) endptr++;
-        printf("** Hex number expected; ignored '%.*s'\n", endptr-s, s);
-        }
-      else  print_prop(c);
-      s = endptr;
-      while (isspace(*s)) s++;
-      }
-    }
-
-  else if (strcmp(CS name, "find") == 0)
-    {
-    find_chars(s);
-    }
-
-  else printf("** Unknown test command %s\n", name);
+    
+  process_command_line(buffer);
   }


if (interactive) printf("\n");

Modified: code/trunk/maint/ucptestdata/testinput1
===================================================================
--- code/trunk/maint/ucptestdata/testinput1    2020-03-25 17:18:33 UTC (rev 1240)
+++ code/trunk/maint/ucptestdata/testinput1    2020-04-01 17:00:33 UTC (rev 1241)
@@ -45,4 +45,4 @@


findprop 1f16d

-findprop 10e93 10eaa
+findprop U+10e93 U+10eaa

Added: code/trunk/maint/ucptestdata/testinput2
===================================================================
--- code/trunk/maint/ucptestdata/testinput2                            (rev 0)
+++ code/trunk/maint/ucptestdata/testinput2    2020-04-01 17:00:33 UTC (rev 1241)
@@ -0,0 +1,5 @@
+find script Han
+find type Pe script Common scriptx Hangul
+find type Sk
+find type Pd
+find gbreak LVT


Modified: code/trunk/maint/ucptestdata/testoutput1
===================================================================
--- code/trunk/maint/ucptestdata/testoutput1    2020-03-25 17:18:33 UTC (rev 1240)
+++ code/trunk/maint/ucptestdata/testoutput1    2020-04-01 17:00:33 UTC (rev 1241)
@@ -1,398 +1,398 @@
 findprop 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 
-0000 Control: Control, Common, Control
-0001 Control: Control, Common, Control
-0002 Control: Control, Common, Control
-0003 Control: Control, Common, Control
-0004 Control: Control, Common, Control
-0005 Control: Control, Common, Control
-0006 Control: Control, Common, Control
-0007 Control: Control, Common, Control
-0008 Control: Control, Common, Control
-0009 Control: Control, Common, Control
-000a Control: Control, Common, LF
-000b Control: Control, Common, Control
-000c Control: Control, Common, Control
-000d Control: Control, Common, CR
-000e Control: Control, Common, Control
-000f Control: Control, Common, Control
+U+0000 Control: Control, Common, Control
+U+0001 Control: Control, Common, Control
+U+0002 Control: Control, Common, Control
+U+0003 Control: Control, Common, Control
+U+0004 Control: Control, Common, Control
+U+0005 Control: Control, Common, Control
+U+0006 Control: Control, Common, Control
+U+0007 Control: Control, Common, Control
+U+0008 Control: Control, Common, Control
+U+0009 Control: Control, Common, Control
+U+000A Control: Control, Common, LF
+U+000B Control: Control, Common, Control
+U+000C Control: Control, Common, Control
+U+000D Control: Control, Common, CR
+U+000E Control: Control, Common, Control
+U+000F Control: Control, Common, Control
 findprop 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 
-0010 Control: Control, Common, Control
-0011 Control: Control, Common, Control
-0012 Control: Control, Common, Control
-0013 Control: Control, Common, Control
-0014 Control: Control, Common, Control
-0015 Control: Control, Common, Control
-0016 Control: Control, Common, Control
-0017 Control: Control, Common, Control
-0018 Control: Control, Common, Control
-0019 Control: Control, Common, Control
-001a Control: Control, Common, Control
-001b Control: Control, Common, Control
-001c Control: Control, Common, Control
-001d Control: Control, Common, Control
-001e Control: Control, Common, Control
-001f Control: Control, Common, Control
+U+0010 Control: Control, Common, Control
+U+0011 Control: Control, Common, Control
+U+0012 Control: Control, Common, Control
+U+0013 Control: Control, Common, Control
+U+0014 Control: Control, Common, Control
+U+0015 Control: Control, Common, Control
+U+0016 Control: Control, Common, Control
+U+0017 Control: Control, Common, Control
+U+0018 Control: Control, Common, Control
+U+0019 Control: Control, Common, Control
+U+001A Control: Control, Common, Control
+U+001B Control: Control, Common, Control
+U+001C Control: Control, Common, Control
+U+001D Control: Control, Common, Control
+U+001E Control: Control, Common, Control
+U+001F Control: Control, Common, Control
 findprop 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 
-0020 Separator: Space separator, Common, Other
-0021 Punctuation: Other punctuation, Common, Other
-0022 Punctuation: Other punctuation, Common, Other
-0023 Punctuation: Other punctuation, Common, Other
-0024 Symbol: Currency symbol, Common, Other
-0025 Punctuation: Other punctuation, Common, Other
-0026 Punctuation: Other punctuation, Common, Other
-0027 Punctuation: Other punctuation, Common, Other
-0028 Punctuation: Open punctuation, Common, Other
-0029 Punctuation: Close punctuation, Common, Other
-002a Punctuation: Other punctuation, Common, Other
-002b Symbol: Mathematical symbol, Common, Other
-002c Punctuation: Other punctuation, Common, Other
-002d Punctuation: Dash punctuation, Common, Other
-002e Punctuation: Other punctuation, Common, Other
-002f Punctuation: Other punctuation, Common, Other
+U+0020 Separator: Space separator, Common, Other
+U+0021 Punctuation: Other punctuation, Common, Other
+U+0022 Punctuation: Other punctuation, Common, Other
+U+0023 Punctuation: Other punctuation, Common, Other
+U+0024 Symbol: Currency symbol, Common, Other
+U+0025 Punctuation: Other punctuation, Common, Other
+U+0026 Punctuation: Other punctuation, Common, Other
+U+0027 Punctuation: Other punctuation, Common, Other
+U+0028 Punctuation: Open punctuation, Common, Other
+U+0029 Punctuation: Close punctuation, Common, Other
+U+002A Punctuation: Other punctuation, Common, Other
+U+002B Symbol: Mathematical symbol, Common, Other
+U+002C Punctuation: Other punctuation, Common, Other
+U+002D Punctuation: Dash punctuation, Common, Other
+U+002E Punctuation: Other punctuation, Common, Other
+U+002F Punctuation: Other punctuation, Common, Other
 findprop 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 
-0030 Number: Decimal number, Common, Other
-0031 Number: Decimal number, Common, Other
-0032 Number: Decimal number, Common, Other
-0033 Number: Decimal number, Common, Other
-0034 Number: Decimal number, Common, Other
-0035 Number: Decimal number, Common, Other
-0036 Number: Decimal number, Common, Other
-0037 Number: Decimal number, Common, Other
-0038 Number: Decimal number, Common, Other
-0039 Number: Decimal number, Common, Other
-003a Punctuation: Other punctuation, Common, Other
-003b Punctuation: Other punctuation, Common, Other
-003c Symbol: Mathematical symbol, Common, Other
-003d Symbol: Mathematical symbol, Common, Other
-003e Symbol: Mathematical symbol, Common, Other
-003f Punctuation: Other punctuation, Common, Other
+U+0030 Number: Decimal number, Common, Other
+U+0031 Number: Decimal number, Common, Other
+U+0032 Number: Decimal number, Common, Other
+U+0033 Number: Decimal number, Common, Other
+U+0034 Number: Decimal number, Common, Other
+U+0035 Number: Decimal number, Common, Other
+U+0036 Number: Decimal number, Common, Other
+U+0037 Number: Decimal number, Common, Other
+U+0038 Number: Decimal number, Common, Other
+U+0039 Number: Decimal number, Common, Other
+U+003A Punctuation: Other punctuation, Common, Other
+U+003B Punctuation: Other punctuation, Common, Other
+U+003C Symbol: Mathematical symbol, Common, Other
+U+003D Symbol: Mathematical symbol, Common, Other
+U+003E Symbol: Mathematical symbol, Common, Other
+U+003F Punctuation: Other punctuation, Common, Other
 findprop 40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f 
-0040 Punctuation: Other punctuation, Common, Other
-0041 Letter: Upper case letter, Latin, Other, 0061
-0042 Letter: Upper case letter, Latin, Other, 0062
-0043 Letter: Upper case letter, Latin, Other, 0063
-0044 Letter: Upper case letter, Latin, Other, 0064
-0045 Letter: Upper case letter, Latin, Other, 0065
-0046 Letter: Upper case letter, Latin, Other, 0066
-0047 Letter: Upper case letter, Latin, Other, 0067
-0048 Letter: Upper case letter, Latin, Other, 0068
-0049 Letter: Upper case letter, Latin, Other, 0069
-004a Letter: Upper case letter, Latin, Other, 006a
-004b Letter: Upper case letter, Latin, Other, 006b, 212a
-004c Letter: Upper case letter, Latin, Other, 006c
-004d Letter: Upper case letter, Latin, Other, 006d
-004e Letter: Upper case letter, Latin, Other, 006e
-004f Letter: Upper case letter, Latin, Other, 006f
+U+0040 Punctuation: Other punctuation, Common, Other
+U+0041 Letter: Upper case letter, Latin, Other, U+0061
+U+0042 Letter: Upper case letter, Latin, Other, U+0062
+U+0043 Letter: Upper case letter, Latin, Other, U+0063
+U+0044 Letter: Upper case letter, Latin, Other, U+0064
+U+0045 Letter: Upper case letter, Latin, Other, U+0065
+U+0046 Letter: Upper case letter, Latin, Other, U+0066
+U+0047 Letter: Upper case letter, Latin, Other, U+0067
+U+0048 Letter: Upper case letter, Latin, Other, U+0068
+U+0049 Letter: Upper case letter, Latin, Other, U+0069
+U+004A Letter: Upper case letter, Latin, Other, U+006A
+U+004B Letter: Upper case letter, Latin, Other, U+006B, U+212A
+U+004C Letter: Upper case letter, Latin, Other, U+006C
+U+004D Letter: Upper case letter, Latin, Other, U+006D
+U+004E Letter: Upper case letter, Latin, Other, U+006E
+U+004F Letter: Upper case letter, Latin, Other, U+006F
 findprop 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 
-0050 Letter: Upper case letter, Latin, Other, 0070
-0051 Letter: Upper case letter, Latin, Other, 0071
-0052 Letter: Upper case letter, Latin, Other, 0072
-0053 Letter: Upper case letter, Latin, Other, 0073, 017f
-0054 Letter: Upper case letter, Latin, Other, 0074
-0055 Letter: Upper case letter, Latin, Other, 0075
-0056 Letter: Upper case letter, Latin, Other, 0076
-0057 Letter: Upper case letter, Latin, Other, 0077
-0058 Letter: Upper case letter, Latin, Other, 0078
-0059 Letter: Upper case letter, Latin, Other, 0079
-005a Letter: Upper case letter, Latin, Other, 007a
-005b Punctuation: Open punctuation, Common, Other
-005c Punctuation: Other punctuation, Common, Other
-005d Punctuation: Close punctuation, Common, Other
-005e Symbol: Modifier symbol, Common, Other
-005f Punctuation: Connector punctuation, Common, Other
+U+0050 Letter: Upper case letter, Latin, Other, U+0070
+U+0051 Letter: Upper case letter, Latin, Other, U+0071
+U+0052 Letter: Upper case letter, Latin, Other, U+0072
+U+0053 Letter: Upper case letter, Latin, Other, U+0073, U+017F
+U+0054 Letter: Upper case letter, Latin, Other, U+0074
+U+0055 Letter: Upper case letter, Latin, Other, U+0075
+U+0056 Letter: Upper case letter, Latin, Other, U+0076
+U+0057 Letter: Upper case letter, Latin, Other, U+0077
+U+0058 Letter: Upper case letter, Latin, Other, U+0078
+U+0059 Letter: Upper case letter, Latin, Other, U+0079
+U+005A Letter: Upper case letter, Latin, Other, U+007A
+U+005B Punctuation: Open punctuation, Common, Other
+U+005C Punctuation: Other punctuation, Common, Other
+U+005D Punctuation: Close punctuation, Common, Other
+U+005E Symbol: Modifier symbol, Common, Other
+U+005F Punctuation: Connector punctuation, Common, Other
 findprop 60 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 
-0060 Symbol: Modifier symbol, Common, Other
-0061 Letter: Lower case letter, Latin, Other, 0041
-0062 Letter: Lower case letter, Latin, Other, 0042
-0063 Letter: Lower case letter, Latin, Other, 0043
-0064 Letter: Lower case letter, Latin, Other, 0044
-0065 Letter: Lower case letter, Latin, Other, 0045
-0066 Letter: Lower case letter, Latin, Other, 0046
-0067 Letter: Lower case letter, Latin, Other, 0047
-0068 Letter: Lower case letter, Latin, Other, 0048
-0069 Letter: Lower case letter, Latin, Other, 0049
-006a Letter: Lower case letter, Latin, Other, 004a
-006b Letter: Lower case letter, Latin, Other, 004b, 212a
-006c Letter: Lower case letter, Latin, Other, 004c
-006d Letter: Lower case letter, Latin, Other, 004d
-006e Letter: Lower case letter, Latin, Other, 004e
-006f Letter: Lower case letter, Latin, Other, 004f
+U+0060 Symbol: Modifier symbol, Common, Other
+U+0061 Letter: Lower case letter, Latin, Other, U+0041
+U+0062 Letter: Lower case letter, Latin, Other, U+0042
+U+0063 Letter: Lower case letter, Latin, Other, U+0043
+U+0064 Letter: Lower case letter, Latin, Other, U+0044
+U+0065 Letter: Lower case letter, Latin, Other, U+0045
+U+0066 Letter: Lower case letter, Latin, Other, U+0046
+U+0067 Letter: Lower case letter, Latin, Other, U+0047
+U+0068 Letter: Lower case letter, Latin, Other, U+0048
+U+0069 Letter: Lower case letter, Latin, Other, U+0049
+U+006A Letter: Lower case letter, Latin, Other, U+004A
+U+006B Letter: Lower case letter, Latin, Other, U+004B, U+212A
+U+006C Letter: Lower case letter, Latin, Other, U+004C
+U+006D Letter: Lower case letter, Latin, Other, U+004D
+U+006E Letter: Lower case letter, Latin, Other, U+004E
+U+006F Letter: Lower case letter, Latin, Other, U+004F
 findprop 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f 
-0070 Letter: Lower case letter, Latin, Other, 0050
-0071 Letter: Lower case letter, Latin, Other, 0051
-0072 Letter: Lower case letter, Latin, Other, 0052
-0073 Letter: Lower case letter, Latin, Other, 0053, 017f
-0074 Letter: Lower case letter, Latin, Other, 0054
-0075 Letter: Lower case letter, Latin, Other, 0055
-0076 Letter: Lower case letter, Latin, Other, 0056
-0077 Letter: Lower case letter, Latin, Other, 0057
-0078 Letter: Lower case letter, Latin, Other, 0058
-0079 Letter: Lower case letter, Latin, Other, 0059
-007a Letter: Lower case letter, Latin, Other, 005a
-007b Punctuation: Open punctuation, Common, Other
-007c Symbol: Mathematical symbol, Common, Other
-007d Punctuation: Close punctuation, Common, Other
-007e Symbol: Mathematical symbol, Common, Other
-007f Control: Control, Common, Control
+U+0070 Letter: Lower case letter, Latin, Other, U+0050
+U+0071 Letter: Lower case letter, Latin, Other, U+0051
+U+0072 Letter: Lower case letter, Latin, Other, U+0052
+U+0073 Letter: Lower case letter, Latin, Other, U+0053, U+017F
+U+0074 Letter: Lower case letter, Latin, Other, U+0054
+U+0075 Letter: Lower case letter, Latin, Other, U+0055
+U+0076 Letter: Lower case letter, Latin, Other, U+0056
+U+0077 Letter: Lower case letter, Latin, Other, U+0057
+U+0078 Letter: Lower case letter, Latin, Other, U+0058
+U+0079 Letter: Lower case letter, Latin, Other, U+0059
+U+007A Letter: Lower case letter, Latin, Other, U+005A
+U+007B Punctuation: Open punctuation, Common, Other
+U+007C Symbol: Mathematical symbol, Common, Other
+U+007D Punctuation: Close punctuation, Common, Other
+U+007E Symbol: Mathematical symbol, Common, Other
+U+007F Control: Control, Common, Control


findprop 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f
-0080 Control: Control, Common, Control
-0081 Control: Control, Common, Control
-0082 Control: Control, Common, Control
-0083 Control: Control, Common, Control
-0084 Control: Control, Common, Control
-0085 Control: Control, Common, Control
-0086 Control: Control, Common, Control
-0087 Control: Control, Common, Control
-0088 Control: Control, Common, Control
-0089 Control: Control, Common, Control
-008a Control: Control, Common, Control
-008b Control: Control, Common, Control
-008c Control: Control, Common, Control
-008d Control: Control, Common, Control
-008e Control: Control, Common, Control
-008f Control: Control, Common, Control
+U+0080 Control: Control, Common, Control
+U+0081 Control: Control, Common, Control
+U+0082 Control: Control, Common, Control
+U+0083 Control: Control, Common, Control
+U+0084 Control: Control, Common, Control
+U+0085 Control: Control, Common, Control
+U+0086 Control: Control, Common, Control
+U+0087 Control: Control, Common, Control
+U+0088 Control: Control, Common, Control
+U+0089 Control: Control, Common, Control
+U+008A Control: Control, Common, Control
+U+008B Control: Control, Common, Control
+U+008C Control: Control, Common, Control
+U+008D Control: Control, Common, Control
+U+008E Control: Control, Common, Control
+U+008F Control: Control, Common, Control
findprop 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f
-0090 Control: Control, Common, Control
-0091 Control: Control, Common, Control
-0092 Control: Control, Common, Control
-0093 Control: Control, Common, Control
-0094 Control: Control, Common, Control
-0095 Control: Control, Common, Control
-0096 Control: Control, Common, Control
-0097 Control: Control, Common, Control
-0098 Control: Control, Common, Control
-0099 Control: Control, Common, Control
-009a Control: Control, Common, Control
-009b Control: Control, Common, Control
-009c Control: Control, Common, Control
-009d Control: Control, Common, Control
-009e Control: Control, Common, Control
-009f Control: Control, Common, Control
+U+0090 Control: Control, Common, Control
+U+0091 Control: Control, Common, Control
+U+0092 Control: Control, Common, Control
+U+0093 Control: Control, Common, Control
+U+0094 Control: Control, Common, Control
+U+0095 Control: Control, Common, Control
+U+0096 Control: Control, Common, Control
+U+0097 Control: Control, Common, Control
+U+0098 Control: Control, Common, Control
+U+0099 Control: Control, Common, Control
+U+009A Control: Control, Common, Control
+U+009B Control: Control, Common, Control
+U+009C Control: Control, Common, Control
+U+009D Control: Control, Common, Control
+U+009E Control: Control, Common, Control
+U+009F Control: Control, Common, Control
findprop a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae af
-00a0 Separator: Space separator, Common, Other
-00a1 Punctuation: Other punctuation, Common, Other
-00a2 Symbol: Currency symbol, Common, Other
-00a3 Symbol: Currency symbol, Common, Other
-00a4 Symbol: Currency symbol, Common, Other
-00a5 Symbol: Currency symbol, Common, Other
-00a6 Symbol: Other symbol, Common, Other
-00a7 Punctuation: Other punctuation, Common, Other
-00a8 Symbol: Modifier symbol, Common, Other
-00a9 Symbol: Other symbol, Common, Extended Pictographic
-00aa Letter: Other letter, Latin, Other
-00ab Punctuation: Initial punctuation, Common, Other
-00ac Symbol: Mathematical symbol, Common, Other
-00ad Control: Format, Common, Control
-00ae Symbol: Other symbol, Common, Extended Pictographic
-00af Symbol: Modifier symbol, Common, Other
+U+00A0 Separator: Space separator, Common, Other
+U+00A1 Punctuation: Other punctuation, Common, Other
+U+00A2 Symbol: Currency symbol, Common, Other
+U+00A3 Symbol: Currency symbol, Common, Other
+U+00A4 Symbol: Currency symbol, Common, Other
+U+00A5 Symbol: Currency symbol, Common, Other
+U+00A6 Symbol: Other symbol, Common, Other
+U+00A7 Punctuation: Other punctuation, Common, Other
+U+00A8 Symbol: Modifier symbol, Common, Other
+U+00A9 Symbol: Other symbol, Common, Extended Pictographic
+U+00AA Letter: Other letter, Latin, Other
+U+00AB Punctuation: Initial punctuation, Common, Other
+U+00AC Symbol: Mathematical symbol, Common, Other
+U+00AD Control: Format, Common, Control
+U+00AE Symbol: Other symbol, Common, Extended Pictographic
+U+00AF Symbol: Modifier symbol, Common, Other
findprop b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf
-00b0 Symbol: Other symbol, Common, Other
-00b1 Symbol: Mathematical symbol, Common, Other
-00b2 Number: Other number, Common, Other
-00b3 Number: Other number, Common, Other
-00b4 Symbol: Modifier symbol, Common, Other
-00b5 Letter: Lower case letter, Common, Other, 03bc, 039c
-00b6 Punctuation: Other punctuation, Common, Other
-00b7 Punctuation: Other punctuation, Common, Other
-00b8 Symbol: Modifier symbol, Common, Other
-00b9 Number: Other number, Common, Other
-00ba Letter: Other letter, Latin, Other
-00bb Punctuation: Final punctuation, Common, Other
-00bc Number: Other number, Common, Other
-00bd Number: Other number, Common, Other
-00be Number: Other number, Common, Other
-00bf Punctuation: Other punctuation, Common, Other
+U+00B0 Symbol: Other symbol, Common, Other
+U+00B1 Symbol: Mathematical symbol, Common, Other
+U+00B2 Number: Other number, Common, Other
+U+00B3 Number: Other number, Common, Other
+U+00B4 Symbol: Modifier symbol, Common, Other
+U+00B5 Letter: Lower case letter, Common, Other, U+03BC, U+039C
+U+00B6 Punctuation: Other punctuation, Common, Other
+U+00B7 Punctuation: Other punctuation, Common, Other
+U+00B8 Symbol: Modifier symbol, Common, Other
+U+00B9 Number: Other number, Common, Other
+U+00BA Letter: Other letter, Latin, Other
+U+00BB Punctuation: Final punctuation, Common, Other
+U+00BC Number: Other number, Common, Other
+U+00BD Number: Other number, Common, Other
+U+00BE Number: Other number, Common, Other
+U+00BF Punctuation: Other punctuation, Common, Other
findprop c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf
-00c0 Letter: Upper case letter, Latin, Other, 00e0
-00c1 Letter: Upper case letter, Latin, Other, 00e1
-00c2 Letter: Upper case letter, Latin, Other, 00e2
-00c3 Letter: Upper case letter, Latin, Other, 00e3
-00c4 Letter: Upper case letter, Latin, Other, 00e4
-00c5 Letter: Upper case letter, Latin, Other, 00e5, 212b
-00c6 Letter: Upper case letter, Latin, Other, 00e6
-00c7 Letter: Upper case letter, Latin, Other, 00e7
-00c8 Letter: Upper case letter, Latin, Other, 00e8
-00c9 Letter: Upper case letter, Latin, Other, 00e9
-00ca Letter: Upper case letter, Latin, Other, 00ea
-00cb Letter: Upper case letter, Latin, Other, 00eb
-00cc Letter: Upper case letter, Latin, Other, 00ec
-00cd Letter: Upper case letter, Latin, Other, 00ed
-00ce Letter: Upper case letter, Latin, Other, 00ee
-00cf Letter: Upper case letter, Latin, Other, 00ef
+U+00C0 Letter: Upper case letter, Latin, Other, U+00E0
+U+00C1 Letter: Upper case letter, Latin, Other, U+00E1
+U+00C2 Letter: Upper case letter, Latin, Other, U+00E2
+U+00C3 Letter: Upper case letter, Latin, Other, U+00E3
+U+00C4 Letter: Upper case letter, Latin, Other, U+00E4
+U+00C5 Letter: Upper case letter, Latin, Other, U+00E5, U+212B
+U+00C6 Letter: Upper case letter, Latin, Other, U+00E6
+U+00C7 Letter: Upper case letter, Latin, Other, U+00E7
+U+00C8 Letter: Upper case letter, Latin, Other, U+00E8
+U+00C9 Letter: Upper case letter, Latin, Other, U+00E9
+U+00CA Letter: Upper case letter, Latin, Other, U+00EA
+U+00CB Letter: Upper case letter, Latin, Other, U+00EB
+U+00CC Letter: Upper case letter, Latin, Other, U+00EC
+U+00CD Letter: Upper case letter, Latin, Other, U+00ED
+U+00CE Letter: Upper case letter, Latin, Other, U+00EE
+U+00CF Letter: Upper case letter, Latin, Other, U+00EF
findprop d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df
-00d0 Letter: Upper case letter, Latin, Other, 00f0
-00d1 Letter: Upper case letter, Latin, Other, 00f1
-00d2 Letter: Upper case letter, Latin, Other, 00f2
-00d3 Letter: Upper case letter, Latin, Other, 00f3
-00d4 Letter: Upper case letter, Latin, Other, 00f4
-00d5 Letter: Upper case letter, Latin, Other, 00f5
-00d6 Letter: Upper case letter, Latin, Other, 00f6
-00d7 Symbol: Mathematical symbol, Common, Other
-00d8 Letter: Upper case letter, Latin, Other, 00f8
-00d9 Letter: Upper case letter, Latin, Other, 00f9
-00da Letter: Upper case letter, Latin, Other, 00fa
-00db Letter: Upper case letter, Latin, Other, 00fb
-00dc Letter: Upper case letter, Latin, Other, 00fc
-00dd Letter: Upper case letter, Latin, Other, 00fd
-00de Letter: Upper case letter, Latin, Other, 00fe
-00df Letter: Lower case letter, Latin, Other, 1e9e
+U+00D0 Letter: Upper case letter, Latin, Other, U+00F0
+U+00D1 Letter: Upper case letter, Latin, Other, U+00F1
+U+00D2 Letter: Upper case letter, Latin, Other, U+00F2
+U+00D3 Letter: Upper case letter, Latin, Other, U+00F3
+U+00D4 Letter: Upper case letter, Latin, Other, U+00F4
+U+00D5 Letter: Upper case letter, Latin, Other, U+00F5
+U+00D6 Letter: Upper case letter, Latin, Other, U+00F6
+U+00D7 Symbol: Mathematical symbol, Common, Other
+U+00D8 Letter: Upper case letter, Latin, Other, U+00F8
+U+00D9 Letter: Upper case letter, Latin, Other, U+00F9
+U+00DA Letter: Upper case letter, Latin, Other, U+00FA
+U+00DB Letter: Upper case letter, Latin, Other, U+00FB
+U+00DC Letter: Upper case letter, Latin, Other, U+00FC
+U+00DD Letter: Upper case letter, Latin, Other, U+00FD
+U+00DE Letter: Upper case letter, Latin, Other, U+00FE
+U+00DF Letter: Lower case letter, Latin, Other, U+1E9E
findprop e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef
-00e0 Letter: Lower case letter, Latin, Other, 00c0
-00e1 Letter: Lower case letter, Latin, Other, 00c1
-00e2 Letter: Lower case letter, Latin, Other, 00c2
-00e3 Letter: Lower case letter, Latin, Other, 00c3
-00e4 Letter: Lower case letter, Latin, Other, 00c4
-00e5 Letter: Lower case letter, Latin, Other, 00c5, 212b
-00e6 Letter: Lower case letter, Latin, Other, 00c6
-00e7 Letter: Lower case letter, Latin, Other, 00c7
-00e8 Letter: Lower case letter, Latin, Other, 00c8
-00e9 Letter: Lower case letter, Latin, Other, 00c9
-00ea Letter: Lower case letter, Latin, Other, 00ca
-00eb Letter: Lower case letter, Latin, Other, 00cb
-00ec Letter: Lower case letter, Latin, Other, 00cc
-00ed Letter: Lower case letter, Latin, Other, 00cd
-00ee Letter: Lower case letter, Latin, Other, 00ce
-00ef Letter: Lower case letter, Latin, Other, 00cf
+U+00E0 Letter: Lower case letter, Latin, Other, U+00C0
+U+00E1 Letter: Lower case letter, Latin, Other, U+00C1
+U+00E2 Letter: Lower case letter, Latin, Other, U+00C2
+U+00E3 Letter: Lower case letter, Latin, Other, U+00C3
+U+00E4 Letter: Lower case letter, Latin, Other, U+00C4
+U+00E5 Letter: Lower case letter, Latin, Other, U+00C5, U+212B
+U+00E6 Letter: Lower case letter, Latin, Other, U+00C6
+U+00E7 Letter: Lower case letter, Latin, Other, U+00C7
+U+00E8 Letter: Lower case letter, Latin, Other, U+00C8
+U+00E9 Letter: Lower case letter, Latin, Other, U+00C9
+U+00EA Letter: Lower case letter, Latin, Other, U+00CA
+U+00EB Letter: Lower case letter, Latin, Other, U+00CB
+U+00EC Letter: Lower case letter, Latin, Other, U+00CC
+U+00ED Letter: Lower case letter, Latin, Other, U+00CD
+U+00EE Letter: Lower case letter, Latin, Other, U+00CE
+U+00EF Letter: Lower case letter, Latin, Other, U+00CF
findprop f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff
-00f0 Letter: Lower case letter, Latin, Other, 00d0
-00f1 Letter: Lower case letter, Latin, Other, 00d1
-00f2 Letter: Lower case letter, Latin, Other, 00d2
-00f3 Letter: Lower case letter, Latin, Other, 00d3
-00f4 Letter: Lower case letter, Latin, Other, 00d4
-00f5 Letter: Lower case letter, Latin, Other, 00d5
-00f6 Letter: Lower case letter, Latin, Other, 00d6
-00f7 Symbol: Mathematical symbol, Common, Other
-00f8 Letter: Lower case letter, Latin, Other, 00d8
-00f9 Letter: Lower case letter, Latin, Other, 00d9
-00fa Letter: Lower case letter, Latin, Other, 00da
-00fb Letter: Lower case letter, Latin, Other, 00db
-00fc Letter: Lower case letter, Latin, Other, 00dc
-00fd Letter: Lower case letter, Latin, Other, 00dd
-00fe Letter: Lower case letter, Latin, Other, 00de
-00ff Letter: Lower case letter, Latin, Other, 0178
+U+00F0 Letter: Lower case letter, Latin, Other, U+00D0
+U+00F1 Letter: Lower case letter, Latin, Other, U+00D1
+U+00F2 Letter: Lower case letter, Latin, Other, U+00D2
+U+00F3 Letter: Lower case letter, Latin, Other, U+00D3
+U+00F4 Letter: Lower case letter, Latin, Other, U+00D4
+U+00F5 Letter: Lower case letter, Latin, Other, U+00D5
+U+00F6 Letter: Lower case letter, Latin, Other, U+00D6
+U+00F7 Symbol: Mathematical symbol, Common, Other
+U+00F8 Letter: Lower case letter, Latin, Other, U+00D8
+U+00F9 Letter: Lower case letter, Latin, Other, U+00D9
+U+00FA Letter: Lower case letter, Latin, Other, U+00DA
+U+00FB Letter: Lower case letter, Latin, Other, U+00DB
+U+00FC Letter: Lower case letter, Latin, Other, U+00DC
+U+00FD Letter: Lower case letter, Latin, Other, U+00DD
+U+00FE Letter: Lower case letter, Latin, Other, U+00DE
+U+00FF Letter: Lower case letter, Latin, Other, U+0178

findprop 0100 0101 0102 0103 0104 0105 0106
-0100 Letter: Upper case letter, Latin, Other, 0101
-0101 Letter: Lower case letter, Latin, Other, 0100
-0102 Letter: Upper case letter, Latin, Other, 0103
-0103 Letter: Lower case letter, Latin, Other, 0102
-0104 Letter: Upper case letter, Latin, Other, 0105
-0105 Letter: Lower case letter, Latin, Other, 0104
-0106 Letter: Upper case letter, Latin, Other, 0107
+U+0100 Letter: Upper case letter, Latin, Other, U+0101
+U+0101 Letter: Lower case letter, Latin, Other, U+0100
+U+0102 Letter: Upper case letter, Latin, Other, U+0103
+U+0103 Letter: Lower case letter, Latin, Other, U+0102
+U+0104 Letter: Upper case letter, Latin, Other, U+0105
+U+0105 Letter: Lower case letter, Latin, Other, U+0104
+U+0106 Letter: Upper case letter, Latin, Other, U+0107

findprop ffe0 ffe1 ffe2 ffe3 ffe4 ffe5 ffe6 ffe7
-ffe0 Symbol: Currency symbol, Common, Other
-ffe1 Symbol: Currency symbol, Common, Other
-ffe2 Symbol: Mathematical symbol, Common, Other
-ffe3 Symbol: Modifier symbol, Common, Other
-ffe4 Symbol: Other symbol, Common, Other
-ffe5 Symbol: Currency symbol, Common, Other
-ffe6 Symbol: Currency symbol, Common, Other
-ffe7 Control: Unassigned, Unknown, Other
+U+FFE0 Symbol: Currency symbol, Common, Other
+U+FFE1 Symbol: Currency symbol, Common, Other
+U+FFE2 Symbol: Mathematical symbol, Common, Other
+U+FFE3 Symbol: Modifier symbol, Common, Other
+U+FFE4 Symbol: Other symbol, Common, Other
+U+FFE5 Symbol: Currency symbol, Common, Other
+U+FFE6 Symbol: Currency symbol, Common, Other
+U+FFE7 Control: Unassigned, Unknown, Other
findprop ffe8 ffe9 ffea ffeb ffec ffed ffee ffef
-ffe8 Symbol: Other symbol, Common, Other
-ffe9 Symbol: Mathematical symbol, Common, Other
-ffea Symbol: Mathematical symbol, Common, Other
-ffeb Symbol: Mathematical symbol, Common, Other
-ffec Symbol: Mathematical symbol, Common, Other
-ffed Symbol: Other symbol, Common, Other
-ffee Symbol: Other symbol, Common, Other
-ffef Control: Unassigned, Unknown, Other
+U+FFE8 Symbol: Other symbol, Common, Other
+U+FFE9 Symbol: Mathematical symbol, Common, Other
+U+FFEA Symbol: Mathematical symbol, Common, Other
+U+FFEB Symbol: Mathematical symbol, Common, Other
+U+FFEC Symbol: Mathematical symbol, Common, Other
+U+FFED Symbol: Other symbol, Common, Other
+U+FFEE Symbol: Other symbol, Common, Other
+U+FFEF Control: Unassigned, Unknown, Other
findprop fff8 fff9 fffa fffb fffc fffd fffe ffff
-fff8 Control: Unassigned, Unknown, Control
-fff9 Control: Format, Common, Control
-fffa Control: Format, Common, Control
-fffb Control: Format, Common, Control
-fffc Symbol: Other symbol, Common, Other
-fffd Symbol: Other symbol, Common, Other
-fffe Control: Unassigned, Unknown, Other
-ffff Control: Unassigned, Unknown, Other
+U+FFF8 Control: Unassigned, Unknown, Control
+U+FFF9 Control: Format, Common, Control
+U+FFFA Control: Format, Common, Control
+U+FFFB Control: Format, Common, Control
+U+FFFC Symbol: Other symbol, Common, Other
+U+FFFD Symbol: Other symbol, Common, Other
+U+FFFE Control: Unassigned, Unknown, Other
+U+FFFF Control: Unassigned, Unknown, Other
findprop 10000 10001 e01ef f0000 100000
-10000 Letter: Other letter, Linear_B, Other
-10001 Letter: Other letter, Linear_B, Other
-e01ef Mark: Non-spacing mark, Inherited, Extend
-f0000 Control: Private use, Unknown, Other
-100000 Control: Private use, Unknown, Other
+U+10000 Letter: Other letter, Linear_B, Other
+U+10001 Letter: Other letter, Linear_B, Other
+U+E01EF Mark: Non-spacing mark, Inherited, Extend
+U+F0000 Control: Private use, Unknown, Other
+U+100000 Control: Private use, Unknown, Other

findprop 1b00 12000 7c0 a840 10900
-1b00 Mark: Non-spacing mark, Balinese, Extend
-12000 Letter: Other letter, Cuneiform, Other
-07c0 Number: Decimal number, Nko, Other
-a840 Letter: Other letter, Phags_Pa, Other
-10900 Letter: Other letter, Phoenician, Other
+U+1B00 Mark: Non-spacing mark, Balinese, Extend
+U+12000 Letter: Other letter, Cuneiform, Other
+U+07C0 Number: Decimal number, Nko, Other
+U+A840 Letter: Other letter, Phags_Pa, Other
+U+10900 Letter: Other letter, Phoenician, Other
findprop 1d79 a77d
-1d79 Letter: Lower case letter, Latin, Other, a77d
-a77d Letter: Upper case letter, Latin, Other, 1d79
+U+1D79 Letter: Lower case letter, Latin, Other, U+A77D
+U+A77D Letter: Upper case letter, Latin, Other, U+1D79

findprop 0800 083e a4d0 a4f7 aa80 aadf
-0800 Letter: Other letter, Samaritan, Other
-083e Punctuation: Other punctuation, Samaritan, Other
-a4d0 Letter: Other letter, Lisu, Other
-a4f7 Letter: Other letter, Lisu, Other
-aa80 Letter: Other letter, Tai_Viet, Other
-aadf Punctuation: Other punctuation, Tai_Viet, Other
+U+0800 Letter: Other letter, Samaritan, Other
+U+083E Punctuation: Other punctuation, Samaritan, Other
+U+A4D0 Letter: Other letter, Lisu, Other
+U+A4F7 Letter: Other letter, Lisu, Other
+U+AA80 Letter: Other letter, Tai_Viet, Other
+U+AADF Punctuation: Other punctuation, Tai_Viet, Other
findprop 10b00 10b35 13000 1342e 10840 10855
-10b00 Letter: Other letter, Avestan, Other
-10b35 Letter: Other letter, Avestan, Other
-13000 Letter: Other letter, Egyptian_Hieroglyphs, Other
-1342e Letter: Other letter, Egyptian_Hieroglyphs, Other
-10840 Letter: Other letter, Imperial_Aramaic, Other
-10855 Letter: Other letter, Imperial_Aramaic, Other
+U+10B00 Letter: Other letter, Avestan, Other
+U+10B35 Letter: Other letter, Avestan, Other
+U+13000 Letter: Other letter, Egyptian_Hieroglyphs, Other
+U+1342E Letter: Other letter, Egyptian_Hieroglyphs, Other
+U+10840 Letter: Other letter, Imperial_Aramaic, Other
+U+10855 Letter: Other letter, Imperial_Aramaic, Other

findprop 11100 1113c 11680 116c0
-11100 Mark: Non-spacing mark, Chakma, Extend
-1113c Number: Decimal number, Chakma, Other
-11680 Letter: Other letter, Takri, Other
-116c0 Number: Decimal number, Takri, Other
+U+11100 Mark: Non-spacing mark, Chakma, Extend
+U+1113C Number: Decimal number, Chakma, Other
+U+11680 Letter: Other letter, Takri, Other
+U+116C0 Number: Decimal number, Takri, Other

findprop 0d 0a 0e 0711 1b04 1111 1169 11fe ae4c ad89
-000d Control: Control, Common, CR
-000a Control: Control, Common, LF
-000e Control: Control, Common, Control
-0711 Mark: Non-spacing mark, Syriac, Extend
-1b04 Mark: Spacing mark, Balinese, SpacingMark
-1111 Letter: Other letter, Hangul, Hangul syllable type L
-1169 Letter: Other letter, Hangul, Hangul syllable type V
-11fe Letter: Other letter, Hangul, Hangul syllable type T
-ae4c Letter: Other letter, Hangul, Hangul syllable type LV
-ad89 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+000D Control: Control, Common, CR
+U+000A Control: Control, Common, LF
+U+000E Control: Control, Common, Control
+U+0711 Mark: Non-spacing mark, Syriac, Extend
+U+1B04 Mark: Spacing mark, Balinese, SpacingMark
+U+1111 Letter: Other letter, Hangul, Hangul syllable type L
+U+1169 Letter: Other letter, Hangul, Hangul syllable type V
+U+11FE Letter: Other letter, Hangul, Hangul syllable type T
+U+AE4C Letter: Other letter, Hangul, Hangul syllable type LV
+U+AD89 Letter: Other letter, Hangul, Hangul syllable type LVT

findprop 118a0 11ac7 16ad0
-118a0 Letter: Upper case letter, Warang_Citi, Other, 118c0
-11ac7 Letter: Other letter, Pau_Cin_Hau, Other
-16ad0 Letter: Other letter, Bassa_Vah, Other
+U+118A0 Letter: Upper case letter, Warang_Citi, Other, U+118C0
+U+11AC7 Letter: Other letter, Pau_Cin_Hau, Other
+U+16AD0 Letter: Other letter, Bassa_Vah, Other

findprop 11700 14400 108e0 11280 1d800
-11700 Letter: Other letter, Ahom, Other
-14400 Letter: Other letter, Anatolian_Hieroglyphs, Other
-108e0 Letter: Other letter, Hatran, Other
-11280 Letter: Other letter, Multani, Other
-1d800 Symbol: Other symbol, SignWriting, Other
+U+11700 Letter: Other letter, Ahom, Other
+U+14400 Letter: Other letter, Anatolian_Hieroglyphs, Other
+U+108E0 Letter: Other letter, Hatran, Other
+U+11280 Letter: Other letter, Multani, Other
+U+1D800 Symbol: Other symbol, SignWriting, Other

findprop 11800 1e903 11da9 10d27 11ee0 16e48 10f27 10f30
-11800 Letter: Other letter, Dogra, Other
-1e903 Letter: Upper case letter, Adlam, Other, 1e925
-11da9 Number: Decimal number, Gunjala_Gondi, Other
-10d27 Mark: Non-spacing mark, Hanifi_Rohingya, Extend
-11ee0 Letter: Other letter, Makasar, Other
-16e48 Letter: Upper case letter, Medefaidrin, Other, 16e68
-10f27 Letter: Other letter, Old_Sogdian, Other
-10f30 Letter: Other letter, Sogdian, Other
+U+11800 Letter: Other letter, Dogra, Other
+U+1E903 Letter: Upper case letter, Adlam, Other, U+1E925
+U+11DA9 Number: Decimal number, Gunjala_Gondi, Other
+U+10D27 Mark: Non-spacing mark, Hanifi_Rohingya, Extend
+U+11EE0 Letter: Other letter, Makasar, Other
+U+16E48 Letter: Upper case letter, Medefaidrin, Other, U+16E68
+U+10F27 Letter: Other letter, Old_Sogdian, Other
+U+10F30 Letter: Other letter, Sogdian, Other

findprop a836 a833 1cf4 20f0 1cd0
-a836 Symbol: Other symbol, Common, Other, [Devanagari, Dogra, Gujarati, Gurmukhi, Khojki, Kaithi, Mahajani, Modi, Khudawadi, Takri, Tirhuta]
-a833 Number: Other number, Common, Other, [Devanagari, Dogra, Gujarati, Gurmukhi, Khojki, Kannada, Kaithi, Mahajani, Modi, Nandinagari, Khudawadi, Takri, Tirhuta]
-1cf4 Mark: Non-spacing mark, Inherited, Extend, [Devanagari, Grantha, Kannada]
-20f0 Mark: Non-spacing mark, Inherited, Extend, [Devanagari, Grantha, Latin]
-1cd0 Mark: Non-spacing mark, Inherited, Extend, [Bengali, Devanagari, Grantha, Kannada]
+U+A836 Symbol: Other symbol, Common, Other, [Devanagari, Dogra, Gujarati, Gurmukhi, Khojki, Kaithi, Mahajani, Modi, Khudawadi, Takri, Tirhuta]
+U+A833 Number: Other number, Common, Other, [Devanagari, Dogra, Gujarati, Gurmukhi, Khojki, Kannada, Kaithi, Mahajani, Modi, Nandinagari, Khudawadi, Takri, Tirhuta]
+U+1CF4 Mark: Non-spacing mark, Inherited, Extend, [Devanagari, Grantha, Kannada]
+U+20F0 Mark: Non-spacing mark, Inherited, Extend, [Devanagari, Grantha, Latin]
+U+1CD0 Mark: Non-spacing mark, Inherited, Extend, [Bengali, Devanagari, Grantha, Kannada]

findprop 32ff
-32ff Symbol: Other symbol, Common, Other, [Han]
+U+32FF Symbol: Other symbol, Common, Other, [Han]

findprop 1f16d
-1f16d Symbol: Other symbol, Common, Extended Pictographic
+U+1F16D Symbol: Other symbol, Common, Extended Pictographic

-findprop 10e93 10eaa
-10e93 Letter: Other letter, Yezidi, Other
-10eaa Control: Unassigned, Unknown, Other
+findprop U+10e93 U+10eaa
+U+10E93 Letter: Other letter, Yezidi, Other
+U+10EAA Control: Unassigned, Unknown, Other

Added: code/trunk/maint/ucptestdata/testoutput2
===================================================================
--- code/trunk/maint/ucptestdata/testoutput2                            (rev 0)
+++ code/trunk/maint/ucptestdata/testoutput2    2020-04-01 17:00:33 UTC (rev 1241)
@@ -0,0 +1,188 @@
+find script Han
+U+2E80..U+2E99 Symbol: Other symbol, Han, Other
+U+2E9B..U+2EF3 Symbol: Other symbol, Han, Other
+U+2F00..U+2FD5 Symbol: Other symbol, Han, Other
+        U+3005 Letter: Modifier letter, Han, Other
+        U+3007 Number: Letter number, Han, Other
+U+3021..U+3029 Number: Letter number, Han, Other
+U+3038..U+303A Number: Letter number, Han, Other
+        U+303B Letter: Modifier letter, Han, Other
+U+3400..U+4DBF Letter: Other letter, Han, Other
+U+4E00..U+9FFC Letter: Other letter, Han, Other
+U+F900..U+FA6D Letter: Other letter, Han, Other
+U+FA70..U+FAD9 Letter: Other letter, Han, Other
+U+16FF0..U+16FF1 Mark: Spacing mark, Han, SpacingMark
+U+20000..U+2A6DD Letter: Other letter, Han, Other
+U+2A700..U+2B734 Letter: Other letter, Han, Other
+U+2B740..U+2B81D Letter: Other letter, Han, Other
+U+2B820..U+2CEA1 Letter: Other letter, Han, Other
+U+2CEB0..U+2EBE0 Letter: Other letter, Han, Other
+U+2F800..U+2FA1D Letter: Other letter, Han, Other
+U+30000..U+3134A Letter: Other letter, Han, Other
+find type Pe script Common scriptx Hangul
+U+3009 Punctuation: Close punctuation, Common, Other, [Bopomofo, Hangul, Han, Hiragana, Katakana, Yi]
+U+300B Punctuation: Close punctuation, Common, Other, [Bopomofo, Hangul, Han, Hiragana, Katakana, Yi]
+U+300D Punctuation: Close punctuation, Common, Other, [Bopomofo, Hangul, Han, Hiragana, Katakana, Yi]
+U+300F Punctuation: Close punctuation, Common, Other, [Bopomofo, Hangul, Han, Hiragana, Katakana, Yi]
+U+3011 Punctuation: Close punctuation, Common, Other, [Bopomofo, Hangul, Han, Hiragana, Katakana, Yi]
+U+3015 Punctuation: Close punctuation, Common, Other, [Bopomofo, Hangul, Han, Hiragana, Katakana, Yi]
+U+3017 Punctuation: Close punctuation, Common, Other, [Bopomofo, Hangul, Han, Hiragana, Katakana, Yi]
+U+3019 Punctuation: Close punctuation, Common, Other, [Bopomofo, Hangul, Han, Hiragana, Katakana, Yi]
+U+301B Punctuation: Close punctuation, Common, Other, [Bopomofo, Hangul, Han, Hiragana, Katakana, Yi]
+U+301E..U+301F Punctuation: Close punctuation, Common, Other, [Bopomofo, Hangul, Han, Hiragana, Katakana]
+        U+FF63 Punctuation: Close punctuation, Common, Other, [Bopomofo, Hangul, Han, Hiragana, Katakana, Yi]
+find type Sk
+U+005E Symbol: Modifier symbol, Common, Other
+U+0060 Symbol: Modifier symbol, Common, Other
+U+00A8 Symbol: Modifier symbol, Common, Other
+U+00AF Symbol: Modifier symbol, Common, Other
+U+00B4 Symbol: Modifier symbol, Common, Other
+U+00B8 Symbol: Modifier symbol, Common, Other
+U+02C2..U+02C5 Symbol: Modifier symbol, Common, Other
+U+02D2..U+02DF Symbol: Modifier symbol, Common, Other
+U+02E5..U+02E9 Symbol: Modifier symbol, Common, Other
+U+02EA..U+02EB Symbol: Modifier symbol, Bopomofo, Other
+        U+02ED Symbol: Modifier symbol, Common, Other
+U+02EF..U+02FF Symbol: Modifier symbol, Common, Other
+        U+0375 Symbol: Modifier symbol, Greek, Other
+        U+0384 Symbol: Modifier symbol, Greek, Other
+        U+0385 Symbol: Modifier symbol, Common, Other
+        U+1FBD Symbol: Modifier symbol, Greek, Other
+U+1FBF..U+1FC1 Symbol: Modifier symbol, Greek, Other
+U+1FCD..U+1FCF Symbol: Modifier symbol, Greek, Other
+U+1FDD..U+1FDF Symbol: Modifier symbol, Greek, Other
+U+1FED..U+1FEF Symbol: Modifier symbol, Greek, Other
+U+1FFD..U+1FFE Symbol: Modifier symbol, Greek, Other
+U+309B..U+309C Symbol: Modifier symbol, Common, Other, [Hiragana, Katakana]
+U+A700..U+A707 Symbol: Modifier symbol, Common, Other, [Han, Latin]
+U+A708..U+A716 Symbol: Modifier symbol, Common, Other
+U+A720..U+A721 Symbol: Modifier symbol, Common, Other
+U+A789..U+A78A Symbol: Modifier symbol, Common, Other
+        U+AB5B Symbol: Modifier symbol, Common, Other
+U+AB6A..U+AB6B Symbol: Modifier symbol, Common, Other
+U+FBB2..U+FBC1 Symbol: Modifier symbol, Arabic, Other
+        U+FF3E Symbol: Modifier symbol, Common, Other
+        U+FF40 Symbol: Modifier symbol, Common, Other
+        U+FFE3 Symbol: Modifier symbol, Common, Other
+U+1F3FB..U+1F3FF Symbol: Modifier symbol, Common, Extend
+find type Pd
+U+002D Punctuation: Dash punctuation, Common, Other
+U+058A Punctuation: Dash punctuation, Armenian, Other
+U+05BE Punctuation: Dash punctuation, Hebrew, Other
+U+1400 Punctuation: Dash punctuation, Canadian_Aboriginal, Other
+U+1806 Punctuation: Dash punctuation, Mongolian, Other
+U+2010..U+2015 Punctuation: Dash punctuation, Common, Other
+        U+2E17 Punctuation: Dash punctuation, Common, Other
+        U+2E1A Punctuation: Dash punctuation, Common, Other
+U+2E3A..U+2E3B Punctuation: Dash punctuation, Common, Other
+        U+2E40 Punctuation: Dash punctuation, Common, Other
+        U+301C Punctuation: Dash punctuation, Common, Other, [Bopomofo, Hangul, Han, Hiragana, Katakana]
+        U+3030 Punctuation: Dash punctuation, Common, Extended Pictographic, [Bopomofo, Hangul, Han, Hiragana, Katakana]
+        U+30A0 Punctuation: Dash punctuation, Common, Other, [Hiragana, Katakana]
+U+FE31..U+FE32 Punctuation: Dash punctuation, Common, Other
+        U+FE58 Punctuation: Dash punctuation, Common, Other
+        U+FE63 Punctuation: Dash punctuation, Common, Other
+        U+FF0D Punctuation: Dash punctuation, Common, Other
+        U+10EAD Punctuation: Dash punctuation, Yezidi, Other
+find gbreak LVT
+U+AC01..U+AC1B Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AC1D..U+AC37 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AC39..U+AC53 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AC55..U+AC6F Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AC71..U+AC8B Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AC8D..U+ACA7 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+ACA9..U+ACC3 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+ACC5..U+ACDF Letter: Other letter, Hangul, Hangul syllable type LVT
+U+ACE1..U+ACFB Letter: Other letter, Hangul, Hangul syllable type LVT
+U+ACFD..U+AD17 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AD19..U+AD33 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AD35..U+AD4F Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AD51..U+AD6B Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AD6D..U+AD87 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AD89..U+ADA3 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+ADA5..U+ADBF Letter: Other letter, Hangul, Hangul syllable type LVT
+U+ADC1..U+ADDB Letter: Other letter, Hangul, Hangul syllable type LVT
+U+ADDD..U+ADF7 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+ADF9..U+AE13 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AE15..U+AE2F Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AE31..U+AE4B Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AE4D..U+AE67 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AE69..U+AE83 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AE85..U+AE9F Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AEA1..U+AEBB Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AEBD..U+AED7 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AED9..U+AEF3 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AEF5..U+AF0F Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AF11..U+AF2B Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AF2D..U+AF47 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AF49..U+AF63 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AF65..U+AF7F Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AF81..U+AF9B Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AF9D..U+AFB7 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AFB9..U+AFD3 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AFD5..U+AFEF Letter: Other letter, Hangul, Hangul syllable type LVT
+U+AFF1..U+B00B Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B00D..U+B027 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B029..U+B043 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B045..U+B05F Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B061..U+B07B Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B07D..U+B097 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B099..U+B0B3 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B0B5..U+B0CF Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B0D1..U+B0EB Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B0ED..U+B107 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B109..U+B123 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B125..U+B13F Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B141..U+B15B Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B15D..U+B177 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B179..U+B193 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B195..U+B1AF Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B1B1..U+B1CB Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B1CD..U+B1E7 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B1E9..U+B203 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B205..U+B21F Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B221..U+B23B Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B23D..U+B257 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B259..U+B273 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B275..U+B28F Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B291..U+B2AB Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B2AD..U+B2C7 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B2C9..U+B2E3 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B2E5..U+B2FF Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B301..U+B31B Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B31D..U+B337 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B339..U+B353 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B355..U+B36F Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B371..U+B38B Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B38D..U+B3A7 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B3A9..U+B3C3 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B3C5..U+B3DF Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B3E1..U+B3FB Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B3FD..U+B417 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B419..U+B433 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B435..U+B44F Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B451..U+B46B Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B46D..U+B487 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B489..U+B4A3 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B4A5..U+B4BF Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B4C1..U+B4DB Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B4DD..U+B4F7 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B4F9..U+B513 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B515..U+B52F Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B531..U+B54B Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B54D..U+B567 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B569..U+B583 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B585..U+B59F Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B5A1..U+B5BB Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B5BD..U+B5D7 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B5D9..U+B5F3 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B5F5..U+B60F Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B611..U+B62B Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B62D..U+B647 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B649..U+B663 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B665..U+B67F Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B681..U+B69B Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B69D..U+B6B7 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B6B9..U+B6D3 Letter: Other letter, Hangul, Hangul syllable type LVT
+U+B6D5..U+B6EF Letter: Other letter, Hangul, Hangul syllable type LVT
+...


Modified: code/trunk/maint/utf8.c
===================================================================
--- code/trunk/maint/utf8.c    2020-03-25 17:18:33 UTC (rev 1240)
+++ code/trunk/maint/utf8.c    2020-04-01 17:00:33 UTC (rev 1241)
@@ -1,30 +1,47 @@
-/* A test program for converting characters to UTF-8 and vice versa. Note that
-this program conforms to the original definition of UTF-8, which allows
-codepoints up to 7fffffff. The more recent definition limits the validity of
-UTF-8 codepoints to a maximum of 10ffffff.
+/****************************************************
+* PCRE maintainers' helper program: UTF-8 converter *
+****************************************************/


-The arguments are either single codepoint values, written as 0xhhhh, for
-conversion to UTF-8, or sequences of hex values, written without 0x and
-optionally including spaces (but such arguments must be quoted), for conversion
+/* This is a test program for converting character code points to UTF-8 and
+vice versa. Note that this program conforms to the original definition of
+UTF-8, which allows codepoints up to 7fffffff. The more recent definition
+limits the validity of Unicode UTF-8 codepoints to a maximum of 10ffffff, and
+forbids the "surrogate" code points. This program now gives warnings for these
+invalid code points.
+
+The arguments are either single code point values written as U+hh.. or 0xhh..
+for conversion to UTF-8, or sequences of hex values, written without 0x and
+optionally including spaces (but such arguments must be quoted), for conversion
from UTF-8 to codepoints. For example:

./utf8 0x1234
-0x00001234 => e1 88 b4
+U+00001234 => e1 88 b4

./utf8 "e1 88 b4"
-0x00001234 <= e1 88 b4
+U+00001234 <= e1 88 b4

-In the second case, a number of characters can be present in one argument:
+In the second case, a number of UTF-8 characters can be present in one
+argument. In other words, each such argument is interpreted (after ignoring
+spaces) as a string of UTF-8 bytes representing a string of characters:

./utf8 "65 e188b4 77"
-0x00000065 <= 65
-0x00001234 <= e1 88 b4
-0x00000077 <= 77
+0x00000065 <= 65
+0x00001234 <= e1 88 b4
+0x00000077 <= 77

-If the option -s is given, the sequence of UTF-bytes is written out between
+If the option -s is given, the sequence of UTF-bytes is written out between
angle brackets at the end of the line. On a UTF-8 terminal, this will show the
-appropriate graphic for the codepoint. */
+appropriate graphic for the code point.

+Errors provoke error messages, but the program carries on with the next
+argument. The return code is always zero.
+
+Philip Hazel
+Original creation data: unknown
+Code extended and tidied to avoid compiler warnings: 26 March 2020
+*/
+
+
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
@@ -41,20 +58,14 @@
*/


-static const int utf8_table1[] = {
- 0x0000007f, 0x000007ff, 0x0000ffff, 0x001fffff, 0x03ffffff, 0x7fffffff};
+static const unsigned int utf8_table1[] = {
+ 0x0000007f, 0x000007ff, 0x0000ffff, 0x001fffff, 0x03ffffff, 0x7fffffff};

static const int utf8_table2[] = {
- 0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc};
-
+ 0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc};
+
static const int utf8_table3[] = {
- 0xff, 0x1f, 0x0f, 0x07, 0x03, 0x01};
-
-static const unsigned char utf8_table4[] = {
- 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
- 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
- 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
- 3,3,3,3,3,3,3,3,4,4,4,4,5,5,6,6 };
+ 0xff, 0x1f, 0x0f, 0x07, 0x03, 0x01};


 /*************************************************
@@ -61,27 +72,24 @@
 *       Convert character value to UTF-8         *
 *************************************************/


-/* This function takes an integer value in the range 0 - 0x7fffffff
-and encodes it as a UTF-8 character in 1 to 6 bytes.
+/* This function takes an unsigned long integer value in the range 0 -
+0x7fffffff and encodes it as a UTF-8 character in 1 to 6 bytes.

-Arguments:   
-  cvalue     the character value 
+Arguments:
+  cvalue     the character value
   buffer     pointer to buffer for result - at least 6 bytes long
-  
-Returns:     number of characters placed in the buffer
-             -1 if input character is negative  
-             0 if input character is positive but too big (only when
-             int is longer than 32 bits) 
+
+Returns:     number of bytes placed in the buffer
+             0 if input code point is too big
 */


-int
-ord2utf8(int cvalue, unsigned char *buffer)
+static size_t
+ord2utf8(unsigned long int cvalue, unsigned char *buffer)
 {
-register int i, j;
+size_t i, j;
 for (i = 0; i < sizeof(utf8_table1)/sizeof(int); i++)
   if (cvalue <= utf8_table1[i]) break;
 if (i >= sizeof(utf8_table1)/sizeof(int)) return 0;
-if (cvalue < 0) return -1;
 buffer += i;
 for (j = i; j > 0; j--)
  {
@@ -98,33 +106,60 @@
 *            Convert UTF-8 string to value       *
 *************************************************/


-/* This function takes one or more bytes that represents a UTF-8 character,
-and returns the value of the character.
+/* This function takes one or more bytes that represent a UTF-8 character from
+the start of a string of bytes. It returns the value of the character, or the
+offset of a malformation. For an overlong encoding that works but is not the
+correct (shortest) one, the error offset is just after the last byte.

-Argument:  
+Argument:
   buffer   a pointer to the byte vector
-  vptr     a pointer to an int to receive the value 
+  buffend  a pointer to the end of the buffer
+  vptr     a pointer to a variable to receive the value
+  lenptr   a pointer to a variable to receive the offset when error detected


-Returns:   >  0 => the number of bytes consumed
-           -6 to 0 => malformed UTF-8 character at offset = (-return)
+Returns:   > 0 => the number of bytes consumed
+             0 => invalid UTF-8: first byte missing 0x40 bit
+            -1 => invalid UTF-8: first byte has too many high-order 1-bits
+            -2 => incomplete sequence at end of string
+            -3 => incomplete sequence within string
+            -4 => overlong code sequence
 */


-int
-utf82ord(unsigned char *buffer, int *vptr)
+static int
+utf82ord(unsigned char *buffer, unsigned char *buffend,
+ long unsigned int *vptr, int *lenptr)
{
-int c = *buffer++;
-int d = c;
+unsigned int c = *buffer++;
+unsigned int d = c;
int i, j, s;

-for (i = -1; i < 6; i++)               /* i is number of additional bytes */
+/* Check for an ASCII character, or find the number of additional bytes in a
+multibyte character. */
+
+for (i = -1; i < 6; i++)
   {
   if ((d & 0x80) == 0) break;
   d <<= 1;
   }


-if (i == -1) { *vptr = c; return 1; }  /* ascii character */
-if (i == 0 || i == 6) return 0;        /* invalid UTF-8 */
+switch (i)
+  {
+  case -1:      /* ASCII character; first byte does not have 0x80 bit */
+  *vptr = c;
+  return 1;


+  case 0:       /* First byte has 0x80 but is missing 0x40 bit */
+  *lenptr = 0;
+  return 0;
+
+  case 6:
+  *lenptr = 0;  /* Too many high bits */
+  return -1;
+
+  default:
+  break;
+  }
+
 /* i now has a value in the range 1-5 */


s = 6*i;
@@ -132,32 +167,46 @@

 for (j = 0; j < i; j++)
   {
+  if (buffer >= buffend)
+    {
+    *lenptr = j + 1;
+    return -2;
+    }
   c = *buffer++;
-  if ((c & 0xc0) != 0x80) return -(j+1);
+  if ((c & 0xc0) != 0x80)
+    {
+    *lenptr = j + 1;
+    return -3;
+    }
   s -= 6;
   d |= (c & 0x3f) << s;
   }


-/* Check that encoding was the correct unique one */
+/* Valid UTF-8 syntax */

-for (j = 0; j < sizeof(utf8_table1)/sizeof(int); j++)
+*vptr = d;
+
+/* Check that encoding was the correct one, not overlong */
+
+for (j = 0; j < (int)(sizeof(utf8_table1)/sizeof(int)); j++)
if (d <= utf8_table1[j]) break;
-if (j != i) return -(i+1);
+if (j != i)
+ {
+ *lenptr = i + 1;
+ return -4;
+ }

/* Valid value */

-*vptr = d;
-return i+1;
+return i + 1;
}



-
 /*************************************************
 *                 Main Program                   *
 *************************************************/


-
int
main(int argc, char **argv)
{
@@ -169,85 +218,129 @@
{
show = 1;
i = 2;
- }
+ }

 for (; i < argc; i++)
   {
   char *x = argv[i];
-  if (strncmp(x, "0x", 2) == 0)
+  char *endptr;
+  if (strncmp(x, "0x", 2) == 0 || strncmp(x, "U+", 2) == 0)
     {
-    int j; 
-    int d = strtol(x+2, NULL, 16);
-    int rc = ord2utf8(d, buffer);
-    printf("0x%08x => ", d); 
-    if (rc <= 0) printf("*** Error %d ***", rc); else 
+    size_t rc, j;
+    unsigned long int d = strtoul(x+2, &endptr, 16);
+    if (*endptr != 0)
       {
+      printf("** Invalid hex number %s\n", x);
+      continue;   /* With next argument */
+      }
+    rc = ord2utf8(d, buffer);
+    printf("U+%08lx => ", d);
+    if (rc == 0)
+      printf("** Code point greater than 0x7fffffff cannot be encoded");
+    else
+      {
       for (j = 0; j < rc; j++) printf("%02x ", buffer[j]);
       if (show)
         {
         printf(">");
         for (j = 0; j < rc; j++) printf("%c", buffer[j]);
-        printf("<"); 
-        }  
-      } 
-    printf("\n");   
+        printf("< ");
+        }
+      if (d >= 0xd800 && d <= 0xdfff)
+        printf("** Invalid Unicode (surrogate)");
+      else if (d > 0x10ffff)
+        printf("** Invalid Unicode (greater than U+10ffff)");
+      }
+    printf("\n");
     }
   else
     {
-    int d, rc; 
-    int j = 0;
-    int y = 0; 
+    unsigned char *bptr;
+    unsigned char *buffend;
+    int len = 0;
+    int y = 0;
     int z = 0;
-    unsigned char *bptr;
-       
-    for (;;) 
-      { 
-      while (*x == ' ') x++; 
+
+    for (;;)
+      {
+      while (*x == ' ') x++;
       if (*x == 0 && !z) break;
-      if (!isxdigit(*x)) 
+      if (!isxdigit(*x))
         {
-        printf("Malformed hex string: %s\n", argv[i]);
-        j = -1;
-        break;    
-        } 
+        printf("** Malformed hex string: %s\n", argv[i]);
+        len = -1;
+        break;
+        }
       y = y * 16 + tolower(*x) - ((isdigit(*x))? '0' : 'W');
-      x++; 
+      x++;
       if (z)
-        { 
-        buffer[j++] = y;
+        {
+        buffer[len++] = y;
         y = 0;
         }
-      z ^= 1;     
-      } 
-    buffer[j] = 0;
+      z ^= 1;
+      }
+
+    if (len < 0) continue;  /* With next argument after malformation */
+
     bptr = buffer;
+    buffend = buffer + len;


-    while (*bptr != 0)
-      { 
-      rc = utf82ord(bptr, &d);
-      if (rc > 0) 
+    while (bptr < buffend)
+      {
+      unsigned long int d;
+      int j;
+      int offset;
+      int rc = utf82ord(bptr, buffend, &d, &offset);
+
+      if (rc > 0)
         {
-        printf("0x%08x <= ", d);
+        printf("U+%08lx <= ", d);
         for (j = 0; j < rc; j++) printf("%02x ", bptr[j]);
         if (show)
           {
           printf(">");
           for (j = 0; j < rc; j++) printf("%c", bptr[j]);
-          printf("<"); 
-          }  
+          printf("<");
+          }
         printf("\n");
-        bptr += rc; 
-        } 
-      else 
+        bptr += rc;
+        }
+      else if (rc == -4)
         {
-        printf("Malformed UTF-8 at offset %d <= ", -rc);
-        while (*bptr != 0) printf("%02x ", *bptr++);
-        printf("\n"); 
-        break;  
-        } 
-      }   
-    }       
-  } 
+        printf("U+%08lx <= ", d);
+        for (j = 0; j < offset; j++) printf("%02x ", bptr[j]);
+        printf("** Overlong UTF-8 sequence\n");
+        bptr += offset;
+        }
+      else
+        {
+        switch (rc)
+          {
+          case 0:  printf("** First byte missing 0x40 bit");
+          break;
+
+          case -1: printf("** First byte has too many high-order bits");
+          break;
+
+          case -2: printf("** Incomplete UTF-8 sequence at end of string");
+          break;
+
+          case -3: printf("** Incomplete UTF-8 sequence");
+          break;
+
+          default: printf("** Unexpected return %d from utf82ord()", rc);
+          break;
+          }
+        printf(" at offset %d in string ", offset);
+        while (bptr < buffend) printf("%02x ", *bptr++);
+        printf("\n");
+        break;
+        }
+      }
+    }
+  }
+
 return 0;
 }