[Pcre-svn] [871] code/trunk: Fix issues with UTF-8 in the Pe…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [871] code/trunk: Fix issues with UTF-8 in the Perl checking script.
Revision: 871
          http://vcs.pcre.org/viewvc?view=rev&revision=871
Author:   ph10
Date:     2012-01-14 16:20:44 +0000 (Sat, 14 Jan 2012)


Log Message:
-----------
Fix issues with UTF-8 in the Perl checking script.

Modified Paths:
--------------
    code/trunk/doc/perltest.txt
    code/trunk/perltest.pl


Modified: code/trunk/doc/perltest.txt
===================================================================
--- code/trunk/doc/perltest.txt    2012-01-14 11:23:25 UTC (rev 870)
+++ code/trunk/doc/perltest.txt    2012-01-14 16:20:44 UTC (rev 871)
@@ -28,13 +28,15 @@
 The perltest.pl script can also test UTF-8 features. It recognizes the special
 modifier /8 that pcretest uses to invoke UTF-8 functionality. The testinput4
 and testinput6 files can be fed to perltest to run compatible UTF-8 tests.
-However, it is necessary to add "use utf8;" to the script to make this work
-correctly.
+However, it is necessary to add "use utf8; require Encode" to the script to
+make this work correctly. I have not managed to find a way to handle this 
+automatically.


The other testinput files are not suitable for feeding to perltest.pl, since
they make use of the special upper case modifiers and escapes that pcretest
-uses to test some features of PCRE. Some of these files also contains malformed
-regular expressions, in order to check that PCRE diagnoses them correctly.
+uses to test certain features of PCRE. Some of these files also contain
+malformed regular expressions, in order to check that PCRE diagnoses them
+correctly.

Philip Hazel
January 2012

Modified: code/trunk/perltest.pl
===================================================================
--- code/trunk/perltest.pl    2012-01-14 11:23:25 UTC (rev 870)
+++ code/trunk/perltest.pl    2012-01-14 16:20:44 UTC (rev 871)
@@ -1,17 +1,19 @@
 #! /usr/bin/env perl


# Program for testing regular expressions with perl to check that PCRE handles
-# them the same. This is the version that supports /8 for UTF-8 testing. As it
-# stands, it requires at least Perl 5.8 for UTF-8 support. However, it needs to
-# have "use utf8" at the start for running the UTF-8 tests, but *not* for the
-# other tests. The only way I've found for doing this is to cat this line in
-# explicitly in the RunPerlTest script.
+# them the same. This version supports /8 for UTF-8 testing. However, it needs
+# to have "use utf8" at the start for running the UTF-8 tests, but *not* for
+# the other tests. The only way I've found for doing this is to cat this line
+# in explicitly in the RunPerlTest script. I've also used this method to supply
+# "require Encode" for the UTF-8 tests, so that the main test will still run
+# where Encode is not installed.

# use locale; # With this included, \x0b matches \s!

-# Function for turning a string into a string of printing chars. There are
-# currently problems with UTF-8 strings; this fudges round them.
+# Function for turning a string into a string of printing chars.

+#require Encode;
+
sub pchars {
my($t) = "";

@@ -21,10 +23,10 @@
   foreach $c (@p)
     {
     if ($c >= 32 && $c < 127) { $t .= chr $c; }
-      else { $t .= sprintf("\\x{%02x}", $c); }
+      else { $t .= sprintf("\\x{%02x}", $c); 
+      }
     }
   }
-
 else
   {
   foreach $c (split(//, $_[0]))
@@ -192,7 +194,7 @@
       {
       printf $outfile "No match";
       if (defined $REGERROR && $REGERROR != 1)
-        { print $outfile (", mark = $REGERROR"); }
+        { printf $outfile (", mark = %s", &pchars($REGERROR)); }
       printf $outfile "\n";
       }
     else
@@ -214,8 +216,17 @@
           }
         splice(@subs, 0, 18);
         }
+        
+      # It seems that $REGMARK is not marked as UTF-8 even when use utf8 is
+      # set and the input pattern was a UTF-8 string. We can, however, force
+      # it to be so marked.  
+       
       if (defined $REGMARK && $REGMARK != 1)
-        { print $outfile ("MK: $REGMARK\n"); }
+        {
+        $xx = $REGMARK;  
+        $xx = Encode::decode_utf8($xx) if $utf8; 
+        printf $outfile ("MK: %s\n", &pchars($xx)); 
+        }
       }
     }
   }