[pcre-dev] Fw: [a-z] class in EBCDIC and Perl-MVS status que…

Góra strony
Delete this message
Autor: Ze'ev Atlas
Data:  
Dla: Philip Hazel, Pcre Exim
CC: Karl Williamson
Temat: [pcre-dev] Fw: [a-z] class in EBCDIC and Perl-MVS status question
Hi PhilipAs promised, I posed the question to the Pel-MVS community and here is the answered I'v got (see below).  I admit that it was news to me as well, but apparently the Perl-MVS guys went the extra mile to do that.  I am not saying that we must follow Perl to the letter, but you may want to consider implementing the below in PCRE2.  I may get involved, but it will take me some time (if at all), but first we need a decision from you if we want even to consider that.  If ultimately we decide not to do it ever, we have to mention it in the documentation as a difference with Perl.
Ze'ev Atlas

    ----- Forwarded Message -----
  From: Karl Williamson <public@???>
 To: "Atlas, Ze'Ev" ; "perl-mvs@???" 
 Sent: Friday, June 19, 2015 3:38 PM
 Subject: Re: [a-z] class in EBCDIC and Perl-MVS status question


On 06/18/2015 09:01 AM, Atlas, Ze'Ev wrote:
<snip - some irrelevant material>

>
> 2. The Perlre in perldocs (5.20), document states:
>
>  (The following all specify the same class of three characters: [-az] ,
> [az-] , and [a\-z] . All are different from [a-z] , which specifies a
> class containing twenty-six characters, even on EBCDIC-based character
> sets.)
>
> The implication is that Perl somehow recognizes [a-z] and treats it as a
> special case in EBCDIC and ignore the non-letters gaps.  Do I understand
> it correctly and is it implemented as advertised?
>
> Ze'ev Atlas


Yes it is implemented as advertised.  If you do want to include the gap
characters, you can instead write [\x81-\xA9].  But when both ends of
the range are literals, like "A", and the range is any subset of [A-Z]
or [a-z], special handling is invoked internally to exclude the gap


characters.

The 5.22 EBCDIC documentation has been extensively revised by me to
accurately reflect the actual implementation.  Please file a bug report
on any discrepancies.  There are some known bugs in the EBCDIC version
not present when run on ASCII platforms.  Unfortunately, the
documentation on the web hasn't been properly updated yet to reflect
5.22.  Here's what the new perlebcdic says about known EBCDIC problems:

      *  The "cmp" (and hence "sort") operators do not necessarily give the
          correct results when both operands are UTF-EBCDIC encoded
strings and
          there is a mixture of ASCII and/or control characters, along with
          other characters.

      *  Ranges containing "\N{...}" in the "tr///" (and "y///")
          transliteration operators are treated differently than the
equivalent
          ranges in regular expression patterns. They should, but don't,
cause
          the values in the ranges to all be treated as Unicode code
points, and
          not native ones. ("Version 8 Regular Expressions" in perlre gives
          details as to how it should work.)

      *  There are some bugs in the "pack"/"unpack" "U0" template

      *  There are a significant number of test failures in the CPAN
modules
          shipped with Perl v5.22. These are only in modules not primarily
          maintained by Perl 5 porters. Some of these are failures in
the tests
          only: they don't realize that it is proper to get different
results on
          EBCDIC platforms. And some of the failures are real bugs. If you
          compile and do a "make test" on Perl, all tests on the "/cpan"
          directory are skipped.

          In particular, the extensions Unicode::Collate and
Unicode::Normalize
          are not supported under EBCDIC; likewise for the (now deprecated)
          encoding pragma.

          Encode partially works.

>