Re: [pcre-dev] PCRE on EBCDIC tests

Top Page
Delete this message
Author: Ze'ev Atlas
Date:  
To: pcre-dev@exim.org
Subject: Re: [pcre-dev] PCRE on EBCDIC tests
I ran pcretest with -C option to get:
PCRE version 8.37 2015-04-28Compiled with  EBCDIC code support: LF is 0x25  8-bit support  No UTF-8 support  No Unicode properties support  No just-in-time compiler support  Newline sequence is a non-standard value: 0x0015  \R matches all Unicode newlines  Internal link size = 2  POSIX malloc threshold = 10  Parentheses nest limit = 250  Default match limit = 10000000  Default recursion depth limit = 10000000  Match recursion uses stack

This reminded me an old conversation we've had sometime before<snip>I've been reading various web pages about EBCDIC systems, and they suggest that the NL EBCDIC character (0x15) is used as the equivalent ofASCII LF, though EBCDIC does have its own LF character (0x25) which was mentioned as sometimes used. Ze'ev's experience suggests that 0x15 isused in his environment. 
Unicode has a NEL character 0x85, which I guess should be equivalent toEBCDIC's 0x25 when 0x15 is NL, as suggested in
http://unicode.org/standard/reports/tr13/tr13-5.html
I am not sure why the user who contributed the original EBCDIC patch used the name CHAR_NL rather than CHAR_LF for the character represented by '\n', but I guess it was because it is usually the NL character. I think I will change that name, and introduce CHAR_NEL for the other character.
Ze'ev: please can you check that '\n' really is 0x15 in your environment?<snip>
I wrote a little program that reads my test file to a chunk of memory, prints it and dump the memory.  You will see below that the equivalent of \n is indeed 21=0x15.  My conclusion is that we need a way to tell PCRE that \n is not the LF character, but the NL character or otherwise, dictate what \n should be.

/(?<=foo\n)¬bar/Im    foo\x0Fbarbar    ***Failers    rhubarb    barbell    abc\nbarton/¬(?<=foo\n)bar/Im    foo\x15barbar    ***Failers    rhubarb    barbell    abc\nbarton/(?<=foo\x0F)¬bar/Im    foo\x0Fbarbar    ***Failers    rhubarb    barbell    abc\nbarton/¬(?<=foo\x15)bar/Im    foo\x15barbar    ***Failers    rhubarb    barbell    abc\nbarton/(?>¬abc)/Im    abc    def\nabc    *** Failers    defabc/(?<=ab(c+)d)ef//(?<=ab(?<=c+)d)ef//(?<=ab(c|de)f)g/082FF5E0 | 61 4D 6F 4C 7E 86 96 96 E0 95 5D 5F 82 81 99 61 | /(?<=foo\n)¬bar/082FF5F0 | C9 94 15 40 40 40 40 86 96 96 E0 A7 F0 C6 82 81 | Im     foo\x0Fba082FF600 | 99 82 81 99 40 15 40 40 40 40 5C 5C 5C C6 81 89 | rbar      ***Fai082FF610 | 93 85 99 A2 15 40 40 40 40 99 88 A4 82 81 99 82 | lers     rhubarb082FF620 | 15 40 40 40 40 82 81 99 82 85 93 93 15 40 40 40 |      barbell082FF630 | 40 81 82 83 E0 95 82 81 99 A3 96 95 15 15 61 5F |  abc\nbarton  /¬082FF640 | 4D 6F 4C 7E 86 96 96 E0 95 5D 82 81 99 61 C9 94 | (?<=foo\n)bar/Im082FF650 | 15 40 40 40 40 86 96 96 E0 A7 F1 F5 82 81 99 82 |      foo\x15barb082FF660 | 81 99 40 15 40 40 40 40 5C 5C 5C C6 81 89 93 85 | ar      ***Faile082FF670 | 99 A2 15 40 40 40 40 99 88 A4 82 81 99 82 15 40 | rs     rhubarb082FF680 | 40 40 40 82 81 99 82 85 93 93 15 40 40 40 40 81 |    barbell     a082FF690 | 82 83 E0 95 82 81 99 A3 96 95 15 15 61 4D 6F 4C | bc\nbarton  /(?<082FF6A0 | 7E 86 96 96 E0 A7 F0 C6 5D 5F 82 81 99 61 C9 94 | =foo\x0F)¬bar/Im082FF6B0 | 15 40 40 40 40 86 96 96 E0 A7 F0 C6 82 81 99 82 |      foo\x0Fbarb082FF6C0 | 81 99 40 15 40 40 40 40 5C 5C 5C C6 81 89 93 85 | ar      ***Faile082FF6D0 | 99 A2 15 40 40 40 40 99 88 A4 82 81 99 82 15 40 | rs     rhubarb082FF6E0 | 40 40 40 82 81 99 82 85 93 93 15 40 40 40 40 81 |    barbell     a082FF6F0 | 82 83 E0 95 82 81 99 A3 96 95 15 15 61 5F 4D 6F | bc\nbarton  /¬(?082FF700 | 4C 7E 86 96 96 E0 A7 F1 F5 5D 82 81 99 61 C9 94 | <=foo\x15)bar/Im082FF710 | 15 40 40 40 40 86 96 96 E0 A7 F1 F5 82 81 99 82 |      foo\x15barb082FF720 | 81 99 40 15 40 40 40 40 5C 5C 5C C6 81 89 93 85 | ar      ***Faile082FF730 | 99 A2 15 40 40 40 40 99 88 A4 82 81 99 82 15 40 | rs     rhubarb082FF740 | 40 40 40 82 81 99 82 85 93 93 15 40 40 40 40 81 |    barbell     a082FF750 | 82 83 E0 95 82 81 99 A3 96 95 15 15 61 4D 6F 6E | bc\nbarton  /(?>082FF760 | 5F 81 82 83 5D 61 C9 94 15 40 40 40 40 81 82 83 | ¬abc)/Im     abc082FF770 | 15 40 40 40 40 84 85 86 E0 95 81 82 83 15 40 40 |      def\nabc082FF780 | 40 40 5C 5C 5C 40 C6 81 89 93 85 99 A2 15 40 40 |   *** Failers082FF790 | 40 40 84 85 86 81 82 83 15 15 61 4D 6F 4C 7E 81 |   defabc  /(?<=a082FF7A0 | 82 4D 83 4E 5D 84 5D 85 86 61 15 15 61 4D 6F 4C | b(c+)d)ef/  /(?<082FF7B0 | 7E 81 82 4D 6F 4C 7E 83 4E 5D 84 5D 85 86 61 15 | =ab(?<=c+)d)ef/082FF7C0 | 15 61 4D 6F 4C 7E 81 82 4D 83 4F 84 85 5D 86 5D |  /(?<=ab(c|de)f)082FF7D0 | 87 61 15 __ __ __ __ __ __ __ __ __ __ __ __ __ | g/ Ze'ev Atlas


      From: Ze'ev Atlas <zatlas1@???>
 To: "pcre-dev@???" <pcre-dev@???> 
 Sent: Thursday, May 28, 2015 3:31 PM
 Subject: Re: PCRE on EBCDIC tests


\n is definitly not defined correctly
I ran 4 tests.  Note that new line is defined correctly as 21 = 0x15 in EBCDIC
/(?<=foo\n)¬bar/ImCapturing subpattern count = 0Max lookbehind = 4Contains explicit CR or LF matchOptions: multilineNo first charNeed char = 'r'    foo\x0FbarbarNo match    ***FailersNo match    rhubarbNo match    barbellNo match    abc\nbartonNo match
/¬(?<=foo\n)bar/ImCapturing subpattern count = 0Max lookbehind = 4Contains explicit CR or LF matchOptions: multilineFirst char at start or follows newlineNeed char = 'r'    foo\x15barbarNo match    ***FailersNo match    rhubarbNo match    barbellNo match    abc\nbartonNo match /(?<=foo\x0F)¬bar/ImCapturing subpattern count = 0Max lookbehind = 4Options: multilineNo first charNeed char = 'r'    foo\x0FbarbarNo match    ***FailersNo match    rhubarbNo match    barbellNo match    abc\nbartonNo match
/¬(?<=foo\x15)bar/ImCapturing subpattern count = 0Max lookbehind = 4Options: multilineFirst char at start or follows newlineNeed char = 'r'    foo\x15barbar 0: bar    ***FailersNo match    rhubarbNo match    barbellNo match    abc\nbartonNo match
In my config.h I have#ifndef NEWLINE#define NEWLINE 21#endif