Author: ph10 Date: To: Ze'ev Atlas CC: Pcre Exim Subject: Re: [pcre-dev] PCRE on EBCDIC tests
On Thu, 28 May 2015, Ze'ev Atlas wrote:
> PCRE for classic z/OS is now stable and development effort is basically over. Before I begin to tackle PCRE2, I would like to tackle the test suite in a more serious manner regarding the EBCDIC vs. ASCII environment. From previous communication I know that for 8 bits code I need only TESTIN1, TESTIN2, TESTIN11 and TESTIN12. I ran those tests as is, knowing that the results would be different then the results on an ASCII platform. Indeed, there were some expected differences, but also some unexpected or at least some that I do not understand. I will have to ask questions as I scan and try to make sense of those differences.Two comment before my questions:
> * The character logical not ¬ is the EBCDIC equivalent of the circumflex ^* I am somewhat surprised that most tests actually produced the same results. > 1. on TESTOUT11-8 line 284 you have:/\x{100}/8BM <snip>
> I get:/\x{100}/8BM Failed: this version of PCRE is compiled without UTF support at offset 0
Indeed. Test 11 requires UTF and UCP support (as it says in the comment
at the top. As you won't have this in the EBCDIC version, you get that
error.
> 2. I guess that the test with saved patterns on 16 and 32 bits are not really relevant because nobody could produce their equivalent on EBCDIC platforms.
Correct.
> 3. On TESTOUT2, line 518 you have:
> /(?i)[abcd]/IS
> Capturing subpattern count = 0
> Options: caseless
> No first char
> No need char
> Subject length lower bound = 1
> Starting chars: A B C D a b c d > while I have
> /(?i)[abcd]/IS
> Capturing subpattern count = 0
> Options: caseless
> No first char
> No need char
> Subject length lower bound = 1
> Starting chars: a b c d A B C D > The small and capital letter switch places. Now, in EBCDIC, the
> capital letters appear after the small letters (e.g. A is 0xC1 while a
> is 0x81) while in ASCII the opposite is true. Would that cause the
> difference? Should it be like that?
Yes, that would cause the difference. The starting characters are listed
in collating sequence.
> 4. I am a bit concerned whether we have the \n defined correctly.
> In TESTOUT2, line 689 you have:
> /(?<=foo\n)¬bar/Im
> Capturing subpattern count = 0
> Max lookbehind = 4
> Contains explicit CR or LF match
> Options: multiline
> No first char
> Need char = 'r'
> foo\nbarbar
> 0: bar > while I have:
> /(?<=foo\n)¬bar/Im
> Capturing subpattern count = 0
> Max lookbehind = 4
> Contains explicit CR or LF match
> Options: multiline
> No first char
> Need char = 'r'
> foo\nbarbar
> No match
Hmm. Not sure what to make of that. Looks like the interaction between ^
(that is ¬) and \n may not be working properly in multiline mode. Or
maybe the conversion of \n into a newline character in the pcretest
program is not working. Can you try replacing foo\nbarbar with
foo\{15}barbar
and see if that works? The code in pcretest changes \n into whatever
your C compiler uses for '\n'.
> Similarly TESTOUT2, line 1098
> /word ((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+)((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+)?)?)?)?)?)?)?)?)?otherword/I
> Capturing subpattern count = 8
> Contains explicit CR or LF match
> No options
> First char = 'w'
> Need char = 'd' > while I have:
> /word ((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+)((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+)?)?)?)?)?)?)?)?)?otherword/I
> Capturing subpattern count = 8
> No options
> First char = 'w'
> Need char = 'd'
This is because in the ASCII input file the pattern is split over three
lines. The newline characters are taken as part of the data. Perhaps you
lost the newlines when you converted the input files to EBCDIC.