Auteur: Ze'ev Atlas Date: À: Pcre Exim, Philip Hazel Sujet: [pcre-dev] PCRE on EBCDIC tests
PCRE for classic z/OS is now stable and development effort is basically over. Before I begin to tackle PCRE2, I would like to tackle the test suite in a more serious manner regarding the EBCDIC vs. ASCII environment. From previous communication I know that for 8 bits code I need only TESTIN1, TESTIN2, TESTIN11 and TESTIN12. I ran those tests as is, knowing that the results would be different then the results on an ASCII platform. Indeed, there were some expected differences, but also some unexpected or at least some that I do not understand. I will have to ask questions as I scan and try to make sense of those differences.Two comment before my questions:
* The character logical not ¬ is the EBCDIC equivalent of the circumflex ^* I am somewhat surprised that most tests actually produced the same results.
1. on TESTOUT11-8 line 284 you have:/\x{100}/8BMMemory allocation (code space): 10------------------------------------------------------------------ 0 6 Bra 3 \x{100} 6 6 Ket 9 End------------------------------------------------------------------
/\x{1000}/8BMMemory allocation (code space): 11------------------------------------------------------------------ 0 7 Bra 3 \x{1000} 7 7 Ket 10 End------------------------------------------------------------------/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
I get:/\x{100}/8BMFailed: this version of PCRE is compiled without UTF support at offset 0
/\x{1000}/8BMFailed: this version of PCRE is compiled without UTF support at offset 0/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
2. I guess that the test with saved patterns on 16 and 32 bits are not really relevant because nobody could produce their equivalent on EBCDIC platforms.
3. On TESTOUT2, line 518 you have:/(?i)[abcd]/ISCapturing subpattern count = 0Options: caselessNo first charNo need charSubject length lower bound = 1Starting chars: A B C D a b c d
while I have/(?i)[abcd]/ISCapturing subpattern count = 0Options: caselessNo first charNo need charSubject length lower bound = 1Starting chars: a b c d A B C D
The small and capital letter switch places. Now, in EBCDIC, the capital letters appear after the small letters (e.g. A is 0xC1 while a is 0x81) while in ASCII the opposite is true. Would that cause the difference? Should it be like that?
4. I am a bit concerned whether we have the \n defined correctly. In TESTOUT2, line 689 you have:/(?<=foo\n)¬bar/ImCapturing subpattern count = 0Max lookbehind = 4Contains explicit CR or LF matchOptions: multilineNo first charNeed char = 'r' foo\nbarbar 0: bar
while I have:/(?<=foo\n)¬bar/ImCapturing subpattern count = 0Max lookbehind = 4Contains explicit CR or LF matchOptions: multilineNo first charNeed char = 'r' foo\nbarbarNo match
Similarly TESTOUT2, line 1098/word ((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+)((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+)?)?)?)?)?)?)?)?)?otherword/ICapturing subpattern count = 8Contains explicit CR or LF matchNo optionsFirst char = 'w'Need char = 'd'
while I have:/word ((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+)((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+)?)?)?)?)?)?)?)?)?otherword/ICapturing subpattern count = 8No optionsFirst char = 'w'Need char = 'd'