On Wed, 17 Jun 2015, Ze'ev Atlas wrote:
> In developing the EBCDIC test suite, I am using this Methodology:1. I am interested only in testinput1, testinput2 and partially in testinput11 and testinput14.2. Compare pcretest results from the standard test suite to the z/OS EBCDIC results (after converting to ASCII - this is all a pain in the neck :)3. Ignore all same results or where the differences are obvious (0x0a vs. 0x15, Caret vs. logical not) and where the EBCDIC results are matching correctly, if different.4. For all remaining issues, I've developed a small test suite that either demonstrate how to code the pattern so that it would match, or overcome issues such as different file name scheme [for saved regexes], etc.
Seems very reasonable.
> Once done, I will provide an ASCII image of the EBCDIC suite (standard
> EBCDIC output, my input files, and standard output from those files).
> EBCDIC images will be available on my distro, unless you want them
> with your test suite.
No, I think it's probably better just to leave them with your distro.
> I would like to report that testinput1, testinput2 and their
> specialized EBCDIC derivatives work perfectly, except of three issues
> that are not clear to me: a. As far as I can tell, 0xa0 is supposed to
> be a non-breaking space! Am I correct? If so, is there an obvious
> EBCDIC equivalent?
0xa0 is indeed a non-breaking space in 8859 and Unicode. I don't know
what the EBCDIC equivalent is ... a quick Google suggests that it might
be 0x41:
https://en.wikipedia.org/?title=EBCDIC
> b. Could you please guide me to any documentation
> about class in the from [!xxx!], so I can adjust my tests.
There is nothing special about [!xxx!]. PCRE recognizes [:xxx:] *within*
an outer class. It gives an error for [.ch.] and [=ch=] (POSIX collating
sequences) because it does not support them. The doc/pcre2pattern.1
document describes the supported [:xxx:] POSIX class names.
> c. What is 0x85 used for?
0x85 is NEL (next line) in Unicode:
http://www.unicode.org/charts/PDF/U0080.pdf
On this page
http://unicode.org/standard/reports/tr13/tr13-5.html
it says "There are two mappings of LF and NEL used by EBCDIC systems.
The first EBCDIC column shows the MVS Open Edition (including CP1047)
mapping of these characters, while the second column shows the CDRA
mapping. This difference arises from the use of LF character as 'New
Line' in ASCII-based Unix environments and in some data transfer
protocols that use the Unix assumptions. The second column is based on
the standardized definitions — both in ASCII and EBCDIC of LF."
The two values are 0x15 and 0x25. One is LF and the other is NEL. PCRE
defaults to LF = 0x15 but puts them the other way round if EBCDIC_NL25
is defined in config.h.
Philip
--
Philip Hazel