Re: [pcre-dev] issues with EBCDIC and pcretest

Author: ph10
Date:
To: Ze'ev Atlas
CC: Pcre Exim
Subject: Re: [pcre-dev] issues with EBCDIC and pcretest

On Wed, 10 Jun 2015, Ze'ev Atlas wrote:

> Hi I am doing now testing for the EBCDIC package. I ignored all obvious differences such as ¬ vs ^ or \x0a vs. \x15 (new line) and I am concentrating on the less obvious ones. There are not too many of those (about 216 in 4 relevant files and I would probably dismiss many of them as irrelevant differences anyway.)
> The issues that I see now may be marginal, but some of them are important. I do not know whether the problem is in PCRE or in pcretest
> Failure #1 (derived from testinput1)/abcd\t\n\r\f\a\e/ abcd\t\n\r\f\a\e9;\$\\?caxyzNo match
> so I skipped the offending \e, just to find that I have a problem in both \a and in \e /abcd\t\n\r\f\a.\371/ abcd\t\n\r\f\a\e9;\$\\?caxyz 0: abcd\x05\x15\x0d\x0c\x07\x1b9
> \a should not be translated into \x07 and \x07 should not be matched. It is \x2f in EBCDIC\x1b is indeed the ESC in EBCDIC and it should have been recognized
> .Failure #2 In testinput1, on lines 159 and 160 there is a character that I cannot really recognize on my ASCII editor and it does not translate well into EBCDIC. Could you please let me know what is it. Is it a UTF-8 character that made it here by mistake?
> Failure #3Please consider:/¬\ca\cA\c[/
> \x01\x01\e;zNo match
> /¬\ca\cA.;/ \x01\x01\e;z 0: \x01\x01\x1b;
> Again \c[ should be ESC and thus should be \x1b according to perlebcdic document. BTW, \c: is not defined in that document, so I ignored test that involve it. Ideally, PCRE should have produced an error message that recognize this as an undefined escape sequence, but I understand why it does not produce such a message, for such marginal usage.
> Failure #4Please consider:/¬[\w][\W][\s][\S][\d][\D][\b][\n][\c]][\022]/ a+ Z0+\xf8\n\x1d\x12No match
> /¬[\w][\W][\s][\S][\d][\D]/ a+ Z0+\xf8\n\x1d\x12 0: a+ Z0+
> /¬[\w][\W][\s][\S][\d][\D][\b]/ a+ Z0+\xf8\n\x1d\x12No match
> Why didn't ]b match?
> Failure #5This is a clear pcretest issue/¬[w-C_¬]+$/ wxy_¬ABC 0: wxy_¬ABC *** FailersNo match WXYNo match
> It does not recognize the '*** Failers' string and tries to match it. It is no more then some annoyance, but it may point to a real issue.

Your emails are very difficult to answer because somehow they are losing
newlines by the time they get to me. I quote above exactly what I am
seeing. The output from pcretest is all joined up like this:

> Failure #1 (derived from testinput1)/abcd\t\n\r\f\a\e/ abcd\t\n\r\f\a\e9;\$\\?caxyzNo match

I presume that should really be:

> Failure #1 (derived from testinput1)
> /abcd\t\n\r\f\a\e/
> abcd\t\n\r\f\a\e9;\$\\?caxyz
> No match

The actual test that I have in test 1 is this:

/abcd\t\n\r\f\a\e\071\x3b\$\\\?caxyz/
    abcd\t\n\r\f\a\e9;\$\\?caxyz
 0: abcd\x09\x0a\x0d\x0c\x07\x1b9;$\?caxyz

However, running your test on my Linux box does give a match. I suspect
it might be the \e that is causing the issue. Ah yes. I see the problem.
In pcretest, which processes escapes in subject lines, \e is being
converted to ASCII escape (0x1b) instead of EBCDIC escape (0x27).
Likewise \a goes to ASCII BEL (0x07) instead of 0x2f. Other escapes such
as \n go to the compiler's value for '\n' so should work in either code.

I will check pcretest to see if there are any other such errors. I see
also that there is a similar bug in pcre_compile.c for \a, but not for
\e. When I have committed the changes I will email you so you can pick
up the revised code.

> .Failure #2
> In testinput1, on lines 159 and 160 there is a character that I cannot
> really recognize on my ASCII editor and it does not translate well
> into EBCDIC. Could you please let me know what is it. Is it a UTF-8
> character that made it here by mistake?

It is not UTF. It is a non-ASCII byte with code value 0x81. The test is
that \ works with such characters. The test that immediately follows is
similar. On my system the output looks like this:

/^\¤/                                          
    ¤                  
 0: \x81

/^ÿ/                            
    ÿ                                     
 0: \xff

> Failure #3
> Please consider:
> /¬\ca\cA\c[/
> \x01\x01\e;z
> No match

> /¬\ca\cA.;/ > \x01\x01\e;z > 0: \x01\x01\x1b;

> Again \c[ should be ESC and thus should be \x1b according to
> perlebcdic document.

Probably the same \e bug in pcretest. \c[ will produce \x1b in an ASCII
world, but (as documented in pcrepattern.3) \c works differently in an
EBCDIC environment. I would expect \c[ to become 0x6D if [ is 0xAD (as I
have it listed). You can check this kind of thing easily by running
pcretest with the -b option. This is on an ASCII system:

$ ./pcretest -b
PCRE version 8.38-RC1 2015-05-03

re> /\c[/

------------------------------------------------------------------
  0   5 Bra
  3     \x1b
  5   5 Ket
  8     End
------------------------------------------------------------------

data>

> BTW, \c: is not defined in that document, so I ignored test that
> involve it. Ideally, PCRE should have produced an error message that
> recognize this as an undefined escape sequence, but I understand why
> it does not produce such a message, for such marginal usage.

I think \c is now historical and of very limited use. It is as easy to
write \x01 as it is to write \cA, for example. However, it remains in
PCRE for Perl compatibility.

> Failure #4 Please consider:
> /¬[\w][\W][\s][\S][\d][\D][\b][\n][\c]][\022]/
> a+ Z0+\xf8\n\x1d\x12
> No match

> /¬[\w][\W][\s][\S][\d][\D]/
> a+ Z0+\xf8\n\x1d\x12
> 0: a+ Z0+

> /¬[\w][\W][\s][\S][\d][\D][\b]/
> a+ Z0+\xf8\n\x1d\x12
> No match

> Why didn't ]b match?

[\b] matches a backspace character. In ASCII this is 0x08 and the ASCII
output for this test looks like this:

/^[\w][\W][\s][\S][\d][\D][\b][\n][\c]][\022]/ 
    a+ Z0+\x08\n\x1d\x12
 0: a+ Z0+\x08\x0a\x1d\x12

Note that the data line has \x08 in it; yours has \xf8. EBCDIC backspace
is 0x16. Try changing \xf8 to \f16.

> Failure #5
> This is a clear pcretest issue
> /¬[w-C_¬]+$/
> wxy_¬ABC
> 0: wxy_¬ABC
> *** Failers
> No match
> WXY
> No match

> It does not recognize the '*** Failers' string and tries to match it. It is no more then some annoyance, but it may point to a real issue.

The "*** Failers" string is not special. I put in some tests to make it
easy to see which data is supposed to match and which is not. Some
patterns match it; some do not. In this case, it doesn't match. Your
test, however, looks different to the one in testdata1:

/^[W-c]+$/
    WXY_^abc
 0: WXY_^abc
    *** Failers
No match
    wxy
No match

The case of the letters has been inverted somehow.

Philip

--
Philip Hazel

This message is part of the following thread:
	the complete thread tree sorted by date
	Ze'ev Atlas at
	ph10 at