Re: [pcre-dev] PCRE on EBCDIC tests

Top Page
Delete this message
Author: Ze'ev Atlas
Date:  
To: pcre-dev@exim.org
Subject: Re: [pcre-dev] PCRE on EBCDIC tests
Thank you PhilipI should have caught my mistake (of wrongly and explicitly defining EBCDIC_NL25) but I've got confused
Once I removed the #define EBCDIC_NL25 from config.h I've got:
PCRE version 8.37 2015-04-28Compiled with  EBCDIC code support: LF is 0x15  8-bit support  No UTF-8 support  No Unicode properties support  No just-in-time compiler support  Newline sequence is LF  \R matches all Unicode newlines  Internal link size = 2  POSIX malloc threshold = 10  Parentheses nest limit = 250  Default match limit = 10000000  Default recursion depth limit = 10000000  Match recursion uses stack
and the tests worked as advertised
/(?<=foo\n)¬bar/ImCapturing subpattern count = 0Max lookbehind = 4Contains explicit CR or LF matchOptions: multilineNo first charNeed char = 'r'    foo\nbarbar 0: bar    ***FailersNo match    rhubarbNo match    barbellNo match    abc\nbartonNo match
/¬(?<=foo\n)bar/ImCapturing subpattern count = 0Max lookbehind = 4Contains explicit CR or LF matchOptions: multilineFirst char at start or follows newlineNeed char = 'r'    foo\nbarbar 0: bar    ***FailersNo match    rhubarbNo match    barbellNo match    abc\nbartonNo match Ze'ev Atlas


      From: "ph10@???" <ph10@???>
 To: Ze'ev Atlas <zatlas1@???> 
Cc: "pcre-dev@???" <pcre-dev@???> 
 Sent: Friday, May 29, 2015 11:26 AM
 Subject: Re: [pcre-dev] PCRE on EBCDIC tests


On Fri, 29 May 2015, Ze'ev Atlas wrote:

> I ran pcretest with -C option to get:
> PCRE version 8.37 2015-04-28
> Compiled with 

...
> LF is 0x25 


That should happen only if EBCDIC_NL25 is defined.

> Newline sequence is a non-standard value: 0x0015 


... which means is is not CR or LF or CRLF.

> Unicode has a NEL character 0x85, which I guess should be equivalent to
> EBCDIC's 0x25 when 0x15 is NL, as suggested in
> http://unicode.org/standard/reports/tr13/tr13-5.html


and in PCRE's README file.

> I am not sure why the user who contributed the original EBCDIC
> patch used the name CHAR_NL rather than CHAR_LF for the character
> represented by '\n', but I guess it was because it is usually the NL
> character. I think I will change that name, and introduce CHAR_NEL for
> the other character.


PCRE defines the names CHAR_NL, CHAR_NEL, CHAR_CR, and CHAR_LF. CHAR_NL
and CHAR_LF are synonyms for the same code point in both ASCII &
EBCDIC. See the pcre_internal.h file.

> I wrote a little program that reads my test file to a chunk of memory,
> prints it and dump the memory.  You will see below that the equivalent
> of \n is indeed 21=0x15.  My conclusion is that we need a way to tell
> PCRE that \n is not the LF character, but the NL character or
> otherwise, dictate what \n should be.


There are two issues here:

(1) What code point does \n represent?
(2) What code point (or pair of code points) represents newline?

In PCRE, the answer to (1) is the CHAR_LF character, which is either
0x15 or 0x25. The answer to (2) is determined by the value of the
NEWLINE macro. What is the setting of NEWLINE in your config.h file?

Philip

--
Philip Hazel