Re: [pcre-dev] PCRE on EBCDIC tests

Auteur: Ze'ev Atlas
Date:
À: pcre-dev@exim.org
Sujet: Re: [pcre-dev] PCRE on EBCDIC tests

Thank you PhilipI should have caught my mistake (of wrongly and explicitly defining EBCDIC_NL25) but I've got confused
Once I removed the #define EBCDIC_NL25 from config.h I've got:
PCRE version 8.37 2015-04-28Compiled with EBCDIC code support: LF is 0x15 8-bit support No UTF-8 support No Unicode properties support No just-in-time compiler support Newline sequence is LF \R matches all Unicode newlines Internal link size = 2 POSIX malloc threshold = 10 Parentheses nest limit = 250 Default match limit = 10000000 Default recursion depth limit = 10000000 Match recursion uses stack
and the tests worked as advertised
/(?<=foo\n)¬bar/ImCapturing subpattern count = 0Max lookbehind = 4Contains explicit CR or LF matchOptions: multilineNo first charNeed char = 'r' foo\nbarbar 0: bar ***FailersNo match rhubarbNo match barbellNo match abc\nbartonNo match
/¬(?<=foo\n)bar/ImCapturing subpattern count = 0Max lookbehind = 4Contains explicit CR or LF matchOptions: multilineFirst char at start or follows newlineNeed char = 'r' foo\nbarbar 0: bar ***FailersNo match rhubarbNo match barbellNo match abc\nbartonNo match Ze'ev Atlas

      From: "ph10@???" <ph10@???>
 To: Ze'ev Atlas <zatlas1@???> 
Cc: "pcre-dev@???" <pcre-dev@???> 
 Sent: Friday, May 29, 2015 11:26 AM
 Subject: Re: [pcre-dev] PCRE on EBCDIC tests

On Fri, 29 May 2015, Ze'ev Atlas wrote:

> I ran pcretest with -C option to get:
> PCRE version 8.37 2015-04-28
> Compiled with
...
> LF is 0x25

That should happen only if EBCDIC_NL25 is defined.

> Newline sequence is a non-standard value: 0x0015

... which means is is not CR or LF or CRLF.

> Unicode has a NEL character 0x85, which I guess should be equivalent to
> EBCDIC's 0x25 when 0x15 is NL, as suggested in
> http://unicode.org/standard/reports/tr13/tr13-5.html

and in PCRE's README file.

> I am not sure why the user who contributed the original EBCDIC
> patch used the name CHAR_NL rather than CHAR_LF for the character
> represented by '\n', but I guess it was because it is usually the NL
> character. I think I will change that name, and introduce CHAR_NEL for
> the other character.

PCRE defines the names CHAR_NL, CHAR_NEL, CHAR_CR, and CHAR_LF. CHAR_NL
and CHAR_LF are synonyms for the same code point in both ASCII &
EBCDIC. See the pcre_internal.h file.

> I wrote a little program that reads my test file to a chunk of memory,
> prints it and dump the memory. You will see below that the equivalent
> of \n is indeed 21=0x15. My conclusion is that we need a way to tell
> PCRE that \n is not the LF character, but the NL character or
> otherwise, dictate what \n should be.

There are two issues here:

(1) What code point does \n represent?
(2) What code point (or pair of code points) represents newline?

In PCRE, the answer to (1) is the CHAR_LF character, which is either
0x15 or 0x25. The answer to (2) is determined by the value of the
NEWLINE macro. What is the setting of NEWLINE in your config.h file?

Philip

--
Philip Hazel

Ce message fait partie du fil suivant :
	Arborescence complète du fil triée par date
	ph10 à