I have a system on the build farm (CentOS 5.9) which is failing the
doc build. It fails this because there is a global environment
variable we set on a class of machines, one of which is running the
farm client. It has to do with setting an automatic PERL_UNICODE
environment variable. Since it's global, this initialization is done
at instantiation, and affects every object created which will pass
data back and forth, initializing some or all parts of its interface
to expect certain formats.
This causes a problem during the portion of the doc building which
makes the spec.txt. Normally the "w3m -dump" command generates some
non-ASCII characters (boxes/borders) and it's piped to the ./Tidytxt
perl script which converts those characters to plain ASCII. The
problem is that if the PERL_UNICODE environment variable is set, in
any way, shape, or form (even empty!) the data coming in on STDIN is
not treated byte-by-byte, but as unicode chars. The s/// command then
doesn't match per ascii character, but per unicode character.
In this simple test, I copied and pasted one character from the w3m
-dump output into a simple text file, and you can see that it is a 3
byte unicode character from the hexdump:
[farm@ivwm01 doc-docbook]$ cat test1.txt
┌
[farm@ivwm01 doc-docbook]$ cat test1.txt | hexdump
0000000 94e2 0a8c
0000004
[farm@ivwm01 doc-docbook]$ cat test1.txt | ./Tidytxt.orig
┌
[farm@ivwm01 doc-docbook]$ cat test1.txt | ./Tidytxt
+
The fix above was to add to the beginning of the Tidytxt script:
--- ../../../exim/doc/doc-docbook/Tidytxt 2013-10-29 15:54:20.000000000 +0000
+++ ./Tidytxt 2014-01-13 16:38:23.000000000 +0000
@@ -11,6 +11,7 @@
# (2) It uses U+25CF as its bullet character.
# (3) It inserts a whole slew of "box drawing" characters round the heading.
+binmode(STDIN, ":encoding(iso-8859-1)");
@lines = <>;
$lastwasblank = 0;
1) Anybody who can assure me that this won't break on old perl
versions? (I'm on 5.8 on this machine).
2) Anybody who can assure me that this won't break on new perl versions?
3) Anybody think of a better way to do this? It really doesn't hurt
the build process, it's just that in a couple of corner cases, the
spec.txt file could have some non-ASCII in it.
This is not the machine that I use to build the official releases, but
it could have happened to anybody with just the right combination of
environment/settings.
...Todd
--
The total budget at all receivers for solving senders' problems is $0.
If you want them to accept your mail and manage it the way you want,
send it the way the spec says to. --John Levine