[exim-dev] Status report: documentation conversion

Top Page
Delete this message
Reply to this message
Author: Philip Hazel
Date:  
To: exim-dev
Subject: [exim-dev] Status report: documentation conversion
I have spent a lot more time on converting the filter document to a
"standard" format. This is a report of where I currently stand:

1. The base source is now an Asciidoc file. This uses more-or-less the
default Asciidoc markup, with a number of additions that I have added to
make it a bit richer. I have a custom Asciidoc configuration file that
is used to turn this file into DocBook XML. This seems reasonably
workable, though I have not yet had to deal with figures or indexes.

2. Converting DocBook XML to HTML using xmlto seems to do a reasonable
job, though again, I haven't yet had to deal with figures or indexes.
I have created a private XSL style-sheet that modifies the standard
style to suit my preferences.

3. The route for converting DocBook XML to plain text is to turn it into
HTML, and then use the w3c browser to make text (w3c does a much better
job than lynx, IMO). There will have to be some fudging here, because
w3c does not do character conversions and the base file now contains
non-ascii characters like en-dashes, typographic quotes and apostrophes,
and the "fi" ligature. Also, it is probably a good idea to put quotes
round strings that, in HTML, are rendered in a monospaced font. So I
plan to write a script that pre-processes the HTML before passing it to
w3c.

4. There are problems in turning DocBook XML into PostScript or PDF. The
free technology for doing this appears to be very immature. I have used
xmlto to turn the file into a "formatted objects" (.fo) file, which is
then processed by the "fop" command. Again, I've used a private style
sheet to adjust the default design (reducing the size of fonts for
headings, reducing some of the spacing, changing the way
cross-references are handled, for instance).

The "fop" command is currently at release 0.20 (at least on my Gentoo
system). Running it produces a lot of errors and "not implemented
yet" warnings (even for a short test file), though it does succeed in
producing output. It is rather slow (it seems to be written in Java).

5. The output produced by xmlto->fo->fop->PostScript is lousy in a
number of ways:

  (a) Problem: The hyphenation logic is poor. It seems to hyphenate 
      "unintelligently", that is, to use hypenation even a line is 
      pretty "tight" without it, and not considering the look of the 
      whole paragraph. The result is that you often get several 
      hyphenated lines in succession. Also, I suspect it is using 
      algorithmic hyphenation, which IMO gets it wrong too often.


      Solution: I have turned hyphenation off. 


  (b) Problem: The pagination is also unintelligent. It is quite capable 
      of putting a section heading as the last line on a page. It also 
      generates "widow" and "orphan" lines frequently. The former (last 
      line of a paragraph - often a short line - on the first line of a
      page) look particularly awful.


      Solution: My custom configuration allows for a means of forcing a 
      page break, but this is a hack, and is not nice from a maintenance 
      point of view, because these manual breaks have to be reviewed for 
      each edition.


      I do not know of a solution to the widows and orphans problem.


  (c) I tried to create tables without any lines as a means of 
      displaying information in a non-monospaced font but in fixed 
      positions on the page (this will be needed for the main 
      specification options definitions). I failed to persuade it not to 
      draw lines. Maybe I've just missed something here.


  (d) Having carefully set up Asciidoc to turn the letter sequence of 
      "f" followed by "i" into the Unicode value for the "fi" ligature 
      (because I care about these things), I found that this is not 
      recognized by "fop". I believe it is fop not xmlto, because it 
      works OK in the HTML output. I imagine that it has some incomplete 
      font tables or something, because it manages to do the typographic 
      quotes and the dashes OK.


      Solution: I could pre-process the XML to remove the "fi" ligatures 
      before building PostScript and PDF. The output would be readable, 
      but not as nice.   


6. I am sorely tempted to write a script to turn the DocBook into a
format that I can process with SGCAL. This isn't as stupid as it
seems: it will keep me happy while I'm still around, while at the
same time holding the source in a standardized form that can be
processed in other ways in the future.

But first: I will investigate the problem of the main spec file. This
is *much* bigger, and has many more features that need to be dealt
with. I will report back in due course.

7. Lost facilities: It seems pretty certain that we will have to lose
the "change bars" feature: DocBook doesn't seem to have anything
suitable. Also, the HTML will not be in the current frame format,
though perhaps it can be massaged. Whatever, it is unlikely that we
can maintain the distinction between the options and concepts index,
as there's only one index facility in DocBook.

8. Texinfo: I have not investigated this yet. I understand there are
HTML->info converters. Let's hope one of them works. :-)

-- 
Philip Hazel            University of Cambridge Computing Service,
ph10@???      Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book:    http://www.uit.co.uk/exim-book