Tree - source-git/docbook2X - CentOS Git server

source-git / docbook2X

Blame doc/design-notes.xml

Blob History Raw

Packit	e4b6da
Packit	e4b6da	`<sect1 id="design-notes">`
Packit	e4b6da	`<sect1info>`
Packit	e4b6da	`<abstract role="texinfo-node">`
Packit	e4b6da	`<para>Author’s notes on the grand scheme of docbook2X</para>`
Packit	e4b6da	`</abstract>`
Packit	e4b6da	`</sect1info>`
Packit	e4b6da	`<title>Design notes</title>`
Packit	e4b6da
Packit	e4b6da	`<indexterm><primary>design</primary></indexterm>`
Packit	e4b6da	`<indexterm><primary>history</primary></indexterm>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`Lessons learned:`
Packit	e4b6da
Packit	e4b6da	`<itemizedlist>`
Packit	e4b6da
Packit	e4b6da	`<listitem>`
Packit	e4b6da	`<indexterm><primary>stream processing</primary></indexterm>`
Packit	e4b6da	`<indexterm><primary>tree processing</primary></indexterm>`
Packit	e4b6da	`<para>`
Packit	e4b6da	`Think four times before doing stream-based XML processing, even though it`
Packit	e4b6da	`appears to be more efficient than tree-based.`
Packit	e4b6da	`Stream-based processing is usually more difficult.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`But if you have to do stream-based processing, make sure to use robust,`
Packit	e4b6da	`fairly scaleable tools like <classname>XML::Templates</classname>,`
Packit	e4b6da	`<emphasis>not</emphasis> <command>sgmlspl</command>. Of course it cannot`
Packit	e4b6da	`be as pleasant as tree-based XML processing, but examine`
Packit	e4b6da	`&db2x_manxml; and &db2x_texixml;.`
Packit	e4b6da	`</para>`
Packit	e4b6da	`</listitem>`
Packit	e4b6da
Packit	e4b6da	`<listitem>`
Packit	e4b6da	`<para>`
Packit	e4b6da	`Do not use <classname>XML::DOM</classname> directly for stylesheets.`
Packit	e4b6da	`Your “stylesheet” would become seriously unmanageable.`
Packit	e4b6da	`Its also extremely slow for anything but trivial documents.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`At least take a look at some of the XPath modules out there.`
Packit	e4b6da	`Better yet, see if your solution really cannot use XSLT.`
Packit	e4b6da	`A C/C++-based implementation of XSLT can be fast enough`
Packit	e4b6da	`for many tasks.`
Packit	e4b6da	`</para>`
Packit	e4b6da	`</listitem>`
Packit	e4b6da
Packit	e4b6da	`<listitem>`
Packit	e4b6da	`<indexterm><primary>XSLT extensions</primary></indexterm>`
Packit	e4b6da	`<para>`
Packit	e4b6da	`Avoid XSLT extensions whenever possible. I don't think there is`
Packit	e4b6da	`anything wrong with them intrinsically, but it is a headache`
Packit	e4b6da	`to have to compile your own XSLT processor. (libxslt is written`
Packit	e4b6da	`in C, and the extensions must be compiled-in and cannot be loaded`
Packit	e4b6da	`dynamically at runtime.) Not to mention there seems to be a thousand`
Packit	e4b6da	`different set-ups for different XSLT processors.`
Packit	e4b6da	`</para>`
Packit	e4b6da	`</listitem>`
Packit	e4b6da
Packit	e4b6da	`<listitem>`
Packit	e4b6da	`<indexterm><primary>Perl</primary></indexterm>`
Packit	e4b6da	`<para>`
Packit	e4b6da	`Perl is not as good at XML as it’s hyped to be.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`SAX comes from the Java world, and its port to Perl`
Packit	e4b6da	`(with all the object-orientedness, and without adopting Perl idioms)`
Packit	e4b6da	`is awkward to use.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`Another problem is that Perl SAX does not seem to be well-maintained.`
Packit	e4b6da	`The implementations have various bugs; while they can be worked around,`
Packit	e4b6da	`they have been around for such a long time that it does not inspire`
Packit	e4b6da	`confidence that the Perl XML modules are reliable software.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`It also seems that no one else has seriously used Perl SAX`
Packit	e4b6da	`for robust applications. It seems to be unnecessarily hard to`
Packit	e4b6da	`certain tasks such as displaying error diagnostics on its`
Packit	e4b6da	`input, processing large documents with complicated structure.`
Packit	e4b6da	`</para>`
Packit	e4b6da	`</listitem>`
Packit	e4b6da
Packit	e4b6da	`<listitem>`
Packit	e4b6da	`<indexterm><primary>Man-XML</primary></indexterm>`
Packit	e4b6da	`<indexterm><primary>Texi-XML</primary></indexterm>`
Packit	e4b6da	`<para>`
Packit	e4b6da	`Do not be afraid to use XML intermediate formats`
Packit	e4b6da	`(e.g. Man-XML and Texi-XML) for converting to other`
Packit	e4b6da	`markup languages, implemented with a scripting language.`
Packit	e4b6da	`The syntax rules for these formats are made for`
Packit	e4b6da	`authoring by hand, not machine generation; hence a conversion`
Packit	e4b6da	`using tools designed for XML-to-XML conversion,`
Packit	e4b6da	`requires jumping through hoops.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`You might think that we could, instead, make a separate module`
Packit	e4b6da	`that abstracts all this complexity`
Packit	e4b6da	`from the rest of the conversion program. For example,`
Packit	e4b6da	`there is nothing stopping a XSLT processor from serializing`
Packit	e4b6da	`the output document as a text document obeying the syntax`
Packit	e4b6da	`rules for man pages or Texinfo documents.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`Theoretically you would get the same result,`
Packit	e4b6da	`but it is much harder to implement. It is far easier to write plain`
Packit	e4b6da	`text manipulation code in a scripting language than in Java or C or XSLT.`
Packit	e4b6da	`Also, if the intermediate format is hidden in a Java class or`
Packit	e4b6da	`C API, output errors are harder to see.`
Packit	e4b6da	`Whereas with the intermediate-format approach, we can`
Packit	e4b6da	`visually examine the textual output of the XSLT processor and fix`
Packit	e4b6da	`the Perl script as we go along.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`Some XSLT processors support scripting to go beyond XSLT`
Packit	e4b6da	`functionality, but they are usually not portable, and not`
Packit	e4b6da	`always easy to use.`
Packit	e4b6da	`Therefore, opt to do two-pass processing, with a standalone`
Packit	e4b6da	`script as the second stage. (The first stage using XSLT.)`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da
Packit	e4b6da	`elimination of XSLT extensions">`
Packit	e4b6da	`Finally, another advantage of using intermediate XML formats`
Packit	e4b6da	`processed by a Perl script is that we can often eliminate the`
Packit	e4b6da	`use of XSLT extensions. In particular, all the way back when XSLT`
Packit	e4b6da	`stylesheets first went into docbook2X, the extensions related to`
Packit	e4b6da	`Texinfo node handling could have been easily moved to the Perl script,`
Packit	e4b6da	`but I didn't realize it! I feel stupid now.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`If I had known this in the very beginning, it would have saved`
Packit	e4b6da	`a lot of development time, and docbook2X would be much more`
Packit	e4b6da	`advanced by now.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`Note that even the man-pages stylesheet from the DocBook XSL`
Packit	e4b6da	`distribution essentially does two-pass processing`
Packit	e4b6da	`just the same as the docbook2X solution. That stylesheet`
Packit	e4b6da	`had formerly used one-pass processing, and its authors`
Packit	e4b6da	`probably finally realized what a mess that was.`
Packit	e4b6da	`</para>`
Packit	e4b6da	`</listitem>`
Packit	e4b6da
Packit	e4b6da	`<listitem>`
Packit	e4b6da	`<para>`
Packit	e4b6da	`Design the XML intermediate format to be easy to use from the standpoint`
Packit	e4b6da	`of the conversion tool, and similar to how XML document types work in`
Packit	e4b6da	`general. e.g. abstract the paragraphs of a document, rather than their`
Packit	e4b6da	`paragraph <emphasis>breaks</emphasis>`
Packit	e4b6da	`(the latter is typical of traditional markup languages, but not of XML).`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`</listitem>`
Packit	e4b6da
Packit	e4b6da	`<listitem>`
Packit	e4b6da	`<para>`
Packit	e4b6da	`I am quite impressed by some of the things that people make XSLT 1.0 do.`
Packit	e4b6da	`Things that I thought were impossible, or at least unworkable`
Packit	e4b6da	`without using “real” scripting language.`
Packit	e4b6da	`(&db2x_manxml; and &db2x_texixml; fall in the`
Packit	e4b6da	`category of things that can be done in XSLT 1.0 but inelegantly.)`
Packit	e4b6da	`</para>`
Packit	e4b6da	`</listitem>`
Packit	e4b6da
Packit	e4b6da	`<listitem>`
Packit	e4b6da	`<para>`
Packit	e4b6da	`Internationalize as soon as possible.`
Packit	e4b6da	`That is much easier than adding it in later.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`Same advice for build system.`
Packit	e4b6da	`</para>`
Packit	e4b6da	`</listitem>`
Packit	e4b6da
Packit	e4b6da	`<listitem>`
Packit	e4b6da	`<para>`
Packit	e4b6da	`I would suggest against using build systems based`
Packit	e4b6da	`on Makefiles or any form of automake.`
Packit	e4b6da	`Of course it is inertia that prevents people from`
Packit	e4b6da	`switching to better build systems. But also`
Packit	e4b6da	`consider that while Makefile-based build systems`
Packit	e4b6da	`can do many of the things newer build systems are capable`
Packit	e4b6da	`of, they often require too many fragile hacks. Developing`
Packit	e4b6da	`these hacks take too much time that would be better`
Packit	e4b6da	`spent developing the program itself.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`Alas, better build systems such as scons were not available`
Packit	e4b6da	`when docbook2X was at an earlier stage. It’s too late`
Packit	e4b6da	`to switch now.`
Packit	e4b6da	`</para>`
Packit	e4b6da	`</listitem>`
Packit	e4b6da
Packit	e4b6da	`<listitem>`
Packit	e4b6da	`<para>`
Packit	e4b6da	`Writing good documentation takes skill. This manual has`
Packit	e4b6da	`has been revised substantially at least four times`
Packit	e4b6da	`<footnote><para>`
Packit	e4b6da	`This number is probably inflated because of the so many design`
Packit	e4b6da	`mistakes in the process.</para></footnote>, with the author`
Packit	e4b6da	`consciously trying to condense information each time.`
Packit	e4b6da	`</para>`
Packit	e4b6da	`</listitem>`
Packit	e4b6da
Packit	e4b6da	`<listitem>`
Packit	e4b6da	`<para>`
Packit	e4b6da	`Table processing in the pure-XSLT man-pages conversion`
Packit	e4b6da	`is convoluted — it goes through HTML(!) tables as an intermediary.`
Packit	e4b6da	`That is the same way that the DocBook XSL stylesheets implement`
Packit	e4b6da	`it (due to Michael Smith), and I copied the code there`
Packit	e4b6da	`almost verbatim. I did it this way to save myself time and energy`
Packit	e4b6da	`re-implementing tables conversion <emphasis>again</emphasis>.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`And Michael Smith says that going through HTML is better,`
Packit	e4b6da	`because some varieties of DocBook allow the HTML table model`
Packit	e4b6da	`in addition to the CALS table model. (I am not convinced`
Packit	e4b6da	`that this is such a good idea, but anyway.)`
Packit	e4b6da	`Then HTML tables in DocBook can be translated to man pages`
Packit	e4b6da	`too without much more effort.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`Is this inefficient? Probably. But that’s what you get`
Packit	e4b6da	`if you insist on using pure XSLT. The Perl implementation`
Packit	e4b6da	`of docbook2X.`
Packit	e4b6da	`already supported tables conversion for two years prior.`
Packit	e4b6da	`</para>`
Packit	e4b6da	`</listitem>`
Packit	e4b6da
Packit	e4b6da	`<listitem>`
Packit	e4b6da	`<para>`
Packit	e4b6da	`The design of &utf8trans; is not the best.`
Packit	e4b6da	`It was chosen to simplify implementations while being efficient.`
Packit	e4b6da	`A more general design, while still retaining efficiency, is possible,`
Packit	e4b6da	`which I describe below. However, unfortunately,`
Packit	e4b6da	`at this point changing &utf8trans;`
Packit	e4b6da	`will be too disruptive to users with little gain in functionality.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`Instead of working with characters, we should work with byte strings.`
Packit	e4b6da	`This means that, if all input and output is in UTF-8,`
Packit	e4b6da	`with no escape sequences, then UTF-8 decoding or encoding`
Packit	e4b6da	`is not necessary at all. Indeed the program becomes agnostic`
Packit	e4b6da	`to the character set used. Of course,`
Packit	e4b6da	`multi-character matches become possible.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`The translation map will be an unordered list of key-value pairs.`
Packit	e4b6da	`The key and value are both arbitrary-length byte strings,`
Packit	e4b6da	`with an explicit length attached (so null bytes in the input`
Packit	e4b6da	`and output are retained).`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`The program would take the translation map, and transform the input file`
Packit	e4b6da	`by matching the start of input, seen as a sequence of bytes,`
Packit	e4b6da	`against the keys in the translation map, greedily.`
Packit	e4b6da	`(Since the matching is greedy, the translation keys do not`
Packit	e4b6da	`need to be restricted to be prefix-free.)`
Packit	e4b6da	`Once the longest (in byte length) matching key is found,`
Packit	e4b6da	`the corresponding value (another byte string) is substituted`
Packit	e4b6da	`in the output, and processing repeats (until the input is finished).`
Packit	e4b6da	`If, on the other hand, no match is found, the next byte`
Packit	e4b6da	`in the input file is copied as-is, and processing repeats`
Packit	e4b6da	`at the next byte of input.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`Since bytes are 8 bits and the key strings are typically`
Packit	e4b6da	`very short (up to 3`
Packit	e4b6da	`bytes for a Unicode BMP character encoded in UTF-8),`
Packit	e4b6da	`this algorithm can be implemented with radix search.`
Packit	e4b6da	`It would be competitive, in both execution time and space,`
Packit	e4b6da	`with character codepoint hashing and sparse multi-level`
Packit	e4b6da	`arrays, the primary techniques for implementing`
Packit	e4b6da	`Unicode <emphasis>character</emphasis> translation.`
Packit	e4b6da	`(&utf8trans; is implemented using sparse multi-level arrays.)`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`<para>`
Packit	e4b6da	`One could even try to generalize the radix searching further,`
Packit	e4b6da	`so that keys can include wildcard characters, for example.`
Packit	e4b6da	`Taken to the extremes, the design would end up being`
Packit	e4b6da	`a regular expressions processor optimized for matching`
Packit	e4b6da	`many strings with common prefixes.`
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`</listitem>`
Packit	e4b6da
Packit	e4b6da
Packit	e4b6da	`</itemizedlist>`
Packit	e4b6da
Packit	e4b6da	`</para>`
Packit	e4b6da
Packit	e4b6da	`</sect1>`
Packit	e4b6da

source-git / docbook2X

Source Code

Blame doc/design-notes.xml