|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<sect1 id="design-notes">
|
|
Packit |
e4b6da |
<sect1info>
|
|
Packit |
e4b6da |
<abstract role="texinfo-node">
|
|
Packit |
e4b6da |
<para>Author’s notes on the grand scheme of docbook2X</para>
|
|
Packit |
e4b6da |
</abstract>
|
|
Packit |
e4b6da |
</sect1info>
|
|
Packit |
e4b6da |
<title>Design notes</title>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<indexterm><primary>design</primary></indexterm>
|
|
Packit |
e4b6da |
<indexterm><primary>history</primary></indexterm>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Lessons learned:
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<itemizedlist>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<listitem>
|
|
Packit |
e4b6da |
<indexterm><primary>stream processing</primary></indexterm>
|
|
Packit |
e4b6da |
<indexterm><primary>tree processing</primary></indexterm>
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Think four times before doing stream-based XML processing, even though it
|
|
Packit |
e4b6da |
appears to be more efficient than tree-based.
|
|
Packit |
e4b6da |
Stream-based processing is usually more difficult.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
But if you have to do stream-based processing, make sure to use robust,
|
|
Packit |
e4b6da |
fairly scaleable tools like <classname>XML::Templates</classname>,
|
|
Packit |
e4b6da |
<emphasis>not</emphasis> <command>sgmlspl</command>. Of course it cannot
|
|
Packit |
e4b6da |
be as pleasant as tree-based XML processing, but examine
|
|
Packit |
e4b6da |
&db2x_manxml; and &db2x_texixml;.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
</listitem>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<listitem>
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Do not use <classname>XML::DOM</classname> directly for stylesheets.
|
|
Packit |
e4b6da |
Your “stylesheet” would become seriously unmanageable.
|
|
Packit |
e4b6da |
Its also extremely slow for anything but trivial documents.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
At least take a look at some of the XPath modules out there.
|
|
Packit |
e4b6da |
Better yet, see if your solution really cannot use XSLT.
|
|
Packit |
e4b6da |
A C/C++-based implementation of XSLT can be fast enough
|
|
Packit |
e4b6da |
for many tasks.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
</listitem>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<listitem>
|
|
Packit |
e4b6da |
<indexterm><primary>XSLT extensions</primary></indexterm>
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Avoid XSLT extensions whenever possible. I don't think there is
|
|
Packit |
e4b6da |
anything wrong with them intrinsically, but it is a headache
|
|
Packit |
e4b6da |
to have to compile your own XSLT processor. (libxslt is written
|
|
Packit |
e4b6da |
in C, and the extensions must be compiled-in and cannot be loaded
|
|
Packit |
e4b6da |
dynamically at runtime.) Not to mention there seems to be a thousand
|
|
Packit |
e4b6da |
different set-ups for different XSLT processors.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
</listitem>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<listitem>
|
|
Packit |
e4b6da |
<indexterm><primary>Perl</primary></indexterm>
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Perl is not as good at XML as it’s hyped to be.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
SAX comes from the Java world, and its port to Perl
|
|
Packit |
e4b6da |
(with all the object-orientedness, and without adopting Perl idioms)
|
|
Packit |
e4b6da |
is awkward to use.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Another problem is that Perl SAX does not seem to be well-maintained.
|
|
Packit |
e4b6da |
The implementations have various bugs; while they can be worked around,
|
|
Packit |
e4b6da |
they have been around for such a long time that it does not inspire
|
|
Packit |
e4b6da |
confidence that the Perl XML modules are reliable software.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
It also seems that no one else has seriously used Perl SAX
|
|
Packit |
e4b6da |
for robust applications. It seems to be unnecessarily hard to
|
|
Packit |
e4b6da |
certain tasks such as displaying error diagnostics on its
|
|
Packit |
e4b6da |
input, processing large documents with complicated structure.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
</listitem>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<listitem>
|
|
Packit |
e4b6da |
<indexterm><primary>Man-XML</primary></indexterm>
|
|
Packit |
e4b6da |
<indexterm><primary>Texi-XML</primary></indexterm>
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Do not be afraid to use XML intermediate formats
|
|
Packit |
e4b6da |
(e.g. Man-XML and Texi-XML) for converting to other
|
|
Packit |
e4b6da |
markup languages, implemented with a scripting language.
|
|
Packit |
e4b6da |
The syntax rules for these formats are made for
|
|
Packit |
e4b6da |
authoring by hand, not machine generation; hence a conversion
|
|
Packit |
e4b6da |
using tools designed for XML-to-XML conversion,
|
|
Packit |
e4b6da |
requires jumping through hoops.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
You might think that we could, instead, make a separate module
|
|
Packit |
e4b6da |
that abstracts all this complexity
|
|
Packit |
e4b6da |
from the rest of the conversion program. For example,
|
|
Packit |
e4b6da |
there is nothing stopping a XSLT processor from serializing
|
|
Packit |
e4b6da |
the output document as a text document obeying the syntax
|
|
Packit |
e4b6da |
rules for man pages or Texinfo documents.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Theoretically you would get the same result,
|
|
Packit |
e4b6da |
but it is much harder to implement. It is far easier to write plain
|
|
Packit |
e4b6da |
text manipulation code in a scripting language than in Java or C or XSLT.
|
|
Packit |
e4b6da |
Also, if the intermediate format is hidden in a Java class or
|
|
Packit |
e4b6da |
C API, output errors are harder to see.
|
|
Packit |
e4b6da |
Whereas with the intermediate-format approach, we can
|
|
Packit |
e4b6da |
visually examine the textual output of the XSLT processor and fix
|
|
Packit |
e4b6da |
the Perl script as we go along.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Some XSLT processors support scripting to go beyond XSLT
|
|
Packit |
e4b6da |
functionality, but they are usually not portable, and not
|
|
Packit |
e4b6da |
always easy to use.
|
|
Packit |
e4b6da |
Therefore, opt to do two-pass processing, with a standalone
|
|
Packit |
e4b6da |
script as the second stage. (The first stage using XSLT.)
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
elimination of XSLT extensions">
|
|
Packit |
e4b6da |
Finally, another advantage of using intermediate XML formats
|
|
Packit |
e4b6da |
processed by a Perl script is that we can often eliminate the
|
|
Packit |
e4b6da |
use of XSLT extensions. In particular, all the way back when XSLT
|
|
Packit |
e4b6da |
stylesheets first went into docbook2X, the extensions related to
|
|
Packit |
e4b6da |
Texinfo node handling could have been easily moved to the Perl script,
|
|
Packit |
e4b6da |
but I didn't realize it! I feel stupid now.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
If I had known this in the very beginning, it would have saved
|
|
Packit |
e4b6da |
a lot of development time, and docbook2X would be much more
|
|
Packit |
e4b6da |
advanced by now.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Note that even the man-pages stylesheet from the DocBook XSL
|
|
Packit |
e4b6da |
distribution essentially does two-pass processing
|
|
Packit |
e4b6da |
just the same as the docbook2X solution. That stylesheet
|
|
Packit |
e4b6da |
had formerly used one-pass processing, and its authors
|
|
Packit |
e4b6da |
probably finally realized what a mess that was.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
</listitem>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<listitem>
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Design the XML intermediate format to be easy to use from the standpoint
|
|
Packit |
e4b6da |
of the conversion tool, and similar to how XML document types work in
|
|
Packit |
e4b6da |
general. e.g. abstract the paragraphs of a document, rather than their
|
|
Packit |
e4b6da |
paragraph <emphasis>breaks</emphasis>
|
|
Packit |
e4b6da |
(the latter is typical of traditional markup languages, but not of XML).
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
</listitem>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<listitem>
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
I am quite impressed by some of the things that people make XSLT 1.0 do.
|
|
Packit |
e4b6da |
Things that I thought were impossible, or at least unworkable
|
|
Packit |
e4b6da |
without using “real” scripting language.
|
|
Packit |
e4b6da |
(&db2x_manxml; and &db2x_texixml; fall in the
|
|
Packit |
e4b6da |
category of things that can be done in XSLT 1.0 but inelegantly.)
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
</listitem>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<listitem>
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Internationalize as soon as possible.
|
|
Packit |
e4b6da |
That is much easier than adding it in later.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Same advice for build system.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
</listitem>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<listitem>
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
I would suggest against using build systems based
|
|
Packit |
e4b6da |
on Makefiles or any form of automake.
|
|
Packit |
e4b6da |
Of course it is inertia that prevents people from
|
|
Packit |
e4b6da |
switching to better build systems. But also
|
|
Packit |
e4b6da |
consider that while Makefile-based build systems
|
|
Packit |
e4b6da |
can do many of the things newer build systems are capable
|
|
Packit |
e4b6da |
of, they often require too many fragile hacks. Developing
|
|
Packit |
e4b6da |
these hacks take too much time that would be better
|
|
Packit |
e4b6da |
spent developing the program itself.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Alas, better build systems such as scons were not available
|
|
Packit |
e4b6da |
when docbook2X was at an earlier stage. It’s too late
|
|
Packit |
e4b6da |
to switch now.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
</listitem>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<listitem>
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Writing good documentation takes skill. This manual has
|
|
Packit |
e4b6da |
has been revised substantially at least four times
|
|
Packit |
e4b6da |
<footnote><para>
|
|
Packit |
e4b6da |
This number is probably inflated because of the so many design
|
|
Packit |
e4b6da |
mistakes in the process.</para></footnote>, with the author
|
|
Packit |
e4b6da |
consciously trying to condense information each time.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
</listitem>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<listitem>
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Table processing in the pure-XSLT man-pages conversion
|
|
Packit |
e4b6da |
is convoluted — it goes through HTML(!) tables as an intermediary.
|
|
Packit |
e4b6da |
That is the same way that the DocBook XSL stylesheets implement
|
|
Packit |
e4b6da |
it (due to Michael Smith), and I copied the code there
|
|
Packit |
e4b6da |
almost verbatim. I did it this way to save myself time and energy
|
|
Packit |
e4b6da |
re-implementing tables conversion <emphasis>again</emphasis>.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
And Michael Smith says that going through HTML is better,
|
|
Packit |
e4b6da |
because some varieties of DocBook allow the HTML table model
|
|
Packit |
e4b6da |
in addition to the CALS table model. (I am not convinced
|
|
Packit |
e4b6da |
that this is such a good idea, but anyway.)
|
|
Packit |
e4b6da |
Then HTML tables in DocBook can be translated to man pages
|
|
Packit |
e4b6da |
too without much more effort.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Is this inefficient? Probably. But that’s what you get
|
|
Packit |
e4b6da |
if you insist on using pure XSLT. The Perl implementation
|
|
Packit |
e4b6da |
of docbook2X.
|
|
Packit |
e4b6da |
already supported tables conversion for two years prior.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
</listitem>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<listitem>
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
The design of &utf8trans; is not the best.
|
|
Packit |
e4b6da |
It was chosen to simplify implementations while being efficient.
|
|
Packit |
e4b6da |
A more general design, while still retaining efficiency, is possible,
|
|
Packit |
e4b6da |
which I describe below. However, unfortunately,
|
|
Packit |
e4b6da |
at this point changing &utf8trans;
|
|
Packit |
e4b6da |
will be too disruptive to users with little gain in functionality.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Instead of working with characters, we should work with byte strings.
|
|
Packit |
e4b6da |
This means that, if all input and output is in UTF-8,
|
|
Packit |
e4b6da |
with no escape sequences, then UTF-8 decoding or encoding
|
|
Packit |
e4b6da |
is not necessary at all. Indeed the program becomes agnostic
|
|
Packit |
e4b6da |
to the character set used. Of course,
|
|
Packit |
e4b6da |
multi-character matches become possible.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
The translation map will be an unordered list of key-value pairs.
|
|
Packit |
e4b6da |
The key and value are both arbitrary-length byte strings,
|
|
Packit |
e4b6da |
with an explicit length attached (so null bytes in the input
|
|
Packit |
e4b6da |
and output are retained).
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
The program would take the translation map, and transform the input file
|
|
Packit |
e4b6da |
by matching the start of input, seen as a sequence of bytes,
|
|
Packit |
e4b6da |
against the keys in the translation map, greedily.
|
|
Packit |
e4b6da |
(Since the matching is greedy, the translation keys do not
|
|
Packit |
e4b6da |
need to be restricted to be prefix-free.)
|
|
Packit |
e4b6da |
Once the longest (in byte length) matching key is found,
|
|
Packit |
e4b6da |
the corresponding value (another byte string) is substituted
|
|
Packit |
e4b6da |
in the output, and processing repeats (until the input is finished).
|
|
Packit |
e4b6da |
If, on the other hand, no match is found, the next byte
|
|
Packit |
e4b6da |
in the input file is copied as-is, and processing repeats
|
|
Packit |
e4b6da |
at the next byte of input.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
Since bytes are 8 bits and the key strings are typically
|
|
Packit |
e4b6da |
very short (up to 3
|
|
Packit |
e4b6da |
bytes for a Unicode BMP character encoded in UTF-8),
|
|
Packit |
e4b6da |
this algorithm can be implemented with radix search.
|
|
Packit |
e4b6da |
It would be competitive, in both execution time and space,
|
|
Packit |
e4b6da |
with character codepoint hashing and sparse multi-level
|
|
Packit |
e4b6da |
arrays, the primary techniques for implementing
|
|
Packit |
e4b6da |
Unicode <emphasis>character</emphasis> translation.
|
|
Packit |
e4b6da |
(&utf8trans; is implemented using sparse multi-level arrays.)
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
<para>
|
|
Packit |
e4b6da |
One could even try to generalize the radix searching further,
|
|
Packit |
e4b6da |
so that keys can include wildcard characters, for example.
|
|
Packit |
e4b6da |
Taken to the extremes, the design would end up being
|
|
Packit |
e4b6da |
a regular expressions processor optimized for matching
|
|
Packit |
e4b6da |
many strings with common prefixes.
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
</listitem>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
</itemizedlist>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
</para>
|
|
Packit |
e4b6da |
|
|
Packit |
e4b6da |
</sect1>
|
|
Packit |
e4b6da |
|