Blame doc/design-notes.xml

Packit e4b6da
Packit e4b6da
<sect1 id="design-notes">
Packit e4b6da
<sect1info>
Packit e4b6da
<abstract role="texinfo-node">
Packit e4b6da
  <para>Author’s notes on the grand scheme of docbook2X</para>
Packit e4b6da
</abstract>
Packit e4b6da
</sect1info>
Packit e4b6da
<title>Design notes</title>
Packit e4b6da
Packit e4b6da
<indexterm><primary>design</primary></indexterm>
Packit e4b6da
<indexterm><primary>history</primary></indexterm>
Packit e4b6da
Packit e4b6da
<para>
Packit e4b6da
Lessons learned:
Packit e4b6da
Packit e4b6da
  <itemizedlist>
Packit e4b6da
  
Packit e4b6da
    <listitem>
Packit e4b6da
<indexterm><primary>stream processing</primary></indexterm>
Packit e4b6da
<indexterm><primary>tree processing</primary></indexterm>
Packit e4b6da
      <para>
Packit e4b6da
      Think four times before doing stream-based XML processing, even though it
Packit e4b6da
      appears to be more efficient than tree-based.
Packit e4b6da
      Stream-based processing is usually more difficult.
Packit e4b6da
      </para>
Packit e4b6da
      
Packit e4b6da
      <para>
Packit e4b6da
      But if you have to do stream-based processing, make sure to use robust,
Packit e4b6da
      fairly scaleable tools like <classname>XML::Templates</classname>, 
Packit e4b6da
      <emphasis>not</emphasis> <command>sgmlspl</command>.  Of course it cannot 
Packit e4b6da
      be as pleasant as tree-based XML processing, but examine 
Packit e4b6da
      &db2x_manxml; and &db2x_texixml;.
Packit e4b6da
      </para>
Packit e4b6da
    </listitem>
Packit e4b6da
    
Packit e4b6da
    <listitem>
Packit e4b6da
      <para>
Packit e4b6da
      Do not use <classname>XML::DOM</classname> directly for stylesheets.
Packit e4b6da
      Your “stylesheet” would become seriously unmanageable.
Packit e4b6da
      Its also extremely slow for anything but trivial documents.
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
      <para>
Packit e4b6da
      At least take a look at some of the XPath modules out there.
Packit e4b6da
      Better yet, see if your solution really cannot use XSLT.
Packit e4b6da
      A C/C++-based implementation of XSLT can be fast enough
Packit e4b6da
      for many tasks.
Packit e4b6da
      </para>
Packit e4b6da
    </listitem>
Packit e4b6da
    
Packit e4b6da
    <listitem>
Packit e4b6da
<indexterm><primary>XSLT extensions</primary></indexterm>
Packit e4b6da
      <para>
Packit e4b6da
      Avoid XSLT extensions whenever possible.  I don't think there is
Packit e4b6da
      anything wrong with them intrinsically, but it is a headache
Packit e4b6da
      to have to compile your own XSLT processor. (libxslt is written 
Packit e4b6da
      in C, and the extensions must be compiled-in and cannot be loaded
Packit e4b6da
      dynamically at runtime.)  Not to mention there seems to be a thousand
Packit e4b6da
      different set-ups for different XSLT processors.
Packit e4b6da
      </para>
Packit e4b6da
    </listitem>
Packit e4b6da
    
Packit e4b6da
    <listitem>
Packit e4b6da
<indexterm><primary>Perl</primary></indexterm>
Packit e4b6da
      <para>
Packit e4b6da
      Perl is not as good at XML as it’s hyped to be.  
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
      <para>
Packit e4b6da
      SAX comes from the Java world, and its port to Perl
Packit e4b6da
      (with all the object-orientedness, and without adopting Perl idioms)
Packit e4b6da
      is awkward to use.
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
      <para>
Packit e4b6da
      Another problem is that Perl SAX does not seem to be well-maintained.
Packit e4b6da
      The implementations have various bugs; while they can be worked around,
Packit e4b6da
      they have been around for such a long time that it does not inspire
Packit e4b6da
      confidence that the Perl XML modules are reliable software.
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
      <para>  
Packit e4b6da
      It also seems that no one else has seriously used Perl SAX
Packit e4b6da
      for robust applications.  It seems to be unnecessarily hard to 
Packit e4b6da
      certain tasks such as displaying error diagnostics on its
Packit e4b6da
      input, processing large documents with complicated structure.
Packit e4b6da
      </para>
Packit e4b6da
    </listitem>
Packit e4b6da
    
Packit e4b6da
    <listitem>
Packit e4b6da
<indexterm><primary>Man-XML</primary></indexterm>
Packit e4b6da
<indexterm><primary>Texi-XML</primary></indexterm>
Packit e4b6da
      <para>
Packit e4b6da
      Do not be afraid to use XML intermediate formats 
Packit e4b6da
      (e.g. Man-XML and Texi-XML) for converting to other
Packit e4b6da
      markup languages, implemented with a scripting language.
Packit e4b6da
      The syntax rules for these formats are made for 
Packit e4b6da
      authoring by hand, not machine generation; hence a conversion
Packit e4b6da
      using tools designed for XML-to-XML conversion, 
Packit e4b6da
      requires jumping through hoops. 
Packit e4b6da
      </para>
Packit e4b6da
    
Packit e4b6da
      <para>
Packit e4b6da
      You might think that we could, instead, make a separate module 
Packit e4b6da
      that abstracts all this complexity
Packit e4b6da
      from the rest of the conversion program.  For example,
Packit e4b6da
      there is nothing stopping a XSLT processor from serializing
Packit e4b6da
      the output document as a text document obeying the syntax
Packit e4b6da
      rules for man pages or Texinfo documents.
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
      <para>
Packit e4b6da
      Theoretically you would get the same result,
Packit e4b6da
      but it is much harder to implement.  It is far easier to write plain 
Packit e4b6da
      text manipulation code in a scripting language than in Java or C or XSLT.
Packit e4b6da
      Also, if the intermediate format is hidden in a Java class or 
Packit e4b6da
      C API, output errors are harder to see.
Packit e4b6da
      Whereas with the intermediate-format approach, we can
Packit e4b6da
      visually examine the textual output of the XSLT processor and fix
Packit e4b6da
      the Perl script as we go along.
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
      <para>
Packit e4b6da
      Some XSLT processors support scripting to go beyond XSLT
Packit e4b6da
      functionality, but they are usually not portable, and not 
Packit e4b6da
      always easy to use.
Packit e4b6da
      Therefore, opt to do two-pass processing, with a standalone
Packit e4b6da
      script as the second stage.  (The first stage using XSLT.)
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
      
Packit e4b6da
      elimination of XSLT extensions">
Packit e4b6da
      Finally, another advantage of using intermediate XML formats
Packit e4b6da
      processed by a Perl script is that we can often eliminate the
Packit e4b6da
      use of XSLT extensions.  In particular, all the way back when XSLT 
Packit e4b6da
      stylesheets first went into docbook2X, the extensions related to
Packit e4b6da
      Texinfo node handling could have been easily moved to the Perl script,
Packit e4b6da
      but I didn't realize it!  I feel stupid now. 
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
      <para>
Packit e4b6da
      If I had known this in the very beginning, it would have saved 
Packit e4b6da
      a lot of development time, and docbook2X would be much more 
Packit e4b6da
      advanced by now.
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
      <para>
Packit e4b6da
      Note that even the man-pages stylesheet from the DocBook XSL
Packit e4b6da
      distribution essentially does two-pass processing
Packit e4b6da
      just the same as the docbook2X solution.  That stylesheet
Packit e4b6da
      had formerly used one-pass processing, and its authors 
Packit e4b6da
      probably finally realized what a mess that was.
Packit e4b6da
      </para>
Packit e4b6da
    </listitem>
Packit e4b6da
Packit e4b6da
    <listitem>
Packit e4b6da
      <para>
Packit e4b6da
      Design the XML intermediate format to be easy to use from the standpoint
Packit e4b6da
      of the conversion tool, and similar to how XML document types work in
Packit e4b6da
      general.  e.g. abstract the paragraphs of a document, rather than their 
Packit e4b6da
      paragraph <emphasis>breaks</emphasis>
Packit e4b6da
      (the latter is typical of traditional markup languages, but not of XML).
Packit e4b6da
      </para>
Packit e4b6da
    
Packit e4b6da
    </listitem>
Packit e4b6da
    
Packit e4b6da
    <listitem>
Packit e4b6da
      <para>
Packit e4b6da
      I am quite impressed by some of the things that people make XSLT 1.0 do.
Packit e4b6da
      Things that I thought were impossible, or at least unworkable
Packit e4b6da
      without using “real” scripting language. 
Packit e4b6da
      (&db2x_manxml; and &db2x_texixml; fall in the
Packit e4b6da
      category of things that can be done in XSLT 1.0 but inelegantly.)
Packit e4b6da
      </para>
Packit e4b6da
    </listitem>
Packit e4b6da
    
Packit e4b6da
    <listitem>
Packit e4b6da
      <para>
Packit e4b6da
      Internationalize as soon as possible.  
Packit e4b6da
      That is much easier than adding it in later.
Packit e4b6da
      </para>
Packit e4b6da
      
Packit e4b6da
      <para>
Packit e4b6da
      Same advice for build system.
Packit e4b6da
      </para>
Packit e4b6da
    </listitem>
Packit e4b6da
Packit e4b6da
    <listitem>
Packit e4b6da
      <para>
Packit e4b6da
        I would suggest against using build systems based
Packit e4b6da
        on Makefiles or any form of automake.
Packit e4b6da
        Of course it is inertia that prevents people from
Packit e4b6da
        switching to better build systems.  But also
Packit e4b6da
        consider that while Makefile-based build systems 
Packit e4b6da
        can do many of the things newer build systems are capable
Packit e4b6da
        of, they often require too many fragile hacks.  Developing
Packit e4b6da
        these hacks take too much time that would be better
Packit e4b6da
        spent developing the program itself.
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
      <para>
Packit e4b6da
        Alas, better build systems such as scons were not available
Packit e4b6da
        when docbook2X was at an earlier stage.  It’s too late
Packit e4b6da
        to switch now.
Packit e4b6da
      </para>
Packit e4b6da
    </listitem>
Packit e4b6da
Packit e4b6da
    <listitem>
Packit e4b6da
      <para>
Packit e4b6da
      Writing good documentation takes skill.  This manual has
Packit e4b6da
      has been revised substantially at least four times
Packit e4b6da
      <footnote><para>
Packit e4b6da
      This number is probably inflated because of the so many design 
Packit e4b6da
      mistakes in the process.</para></footnote>, with the author
Packit e4b6da
      consciously trying to condense information each time.
Packit e4b6da
      </para>
Packit e4b6da
    </listitem>
Packit e4b6da
Packit e4b6da
    <listitem>
Packit e4b6da
      <para>
Packit e4b6da
        Table processing in the pure-XSLT man-pages conversion
Packit e4b6da
        is convoluted — it goes through HTML(!) tables as an intermediary.
Packit e4b6da
        That is the same way that the DocBook XSL stylesheets implement
Packit e4b6da
        it (due to Michael Smith), and I copied the code there
Packit e4b6da
        almost verbatim.  I did it this way to save myself time and energy
Packit e4b6da
        re-implementing tables conversion <emphasis>again</emphasis>.
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
      <para>
Packit e4b6da
        And Michael Smith says that going through HTML is better,
Packit e4b6da
        because some varieties of DocBook allow the HTML table model
Packit e4b6da
        in addition to the CALS table model.  (I am not convinced
Packit e4b6da
        that this is such a good idea, but anyway.)
Packit e4b6da
        Then HTML tables in DocBook can be translated to man pages
Packit e4b6da
        too without much more effort.
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
      <para>
Packit e4b6da
        Is this inefficient? Probably.  But that’s what you get
Packit e4b6da
        if you insist on using pure XSLT.  The Perl implementation
Packit e4b6da
        of docbook2X.
Packit e4b6da
        already supported tables conversion for two years prior.
Packit e4b6da
      </para>
Packit e4b6da
    </listitem>
Packit e4b6da
Packit e4b6da
    <listitem>
Packit e4b6da
      <para>
Packit e4b6da
        The design of &utf8trans; is not the best.
Packit e4b6da
        It was chosen to simplify implementations while being efficient.
Packit e4b6da
        A more general design, while still retaining efficiency, is possible, 
Packit e4b6da
        which I describe below.  However, unfortunately,
Packit e4b6da
        at this point changing &utf8trans;
Packit e4b6da
        will be too disruptive to users with little gain in functionality.
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
      <para>
Packit e4b6da
        Instead of working with characters, we should work with byte strings.
Packit e4b6da
        This means that, if all input and output is in UTF-8,
Packit e4b6da
        with no escape sequences, then UTF-8 decoding or encoding
Packit e4b6da
        is not necessary at all.  Indeed the program becomes agnostic
Packit e4b6da
        to the character set used.  Of course,
Packit e4b6da
        multi-character matches become possible.
Packit e4b6da
      </para>
Packit e4b6da
      
Packit e4b6da
      <para>
Packit e4b6da
        The translation map will be an unordered list of key-value pairs.
Packit e4b6da
        The key and value are both arbitrary-length byte strings,
Packit e4b6da
        with an explicit length attached (so null bytes in the input
Packit e4b6da
        and output are retained).
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
      <para>
Packit e4b6da
        The program would take the translation map, and transform the input file
Packit e4b6da
        by matching the start of input, seen as a sequence of bytes, 
Packit e4b6da
        against the keys in the translation map, greedily.
Packit e4b6da
        (Since the matching is greedy, the translation keys do not
Packit e4b6da
        need to be restricted to be prefix-free.)
Packit e4b6da
        Once the longest (in byte length) matching key is found, 
Packit e4b6da
        the corresponding value (another byte string) is substituted
Packit e4b6da
        in the output, and processing repeats (until the input is finished).
Packit e4b6da
        If, on the other hand, no match is found, the next byte
Packit e4b6da
        in the input file is copied as-is, and processing repeats 
Packit e4b6da
        at the next byte of input.
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
      <para>
Packit e4b6da
        Since bytes are 8 bits and the key strings are typically
Packit e4b6da
        very short (up to 3 
Packit e4b6da
        bytes for a Unicode BMP character encoded in UTF-8),
Packit e4b6da
        this algorithm can be implemented with radix search.
Packit e4b6da
        It would be competitive, in both execution time and space,
Packit e4b6da
        with character codepoint hashing and sparse multi-level
Packit e4b6da
        arrays, the primary techniques for implementing
Packit e4b6da
        Unicode <emphasis>character</emphasis> translation.
Packit e4b6da
        (&utf8trans; is implemented using sparse multi-level arrays.)
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
      <para>
Packit e4b6da
        One could even try to generalize the radix searching further,
Packit e4b6da
        so that keys can include wildcard characters, for example.
Packit e4b6da
        Taken to the extremes, the design would end up being
Packit e4b6da
        a regular expressions processor optimized for matching
Packit e4b6da
        many strings with common prefixes.
Packit e4b6da
      </para>
Packit e4b6da
Packit e4b6da
    </listitem>
Packit e4b6da
Packit e4b6da
    
Packit e4b6da
  </itemizedlist>
Packit e4b6da
  
Packit e4b6da
</para>
Packit e4b6da
Packit e4b6da
</sect1>
Packit e4b6da