Tree - source-git/djvulibre - CentOS Git server

source-git / djvulibre

Files

Commit: df99a1a4b954b33d0c87eaa7846fef87e1b3f0bd
Blob Blame History Raw



<WARNING: THIS DOCUMENT HAS BEEN MADE OBSOLETE 
 BY THE LIZARDTECH SPECIFICATION "DJVU3SPEC.DJVU".>



    This file summarizes the file format changes 
    between DjVu2 and DjVu3.




------------------------------------------------------------
1 - DJVU3 FILE STRUCTURE OVERVIEW
------------------------------------------------------------

    DjVu files are organized according to the ``EA IFF 85'' layout.  Pointers to
    the appropriate reference document are provided in section
    \Ref{IFFByteStream.h}.  IFF files are logically composed of a sequence of
    data \emph{chunks}.  Each chunk comes with a four character \emph{chunk
    identifier} describing the type of the data stored in the chunk.  A few
    special chunk identifiers, for instance #"FORM"#, are reserved for so
    called \emph{composite chunks} containing a sequence of data chunks.  This
    convention effectively provides IFF files with a hierarchical structure.
    Composite chunks are further identified by a \emph{secondary chunk
    identifier}.  For convenience, both identifiers are gathered as an
    extended chunk identifier such as #"FORM:DJVU"#.

    The four octets #0x41,0x54,0x26,0x54# may be inserted in front of the IFF 
    compliant byte stream.  The decoder simply ignores these four octets when
    they are present.  These four octets are not part of the IFF format and
    are not required components of a valid DjVu file.  Certain versions of MSIE
    incorrectly recognize any IFF file as a Microsoft AIFF sound file.  The
    presence of these four octets prevents this incorrect identification.

    The DjVu specification mandates that the decoder should silently
    skip chunks whose identifier is not recognized.  This mechanism
    provides a backward compatible way to extend the initial format by
    allocating new chunk identifiers. 

------------------------------------------------------------
1.1 - DJVU3 IMAGE FILES
------------------------------------------------------------

    \textbf{Photo DjVu Image} --- 

    Photo DjVu Image files are best used for
    encoding photographic images in colors or in shades of gray.  The data
    compression model relies on the IW44 wavelet representation.  This format
    is designed such that the IW44 decoder is able to quickly perform
    progressive rendering of any image segment using only a small amount of
    memory.  Photo DjVu files are composed of a single #"FORM:DJVU"# composite
    chunk.  This composite chunk always begins with one #"INFO"# chunk
    describing the image size and resolution (see \Ref{DjVuInfo.h}).  One or
    more additional #"BG44"# chunks contains the image data encoded with the
    IW44 representation (see \Ref{IW44Image.h}).  The image size specified in
    the #"INFO"# chunk and the image size specified in the IW44 data must be
    equal.

    \textbf{Bilevel DjVu Image} --- 

    Bilevel DjVu Image files are used to compress
    black and white images representing text and simple drawings.   The
    JB2 data compression model uses the soft pattern matching technique, which
    essentially consists of encoding each character by describing how it
    differs from a well chosen already encoded character.  Bilevel DjVu Files
    are composed of a single #"FORM:DJVU"# composite chunk.  This composite
    chunk always begins with one #"INFO"# chunk describing the image size and
    resolution (see \Ref{DjVuInfo.h}).  An additional #"Sjbz"# chunk contains
    the bilevel data encoded with the JB2 representation (see
    \Ref{JB2Image.h}).  The image size specified in the #"INFO"# chunk and the
    image size specified in the JB2 data must be equal.

    \textbf{Compound DjVu Image} --- 

    Compound DjVu Files are an extremely
    efficient way to compress high resolution Compound document images
    containing both pictures and text, such as a page of a magazine.  Compound
    DjVu Files represent the document images using two layers.  The
    \emph{background layer} is used for encoding the pictures and the
    paper texture.
    The \emph{foreground layer} is used for encoding the text and the drawings.
    Compound DjVu Files are composed of a single #"FORM:DJVU"# composite
    chunk.  This composite chunk always begins with one #"INFO"# chunk
    describing the size and the resolution of the image (see \Ref{DjVuInfo}).
    Additional chunks hold the components of either the foreground or the
    background layers.

    The main component of the foreground layer is a bilevel image named the
    \emph{foreground mask}. The pixel size of the foreground mask is equal to
    the size of the DjVu image.  It contains a black-on-white representation
    of the text and the drawings.  This image is encoded by a #"Sjbz"# chunk
    using the JB2 representation.  There may also be a companion chunk
    #"Djbz"# containing a \emph{shape dictionary} that defines bilevel shapes
    referenced by the #"Sjbz"# chunk.

    The \emph{foreground colors} can be encoded according to two models:
    \begin{itemize}
    \item 
      The foreground colors may be encoded using a small color image,
      the \emph{foreground color image}, encoded as a single #"FG44"#
      chunk using the
      IW44 representation (see \Ref{IW44Image.h}).  Such compound DjVu images
      are rendered by painting the foreground color image on top of the
      background color image using the foreground mask as a stencil.  The
      pixel size of the foreground color image is computed by rounding up the
      quotient of the mask size by an integer sub-sampling factor ranging from
      1 to 12.  Most Compound DjVu Images use a foreground color sub-sampling
      factor of 12.  Smaller sub-sampling factors produce very slightly better
      images.
    \item
      The foreground colors may be encoded by specifying one solid color per
      object described by the JB2 encoded mask. These \emph{JB2 colors} are
      color-quantized and stored in a single #"FGbz"# chunk (see.
      \Ref{DjVuPalette.h}).  Such compound DjVu images are rendered by
      painting each foreground object on top of the background color image
      using the solid color specified by the #"FGbz"# chunk.
    \end{itemize}

    The background layer is a color image, \Ref{the background color image}
    ncoded by an arbitrary number of #"BG44"# chunks containing successive
    IW44 refinements (see \Ref{IW44Image.h}).  The size of this image is
    computed by rounding up the quotient of the mask size by an integer
    sub-sampling factor ranging from 1 to 12.  Most Compound DjVu Images use a
    background sub-sampling factor equal to 3.  Smaller sub-sampling factors
    are adequate for images with a very rich paper texture.  Larger
    sub-sampling factors are adequate for images containing no pictures.

    There are no ordering or interleaving constraints on these chunks except
    that (a) the #"INFO"# chunk must appear first, and (b) the successive
    #"BG44"# refinements must appear with their natural order.  The chunk
    order simply affects the progressive rendering of DjVu images on a web
    browser.  

    \textbf{IW44 Image Files} --

    The IW44 Image file format is the native format for the IW44 wavelet
    representation.  These files are deprecated in favor of Photo DjVu
    Images.

    \textbf{Alternative encodings} --- 

    Besides the JB2 and IW44 encoding schemes,
    the DjVu format supports alternative encoding methods for its components.  

    \begin{itemize}
    \item
       The foreground mask may be represented by a single #"Smmr"# chunk
       instead of #"Sjbz"#.  The #"Smmr"# chunk contains a bilevel image
       encoded with the Fax-G4/MMR method.  Although the resulting files
       are typically six times larger, this capability can be useful when
       DjVu is used as a front-end for fax machines and scanners with 
       embedded Fax-G4/MMR capabilities. 
    \item
       The background color image may be represented by a single #"BGjp"#
       chunk instead of several #"BG44"# chunks.  The #"BGjp"# chunk contains
       a JPEG encoded color image.  The resulting files are significantly
       larger and lack the progressivity of the usual DjVu files.  
       This is useful because some scanners have embedded JPEG capabilities.
    \item
       The foreground color image may be represented by a single #"FGjp"#
       chunk instead of a single #"FG44"# chunk.  This is useful because 
       some scanners have embedded JPEG capabilities.
    \end{itemize}

    In addition, the chunk names #"BG2k"# and #"FG2k"# have been reserved for
    encoding the background color image and the foreground color image using
    the forthcoming JPEG-2000 standard.  This capability is not implemented at
    the moment.  The JPEG-2000 standard may even become the preferred encoding
    method for color images in DjVu.  */

    \textbf{Annotations and Textual Information } --

    All types of DjVu images may contain
    annotation chunks.  Annotation chunks are currently used to describe
    hyperlinks, to specify more closely the behavior of the viewers,
    and to hold metadata information.  Annotations are contained in #"ANTa"# 
    or #"ANTz"# chunks.  The #"ANTa"# chunks contain the annotation in 
    plain text. The #"ANTz"# chunks contain the same information compressed 
    with the BZZ encoder (cf. \Ref{BSByteStream.h}).

    All types of DjVu image files may also contain a
    computer readable description of the text appearing on the page.  This
    information is contained by either a #"TXTa"# chunk or #"TXTz"# chunk.
    The #"TXTa"# chunk contains uncompressed data.  The #"TXTz"# chunk
    contains the same data compressed with the \Ref{bzz} compressor
    (cf. \Ref{BSByteStream.h}).  The #"TXTa"# chunks begins by a 24 bit
    integer (most significant byte first) describing the length of the text in
    bytes.  Then come the ISO10646/UTF8 text.  Additional information
    indicates the position of each column/region/paragraph/line/word in the
    document.  More information about the capabilities of the chunk can be
    found in section \Ref{DjVuTXT}.  More information about the encoding of
    textual information can be found in file #"DjVuAnno.cpp"#.  */


------------------------------------------------------------
1.2 - DJVU3 MULTIPAGE DOCUMENTS
------------------------------------------------------------

    The DjVu3 system supports two models for multi-page documents:
    \emph{bundled} multi-page documents and \emph{indirect} multi-page documents.
    
    \textbf{Bundled multi-page documents} --- 

    A \emph{bundled} multi-page DjVu
    document uses a single file to represent the entire document.  This single
    file contains all the pages as well as ancillary information (e.g. the
    page directory, data shared by several pages, thumbnails, etc.).  Using a
    single file format is very convenient for storing documents or for sending
    email attachments.

    A bundled multi-page document is composed of a single #"FORM:DJVM"#
    composite chunk.  This composite chunk always begins with a #"DIRM"# chunk
    containing the document directory (see. \Ref{DjVmDir.h}) which represents
    the list of the \emph{component files} that compose the document.  The
    component files themselves are then encoded as IFF85 composite chunks
    following the #"DIRM"# chunk.

    \begin{itemize}
    \item  
       Component files may be any valid DjVu image (see \Ref{DjVu Image Files})
       or IW44 image (see \Ref{IW44 Image Files}.)  These component files 
       always represent a page of a document.  The corresponding IFF85 chunk ids are 
       #"FORM:DJVU"#, #"FORM:PM44"#, or #"FORM:BM44"#.
    \item 
       Component files may contain shared information indirectly referenced by
       some document pages.  These \emph{shared component files} are always composed
       of a single #"FORM:DJVI"# chunk containing an arbitrary collection of
       chunks. 
    \item
       Thumbnail files contain optional thumbnail images for a few consecutive
       pages of the document.  Thumbnail files consist of a single
       #"FORM:THUM"# composite chunk containing several #"TH44"# chunks
       containing IW44 encoded thumbnail images (see \Ref{IW44Image.h}).  These
       thumbnails always pertain the first few page files following the
       thumbnail file in the document directory.
    \end{itemize}

    \textbf{Including shared information} --- 

    Any DjVu image file contained in a multipage file may contain an #"INCL"#
    chunk containing the ID of a shared component file.  The decoder processes
    the chunks contained in the shared component file as if they were
    contained by the DjVu image file.

    A shared component file is composed of a single #"FORM:DJVI"# potentially
    containing any information otherwise allowed in a DjVu image file (except
    for the #"INFO"# chunk of course).

    There are many benefits associated with storing such shared information in
    separate files.  A well designed browser may keep pre-decoded copies of
    these files in a cache.  This procedure would reduce the size of the data
    transferred over the Internet and also increase the display speed.  The
    multipage DjVu compressor, for instance, identifies similar object shapes
    occuring in several pages.  These shapes are encoded in a shape dictionary
    (chunk #"Djbz"#) placed in a shared component file.  All relevant pages
    include this shared component file.  Although they appear in several
    pages, these shared shapes are encoded only once in the document.

    \textbf{Browsing a multi-page document} --- 

    You can view the pages using the DjVu plugin and a web browser.  When you
    type the URL of a multi-page document, the browser starts downloading the
    whole file, but displays the first page as soon as it is available.  You
    can immediately navigate to other pages using the DjVu toolbar.  Suppose
    however that the document is stored on a remote web server.  You can
    easily access the first page and see that this is not the document you
    wanted.  Although you will never display the other pages the browser is
    transferring data for these pages and is wasting the bandwith of your
    server (and the bandwith of the Internet too).  You could also see the
    summary of the document on the first page and jump to page 100.  But page
    100 cannot be displayed until data for pages 1 to 99 has been received.
    You may have to wait for the transmission of unnecessary page data.  This
    second problem (the unnecessary wait) can be solved using the ``byte
    serving'' options of the HTTP/1.1 protocol.  This option has to be
    supported by the web server, the proxies, the caches and the browser.  We
    are coming there but not quite yet.  Byte serving however does not solve
    the first problem (the waste of bandwith).

    \textbf{Indirect multi-page documents} --- 

    DjVu solves both problem using a
    special multi-page format named the \emph{indirect} model.  An indirect
    multi-page DjVu document is composed of several files.  The main file is
    named the \emph{index file}.  You can browse a document using the URL of
    the index file, just like you do with a bundled multi-page document.  The
    index file however is very small.  It simply contains the document
    directory and the URLs of secondary files containing the page data.  When
    you browse an indirect multi-page document, the browser only accesses data
    for the pages you are viewing.  This can be done at a reasonable speed
    because the browser maintains a cache of pages and sometimes pre-fetches a
    few pages ahead of the current page.  This model uses the web serving
    bandwith much more effectively.  It also eliminates unnecessary delays
    when jumping ahead to pages located anywhere in a long document.

    \textbf{Obsolete Formats} --- 

    The library also supports two other multipage
    formats which are now obsolete.  These formats are technologically
    inferior and should no longer be used. */





------------------------------------------------------------
2 - CHUNK ENCODING
------------------------------------------------------------

    This section describes 
    - the encoding of new chunks introduces with DjVu3
    - the encoding changes of chunks already present in DjVu2



------------------------------------------------------------
2.1 - CHANGES TO JB2 ( "Sjbz" AND "Djbz" CHUNKS )
------------------------------------------------------------

    Two extensions of the JB2 encoding format have been introduced
    with DjVu files version 21.  Both extensions maintain significant
    backward compatibility with previous version of the JB2 format.
    These extensions are described below by reference to the DjVu2 spec
    dated August 1999.  Both extension make use of the unused record 
    type value #9# (cf. ICFDD page 24) which has been renamed
    #REQUIRED_DICT_OR_RESET#.

    \textbf{Shared Shape Dictionaries} --- This extension provides
    support for sharing symbol definitions between the pages of a
    document.  To achieve this objective, the JB2 image data chunk
    must be able to address symbols defined elsewhere by a JB2
    dictionary data chunk shared by all the pages of a document.

    The arithmetically encoded JB2 image data logically consist of a
    sequence of records. The decoder processes these records in
    sequence and maintains a library of symbols which can be addressed
    by the following records.  The first record usually is a ``Start
    Of Image'' record describing the size of the image.

    Starting with version 21, a #REQUIRED_DICT_OR_RESET# (9) record
    type can appear \emph{before} the #START_OF_DATA# (0) record.  The
    record type field is followed by a single number arithmetically
    encoded (cf. ICFDD page 26) using a sixteenth context (cf. ICFDD
    page 25).  This record appears when the JB2 data chunk requires
    symbols encoded in a separate JB2 dictionary data chunk.  The
    number (the \textbf{dictionary size}) indicates how many symbols
    should have been defined by the JB2 dictionary data chunk.  The
    decoder should simply load these symbols in the symbol library and
    proceed as usual.  New symbols potentially defined by the
    subsequent JB2 image data records will therefore be numbered with
    integers greater or equal than the dictionary size.

    The JB2 dictionary data format is a pure subset of the JB2 image
    data format.  The #START_OF_DATA# (0) record always specifies an
    image width of zero and an image height of zero.  The only allowed
    record types are those defining library symbols only
    (#NEW_SYMBOL_LIBRARY_ONLY# (2) and #MATCHED_REFINE_LIBRARY_ONLY#
    (5) cf. ICFDD page 24) followed by a final #END_OF_DATA# (11)
    record.

    The JB2 dictionary data is usually located in an \textbf{Djbz} chunk.
    Each page \textbf{FORM:DJVU} may directly contain a \textbf{Djbz} chunk,
    or may indirectly point to such a chunk using an \textbf{INCL} chunk
    (cf. \Ref{Multipage DjVu documents.}).

    \textbf{Numcoder Reset} --- This extension addresses a problem for
    hardware implementations.  The encoding of numbers (cf. ICFDD page
    26) potentially uses an unbounded number of binary coding
    contexts. These contexts are normally allocated when they are used
    for the first time (cf. ICFDD informative note, page 27).

    Starting with version 21, a #REQUIRED_DICT_OR_RESET# (9) record
    type can appear \emph{after} the #START_OF_DATA# (0) record.  The
    decoder should proceed with the next record after \emph{clearing
    all binary contexts used for coding numbers}.  This operation
    implies that all binary contexts previously allocated for coding
    numbers can be deallocated.
  
    Starting with version 21, the JB2 encoder should insert a
    #REQUIRED_DICT_OR_RESET# record type whenever the number of these
    allocated binary contexts exceeds #20000#.  Only very large
    documents ever reach such a large number of allocated binary
    contexts (e.g large maps).  Hardware implementation however can
    benefit greatly from a hard bound on the total number of binary
    coding contexts.  Old JB2 decoders will treat this record type as
    an #END_OF_DATA# record and cleanly stop decoding (cf. ICFDD page
    30, Image refinement data).




------------------------------------------------------------
2.2 - JB2 COLORS ( "FGbz" CHUNK )
------------------------------------------------------------

    To be documented.

    The #"FGbz"# contains BZZ compressed data 
    (cf. \Ref{BSByteStream.h}).

    The uncompressed data can be decoded using function
    #DjVuPalette::decode# defined in file #"DjVuPalette.cpp"#.




------------------------------------------------------------
2.3 - ANNOTATIONS ( "ANTa" AND "ANTz" CHUNKS )
------------------------------------------------------------

[MAY 19TH, 2005: 
 New annotation types have been defined by Lizardtech.
 The authoritative documentation is now the djvused man page.]


    Annotations are contained in #"ANTa"# 
    or #"ANTz"# chunks.  The #"ANTa"# chunks contain the annotation in 
    plain text. The #"ANTz"# chunks contain the same information compressed 
    with the BZZ encoder (cf. \Ref{BSByteStream.h}).

    The complete annotation text is obtained by concatenating all annotation 
    chunks present in the page.  Pages can share annotations using an INCL
    chunk as explained in section \Ref{Including shared information}.
    A restriction of the current reference library implementation
    limits the number of shared annotation files to one.

    The syntax of the annotation text uses a simple
    parenthesized notation. Erroneous and unrecognized constructs are silently
    ignored.  The following constructs are recognized:

    \begin{description}
    \item[(background <color>)]
       Sets the color of the viewer area surrounding the DjVu image.
       The color argument #color# are always represented using X11 
       syntax \##RRGGBB#. For instance \##000000# is black 
       and \##FFFFFF# is white.

    \item[(zoom <zoom-value>)]
       Sets the initial zoom factor of the image.  Argument #zoom-value# may
       be #stretch#, #one2one#, #width#, #page#, or composed of the letter
       #"d"# followed by a number between #1# and #999# (such as in #d300# for
       instance.)

    \item[(mode <mode-value>)]
       Sets the display mode for the image.  Argument #mode-value# may
       be #color#, #bw#, #fore# or #back#.

    \item[(align <horz-align> <vert-align>)]
       Specifies how the image should be aligned on the viewer surface.
       By default the image is located in the center.  Argument #horz-align#
       may be #left#, #center#, or #right#.  Argument #vert-align# may be
       #top#, #center#, or #bottom#.

    \item[(maparea <url> <comment> <area> <...options...>]
       Defines an hyperlink for the URL specified by argument #url#.

       Argument #url# may have one of the following two forms:
       \begin{verbatim}
            "<href>"
            (url "<href>" "<target>")
       \end{verbatim}
       where #href# is a string representing the URL and #target# is a string
       representing the target frame for the hyperlink (cf. Documentation for
       the HTML tag #<A>#).  Both strings are surrounded with double quotes.
       Argument #comment# is a string surrounded by double quotes.
       This string may be displayed as a tooltip when the user
       moves the mouse over the hyperlink.
       Argument #area# defines the shape of the hyperlink.
       The following options are supported for representing
       rectangle, circle, or polygons.
       \begin{verbatim}
            (rect <xmin> <ymin> <width> <height>)
            (oval <xmin> <ymin> <width> <height>)
            (polygon <x0> <y0> <x1> <y1> ....)
       \end{verbatim}
       All parameters are numbers representing coordinates measured in image
       pixels with the origin set at the bottom left corner of the image.  The
       remaining arguments describe options regarding the hyperlink borders.
       A first set of option define the type of the borders:
       \begin{verbatim}
            (xor)
            (border <color>
            (shadow_in [<thickness>])
            (shadow_out [<thickness>])
            (shadow_ein [<thickness>])
            (shadow_eout [<thickness>])
       \end{verbatim}
       where parameter #color# has syntax \##RRGGBB# (as above) and parameter
       #thickness# is a number from 1 to 32.  The last four border modes are
       only supported with rectangular areas. The border becomes visible when
       the user moves the mouse over the hyperlink.  The border may be made
       always visible by using the following option:
       \begin{verbatim}
            (border-avis)
       \end{verbatim}
       Finally the following option may be used with rectangular areas only.
       The complete area will be hilited using the specified color (specified
       with syntax \##RRGGBB# as usual).
       \begin{verbatim}
            (hilite <color>)
       \end{verbatim}
       This is often used with an empty URL for simply emphasizing a specific
       segment of an image.

    \item[(metadata <...entries...>)]
       Defines multiple metadata entries.

       Each metadata entry has the form
       \begin{verbatim}
          (<key> "<value>")
       \end{verbatim}
       parameter #<key># is a symbolic attribute name such as #year#,
       #booktitle#, #editor#, #author#, and parameter #<value>#
       is a UTF-8 encoded string representing the attribute value.  
       Common C escape sequences are recognized.
       It is suggested to use the same key names as 
       the BibTeX bibliography system.

       Metadata pertaining to the entire document should be placed
       in a shared annotation file (and therefore are seen in all pages).
       Metadata pertaining to a particular page are usually places
       inside an #"ANTz"# chunk in this particular page.

    \end{description}

    



------------------------------------------------------------
2.4 - HIDDEN TEXT ( "TXTa" AND "TXTz" CHUNKS )
------------------------------------------------------------

    To be documented.

    The #"TXTa"# chunk contains uncompressed data.  
    The #"TXTz"# chunk contains BZZ compressed data (cf. \Ref{BSByteStream.h}).

    The uncompressed data can be decoded using function #DjVuText::decode#
    defined in file #"DjVuText.cpp"# Program #djvused# can display the content
    of the text chunk using a lisp syntax, and can create a text chunk from
    this lisp syntax.


------------------------------------------------------------
2.5 - MULTIPAGE DIRECTORY CHUNK ( "DIRM" CHUNK )
------------------------------------------------------------

    Multipage DjVu documents follow the EA
    IFF85 format (cf. \Ref{IFFByteStream.h}.)  A document is composed of a
    #"FORM:DJVM"# whose first chunk is a #"DIRM"# chunk containing the
    \emph{document directory}.  This directory lists all component
    files composing
    the given document, helps to access every component file and identify the
    pages of the document.

    \begin{itemize} 
    \item
         In a \emph{bundled} multipage file, the component files 
         are stored immediately after the #"DIRM"# chunk,
         within the #"FORM:DJVM"# composite chunk.  
    \item
         In an \emph{indirect} multipage file, the component files are 
         stored in different files whose URLs are composed using information 
         stored in the #"DIRM"# chunk.
    \end{itemize} 

    Most of the component files represent pages of a document.  Some files
    however represent data shared by several pages.  The pages refer to these
    supporting files by means of an inclusion chunk (#"INCL"# chunks)
    identifying the supporting file.  Every directory record describes a 
    component file.  Each component file is identified by a small string 
    named the identifier (ID). Each component file also contains a 
    file name and a title.

    Theoretically, IDs are used to uniquely identify each component file in
    #"INCL"# chunks, names are used to compose the the URLs of the component
    files in an indirect multipage DjVu file, and titles are cosmetic names
    possibly displayed when viewing a page of a document.  There are however
    many problems with this scheme, and we \emph{strongly suggest}, with the
    current implementation to always make the file ID, the file name and the
    file title identical.

    \textbf{Variants} --- There are two versions of the #"DIRM"# chunk format.
    The version number is identified by the seven low bits of the first byte
    of the chunk.  Version \textbf{0} is obsolete and should never be used.  This
    section describes version \textbf{1}.  There are two major multipage DjVu
    formats supported: \emph{bundled} and \emph{indirect}.  The #"DIRM"# chunk
    indicates which format is used in the most significant bit of the first
    byte of the chunk.  The document is bundled when this bit is set.
    Otherwise the document is indirect.

    \textbf{Unencoded data} --- 

    The #"DIRM"# chunk is composed some unencoded
    data followed by \Ref{bzz} encoded data.  The unencoded data starts with
    the version byte and a 16 bit integer representing the number of component
    files.  All integers are encoded with the most significant byte first.
    \begin{verbatim}
          BYTE:             Flags/Version:  0x<bundled>0000011
          INT16:            Number of component files.
    \end{verbatim}
    When the document is a bundled document (i.e. the flag #bundled# is set),
    this header is followed by the offsets of each of the component files within
    the #"FORM:DJVM"#.  These offsets allow for random component file access.
    \begin{verbatim}
          INT32:            Offset of first component file.
          INT32:            Offset of second component file.
          ...
          INT32:            Offset of last component file.
    \end{verbatim}

    \textbf{BZZ encoded data} ---

     The rest of the chunk is entirely compressed
    with the BZZ general purpose compressor.  We describe now the data fed
    into (or retrieved from) the BZZ codec (cf. \Ref{BSByteStream}.)  First
    come the sizes and the flags associated with each component file.
    \begin{verbatim}
          INT24:             Size of the first component file.
          INT24:             Size of the second component file.
          ...
          INT24:             Size of the last component file.
          BYTE:              Flag byte for the first component file.
          BYTE:              Flag byte for the second component file.
          ...
          BYTE:              Flag byte for the last component file.
    \end{verbatim}
    The flag bytes have the following format:
    \begin{verbatim}
          0b<hasname><hastitle>000000     for a file included by other files.
          0b<hasname><hastitle>000001     for a file representing a page.
          0b<hasname><hastitle>000010     for a file containing thumbnails.
    \end{verbatim}
    Flag #hasname# is set when the name of the file is different from the file
    ID.  Flag #hastitle# is set when the title of the file is different from
    the file ID.  These flags are used to avoid encoding the same string three
    times.  Then come a sequence of zero terminated strings.  There are one to
    three such strings per component file.  The first string contains the ID
    of the component file.  The second string contains the name of the
    component file.  It is only present when the flag #hasname# is set. The third
    one contains the title of the component file. It is only present when the
    flag #hastitle# is set. The \Ref{bzz} encoding system makes sure that 
    all these strings will be encoded efficiently despite their possible
    redundancies.
    \begin{verbatim}
          ZSTR:     ID of the first component file.
          ZSTR:     Name of the first component file (only if #hasname# is set.)
          ZSTR:     Title of the first component file (only if #hastitle# is set.)
          ... 
          ZSTR:     ID of the last component file.
          ZSTR:     Name of the last component file (only if #hasname# is set.)
          ZSTR:     Title of the last component file (only if #hastitle# is set.)
    \end{verbatim}



------------------------------------------------------------
2.6 - INCLUDES ( "INCL" CHUNK )
------------------------------------------------------------

    The chunks simply contains the ascii encoded ID
    of the included component file.



------------------------------------------------------------
2.7 - THUMBNAILS
------------------------------------------------------------


    Multipage document file optionally can contain thumbnails for some or all
    pages.  These thumbnails are stored into special component files
    containing thumbnails for a number of consecutive pages.

    The thumbnail component file is composed of a single #"FORM:THUM"#
    containing one or more #"TH44"# chunk.  Each #"TH44"# chunk contains one
    IW44 encoded thumbnail image for one page (cf. \Ref{IW44Image.h}).


------------------------------------------------------------
2.8 - OUTLINES/BOOKMARKS
------------------------------------------------------------

[MAY 19th, 2005
 Multipage files (FORM:DJVM) can contain an 
 additional chunk "NAVM" located after the "DIRM" chunk.
 The NAVM chunk contains outlines and bookmarks.
 See the files libdjvu/DjVmNav.h and libdjvu.DjVmNav.cpp]
source-git / djvulibre

Source Code

Files