Blame libarchive/libarchive_internals.3

Packit 08bd4c
.\" Copyright (c) 2003-2007 Tim Kientzle
Packit 08bd4c
.\" All rights reserved.
Packit 08bd4c
.\"
Packit 08bd4c
.\" Redistribution and use in source and binary forms, with or without
Packit 08bd4c
.\" modification, are permitted provided that the following conditions
Packit 08bd4c
.\" are met:
Packit 08bd4c
.\" 1. Redistributions of source code must retain the above copyright
Packit 08bd4c
.\"    notice, this list of conditions and the following disclaimer.
Packit 08bd4c
.\" 2. Redistributions in binary form must reproduce the above copyright
Packit 08bd4c
.\"    notice, this list of conditions and the following disclaimer in the
Packit 08bd4c
.\"    documentation and/or other materials provided with the distribution.
Packit 08bd4c
.\"
Packit 08bd4c
.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
Packit 08bd4c
.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
Packit 08bd4c
.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
Packit 08bd4c
.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
Packit 08bd4c
.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
Packit 08bd4c
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
Packit 08bd4c
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
Packit 08bd4c
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
Packit 08bd4c
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
Packit 08bd4c
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
Packit 08bd4c
.\" SUCH DAMAGE.
Packit 08bd4c
.\"
Packit 08bd4c
.\" $FreeBSD$
Packit 08bd4c
.\"
Packit 08bd4c
.Dd January 26, 2011
Packit 08bd4c
.Dt LIBARCHIVE_INTERNALS 3
Packit 08bd4c
.Os
Packit 08bd4c
.Sh NAME
Packit 08bd4c
.Nm libarchive_internals
Packit 08bd4c
.Nd description of libarchive internal interfaces
Packit 08bd4c
.Sh OVERVIEW
Packit 08bd4c
The
Packit 08bd4c
.Nm libarchive
Packit 08bd4c
library provides a flexible interface for reading and writing
Packit 08bd4c
streaming archive files such as tar and cpio.
Packit 08bd4c
Internally, it follows a modular layered design that should
Packit 08bd4c
make it easy to add new archive and compression formats.
Packit 08bd4c
.Sh GENERAL ARCHITECTURE
Packit 08bd4c
Externally, libarchive exposes most operations through an
Packit 08bd4c
opaque, object-style interface.
Packit 08bd4c
The
Packit 08bd4c
.Xr archive_entry 3
Packit 08bd4c
objects store information about a single filesystem object.
Packit 08bd4c
The rest of the library provides facilities to write
Packit 08bd4c
.Xr archive_entry 3
Packit 08bd4c
objects to archive files,
Packit 08bd4c
read them from archive files,
Packit 08bd4c
and write them to disk.
Packit 08bd4c
(There are plans to add a facility to read
Packit 08bd4c
.Xr archive_entry 3
Packit 08bd4c
objects from disk as well.)
Packit 08bd4c
.Pp
Packit 08bd4c
The read and write APIs each have four layers: a public API
Packit 08bd4c
layer, a format layer that understands the archive file format,
Packit 08bd4c
a compression layer, and an I/O layer.
Packit 08bd4c
The I/O layer is completely exposed to clients who can replace
Packit 08bd4c
it entirely with their own functions.
Packit 08bd4c
.Pp
Packit 08bd4c
In order to provide as much consistency as possible for clients,
Packit 08bd4c
some public functions are virtualized.
Packit 08bd4c
Eventually, it should be possible for clients to open
Packit 08bd4c
an archive or disk writer, and then use a single set of
Packit 08bd4c
code to select and write entries, regardless of the target.
Packit 08bd4c
.Sh READ ARCHITECTURE
Packit 08bd4c
From the outside, clients use the
Packit 08bd4c
.Xr archive_read 3
Packit 08bd4c
API to manipulate an
Packit 08bd4c
.Nm archive
Packit 08bd4c
object to read entries and bodies from an archive stream.
Packit 08bd4c
Internally, the
Packit 08bd4c
.Nm archive
Packit 08bd4c
object is cast to an
Packit 08bd4c
.Nm archive_read
Packit 08bd4c
object, which holds all read-specific data.
Packit 08bd4c
The API has four layers:
Packit 08bd4c
The lowest layer is the I/O layer.
Packit 08bd4c
This layer can be overridden by clients, but most clients use
Packit 08bd4c
the packaged I/O callbacks provided, for example, by
Packit 08bd4c
.Xr archive_read_open_memory 3 ,
Packit 08bd4c
and
Packit 08bd4c
.Xr archive_read_open_fd 3 .
Packit 08bd4c
The compression layer calls the I/O layer to
Packit 08bd4c
read bytes and decompresses them for the format layer.
Packit 08bd4c
The format layer unpacks a stream of uncompressed bytes and
Packit 08bd4c
creates
Packit 08bd4c
.Nm archive_entry
Packit 08bd4c
objects from the incoming data.
Packit 08bd4c
The API layer tracks overall state
Packit 08bd4c
(for example, it prevents clients from reading data before reading a header)
Packit 08bd4c
and invokes the format and compression layer operations
Packit 08bd4c
through registered function pointers.
Packit 08bd4c
In particular, the API layer drives the format-detection process:
Packit 08bd4c
When opening the archive, it reads an initial block of data
Packit 08bd4c
and offers it to each registered compression handler.
Packit 08bd4c
The one with the highest bid is initialized with the first block.
Packit 08bd4c
Similarly, the format handlers are polled to see which handler
Packit 08bd4c
is the best for each archive.
Packit 08bd4c
(Prior to 2.4.0, the format bidders were invoked for each
Packit 08bd4c
entry, but this design hindered error recovery.)
Packit 08bd4c
.Ss I/O Layer and Client Callbacks
Packit 08bd4c
The read API goes to some lengths to be nice to clients.
Packit 08bd4c
As a result, there are few restrictions on the behavior of
Packit 08bd4c
the client callbacks.
Packit 08bd4c
.Pp
Packit 08bd4c
The client read callback is expected to provide a block
Packit 08bd4c
of data on each call.
Packit 08bd4c
A zero-length return does indicate end of file, but otherwise
Packit 08bd4c
blocks may be as small as one byte or as large as the entire file.
Packit 08bd4c
In particular, blocks may be of different sizes.
Packit 08bd4c
.Pp
Packit 08bd4c
The client skip callback returns the number of bytes actually
Packit 08bd4c
skipped, which may be much smaller than the skip requested.
Packit 08bd4c
The only requirement is that the skip not be larger.
Packit 08bd4c
In particular, clients are allowed to return zero for any
Packit 08bd4c
skip that they don't want to handle.
Packit 08bd4c
The skip callback must never be invoked with a negative value.
Packit 08bd4c
.Pp
Packit 08bd4c
Keep in mind that not all clients are reading from disk:
Packit 08bd4c
clients reading from networks may provide different-sized
Packit 08bd4c
blocks on every request and cannot skip at all;
Packit 08bd4c
advanced clients may use
Packit 08bd4c
.Xr mmap 2
Packit 08bd4c
to read the entire file into memory at once and return the
Packit 08bd4c
entire file to libarchive as a single block;
Packit 08bd4c
other clients may begin asynchronous I/O operations for the
Packit 08bd4c
next block on each request.
Packit 08bd4c
.Ss Decompresssion Layer
Packit 08bd4c
The decompression layer not only handles decompression,
Packit 08bd4c
it also buffers data so that the format handlers see a
Packit 08bd4c
much nicer I/O model.
Packit 08bd4c
The decompression API is a two stage peek/consume model.
Packit 08bd4c
A read_ahead request specifies a minimum read amount;
Packit 08bd4c
the decompression layer must provide a pointer to at least
Packit 08bd4c
that much data.
Packit 08bd4c
If more data is immediately available, it should return more:
Packit 08bd4c
the format layer handles bulk data reads by asking for a minimum
Packit 08bd4c
of one byte and then copying as much data as is available.
Packit 08bd4c
.Pp
Packit 08bd4c
A subsequent call to the
Packit 08bd4c
.Fn consume
Packit 08bd4c
function advances the read pointer.
Packit 08bd4c
Note that data returned from a
Packit 08bd4c
.Fn read_ahead
Packit 08bd4c
call is guaranteed to remain in place until
Packit 08bd4c
the next call to
Packit 08bd4c
.Fn read_ahead .
Packit 08bd4c
Intervening calls to
Packit 08bd4c
.Fn consume
Packit 08bd4c
should not cause the data to move.
Packit 08bd4c
.Pp
Packit 08bd4c
Skip requests must always be handled exactly.
Packit 08bd4c
Decompression handlers that cannot seek forward should
Packit 08bd4c
not register a skip handler;
Packit 08bd4c
the API layer fills in a generic skip handler that reads and discards data.
Packit 08bd4c
.Pp
Packit 08bd4c
A decompression handler has a specific lifecycle:
Packit 08bd4c
.Bl -tag -compact -width indent
Packit 08bd4c
.It Registration/Configuration
Packit 08bd4c
When the client invokes the public support function,
Packit 08bd4c
the decompression handler invokes the internal
Packit 08bd4c
.Fn __archive_read_register_compression
Packit 08bd4c
function to provide bid and initialization functions.
Packit 08bd4c
This function returns
Packit 08bd4c
.Cm NULL
Packit 08bd4c
on error or else a pointer to a
Packit 08bd4c
.Cm struct decompressor_t .
Packit 08bd4c
This structure contains a
Packit 08bd4c
.Va void * config
Packit 08bd4c
slot that can be used for storing any customization information.
Packit 08bd4c
.It Bid
Packit 08bd4c
The bid function is invoked with a pointer and size of a block of data.
Packit 08bd4c
The decompressor can access its config data
Packit 08bd4c
through the
Packit 08bd4c
.Va decompressor
Packit 08bd4c
element of the
Packit 08bd4c
.Cm archive_read
Packit 08bd4c
object.
Packit 08bd4c
The bid function is otherwise stateless.
Packit 08bd4c
In particular, it must not perform any I/O operations.
Packit 08bd4c
.Pp
Packit 08bd4c
The value returned by the bid function indicates its suitability
Packit 08bd4c
for handling this data stream.
Packit 08bd4c
A bid of zero will ensure that this decompressor is never invoked.
Packit 08bd4c
Return zero if magic number checks fail.
Packit 08bd4c
Otherwise, your initial implementation should return the number of bits
Packit 08bd4c
actually checked.
Packit 08bd4c
For example, if you verify two full bytes and three bits of another
Packit 08bd4c
byte, bid 19.
Packit 08bd4c
Note that the initial block may be very short;
Packit 08bd4c
be careful to only inspect the data you are given.
Packit 08bd4c
(The current decompressors require two bytes for correct bidding.)
Packit 08bd4c
.It Initialize
Packit 08bd4c
The winning bidder will have its init function called.
Packit 08bd4c
This function should initialize the remaining slots of the
Packit 08bd4c
.Va struct decompressor_t
Packit 08bd4c
object pointed to by the
Packit 08bd4c
.Va decompressor
Packit 08bd4c
element of the
Packit 08bd4c
.Va archive_read
Packit 08bd4c
object.
Packit 08bd4c
In particular, it should allocate any working data it needs
Packit 08bd4c
in the
Packit 08bd4c
.Va data
Packit 08bd4c
slot of that structure.
Packit 08bd4c
The init function is called with the block of data that
Packit 08bd4c
was used for tasting.
Packit 08bd4c
At this point, the decompressor is responsible for all I/O
Packit 08bd4c
requests to the client callbacks.
Packit 08bd4c
The decompressor is free to read more data as and when
Packit 08bd4c
necessary.
Packit 08bd4c
.It Satisfy I/O requests
Packit 08bd4c
The format handler will invoke the
Packit 08bd4c
.Va read_ahead ,
Packit 08bd4c
.Va consume ,
Packit 08bd4c
and
Packit 08bd4c
.Va skip
Packit 08bd4c
functions as needed.
Packit 08bd4c
.It Finish
Packit 08bd4c
The finish method is called only once when the archive is closed.
Packit 08bd4c
It should release anything stored in the
Packit 08bd4c
.Va data
Packit 08bd4c
and
Packit 08bd4c
.Va config
Packit 08bd4c
slots of the
Packit 08bd4c
.Va decompressor
Packit 08bd4c
object.
Packit 08bd4c
It should not invoke the client close callback.
Packit 08bd4c
.El
Packit 08bd4c
.Ss Format Layer
Packit 08bd4c
The read formats have a similar lifecycle to the decompression handlers:
Packit 08bd4c
.Bl -tag -compact -width indent
Packit 08bd4c
.It Registration
Packit 08bd4c
Allocate your private data and initialize your pointers.
Packit 08bd4c
.It Bid
Packit 08bd4c
Formats bid by invoking the
Packit 08bd4c
.Fn read_ahead
Packit 08bd4c
decompression method but not calling the
Packit 08bd4c
.Fn consume
Packit 08bd4c
method.
Packit 08bd4c
This allows each bidder to look ahead in the input stream.
Packit 08bd4c
Bidders should not look further ahead than necessary, as long
Packit 08bd4c
look aheads put pressure on the decompression layer to buffer
Packit 08bd4c
lots of data.
Packit 08bd4c
Most formats only require a few hundred bytes of look ahead;
Packit 08bd4c
look aheads of a few kilobytes are reasonable.
Packit 08bd4c
(The ISO9660 reader sometimes looks ahead by 48k, which
Packit 08bd4c
should be considered an upper limit.)
Packit 08bd4c
.It Read header
Packit 08bd4c
The header read is usually the most complex part of any format.
Packit 08bd4c
There are a few strategies worth mentioning:
Packit 08bd4c
For formats such as tar or cpio, reading and parsing the header is
Packit 08bd4c
straightforward since headers alternate with data.
Packit 08bd4c
For formats that store all header data at the beginning of the file,
Packit 08bd4c
the first header read request may have to read all headers into
Packit 08bd4c
memory and store that data, sorted by the location of the file
Packit 08bd4c
data.
Packit 08bd4c
Subsequent header read requests will skip forward to the
Packit 08bd4c
beginning of the file data and return the corresponding header.
Packit 08bd4c
.It Read Data
Packit 08bd4c
The read data interface supports sparse files; this requires that
Packit 08bd4c
each call return a block of data specifying the file offset and
Packit 08bd4c
size.
Packit 08bd4c
This may require you to carefully track the location so that you
Packit 08bd4c
can return accurate file offsets for each read.
Packit 08bd4c
Remember that the decompressor will return as much data as it has.
Packit 08bd4c
Generally, you will want to request one byte,
Packit 08bd4c
examine the return value to see how much data is available, and
Packit 08bd4c
possibly trim that to the amount you can use.
Packit 08bd4c
You should invoke consume for each block just before you return it.
Packit 08bd4c
.It Skip All Data
Packit 08bd4c
The skip data call should skip over all file data and trailing padding.
Packit 08bd4c
This is called automatically by the API layer just before each
Packit 08bd4c
header read.
Packit 08bd4c
It is also called in response to the client calling the public
Packit 08bd4c
.Fn data_skip
Packit 08bd4c
function.
Packit 08bd4c
.It Cleanup
Packit 08bd4c
On cleanup, the format should release all of its allocated memory.
Packit 08bd4c
.El
Packit 08bd4c
.Ss API Layer
Packit 08bd4c
XXX to do XXX
Packit 08bd4c
.Sh WRITE ARCHITECTURE
Packit 08bd4c
The write API has a similar set of four layers:
Packit 08bd4c
an API layer, a format layer, a compression layer, and an I/O layer.
Packit 08bd4c
The registration here is much simpler because only
Packit 08bd4c
one format and one compression can be registered at a time.
Packit 08bd4c
.Ss I/O Layer and Client Callbacks
Packit 08bd4c
XXX To be written XXX
Packit 08bd4c
.Ss Compression Layer
Packit 08bd4c
XXX To be written XXX
Packit 08bd4c
.Ss Format Layer
Packit 08bd4c
XXX To be written XXX
Packit 08bd4c
.Ss API Layer
Packit 08bd4c
XXX To be written XXX
Packit 08bd4c
.Sh WRITE_DISK ARCHITECTURE
Packit 08bd4c
The write_disk API is intended to look just like the write API
Packit 08bd4c
to clients.
Packit 08bd4c
Since it does not handle multiple formats or compression, it
Packit 08bd4c
is not layered internally.
Packit 08bd4c
.Sh GENERAL SERVICES
Packit 08bd4c
The
Packit 08bd4c
.Nm archive_read ,
Packit 08bd4c
.Nm archive_write ,
Packit 08bd4c
and
Packit 08bd4c
.Nm archive_write_disk
Packit 08bd4c
objects all contain an initial
Packit 08bd4c
.Nm archive
Packit 08bd4c
object which provides common support for a set of standard services.
Packit 08bd4c
(Recall that ANSI/ISO C90 guarantees that you can cast freely between
Packit 08bd4c
a pointer to a structure and a pointer to the first element of that
Packit 08bd4c
structure.)
Packit 08bd4c
The
Packit 08bd4c
.Nm archive
Packit 08bd4c
object has a magic value that indicates which API this object
Packit 08bd4c
is associated with,
Packit 08bd4c
slots for storing error information,
Packit 08bd4c
and function pointers for virtualized API functions.
Packit 08bd4c
.Sh MISCELLANEOUS NOTES
Packit 08bd4c
Connecting existing archiving libraries into libarchive is generally
Packit 08bd4c
quite difficult.
Packit 08bd4c
In particular, many existing libraries strongly assume that you
Packit 08bd4c
are reading from a file; they seek forwards and backwards as necessary
Packit 08bd4c
to locate various pieces of information.
Packit 08bd4c
In contrast, libarchive never seeks backwards in its input, which
Packit 08bd4c
sometimes requires very different approaches.
Packit 08bd4c
.Pp
Packit 08bd4c
For example, libarchive's ISO9660 support operates very differently
Packit 08bd4c
from most ISO9660 readers.
Packit 08bd4c
The libarchive support utilizes a work-queue design that
Packit 08bd4c
keeps a list of known entries sorted by their location in the input.
Packit 08bd4c
Whenever libarchive's ISO9660 implementation is asked for the next
Packit 08bd4c
header, checks this list to find the next item on the disk.
Packit 08bd4c
Directories are parsed when they are encountered and new
Packit 08bd4c
items are added to the list.
Packit 08bd4c
This design relies heavily on the ISO9660 image being optimized so that
Packit 08bd4c
directories always occur earlier on the disk than the files they
Packit 08bd4c
describe.
Packit 08bd4c
.Pp
Packit 08bd4c
Depending on the specific format, such approaches may not be possible.
Packit 08bd4c
The ZIP format specification, for example, allows archivers to store
Packit 08bd4c
key information only at the end of the file.
Packit 08bd4c
In theory, it is possible to create ZIP archives that cannot
Packit 08bd4c
be read without seeking.
Packit 08bd4c
Fortunately, such archives are very rare, and libarchive can read
Packit 08bd4c
most ZIP archives, though it cannot always extract as much information
Packit 08bd4c
as a dedicated ZIP program.
Packit 08bd4c
.Sh SEE ALSO
Packit 08bd4c
.Xr archive_entry 3 ,
Packit 08bd4c
.Xr archive_read 3 ,
Packit 08bd4c
.Xr archive_write 3 ,
Packit 08bd4c
.Xr archive_write_disk 3
Packit 08bd4c
.Xr libarchive 3 ,
Packit 08bd4c
.Sh HISTORY
Packit 08bd4c
The
Packit 08bd4c
.Nm libarchive
Packit 08bd4c
library first appeared in
Packit 08bd4c
.Fx 5.3 .
Packit 08bd4c
.Sh AUTHORS
Packit 08bd4c
.An -nosplit
Packit 08bd4c
The
Packit 08bd4c
.Nm libarchive
Packit 08bd4c
library was written by
Packit 08bd4c
.An Tim Kientzle Aq kientzle@acm.org .