|
Packit |
d27c7e |
=head1 SAX for Perl
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=head2 What is SAX?
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
SAX (Simple API for XML) is a common parser interface for XML
|
|
Packit |
d27c7e |
parsers. It allows application writers to write applications that use
|
|
Packit |
d27c7e |
XML parsers, but are independent of which parser is actually used.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
This document describes a version of SAX used by Perl modules. The
|
|
Packit |
d27c7e |
original version of SAX, for Java, is described at
|
|
Packit |
d27c7e |
<http://www.megginson.com/SAX/>.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
There are two basic interfaces in the Perl version of SAX, the parser
|
|
Packit |
d27c7e |
interface and the handler interface. The parser interface creates new
|
|
Packit |
d27c7e |
parser instances, initiates parsing, and provides additional
|
|
Packit |
d27c7e |
information to handlers on request. The handler interface is used to
|
|
Packit |
d27c7e |
receive parse events from the parser.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=head2 Deviations from the Java version
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=over 4
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item *
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Takes parameters to `C<new()>' instead of using `set*' calls.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item *
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Allows a default Handler parameter to be used for all handlers.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item *
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
No base classes are implemented. Instead, parsers dynamically check
|
|
Packit |
d27c7e |
the handlers for what methods they support.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item *
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The AttributeList, InputSource, and SAXException classes have been
|
|
Packit |
d27c7e |
replaced by anonymous hashes.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item *
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Handlers are passed a hash containing properties as an argument in
|
|
Packit |
d27c7e |
place of positional arguments.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item *
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
`C<parse()>' returns the value returned by calling the
|
|
Packit |
d27c7e |
`C<end_document()>' handler.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item *
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Method names have been converted to lower-case with underscores.
|
|
Packit |
d27c7e |
Parameters are all mixed case with initial upper-case.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=back
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=head1 Parser Interface
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
SAX parsers are reusable but not re-entrant: the application may reuse
|
|
Packit |
d27c7e |
a parser object (possibly with a different input source) once the
|
|
Packit |
d27c7e |
first parse has completed successfully, but it may not invoke the
|
|
Packit |
d27c7e |
`C<parse()>' methods recursively within a parse.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Parser objects contain the following options. A new or different
|
|
Packit |
d27c7e |
handler option may provided in the middle of a parse, and the SAX
|
|
Packit |
d27c7e |
parser must begin using the new handler immediately. The `C<Locale>'
|
|
Packit |
d27c7e |
option must not be changed in the middle of a parse. If an
|
|
Packit |
d27c7e |
application does not provide a handler for a particular set of events,
|
|
Packit |
d27c7e |
those events will be silently ignored unless otherwise stated. If an
|
|
Packit |
d27c7e |
`C<EntityResolver>' is not provided, the parser will resolve system
|
|
Packit |
d27c7e |
identifiers and open connections to entities itself.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Handler default handler to receive events
|
|
Packit |
d27c7e |
DocumentHandler handler to receive document events
|
|
Packit |
d27c7e |
DTDHandler handler to receive DTD events
|
|
Packit |
d27c7e |
ErrorHandler handler to receive error events
|
|
Packit |
d27c7e |
EntityResolver handler to resolve entities
|
|
Packit |
d27c7e |
Locale locale to provide localisation for errors
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
If no handlers are provided then all events will be silently ignored,
|
|
Packit |
d27c7e |
except for `C<fatal_error()>' which will cause a `C<die()>' to be
|
|
Packit |
d27c7e |
called after calling `C<end_document()>'.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
All handler methods are called with a single hash argument containing
|
|
Packit |
d27c7e |
the parameters for that method. `C<new()>' methods can be called with
|
|
Packit |
d27c7e |
a hash or a list of key-value pairs containing the parameters.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
All SAX parsers must implement this basic interface: it allows
|
|
Packit |
d27c7e |
applications to provide handlers for different types of events and to
|
|
Packit |
d27c7e |
initiate a parse from a URI, a byte stream, or a character stream.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=over 4
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item new( I<OPTIONS> )
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Creates a Parser that will be used to parse XML sources. Any
|
|
Packit |
d27c7e |
parameters passed to `C<new()>' will be used for subsequent parses.
|
|
Packit |
d27c7e |
I<OPTIONS> may be a list of key, value pairs or a hash.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item parse( I<OPTIONS> )
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Parse an XML document.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The application can use this method to instruct the SAX parser to
|
|
Packit |
d27c7e |
begin parsing an XML document from any valid input source (a character
|
|
Packit |
d27c7e |
stream, a byte stream, or a URI). I<OPTIONS> may be a list of key,
|
|
Packit |
d27c7e |
value pairs or a hash. I<OPTIONS> passed to `C<parse()>' override
|
|
Packit |
d27c7e |
options given when the parser instance was created with `C<new()>'.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Applications may not invoke this method while a parse is in progress
|
|
Packit |
d27c7e |
(they should create a new Parser instead for each additional XML
|
|
Packit |
d27c7e |
document). Once a parse is complete, an application may reuse the same
|
|
Packit |
d27c7e |
Parser object, possibly with a different input source.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
`C<parse()>' returns the result of calling the handler method
|
|
Packit |
d27c7e |
`C<end_document()>'.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
A `C<Source>' parameter must have been provided to either the
|
|
Packit |
d27c7e |
`C<parse()>' or `C<new()>' methods. The `C<Source>' parameter is a
|
|
Packit |
d27c7e |
hash containing the following parameters:
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=over 4
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item PublicId
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The public identifier for this input source.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The public identifier is always optional: if the application writer
|
|
Packit |
d27c7e |
includes one, it will be provided as part of the location information.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item SystemId
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The system identifier for this input source.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The system identifier is optional if there is a byte stream, a
|
|
Packit |
d27c7e |
character stream, or a string, but it is still useful to provide one,
|
|
Packit |
d27c7e |
since the application can use it to resolve relative URIs and can
|
|
Packit |
d27c7e |
include it in error messages and warnings (the parser will attempt to
|
|
Packit |
d27c7e |
open a connection to the URI only if there is no byte stream or
|
|
Packit |
d27c7e |
character stream specified).
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
If the application knows the character encoding of the object pointed
|
|
Packit |
d27c7e |
to by the system identifier, it can provide the encoding using the
|
|
Packit |
d27c7e |
`C<Encoding>' parameter.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
If the system ID is a URL, it must be fully resolved.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item String
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
A scalar value containing XML text to be parsed.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The SAX parser will ignore this if there is also a byte or character
|
|
Packit |
d27c7e |
stream, but it will use a string in preference to opening a URI
|
|
Packit |
d27c7e |
connection.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item ByteStream
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The byte stream (file handle) for this input source.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The SAX parser will ignore this if there is also a character stream
|
|
Packit |
d27c7e |
specified, but it will use a byte stream in preference to opening a
|
|
Packit |
d27c7e |
URI connection itself or using `C<String>'.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
If the application knows the character encoding of the byte stream, it
|
|
Packit |
d27c7e |
should set it with the `C<Encoding>' parameter.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item CharacterStream
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
FOR FUTURE USE ONLY -- Perl does not currently support any character
|
|
Packit |
d27c7e |
streams, only use the `C<ByteStream>', `C<SystemId>', or `C<String>'
|
|
Packit |
d27c7e |
parameters.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The character stream (file handle) for this input source.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
If there is a character stream specified, the SAX parser will ignore
|
|
Packit |
d27c7e |
any byte stream and will not attempt to open a URI connection to the
|
|
Packit |
d27c7e |
system identifier.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item Encoding
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The character encoding, if known.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The encoding must be a string acceptable for an XML encoding
|
|
Packit |
d27c7e |
declaration (see section 4.3.3 of the XML 1.0 recommendation).
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
This parameter has no effect when the application provides a character
|
|
Packit |
d27c7e |
stream.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=back
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=back
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=head2 Locator
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Interface for associating a SAX event with a document location.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
If a SAX parser provides location information to the SAX application,
|
|
Packit |
d27c7e |
it does so by implementing the following methods and then calling the
|
|
Packit |
d27c7e |
`C<set_document_locator()>' handler method. The handler can use the
|
|
Packit |
d27c7e |
object to obtain the location of any other document handler event in
|
|
Packit |
d27c7e |
the XML source document.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Note that the results returned by the object will be valid only during
|
|
Packit |
d27c7e |
the scope of each document handler method: the application will
|
|
Packit |
d27c7e |
receive unpredictable results if it attempts to use the locator at any
|
|
Packit |
d27c7e |
other time.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
SAX parsers are not required to supply a locator, but they are very
|
|
Packit |
d27c7e |
strongly encouraged to do so.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=over 4
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item location()
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Return the location information for the current event.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Returns a hash containing the following parameters:
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
ColumnNumber The column number, or undef if none is available.
|
|
Packit |
d27c7e |
LineNumber The line number, or undef if none is available.
|
|
Packit |
d27c7e |
PublicId A string containing the public identifier, or undef if
|
|
Packit |
d27c7e |
none is available.
|
|
Packit |
d27c7e |
SystemId A string containing the system identifier, or undef if
|
|
Packit |
d27c7e |
none is available.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=back
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=head1 Handler Interfaces
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
SAX handler methods are grouped into four interfaces: the document
|
|
Packit |
d27c7e |
handler for receiving normal document events, the DTD handler for
|
|
Packit |
d27c7e |
receiving notation and unparsed entity events, the error handler for
|
|
Packit |
d27c7e |
receiving errors and warnings, and the entity resolver for redirecting
|
|
Packit |
d27c7e |
external system identifiers.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The application may choose to implement each interface in one package
|
|
Packit |
d27c7e |
or in seperate packages, as long as the objects provided as parameters
|
|
Packit |
d27c7e |
to the parser provide the matching interface.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Parsers may implement additional methods in each of these categories,
|
|
Packit |
d27c7e |
refer to the parser documentation for further information.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
All handlers are called with a single hash argument containing the
|
|
Packit |
d27c7e |
parameters for that handler.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Application writers who do not want to implement the entire interface
|
|
Packit |
d27c7e |
can leave those methods undefined. Events whose handler methods are
|
|
Packit |
d27c7e |
undefined will be ignored unless otherwise stated.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=head2 DocumentHandler
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
This is the main interface that most SAX applications implement: if
|
|
Packit |
d27c7e |
the application needs to be informed of basic parsing events, it
|
|
Packit |
d27c7e |
implements this interface and provides an instance with the SAX parser
|
|
Packit |
d27c7e |
using the `C<DocumentHandler>' parameter. The parser uses the instance
|
|
Packit |
d27c7e |
to report basic document-related events like the start and end of
|
|
Packit |
d27c7e |
elements and character data.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The order of events in this interface is very important, and mirrors
|
|
Packit |
d27c7e |
the order of information in the document itself. For example, all of
|
|
Packit |
d27c7e |
an element's content (character data, processing instructions, and/or
|
|
Packit |
d27c7e |
subelements) will appear, in order, between the `C<start_element()>'
|
|
Packit |
d27c7e |
event and the corresponding `C<end_element()>' event.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The application can find the location of any event using the Locator
|
|
Packit |
d27c7e |
interface supplied by the Parser through the
|
|
Packit |
d27c7e |
`C<set_document_locator()>' method.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=over 4
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item set_document_locator( { Locator => $locator } )
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Receive an object for locating the origin of SAX document events.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
SAX parsers are strongly encouraged (though not absolutely required)
|
|
Packit |
d27c7e |
to supply a locator: if it does so, it must supply the locator to the
|
|
Packit |
d27c7e |
application by invoking this method before invoking any of the other
|
|
Packit |
d27c7e |
methods in the DocumentHandler interface.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The locator allows the application to determine the end position of
|
|
Packit |
d27c7e |
any document-related event, even if the parser is not reporting an
|
|
Packit |
d27c7e |
error. Typically, the application will use this information for
|
|
Packit |
d27c7e |
reporting its own errors (such as character content that does not
|
|
Packit |
d27c7e |
match an application's business rules). The information returned by
|
|
Packit |
d27c7e |
the locator is probably not sufficient for use with a search engine.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Note that the locator will return correct information only during the
|
|
Packit |
d27c7e |
invocation of the events in this interface. The application should not
|
|
Packit |
d27c7e |
attempt to use it at any other time.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Parameters:
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Locator An object that can return the location of any SAX document
|
|
Packit |
d27c7e |
event.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item start_document( { } )
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Receive notification of the beginning of a document.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The SAX parser will invoke this method only once, before any other
|
|
Packit |
d27c7e |
methods in this interface or in DTDHandler.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item end_document( { } )
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Receive notification of the end of a document, no parameters are
|
|
Packit |
d27c7e |
passed for the end of a document.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The SAX parser will invoke this method only once, and it will be the
|
|
Packit |
d27c7e |
last method invoked during the parse. The parser shall not invoke
|
|
Packit |
d27c7e |
this method until it has either abandoned parsing (because of an
|
|
Packit |
d27c7e |
unrecoverable error) or reached the end of input.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The value returned by calling `C<end_document()>' will be the value
|
|
Packit |
d27c7e |
returned by `C<parse()>'.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item start_element( { Name => $name, Attributes => $attributes } )
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Receive notification of the beginning of an element.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The Parser will invoke this method at the beginning of every element
|
|
Packit |
d27c7e |
in the XML document; there will be a corresponding `C<end_element()>'
|
|
Packit |
d27c7e |
event for every `C<start_element()>' event (even when the element is
|
|
Packit |
d27c7e |
empty). All of the element's content will be reported, in order,
|
|
Packit |
d27c7e |
before the corresponding `C<end_element()>' event.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
If the element name has a namespace prefix, the prefix will still be
|
|
Packit |
d27c7e |
attached. Note that the attribute list provided will contain only
|
|
Packit |
d27c7e |
attributes with explicit values (specified or defaulted): #IMPLIED
|
|
Packit |
d27c7e |
attributes will be omitted.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Parameters:
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Name The element type name.
|
|
Packit |
d27c7e |
Attributes The attributes attached to the element, if any.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item end_element( { Name => $name } )
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Receive notification of the end of an element.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The SAX parser will invoke this method at the end of every element in
|
|
Packit |
d27c7e |
the XML document; there will be a corresponding `C<start_element()>'
|
|
Packit |
d27c7e |
event for every `C<end_element()>' event (even when the element is
|
|
Packit |
d27c7e |
empty).
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
If the element name has a namespace prefix, the prefix will still be
|
|
Packit |
d27c7e |
attached to the name.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Parameters:
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Name The element type name.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item characters( { Data => $characters } )
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Receive notification of character data.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The Parser will call this method to report each chunk of character
|
|
Packit |
d27c7e |
data. SAX parsers may return all contiguous character data in a
|
|
Packit |
d27c7e |
single chunk, or they may split it into several chunks; however, all
|
|
Packit |
d27c7e |
of the characters in any single event must come from the same external
|
|
Packit |
d27c7e |
entity, so that the Locator provides useful information.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Note that some parsers will report whitespace using the
|
|
Packit |
d27c7e |
`C<ignorable_whitespace()>' method rather than this one (validating
|
|
Packit |
d27c7e |
parsers must do so).
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Parameters:
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Data The characters from the XML document.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item ignorable_whitespace( { Data => $whitespace } )
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Receive notification of ignorable whitespace in element content.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Validating Parsers must use this method to report each chunk of
|
|
Packit |
d27c7e |
ignorable whitespace (see the W3C XML 1.0 recommendation, section
|
|
Packit |
d27c7e |
2.10): non-validating parsers may also use this method if they are
|
|
Packit |
d27c7e |
capable of parsing and using content models.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
SAX parsers may return all contiguous whitespace in a single chunk, or
|
|
Packit |
d27c7e |
they may split it into several chunks; however, all of the characters
|
|
Packit |
d27c7e |
in any single event must come from the same external entity, so that
|
|
Packit |
d27c7e |
the Locator provides useful information.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The application must not attempt to read from the array outside of the
|
|
Packit |
d27c7e |
specified range.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Data The characters from the XML document.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item processing_instruction ( { Target => $target, Data => $data } )
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Receive notification of a processing instruction.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The Parser will invoke this method once for each processing
|
|
Packit |
d27c7e |
instruction found: note that processing instructions may occur before
|
|
Packit |
d27c7e |
or after the main document element.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
A SAX parser should never report an XML declaration (XML 1.0, section
|
|
Packit |
d27c7e |
2.8) or a text declaration (XML 1.0, section 4.3.1) using this method.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Parameters:
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Target The processing instruction target.
|
|
Packit |
d27c7e |
Data The processing instruction data, if any.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=back
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=head2 ErrorHandler
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Basic interface for SAX error handlers.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
If a SAX application needs to implement customized error handling, it
|
|
Packit |
d27c7e |
must implement this interface and then provide an instance to the SAX
|
|
Packit |
d27c7e |
parser using the parser's `C<ErrorHandler>' parameter. The parser
|
|
Packit |
d27c7e |
will then report all errors and warnings through this interface.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The parser shall use this interface instead of throwing an exception:
|
|
Packit |
d27c7e |
it is up to the application whether to throw an exception for
|
|
Packit |
d27c7e |
different types of errors and warnings. Note, however, that there is
|
|
Packit |
d27c7e |
no requirement that the parser continue to provide useful information
|
|
Packit |
d27c7e |
after a call to `C<fatal_error()>' (in other words, a SAX driver class
|
|
Packit |
d27c7e |
could catch an exception and report a fatalError).
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
All error handlers receive the following I<PARAMS>. The
|
|
Packit |
d27c7e |
`C<PublicId>', `C<SystemId>', `C<LineNumber>', and `C<ColumnNumber>'
|
|
Packit |
d27c7e |
are provided only if the parser has that information available.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Messsage The error or warning message, or undef to use the message
|
|
Packit |
d27c7e |
from the `C<EvalError>' parameter
|
|
Packit |
d27c7e |
PublicId The public identifer of the entity that generated the
|
|
Packit |
d27c7e |
error or warning.
|
|
Packit |
d27c7e |
SystemId The system identifer of the entity that generated the
|
|
Packit |
d27c7e |
error or warning.
|
|
Packit |
d27c7e |
LineNumber The line number of the end of the text that caused the
|
|
Packit |
d27c7e |
error or warning.
|
|
Packit |
d27c7e |
ColumnNumber The column number of the end of the text that cause the
|
|
Packit |
d27c7e |
error or warning.
|
|
Packit |
d27c7e |
EvalError The error value returned from a lower level interface.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Application writers who do not want to implement the entire interface
|
|
Packit |
d27c7e |
can leave those methods undefined. If not defined, calls to the
|
|
Packit |
d27c7e |
`C<warning()>' and `C<error()>' handlers will be ignored and a
|
|
Packit |
d27c7e |
processing will be terminated (going straight to `C<end_document()>')
|
|
Packit |
d27c7e |
after the call to `C<fatal_error()>'.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=over 4
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item warning( { I<PARAMS> } )
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Receive notification of a warning.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
SAX parsers will use this method to report conditions that are not
|
|
Packit |
d27c7e |
errors or fatal errors as defined by the XML 1.0 recommendation. The
|
|
Packit |
d27c7e |
default behaviour is to take no action.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The SAX parser must continue to provide normal parsing events after
|
|
Packit |
d27c7e |
invoking this method: it should still be possible for the application
|
|
Packit |
d27c7e |
to process the document through to the end.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item error( { I<PARAMS> } )
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Receive notification of a recoverable error.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
This corresponds to the definition of "error" in section 1.2 of the
|
|
Packit |
d27c7e |
W3C XML 1.0 Recommendation. For example, a validating parser would use
|
|
Packit |
d27c7e |
this callback to report the violation of a validity constraint. The
|
|
Packit |
d27c7e |
default behaviour is to take no action.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The SAX parser must continue to provide normal parsing events after
|
|
Packit |
d27c7e |
invoking this method: it should still be possible for the application
|
|
Packit |
d27c7e |
to process the document through to the end. If the application cannot
|
|
Packit |
d27c7e |
do so, then the parser should report a fatal error even if the XML 1.0
|
|
Packit |
d27c7e |
recommendation does not require it to do so.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item fatal_error( { I<PARAMS> } )
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Receive notification of a non-recoverable error.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
This corresponds to the definition of "fatal error" in section 1.2 of
|
|
Packit |
d27c7e |
the W3C XML 1.0 Recommendation. For example, a parser would use this
|
|
Packit |
d27c7e |
callback to report the violation of a well-formedness constraint.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The application must assume that the document is unusable after the
|
|
Packit |
d27c7e |
parser has invoked this method, and should continue (if at all) only
|
|
Packit |
d27c7e |
for the sake of collecting addition error messages: in fact, SAX
|
|
Packit |
d27c7e |
parsers are free to stop reporting any other events once this method
|
|
Packit |
d27c7e |
has been invoked.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=back
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=head2 DTDHandler
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Receive notification of basic DTD-related events.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
If a SAX application needs information about notations and unparsed
|
|
Packit |
d27c7e |
entities, then the application implements this interface and provide
|
|
Packit |
d27c7e |
an instance to the SAX parser using the parser's `C<DTDHandler>'
|
|
Packit |
d27c7e |
parameter. The parser uses the instance to report notation and
|
|
Packit |
d27c7e |
unparsed entity declarations to the application.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The SAX parser may report these events in any order, regardless of the
|
|
Packit |
d27c7e |
order in which the notations and unparsed entities were declared;
|
|
Packit |
d27c7e |
however, all DTD events must be reported after the document handler's
|
|
Packit |
d27c7e |
`C<start_document()>' event, and before the first `C<start_element()>'
|
|
Packit |
d27c7e |
event.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
It is up to the application to store the information for future use
|
|
Packit |
d27c7e |
(perhaps in a hash table or object tree). If the application
|
|
Packit |
d27c7e |
encounters attributes of type "NOTATION", "ENTITY", or "ENTITIES", it
|
|
Packit |
d27c7e |
can use the information that it obtained through this interface to
|
|
Packit |
d27c7e |
find the entity and/or notation corresponding with the attribute
|
|
Packit |
d27c7e |
value.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Application writers who do not want to implement the entire interface
|
|
Packit |
d27c7e |
can leave those methods undefined. Events whose handler methods are
|
|
Packit |
d27c7e |
undefined will be ignored.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=over 4
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item notation_decl( { I<PARAMS> } )
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Receive notification of a notation declaration event.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
It is up to the application to record the notation for later
|
|
Packit |
d27c7e |
reference, if necessary.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
If a system identifier is present, and it is a URL, the SAX parser
|
|
Packit |
d27c7e |
must resolve it fully before passing it to the application.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
I<PARAMS>:
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Name The notation name.
|
|
Packit |
d27c7e |
PublicId The notation's public identifier, or undef if none was given.
|
|
Packit |
d27c7e |
SystemId The notation's system identifier, or undef if none was given.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item unparsed_entity_decl( { I<PARAMS> } )
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Receive notification of an unparsed entity declaration event.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Note that the notation name corresponds to a notation reported by the
|
|
Packit |
d27c7e |
`C<notation_decl()>' event. It is up to the application to record the
|
|
Packit |
d27c7e |
entity for later reference, if necessary.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
If the system identifier is a URL, the parser must resolve it fully
|
|
Packit |
d27c7e |
before passing it to the application.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
I<PARAMS>:
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Name The unparsed entity's name.
|
|
Packit |
d27c7e |
PublicId The entity's public identifier, or undef if none was given.
|
|
Packit |
d27c7e |
SystemId The entity's system identifier (it must always have one).
|
|
Packit |
d27c7e |
NotationName The name of the associated notation.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=back
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=head2 EntityResolver
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Basic interface for resolving entities.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
If a SAX application needs to implement customized handling for
|
|
Packit |
d27c7e |
external entities, it must implement this interface and provide an
|
|
Packit |
d27c7e |
instance with the SAX parser using the parser's `C<EntityResolver>'
|
|
Packit |
d27c7e |
parameter.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The parser will then allow the application to intercept any external
|
|
Packit |
d27c7e |
entities (including the external DTD subset and external parameter
|
|
Packit |
d27c7e |
entities, if any) before including them.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Many SAX applications will not need to implement this interface, but
|
|
Packit |
d27c7e |
it will be especially useful for applications that build XML documents
|
|
Packit |
d27c7e |
from databases or other specialised input sources, or for applications
|
|
Packit |
d27c7e |
that use URI types other than URLs.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The application can also use this interface to redirect system
|
|
Packit |
d27c7e |
identifiers to local URIs or to look up replacements in a catalog
|
|
Packit |
d27c7e |
(possibly by using the public identifier).
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=over 4
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=item resolve_entity( { PublicId => $public_id, SystemId => $system_id } )
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Allow the application to resolve external entities.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The Parser will call this method before opening any external entity
|
|
Packit |
d27c7e |
except the top-level document entity (including the external DTD
|
|
Packit |
d27c7e |
subset, external entities referenced within the DTD, and external
|
|
Packit |
d27c7e |
entities referenced within the document element): the application may
|
|
Packit |
d27c7e |
request that the parser resolve the entity itself, that it use an
|
|
Packit |
d27c7e |
alternative URI, or that it use an entirely different input source.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Application writers can use this method to redirect external system
|
|
Packit |
d27c7e |
identifiers to secure and/or local URIs, to look up public identifiers
|
|
Packit |
d27c7e |
in a catalogue, or to read an entity from a database or other input
|
|
Packit |
d27c7e |
source (including, for example, a dialog box).
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
If the system identifier is a URL, the SAX parser must resolve it
|
|
Packit |
d27c7e |
fully before reporting it to the application.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Parameters:
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
PublicId The public identifier of the external entity being
|
|
Packit |
d27c7e |
referenced, or undef if none was supplied.
|
|
Packit |
d27c7e |
SystemId The system identifier of the external entity being
|
|
Packit |
d27c7e |
referenced.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
`C<resolve_entity()>' returns undef to request that the parser open a
|
|
Packit |
d27c7e |
regular URI connection to the system identifier or returns a hash
|
|
Packit |
d27c7e |
containing the same parameters as the `C<Source>' parameter to
|
|
Packit |
d27c7e |
Parser's `C<parse()>' method, summarized here:
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
PublicId The public identifier of the external entity being
|
|
Packit |
d27c7e |
referenced, or undef if none was supplied.
|
|
Packit |
d27c7e |
SystemId The system identifier of the external entity being
|
|
Packit |
d27c7e |
referenced.
|
|
Packit |
d27c7e |
String String containing XML text
|
|
Packit |
d27c7e |
ByteStream An open file handle.
|
|
Packit |
d27c7e |
CharacterStream
|
|
Packit |
d27c7e |
An open file handle.
|
|
Packit |
d27c7e |
Encoding The character encoding, if known.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
See Parser's `C<parse()>' method for complete details on how these
|
|
Packit |
d27c7e |
parameters interact.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=back
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
=head1 Contributors
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
SAX <http://www.megginson.com/SAX/> was developed collaboratively by
|
|
Packit |
d27c7e |
the members of the XML-DEV mailing list. Please see the ``SAX History
|
|
Packit |
d27c7e |
and Contributors'' page for the people who did the real work behind
|
|
Packit |
d27c7e |
SAX. Much of the content of this document was copied from the SAX 1.0
|
|
Packit |
d27c7e |
Java Implementation documentation.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
The SAX for Python specification was helpful in creating this
|
|
Packit |
d27c7e |
specification.
|
|
Packit |
d27c7e |
<http://www.stud.ifi.uio.no/~larsga/download/python/xml/saxlib.html>
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Thanks to the following people who contributed to Perl SAX.
|
|
Packit |
d27c7e |
|
|
Packit |
d27c7e |
Eduard (Enno) Derksen
|
|
Packit |
d27c7e |
Ken MacLeod
|
|
Packit |
d27c7e |
Eric Prud'hommeaux
|
|
Packit |
d27c7e |
Larry Wall
|