|
Packit |
423ecb |
|
|
Packit |
423ecb |
"http://www.w3.org/TR/html4/loose.dtd">
|
|
Packit |
423ecb |
<html>
|
|
Packit |
423ecb |
<head>
|
|
Packit |
423ecb |
<meta http-equiv="Content-Type" content="text/html">
|
|
Packit |
423ecb |
<style type="text/css"></style>
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
TD {font-family: Verdana,Arial,Helvetica}
|
|
Packit |
423ecb |
BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
|
|
Packit |
423ecb |
H1 {font-family: Verdana,Arial,Helvetica}
|
|
Packit |
423ecb |
H2 {font-family: Verdana,Arial,Helvetica}
|
|
Packit |
423ecb |
H3 {font-family: Verdana,Arial,Helvetica}
|
|
Packit |
423ecb |
A:link, A:visited, A:active { text-decoration: underline }
|
|
Packit |
423ecb |
</style>
|
|
Packit |
423ecb |
-->
|
|
Packit |
423ecb |
<title>Libxml2 XmlTextReader Interface tutorial</title>
|
|
Packit |
423ecb |
</head>
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
<body bgcolor="#fffacd" text="#000000">
|
|
Packit |
423ecb |
Libxml2 XmlTextReader Interface tutorial
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
This document describes the use of the XmlTextReader streaming API added
|
|
Packit |
423ecb |
to libxml2 in version 2.5.0 . This API is closely modeled after the
|
|
Packit |
423ecb |
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
|
|
Packit |
423ecb |
and
|
|
Packit |
423ecb |
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader
|
|
Packit |
423ecb |
classes of the C# language.
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
This tutorial will present the key points of this API, and working
|
|
Packit |
423ecb |
examples using both C and the Python bindings:
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
Table of content:
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
Introduction: why a new API
|
|
Packit |
423ecb |
Walking a simple tree
|
|
Packit |
423ecb |
Extracting informations for the current
|
|
Packit |
423ecb |
node
|
|
Packit |
423ecb |
Extracting informations for the
|
|
Packit |
423ecb |
attributes
|
|
Packit |
423ecb |
Validating a document
|
|
Packit |
423ecb |
Entities substitution
|
|
Packit |
423ecb |
Relax-NG Validation
|
|
Packit |
423ecb |
Mixing the reader and tree or XPath
|
|
Packit |
423ecb |
operations
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
Libxml2 main API is
|
|
Packit |
423ecb |
tree based, where the parsing operation results in a document loaded
|
|
Packit |
423ecb |
completely in memory, and expose it as a tree of nodes all availble at the
|
|
Packit |
423ecb |
same time. This is very simple and quite powerful, but has the major
|
|
Packit |
423ecb |
limitation that the size of the document that can be hamdled is limited by
|
|
Packit |
423ecb |
the size of the memory available. Libxml2 also provide a
|
|
Packit |
423ecb |
href="http://www.saxproject.org/">SAX based API, but that version was
|
|
Packit |
423ecb |
designed upon one of the early
|
|
Packit |
423ecb |
href="http://www.jclark.com/xml/expat.html">expat version of SAX, SAX is
|
|
Packit |
423ecb |
also not formally defined for C. SAX basically work by registering callbacks
|
|
Packit |
423ecb |
which are called directly by the parser as it progresses through the document
|
|
Packit |
423ecb |
streams. The problem is that this programming model is relatively complex,
|
|
Packit |
423ecb |
not well standardized, cannot provide validation directly, makes entity,
|
|
Packit |
423ecb |
namespace and base processing relatively hard.
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
The
|
|
Packit |
423ecb |
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
|
|
Packit |
423ecb |
API from C# provides a far simpler programming model. The API acts as a
|
|
Packit |
423ecb |
cursor going forward on the document stream and stopping at each node in the
|
|
Packit |
423ecb |
way. The user's code keeps control of the progress and simply calls a
|
|
Packit |
423ecb |
Read() function repeatedly to progress to each node in sequence in document
|
|
Packit |
423ecb |
order. There is direct support for namespaces, xml:base, entity handling and
|
|
Packit |
423ecb |
adding DTD validation on top of it was relatively simple. This API is really
|
|
Packit |
423ecb |
close to the DOM Core
|
|
Packit |
423ecb |
specification This provides a far more standard, easy to use and powerful
|
|
Packit |
423ecb |
API than the existing SAX. Moreover integrating extension features based on
|
|
Packit |
423ecb |
the tree seems relatively easy.
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
In a nutshell the XmlTextReader API provides a simpler, more standard and
|
|
Packit |
423ecb |
more extensible interface to handle large documents than the existing SAX
|
|
Packit |
423ecb |
version.
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
Basically the XmlTextReader API is a forward only tree walking interface.
|
|
Packit |
423ecb |
The basic steps are:
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
prepare a reader context operating on some input
|
|
Packit |
423ecb |
run a loop iterating over all nodes in the document
|
|
Packit |
423ecb |
free up the reader context
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
Here is a basic C sample doing this:
|
|
Packit |
423ecb |
#include <libxml/xmlreader.h>
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
void processNode(xmlTextReaderPtr reader) {
|
|
Packit |
423ecb |
/* handling of a node in the tree */
|
|
Packit |
423ecb |
}
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
int streamFile(char *filename) {
|
|
Packit |
423ecb |
xmlTextReaderPtr reader;
|
|
Packit |
423ecb |
int ret;
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
reader = xmlNewTextReaderFilename(filename);
|
|
Packit |
423ecb |
if (reader != NULL) {
|
|
Packit |
423ecb |
ret = xmlTextReaderRead(reader);
|
|
Packit |
423ecb |
while (ret == 1) {
|
|
Packit |
423ecb |
processNode(reader);
|
|
Packit |
423ecb |
ret = xmlTextReaderRead(reader);
|
|
Packit |
423ecb |
}
|
|
Packit |
423ecb |
xmlFreeTextReader(reader);
|
|
Packit |
423ecb |
if (ret != 0) {
|
|
Packit |
423ecb |
printf("%s : failed to parse\n", filename);
|
|
Packit |
423ecb |
}
|
|
Packit |
423ecb |
} else {
|
|
Packit |
423ecb |
printf("Unable to open %s\n", filename);
|
|
Packit |
423ecb |
}
|
|
Packit |
423ecb |
}
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
A few things to notice:
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
the include file needed : libxml/xmlreader.h
|
|
Packit |
423ecb |
the creation of the reader using a filename
|
|
Packit |
423ecb |
the repeated call to xmlTextReaderRead() and how any return value
|
|
Packit |
423ecb |
different from 1 should stop the loop
|
|
Packit |
423ecb |
that a negative return means a parsing error
|
|
Packit |
423ecb |
how xmlFreeTextReader() should be used to free up the resources used by
|
|
Packit |
423ecb |
the reader.
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
Here is similar code in python for exactly the same processing:
|
|
Packit |
423ecb |
import libxml2
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
def processNode(reader):
|
|
Packit |
423ecb |
pass
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
def streamFile(filename):
|
|
Packit |
423ecb |
try:
|
|
Packit |
423ecb |
reader = libxml2.newTextReaderFilename(filename)
|
|
Packit |
423ecb |
except:
|
|
Packit |
423ecb |
print "unable to open %s" % (filename)
|
|
Packit |
423ecb |
return
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
ret = reader.Read()
|
|
Packit |
423ecb |
while ret == 1:
|
|
Packit |
423ecb |
processNode(reader)
|
|
Packit |
423ecb |
ret = reader.Read()
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
if ret != 0:
|
|
Packit |
423ecb |
print "%s : failed to parse" % (filename)
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
The only things worth adding are that the
|
|
Packit |
423ecb |
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
|
|
Packit |
423ecb |
is abstracted as a class like in C# with the same method names (but the
|
|
Packit |
423ecb |
properties are currently accessed with methods) and that one doesn't need to
|
|
Packit |
423ecb |
free the reader at the end of the processing. It will get garbage collected
|
|
Packit |
423ecb |
once all references have disapeared.
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
So far the example code did not indicate how information was extracted
|
|
Packit |
423ecb |
from the reader. It was abstrated as a call to the processNode() routine,
|
|
Packit |
423ecb |
with the reader as the argument. At each invocation, the parser is stopped on
|
|
Packit |
423ecb |
a given node and the reader can be used to query those node properties. Each
|
|
Packit |
423ecb |
Property is available at the C level as a function taking a single
|
|
Packit |
423ecb |
xmlTextReaderPtr argument whose name is
|
|
Packit |
423ecb |
xmlTextReader Property , if the return type is an
|
|
Packit |
423ecb |
xmlChar * string then it must be deallocated with
|
|
Packit |
423ecb |
xmlFree() to avoid leaks. For the Python interface, there is a
|
|
Packit |
423ecb |
Property method to the reader class that can be called on the
|
|
Packit |
423ecb |
instance. The list of the properties is based on the
|
|
Packit |
423ecb |
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
|
|
Packit |
423ecb |
XmlTextReader class set of properties and methods:
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
NodeType: The node type, 1 for start element, 15 for end of
|
|
Packit |
423ecb |
element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
|
|
Packit |
423ecb |
entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
|
|
Packit |
423ecb |
9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
|
|
Packit |
423ecb |
fragment and 12 for notation nodes.
|
|
Packit |
423ecb |
Name: the
|
|
Packit |
423ecb |
href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
|
|
Packit |
423ecb |
name of the node, equal to (Prefix:)LocalName.
|
|
Packit |
423ecb |
LocalName: the
|
|
Packit |
423ecb |
href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name of
|
|
Packit |
423ecb |
the node.
|
|
Packit |
423ecb |
Prefix: a shorthand reference to the
|
|
Packit |
423ecb |
href="http://www.w3.org/TR/REC-xml-names/">namespace associated with
|
|
Packit |
423ecb |
the node.
|
|
Packit |
423ecb |
NamespaceUri: the URI defining the
|
|
Packit |
423ecb |
href="http://www.w3.org/TR/REC-xml-names/">namespace associated with
|
|
Packit |
423ecb |
the node.
|
|
Packit |
423ecb |
BaseUri: the base URI of the node. See the
|
|
Packit |
423ecb |
href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification.
|
|
Packit |
423ecb |
Depth: the depth of the node in the tree, starts at 0 for the
|
|
Packit |
423ecb |
root node.
|
|
Packit |
423ecb |
HasAttributes: whether the node has attributes.
|
|
Packit |
423ecb |
HasValue: whether the node can have a text value.
|
|
Packit |
423ecb |
Value: provides the text value of the node if present.
|
|
Packit |
423ecb |
IsDefault: whether an Attribute node was generated from the
|
|
Packit |
423ecb |
default value defined in the DTD or schema (unsupported
|
|
Packit |
423ecb |
yet).
|
|
Packit |
423ecb |
XmlLang: the
|
|
Packit |
423ecb |
href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang scope
|
|
Packit |
423ecb |
within which the node resides.
|
|
Packit |
423ecb |
IsEmptyElement: check if the current node is empty, this is a
|
|
Packit |
423ecb |
bit bizarre in the sense that <a/> will be considered
|
|
Packit |
423ecb |
empty while <a></a> will not.
|
|
Packit |
423ecb |
AttributeCount: provides the number of attributes of the
|
|
Packit |
423ecb |
current node.
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
Let's look first at a small example to get this in practice by redefining
|
|
Packit |
423ecb |
the processNode() function in the Python example:
|
|
Packit |
423ecb |
def processNode(reader):
|
|
Packit |
423ecb |
print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
|
|
Packit |
423ecb |
reader.Name(), reader.IsEmptyElement())
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
and look at the result of calling streamFile("tst.xml") for various
|
|
Packit |
423ecb |
content of the XML test file.
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
For the minimal document "<doc/> " we get:
|
|
Packit |
423ecb |
0 1 doc 1
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
Only one node is found, its depth is 0, type 1 indicate an element start,
|
|
Packit |
423ecb |
of name "doc" and it is empty. Trying now with
|
|
Packit |
423ecb |
"<doc></doc> " instead leads to:
|
|
Packit |
423ecb |
0 1 doc 0
|
|
Packit |
423ecb |
0 15 doc 0
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
The document root node is not flagged as empty anymore and both a start
|
|
Packit |
423ecb |
and an end of element are detected. The following document shows how
|
|
Packit |
423ecb |
character data are reported:
|
|
Packit |
423ecb |
<doc><a/><b>some text</b>
|
|
Packit |
423ecb |
<c/></doc>
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
We modifying the processNode() function to also report the node Value:
|
|
Packit |
423ecb |
def processNode(reader):
|
|
Packit |
423ecb |
print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
|
|
Packit |
423ecb |
reader.Name(), reader.IsEmptyElement(),
|
|
Packit |
423ecb |
reader.Value())
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
The result of the test is:
|
|
Packit |
423ecb |
0 1 doc 0 None
|
|
Packit |
423ecb |
1 1 a 1 None
|
|
Packit |
423ecb |
1 1 b 0 None
|
|
Packit |
423ecb |
2 3 #text 0 some text
|
|
Packit |
423ecb |
1 15 b 0 None
|
|
Packit |
423ecb |
1 3 #text 0
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
1 1 c 1 None
|
|
Packit |
423ecb |
0 15 doc 0 None
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
There are a few things to note:
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
the increase of the depth value (first row) as children nodes are
|
|
Packit |
423ecb |
explored
|
|
Packit |
423ecb |
the text node child of the b element, of type 3 and its content
|
|
Packit |
423ecb |
the text node containing the line return between elements b and c
|
|
Packit |
423ecb |
that elements have the Value None (or NULL in C)
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
The equivalent routine for processNode() as used by
|
|
Packit |
423ecb |
xmllint --stream --debug is the following and can be found in
|
|
Packit |
423ecb |
the xmllint.c module in the source distribution:
|
|
Packit |
423ecb |
static void processNode(xmlTextReaderPtr reader) {
|
|
Packit |
423ecb |
xmlChar *name, *value;
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
name = xmlTextReaderName(reader);
|
|
Packit |
423ecb |
if (name == NULL)
|
|
Packit |
423ecb |
name = xmlStrdup(BAD_CAST "--");
|
|
Packit |
423ecb |
value = xmlTextReaderValue(reader);
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
printf("%d %d %s %d",
|
|
Packit |
423ecb |
xmlTextReaderDepth(reader),
|
|
Packit |
423ecb |
xmlTextReaderNodeType(reader),
|
|
Packit |
423ecb |
name,
|
|
Packit |
423ecb |
xmlTextReaderIsEmptyElement(reader));
|
|
Packit |
423ecb |
xmlFree(name);
|
|
Packit |
423ecb |
if (value == NULL)
|
|
Packit |
423ecb |
printf("\n");
|
|
Packit |
423ecb |
else {
|
|
Packit |
423ecb |
printf(" %s\n", value);
|
|
Packit |
423ecb |
xmlFree(value);
|
|
Packit |
423ecb |
}
|
|
Packit |
423ecb |
}
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
The previous examples don't indicate how attributes are processed. The
|
|
Packit |
423ecb |
simple test "<doc a="b"/> " provides the following
|
|
Packit |
423ecb |
result:
|
|
Packit |
423ecb |
0 1 doc 1 None
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
This proves that attribute nodes are not traversed by default. The
|
|
Packit |
423ecb |
HasAttributes property allow to detect their presence. To check
|
|
Packit |
423ecb |
their content the API has special instructions. Basically two kinds of operations
|
|
Packit |
423ecb |
are possible:
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
to move the reader to the attribute nodes of the current element, in
|
|
Packit |
423ecb |
that case the cursor is positionned on the attribute node
|
|
Packit |
423ecb |
to directly query the element node for the attribute value
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
In both case the attribute can be designed either by its position in the
|
|
Packit |
423ecb |
list of attribute (MoveToAttributeNo or GetAttributeNo) or
|
|
Packit |
423ecb |
by their name (and namespace):
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
GetAttributeNo(no): provides the value of the attribute with
|
|
Packit |
423ecb |
the specified index no relative to the containing element.
|
|
Packit |
423ecb |
GetAttribute(name): provides the value of the attribute with
|
|
Packit |
423ecb |
the specified qualified name.
|
|
Packit |
423ecb |
GetAttributeNs(localName, namespaceURI): provides the value of the
|
|
Packit |
423ecb |
attribute with the specified local name and namespace URI.
|
|
Packit |
423ecb |
MoveToAttributeNo(no): moves the position of the current
|
|
Packit |
423ecb |
instance to the attribute with the specified index relative to the
|
|
Packit |
423ecb |
containing element.
|
|
Packit |
423ecb |
MoveToAttribute(name): moves the position of the current
|
|
Packit |
423ecb |
instance to the attribute with the specified qualified name.
|
|
Packit |
423ecb |
MoveToAttributeNs(localName, namespaceURI): moves the position
|
|
Packit |
423ecb |
of the current instance to the attribute with the specified local name
|
|
Packit |
423ecb |
and namespace URI.
|
|
Packit |
423ecb |
MoveToFirstAttribute: moves the position of the current
|
|
Packit |
423ecb |
instance to the first attribute associated with the current node.
|
|
Packit |
423ecb |
MoveToNextAttribute: moves the position of the current
|
|
Packit |
423ecb |
instance to the next attribute associated with the current node.
|
|
Packit |
423ecb |
MoveToElement: moves the position of the current instance to
|
|
Packit |
423ecb |
the node that contains the current Attribute node.
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
After modifying the processNode() function to show attributes:
|
|
Packit |
423ecb |
def processNode(reader):
|
|
Packit |
423ecb |
print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
|
|
Packit |
423ecb |
reader.Name(), reader.IsEmptyElement(),
|
|
Packit |
423ecb |
reader.Value())
|
|
Packit |
423ecb |
if reader.NodeType() == 1: # Element
|
|
Packit |
423ecb |
while reader.MoveToNextAttribute():
|
|
Packit |
423ecb |
print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
|
|
Packit |
423ecb |
reader.Name(),reader.Value())
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
The output for the same input document reflects the attribute:
|
|
Packit |
423ecb |
0 1 doc 1 None
|
|
Packit |
423ecb |
-- 1 2 (a) [b]
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
There are a couple of things to note on the attribute processing:
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
Their depth is the one of the carrying element plus one.
|
|
Packit |
423ecb |
Namespace declarations are seen as attributes, as in DOM.
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
Libxml2 implementation adds some extra features on top of the XmlTextReader
|
|
Packit |
423ecb |
API. The main one is the ability to DTD validate the parsed document
|
|
Packit |
423ecb |
progressively. This is simply the activation of the associated feature of the
|
|
Packit |
423ecb |
parser used by the reader structure. There are a few options available
|
|
Packit |
423ecb |
defined as the enum xmlParserProperties in the libxml/xmlreader.h header
|
|
Packit |
423ecb |
file:
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
XML_PARSER_LOADDTD: force loading the DTD (without validating)
|
|
Packit |
423ecb |
XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply
|
|
Packit |
423ecb |
loading the DTD)
|
|
Packit |
423ecb |
XML_PARSER_VALIDATE: activate DTD validation (this also imply loading
|
|
Packit |
423ecb |
the DTD)
|
|
Packit |
423ecb |
XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity
|
|
Packit |
423ecb |
reference nodes are not generated and are replaced by their expanded
|
|
Packit |
423ecb |
content.
|
|
Packit |
423ecb |
more settings might be added, those were the one available at the 2.5.0
|
|
Packit |
423ecb |
release...
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
The GetParserProp() and SetParserProp() methods can then be used to get
|
|
Packit |
423ecb |
and set the values of those parser properties of the reader. For example
|
|
Packit |
423ecb |
def parseAndValidate(file):
|
|
Packit |
423ecb |
reader = libxml2.newTextReaderFilename(file)
|
|
Packit |
423ecb |
reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
|
|
Packit |
423ecb |
ret = reader.Read()
|
|
Packit |
423ecb |
while ret == 1:
|
|
Packit |
423ecb |
ret = reader.Read()
|
|
Packit |
423ecb |
if ret != 0:
|
|
Packit |
423ecb |
print "Error parsing and validating %s" % (file)
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
This routine will parse and validate the file. Error messages can be
|
|
Packit |
423ecb |
captured by registering an error handler. See python/tests/reader2.py for
|
|
Packit |
423ecb |
more complete Python examples. At the C level the equivalent call to cativate
|
|
Packit |
423ecb |
the validation feature is just:
|
|
Packit |
423ecb |
ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
and a return value of 0 indicates success.
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
By default the xmlReader will report entities as such and not replace them
|
|
Packit |
423ecb |
with their content. This default behaviour can however be overriden using:
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
Introduced in version 2.5.7
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
Libxml2 can now validate the document being read using the xmlReader using
|
|
Packit |
423ecb |
Relax-NG schemas. While the Relax NG validator can't always work in a
|
|
Packit |
423ecb |
streamable mode, only subsets which cannot be reduced to regular expressions
|
|
Packit |
423ecb |
need to have their subtree expanded for validation. In practice it means
|
|
Packit |
423ecb |
that, unless the schemas for the top level element content is not expressable
|
|
Packit |
423ecb |
as a regexp, only chunk of the document needs to be parsed while
|
|
Packit |
423ecb |
validating.
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
The steps to do so are:
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
create a reader working on a document as usual
|
|
Packit |
423ecb |
before any call to read associate it to a Relax NG schemas, either the
|
|
Packit |
423ecb |
preparsed schemas or the URL to the schemas to use
|
|
Packit |
423ecb |
errors will be reported the usual way, and the validity status can be
|
|
Packit |
423ecb |
obtained using the IsValid() interface of the reader like for DTDs.
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
Example, assuming the reader has already being created and that the schema
|
|
Packit |
423ecb |
string contains the Relax-NG schemas:
|
|
Packit |
423ecb |
rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))
|
|
Packit |
423ecb |
rngs = rngp.relaxNGParse()
|
|
Packit |
423ecb |
reader.RelaxNGSetSchema(rngs)
|
|
Packit |
423ecb |
ret = reader.Read()
|
|
Packit |
423ecb |
while ret == 1:
|
|
Packit |
423ecb |
ret = reader.Read()
|
|
Packit |
423ecb |
if ret != 0:
|
|
Packit |
423ecb |
print "Error parsing the document"
|
|
Packit |
423ecb |
if reader.IsValid() != 1:
|
|
Packit |
423ecb |
print "Document failed to validate"
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
See reader6.py in the sources or documentation for a complete
|
|
Packit |
423ecb |
example.
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
Introduced in version 2.5.7
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
While the reader is a streaming interface, its underlying implementation
|
|
Packit |
423ecb |
is based on the DOM builder of libxml2. As a result it is relatively simple
|
|
Packit |
423ecb |
to mix operations based on both models under some constraints. To do so the
|
|
Packit |
423ecb |
reader has an Expand() operation allowing to grow the subtree under the
|
|
Packit |
423ecb |
current node. It returns a pointer to a standard node which can be
|
|
Packit |
423ecb |
manipulated in the usual ways. The node will get all its ancestors and the
|
|
Packit |
423ecb |
full subtree available. Usual operations like XPath queries can be used on
|
|
Packit |
423ecb |
that reduced view of the document. Here is an example extracted from
|
|
Packit |
423ecb |
reader5.py in the sources which extract and prints the bibliography for the
|
|
Packit |
423ecb |
"Dragon" compiler book from the XML 1.0 recommendation:
|
|
Packit |
423ecb |
f = open('../../test/valid/REC-xml-19980210.xml')
|
|
Packit |
423ecb |
input = libxml2.inputBuffer(f)
|
|
Packit |
423ecb |
reader = input.newTextReader("REC")
|
|
Packit |
423ecb |
res=""
|
|
Packit |
423ecb |
while reader.Read():
|
|
Packit |
423ecb |
while reader.Name() == 'bibl':
|
|
Packit |
423ecb |
node = reader.Expand() # expand the subtree
|
|
Packit |
423ecb |
if node.xpathEval("@id = 'Aho'"): # use XPath on it
|
|
Packit |
423ecb |
res = res + node.serialize()
|
|
Packit |
423ecb |
if reader.Next() != 1: # skip the subtree
|
|
Packit |
423ecb |
break;
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
Note, however that the node instance returned by the Expand() call is only
|
|
Packit |
423ecb |
valid until the next Read() operation. The Expand() operation does not
|
|
Packit |
423ecb |
affects the Read() ones, however usually once processed the full subtree is
|
|
Packit |
423ecb |
not useful anymore, and the Next() operation allows to skip it completely and
|
|
Packit |
423ecb |
process to the successor or return 0 if the document end is reached.
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
Daniel Veillard
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
$Id$
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
</body>
|
|
Packit |
423ecb |
</html>
|