Tree - source-git/libxml2 - CentOS Git server

source-git / libxml2

Blame doc/xmlreader.html

Blob History Raw

Packit	423ecb
Packit	423ecb	`"http://www.w3.org/TR/html4/loose.dtd">`
Packit	423ecb	`<html>`
Packit	423ecb	`<head>`
Packit	423ecb	`<meta http-equiv="Content-Type" content="text/html">`
Packit	423ecb	`<style type="text/css"></style>`
Packit	423ecb
Packit	423ecb	`TD {font-family: Verdana,Arial,Helvetica}`
Packit	423ecb	`BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}`
Packit	423ecb	`H1 {font-family: Verdana,Arial,Helvetica}`
Packit	423ecb	`H2 {font-family: Verdana,Arial,Helvetica}`
Packit	423ecb	`H3 {font-family: Verdana,Arial,Helvetica}`
Packit	423ecb	`A:link, A:visited, A:active { text-decoration: underline }`
Packit	423ecb	`</style>`
Packit	423ecb	`-->`
Packit	423ecb	`<title>Libxml2 XmlTextReader Interface tutorial</title>`
Packit	423ecb	`</head>`
Packit	423ecb
Packit	423ecb	`<body bgcolor="#fffacd" text="#000000">`
Packit	423ecb	`Libxml2 XmlTextReader Interface tutorial`
Packit	423ecb
Packit	423ecb
Packit	423ecb
Packit	423ecb	`This document describes the use of the XmlTextReader streaming API added`
Packit	423ecb	`to libxml2 in version 2.5.0 . This API is closely modeled after the`
Packit	423ecb	`href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader`
Packit	423ecb	`and`
Packit	423ecb	`href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader`
Packit	423ecb	`classes of the C# language.`
Packit	423ecb
Packit	423ecb	`This tutorial will present the key points of this API, and working`
Packit	423ecb	`examples using both C and the Python bindings:`
Packit	423ecb
Packit	423ecb	`Table of content:`
Packit	423ecb
Packit	423ecb	`Introduction: why a new API`
Packit	423ecb	`Walking a simple tree`
Packit	423ecb	`Extracting informations for the current`
Packit	423ecb	`node`
Packit	423ecb	`Extracting informations for the`
Packit	423ecb	`attributes`
Packit	423ecb	`Validating a document`
Packit	423ecb	`Entities substitution`
Packit	423ecb	`Relax-NG Validation`
Packit	423ecb	`Mixing the reader and tree or XPath`
Packit	423ecb	`operations`
Packit	423ecb
Packit	423ecb
Packit	423ecb
Packit	423ecb
Packit	423ecb	`Introduction: why a new API`
Packit	423ecb
Packit	423ecb	`Libxml2 main API is`
Packit	423ecb	`tree based, where the parsing operation results in a document loaded`
Packit	423ecb	`completely in memory, and expose it as a tree of nodes all availble at the`
Packit	423ecb	`same time. This is very simple and quite powerful, but has the major`
Packit	423ecb	`limitation that the size of the document that can be hamdled is limited by`
Packit	423ecb	`the size of the memory available. Libxml2 also provide a`
Packit	423ecb	`href="http://www.saxproject.org/">SAX based API, but that version was`
Packit	423ecb	`designed upon one of the early`
Packit	423ecb	`href="http://www.jclark.com/xml/expat.html">expat version of SAX, SAX is`
Packit	423ecb	`also not formally defined for C. SAX basically work by registering callbacks`
Packit	423ecb	`which are called directly by the parser as it progresses through the document`
Packit	423ecb	`streams. The problem is that this programming model is relatively complex,`
Packit	423ecb	`not well standardized, cannot provide validation directly, makes entity,`
Packit	423ecb	`namespace and base processing relatively hard.`
Packit	423ecb
Packit	423ecb	`The`
Packit	423ecb	`href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader`
Packit	423ecb	`API from C# provides a far simpler programming model. The API acts as a`
Packit	423ecb	`cursor going forward on the document stream and stopping at each node in the`
Packit	423ecb	`way. The user's code keeps control of the progress and simply calls a`
Packit	423ecb	`Read() function repeatedly to progress to each node in sequence in document`
Packit	423ecb	`order. There is direct support for namespaces, xml:base, entity handling and`
Packit	423ecb	`adding DTD validation on top of it was relatively simple. This API is really`
Packit	423ecb	`close to the DOM Core`
Packit	423ecb	`specification This provides a far more standard, easy to use and powerful`
Packit	423ecb	`API than the existing SAX. Moreover integrating extension features based on`
Packit	423ecb	`the tree seems relatively easy.`
Packit	423ecb
Packit	423ecb	`In a nutshell the XmlTextReader API provides a simpler, more standard and`
Packit	423ecb	`more extensible interface to handle large documents than the existing SAX`
Packit	423ecb	`version.`
Packit	423ecb
Packit	423ecb	`Walking a simple tree`
Packit	423ecb
Packit	423ecb	`Basically the XmlTextReader API is a forward only tree walking interface.`
Packit	423ecb	`The basic steps are:`
Packit	423ecb
Packit	423ecb	`prepare a reader context operating on some input`
Packit	423ecb	`run a loop iterating over all nodes in the document`
Packit	423ecb	`free up the reader context`
Packit	423ecb
Packit	423ecb
Packit	423ecb	`Here is a basic C sample doing this:`
Packit	423ecb	`#include <libxml/xmlreader.h>`
Packit	423ecb
Packit	423ecb	`void processNode(xmlTextReaderPtr reader) {`
Packit	423ecb	`/* handling of a node in the tree */`
Packit	423ecb	`}`
Packit	423ecb
Packit	423ecb	`int streamFile(char *filename) {`
Packit	423ecb	`xmlTextReaderPtr reader;`
Packit	423ecb	`int ret;`
Packit	423ecb
Packit	423ecb	`reader = xmlNewTextReaderFilename(filename);`
Packit	423ecb	`if (reader != NULL) {`
Packit	423ecb	`ret = xmlTextReaderRead(reader);`
Packit	423ecb	`while (ret == 1) {`
Packit	423ecb	`processNode(reader);`
Packit	423ecb	`ret = xmlTextReaderRead(reader);`
Packit	423ecb	`}`
Packit	423ecb	`xmlFreeTextReader(reader);`
Packit	423ecb	`if (ret != 0) {`
Packit	423ecb	`printf("%s : failed to parse\n", filename);`
Packit	423ecb	`}`
Packit	423ecb	`} else {`
Packit	423ecb	`printf("Unable to open %s\n", filename);`
Packit	423ecb	`}`
Packit	423ecb	`}`
Packit	423ecb
Packit	423ecb	`A few things to notice:`
Packit	423ecb
Packit	423ecb	`the include file needed : libxml/xmlreader.h`
Packit	423ecb	`the creation of the reader using a filename`
Packit	423ecb	`the repeated call to xmlTextReaderRead() and how any return value`
Packit	423ecb	`different from 1 should stop the loop`
Packit	423ecb	`that a negative return means a parsing error`
Packit	423ecb	`how xmlFreeTextReader() should be used to free up the resources used by`
Packit	423ecb	`the reader.`
Packit	423ecb
Packit	423ecb
Packit	423ecb	`Here is similar code in python for exactly the same processing:`
Packit	423ecb	`import libxml2`
Packit	423ecb
Packit	423ecb	`def processNode(reader):`
Packit	423ecb	`pass`
Packit	423ecb
Packit	423ecb	`def streamFile(filename):`
Packit	423ecb	`try:`
Packit	423ecb	`reader = libxml2.newTextReaderFilename(filename)`
Packit	423ecb	`except:`
Packit	423ecb	`print "unable to open %s" % (filename)`
Packit	423ecb	`return`
Packit	423ecb
Packit	423ecb	`ret = reader.Read()`
Packit	423ecb	`while ret == 1:`
Packit	423ecb	`processNode(reader)`
Packit	423ecb	`ret = reader.Read()`
Packit	423ecb
Packit	423ecb	`if ret != 0:`
Packit	423ecb	`print "%s : failed to parse" % (filename)`
Packit	423ecb
Packit	423ecb	`The only things worth adding are that the`
Packit	423ecb	`href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader`
Packit	423ecb	`is abstracted as a class like in C# with the same method names (but the`
Packit	423ecb	`properties are currently accessed with methods) and that one doesn't need to`
Packit	423ecb	`free the reader at the end of the processing. It will get garbage collected`
Packit	423ecb	`once all references have disapeared.`
Packit	423ecb
Packit	423ecb	`Extracting information for the current node`
Packit	423ecb
Packit	423ecb	`So far the example code did not indicate how information was extracted`
Packit	423ecb	`from the reader. It was abstrated as a call to the processNode() routine,`
Packit	423ecb	`with the reader as the argument. At each invocation, the parser is stopped on`
Packit	423ecb	`a given node and the reader can be used to query those node properties. Each`
Packit	423ecb	`Property is available at the C level as a function taking a single`
Packit	423ecb	`xmlTextReaderPtr argument whose name is`
Packit	423ecb	`xmlTextReaderProperty , if the return type is an`
Packit	423ecb	`xmlChar * string then it must be deallocated with`
Packit	423ecb	`xmlFree() to avoid leaks. For the Python interface, there is a`
Packit	423ecb	`Property method to the reader class that can be called on the`
Packit	423ecb	`instance. The list of the properties is based on the`
Packit	423ecb	`href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#`
Packit	423ecb	`XmlTextReader class set of properties and methods:`
Packit	423ecb
Packit	423ecb	`NodeType: The node type, 1 for start element, 15 for end of`
Packit	423ecb	`element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for`
Packit	423ecb	`entity references, 6 for entity declarations, 7 for PIs, 8 for comments,`
Packit	423ecb	`9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document`
Packit	423ecb	`fragment and 12 for notation nodes.`
Packit	423ecb	`Name: the`
Packit	423ecb	`href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified`
Packit	423ecb	`name of the node, equal to (Prefix:)LocalName.`
Packit	423ecb	`LocalName: the`
Packit	423ecb	`href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name of`
Packit	423ecb	`the node.`
Packit	423ecb	`Prefix: a shorthand reference to the`
Packit	423ecb	`href="http://www.w3.org/TR/REC-xml-names/">namespace associated with`
Packit	423ecb	`the node.`
Packit	423ecb	`NamespaceUri: the URI defining the`
Packit	423ecb	`href="http://www.w3.org/TR/REC-xml-names/">namespace associated with`
Packit	423ecb	`the node.`
Packit	423ecb	`BaseUri: the base URI of the node. See the`
Packit	423ecb	`href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification.`
Packit	423ecb	`Depth: the depth of the node in the tree, starts at 0 for the`
Packit	423ecb	`root node.`
Packit	423ecb	`HasAttributes: whether the node has attributes.`
Packit	423ecb	`HasValue: whether the node can have a text value.`
Packit	423ecb	`Value: provides the text value of the node if present.`
Packit	423ecb	`IsDefault: whether an Attribute node was generated from the`
Packit	423ecb	`default value defined in the DTD or schema (unsupported`
Packit	423ecb	`yet).`
Packit	423ecb	`XmlLang: the`
Packit	423ecb	`href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang scope`
Packit	423ecb	`within which the node resides.`
Packit	423ecb	`IsEmptyElement: check if the current node is empty, this is a`
Packit	423ecb	`bit bizarre in the sense that <a/> will be considered`
Packit	423ecb	`empty while <a></a> will not.`
Packit	423ecb	`AttributeCount: provides the number of attributes of the`
Packit	423ecb	`current node.`
Packit	423ecb
Packit	423ecb
Packit	423ecb	`Let's look first at a small example to get this in practice by redefining`
Packit	423ecb	`the processNode() function in the Python example:`
Packit	423ecb	`def processNode(reader):`
Packit	423ecb	`print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),`
Packit	423ecb	`reader.Name(), reader.IsEmptyElement())`
Packit	423ecb
Packit	423ecb	`and look at the result of calling streamFile("tst.xml") for various`
Packit	423ecb	`content of the XML test file.`
Packit	423ecb
Packit	423ecb	`For the minimal document "<doc/>" we get:`
Packit	423ecb	`0 1 doc 1`
Packit	423ecb
Packit	423ecb	`Only one node is found, its depth is 0, type 1 indicate an element start,`
Packit	423ecb	`of name "doc" and it is empty. Trying now with`
Packit	423ecb	`"<doc></doc>" instead leads to:`
Packit	423ecb	`0 1 doc 0`
Packit	423ecb	`0 15 doc 0`
Packit	423ecb
Packit	423ecb	`The document root node is not flagged as empty anymore and both a start`
Packit	423ecb	`and an end of element are detected. The following document shows how`
Packit	423ecb	`character data are reported:`
Packit	423ecb	`<doc><a/><b>some text</b>`
Packit	423ecb	`<c/></doc>`
Packit	423ecb
Packit	423ecb	`We modifying the processNode() function to also report the node Value:`
Packit	423ecb	`def processNode(reader):`
Packit	423ecb	`print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),`
Packit	423ecb	`reader.Name(), reader.IsEmptyElement(),`
Packit	423ecb	`reader.Value())`
Packit	423ecb
Packit	423ecb	`The result of the test is:`
Packit	423ecb	`0 1 doc 0 None`
Packit	423ecb	`1 1 a 1 None`
Packit	423ecb	`1 1 b 0 None`
Packit	423ecb	`2 3 #text 0 some text`
Packit	423ecb	`1 15 b 0 None`
Packit	423ecb	`1 3 #text 0`
Packit	423ecb
Packit	423ecb	`1 1 c 1 None`
Packit	423ecb	`0 15 doc 0 None`
Packit	423ecb
Packit	423ecb	`There are a few things to note:`
Packit	423ecb
Packit	423ecb	`the increase of the depth value (first row) as children nodes are`
Packit	423ecb	`explored`
Packit	423ecb	`the text node child of the b element, of type 3 and its content`
Packit	423ecb	`the text node containing the line return between elements b and c`
Packit	423ecb	`that elements have the Value None (or NULL in C)`
Packit	423ecb
Packit	423ecb
Packit	423ecb	`The equivalent routine for processNode() as used by`
Packit	423ecb	`xmllint --stream --debug is the following and can be found in`
Packit	423ecb	`the xmllint.c module in the source distribution:`
Packit	423ecb	`static void processNode(xmlTextReaderPtr reader) {`
Packit	423ecb	`xmlChar name, value;`
Packit	423ecb
Packit	423ecb	`name = xmlTextReaderName(reader);`
Packit	423ecb	`if (name == NULL)`
Packit	423ecb	`name = xmlStrdup(BAD_CAST "--");`
Packit	423ecb	`value = xmlTextReaderValue(reader);`
Packit	423ecb
Packit	423ecb	`printf("%d %d %s %d",`
Packit	423ecb	`xmlTextReaderDepth(reader),`
Packit	423ecb	`xmlTextReaderNodeType(reader),`
Packit	423ecb	`name,`
Packit	423ecb	`xmlTextReaderIsEmptyElement(reader));`
Packit	423ecb	`xmlFree(name);`
Packit	423ecb	`if (value == NULL)`
Packit	423ecb	`printf("\n");`
Packit	423ecb	`else {`
Packit	423ecb	`printf(" %s\n", value);`
Packit	423ecb	`xmlFree(value);`
Packit	423ecb	`}`
Packit	423ecb	`}`
Packit	423ecb
Packit	423ecb	`Extracting information for the attributes`
Packit	423ecb
Packit	423ecb	`The previous examples don't indicate how attributes are processed. The`
Packit	423ecb	`simple test "<doc a="b"/>" provides the following`
Packit	423ecb	`result:`
Packit	423ecb	`0 1 doc 1 None`
Packit	423ecb
Packit	423ecb	`This proves that attribute nodes are not traversed by default. The`
Packit	423ecb	`HasAttributes property allow to detect their presence. To check`
Packit	423ecb	`their content the API has special instructions. Basically two kinds of operations`
Packit	423ecb	`are possible:`
Packit	423ecb
Packit	423ecb	`to move the reader to the attribute nodes of the current element, in`
Packit	423ecb	`that case the cursor is positionned on the attribute node`
Packit	423ecb	`to directly query the element node for the attribute value`
Packit	423ecb
Packit	423ecb
Packit	423ecb	`In both case the attribute can be designed either by its position in the`
Packit	423ecb	`list of attribute (MoveToAttributeNo or GetAttributeNo) or`
Packit	423ecb	`by their name (and namespace):`
Packit	423ecb
Packit	423ecb	`GetAttributeNo(no): provides the value of the attribute with`
Packit	423ecb	`the specified index no relative to the containing element.`
Packit	423ecb	`GetAttribute(name): provides the value of the attribute with`
Packit	423ecb	`the specified qualified name.`
Packit	423ecb	`GetAttributeNs(localName, namespaceURI): provides the value of the`
Packit	423ecb	`attribute with the specified local name and namespace URI.`
Packit	423ecb	`MoveToAttributeNo(no): moves the position of the current`
Packit	423ecb	`instance to the attribute with the specified index relative to the`
Packit	423ecb	`containing element.`
Packit	423ecb	`MoveToAttribute(name): moves the position of the current`
Packit	423ecb	`instance to the attribute with the specified qualified name.`
Packit	423ecb	`MoveToAttributeNs(localName, namespaceURI): moves the position`
Packit	423ecb	`of the current instance to the attribute with the specified local name`
Packit	423ecb	`and namespace URI.`
Packit	423ecb	`MoveToFirstAttribute: moves the position of the current`
Packit	423ecb	`instance to the first attribute associated with the current node.`
Packit	423ecb	`MoveToNextAttribute: moves the position of the current`
Packit	423ecb	`instance to the next attribute associated with the current node.`
Packit	423ecb	`MoveToElement: moves the position of the current instance to`
Packit	423ecb	`the node that contains the current Attribute node.`
Packit	423ecb
Packit	423ecb
Packit	423ecb	`After modifying the processNode() function to show attributes:`
Packit	423ecb	`def processNode(reader):`
Packit	423ecb	`print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),`
Packit	423ecb	`reader.Name(), reader.IsEmptyElement(),`
Packit	423ecb	`reader.Value())`
Packit	423ecb	`if reader.NodeType() == 1: # Element`
Packit	423ecb	`while reader.MoveToNextAttribute():`
Packit	423ecb	`print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),`
Packit	423ecb	`reader.Name(),reader.Value())`
Packit	423ecb
Packit	423ecb	`The output for the same input document reflects the attribute:`
Packit	423ecb	`0 1 doc 1 None`
Packit	423ecb	`-- 1 2 (a) [b]`
Packit	423ecb
Packit	423ecb	`There are a couple of things to note on the attribute processing:`
Packit	423ecb
Packit	423ecb	`Their depth is the one of the carrying element plus one.`
Packit	423ecb	`Namespace declarations are seen as attributes, as in DOM.`
Packit	423ecb
Packit	423ecb
Packit	423ecb	`Validating a document`
Packit	423ecb
Packit	423ecb	`Libxml2 implementation adds some extra features on top of the XmlTextReader`
Packit	423ecb	`API. The main one is the ability to DTD validate the parsed document`
Packit	423ecb	`progressively. This is simply the activation of the associated feature of the`
Packit	423ecb	`parser used by the reader structure. There are a few options available`
Packit	423ecb	`defined as the enum xmlParserProperties in the libxml/xmlreader.h header`
Packit	423ecb	`file:`
Packit	423ecb
Packit	423ecb	`XML_PARSER_LOADDTD: force loading the DTD (without validating)`
Packit	423ecb	`XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply`
Packit	423ecb	`loading the DTD)`
Packit	423ecb	`XML_PARSER_VALIDATE: activate DTD validation (this also imply loading`
Packit	423ecb	`the DTD)`
Packit	423ecb	`XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity`
Packit	423ecb	`reference nodes are not generated and are replaced by their expanded`
Packit	423ecb	`content.`
Packit	423ecb	`more settings might be added, those were the one available at the 2.5.0`
Packit	423ecb	`release...`
Packit	423ecb
Packit	423ecb
Packit	423ecb	`The GetParserProp() and SetParserProp() methods can then be used to get`
Packit	423ecb	`and set the values of those parser properties of the reader. For example`
Packit	423ecb	`def parseAndValidate(file):`
Packit	423ecb	`reader = libxml2.newTextReaderFilename(file)`
Packit	423ecb	`reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)`
Packit	423ecb	`ret = reader.Read()`
Packit	423ecb	`while ret == 1:`
Packit	423ecb	`ret = reader.Read()`
Packit	423ecb	`if ret != 0:`
Packit	423ecb	`print "Error parsing and validating %s" % (file)`
Packit	423ecb
Packit	423ecb	`This routine will parse and validate the file. Error messages can be`
Packit	423ecb	`captured by registering an error handler. See python/tests/reader2.py for`
Packit	423ecb	`more complete Python examples. At the C level the equivalent call to cativate`
Packit	423ecb	`the validation feature is just:`
Packit	423ecb	`ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)`
Packit	423ecb
Packit	423ecb	`and a return value of 0 indicates success.`
Packit	423ecb
Packit	423ecb	`Entities substitution`
Packit	423ecb
Packit	423ecb	`By default the xmlReader will report entities as such and not replace them`
Packit	423ecb	`with their content. This default behaviour can however be overriden using:`
Packit	423ecb
Packit	423ecb	`reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)`
Packit	423ecb
Packit	423ecb	`Relax-NG Validation`
Packit	423ecb
Packit	423ecb	`Introduced in version 2.5.7`
Packit	423ecb
Packit	423ecb	`Libxml2 can now validate the document being read using the xmlReader using`
Packit	423ecb	`Relax-NG schemas. While the Relax NG validator can't always work in a`
Packit	423ecb	`streamable mode, only subsets which cannot be reduced to regular expressions`
Packit	423ecb	`need to have their subtree expanded for validation. In practice it means`
Packit	423ecb	`that, unless the schemas for the top level element content is not expressable`
Packit	423ecb	`as a regexp, only chunk of the document needs to be parsed while`
Packit	423ecb	`validating.`
Packit	423ecb
Packit	423ecb	`The steps to do so are:`
Packit	423ecb
Packit	423ecb	`create a reader working on a document as usual`
Packit	423ecb	`before any call to read associate it to a Relax NG schemas, either the`
Packit	423ecb	`preparsed schemas or the URL to the schemas to use`
Packit	423ecb	`errors will be reported the usual way, and the validity status can be`
Packit	423ecb	`obtained using the IsValid() interface of the reader like for DTDs.`
Packit	423ecb
Packit	423ecb
Packit	423ecb	`Example, assuming the reader has already being created and that the schema`
Packit	423ecb	`string contains the Relax-NG schemas:`
Packit	423ecb	`rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))`
Packit	423ecb	`rngs = rngp.relaxNGParse()`
Packit	423ecb	`reader.RelaxNGSetSchema(rngs)`
Packit	423ecb	`ret = reader.Read()`
Packit	423ecb	`while ret == 1:`
Packit	423ecb	`ret = reader.Read()`
Packit	423ecb	`if ret != 0:`
Packit	423ecb	`print "Error parsing the document"`
Packit	423ecb	`if reader.IsValid() != 1:`
Packit	423ecb	`print "Document failed to validate"`
Packit	423ecb
Packit	423ecb
Packit	423ecb	`See reader6.py in the sources or documentation for a complete`
Packit	423ecb	`example.`
Packit	423ecb
Packit	423ecb	`Mixing the reader and tree or XPath operations`
Packit	423ecb
Packit	423ecb	`Introduced in version 2.5.7`
Packit	423ecb
Packit	423ecb	`While the reader is a streaming interface, its underlying implementation`
Packit	423ecb	`is based on the DOM builder of libxml2. As a result it is relatively simple`
Packit	423ecb	`to mix operations based on both models under some constraints. To do so the`
Packit	423ecb	`reader has an Expand() operation allowing to grow the subtree under the`
Packit	423ecb	`current node. It returns a pointer to a standard node which can be`
Packit	423ecb	`manipulated in the usual ways. The node will get all its ancestors and the`
Packit	423ecb	`full subtree available. Usual operations like XPath queries can be used on`
Packit	423ecb	`that reduced view of the document. Here is an example extracted from`
Packit	423ecb	`reader5.py in the sources which extract and prints the bibliography for the`
Packit	423ecb	`"Dragon" compiler book from the XML 1.0 recommendation:`
Packit	423ecb	`f = open('../../test/valid/REC-xml-19980210.xml')`
Packit	423ecb	`input = libxml2.inputBuffer(f)`
Packit	423ecb	`reader = input.newTextReader("REC")`
Packit	423ecb	`res=""`
Packit	423ecb	`while reader.Read():`
Packit	423ecb	`while reader.Name() == 'bibl':`
Packit	423ecb	`node = reader.Expand() # expand the subtree`
Packit	423ecb	`if node.xpathEval("@id = 'Aho'"): # use XPath on it`
Packit	423ecb	`res = res + node.serialize()`
Packit	423ecb	`if reader.Next() != 1: # skip the subtree`
Packit	423ecb	`break;`
Packit	423ecb
Packit	423ecb	`Note, however that the node instance returned by the Expand() call is only`
Packit	423ecb	`valid until the next Read() operation. The Expand() operation does not`
Packit	423ecb	`affects the Read() ones, however usually once processed the full subtree is`
Packit	423ecb	`not useful anymore, and the Next() operation allows to skip it completely and`
Packit	423ecb	`process to the successor or return 0 if the document end is reached.`
Packit	423ecb
Packit	423ecb	`Daniel Veillard`
Packit	423ecb
Packit	423ecb	$Id$
Packit	423ecb
Packit	423ecb
Packit	423ecb	`</body>`
Packit	423ecb	`</html>`

source-git / libxml2

Source Code

Blame doc/xmlreader.html

Libxml2 XmlTextReader Interface tutorial