Blame doc/xmlreader.html

Packit 423ecb
Packit 423ecb
    "http://www.w3.org/TR/html4/loose.dtd">
Packit 423ecb
<html>
Packit 423ecb
<head>
Packit 423ecb
  <meta http-equiv="Content-Type" content="text/html">
Packit 423ecb
  <style type="text/css"></style>
Packit 423ecb
Packit 423ecb
TD {font-family: Verdana,Arial,Helvetica}
Packit 423ecb
BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
Packit 423ecb
H1 {font-family: Verdana,Arial,Helvetica}
Packit 423ecb
H2 {font-family: Verdana,Arial,Helvetica}
Packit 423ecb
H3 {font-family: Verdana,Arial,Helvetica}
Packit 423ecb
A:link, A:visited, A:active { text-decoration: underline }
Packit 423ecb
  </style>
Packit 423ecb
-->
Packit 423ecb
  <title>Libxml2 XmlTextReader Interface tutorial</title>
Packit 423ecb
</head>
Packit 423ecb
Packit 423ecb
<body bgcolor="#fffacd" text="#000000">
Packit 423ecb

Libxml2 XmlTextReader Interface tutorial

Packit 423ecb
Packit 423ecb

Packit 423ecb
Packit 423ecb

This document describes the use of the XmlTextReader streaming API added

Packit 423ecb
to libxml2 in version 2.5.0 . This API is closely modeled after the 
Packit 423ecb
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
Packit 423ecb
and 
Packit 423ecb
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader
Packit 423ecb
classes of the C# language.

Packit 423ecb
Packit 423ecb

This tutorial will present the key points of this API, and working

Packit 423ecb
examples using both C and the Python bindings:

Packit 423ecb
Packit 423ecb

Table of content:

Packit 423ecb
    Packit 423ecb
      
  • Introduction: why a new API
  • Packit 423ecb
      
  • Walking a simple tree
  • Packit 423ecb
      
  • Extracting informations for the current
  • Packit 423ecb
      node
    Packit 423ecb
      
  • Extracting informations for the
  • Packit 423ecb
      attributes
    Packit 423ecb
      
  • Validating a document
  • Packit 423ecb
      
  • Entities substitution
  • Packit 423ecb
      
  • Relax-NG Validation
  • Packit 423ecb
      
  • Mixing the reader and tree or XPath
  • Packit 423ecb
      operations
    Packit 423ecb
    Packit 423ecb
    Packit 423ecb

    Packit 423ecb
    Packit 423ecb

    Introduction: why a new API

    Packit 423ecb
    Packit 423ecb

    Libxml2 main API is

    Packit 423ecb
    tree based, where the parsing operation results in a document loaded
    Packit 423ecb
    completely in memory, and expose it as a tree of nodes all availble at the
    Packit 423ecb
    same time. This is very simple and quite powerful, but has the major
    Packit 423ecb
    limitation that the size of the document that can be hamdled is limited by
    Packit 423ecb
    the size of the memory available. Libxml2 also provide a 
    Packit 423ecb
    href="http://www.saxproject.org/">SAX based API, but that version was
    Packit 423ecb
    designed upon one of the early 
    Packit 423ecb
    href="http://www.jclark.com/xml/expat.html">expat version of SAX, SAX is
    Packit 423ecb
    also not formally defined for C. SAX basically work by registering callbacks
    Packit 423ecb
    which are called directly by the parser as it progresses through the document
    Packit 423ecb
    streams. The problem is that this programming model is relatively complex,
    Packit 423ecb
    not well standardized, cannot provide validation directly, makes entity,
    Packit 423ecb
    namespace and base processing relatively hard.

    Packit 423ecb
    Packit 423ecb

    The

    Packit 423ecb
    href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
    Packit 423ecb
    API from C# provides a far simpler programming model. The API acts as a
    Packit 423ecb
    cursor going forward on the document stream and stopping at each node in the
    Packit 423ecb
    way. The user's code keeps control of the progress and simply calls a
    Packit 423ecb
    Read() function repeatedly to progress to each node in sequence in document
    Packit 423ecb
    order. There is direct support for namespaces, xml:base, entity handling and
    Packit 423ecb
    adding DTD validation on top of it was relatively simple. This API is really
    Packit 423ecb
    close to the DOM Core
    Packit 423ecb
    specification This provides a far more standard, easy to use and powerful
    Packit 423ecb
    API than the existing SAX. Moreover integrating extension features based on
    Packit 423ecb
    the tree seems relatively easy.

    Packit 423ecb
    Packit 423ecb

    In a nutshell the XmlTextReader API provides a simpler, more standard and

    Packit 423ecb
    more extensible interface to handle large documents than the existing SAX
    Packit 423ecb
    version.

    Packit 423ecb
    Packit 423ecb

    Walking a simple tree

    Packit 423ecb
    Packit 423ecb

    Basically the XmlTextReader API is a forward only tree walking interface.

    Packit 423ecb
    The basic steps are:

    Packit 423ecb
      Packit 423ecb
        
    1. prepare a reader context operating on some input
    2. Packit 423ecb
        
    3. run a loop iterating over all nodes in the document
    4. Packit 423ecb
        
    5. free up the reader context
    6. Packit 423ecb
      Packit 423ecb
      Packit 423ecb

      Here is a basic C sample doing this:

      Packit 423ecb
      #include <libxml/xmlreader.h>
      Packit 423ecb
      Packit 423ecb
      void processNode(xmlTextReaderPtr reader) {
      Packit 423ecb
          /* handling of a node in the tree */
      Packit 423ecb
      }
      Packit 423ecb
      Packit 423ecb
      int streamFile(char *filename) {
      Packit 423ecb
          xmlTextReaderPtr reader;
      Packit 423ecb
          int ret;
      Packit 423ecb
      Packit 423ecb
          reader = xmlNewTextReaderFilename(filename);
      Packit 423ecb
          if (reader != NULL) {
      Packit 423ecb
              ret = xmlTextReaderRead(reader);
      Packit 423ecb
              while (ret == 1) {
      Packit 423ecb
                  processNode(reader);
      Packit 423ecb
                  ret = xmlTextReaderRead(reader);
      Packit 423ecb
              }
      Packit 423ecb
              xmlFreeTextReader(reader);
      Packit 423ecb
              if (ret != 0) {
      Packit 423ecb
                  printf("%s : failed to parse\n", filename);
      Packit 423ecb
              }
      Packit 423ecb
          } else {
      Packit 423ecb
              printf("Unable to open %s\n", filename);
      Packit 423ecb
          }
      Packit 423ecb
      }
      Packit 423ecb
      Packit 423ecb

      A few things to notice:

      Packit 423ecb
        Packit 423ecb
          
      • the include file needed : libxml/xmlreader.h
      • Packit 423ecb
          
      • the creation of the reader using a filename
      • Packit 423ecb
          
      • the repeated call to xmlTextReaderRead() and how any return value
      • Packit 423ecb
            different from 1 should stop the loop
        Packit 423ecb
          
      • that a negative return means a parsing error
      • Packit 423ecb
          
      • how xmlFreeTextReader() should be used to free up the resources used by
      • Packit 423ecb
            the reader.
        Packit 423ecb
        Packit 423ecb
        Packit 423ecb

        Here is similar code in python for exactly the same processing:

        Packit 423ecb
        import libxml2
        Packit 423ecb
        Packit 423ecb
        def processNode(reader):
        Packit 423ecb
            pass
        Packit 423ecb
        Packit 423ecb
        def streamFile(filename):
        Packit 423ecb
            try:
        Packit 423ecb
                reader = libxml2.newTextReaderFilename(filename)
        Packit 423ecb
            except:
        Packit 423ecb
                print "unable to open %s" % (filename)
        Packit 423ecb
                return
        Packit 423ecb
        Packit 423ecb
            ret = reader.Read()
        Packit 423ecb
            while ret == 1:
        Packit 423ecb
                processNode(reader)
        Packit 423ecb
                ret = reader.Read()
        Packit 423ecb
        Packit 423ecb
            if ret != 0:
        Packit 423ecb
                print "%s : failed to parse" % (filename)
        Packit 423ecb
        Packit 423ecb

        The only things worth adding are that the

        Packit 423ecb
        href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
        Packit 423ecb
        is abstracted as a class like in C# with the same method names (but the
        Packit 423ecb
        properties are currently accessed with methods) and that one doesn't need to
        Packit 423ecb
        free the reader at the end of the processing. It will get garbage collected
        Packit 423ecb
        once all references have disapeared.

        Packit 423ecb
        Packit 423ecb

        Extracting information for the current node

        Packit 423ecb
        Packit 423ecb

        So far the example code did not indicate how information was extracted

        Packit 423ecb
        from the reader. It was abstrated as a call to the processNode() routine,
        Packit 423ecb
        with the reader as the argument. At each invocation, the parser is stopped on
        Packit 423ecb
        a given node and the reader can be used to query those node properties. Each
        Packit 423ecb
        Property is available at the C level as a function taking a single
        Packit 423ecb
        xmlTextReaderPtr argument whose name is
        Packit 423ecb
        xmlTextReaderProperty , if the return type is an
        Packit 423ecb
        xmlChar * string then it must be deallocated with
        Packit 423ecb
        xmlFree() to avoid leaks. For the Python interface, there is a
        Packit 423ecb
        Property method to the reader class that can be called on the
        Packit 423ecb
        instance. The list of the properties is based on the 
        Packit 423ecb
        href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
        Packit 423ecb
        XmlTextReader class set of properties and methods:

        Packit 423ecb
          Packit 423ecb
            
        • NodeType: The node type, 1 for start element, 15 for end of
        • Packit 423ecb
              element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
          Packit 423ecb
              entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
          Packit 423ecb
              9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
          Packit 423ecb
              fragment and 12 for notation nodes.
          Packit 423ecb
            
        • Name: the
        • Packit 423ecb
              href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
          Packit 423ecb
              name of the node, equal to (Prefix:)LocalName.
          Packit 423ecb
            
        • LocalName: the
        • Packit 423ecb
              href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name of
          Packit 423ecb
              the node.
          Packit 423ecb
            
        • Prefix: a shorthand reference to the
        • Packit 423ecb
              href="http://www.w3.org/TR/REC-xml-names/">namespace associated with
          Packit 423ecb
              the node.
          Packit 423ecb
            
        • NamespaceUri: the URI defining the
        • Packit 423ecb
              href="http://www.w3.org/TR/REC-xml-names/">namespace associated with
          Packit 423ecb
              the node.
          Packit 423ecb
            
        • BaseUri: the base URI of the node. See the
        • Packit 423ecb
              href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification.
          Packit 423ecb
            
        • Depth: the depth of the node in the tree, starts at 0 for the
        • Packit 423ecb
              root node.
          Packit 423ecb
            
        • HasAttributes: whether the node has attributes.
        • Packit 423ecb
            
        • HasValue: whether the node can have a text value.
        • Packit 423ecb
            
        • Value: provides the text value of the node if present.
        • Packit 423ecb
            
        • IsDefault: whether an Attribute node was generated from the
        • Packit 423ecb
              default value defined in the DTD or schema (unsupported
          Packit 423ecb
            yet).
          Packit 423ecb
            
        • XmlLang: the
        • Packit 423ecb
              href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang scope
          Packit 423ecb
              within which the node resides.
          Packit 423ecb
            
        • IsEmptyElement: check if the current node is empty, this is a
        • Packit 423ecb
              bit bizarre in the sense that <a/> will be considered
          Packit 423ecb
              empty while <a></a> will not.
          Packit 423ecb
            
        • AttributeCount: provides the number of attributes of the
        • Packit 423ecb
              current node.
          Packit 423ecb
          Packit 423ecb
          Packit 423ecb

          Let's look first at a small example to get this in practice by redefining

          Packit 423ecb
          the processNode() function in the Python example:

          Packit 423ecb
          def processNode(reader):
          Packit 423ecb
              print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
          Packit 423ecb
                                     reader.Name(), reader.IsEmptyElement())
          Packit 423ecb
          Packit 423ecb

          and look at the result of calling streamFile("tst.xml") for various

          Packit 423ecb
          content of the XML test file.

          Packit 423ecb
          Packit 423ecb

          For the minimal document "<doc/>" we get:

          Packit 423ecb
          0 1 doc 1
          Packit 423ecb
          Packit 423ecb

          Only one node is found, its depth is 0, type 1 indicate an element start,

          Packit 423ecb
          of name "doc" and it is empty. Trying now with
          Packit 423ecb
          "<doc></doc>" instead leads to:

          Packit 423ecb
          0 1 doc 0
          Packit 423ecb
          0 15 doc 0
          Packit 423ecb
          Packit 423ecb

          The document root node is not flagged as empty anymore and both a start

          Packit 423ecb
          and an end of element are detected. The following document shows how
          Packit 423ecb
          character data are reported:

          Packit 423ecb
          <doc><a/><b>some text</b>
          Packit 423ecb
          <c/></doc>
          Packit 423ecb
          Packit 423ecb

          We modifying the processNode() function to also report the node Value:

          Packit 423ecb
          def processNode(reader):
          Packit 423ecb
              print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
          Packit 423ecb
                                        reader.Name(), reader.IsEmptyElement(),
          Packit 423ecb
                                        reader.Value())
          Packit 423ecb
          Packit 423ecb

          The result of the test is:

          Packit 423ecb
          0 1 doc 0 None
          Packit 423ecb
          1 1 a 1 None
          Packit 423ecb
          1 1 b 0 None
          Packit 423ecb
          2 3 #text 0 some text
          Packit 423ecb
          1 15 b 0 None
          Packit 423ecb
          1 3 #text 0
          Packit 423ecb
          Packit 423ecb
          1 1 c 1 None
          Packit 423ecb
          0 15 doc 0 None
          Packit 423ecb
          Packit 423ecb

          There are a few things to note:

          Packit 423ecb
            Packit 423ecb
              
          • the increase of the depth value (first row) as children nodes are
          • Packit 423ecb
                explored
            Packit 423ecb
              
          • the text node child of the b element, of type 3 and its content
          • Packit 423ecb
              
          • the text node containing the line return between elements b and c
          • Packit 423ecb
              
          • that elements have the Value None (or NULL in C)
          • Packit 423ecb
            Packit 423ecb
            Packit 423ecb

            The equivalent routine for processNode() as used by

            Packit 423ecb
            xmllint --stream --debug is the following and can be found in
            Packit 423ecb
            the xmllint.c module in the source distribution:

            Packit 423ecb
            static void processNode(xmlTextReaderPtr reader) {
            Packit 423ecb
                xmlChar *name, *value;
            Packit 423ecb
            Packit 423ecb
                name = xmlTextReaderName(reader);
            Packit 423ecb
                if (name == NULL)
            Packit 423ecb
                    name = xmlStrdup(BAD_CAST "--");
            Packit 423ecb
                value = xmlTextReaderValue(reader);
            Packit 423ecb
            Packit 423ecb
                printf("%d %d %s %d",
            Packit 423ecb
                        xmlTextReaderDepth(reader),
            Packit 423ecb
                        xmlTextReaderNodeType(reader),
            Packit 423ecb
                        name,
            Packit 423ecb
                        xmlTextReaderIsEmptyElement(reader));
            Packit 423ecb
                xmlFree(name);
            Packit 423ecb
                if (value == NULL)
            Packit 423ecb
                    printf("\n");
            Packit 423ecb
                else {
            Packit 423ecb
                    printf(" %s\n", value);
            Packit 423ecb
                    xmlFree(value);
            Packit 423ecb
                }
            Packit 423ecb
            }
            Packit 423ecb
            Packit 423ecb

            Extracting information for the attributes

            Packit 423ecb
            Packit 423ecb

            The previous examples don't indicate how attributes are processed. The

            Packit 423ecb
            simple test "<doc a="b"/>" provides the following
            Packit 423ecb
            result:

            Packit 423ecb
            0 1 doc 1 None
            Packit 423ecb
            Packit 423ecb

            This proves that attribute nodes are not traversed by default. The

            Packit 423ecb
            HasAttributes property allow to detect their presence. To check
            Packit 423ecb
            their content the API has special instructions. Basically two kinds of operations
            Packit 423ecb
            are possible:

            Packit 423ecb
              Packit 423ecb
                
            1. to move the reader to the attribute nodes of the current element, in
            2. Packit 423ecb
                  that case the cursor is positionned on the attribute node
              Packit 423ecb
                
            3. to directly query the element node for the attribute value
            4. Packit 423ecb
              Packit 423ecb
              Packit 423ecb

              In both case the attribute can be designed either by its position in the

              Packit 423ecb
              list of attribute (MoveToAttributeNo or GetAttributeNo) or
              Packit 423ecb
              by their name (and namespace):

              Packit 423ecb
                Packit 423ecb
                  
              • GetAttributeNo(no): provides the value of the attribute with
              • Packit 423ecb
                    the specified index no relative to the containing element.
                Packit 423ecb
                  
              • GetAttribute(name): provides the value of the attribute with
              • Packit 423ecb
                    the specified qualified name.
                Packit 423ecb
                  
              • GetAttributeNs(localName, namespaceURI): provides the value of the
              • Packit 423ecb
                    attribute with the specified local name and namespace URI.
                Packit 423ecb
                  
              • MoveToAttributeNo(no): moves the position of the current
              • Packit 423ecb
                    instance to the attribute with the specified index relative to the
                Packit 423ecb
                    containing element.
                Packit 423ecb
                  
              • MoveToAttribute(name): moves the position of the current
              • Packit 423ecb
                    instance to the attribute with the specified qualified name.
                Packit 423ecb
                  
              • MoveToAttributeNs(localName, namespaceURI): moves the position
              • Packit 423ecb
                    of the current instance to the attribute with the specified local name
                Packit 423ecb
                    and namespace URI.
                Packit 423ecb
                  
              • MoveToFirstAttribute: moves the position of the current
              • Packit 423ecb
                    instance to the first attribute associated with the current node.
                Packit 423ecb
                  
              • MoveToNextAttribute: moves the position of the current
              • Packit 423ecb
                    instance to the next attribute associated with the current node.
                Packit 423ecb
                  
              • MoveToElement: moves the position of the current instance to
              • Packit 423ecb
                    the node that contains the current Attribute  node.
                Packit 423ecb
                Packit 423ecb
                Packit 423ecb

                After modifying the processNode() function to show attributes:

                Packit 423ecb
                def processNode(reader):
                Packit 423ecb
                    print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
                Packit 423ecb
                                              reader.Name(), reader.IsEmptyElement(),
                Packit 423ecb
                                              reader.Value())
                Packit 423ecb
                    if reader.NodeType() == 1: # Element
                Packit 423ecb
                        while reader.MoveToNextAttribute():
                Packit 423ecb
                            print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
                Packit 423ecb
                                                          reader.Name(),reader.Value())
                Packit 423ecb
                Packit 423ecb

                The output for the same input document reflects the attribute:

                Packit 423ecb
                0 1 doc 1 None
                Packit 423ecb
                -- 1 2 (a) [b]
                Packit 423ecb
                Packit 423ecb

                There are a couple of things to note on the attribute processing:

                Packit 423ecb
                  Packit 423ecb
                    
                • Their depth is the one of the carrying element plus one.
                • Packit 423ecb
                    
                • Namespace declarations are seen as attributes, as in DOM.
                • Packit 423ecb
                  Packit 423ecb
                  Packit 423ecb

                  Validating a document

                  Packit 423ecb
                  Packit 423ecb

                  Libxml2 implementation adds some extra features on top of the XmlTextReader

                  Packit 423ecb
                  API. The main one is the ability to DTD validate the parsed document
                  Packit 423ecb
                  progressively. This is simply the activation of the associated feature of the
                  Packit 423ecb
                  parser used by the reader structure. There are a few options available
                  Packit 423ecb
                  defined as the enum xmlParserProperties in the libxml/xmlreader.h header
                  Packit 423ecb
                  file:

                  Packit 423ecb
                    Packit 423ecb
                      
                  • XML_PARSER_LOADDTD: force loading the DTD (without validating)
                  • Packit 423ecb
                      
                  • XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply
                  • Packit 423ecb
                        loading the DTD)
                    Packit 423ecb
                      
                  • XML_PARSER_VALIDATE: activate DTD validation (this also imply loading
                  • Packit 423ecb
                        the DTD)
                    Packit 423ecb
                      
                  • XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity
                  • Packit 423ecb
                        reference nodes are not generated and are replaced by their expanded
                    Packit 423ecb
                        content.
                    Packit 423ecb
                      
                  • more settings might be added, those were the one available at the 2.5.0
                  • Packit 423ecb
                        release...
                    Packit 423ecb
                    Packit 423ecb
                    Packit 423ecb

                    The GetParserProp() and SetParserProp() methods can then be used to get

                    Packit 423ecb
                    and set the values of those parser properties of the reader. For example

                    Packit 423ecb
                    def parseAndValidate(file):
                    Packit 423ecb
                        reader = libxml2.newTextReaderFilename(file)
                    Packit 423ecb
                        reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
                    Packit 423ecb
                        ret = reader.Read()
                    Packit 423ecb
                        while ret == 1:
                    Packit 423ecb
                            ret = reader.Read()
                    Packit 423ecb
                        if ret != 0:
                    Packit 423ecb
                            print "Error parsing and validating %s" % (file)
                    Packit 423ecb
                    Packit 423ecb

                    This routine will parse and validate the file. Error messages can be

                    Packit 423ecb
                    captured by registering an error handler. See python/tests/reader2.py for
                    Packit 423ecb
                    more complete Python examples. At the C level the equivalent call to cativate
                    Packit 423ecb
                    the validation feature is just:

                    Packit 423ecb
                    ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)
                    Packit 423ecb
                    Packit 423ecb

                    and a return value of 0 indicates success.

                    Packit 423ecb
                    Packit 423ecb

                    Entities substitution

                    Packit 423ecb
                    Packit 423ecb

                    By default the xmlReader will report entities as such and not replace them

                    Packit 423ecb
                    with their content. This default behaviour can however be overriden using:

                    Packit 423ecb
                    Packit 423ecb

                    reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)

                    Packit 423ecb
                    Packit 423ecb

                    Relax-NG Validation

                    Packit 423ecb
                    Packit 423ecb

                    Introduced in version 2.5.7

                    Packit 423ecb
                    Packit 423ecb

                    Libxml2 can now validate the document being read using the xmlReader using

                    Packit 423ecb
                    Relax-NG schemas. While the Relax NG validator can't always work in a
                    Packit 423ecb
                    streamable mode, only subsets which cannot be reduced to regular expressions
                    Packit 423ecb
                    need to have their subtree expanded for validation. In practice it means
                    Packit 423ecb
                    that, unless the schemas for the top level element content is not expressable
                    Packit 423ecb
                    as a regexp, only chunk of the document needs to be parsed while
                    Packit 423ecb
                    validating.

                    Packit 423ecb
                    Packit 423ecb

                    The steps to do so are:

                    Packit 423ecb
                      Packit 423ecb
                        
                    • create a reader working on a document as usual
                    • Packit 423ecb
                        
                    • before any call to read associate it to a Relax NG schemas, either the
                    • Packit 423ecb
                          preparsed schemas or the URL to the schemas to use
                      Packit 423ecb
                        
                    • errors will be reported the usual way, and the validity status can be
                    • Packit 423ecb
                          obtained using the IsValid() interface of the reader like for DTDs.
                      Packit 423ecb
                      Packit 423ecb
                      Packit 423ecb

                      Example, assuming the reader has already being created and that the schema

                      Packit 423ecb
                      string contains the Relax-NG schemas:

                      Packit 423ecb
                      rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))
                      Packit 423ecb
                      rngs = rngp.relaxNGParse()
                      Packit 423ecb
                      reader.RelaxNGSetSchema(rngs)
                      Packit 423ecb
                      ret = reader.Read()
                      Packit 423ecb
                      while ret == 1:
                      Packit 423ecb
                          ret = reader.Read()
                      Packit 423ecb
                      if ret != 0:
                      Packit 423ecb
                          print "Error parsing the document"
                      Packit 423ecb
                      if reader.IsValid() != 1:
                      Packit 423ecb
                          print "Document failed to validate"
                      Packit 423ecb
                      Packit 423ecb
                      Packit 423ecb

                      See reader6.py in the sources or documentation for a complete

                      Packit 423ecb
                      example.

                      Packit 423ecb
                      Packit 423ecb

                      Mixing the reader and tree or XPath operations

                      Packit 423ecb
                      Packit 423ecb

                      Introduced in version 2.5.7

                      Packit 423ecb
                      Packit 423ecb

                      While the reader is a streaming interface, its underlying implementation

                      Packit 423ecb
                      is based on the DOM builder of libxml2. As a result it is relatively simple
                      Packit 423ecb
                      to mix operations based on both models under some constraints. To do so the
                      Packit 423ecb
                      reader has an Expand() operation allowing to grow the subtree under the
                      Packit 423ecb
                      current node. It returns a pointer to a standard node which can be
                      Packit 423ecb
                      manipulated in the usual ways. The node will get all its ancestors and the
                      Packit 423ecb
                      full subtree available. Usual operations like XPath queries can be used on
                      Packit 423ecb
                      that reduced view of the document. Here is an example extracted from
                      Packit 423ecb
                      reader5.py in the sources which extract and prints the bibliography for the
                      Packit 423ecb
                      "Dragon" compiler book from the XML 1.0 recommendation:

                      Packit 423ecb
                      f = open('../../test/valid/REC-xml-19980210.xml')
                      Packit 423ecb
                      input = libxml2.inputBuffer(f)
                      Packit 423ecb
                      reader = input.newTextReader("REC")
                      Packit 423ecb
                      res=""
                      Packit 423ecb
                      while reader.Read():
                      Packit 423ecb
                          while reader.Name() == 'bibl':
                      Packit 423ecb
                              node = reader.Expand()            # expand the subtree
                      Packit 423ecb
                              if node.xpathEval("@id = 'Aho'"): # use XPath on it
                      Packit 423ecb
                                  res = res + node.serialize()
                      Packit 423ecb
                              if reader.Next() != 1:            # skip the subtree
                      Packit 423ecb
                                  break;
                      Packit 423ecb
                      Packit 423ecb

                      Note, however that the node instance returned by the Expand() call is only

                      Packit 423ecb
                      valid until the next Read() operation. The Expand() operation does not
                      Packit 423ecb
                      affects the Read() ones, however usually once processed the full subtree is
                      Packit 423ecb
                      not useful anymore, and the Next() operation allows to skip it completely and
                      Packit 423ecb
                      process to the successor or return 0 if the document end is reached.

                      Packit 423ecb
                      Packit 423ecb

                      Daniel Veillard

                      Packit 423ecb
                      Packit 423ecb

                      $Id$

                      Packit 423ecb
                      Packit 423ecb

                      Packit 423ecb
                      </body>
                      Packit 423ecb
                      </html>