Tree - source-git/python-lxml - CentOS Git server

source-git / python-lxml

Files

Commit: d9acb67c9b615a0ad85ff8316fa45a8f0c5dd043
Blob Blame History Raw
==============
lxml.objectify
==============

:Authors:
  Stefan Behnel, Holger Joukl

lxml supports an alternative API similar to the Amara_ bindery or
gnosis.xml.objectify_ through a `custom Element implementation`_.  The main idea
is to hide the usage of XML behind normal Python objects, sometimes referred
to as data-binding.  It allows you to use XML as if you were dealing with a
normal Python object hierarchy.

Accessing the children of an XML element deploys object attribute access.  If
there are multiple children with the same name, slicing and indexing can be
used.  Python data types are extracted from XML content automatically and made
available to the normal Python operators.

.. contents::
..
   1  The lxml.objectify API
     1.1  Element access through object attributes
     1.2  Creating objectify trees
     1.3  Tree generation with the E-factory
     1.4  Namespace handling
   2  Asserting a Schema
   3  ObjectPath
   4  Python data types
     4.1  Recursive tree dump
     4.2  Recursive string representation of elements
   5  How data types are matched
     5.1  Type annotations
     5.2  XML Schema datatype annotation
     5.3  The DataElement factory
     5.4  Defining additional data classes
     5.5  Advanced element class lookup
   6  What is different from lxml.etree?

.. _Amara: http://uche.ogbuji.net/tech/4suite/amara/
.. _gnosis.xml.objectify: http://gnosis.cx/download/
.. _`benchmark page`: performance.html#lxml-objectify
.. _`custom Element implementation`: element_classes.html

To set up and use ``objectify``, you need both the ``lxml.etree``
module and ``lxml.objectify``:

.. sourcecode:: pycon

    >>> from lxml import etree
    >>> from lxml import objectify

The objectify API is very different from the ElementTree API.  If it
is used, it should not be mixed with other element implementations
(such as trees parsed with ``lxml.etree``), to avoid non-obvious
behaviour.

The `benchmark page`_ has some hints on performance optimisation of
code using lxml.objectify.

To make the doctests in this document look a little nicer, we also use
this:

.. sourcecode:: pycon

    >>> import lxml.usedoctest

Imported from within a doctest, this relieves us from caring about the exact
formatting of XML output.

..
    >>> try: from StringIO import StringIO
    ... except ImportError:
    ...     from io import BytesIO # Python 3
    ...     def StringIO(s):
    ...         if isinstance(s, str): s = s.encode('UTF-8')
    ...         return BytesIO(s)

..
  >>> import sys
  >>> from lxml import etree as _etree
  >>> if sys.version_info[0] >= 3:
  ...   class etree_mock(object):
  ...     def __getattr__(self, name): return getattr(_etree, name)
  ...     def tostring(self, *args, **kwargs):
  ...       s = _etree.tostring(*args, **kwargs)
  ...       if isinstance(s, bytes) and bytes([10]) in s: s = s.decode("utf-8") # CR
  ...       if s[-1] == '\n': s = s[:-1]
  ...       return s
  ... else:
  ...   class etree_mock(object):
  ...     def __getattr__(self, name): return getattr(_etree, name)
  ...     def tostring(self, *args, **kwargs):
  ...       s = _etree.tostring(*args, **kwargs)
  ...       if s[-1] == '\n': s = s[:-1]
  ...       return s
  >>> etree = etree_mock()


The lxml.objectify API
======================

In ``lxml.objectify``, element trees provide an API that models the behaviour
of normal Python object trees as closely as possible.


Element access through object attributes
----------------------------------------

The main idea behind the ``objectify`` API is to hide XML element access
behind the usual object attribute access pattern.  Asking an element for an
attribute will return the sequence of children with corresponding tag names:

.. sourcecode:: pycon

    >>> root = objectify.Element("root")
    >>> b = objectify.SubElement(root, "b")
    >>> print(root.b[0].tag)
    b
    >>> root.index(root.b[0])
    0
    >>> b = objectify.SubElement(root, "b")
    >>> print(root.b[0].tag)
    b
    >>> print(root.b[1].tag)
    b
    >>> root.index(root.b[1])
    1

For convenience, you can omit the index '0' to access the first child:

.. sourcecode:: pycon

    >>> print(root.b.tag)
    b
    >>> root.index(root.b)
    0
    >>> del root.b

Iteration and slicing also obey the requested tag:

.. sourcecode:: pycon

    >>> x1 = objectify.SubElement(root, "x")
    >>> x2 = objectify.SubElement(root, "x")
    >>> x3 = objectify.SubElement(root, "x")

    >>> [ el.tag for el in root.x ]
    ['x', 'x', 'x']

    >>> [ el.tag for el in root.x[1:3] ]
    ['x', 'x']

    >>> [ el.tag for el in root.x[-1:] ]
    ['x']

    >>> del root.x[1:2]
    >>> [ el.tag for el in root.x ]
    ['x', 'x']

If you want to iterate over all children or need to provide a specific
namespace for the tag, use the ``iterchildren()`` method.  Like the other
methods for iteration, it supports an optional tag keyword argument:

.. sourcecode:: pycon

    >>> [ el.tag for el in root.iterchildren() ]
    ['b', 'x', 'x']

    >>> [ el.tag for el in root.iterchildren(tag='b') ]
    ['b']

    >>> [ el.tag for el in root.b ]
    ['b']

XML attributes are accessed as in the normal ElementTree API:

.. sourcecode:: pycon

    >>> c = objectify.SubElement(root, "c", myattr="someval")
    >>> print(root.c.get("myattr"))
    someval

    >>> root.c.set("c", "oh-oh")
    >>> print(root.c.get("c"))
    oh-oh

In addition to the normal ElementTree API for appending elements to trees,
subtrees can also be added by assigning them to object attributes.  In this
case, the subtree is automatically deep copied and the tag name of its root is
updated to match the attribute name:

.. sourcecode:: pycon

    >>> el = objectify.Element("yet_another_child")
    >>> root.new_child = el
    >>> print(root.new_child.tag)
    new_child
    >>> print(el.tag)
    yet_another_child

    >>> root.y = [ objectify.Element("y"), objectify.Element("y") ]
    >>> [ el.tag for el in root.y ]
    ['y', 'y']

The latter is a short form for operations on the full slice:

.. sourcecode:: pycon

    >>> root.y[:] = [ objectify.Element("y") ]
    >>> [ el.tag for el in root.y ]
    ['y']

You can also replace children that way:

.. sourcecode:: pycon

    >>> child1 = objectify.SubElement(root, "child")
    >>> child2 = objectify.SubElement(root, "child")
    >>> child3 = objectify.SubElement(root, "child")

    >>> el = objectify.Element("new_child")
    >>> subel = objectify.SubElement(el, "sub")

    >>> root.child = el
    >>> print(root.child.sub.tag)
    sub

    >>> root.child[2] = el
    >>> print(root.child[2].sub.tag)
    sub

Note that special care must be taken when changing the tag name of an element:

.. sourcecode:: pycon

    >>> print(root.b.tag)
    b
    >>> root.b.tag = "notB"
    >>> root.b
    Traceback (most recent call last):
      ...
    AttributeError: no such child: b
    >>> print(root.notB.tag)
    notB


Creating objectify trees
------------------------

As with ``lxml.etree``, you can either create an ``objectify`` tree by
parsing an XML document or by building one from scratch.  To parse a
document, just use the ``parse()`` or ``fromstring()`` functions of
the module:

.. sourcecode:: pycon

    >>> fileobject = StringIO('<test/>')

    >>> tree = objectify.parse(fileobject)
    >>> print(isinstance(tree.getroot(), objectify.ObjectifiedElement))
    True

    >>> root = objectify.fromstring('<test/>')
    >>> print(isinstance(root, objectify.ObjectifiedElement))
    True

To build a new tree in memory, ``objectify`` replicates the standard
factory function ``Element()`` from ``lxml.etree``:

.. sourcecode:: pycon

    >>> obj_el = objectify.Element("new")
    >>> print(isinstance(obj_el, objectify.ObjectifiedElement))
    True

After creating such an Element, you can use the `usual API`_ of
lxml.etree to add SubElements to the tree:

.. sourcecode:: pycon

    >>> child = objectify.SubElement(obj_el, "newchild", attr="value")

.. _`usual API`: tutorial.html#the-element-class

New subelements will automatically inherit the objectify behaviour
from their tree.  However, all independent elements that you create
through the ``Element()`` factory of lxml.etree (instead of objectify)
will not support the ``objectify`` API by themselves:

.. sourcecode:: pycon

    >>> subel = objectify.SubElement(obj_el, "sub")
    >>> print(isinstance(subel, objectify.ObjectifiedElement))
    True

    >>> independent_el = etree.Element("new")
    >>> print(isinstance(independent_el, objectify.ObjectifiedElement))
    False


Tree generation with the E-factory
----------------------------------

To simplify the generation of trees even further, you can use the E-factory:

.. sourcecode:: pycon

    >>> E = objectify.E
    >>> root = E.root(
    ...   E.a(5),
    ...   E.b(6.21),
    ...   E.c(True),
    ...   E.d("how", tell="me")
    ... )

    >>> print(etree.tostring(root, pretty_print=True))
    <root xmlns:py="http://codespeak.net/lxml/objectify/pytype">
      <a py:pytype="int">5</a>
      <b py:pytype="float">6.21</b>
      <c py:pytype="bool">true</c>
      <d py:pytype="str" tell="me">how</d>
    </root>

This allows you to write up a specific language in tags:

.. sourcecode:: pycon

    >>> ROOT = objectify.E.root
    >>> TITLE = objectify.E.title
    >>> HOWMANY = getattr(objectify.E, "how-many")

    >>> root = ROOT(
    ...   TITLE("The title"),
    ...   HOWMANY(5)
    ... )

    >>> print(etree.tostring(root, pretty_print=True))
    <root xmlns:py="http://codespeak.net/lxml/objectify/pytype">
      <title py:pytype="str">The title</title>
      <how-many py:pytype="int">5</how-many>
    </root>

``objectify.E`` is an instance of ``objectify.ElementMaker``.  By default, it
creates pytype annotated Elements without a namespace.  You can switch off the
pytype annotation by passing False to the ``annotate`` keyword argument of the
constructor.  You can also pass a default namespace and an ``nsmap``:

.. sourcecode:: pycon

    >>> myE = objectify.ElementMaker(annotate=False,
    ...           namespace="http://my/ns", nsmap={None : "http://my/ns"})

    >>> root = myE.root( myE.someint(2) )

    >>> print(etree.tostring(root, pretty_print=True))
    <root xmlns="http://my/ns">
      <someint>2</someint>
    </root>


Namespace handling
------------------

During tag lookups, namespaces are handled mostly behind the scenes.
If you access a child of an Element without specifying a namespace,
the lookup will use the namespace of the parent:

.. sourcecode:: pycon

    >>> root = objectify.Element("{http://ns/}root")
    >>> b = objectify.SubElement(root, "{http://ns/}b")
    >>> c = objectify.SubElement(root, "{http://other/}c")

    >>> print(root.b.tag)
    {http://ns/}b

Note that the ``SubElement()`` factory of ``lxml.etree`` does not
inherit any namespaces when creating a new subelement.  Element
creation must be explicit about the namespace, and is simplified
through the E-factory as described above.

Lookups, however, inherit namespaces implicitly:

.. sourcecode:: pycon

    >>> print(root.b.tag)
    {http://ns/}b

    >>> print(root.c)
    Traceback (most recent call last):
        ...
    AttributeError: no such child: {http://ns/}c

To access an element in a different namespace than its parent, you can
use ``getattr()``:

.. sourcecode:: pycon

    >>> c = getattr(root, "{http://other/}c")
    >>> print(c.tag)
    {http://other/}c

For convenience, there is also a quick way through item access:

.. sourcecode:: pycon

    >>> c = root["{http://other/}c"]
    >>> print(c.tag)
    {http://other/}c

The same approach must be used to access children with tag names that are not
valid Python identifiers:

.. sourcecode:: pycon

    >>> el = objectify.SubElement(root, "{http://ns/}tag-name")
    >>> print(root["tag-name"].tag)
    {http://ns/}tag-name

    >>> new_el = objectify.Element("{http://ns/}new-element")
    >>> el = objectify.SubElement(new_el, "{http://ns/}child")
    >>> el = objectify.SubElement(new_el, "{http://ns/}child")
    >>> el = objectify.SubElement(new_el, "{http://ns/}child")

    >>> root["tag-name"] = [ new_el, new_el ]
    >>> print(len(root["tag-name"]))
    2
    >>> print(root["tag-name"].tag)
    {http://ns/}tag-name

    >>> print(len(root["tag-name"].child))
    3
    >>> print(root["tag-name"].child.tag)
    {http://ns/}child
    >>> print(root["tag-name"][1].child.tag)
    {http://ns/}child

or for names that have a special meaning in lxml.objectify:

.. sourcecode:: pycon

    >>> root = objectify.XML("<root><text>TEXT</text></root>")

    >>> print(root.text.text)
    Traceback (most recent call last):
      ...
    AttributeError: 'NoneType' object has no attribute 'text'

    >>> print(root["text"].text)
    TEXT


Asserting a Schema
==================

When dealing with XML documents from different sources, you will often
require them to follow a common schema.  In lxml.objectify, this
directly translates to enforcing a specific object tree, i.e. expected
object attributes are ensured to be there and to have the expected
type.  This can easily be achieved through XML Schema validation at
parse time.  Also see the `documentation on validation`_ on this
topic.

.. _`documentation on validation`: validation.html

First of all, we need a parser that knows our schema, so let's say we
parse the schema from a file-like object (or file or filename):

.. sourcecode:: pycon

    >>> f = StringIO('''\
    ...   <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    ...     <xsd:element name="a" type="AType"/>
    ...     <xsd:complexType name="AType">
    ...       <xsd:sequence>
    ...         <xsd:element name="b" type="xsd:string" />
    ...       </xsd:sequence>
    ...     </xsd:complexType>
    ...   </xsd:schema>
    ... ''')
    >>> schema = etree.XMLSchema(file=f)

When creating the validating parser, we must make sure it `returns
objectify trees`_.  This is best done with the ``makeparser()``
function:

.. sourcecode:: pycon

    >>> parser = objectify.makeparser(schema = schema)

.. _`returns objectify trees`: #advance-element-class-lookup

Now we can use it to parse a valid document:

.. sourcecode:: pycon

    >>> xml = "<a><b>test</b></a>"
    >>> a = objectify.fromstring(xml, parser)

    >>> print(a.b)
    test

Or an invalid document:

.. sourcecode:: pycon

    >>> xml = b"<a><b>test</b><c/></a>"
    >>> a = objectify.fromstring(xml, parser)  # doctest: +ELLIPSIS
    Traceback (most recent call last):
    lxml.etree.XMLSyntaxError: Element 'c': This element is not expected...

Note that the same works for parse-time DTD validation, except that
DTDs do not support any data types by design.


ObjectPath
==========

For both convenience and speed, objectify supports its own path language,
represented by the ``ObjectPath`` class:

.. sourcecode:: pycon

    >>> root = objectify.Element("{http://ns/}root")
    >>> b1 = objectify.SubElement(root, "{http://ns/}b")
    >>> c  = objectify.SubElement(b1,   "{http://ns/}c")
    >>> b2 = objectify.SubElement(root, "{http://ns/}b")
    >>> d  = objectify.SubElement(root, "{http://other/}d")

    >>> path = objectify.ObjectPath("root.b.c")
    >>> print(path)
    root.b.c
    >>> path.hasattr(root)
    True
    >>> print(path.find(root).tag)
    {http://ns/}c

    >>> find = objectify.ObjectPath("root.b.c")
    >>> print(find(root).tag)
    {http://ns/}c

    >>> find = objectify.ObjectPath("root.{http://other/}d")
    >>> print(find(root).tag)
    {http://other/}d

    >>> find = objectify.ObjectPath("root.{not}there")
    >>> print(find(root).tag)
    Traceback (most recent call last):
      ...
    AttributeError: no such child: {not}there

    >>> find = objectify.ObjectPath("{not}there")
    >>> print(find(root).tag)
    Traceback (most recent call last):
      ...
    ValueError: root element does not match: need {not}there, got {http://ns/}root

    >>> find = objectify.ObjectPath("root.b[1]")
    >>> print(find(root).tag)
    {http://ns/}b

    >>> find = objectify.ObjectPath("root.{http://ns/}b[1]")
    >>> print(find(root).tag)
    {http://ns/}b

Apart from strings, ObjectPath also accepts lists of path segments:

.. sourcecode:: pycon

    >>> find = objectify.ObjectPath(['root', 'b', 'c'])
    >>> print(find(root).tag)
    {http://ns/}c

    >>> find = objectify.ObjectPath(['root', '{http://ns/}b[1]'])
    >>> print(find(root).tag)
    {http://ns/}b

You can also use relative paths starting with a '.' to ignore the actual root
element and only inherit its namespace:

.. sourcecode:: pycon

    >>> find = objectify.ObjectPath(".b[1]")
    >>> print(find(root).tag)
    {http://ns/}b

    >>> find = objectify.ObjectPath(['', 'b[1]'])
    >>> print(find(root).tag)
    {http://ns/}b

    >>> find = objectify.ObjectPath(".unknown[1]")
    >>> print(find(root).tag)
    Traceback (most recent call last):
      ...
    AttributeError: no such child: {http://ns/}unknown

    >>> find = objectify.ObjectPath(".{http://other/}unknown[1]")
    >>> print(find(root).tag)
    Traceback (most recent call last):
      ...
    AttributeError: no such child: {http://other/}unknown

For convenience, a single dot represents the empty ObjectPath (identity):

.. sourcecode:: pycon

    >>> find = objectify.ObjectPath(".")
    >>> print(find(root).tag)
    {http://ns/}root

ObjectPath objects can be used to manipulate trees:

.. sourcecode:: pycon

    >>> root = objectify.Element("{http://ns/}root")

    >>> path = objectify.ObjectPath(".some.child.{http://other/}unknown")
    >>> path.hasattr(root)
    False
    >>> path.find(root)
    Traceback (most recent call last):
      ...
    AttributeError: no such child: {http://ns/}some

    >>> path.setattr(root, "my value") # creates children as necessary
    >>> path.hasattr(root)
    True
    >>> print(path.find(root).text)
    my value
    >>> print(root.some.child["{http://other/}unknown"].text)
    my value

    >>> print(len( path.find(root) ))
    1
    >>> path.addattr(root, "my new value")
    >>> print(len( path.find(root) ))
    2
    >>> [ el.text for el in path.find(root) ]
    ['my value', 'my new value']

As with attribute assignment, ``setattr()`` accepts lists:

.. sourcecode:: pycon

    >>> path.setattr(root, ["v1", "v2", "v3"])
    >>> [ el.text for el in path.find(root) ]
    ['v1', 'v2', 'v3']


Note, however, that indexing is only supported in this context if the children
exist.  Indexing of non existing children will not extend or create a list of
such children but raise an exception:

.. sourcecode:: pycon

    >>> path = objectify.ObjectPath(".{non}existing[1]")
    >>> path.setattr(root, "my value")
    Traceback (most recent call last):
      ...
    TypeError: creating indexed path attributes is not supported

It is worth noting that ObjectPath does not depend on the ``objectify`` module
or the ObjectifiedElement implementation.  It can also be used in combination
with Elements from the normal lxml.etree API.


Python data types
=================

The objectify module knows about Python data types and tries its best to let
element content behave like them.  For example, they support the normal math
operators:

.. sourcecode:: pycon

    >>> root = objectify.fromstring(
    ...             "<root><a>5</a><b>11</b><c>true</c><d>hoi</d></root>")
    >>> root.a + root.b
    16
    >>> root.a += root.b
    >>> print(root.a)
    16

    >>> root.a = 2
    >>> print(root.a + 2)
    4
    >>> print(1 + root.a)
    3

    >>> print(root.c)
    True
    >>> root.c = False
    >>> if not root.c:
    ...     print("false!")
    false!

    >>> print(root.d + " test !")
    hoi test !
    >>> root.d = "%s - %s"
    >>> print(root.d % (1234, 12345))
    1234 - 12345

However, data elements continue to provide the objectify API.  This means that
sequence operations such as ``len()``, slicing and indexing (e.g. of strings)
cannot behave as the Python types.  Like all other tree elements, they show
the normal slicing behaviour of objectify elements:

.. sourcecode:: pycon

    >>> root = objectify.fromstring("<root><a>test</a><b>toast</b></root>")
    >>> print(root.a + ' me') # behaves like a string, right?
    test me
    >>> len(root.a) # but there's only one 'a' element!
    1
    >>> [ a.tag for a in root.a ]
    ['a']
    >>> print(root.a[0].tag)
    a

    >>> print(root.a)
    test
    >>> [ str(a) for a in root.a[:1] ]
    ['test']

If you need to run sequence operations on data types, you must ask the API for
the *real* Python value.  The string value is always available through the
normal ElementTree ``.text`` attribute.  Additionally, all data classes
provide a ``.pyval`` attribute that returns the value as plain Python type:

.. sourcecode:: pycon

    >>> root = objectify.fromstring("<root><a>test</a><b>5</b></root>")
    >>> root.a.text
    'test'
    >>> root.a.pyval
    'test'

    >>> root.b.text
    '5'
    >>> root.b.pyval
    5

Note, however, that both attributes are read-only in objectify.  If you want
to change values, just assign them directly to the attribute:

.. sourcecode:: pycon

    >>> root.a.text  = "25"
    Traceback (most recent call last):
      ...
    TypeError: attribute 'text' of 'StringElement' objects is not writable

    >>> root.a.pyval = 25
    Traceback (most recent call last):
      ...
    TypeError: attribute 'pyval' of 'StringElement' objects is not writable

    >>> root.a = 25
    >>> print(root.a)
    25
    >>> print(root.a.pyval)
    25

In other words, ``objectify`` data elements behave like immutable Python
types.  You can replace them, but not modify them.


Recursive tree dump
-------------------

To see the data types that are currently used, you can call the module level
``dump()`` function that returns a recursive string representation for
elements:

.. sourcecode:: pycon

    >>> root = objectify.fromstring("""
    ... <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    ...   <a attr1="foo" attr2="bar">1</a>
    ...   <a>1.2</a>
    ...   <b>1</b>
    ...   <b>true</b>
    ...   <c>what?</c>
    ...   <d xsi:nil="true"/>
    ... </root>
    ... """)

    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        a = 1 [IntElement]
          * attr1 = 'foo'
          * attr2 = 'bar'
        a = 1.2 [FloatElement]
        b = 1 [IntElement]
        b = True [BoolElement]
        c = 'what?' [StringElement]
        d = None [NoneElement]
          * xsi:nil = 'true'

You can freely switch between different types for the same child:

.. sourcecode:: pycon

    >>> root = objectify.fromstring("<root><a>5</a></root>")
    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        a = 5 [IntElement]

    >>> root.a = 'nice string!'
    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        a = 'nice string!' [StringElement]
          * py:pytype = 'str'

    >>> root.a = True
    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        a = True [BoolElement]
          * py:pytype = 'bool'

    >>> root.a = [1, 2, 3]
    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        a = 1 [IntElement]
          * py:pytype = 'int'
        a = 2 [IntElement]
          * py:pytype = 'int'
        a = 3 [IntElement]
          * py:pytype = 'int'

    >>> root.a = (1, 2, 3)
    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        a = 1 [IntElement]
          * py:pytype = 'int'
        a = 2 [IntElement]
          * py:pytype = 'int'
        a = 3 [IntElement]
          * py:pytype = 'int'


Recursive string representation of elements
-------------------------------------------

Normally, elements use the standard string representation for str() that is
provided by lxml.etree.  You can enable a pretty-print representation for
objectify elements like this:

.. sourcecode:: pycon

    >>> objectify.enable_recursive_str()

    >>> root = objectify.fromstring("""
    ... <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    ...   <a attr1="foo" attr2="bar">1</a>
    ...   <a>1.2</a>
    ...   <b>1</b>
    ...   <b>true</b>
    ...   <c>what?</c>
    ...   <d xsi:nil="true"/>
    ... </root>
    ... """)

    >>> print(str(root))
    root = None [ObjectifiedElement]
        a = 1 [IntElement]
          * attr1 = 'foo'
          * attr2 = 'bar'
        a = 1.2 [FloatElement]
        b = 1 [IntElement]
        b = True [BoolElement]
        c = 'what?' [StringElement]
        d = None [NoneElement]
          * xsi:nil = 'true'

This behaviour can be switched off in the same way:

.. sourcecode:: pycon

    >>> objectify.enable_recursive_str(False)


How data types are matched
==========================

Objectify uses two different types of Elements.  Structural Elements (or tree
Elements) represent the object tree structure.  Data Elements represent the
data containers at the leafs.  You can explicitly create tree Elements with
the ``objectify.Element()`` factory and data Elements with the
``objectify.DataElement()`` factory.

When Element objects are created, lxml.objectify must determine which
implementation class to use for them.  This is relatively easy for tree
Elements and less so for data Elements.  The algorithm is as follows:

1. If an element has children, use the default tree class.

2. If an element is defined as xsi:nil, use the NoneElement class.

3. If a "Python type hint" attribute is given, use this to determine the element
   class, see below.
 
4. If an XML Schema xsi:type hint is given, use this to determine the element
   class, see below.

5. Try to determine the element class from the text content type by trial and
   error.

6. If the element is a root node then use the default tree class.

7. Otherwise, use the default class for empty data classes.

You can change the default classes for tree Elements and empty data Elements
at setup time.  The ``ObjectifyElementClassLookup()`` call accepts two keyword
arguments, ``tree_class`` and ``empty_data_class``, that determine the Element
classes used in these cases.  By default, ``tree_class`` is a class called
``ObjectifiedElement`` and ``empty_data_class`` is a ``StringElement``.


Type annotations
----------------

The "type hint" mechanism deploys an XML attribute defined as
``lxml.objectify.PYTYPE_ATTRIBUTE``.  It may contain any of the following
string values: int, long, float, str, unicode, NoneType:

.. sourcecode:: pycon

    >>> print(objectify.PYTYPE_ATTRIBUTE)
    {http://codespeak.net/lxml/objectify/pytype}pytype
    >>> ns, name = objectify.PYTYPE_ATTRIBUTE[1:].split('}')

    >>> root = objectify.fromstring("""\
    ... <root xmlns:py='%s'>
    ...   <a py:pytype='str'>5</a>
    ...   <b py:pytype='int'>5</b>
    ...   <c py:pytype='NoneType' />
    ... </root>
    ... """ % ns)

    >>> print(root.a + 10)
    510
    >>> print(root.b + 10)
    15
    >>> print(root.c)
    None

Note that you can change the name and namespace used for this
attribute through the ``set_pytype_attribute_tag(tag)`` module
function, in case your application ever needs to.  There is also a
utility function ``annotate()`` that recursively generates this
attribute for the elements of a tree:

.. sourcecode:: pycon

    >>> root = objectify.fromstring("<root><a>test</a><b>5</b></root>")
    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        a = 'test' [StringElement]
        b = 5 [IntElement]

    >>> objectify.annotate(root)

    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        a = 'test' [StringElement]
          * py:pytype = 'str'
        b = 5 [IntElement]
          * py:pytype = 'int'


XML Schema datatype annotation
------------------------------

A second way of specifying data type information uses XML Schema types as
element annotations.  Objectify knows those that can be mapped to normal
Python types:

.. sourcecode:: pycon

    >>> root = objectify.fromstring('''\
    ...    <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    ...          xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    ...      <d xsi:type="xsd:double">5</d>
    ...      <i xsi:type="xsd:int"   >5</i>
    ...      <s xsi:type="xsd:string">5</s>
    ...    </root>
    ...    ''')
    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        d = 5.0 [FloatElement]
          * xsi:type = 'xsd:double'
        i = 5 [IntElement]
          * xsi:type = 'xsd:int'
        s = '5' [StringElement]
          * xsi:type = 'xsd:string'

Again, there is a utility function ``xsiannotate()`` that recursively 
generates the "xsi:type" attribute for the elements of a tree:

.. sourcecode:: pycon

    >>> root = objectify.fromstring('''\
    ...    <root><a>test</a><b>5</b><c>true</c></root>
    ...    ''')
    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        a = 'test' [StringElement]
        b = 5 [IntElement]
        c = True [BoolElement]

    >>> objectify.xsiannotate(root)

    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        a = 'test' [StringElement]
          * xsi:type = 'xsd:string'
        b = 5 [IntElement]
          * xsi:type = 'xsd:integer'
        c = True [BoolElement]
          * xsi:type = 'xsd:boolean'

Note, however, that ``xsiannotate()`` will always use the first XML Schema
datatype that is defined for any given Python type, see also
`Defining additional data classes`_.

The utility function ``deannotate()`` can be used to get rid of 'py:pytype'
and/or 'xsi:type' information:

.. sourcecode:: pycon

    >>> root = objectify.fromstring('''\
    ... <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    ...       xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    ...   <d xsi:type="xsd:double">5</d>
    ...   <i xsi:type="xsd:int"   >5</i>
    ...   <s xsi:type="xsd:string">5</s>
    ... </root>''')
    >>> objectify.annotate(root)
    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        d = 5.0 [FloatElement]
          * xsi:type = 'xsd:double'
          * py:pytype = 'float'
        i = 5 [IntElement]
          * xsi:type = 'xsd:int'
          * py:pytype = 'int'
        s = '5' [StringElement]
          * xsi:type = 'xsd:string'
          * py:pytype = 'str'
    >>> objectify.deannotate(root)
    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        d = 5 [IntElement]
        i = 5 [IntElement]
        s = 5 [IntElement]

You can control which type attributes should be de-annotated with the keyword
arguments 'pytype' (default: True) and 'xsi' (default: True).
``deannotate()`` can also remove 'xsi:nil' attributes by setting 'xsi_nil=True'
(default: False):

.. sourcecode:: pycon

    >>> root = objectify.fromstring('''\
    ... <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    ...       xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    ...   <d xsi:type="xsd:double">5</d>
    ...   <i xsi:type="xsd:int"   >5</i>
    ...   <s xsi:type="xsd:string">5</s>
    ...   <n xsi:nil="true"/>
    ... </root>''')
    >>> objectify.annotate(root)
    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        d = 5.0 [FloatElement]
          * xsi:type = 'xsd:double'
          * py:pytype = 'float'
        i = 5 [IntElement]
          * xsi:type = 'xsd:int'
          * py:pytype = 'int'
        s = '5' [StringElement]
          * xsi:type = 'xsd:string'
          * py:pytype = 'str'
        n = None [NoneElement]
          * xsi:nil = 'true'
          * py:pytype = 'NoneType'
    >>> objectify.deannotate(root, xsi_nil=True)
    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        d = 5 [IntElement]
        i = 5 [IntElement]
        s = 5 [IntElement]
        n = u'' [StringElement]

Note that ``deannotate()`` does not remove the namespace declarations
of the ``pytype`` namespace by default.  To remove them as well, and
to generally clean up the namespace declarations in the document
(usually when done with the whole processing), pass the option
``cleanup_namespaces=True``.  This option is new in lxml 2.3.2.  In
older versions, use the function ``lxml.etree.cleanup_namespaces()``
instead.


The DataElement factory
-----------------------

For convenience, the ``DataElement()`` factory creates an Element with a
Python value in one step.  You can pass the required Python type name or the
XSI type name:

.. sourcecode:: pycon

    >>> root = objectify.Element("root")
    >>> root.x = objectify.DataElement(5, _pytype="int")
    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        x = 5 [IntElement]
          * py:pytype = 'int'

    >>> root.x = objectify.DataElement(5, _pytype="str", myattr="someval")
    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        x = '5' [StringElement]
          * myattr = 'someval'
          * py:pytype = 'str'

    >>> root.x = objectify.DataElement(5, _xsi="integer")
    >>> print(objectify.dump(root))
    root = None [ObjectifiedElement]
        x = 5 [IntElement]
          * py:pytype = 'int'
          * xsi:type = 'xsd:integer'

XML Schema types reside in the XML schema namespace thus ``DataElement()`` 
tries to correctly prefix the xsi:type attribute value for you:

.. sourcecode:: pycon

    >>> root = objectify.Element("root")
    >>> root.s = objectify.DataElement(5, _xsi="string")

    >>> objectify.deannotate(root, xsi=False)
    >>> print(etree.tostring(root, pretty_print=True))
    <root xmlns:py="http://codespeak.net/lxml/objectify/pytype" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
      <s xsi:type="xsd:string">5</s>
    </root>

``DataElement()`` uses a default nsmap to set these prefixes:

.. sourcecode:: pycon

    >>> el = objectify.DataElement('5', _xsi='string')
    >>> namespaces = list(el.nsmap.items())
    >>> namespaces.sort()
    >>> for prefix, namespace in namespaces:
    ...     print("%s - %s" % (prefix, namespace))
    py - http://codespeak.net/lxml/objectify/pytype
    xsd - http://www.w3.org/2001/XMLSchema
    xsi - http://www.w3.org/2001/XMLSchema-instance

    >>> print(el.get("{http://www.w3.org/2001/XMLSchema-instance}type"))
    xsd:string

While you can set custom namespace prefixes, it is necessary to provide valid
namespace information if you choose to do so:

.. sourcecode:: pycon

    >>> el = objectify.DataElement('5', _xsi='foo:string',
    ...          nsmap={'foo': 'http://www.w3.org/2001/XMLSchema'})
    >>> namespaces = list(el.nsmap.items())
    >>> namespaces.sort()
    >>> for prefix, namespace in namespaces:
    ...     print("%s - %s" % (prefix, namespace))
    foo - http://www.w3.org/2001/XMLSchema
    py - http://codespeak.net/lxml/objectify/pytype
    xsi - http://www.w3.org/2001/XMLSchema-instance

    >>> print(el.get("{http://www.w3.org/2001/XMLSchema-instance}type"))
    foo:string

Note how lxml chose a default prefix for the XML Schema Instance
namespace.  We can override it as in the following example:

.. sourcecode:: pycon

    >>> el = objectify.DataElement('5', _xsi='foo:string',
    ...          nsmap={'foo': 'http://www.w3.org/2001/XMLSchema',
    ...                 'myxsi': 'http://www.w3.org/2001/XMLSchema-instance'})
    >>> namespaces = list(el.nsmap.items())
    >>> namespaces.sort()
    >>> for prefix, namespace in namespaces:
    ...     print("%s - %s" % (prefix, namespace))
    foo - http://www.w3.org/2001/XMLSchema
    myxsi - http://www.w3.org/2001/XMLSchema-instance
    py - http://codespeak.net/lxml/objectify/pytype

    >>> print(el.get("{http://www.w3.org/2001/XMLSchema-instance}type"))
    foo:string

Care must be taken if different namespace prefixes have been used for the same
namespace.  Namespace information gets merged to avoid duplicate definitions
when adding a new sub-element to a tree, but this mechanism does not adapt the
prefixes of attribute values:

.. sourcecode:: pycon

    >>> root = objectify.fromstring("""<root xmlns:schema="http://www.w3.org/2001/XMLSchema"/>""")
    >>> print(etree.tostring(root, pretty_print=True))
    <root xmlns:schema="http://www.w3.org/2001/XMLSchema"/>

    >>> s = objectify.DataElement("17", _xsi="string")
    >>> print(etree.tostring(s, pretty_print=True))
    <value xmlns:py="http://codespeak.net/lxml/objectify/pytype" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" py:pytype="str" xsi:type="xsd:string">17</value>

    >>> root.s = s
    >>> print(etree.tostring(root, pretty_print=True))
    <root xmlns:schema="http://www.w3.org/2001/XMLSchema">
      <s xmlns:py="http://codespeak.net/lxml/objectify/pytype" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" py:pytype="str" xsi:type="xsd:string">17</s>
    </root>

It is your responsibility to fix the prefixes of attribute values if you
choose to deviate from the standard prefixes.  A convenient way to do this for
xsi:type attributes is to use the ``xsiannotate()`` utility:

.. sourcecode:: pycon

    >>> objectify.xsiannotate(root)
    >>> print(etree.tostring(root, pretty_print=True))
    <root xmlns:schema="http://www.w3.org/2001/XMLSchema">
      <s xmlns:py="http://codespeak.net/lxml/objectify/pytype" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" py:pytype="str" xsi:type="schema:string">17</s>
    </root>

Of course, it is discouraged to use different prefixes for one and the same
namespace when building up an objectify tree.


Defining additional data classes
--------------------------------

You can plug additional data classes into objectify that will be used in
exactly the same way as the predefined types.  Data classes can either inherit
from ``ObjectifiedDataElement`` directly or from one of the specialised
classes like ``NumberElement`` or ``BoolElement``.  The numeric types require
an initial call to the NumberElement method ``self._setValueParser(function)``
to set their type conversion function (string -> numeric Python type).  This
call should be placed into the element ``_init()`` method.

The registration of data classes uses the ``PyType`` class:

.. sourcecode:: pycon

    >>> class ChristmasDate(objectify.ObjectifiedDataElement):
    ...     def call_santa(self):
    ...         print("Ho ho ho!")

    >>> def checkChristmasDate(date_string):
    ...     if not date_string.startswith('24.12.'):
    ...         raise ValueError # or TypeError

    >>> xmas_type = objectify.PyType('date', checkChristmasDate, ChristmasDate)

The PyType constructor takes a string type name, an (optional) callable type 
check and the custom data class.  If a type check is provided it must accept a 
string as argument and raise ValueError or TypeError if it cannot handle the
string value.

PyTypes are used if an element carries a ``py:pytype`` attribute denoting its
data type or, in absence of such an attribute, if the given type check callable
does not raise a ValueError/TypeError exception when applied to the element
text. 

If you want, you can also register this type under an XML Schema type name:

.. sourcecode:: pycon

    >>> xmas_type.xmlSchemaTypes = ("date",)

XML Schema types will be considered if the element has an ``xsi:type``
attribute that specifies its data type.  The line above binds the XSD type
``date`` to the newly defined Python type.  Note that this must be done before
the next step, which is to register the type.  Then you can use it:

.. sourcecode:: pycon

    >>> xmas_type.register()

    >>> root = objectify.fromstring(
    ...             "<root><a>24.12.2000</a><b>12.24.2000</b></root>")
    >>> root.a.call_santa()
    Ho ho ho!
    >>> root.b.call_santa()
    Traceback (most recent call last):
      ...
    AttributeError: no such child: call_santa

If you need to specify dependencies between the type check functions, you can
pass a sequence of type names through the ``before`` and ``after`` keyword
arguments of the ``register()`` method.  The PyType will then try to register
itself before or after the respective types, as long as they are currently
registered.  Note that this only impacts the currently registered types at the
time of registration.  Types that are registered later on will not care about
the dependencies of already registered types.

If you provide XML Schema type information, this will override the type check
function defined above:

.. sourcecode:: pycon

    >>> root = objectify.fromstring('''\
    ...    <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    ...      <a xsi:type="date">12.24.2000</a>
    ...    </root>
    ...    ''')
    >>> print(root.a)
    12.24.2000
    >>> root.a.call_santa()
    Ho ho ho!

To unregister a type, call its ``unregister()`` method:

.. sourcecode:: pycon

    >>> root.a.call_santa()
    Ho ho ho!
    >>> xmas_type.unregister()
    >>> root.a.call_santa()
    Traceback (most recent call last):
      ...
    AttributeError: no such child: call_santa

Be aware, though, that this does not immediately apply to elements to which
there already is a Python reference.  Their Python class will only be changed
after all references are gone and the Python object is garbage collected.


Advanced element class lookup
-----------------------------

In some cases, the normal data class setup is not enough.  Being based
on ``lxml.etree``, however, ``lxml.objectify`` supports very
fine-grained control over the Element classes used in a tree.  All you
have to do is configure a different `class lookup`_ mechanism (or
write one yourself).

.. _`class lookup`: element_classes.html

The first step for the setup is to create a new parser that builds
objectify documents.  The objectify API is meant for data-centric XML
(as opposed to document XML with mixed content).  Therefore, we
configure the parser to let it remove whitespace-only text from the
parsed document if it is not enclosed by an XML element.  Note that
this alters the document infoset, so if you consider the removed
spaces as data in your specific use case, you should go with a normal
parser and just set the element class lookup.  Most applications,
however, will work fine with the following setup:

.. sourcecode:: pycon

    >>> parser = objectify.makeparser(remove_blank_text=True)

What this does internally, is:

.. sourcecode:: pycon

    >>> parser = etree.XMLParser(remove_blank_text=True)

    >>> lookup = objectify.ObjectifyElementClassLookup()
    >>> parser.set_element_class_lookup(lookup)

If you want to change the lookup scheme, say, to get additional
support for `namespace specific classes`_, you can register the
objectify lookup as a fallback of the namespace lookup.  In this case,
however, you have to take care that the namespace classes inherit from
``objectify.ObjectifiedElement``, not only from the normal
``lxml.etree.ElementBase``, so that they support the ``objectify``
API.  The above setup code then becomes:

.. sourcecode:: pycon

    >>> lookup = etree.ElementNamespaceClassLookup(
    ...                   objectify.ObjectifyElementClassLookup() )
    >>> parser.set_element_class_lookup(lookup)

.. _`namespace specific classes`: element_classes.html#namespace-class-lookup

See the documentation on `class lookup`_ schemes for more information.


What is different from lxml.etree?
==================================

Such a different Element API obviously implies some side effects to the normal
behaviour of the rest of the API.

* len(<element>) returns the sibling count, not the number of children of
  <element>. You can retrieve the number of children with the
  ``countchildren()`` method. 

* Iteration over elements does not yield the children, but the siblings.  You
  can access all children with the ``iterchildren()`` method on elements or
  retrieve a list by calling the ``getchildren()`` method.

* The find, findall and findtext methods require a different implementation
  based on ETXPath.  In ``lxml.etree``, they use a Python implementation based
  on the original iteration scheme.  This has the disadvantage that they may
  not be 100% backwards compatible, and the additional advantage that they now
  support any XPath expression.
source-git / python-lxml

Source Code

Files