|
Packit Service |
b74dd5 |
==============================
|
|
Packit Service |
b74dd5 |
How to read the source of lxml
|
|
Packit Service |
b74dd5 |
==============================
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
:Author:
|
|
Packit Service |
b74dd5 |
Stefan Behnel
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
.. meta::
|
|
Packit Service |
b74dd5 |
:description: How to read and work on the source code of lxml
|
|
Packit Service |
b74dd5 |
:keywords: lxml, XML, Cython, source code, develop, comprehend, understand
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
This document describes how to read the source code of lxml_ and how
|
|
Packit Service |
b74dd5 |
to start working on it. You might also be interested in the companion
|
|
Packit Service |
b74dd5 |
document that describes `how to build lxml from sources`_.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
.. _lxml: http://lxml.de/
|
|
Packit Service |
b74dd5 |
.. _`how to build lxml from sources`: build.html
|
|
Packit Service |
b74dd5 |
.. _`ReStructured Text`: http://docutils.sourceforge.net/rst.html
|
|
Packit Service |
b74dd5 |
.. _epydoc: http://epydoc.sourceforge.net/
|
|
Packit Service |
b74dd5 |
.. _docutils: http://docutils.sourceforge.net/
|
|
Packit Service |
b74dd5 |
.. _`C-level API`: capi.html
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
.. contents::
|
|
Packit Service |
b74dd5 |
..
|
|
Packit Service |
b74dd5 |
1 What is Cython?
|
|
Packit Service |
b74dd5 |
2 Where to start?
|
|
Packit Service |
b74dd5 |
2.1 Concepts
|
|
Packit Service |
b74dd5 |
2.2 The documentation
|
|
Packit Service |
b74dd5 |
3 lxml.etree
|
|
Packit Service |
b74dd5 |
4 lxml.objectify
|
|
Packit Service |
b74dd5 |
5 lxml.html
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
What is Cython?
|
|
Packit Service |
b74dd5 |
===============
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
.. _Cython: http://cython.org/
|
|
Packit Service |
b74dd5 |
.. _Pyrex: http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
Cython_ is the language that lxml is written in. It is a very
|
|
Packit Service |
b74dd5 |
Python-like language that was specifically designed for writing Python
|
|
Packit Service |
b74dd5 |
extension modules.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
The reason why Cython (or actually its predecessor Pyrex_ at the time)
|
|
Packit Service |
b74dd5 |
was chosen as an implementation language for lxml, is that it makes it
|
|
Packit Service |
b74dd5 |
very easy to interface with both the Python world and external C code.
|
|
Packit Service |
b74dd5 |
Cython generates all the necessary glue code for the Python API,
|
|
Packit Service |
b74dd5 |
including Python types, calling conventions and reference counting.
|
|
Packit Service |
b74dd5 |
On the other side of the table, calling into C code is not more than
|
|
Packit Service |
b74dd5 |
declaring the signature of the function and maybe some variables as
|
|
Packit Service |
b74dd5 |
being C types, pointers or structs, and then calling it. The rest of
|
|
Packit Service |
b74dd5 |
the code is just plain Python code.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
The Cython language is so close to Python that the Cython compiler can
|
|
Packit Service |
b74dd5 |
actually compile many, many Python programs to C without major
|
|
Packit Service |
b74dd5 |
modifications. But the real speed gains of a C compilation come from
|
|
Packit Service |
b74dd5 |
type annotations that were added to the language and that allow Cython
|
|
Packit Service |
b74dd5 |
to generate very efficient C code.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
Even if you are not familiar with Cython, you should keep in mind that
|
|
Packit Service |
b74dd5 |
a slow implementation of a feature is better than none. So, if you
|
|
Packit Service |
b74dd5 |
want to contribute and have an idea what code you want to write, feel
|
|
Packit Service |
b74dd5 |
free to start with a pure Python implementation. Chances are, if you
|
|
Packit Service |
b74dd5 |
get the change officially accepted and integrated, others will take
|
|
Packit Service |
b74dd5 |
the time to optimise it so that it runs fast in Cython.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
Where to start?
|
|
Packit Service |
b74dd5 |
===============
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
First of all, read `how to build lxml from sources`_ to learn how to
|
|
Packit Service |
b74dd5 |
retrieve the source code from the GitHub repository and how to
|
|
Packit Service |
b74dd5 |
build it. The source code lives in the subdirectory ``src`` of the
|
|
Packit Service |
b74dd5 |
checkout.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
The main extension modules in lxml are ``lxml.etree`` and
|
|
Packit Service |
b74dd5 |
``lxml.objectify``. All main modules have the file extension
|
|
Packit Service |
b74dd5 |
``.pyx``, which shows the descendence from Pyrex. As usual in Python,
|
|
Packit Service |
b74dd5 |
the main files start with a short description and a couple of imports.
|
|
Packit Service |
b74dd5 |
Cython distinguishes between the run-time ``import`` statement (as
|
|
Packit Service |
b74dd5 |
known from Python) and the compile-time ``cimport`` statement, which
|
|
Packit Service |
b74dd5 |
imports C declarations, either from external libraries or from other
|
|
Packit Service |
b74dd5 |
Cython modules.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
Concepts
|
|
Packit Service |
b74dd5 |
--------
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
lxml's tree API is based on proxy objects. That means, every Element
|
|
Packit Service |
b74dd5 |
object (or rather ``_Element`` object) is a proxy for a libxml2 node
|
|
Packit Service |
b74dd5 |
structure. The class declaration is (mainly)::
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
cdef class _Element:
|
|
Packit Service |
b74dd5 |
cdef _Document _doc
|
|
Packit Service |
b74dd5 |
cdef xmlNode* _c_node
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
It is a naming convention that C variables and C level class members
|
|
Packit Service |
b74dd5 |
that are passed into libxml2 start with a prefixed ``c_`` (commonly
|
|
Packit Service |
b74dd5 |
libxml2 struct pointers), and that C level class members are prefixed
|
|
Packit Service |
b74dd5 |
with an underscore. So you will often see names like ``c_doc`` for an
|
|
Packit Service |
b74dd5 |
``xmlDoc*`` variable (or ``c_node`` for an ``xmlNode*``), or the above
|
|
Packit Service |
b74dd5 |
``_c_node`` for a class member that points to an ``xmlNode`` struct
|
|
Packit Service |
b74dd5 |
(or ``_c_doc`` for an ``xmlDoc*``).
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
It is important to know that every proxy in lxml has a factory
|
|
Packit Service |
b74dd5 |
function that properly sets up C level members. Proxy objects must
|
|
Packit Service |
b74dd5 |
*never* be instantiated outside of that factory. For example, to
|
|
Packit Service |
b74dd5 |
instantiate an _Element object or its subclasses, you must always call
|
|
Packit Service |
b74dd5 |
its factory function::
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
cdef xmlNode* c_node
|
|
Packit Service |
b74dd5 |
cdef _Document doc
|
|
Packit Service |
b74dd5 |
cdef _Element element
|
|
Packit Service |
b74dd5 |
...
|
|
Packit Service |
b74dd5 |
element = _elementFactory(doc, c_node)
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
A good place to see how this factory is used are the Element methods
|
|
Packit Service |
b74dd5 |
``getparent()``, ``getnext()`` and ``getprevious()``.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
The documentation
|
|
Packit Service |
b74dd5 |
-----------------
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
An important part of lxml is the documentation that lives in the
|
|
Packit Service |
b74dd5 |
``doc`` directory. It describes a large part of the API and comprises
|
|
Packit Service |
b74dd5 |
a lot of example code in the form of doctests.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
The documentation is written in the `ReStructured Text`_ format, a
|
|
Packit Service |
b74dd5 |
very powerful text markup language that looks almost like plain text.
|
|
Packit Service |
b74dd5 |
It is part of the docutils_ package.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
The project web site of lxml_ is completely generated from these text
|
|
Packit Service |
b74dd5 |
documents. Even the side menu is just collected from the table of
|
|
Packit Service |
b74dd5 |
contents that the ReST processor writes into each HTML page.
|
|
Packit Service |
b74dd5 |
Obviously, we use lxml for this.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
The easiest way to generate the HTML pages is by calling::
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
make html
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
This will call the script ``doc/mkhtml.py`` to run the ReST processor
|
|
Packit Service |
b74dd5 |
on the files. After generating an HTML page the script parses it back
|
|
Packit Service |
b74dd5 |
in to build the side menu, and injects the complete menu into each
|
|
Packit Service |
b74dd5 |
page at the very end.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
Running the ``make`` command will also generate the API documentation
|
|
Packit Service |
b74dd5 |
if you have epydoc_ installed. The epydoc package will import and
|
|
Packit Service |
b74dd5 |
introspect the extension modules and also introspect and parse the
|
|
Packit Service |
b74dd5 |
Python modules of lxml. The aggregated information will then be
|
|
Packit Service |
b74dd5 |
written out into an HTML documentation site.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
lxml.etree
|
|
Packit Service |
b74dd5 |
==========
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
The main module, ``lxml.etree``, is in the file `lxml.etree.pyx
|
|
Packit Service |
b74dd5 |
<https://github.com/lxml/lxml/blob/master/src/lxml/lxml.etree.pyx>`_. It
|
|
Packit Service |
b74dd5 |
implements the main functions and types of the ElementTree API, as
|
|
Packit Service |
b74dd5 |
well as all the factory functions for proxies. It is the best place
|
|
Packit Service |
b74dd5 |
to start if you want to find out how a specific feature is
|
|
Packit Service |
b74dd5 |
implemented.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
At the very end of the file, it contains a series of ``include``
|
|
Packit Service |
b74dd5 |
statements that merge the rest of the implementation into the
|
|
Packit Service |
b74dd5 |
generated C code. Yes, you read right: no importing, no source file
|
|
Packit Service |
b74dd5 |
namespacing, just plain good old include and a huge C code result of
|
|
Packit Service |
b74dd5 |
more than 100,000 lines that we throw right into the C compiler.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
The main include files are:
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
apihelpers.pxi
|
|
Packit Service |
b74dd5 |
Private C helper functions. Except for the factory functions,
|
|
Packit Service |
b74dd5 |
most of the little functions that are used all over the place are
|
|
Packit Service |
b74dd5 |
defined here. This includes things like reading out the text
|
|
Packit Service |
b74dd5 |
content of a libxml2 tree node, checking input from the API level,
|
|
Packit Service |
b74dd5 |
creating a new Element node or handling attribute values. If you
|
|
Packit Service |
b74dd5 |
want to work on the lxml code, you should keep these functions in
|
|
Packit Service |
b74dd5 |
the back of your head, as they will definitely make your life
|
|
Packit Service |
b74dd5 |
easier.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
classlookup.pxi
|
|
Packit Service |
b74dd5 |
Element class lookup mechanisms. The main API and engines for
|
|
Packit Service |
b74dd5 |
those who want to define custom Element classes and inject them
|
|
Packit Service |
b74dd5 |
into lxml.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
docloader.pxi
|
|
Packit Service |
b74dd5 |
Support for custom document loaders. Base class and registry for
|
|
Packit Service |
b74dd5 |
custom document resolvers.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
extensions.pxi
|
|
Packit Service |
b74dd5 |
Infrastructure for extension functions in XPath/XSLT, including
|
|
Packit Service |
b74dd5 |
XPath value conversion and function registration.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
iterparse.pxi
|
|
Packit Service |
b74dd5 |
Incremental XML parsing. An iterator class that builds iterparse
|
|
Packit Service |
b74dd5 |
events while parsing.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
nsclasses.pxi
|
|
Packit Service |
b74dd5 |
Namespace implementation and registry. The registry and engine
|
|
Packit Service |
b74dd5 |
for Element classes that use the ElementNamespaceClassLookup
|
|
Packit Service |
b74dd5 |
scheme.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
parser.pxi
|
|
Packit Service |
b74dd5 |
Parsers for XML and HTML. This is the main parser engine. It's
|
|
Packit Service |
b74dd5 |
the reason why you can parse a document from various sources in
|
|
Packit Service |
b74dd5 |
two lines of Python code. It's definitely not the right place to
|
|
Packit Service |
b74dd5 |
start reading lxml's source code.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
parsertarget.pxi
|
|
Packit Service |
b74dd5 |
An ElementTree compatible parser target implementation based on
|
|
Packit Service |
b74dd5 |
the SAX2 interface of libxml2.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
proxy.pxi
|
|
Packit Service |
b74dd5 |
Very low-level functions for memory allocation/deallocation
|
|
Packit Service |
b74dd5 |
and Element proxy handling. Ignoring this for the beginning
|
|
Packit Service |
b74dd5 |
will safe your head from exploding.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
public-api.pxi
|
|
Packit Service |
b74dd5 |
The set of C functions that are exported to other extension
|
|
Packit Service |
b74dd5 |
modules at the C level. For example, ``lxml.objectify`` makes use
|
|
Packit Service |
b74dd5 |
of these. See the `C-level API` documentation.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
readonlytree.pxi
|
|
Packit Service |
b74dd5 |
A separate read-only implementation of the Element API. This is
|
|
Packit Service |
b74dd5 |
used in places where non-intrusive access to a tree is required,
|
|
Packit Service |
b74dd5 |
such as the ``PythonElementClassLookup`` or XSLT extension
|
|
Packit Service |
b74dd5 |
elements.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
saxparser.pxi
|
|
Packit Service |
b74dd5 |
SAX-like parser interfaces as known from ElementTree's TreeBuilder.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
serializer.pxi
|
|
Packit Service |
b74dd5 |
XML output functions. Basically everything that creates byte
|
|
Packit Service |
b74dd5 |
sequences from XML trees.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
xinclude.pxi
|
|
Packit Service |
b74dd5 |
XInclude support.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
xmlerror.pxi
|
|
Packit Service |
b74dd5 |
Error log handling. All error messages that libxml2 generates
|
|
Packit Service |
b74dd5 |
internally walk through the code in this file to end up in lxml's
|
|
Packit Service |
b74dd5 |
Python level error logs.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
At the end of the file, you will find a long list of named error
|
|
Packit Service |
b74dd5 |
codes. It is generated from the libxml2 HTML documentation (using
|
|
Packit Service |
b74dd5 |
lxml, of course). See the script ``update-error-constants.py``
|
|
Packit Service |
b74dd5 |
for this.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
xmlid.pxi
|
|
Packit Service |
b74dd5 |
XMLID and IDDict, a dictionary-like way to find Elements by their
|
|
Packit Service |
b74dd5 |
XML-ID attribute.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
xpath.pxi
|
|
Packit Service |
b74dd5 |
XPath evaluators.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
xslt.pxi
|
|
Packit Service |
b74dd5 |
XSL transformations, including the ``XSLT`` class, document lookup
|
|
Packit Service |
b74dd5 |
handling and access control.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
The different schema languages (DTD, RelaxNG, XML Schema and
|
|
Packit Service |
b74dd5 |
Schematron) are implemented in the following include files:
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
* dtd.pxi
|
|
Packit Service |
b74dd5 |
* relaxng.pxi
|
|
Packit Service |
b74dd5 |
* schematron.pxi
|
|
Packit Service |
b74dd5 |
* xmlschema.pxi
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
Python modules
|
|
Packit Service |
b74dd5 |
==============
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
The ``lxml`` package also contains a number of pure Python modules:
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
builder.py
|
|
Packit Service |
b74dd5 |
The E-factory and the ElementBuilder class. These provide a
|
|
Packit Service |
b74dd5 |
simple interface to XML tree generation.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
cssselect.py
|
|
Packit Service |
b74dd5 |
A CSS selector implementation based on XPath. The main class is
|
|
Packit Service |
b74dd5 |
called ``CSSSelector``.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
doctestcompare.py
|
|
Packit Service |
b74dd5 |
A relaxed comparison scheme for XML/HTML markup in doctest.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
ElementInclude.py
|
|
Packit Service |
b74dd5 |
XInclude-like document inclusion, compatible with ElementTree.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
_elementpath.py
|
|
Packit Service |
b74dd5 |
XPath-like path language, compatible with ElementTree.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
sax.py
|
|
Packit Service |
b74dd5 |
SAX2 compatible interfaces to copy lxml trees from/to SAX compatible
|
|
Packit Service |
b74dd5 |
tools.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
usedoctest.py
|
|
Packit Service |
b74dd5 |
Wrapper module for ``doctestcompare.py`` that simplifies its usage
|
|
Packit Service |
b74dd5 |
from inside a doctest.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
lxml.objectify
|
|
Packit Service |
b74dd5 |
==============
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
A Cython implemented extension module that uses the public C-API of
|
|
Packit Service |
b74dd5 |
lxml.etree. It provides a Python object-like interface to XML trees.
|
|
Packit Service |
b74dd5 |
The implementation resides in the file `lxml.objectify.pyx
|
|
Packit Service |
b74dd5 |
<https://github.com/lxml/lxml/blob/master/src/lxml/lxml.objectify.pyx>`_.
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
lxml.html
|
|
Packit Service |
b74dd5 |
=========
|
|
Packit Service |
b74dd5 |
|
|
Packit Service |
b74dd5 |
A specialised toolkit for HTML handling, based on lxml.etree. This is
|
|
Packit Service |
b74dd5 |
implemented in pure Python.
|