Blame doc/lxml-source-howto.txt

Packit Service b74dd5
==============================
Packit Service b74dd5
How to read the source of lxml
Packit Service b74dd5
==============================
Packit Service b74dd5
Packit Service b74dd5
:Author:
Packit Service b74dd5
  Stefan Behnel
Packit Service b74dd5
Packit Service b74dd5
.. meta::
Packit Service b74dd5
  :description: How to read and work on the source code of lxml
Packit Service b74dd5
  :keywords: lxml, XML, Cython, source code, develop, comprehend, understand
Packit Service b74dd5
Packit Service b74dd5
This document describes how to read the source code of lxml_ and how
Packit Service b74dd5
to start working on it.  You might also be interested in the companion
Packit Service b74dd5
document that describes `how to build lxml from sources`_.
Packit Service b74dd5
Packit Service b74dd5
.. _lxml: http://lxml.de/
Packit Service b74dd5
.. _`how to build lxml from sources`: build.html
Packit Service b74dd5
.. _`ReStructured Text`: http://docutils.sourceforge.net/rst.html
Packit Service b74dd5
.. _epydoc: http://epydoc.sourceforge.net/
Packit Service b74dd5
.. _docutils: http://docutils.sourceforge.net/
Packit Service b74dd5
.. _`C-level API`: capi.html
Packit Service b74dd5
Packit Service b74dd5
.. contents::
Packit Service b74dd5
..
Packit Service b74dd5
   1  What is Cython?
Packit Service b74dd5
   2  Where to start?
Packit Service b74dd5
     2.1  Concepts
Packit Service b74dd5
     2.2  The documentation
Packit Service b74dd5
   3  lxml.etree
Packit Service b74dd5
   4  lxml.objectify
Packit Service b74dd5
   5  lxml.html
Packit Service b74dd5
Packit Service b74dd5
Packit Service b74dd5
What is Cython?
Packit Service b74dd5
===============
Packit Service b74dd5
Packit Service b74dd5
.. _Cython: http://cython.org/
Packit Service b74dd5
.. _Pyrex: http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/
Packit Service b74dd5
Packit Service b74dd5
Cython_ is the language that lxml is written in.  It is a very
Packit Service b74dd5
Python-like language that was specifically designed for writing Python
Packit Service b74dd5
extension modules.
Packit Service b74dd5
Packit Service b74dd5
The reason why Cython (or actually its predecessor Pyrex_ at the time)
Packit Service b74dd5
was chosen as an implementation language for lxml, is that it makes it
Packit Service b74dd5
very easy to interface with both the Python world and external C code.
Packit Service b74dd5
Cython generates all the necessary glue code for the Python API,
Packit Service b74dd5
including Python types, calling conventions and reference counting.
Packit Service b74dd5
On the other side of the table, calling into C code is not more than
Packit Service b74dd5
declaring the signature of the function and maybe some variables as
Packit Service b74dd5
being C types, pointers or structs, and then calling it.  The rest of
Packit Service b74dd5
the code is just plain Python code.
Packit Service b74dd5
Packit Service b74dd5
The Cython language is so close to Python that the Cython compiler can
Packit Service b74dd5
actually compile many, many Python programs to C without major
Packit Service b74dd5
modifications.  But the real speed gains of a C compilation come from
Packit Service b74dd5
type annotations that were added to the language and that allow Cython
Packit Service b74dd5
to generate very efficient C code.
Packit Service b74dd5
Packit Service b74dd5
Even if you are not familiar with Cython, you should keep in mind that
Packit Service b74dd5
a slow implementation of a feature is better than none.  So, if you
Packit Service b74dd5
want to contribute and have an idea what code you want to write, feel
Packit Service b74dd5
free to start with a pure Python implementation.  Chances are, if you
Packit Service b74dd5
get the change officially accepted and integrated, others will take
Packit Service b74dd5
the time to optimise it so that it runs fast in Cython.
Packit Service b74dd5
Packit Service b74dd5
Packit Service b74dd5
Where to start?
Packit Service b74dd5
===============
Packit Service b74dd5
Packit Service b74dd5
First of all, read `how to build lxml from sources`_ to learn how to
Packit Service b74dd5
retrieve the source code from the GitHub repository and how to
Packit Service b74dd5
build it.  The source code lives in the subdirectory ``src`` of the
Packit Service b74dd5
checkout.
Packit Service b74dd5
Packit Service b74dd5
The main extension modules in lxml are ``lxml.etree`` and
Packit Service b74dd5
``lxml.objectify``.  All main modules have the file extension
Packit Service b74dd5
``.pyx``, which shows the descendence from Pyrex.  As usual in Python,
Packit Service b74dd5
the main files start with a short description and a couple of imports.
Packit Service b74dd5
Cython distinguishes between the run-time ``import`` statement (as
Packit Service b74dd5
known from Python) and the compile-time ``cimport`` statement, which
Packit Service b74dd5
imports C declarations, either from external libraries or from other
Packit Service b74dd5
Cython modules.
Packit Service b74dd5
Packit Service b74dd5
Packit Service b74dd5
Concepts
Packit Service b74dd5
--------
Packit Service b74dd5
Packit Service b74dd5
lxml's tree API is based on proxy objects.  That means, every Element
Packit Service b74dd5
object (or rather ``_Element`` object) is a proxy for a libxml2 node
Packit Service b74dd5
structure.  The class declaration is (mainly)::
Packit Service b74dd5
Packit Service b74dd5
    cdef class _Element:
Packit Service b74dd5
        cdef _Document _doc
Packit Service b74dd5
        cdef xmlNode* _c_node
Packit Service b74dd5
Packit Service b74dd5
It is a naming convention that C variables and C level class members
Packit Service b74dd5
that are passed into libxml2 start with a prefixed ``c_`` (commonly
Packit Service b74dd5
libxml2 struct pointers), and that C level class members are prefixed
Packit Service b74dd5
with an underscore.  So you will often see names like ``c_doc`` for an
Packit Service b74dd5
``xmlDoc*`` variable (or ``c_node`` for an ``xmlNode*``), or the above
Packit Service b74dd5
``_c_node`` for a class member that points to an ``xmlNode`` struct
Packit Service b74dd5
(or ``_c_doc`` for an ``xmlDoc*``).
Packit Service b74dd5
Packit Service b74dd5
It is important to know that every proxy in lxml has a factory
Packit Service b74dd5
function that properly sets up C level members.  Proxy objects must
Packit Service b74dd5
*never* be instantiated outside of that factory.  For example, to
Packit Service b74dd5
instantiate an _Element object or its subclasses, you must always call
Packit Service b74dd5
its factory function::
Packit Service b74dd5
Packit Service b74dd5
    cdef xmlNode* c_node
Packit Service b74dd5
    cdef _Document doc
Packit Service b74dd5
    cdef _Element element
Packit Service b74dd5
    ...
Packit Service b74dd5
    element = _elementFactory(doc, c_node)
Packit Service b74dd5
Packit Service b74dd5
A good place to see how this factory is used are the Element methods
Packit Service b74dd5
``getparent()``, ``getnext()`` and ``getprevious()``.
Packit Service b74dd5
Packit Service b74dd5
Packit Service b74dd5
The documentation
Packit Service b74dd5
-----------------
Packit Service b74dd5
Packit Service b74dd5
An important part of lxml is the documentation that lives in the
Packit Service b74dd5
``doc`` directory.  It describes a large part of the API and comprises
Packit Service b74dd5
a lot of example code in the form of doctests.
Packit Service b74dd5
Packit Service b74dd5
The documentation is written in the `ReStructured Text`_ format, a
Packit Service b74dd5
very powerful text markup language that looks almost like plain text.
Packit Service b74dd5
It is part of the docutils_ package.
Packit Service b74dd5
Packit Service b74dd5
The project web site of lxml_ is completely generated from these text
Packit Service b74dd5
documents.  Even the side menu is just collected from the table of
Packit Service b74dd5
contents that the ReST processor writes into each HTML page.
Packit Service b74dd5
Obviously, we use lxml for this.
Packit Service b74dd5
Packit Service b74dd5
The easiest way to generate the HTML pages is by calling::
Packit Service b74dd5
Packit Service b74dd5
    make html
Packit Service b74dd5
Packit Service b74dd5
This will call the script ``doc/mkhtml.py`` to run the ReST processor
Packit Service b74dd5
on the files.  After generating an HTML page the script parses it back
Packit Service b74dd5
in to build the side menu, and injects the complete menu into each
Packit Service b74dd5
page at the very end.
Packit Service b74dd5
Packit Service b74dd5
Running the ``make`` command will also generate the API documentation
Packit Service b74dd5
if you have epydoc_ installed.  The epydoc package will import and
Packit Service b74dd5
introspect the extension modules and also introspect and parse the
Packit Service b74dd5
Python modules of lxml.  The aggregated information will then be
Packit Service b74dd5
written out into an HTML documentation site.
Packit Service b74dd5
Packit Service b74dd5
Packit Service b74dd5
lxml.etree
Packit Service b74dd5
==========
Packit Service b74dd5
Packit Service b74dd5
The main module, ``lxml.etree``, is in the file `lxml.etree.pyx
Packit Service b74dd5
<https://github.com/lxml/lxml/blob/master/src/lxml/lxml.etree.pyx>`_.  It
Packit Service b74dd5
implements the main functions and types of the ElementTree API, as
Packit Service b74dd5
well as all the factory functions for proxies.  It is the best place
Packit Service b74dd5
to start if you want to find out how a specific feature is
Packit Service b74dd5
implemented.
Packit Service b74dd5
Packit Service b74dd5
At the very end of the file, it contains a series of ``include``
Packit Service b74dd5
statements that merge the rest of the implementation into the
Packit Service b74dd5
generated C code.  Yes, you read right: no importing, no source file
Packit Service b74dd5
namespacing, just plain good old include and a huge C code result of
Packit Service b74dd5
more than 100,000 lines that we throw right into the C compiler.
Packit Service b74dd5
Packit Service b74dd5
The main include files are:
Packit Service b74dd5
Packit Service b74dd5
apihelpers.pxi
Packit Service b74dd5
    Private C helper functions.  Except for the factory functions,
Packit Service b74dd5
    most of the little functions that are used all over the place are
Packit Service b74dd5
    defined here.  This includes things like reading out the text
Packit Service b74dd5
    content of a libxml2 tree node, checking input from the API level,
Packit Service b74dd5
    creating a new Element node or handling attribute values.  If you
Packit Service b74dd5
    want to work on the lxml code, you should keep these functions in
Packit Service b74dd5
    the back of your head, as they will definitely make your life
Packit Service b74dd5
    easier.
Packit Service b74dd5
Packit Service b74dd5
classlookup.pxi
Packit Service b74dd5
    Element class lookup mechanisms.  The main API and engines for
Packit Service b74dd5
    those who want to define custom Element classes and inject them
Packit Service b74dd5
    into lxml.
Packit Service b74dd5
Packit Service b74dd5
docloader.pxi
Packit Service b74dd5
    Support for custom document loaders.  Base class and registry for
Packit Service b74dd5
    custom document resolvers.
Packit Service b74dd5
Packit Service b74dd5
extensions.pxi
Packit Service b74dd5
    Infrastructure for extension functions in XPath/XSLT, including
Packit Service b74dd5
    XPath value conversion and function registration.
Packit Service b74dd5
Packit Service b74dd5
iterparse.pxi
Packit Service b74dd5
    Incremental XML parsing.  An iterator class that builds iterparse
Packit Service b74dd5
    events while parsing.
Packit Service b74dd5
Packit Service b74dd5
nsclasses.pxi
Packit Service b74dd5
    Namespace implementation and registry.  The registry and engine
Packit Service b74dd5
    for Element classes that use the ElementNamespaceClassLookup
Packit Service b74dd5
    scheme.
Packit Service b74dd5
Packit Service b74dd5
parser.pxi
Packit Service b74dd5
    Parsers for XML and HTML.  This is the main parser engine.  It's
Packit Service b74dd5
    the reason why you can parse a document from various sources in
Packit Service b74dd5
    two lines of Python code.  It's definitely not the right place to
Packit Service b74dd5
    start reading lxml's source code.
Packit Service b74dd5
Packit Service b74dd5
parsertarget.pxi
Packit Service b74dd5
    An ElementTree compatible parser target implementation based on
Packit Service b74dd5
    the SAX2 interface of libxml2.
Packit Service b74dd5
Packit Service b74dd5
proxy.pxi
Packit Service b74dd5
    Very low-level functions for memory allocation/deallocation
Packit Service b74dd5
    and Element proxy handling.  Ignoring this for the beginning
Packit Service b74dd5
    will safe your head from exploding.
Packit Service b74dd5
Packit Service b74dd5
public-api.pxi
Packit Service b74dd5
    The set of C functions that are exported to other extension
Packit Service b74dd5
    modules at the C level.  For example, ``lxml.objectify`` makes use
Packit Service b74dd5
    of these.  See the `C-level API` documentation.
Packit Service b74dd5
Packit Service b74dd5
readonlytree.pxi
Packit Service b74dd5
    A separate read-only implementation of the Element API.  This is
Packit Service b74dd5
    used in places where non-intrusive access to a tree is required,
Packit Service b74dd5
    such as the ``PythonElementClassLookup`` or XSLT extension
Packit Service b74dd5
    elements.
Packit Service b74dd5
Packit Service b74dd5
saxparser.pxi
Packit Service b74dd5
    SAX-like parser interfaces as known from ElementTree's TreeBuilder.
Packit Service b74dd5
Packit Service b74dd5
serializer.pxi
Packit Service b74dd5
    XML output functions.  Basically everything that creates byte
Packit Service b74dd5
    sequences from XML trees.
Packit Service b74dd5
Packit Service b74dd5
xinclude.pxi
Packit Service b74dd5
    XInclude support.
Packit Service b74dd5
Packit Service b74dd5
xmlerror.pxi
Packit Service b74dd5
    Error log handling.  All error messages that libxml2 generates
Packit Service b74dd5
    internally walk through the code in this file to end up in lxml's
Packit Service b74dd5
    Python level error logs.
Packit Service b74dd5
Packit Service b74dd5
    At the end of the file, you will find a long list of named error
Packit Service b74dd5
    codes.  It is generated from the libxml2 HTML documentation (using
Packit Service b74dd5
    lxml, of course).  See the script ``update-error-constants.py``
Packit Service b74dd5
    for this.
Packit Service b74dd5
Packit Service b74dd5
xmlid.pxi
Packit Service b74dd5
    XMLID and IDDict, a dictionary-like way to find Elements by their
Packit Service b74dd5
    XML-ID attribute.
Packit Service b74dd5
Packit Service b74dd5
xpath.pxi
Packit Service b74dd5
    XPath evaluators.
Packit Service b74dd5
Packit Service b74dd5
xslt.pxi
Packit Service b74dd5
    XSL transformations, including the ``XSLT`` class, document lookup
Packit Service b74dd5
    handling and access control.
Packit Service b74dd5
Packit Service b74dd5
The different schema languages (DTD, RelaxNG, XML Schema and
Packit Service b74dd5
Schematron) are implemented in the following include files:
Packit Service b74dd5
Packit Service b74dd5
* dtd.pxi
Packit Service b74dd5
* relaxng.pxi
Packit Service b74dd5
* schematron.pxi
Packit Service b74dd5
* xmlschema.pxi
Packit Service b74dd5
Packit Service b74dd5
Packit Service b74dd5
Python modules
Packit Service b74dd5
==============
Packit Service b74dd5
Packit Service b74dd5
The ``lxml`` package also contains a number of pure Python modules:
Packit Service b74dd5
Packit Service b74dd5
builder.py
Packit Service b74dd5
    The E-factory and the ElementBuilder class.  These provide a
Packit Service b74dd5
    simple interface to XML tree generation.
Packit Service b74dd5
Packit Service b74dd5
cssselect.py
Packit Service b74dd5
    A CSS selector implementation based on XPath.  The main class is
Packit Service b74dd5
    called ``CSSSelector``.
Packit Service b74dd5
Packit Service b74dd5
doctestcompare.py
Packit Service b74dd5
    A relaxed comparison scheme for XML/HTML markup in doctest.
Packit Service b74dd5
Packit Service b74dd5
ElementInclude.py
Packit Service b74dd5
    XInclude-like document inclusion, compatible with ElementTree.
Packit Service b74dd5
Packit Service b74dd5
_elementpath.py
Packit Service b74dd5
    XPath-like path language, compatible with ElementTree.
Packit Service b74dd5
Packit Service b74dd5
sax.py
Packit Service b74dd5
    SAX2 compatible interfaces to copy lxml trees from/to SAX compatible
Packit Service b74dd5
    tools.
Packit Service b74dd5
Packit Service b74dd5
usedoctest.py
Packit Service b74dd5
    Wrapper module for ``doctestcompare.py`` that simplifies its usage
Packit Service b74dd5
    from inside a doctest.
Packit Service b74dd5
Packit Service b74dd5
Packit Service b74dd5
lxml.objectify
Packit Service b74dd5
==============
Packit Service b74dd5
Packit Service b74dd5
A Cython implemented extension module that uses the public C-API of
Packit Service b74dd5
lxml.etree.  It provides a Python object-like interface to XML trees.
Packit Service b74dd5
The implementation resides in the file `lxml.objectify.pyx
Packit Service b74dd5
<https://github.com/lxml/lxml/blob/master/src/lxml/lxml.objectify.pyx>`_.
Packit Service b74dd5
Packit Service b74dd5
Packit Service b74dd5
lxml.html
Packit Service b74dd5
=========
Packit Service b74dd5
Packit Service b74dd5
A specialised toolkit for HTML handling, based on lxml.etree.  This is
Packit Service b74dd5
implemented in pure Python.