<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.12: http://docutils.sourceforge.net/" />
<meta name="version" content="S5 1.1" />
<title>Implementing XML languages with lxml</title>
<style type="text/css">
/*
:Author: David Goodger (goodger@python.org)
:Id: $Id: html4css1.css 7614 2013-02-21 15:55:51Z milde $
:Copyright: This stylesheet has been placed in the public domain.
Default cascading style sheet for the HTML output of Docutils.
See http://docutils.sf.net/docs/howto/html-stylesheets.html for how to
customize this style sheet.
*/
/* used to remove borders from tables and images */
.borderless, table.borderless td, table.borderless th {
border: 0 }
table.borderless td, table.borderless th {
/* Override padding for "table.docutils td" with "! important".
The right padding separates the table cells. */
padding: 0 0.5em 0 0 ! important }
.first {
/* Override more specific margin styles with "! important". */
margin-top: 0 ! important }
.last, .with-subtitle {
margin-bottom: 0 ! important }
.hidden {
display: none }
a.toc-backref {
text-decoration: none ;
color: black }
blockquote.epigraph {
margin: 2em 5em ; }
dl.docutils dd {
margin-bottom: 0.5em }
object[type="image/svg+xml"], object[type="application/x-shockwave-flash"] {
overflow: hidden;
}
/* Uncomment (and remove this text!) to get bold-faced definition list terms
dl.docutils dt {
font-weight: bold }
*/
div.abstract {
margin: 2em 5em }
div.abstract p.topic-title {
font-weight: bold ;
text-align: center }
div.admonition, div.attention, div.caution, div.danger, div.error,
div.hint, div.important, div.note, div.tip, div.warning {
margin: 2em ;
border: medium outset ;
padding: 1em }
div.admonition p.admonition-title, div.hint p.admonition-title,
div.important p.admonition-title, div.note p.admonition-title,
div.tip p.admonition-title {
font-weight: bold ;
font-family: sans-serif }
div.attention p.admonition-title, div.caution p.admonition-title,
div.danger p.admonition-title, div.error p.admonition-title,
div.warning p.admonition-title, .code .error {
color: red ;
font-weight: bold ;
font-family: sans-serif }
/* Uncomment (and remove this text!) to get reduced vertical space in
compound paragraphs.
div.compound .compound-first, div.compound .compound-middle {
margin-bottom: 0.5em }
div.compound .compound-last, div.compound .compound-middle {
margin-top: 0.5em }
*/
div.dedication {
margin: 2em 5em ;
text-align: center ;
font-style: italic }
div.dedication p.topic-title {
font-weight: bold ;
font-style: normal }
div.figure {
margin-left: 2em ;
margin-right: 2em }
div.footer, div.header {
clear: both;
font-size: smaller }
div.line-block {
display: block ;
margin-top: 1em ;
margin-bottom: 1em }
div.line-block div.line-block {
margin-top: 0 ;
margin-bottom: 0 ;
margin-left: 1.5em }
div.sidebar {
margin: 0 0 0.5em 1em ;
border: medium outset ;
padding: 1em ;
background-color: #ffffee ;
width: 40% ;
float: right ;
clear: right }
div.sidebar p.rubric {
font-family: sans-serif ;
font-size: medium }
div.system-messages {
margin: 5em }
div.system-messages h1 {
color: red }
div.system-message {
border: medium outset ;
padding: 1em }
div.system-message p.system-message-title {
color: red ;
font-weight: bold }
div.topic {
margin: 2em }
h1.section-subtitle, h2.section-subtitle, h3.section-subtitle,
h4.section-subtitle, h5.section-subtitle, h6.section-subtitle {
margin-top: 0.4em }
h1.title {
text-align: center }
h2.subtitle {
text-align: center }
hr.docutils {
width: 75% }
img.align-left, .figure.align-left, object.align-left {
clear: left ;
float: left ;
margin-right: 1em }
img.align-right, .figure.align-right, object.align-right {
clear: right ;
float: right ;
margin-left: 1em }
img.align-center, .figure.align-center, object.align-center {
display: block;
margin-left: auto;
margin-right: auto;
}
.align-left {
text-align: left }
.align-center {
clear: both ;
text-align: center }
.align-right {
text-align: right }
/* reset inner alignment in figures */
div.align-right {
text-align: inherit }
/* div.align-center * { */
/* text-align: left } */
ol.simple, ul.simple {
margin-bottom: 1em }
ol.arabic {
list-style: decimal }
ol.loweralpha {
list-style: lower-alpha }
ol.upperalpha {
list-style: upper-alpha }
ol.lowerroman {
list-style: lower-roman }
ol.upperroman {
list-style: upper-roman }
p.attribution {
text-align: right ;
margin-left: 50% }
p.caption {
font-style: italic }
p.credits {
font-style: italic ;
font-size: smaller }
p.label {
white-space: nowrap }
p.rubric {
font-weight: bold ;
font-size: larger ;
color: maroon ;
text-align: center }
p.sidebar-title {
font-family: sans-serif ;
font-weight: bold ;
font-size: larger }
p.sidebar-subtitle {
font-family: sans-serif ;
font-weight: bold }
p.topic-title {
font-weight: bold }
pre.address {
margin-bottom: 0 ;
margin-top: 0 ;
font: inherit }
pre.literal-block, pre.doctest-block, pre.math, pre.code {
margin-left: 2em ;
margin-right: 2em }
pre.code .ln { color: grey; } /* line numbers */
pre.code, code { background-color: #eeeeee }
pre.code .comment, code .comment { color: #5C6576 }
pre.code .keyword, code .keyword { color: #3B0D06; font-weight: bold }
pre.code .literal.string, code .literal.string { color: #0C5404 }
pre.code .name.builtin, code .name.builtin { color: #352B84 }
pre.code .deleted, code .deleted { background-color: #DEB0A1}
pre.code .inserted, code .inserted { background-color: #A3D289}
span.classifier {
font-family: sans-serif ;
font-style: oblique }
span.classifier-delimiter {
font-family: sans-serif ;
font-weight: bold }
span.interpreted {
font-family: sans-serif }
span.option {
white-space: nowrap }
span.pre {
white-space: pre }
span.problematic {
color: red }
span.section-subtitle {
/* font-size relative to parent (h1..h6 element) */
font-size: 80% }
table.citation {
border-left: solid 1px gray;
margin-left: 1px }
table.docinfo {
margin: 2em 4em }
table.docutils {
margin-top: 0.5em ;
margin-bottom: 0.5em }
table.footnote {
border-left: solid 1px black;
margin-left: 1px }
table.docutils td, table.docutils th,
table.docinfo td, table.docinfo th {
padding-left: 0.5em ;
padding-right: 0.5em ;
vertical-align: top }
table.docutils th.field-name, table.docinfo th.docinfo-name {
font-weight: bold ;
text-align: left ;
white-space: nowrap ;
padding-left: 0 }
/* "booktabs" style (no vertical lines) */
table.docutils.booktabs {
border: 0px;
border-top: 2px solid;
border-bottom: 2px solid;
border-collapse: collapse;
}
table.docutils.booktabs * {
border: 0px;
}
table.docutils.booktabs th {
border-bottom: thin solid;
text-align: left;
}
h1 tt.docutils, h2 tt.docutils, h3 tt.docutils,
h4 tt.docutils, h5 tt.docutils, h6 tt.docutils {
font-size: 100% }
ul.auto-toc {
list-style-type: none }
</style>
<!-- configuration parameters -->
<meta name="defaultView" content="slideshow" />
<meta name="controlVis" content="hidden" />
<!-- style sheet links -->
<script src="ui/default/slides.js" type="text/javascript"></script>
<link rel="stylesheet" href="ui/default/slides.css"
type="text/css" media="projection" id="slideProj" />
<link rel="stylesheet" href="ui/default/outline.css"
type="text/css" media="screen" id="outlineStyle" />
<link rel="stylesheet" href="ui/default/print.css"
type="text/css" media="print" id="slidePrint" />
<link rel="stylesheet" href="ui/default/opera.css"
type="text/css" media="projection" id="operaFix" />
</head>
<body>
<div class="layout">
<div id="controls"></div>
<div id="currentSlide"></div>
<div id="header">
</div>
<div id="footer">
<h1>Implementing XML languages with lxml</h1>
<h2>Dr. Stefan Behnel, EuroPython 2008, Vilnius/Lietuva</h2>
</div>
</div>
<div class="presentation">
<div class="slide" id="slide0">
<h1 class="title">Implementing XML languages with lxml</h1>
<h2 class="subtitle" id="dr-stefan-behnel">Dr. Stefan Behnel</h2>
<p class="center"><a class="reference external" href="http://codespeak.net/lxml/">http://codespeak.net/lxml/</a></p>
<p class="center"><a class="reference external" href="mailto:lxml-dev@codespeak.net">lxml-dev@codespeak.net</a></p>
<img alt="tagpython.png" class="center" src="tagpython.png" />
<!-- Definitions of interpreted text roles (classes) for S5/HTML data. -->
<!-- This data file has been placed in the public domain. -->
<!-- Colours
======= -->
<!-- Text Sizes
========== -->
<!-- Display in Slides (Presentation Mode) Only
========================================== -->
<!-- Display in Outline Mode Only
============================ -->
<!-- Display in Print Only
===================== -->
<!-- Display in Handout Mode Only
============================ -->
<!-- Incremental Display
=================== -->
</div>
<div class="slide" id="what-is-an-xml-language">
<h1>What is an »XML language«?</h1>
<ul class="simple">
<li>a language in XML notation</li>
<li>aka »XML dialect«<ul>
<li>except that it's not a dialect</li>
</ul>
</li>
<li>Examples:<ul>
<li>XML Schema</li>
<li>Atom/RSS</li>
<li>(X)HTML</li>
<li>Open Document Format</li>
<li>SOAP</li>
<li>... add your own one here</li>
</ul>
</li>
</ul>
</div>
<div class="slide" id="popular-mistakes-to-avoid-1">
<h1>Popular mistakes to avoid (1)</h1>
<p>"That's easy, I can use regular expressions!"</p>
<p class="incremental center">No, you can't.</p>
</div>
<div class="slide" id="popular-mistakes-to-avoid-2">
<h1>Popular mistakes to avoid (2)</h1>
<p>"This is tree data, I'll take the DOM!"</p>
</div>
<div class="slide" id="id1">
<h1>Popular mistakes to avoid (2)</h1>
<p>"This is tree data, I'll take the DOM!"</p>
<ul class="simple">
<li>DOM is ubiquitous, but it's as complicated as Java</li>
<li>uglify your application with tons of DOM code to<ul>
<li>walk over non-element nodes to find the data you need</li>
<li>convert text content to other data types</li>
<li>modify the XML tree in memory</li>
</ul>
</li>
</ul>
<p>=> write verbose, redundant, hard-to-maintain code</p>
</div>
<div class="slide" id="popular-mistakes-to-avoid-3">
<h1>Popular mistakes to avoid (3)</h1>
<p>"SAX is <em>so</em> fast and consumes <em>no</em> memory!"</p>
</div>
<div class="slide" id="id2">
<h1>Popular mistakes to avoid (3)</h1>
<p>"SAX is <em>so</em> fast and consumes <em>no</em> memory!"</p>
<ul class="simple">
<li>but <em>writing</em> SAX code is <em>not</em> fast!</li>
<li>write error-prone, state-keeping SAX code to<ul>
<li>figure out where you are</li>
<li>find the sections you need</li>
<li>convert text content to other data types</li>
<li>copy the XML data into custom data classes</li>
<li>... and don't forget the way back into XML!</li>
</ul>
</li>
</ul>
<p>=> write confusing state-machine code</p>
<p>=> debugging into existence</p>
</div>
<div class="slide" id="working-with-xml">
<h1>Working with XML</h1>
<blockquote>
<p><strong>Getting XML work done</strong></p>
<p>(instead of getting time wasted)</p>
</blockquote>
</div>
<div class="slide" id="how-can-you-work-with-xml">
<h1>How can you work with XML?</h1>
<ul class="simple">
<li>Preparation:<ul>
<li>Implement usable data classes as an abstraction layer</li>
<li>Implement a mapping from XML to the data classes</li>
<li>Implement a mapping from the data classes to XML</li>
</ul>
</li>
<li>Workflow:<ul>
<li>parse XML data</li>
<li>map XML data to data classes</li>
<li>work with data classes</li>
<li>map data classes to XML</li>
<li>serialise XML</li>
</ul>
</li>
</ul>
<ul class="incremental simple">
<li>Approach:<ul>
<li>get rid of XML and do everything in your own code</li>
</ul>
</li>
</ul>
</div>
<div class="slide" id="what-if-you-could-simplify-this">
<h1>What if you could simplify this?</h1>
<ul class="simple">
<li>Preparation:<ul>
<li>Extend usable XML API classes into an abstraction layer</li>
</ul>
</li>
<li>Workflow:<ul>
<li>parse XML data into XML API classes</li>
<li>work with XML API classes</li>
<li>serialise XML</li>
</ul>
</li>
</ul>
<ul class="incremental simple">
<li>Approach:<ul>
<li>cover only the quirks of XML and make it work <em>for</em> you</li>
</ul>
</li>
</ul>
</div>
<div class="slide" id="id3">
<h1>What if you could simplify this ...</h1>
<ul class="simple">
<li>... without sacrificing usability or flexibility?</li>
<li>... using a high-speed, full-featured, pythonic XML toolkit?</li>
<li>... with the power of XPath, XSLT and XML validation?</li>
</ul>
<p class="incremental center">... then »lxml« is your friend!</p>
</div>
<div class="slide" id="overview">
<h1>Overview</h1>
<ul class="simple">
<li>What is lxml?<ul>
<li>what & who</li>
</ul>
</li>
<li>How do you use it?<ul>
<li>Lesson 0: quick API overview<ul>
<li>ElementTree concepts and lxml features</li>
</ul>
</li>
<li>Lesson 1: parse XML<ul>
<li>how to get XML data into memory</li>
</ul>
</li>
<li>Lesson 2: generate XML<ul>
<li>how to write an XML generator for a language</li>
</ul>
</li>
<li>Lesson 3: working with XML trees made easy<ul>
<li>how to write an XML API for a language</li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
<div class="slide" id="what-is-lxml">
<h1>What is lxml?</h1>
<ul class="simple">
<li>a fast, full-featured toolkit for XML and HTML handling<ul>
<li><a class="reference external" href="http://codespeak.net/lxml/">http://codespeak.net/lxml/</a></li>
<li><a class="reference external" href="mailto:lxml-dev@codespeak.net">lxml-dev@codespeak.net</a></li>
</ul>
</li>
<li>based on and inspired by<ul>
<li>the C libraries libxml2 and libxslt (by Daniel Veillard)</li>
<li>the ElementTree API (by Fredrik Lundh)</li>
<li>the Cython compiler (by Robert Bradshaw, Greg Ewing & me)</li>
<li>the Python language (by Guido & [<em>paste Misc/ACKS here</em>])</li>
<li>user feedback, ideas and patches (by you!)<ul>
<li>keep doing that, we love you all!</li>
</ul>
</li>
</ul>
</li>
<li>maintained (and major parts) written by myself<ul>
<li>initial design and implementation by Martijn Faassen</li>
<li>extensive HTML API and tools by Ian Bicking</li>
</ul>
</li>
</ul>
</div>
<div class="slide" id="what-do-you-get-for-your-money">
<h1>What do you get for your money?</h1>
<ul class="simple">
<li>many tools in one:<ul>
<li>Generic, ElementTree compatible XML API: <strong>lxml.etree</strong><ul>
<li>but faster for many tasks and much more feature-rich</li>
</ul>
</li>
<li>Special tool set for HTML handling: <strong>lxml.html</strong></li>
<li>Special API for pythonic data binding: <strong>lxml.objectify</strong></li>
<li>General purpose path languages: XPath and CSS selectors</li>
<li>Validation: DTD, XML Schema, RelaxNG, Schematron</li>
<li>XSLT, XInclude, C14N, ...</li>
<li>Fast tree iteration, event-driven parsing, ...</li>
</ul>
</li>
<li>it's free, but it's worth every €-Cent!<ul>
<li>what users say:<ul>
<li>»no qualification, I would recommend lxml for just about any
HTML task«</li>
<li>»THE tool [...] for newbies and experienced developers«</li>
<li>»you can do pretty much anything with an intuitive API«</li>
<li>»lxml takes all the pain out of XML«</li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
<div class="slide" id="lesson-0-a-quick-overview">
<h1>Lesson 0: a quick overview</h1>
<blockquote>
<p>why <strong>»lxml takes all the pain out of XML«</strong></p>
<p>(a quick overview of lxml features and ElementTree concepts)</p>
</blockquote>
<!-- >>> from lxml import etree, cssselect, html
>>> some_xml_data = "<root><speech class='dialog'><p>So be it!</p></speech><p>stuff</p></root>"
>>> some_html_data = "<p>Just a quick note<br>next line</p>"
>>> xml_tree = etree.XML(some_xml_data)
>>> html_tree = html.fragment_fromstring(some_html_data) -->
</div>
<div class="slide" id="namespaces-in-elementtree">
<h1>Namespaces in ElementTree</h1>
<ul>
<li><p class="first">uses Clark notation:</p>
<ul class="simple">
<li>wrap namespace URI in <tt class="docutils literal"><span class="pre">{...}</span></tt></li>
<li>append the tag name</li>
</ul>
<div class="highlight"><pre><span class="gp">>>> </span><span class="n">tag</span> <span class="o">=</span> <span class="s2">"{http://www.w3.org/the/namespace}tagname"</span>
<span class="gp">>>> </span><span class="n">element</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">Element</span><span class="p">(</span><span class="n">tag</span><span class="p">)</span>
</pre></div>
</li>
<li><p class="first">no prefixes!</p>
</li>
<li><p class="first">a single, self-containing tag identifier</p>
</li>
</ul>
</div>
<div class="slide" id="text-content-in-elementtree">
<h1>Text content in ElementTree</h1>
<ul>
<li><p class="first">uses <tt class="docutils literal">.text</tt> and <tt class="docutils literal">.tail</tt> attributes:</p>
<div class="highlight"><pre><span class="gp">>>> </span><span class="n">div</span> <span class="o">=</span> <span class="n">html</span><span class="o">.</span><span class="n">fragment_fromstring</span><span class="p">(</span>
<span class="gp">... </span> <span class="s2">"<div><p>a paragraph<br>split in two</p> parts</div>"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">p</span> <span class="o">=</span> <span class="n">div</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="gp">>>> </span><span class="n">br</span> <span class="o">=</span> <span class="n">p</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="gp">>>> </span><span class="n">p</span><span class="o">.</span><span class="n">text</span>
<span class="go">'a paragraph'</span>
<span class="gp">>>> </span><span class="n">br</span><span class="o">.</span><span class="n">text</span>
<span class="gp">>>> </span><span class="n">br</span><span class="o">.</span><span class="n">tail</span>
<span class="go">'split in two'</span>
<span class="gp">>>> </span><span class="n">p</span><span class="o">.</span><span class="n">tail</span>
<span class="go">' parts'</span>
</pre></div>
</li>
<li><p class="first">no text nodes!</p>
<ul class="simple">
<li>simplifies tree traversal a lot</li>
<li>simplifies many XML algorithms</li>
</ul>
</li>
</ul>
</div>
<div class="slide" id="attributes-in-elementtree">
<h1>Attributes in ElementTree</h1>
<ul>
<li><p class="first">uses <tt class="docutils literal">.get()</tt> and <tt class="docutils literal">.set()</tt> methods:</p>
<div class="highlight"><pre><span class="gp">>>> </span><span class="n">root</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">fromstring</span><span class="p">(</span>
<span class="gp">... </span> <span class="s1">'<root a="the value" b="of an" c="attribute"/>'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">root</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'a'</span><span class="p">)</span>
<span class="go">'the value'</span>
<span class="gp">>>> </span><span class="n">root</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="s1">'a'</span><span class="p">,</span> <span class="s2">"THE value"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">root</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'a'</span><span class="p">)</span>
<span class="go">'THE value'</span>
</pre></div>
</li>
<li><p class="first">or the <tt class="docutils literal">.attrib</tt> dictionary property:</p>
<div class="highlight"><pre><span class="gp">>>> </span><span class="n">d</span> <span class="o">=</span> <span class="n">root</span><span class="o">.</span><span class="n">attrib</span>
<span class="gp">>>> </span><span class="nb">list</span><span class="p">(</span><span class="nb">sorted</span><span class="p">(</span><span class="n">d</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
<span class="go">['a', 'b', 'c']</span>
<span class="gp">>>> </span><span class="nb">list</span><span class="p">(</span><span class="nb">sorted</span><span class="p">(</span><span class="n">d</span><span class="o">.</span><span class="n">values</span><span class="p">()))</span>
<span class="go">['THE value', 'attribute', 'of an']</span>
</pre></div>
</li>
</ul>
</div>
<div class="slide" id="tree-iteration-in-lxml-etree-1">
<h1>Tree iteration in lxml.etree (1)</h1>
<!-- >>> import collections -->
<div class="highlight"><pre><span class="gp">>>> </span><span class="n">root</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">fromstring</span><span class="p">(</span>
<span class="gp">... </span> <span class="s2">"<root> <a><b/><b/></a> <c><d/><e><f/></e><g/></c> </root>"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="k">print</span><span class="p">([</span><span class="n">child</span><span class="o">.</span><span class="n">tag</span> <span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">root</span><span class="p">])</span> <span class="c1"># children</span>
<span class="go">['a', 'c']</span>
<span class="gp">>>> </span><span class="k">print</span><span class="p">([</span><span class="n">el</span><span class="o">.</span><span class="n">tag</span> <span class="k">for</span> <span class="n">el</span> <span class="ow">in</span> <span class="n">root</span><span class="o">.</span><span class="n">iter</span><span class="p">()])</span> <span class="c1"># self and descendants</span>
<span class="go">['root', 'a', 'b', 'b', 'c', 'd', 'e', 'f', 'g']</span>
<span class="gp">>>> </span><span class="k">print</span><span class="p">([</span><span class="n">el</span><span class="o">.</span><span class="n">tag</span> <span class="k">for</span> <span class="n">el</span> <span class="ow">in</span> <span class="n">root</span><span class="o">.</span><span class="n">iterdescendants</span><span class="p">()])</span>
<span class="go">['a', 'b', 'b', 'c', 'd', 'e', 'f', 'g']</span>
<span class="gp">>>> </span><span class="k">def</span> <span class="nf">iter_breadth_first</span><span class="p">(</span><span class="n">root</span><span class="p">):</span>
<span class="gp">... </span> <span class="n">bfs_queue</span> <span class="o">=</span> <span class="n">collections</span><span class="o">.</span><span class="n">deque</span><span class="p">([</span><span class="n">root</span><span class="p">])</span>
<span class="gp">... </span> <span class="k">while</span> <span class="n">bfs_queue</span><span class="p">:</span>
<span class="gp">... </span> <span class="n">el</span> <span class="o">=</span> <span class="n">bfs_queue</span><span class="o">.</span><span class="n">popleft</span><span class="p">()</span> <span class="c1"># pop next element</span>
<span class="gp">... </span> <span class="n">bfs_queue</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="n">el</span><span class="p">)</span> <span class="c1"># append its children</span>
<span class="gp">... </span> <span class="k">yield</span> <span class="n">el</span>
<span class="gp">>>> </span><span class="k">print</span><span class="p">([</span><span class="n">el</span><span class="o">.</span><span class="n">tag</span> <span class="k">for</span> <span class="n">el</span> <span class="ow">in</span> <span class="n">iter_breadth_first</span><span class="p">(</span><span class="n">root</span><span class="p">)])</span>
<span class="go">['root', 'a', 'c', 'b', 'b', 'd', 'e', 'g', 'f']</span>
</pre></div>
</div>
<div class="slide" id="tree-iteration-in-lxml-etree-2">
<h1>Tree iteration in lxml.etree (2)</h1>
<div class="highlight"><pre><span class="gp">>>> </span><span class="n">root</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">fromstring</span><span class="p">(</span>
<span class="gp">... </span> <span class="s2">"<root> <a><b/><b/></a> <c><d/><e><f/></e><g/></c> </root>"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">tree_walker</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">iterwalk</span><span class="p">(</span><span class="n">root</span><span class="p">,</span> <span class="n">events</span><span class="o">=</span><span class="p">(</span><span class="s1">'start'</span><span class="p">,</span> <span class="s1">'end'</span><span class="p">))</span>
<span class="gp">>>> </span><span class="k">for</span> <span class="p">(</span><span class="n">event</span><span class="p">,</span> <span class="n">element</span><span class="p">)</span> <span class="ow">in</span> <span class="n">tree_walker</span><span class="p">:</span>
<span class="gp">... </span> <span class="k">print</span><span class="p">(</span><span class="s2">"</span><span class="si">%s</span><span class="s2"> (</span><span class="si">%s</span><span class="s2">)"</span> <span class="o">%</span> <span class="p">(</span><span class="n">element</span><span class="o">.</span><span class="n">tag</span><span class="p">,</span> <span class="n">event</span><span class="p">))</span>
<span class="go">root (start)</span>
<span class="go">a (start)</span>
<span class="go">b (start)</span>
<span class="go">b (end)</span>
<span class="go">b (start)</span>
<span class="go">b (end)</span>
<span class="go">a (end)</span>
<span class="go">c (start)</span>
<span class="go">d (start)</span>
<span class="go">d (end)</span>
<span class="go">e (start)</span>
<span class="go">f (start)</span>
<span class="go">f (end)</span>
<span class="go">e (end)</span>
<span class="go">g (start)</span>
<span class="go">g (end)</span>
<span class="go">c (end)</span>
<span class="go">root (end)</span>
</pre></div>
</div>
<div class="slide" id="path-languages-in-lxml">
<h1>Path languages in lxml</h1>
<div class="highlight"><pre><span class="nt"><root></span>
<span class="nt"><speech</span> <span class="na">class=</span><span class="s">'dialog'</span><span class="nt">><p></span>So be it!<span class="nt"></p></speech></span>
<span class="nt"><p></span>stuff<span class="nt"></p></span>
<span class="nt"></root></span>
</pre></div>
<ul>
<li><p class="first">search it with XPath</p>
<div class="highlight"><pre><span class="gp">>>> </span><span class="n">find_paragraphs</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">XPath</span><span class="p">(</span><span class="s2">"//p"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">paragraphs</span> <span class="o">=</span> <span class="n">find_paragraphs</span><span class="p">(</span><span class="n">xml_tree</span><span class="p">)</span>
<span class="gp">>>> </span><span class="k">print</span><span class="p">([</span> <span class="n">p</span><span class="o">.</span><span class="n">text</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">paragraphs</span> <span class="p">])</span>
<span class="go">['So be it!', 'stuff']</span>
</pre></div>
</li>
<li><p class="first">search it with CSS selectors</p>
<div class="highlight"><pre><span class="gp">>>> </span><span class="n">find_dialogs</span> <span class="o">=</span> <span class="n">cssselect</span><span class="o">.</span><span class="n">CSSSelector</span><span class="p">(</span><span class="s2">"speech.dialog p"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">paragraphs</span> <span class="o">=</span> <span class="n">find_dialogs</span><span class="p">(</span><span class="n">xml_tree</span><span class="p">)</span>
<span class="gp">>>> </span><span class="k">print</span><span class="p">([</span> <span class="n">p</span><span class="o">.</span><span class="n">text</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">paragraphs</span> <span class="p">])</span>
<span class="go">['So be it!']</span>
</pre></div>
</li>
</ul>
</div>
<div class="slide" id="summary-of-lesson-0">
<h1>Summary of lesson 0</h1>
<ul class="simple">
<li>lxml comes with various tools<ul>
<li>that aim to hide the quirks of XML</li>
<li>that simplify finding and handling data</li>
<li>that make XML a pythonic tool by itself</li>
</ul>
</li>
</ul>
</div>
<div class="slide" id="lesson-1-parsing-xml-html">
<h1>Lesson 1: parsing XML/HTML</h1>
<blockquote>
<p><strong>The input side</strong></p>
<p>(a quick overview)</p>
</blockquote>
</div>
<div class="slide" id="parsing-xml-and-html-from">
<h1>Parsing XML and HTML from ...</h1>
<ul class="simple">
<li>strings: <tt class="docutils literal">fromstring(xml_data)</tt><ul>
<li>byte strings, but also unicode strings</li>
</ul>
</li>
<li>filenames: <tt class="docutils literal">parse(filename)</tt></li>
<li>HTTP/FTP URLs: <tt class="docutils literal">parse(url)</tt></li>
<li>file objects: <tt class="docutils literal">parse(f)</tt><ul>
<li><tt class="docutils literal">f = open(filename, 'rb')</tt> !</li>
</ul>
</li>
<li>file-like objects: <tt class="docutils literal">parse(f)</tt><ul>
<li>only need a <tt class="docutils literal">f.read(size)</tt> method</li>
</ul>
</li>
<li>data chunks: <tt class="docutils literal">parser.feed(xml_chunk)</tt><ul>
<li><tt class="docutils literal">result = parser.close()</tt></li>
</ul>
</li>
</ul>
<p class="small right">(parsing from strings and filenames/URLs frees the GIL)</p>
</div>
<div class="slide" id="example-parsing-from-a-string">
<h1>Example: parsing from a string</h1>
<ul>
<li><p class="first">using the <tt class="docutils literal">fromstring()</tt> function:</p>
<div class="highlight"><pre><span class="gp">>>> </span><span class="n">root_element</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">fromstring</span><span class="p">(</span><span class="n">some_xml_data</span><span class="p">)</span>
</pre></div>
</li>
<li><p class="first">using the <tt class="docutils literal">fromstring()</tt> function with a specific parser:</p>
<div class="highlight"><pre><span class="gp">>>> </span><span class="n">parser</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">HTMLParser</span><span class="p">(</span><span class="n">remove_comments</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">root_element</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">fromstring</span><span class="p">(</span><span class="n">some_html_data</span><span class="p">,</span> <span class="n">parser</span><span class="p">)</span>
</pre></div>
</li>
<li><p class="first">or the <tt class="docutils literal">XML()</tt> and <tt class="docutils literal">HTML()</tt> aliases for literals in code:</p>
<div class="highlight"><pre><span class="gp">>>> </span><span class="n">root_element</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">XML</span><span class="p">(</span><span class="s2">"<root><child/></root>"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">root_element</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">HTML</span><span class="p">(</span><span class="s2">"<p>some<br>paragraph</p>"</span><span class="p">)</span>
</pre></div>
</li>
</ul>
</div>
<div class="slide" id="parsing-xml-into">
<h1>Parsing XML into ...</h1>
<ul class="simple">
<li>a tree in memory<ul>
<li><tt class="docutils literal">parse()</tt> and <tt class="docutils literal">fromstring()</tt> functions</li>
</ul>
</li>
<li>a tree in memory, but step-by-step with a generator<ul>
<li><tt class="docutils literal">iterparse()</tt> generates <tt class="docutils literal">(start/end, element)</tt> events</li>
<li>tree can be cleaned up to save space</li>
</ul>
</li>
<li>SAX-like callbacks without building a tree<ul>
<li><tt class="docutils literal">parse()</tt> and <tt class="docutils literal">fromstring()</tt> functions</li>
<li>pass a <tt class="docutils literal">target</tt> object into the parser</li>
</ul>
</li>
</ul>
</div>
<div class="slide" id="summary-of-lesson-1">
<h1>Summary of lesson 1</h1>
<ul class="simple">
<li>parsing XML/HTML in lxml is mostly straight forward<ul>
<li>simple functions that do the job</li>
</ul>
</li>
<li>advanced use cases are pretty simple<ul>
<li>event-driven parsing using <tt class="docutils literal">iterparse()</tt></li>
<li>special parser configuration with keyword arguments<ul>
<li>configuration is generally local to a parser</li>
</ul>
</li>
</ul>
</li>
<li>BTW: parsing is <em>very</em> fast, as is serialising<ul>
<li>don't hesitate to do parse-serialise-parse cycles</li>
</ul>
</li>
</ul>
</div>
<div class="slide" id="lesson-2-generating-xml">
<h1>Lesson 2: generating XML</h1>
<blockquote>
<p><strong>The output side</strong></p>
<p>(and how to make it safe and simple)</p>
</blockquote>
</div>
<div class="slide" id="the-example-language-atom">
<h1>The example language: Atom</h1>
<p>The Atom XML format</p>
<ul class="simple">
<li>Namespace: <a class="reference external" href="http://www.w3.org/2005/Atom">http://www.w3.org/2005/Atom</a></li>
<li>W3C recommendation derived from RSS and friends</li>
<li>Atom feeds describe news entries and annotated links<ul>
<li>a <tt class="docutils literal">feed</tt> contains one or more <tt class="docutils literal">entry</tt> elements</li>
<li>an <tt class="docutils literal">entry</tt> contains <tt class="docutils literal">author</tt>, <tt class="docutils literal">link</tt>, <tt class="docutils literal">summary</tt> and/or <tt class="docutils literal">content</tt></li>
</ul>
</li>
</ul>
</div>
<div class="slide" id="example-generate-xml-1">
<h1>Example: generate XML (1)</h1>
<p>The ElementMaker (or <em>E-factory</em>)</p>
<div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">lxml.builder</span> <span class="kn">import</span> <span class="n">ElementMaker</span>
<span class="gp">>>> </span><span class="n">A</span> <span class="o">=</span> <span class="n">ElementMaker</span><span class="p">(</span><span class="n">namespace</span><span class="o">=</span><span class="s2">"http://www.w3.org/2005/Atom"</span><span class="p">,</span>
<span class="gp">... </span> <span class="n">nsmap</span><span class="o">=</span><span class="p">{</span><span class="bp">None</span> <span class="p">:</span> <span class="s2">"http://www.w3.org/2005/Atom"</span><span class="p">})</span>
</pre></div>
<div class="incremental"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">atom</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">author</span><span class="p">(</span> <span class="n">A</span><span class="o">.</span><span class="n">name</span><span class="p">(</span><span class="s2">"Stefan Behnel"</span><span class="p">)</span> <span class="p">),</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">entry</span><span class="p">(</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"News from lxml"</span><span class="p">),</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">link</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="s2">"http://codespeak.net/lxml/"</span><span class="p">),</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="s2">"See what's <b>fun</b> about lxml..."</span><span class="p">,</span>
<span class="gp">... </span> <span class="nb">type</span><span class="o">=</span><span class="s2">"html"</span><span class="p">),</span>
<span class="gp">... </span> <span class="p">)</span>
<span class="gp">... </span><span class="p">)</span>
</pre></div>
</div><div class="incremental"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">lxml.etree</span> <span class="kn">import</span> <span class="n">tostring</span>
<span class="gp">>>> </span><span class="k">print</span><span class="p">(</span> <span class="n">tostring</span><span class="p">(</span><span class="n">atom</span><span class="p">,</span> <span class="n">pretty_print</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="p">)</span>
</pre></div>
</div></div>
<div class="slide" id="example-generate-xml-2">
<h1>Example: generate XML (2)</h1>
<div class="highlight"><pre><span class="gp">>>> </span><span class="n">atom</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">author</span><span class="p">(</span> <span class="n">A</span><span class="o">.</span><span class="n">name</span><span class="p">(</span><span class="s2">"Stefan Behnel"</span><span class="p">)</span> <span class="p">),</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">entry</span><span class="p">(</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"News from lxml"</span><span class="p">),</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">link</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="s2">"http://codespeak.net/lxml/"</span><span class="p">),</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="s2">"See what's <b>fun</b> about lxml..."</span><span class="p">,</span>
<span class="gp">... </span> <span class="nb">type</span><span class="o">=</span><span class="s2">"html"</span><span class="p">),</span>
<span class="gp">... </span> <span class="p">)</span>
<span class="gp">... </span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span class="nt"><feed</span> <span class="na">xmlns=</span><span class="s">"http://www.w3.org/2005/Atom"</span><span class="nt">></span>
<span class="nt"><author></span>
<span class="nt"><name></span>Stefan Behnel<span class="nt"></name></span>
<span class="nt"></author></span>
<span class="nt"><entry></span>
<span class="nt"><title></span>News from lxml<span class="nt"></title></span>
<span class="nt"><link</span> <span class="na">href=</span><span class="s">"http://codespeak.net/lxml/"</span><span class="nt">/></span>
<span class="nt"><summary</span> <span class="na">type=</span><span class="s">"html"</span><span class="nt">></span>See what's <span class="ni">&lt;</span>b<span class="ni">&gt;</span>fun<span class="ni">&lt;</span>/b<span class="ni">&gt;</span>
about lxml...<span class="nt"></summary></span>
<span class="nt"></entry></span>
<span class="nt"></feed></span>
</pre></div>
</div>
<div class="slide" id="be-careful-what-you-type">
<h1>Be careful what you type!</h1>
<div class="highlight"><pre><span class="gp">>>> </span><span class="n">atom</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">author</span><span class="p">(</span> <span class="n">A</span><span class="o">.</span><span class="n">name</span><span class="p">(</span><span class="s2">"Stefan Behnel"</span><span class="p">)</span> <span class="p">),</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">entry</span><span class="p">(</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">titel</span><span class="p">(</span><span class="s2">"News from lxml"</span><span class="p">),</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">link</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="s2">"http://codespeak.net/lxml/"</span><span class="p">),</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="s2">"See what's <b>fun</b> about lxml..."</span><span class="p">,</span>
<span class="gp">... </span> <span class="nb">type</span><span class="o">=</span><span class="s2">"html"</span><span class="p">),</span>
<span class="gp">... </span> <span class="p">)</span>
<span class="gp">... </span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span class="nt"><feed</span> <span class="na">xmlns=</span><span class="s">"http://www.w3.org/2005/Atom"</span><span class="nt">></span>
<span class="nt"><author></span>
<span class="nt"><name></span>Stefan Behnel<span class="nt"></name></span>
<span class="nt"></author></span>
<span class="nt"><entry></span>
<span class="nt"><titel></span>News from lxml<span class="nt"></titel></span>
<span class="nt"><link</span> <span class="na">href=</span><span class="s">"http://codespeak.net/lxml/"</span><span class="nt">/></span>
<span class="nt"><summary</span> <span class="na">type=</span><span class="s">"html"</span><span class="nt">></span>See what's <span class="ni">&lt;</span>b<span class="ni">&gt;</span>fun<span class="ni">&lt;</span>/b<span class="ni">&gt;</span>
about lxml...<span class="nt"></summary></span>
<span class="nt"></entry></span>
<span class="nt"></feed></span>
</pre></div>
</div>
<div class="slide" id="want-more-type-safety">
<h1>Want more 'type safety'?</h1>
<p>Write an XML generator <em>module</em> instead:</p>
<div class="highlight"><pre><span class="c1"># atomgen.py</span>
<span class="kn">from</span> <span class="nn">lxml</span> <span class="kn">import</span> <span class="n">etree</span>
<span class="kn">from</span> <span class="nn">lxml.builder</span> <span class="kn">import</span> <span class="n">ElementMaker</span>
<span class="n">ATOM_NAMESPACE</span> <span class="o">=</span> <span class="s2">"http://www.w3.org/2005/Atom"</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">ElementMaker</span><span class="p">(</span><span class="n">namespace</span><span class="o">=</span><span class="n">ATOM_NAMESPACE</span><span class="p">,</span>
<span class="n">nsmap</span><span class="o">=</span><span class="p">{</span><span class="bp">None</span> <span class="p">:</span> <span class="n">ATOM_NAMESPACE</span><span class="p">})</span>
<span class="n">feed</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">feed</span>
<span class="n">entry</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">entry</span>
<span class="n">title</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">title</span>
<span class="c1"># ... and so on and so forth ...</span>
<span class="c1"># plus a little validation function: isvalid()</span>
<span class="n">isvalid</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">RelaxNG</span><span class="p">(</span><span class="nb">file</span><span class="o">=</span><span class="s2">"atom.rng"</span><span class="p">)</span>
</pre></div>
</div>
<div class="slide" id="the-atom-generator-module">
<h1>The Atom generator module</h1>
<!-- >>> import sys
>>> sys.path.insert(0, "ep2008") -->
<div class="highlight"><pre><span class="gp">>>> </span><span class="kn">import</span> <span class="nn">atomgen</span> <span class="kn">as</span> <span class="nn">A</span>
<span class="gp">>>> </span><span class="n">atom</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">author</span><span class="p">(</span> <span class="n">A</span><span class="o">.</span><span class="n">name</span><span class="p">(</span><span class="s2">"Stefan Behnel"</span><span class="p">)</span> <span class="p">),</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">entry</span><span class="p">(</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">link</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="s2">"http://codespeak.net/lxml/"</span><span class="p">),</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"News from lxml"</span><span class="p">),</span>
<span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="s2">"See what's <b>fun</b> about lxml..."</span><span class="p">,</span>
<span class="gp">... </span> <span class="nb">type</span><span class="o">=</span><span class="s2">"html"</span><span class="p">),</span>
<span class="gp">... </span> <span class="p">)</span>
<span class="gp">... </span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">A</span><span class="o">.</span><span class="n">isvalid</span><span class="p">(</span><span class="n">atom</span><span class="p">)</span> <span class="c1"># ok, forgot the ID's => invalid XML ...</span>
<span class="go">False</span>
<span class="gp">>>> </span><span class="n">title</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">titel</span><span class="p">(</span><span class="s2">"News from lxml"</span><span class="p">)</span>
<span class="gt">Traceback (most recent call last):</span>
<span class="c">...</span>
<span class="gr">AttributeError</span>: <span class="n">'module' object has no attribute 'titel'</span>
</pre></div>
</div>
<div class="slide" id="mixing-languages-1">
<h1>Mixing languages (1)</h1>
<p>Atom can embed <em>serialised</em> HTML</p>
<div class="highlight"><pre><span class="gp">>>> </span><span class="kn">import</span> <span class="nn">lxml.html.builder</span> <span class="kn">as</span> <span class="nn">h</span>
<span class="gp">>>> </span><span class="n">html_fragment</span> <span class="o">=</span> <span class="n">h</span><span class="o">.</span><span class="n">DIV</span><span class="p">(</span>
<span class="gp">... </span> <span class="s2">"this is some</span><span class="se">\n</span><span class="s2">"</span><span class="p">,</span>
<span class="gp">... </span> <span class="n">h</span><span class="o">.</span><span class="n">A</span><span class="p">(</span><span class="s2">"HTML"</span><span class="p">,</span> <span class="n">href</span><span class="o">=</span><span class="s2">"http://w3.org/MarkUp/"</span><span class="p">),</span>
<span class="gp">... </span> <span class="s2">"</span><span class="se">\n</span><span class="s2">content"</span><span class="p">)</span>
</pre></div>
<div class="incremental"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">serialised_html</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">tostring</span><span class="p">(</span><span class="n">html_fragment</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s2">"html"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">summary</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="n">serialised_html</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s2">"html"</span><span class="p">)</span>
</pre></div>
</div><div class="incremental"><div class="highlight"><pre><span class="gp">>>> </span><span class="k">print</span><span class="p">(</span><span class="n">etree</span><span class="o">.</span><span class="n">tostring</span><span class="p">(</span><span class="n">summary</span><span class="p">))</span>
<span class="go"><summary xmlns="http://www.w3.org/2005/Atom" type="html"></span>
<span class="go"> &lt;div&gt;this is some</span>
<span class="go"> &lt;a href="http://w3.org/MarkUp/"&gt;HTML&lt;/a&gt;</span>
<span class="go"> content&lt;/div&gt;</span>
<span class="go"></summary></span>
</pre></div>
</div></div>
<div class="slide" id="mixing-languages-2">
<h1>Mixing languages (2)</h1>
<p>Atom can also embed non-escaped XHTML</p>
<div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">copy</span> <span class="kn">import</span> <span class="n">deepcopy</span>
<span class="gp">>>> </span><span class="n">xhtml_fragment</span> <span class="o">=</span> <span class="n">deepcopy</span><span class="p">(</span><span class="n">html_fragment</span><span class="p">)</span>
<span class="gp">>>> </span><span class="kn">from</span> <span class="nn">lxml.html</span> <span class="kn">import</span> <span class="n">html_to_xhtml</span>
<span class="gp">>>> </span><span class="n">html_to_xhtml</span><span class="p">(</span><span class="n">xhtml_fragment</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">summary</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="n">xhtml_fragment</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s2">"xhtml"</span><span class="p">)</span>
</pre></div>
<div class="incremental"><div class="highlight"><pre><span class="gp">>>> </span><span class="k">print</span><span class="p">(</span><span class="n">etree</span><span class="o">.</span><span class="n">tostring</span><span class="p">(</span><span class="n">summary</span><span class="p">,</span> <span class="n">pretty_print</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
<span class="go"><summary xmlns="http://www.w3.org/2005/Atom" type="xhtml"></span>
<span class="go"> <html:div xmlns:html="http://www.w3.org/1999/xhtml">this is some</span>
<span class="go"> <html:a href="http://w3.org/MarkUp/">HTML</html:a></span>
<span class="go"> content</html:div></span>
<span class="go"></summary></span>
</pre></div>
</div></div>
<div class="slide" id="summary-of-lesson-2">
<h1>Summary of lesson 2</h1>
<ul class="simple">
<li>generating XML is easy<ul>
<li>use the ElementMaker</li>
</ul>
</li>
<li>wrap it in a module that provides<ul>
<li>the target namespace</li>
<li>an ElementMaker name for each language element</li>
<li>a validator</li>
<li>maybe additional helper functions</li>
</ul>
</li>
<li>mixing languages is easy<ul>
<li>define a generator module for each</li>
</ul>
</li>
</ul>
<p>... this is all you need for the <em>output</em> side of XML languages</p>
</div>
<div class="slide" id="lesson-3-designing-xml-apis">
<h1>Lesson 3: Designing XML APIs</h1>
<blockquote>
<p><strong>The Element API</strong></p>
<p>(and how to make it the way <em>you</em> want)</p>
</blockquote>
</div>
<div class="slide" id="trees-in-c-and-in-python">
<h1>Trees in C and in Python</h1>
<ul class="simple">
<li>Trees have two representations:<ul>
<li>a plain, complete, low-level C tree provided by libxml2</li>
<li>a set of Python Element proxies, each representing one element</li>
</ul>
</li>
<li>Proxies are created on-the-fly:<ul>
<li>lxml creates an Element object for a C node on request</li>
<li>proxies are garbage collected when going out of scope</li>
<li>XML trees are garbage collected when deleting the last proxy</li>
</ul>
</li>
</ul>
<img alt="ep2008/proxies.png" class="center" src="ep2008/proxies.png" />
</div>
<div class="slide" id="mapping-python-classes-to-nodes">
<h1>Mapping Python classes to nodes</h1>
<ul class="simple">
<li>Proxies can be assigned to XML nodes <em>by user code</em><ul>
<li>lxml tells you about a node, you return a class</li>
</ul>
</li>
</ul>
</div>
<div class="slide" id="example-a-simple-element-class-1">
<h1>Example: a simple Element class (1)</h1>
<ul>
<li><p class="first">define a subclass of ElementBase</p>
<div class="highlight"><pre><span class="gp">>>> </span><span class="k">class</span> <span class="nc">HonkElement</span><span class="p">(</span><span class="n">etree</span><span class="o">.</span><span class="n">ElementBase</span><span class="p">):</span>
<span class="gp">... </span> <span class="nd">@property</span>
<span class="gp">... </span> <span class="k">def</span> <span class="nf">honking</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="gp">... </span> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'honking'</span><span class="p">)</span> <span class="o">==</span> <span class="s1">'true'</span>
</pre></div>
</li>
<li><p class="first">let it replace the default Element class</p>
<div class="highlight"><pre><span class="gp">>>> </span><span class="n">lookup</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">ElementDefaultClassLookup</span><span class="p">(</span>
<span class="gp">... </span> <span class="n">element</span><span class="o">=</span><span class="n">HonkElement</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">parser</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">XMLParser</span><span class="p">()</span>
<span class="gp">>>> </span><span class="n">parser</span><span class="o">.</span><span class="n">set_element_class_lookup</span><span class="p">(</span><span class="n">lookup</span><span class="p">)</span>
</pre></div>
</li>
</ul>
</div>
<div class="slide" id="example-a-simple-element-class-2">
<h1>Example: a simple Element class (2)</h1>
<ul>
<li><p class="first">use the new Element class</p>
<div class="highlight"><pre><span class="gp">>>> </span><span class="n">root</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">XML</span><span class="p">(</span><span class="s1">'<root><honk honking="true"/></root>'</span><span class="p">,</span>
<span class="gp">... </span> <span class="n">parser</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">root</span><span class="o">.</span><span class="n">honking</span>
<span class="go">False</span>
<span class="gp">>>> </span><span class="n">root</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">honking</span>
<span class="go">True</span>
</pre></div>
</li>
</ul>
</div>
<div class="slide" id="id4">
<h1>Mapping Python classes to nodes</h1>
<ul class="simple">
<li>The Element class lookup<ul>
<li>lxml tells you about a node, you return a class</li>
<li>no restrictions on lookup algorithm</li>
<li>each parser can use a different class lookup scheme</li>
<li>lookup schemes can be chained through fallbacks</li>
</ul>
</li>
<li>Classes can be selected based on<ul>
<li>the node type (element, comment or processing instruction)<ul>
<li><tt class="docutils literal">ElementDefaultClassLookup()</tt></li>
</ul>
</li>
<li>the namespaced node name<ul>
<li><tt class="docutils literal">CustomElementClassLookup()</tt> + a fallback</li>
<li><tt class="docutils literal">ElementNamespaceClassLookup()</tt> + a fallback</li>
</ul>
</li>
<li>the value of an attribute (e.g. <tt class="docutils literal">id</tt> or <tt class="docutils literal">class</tt>)<ul>
<li><tt class="docutils literal">AttributeBasedElementClassLookup()</tt> + a fallback</li>
</ul>
</li>
<li>read-only inspection of the tree<ul>
<li><tt class="docutils literal">PythonElementClassLookup()</tt> + a fallback</li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
<div class="slide" id="designing-an-atom-api">
<h1>Designing an Atom API</h1>
<ul>
<li><p class="first">a feed is a container for entries</p>
<div class="highlight"><pre><span class="c1"># atom.py</span>
<span class="n">ATOM_NAMESPACE</span> <span class="o">=</span> <span class="s2">"http://www.w3.org/2005/Atom"</span>
<span class="n">_ATOM_NS</span> <span class="o">=</span> <span class="s2">"{</span><span class="si">%s</span><span class="s2">}"</span> <span class="o">%</span> <span class="n">ATOM_NAMESPACE</span>
<span class="k">class</span> <span class="nc">FeedElement</span><span class="p">(</span><span class="n">etree</span><span class="o">.</span><span class="n">ElementBase</span><span class="p">):</span>
<span class="nd">@property</span>
<span class="k">def</span> <span class="nf">entries</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">_ATOM_NS</span> <span class="o">+</span> <span class="s2">"entry"</span><span class="p">)</span>
</pre></div>
</li>
<li><p class="first">it also has a couple of meta-data children, e.g. <tt class="docutils literal">title</tt></p>
<div class="highlight"><pre><span class="k">class</span> <span class="nc">FeedElement</span><span class="p">(</span><span class="n">etree</span><span class="o">.</span><span class="n">ElementBase</span><span class="p">):</span>
<span class="c1"># ...</span>
<span class="nd">@property</span>
<span class="k">def</span> <span class="nf">title</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s2">"return the title or None"</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s2">"title"</span><span class="p">)</span>
</pre></div>
</li>
</ul>
</div>
<div class="slide" id="consider-lxml-objectify">
<h1>Consider lxml.objectify</h1>
<ul class="simple">
<li>ready-to-use, generic Python object API for XML</li>
</ul>
<div class="highlight"><pre><span class="o">>>></span> <span class="kn">from</span> <span class="nn">lxml</span> <span class="kn">import</span> <span class="n">objectify</span>
<span class="o">>>></span> <span class="n">feed</span> <span class="o">=</span> <span class="n">objectify</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="s2">"atom-example.xml"</span><span class="p">)</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">feed</span><span class="o">.</span><span class="n">title</span><span class="p">)</span>
<span class="n">Example</span> <span class="n">Feed</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">([</span><span class="n">entry</span><span class="o">.</span><span class="n">title</span> <span class="k">for</span> <span class="n">entry</span> <span class="ow">in</span> <span class="n">feed</span><span class="o">.</span><span class="n">entry</span><span class="p">])</span>
<span class="p">[</span><span class="s1">'Atom-Powered Robots Run Amok'</span><span class="p">]</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">feed</span><span class="o">.</span><span class="n">entry</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">title</span><span class="p">)</span>
<span class="n">Atom</span><span class="o">-</span><span class="n">Powered</span> <span class="n">Robots</span> <span class="n">Run</span> <span class="n">Amok</span>
</pre></div>
</div>
<div class="slide" id="still-room-for-more-convenience">
<h1>Still room for more convenience</h1>
<div class="highlight"><pre><span class="kn">from</span> <span class="nn">itertools</span> <span class="kn">import</span> <span class="n">chain</span>
<span class="k">class</span> <span class="nc">FeedElement</span><span class="p">(</span><span class="n">objectify</span><span class="o">.</span><span class="n">ObjectifiedElement</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">addIDs</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s2">"initialise the IDs of feed and entries"</span>
<span class="k">for</span> <span class="n">element</span> <span class="ow">in</span> <span class="n">chain</span><span class="p">([</span><span class="bp">self</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">entry</span><span class="p">):</span>
<span class="k">if</span> <span class="n">element</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">_ATOM_NS</span> <span class="o">+</span> <span class="s2">"id"</span><span class="p">)</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="nb">id</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">SubElement</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">_ATOM_NS</span> <span class="o">+</span> <span class="s2">"id"</span><span class="p">)</span>
<span class="nb">id</span><span class="o">.</span><span class="n">text</span> <span class="o">=</span> <span class="n">make_guid</span><span class="p">()</span>
</pre></div>
</div>
<div class="slide" id="incremental-api-design">
<h1>Incremental API design</h1>
<ul class="simple">
<li>choose an XML API to start with<ul>
<li>lxml.etree is general purpose</li>
<li>lxml.objectify is nice for document-style XML</li>
</ul>
</li>
<li>fix Elements that really need some API sugar<ul>
<li>dict-mappings to children with specific content/attributes</li>
<li>properties for specially typed attributes or child values</li>
<li>simplified access to varying content types of an element</li>
<li>shortcuts for unnecessarily deep subtrees</li>
</ul>
</li>
<li>ignore what works well enough with the Element API<ul>
<li>lists of homogeneous children -> Element iteration</li>
<li>string attributes -> .get()/.set()</li>
</ul>
</li>
<li>let the API grow at your fingertips<ul>
<li>play with it and test use cases</li>
<li>avoid "I want because I can" feature explosion!</li>
</ul>
</li>
</ul>
</div>
<div class="slide" id="setting-up-the-element-mapping">
<h1>Setting up the Element mapping</h1>
<p>Atom has a namespace => leave the mapping to lxml</p>
<div class="highlight"><pre><span class="c1"># ...</span>
<span class="n">_atom_lookup</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">ElementNamespaceClassLookup</span><span class="p">(</span>
<span class="n">objectify</span><span class="o">.</span><span class="n">ObjectifyElementClassLookup</span><span class="p">())</span>
<span class="c1"># map the classes to tag names</span>
<span class="n">ns</span> <span class="o">=</span> <span class="n">_atom_lookup</span><span class="o">.</span><span class="n">get_namespace</span><span class="p">(</span><span class="n">ATOM_NAMESPACE</span><span class="p">)</span>
<span class="n">ns</span><span class="p">[</span><span class="s2">"feed"</span><span class="p">]</span> <span class="o">=</span> <span class="n">FeedElement</span>
<span class="n">ns</span><span class="p">[</span><span class="s2">"entry"</span><span class="p">]</span> <span class="o">=</span> <span class="n">EntryElement</span>
<span class="c1"># ... and so on</span>
<span class="c1"># or use ns.update(vars()) with appropriate class names</span>
<span class="c1"># create a parser that does some whitespace cleanup</span>
<span class="n">atom_parser</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">XMLParser</span><span class="p">(</span><span class="n">remove_blank_text</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># make it use our Atom classes</span>
<span class="n">atom_parser</span><span class="o">.</span><span class="n">set_element_class_lookup</span><span class="p">(</span><span class="n">_atom_lookup</span><span class="p">)</span>
<span class="c1"># and help users in using our parser setup</span>
<span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="nb">input</span><span class="p">):</span>
<span class="k">return</span> <span class="n">etree</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="nb">input</span><span class="p">,</span> <span class="n">atom_parser</span><span class="p">)</span>
</pre></div>
</div>
<div class="slide" id="using-your-new-atom-api">
<h1>Using your new Atom API</h1>
<div class="highlight"><pre><span class="gp">>>> </span><span class="kn">import</span> <span class="nn">atom</span>
<span class="gp">>>> </span><span class="n">feed</span> <span class="o">=</span> <span class="n">atom</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="s2">"ep2008/atom-example.xml"</span><span class="p">)</span><span class="o">.</span><span class="n">getroot</span><span class="p">()</span>
<span class="gp">>>> </span><span class="k">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">feed</span><span class="o">.</span><span class="n">entry</span><span class="p">))</span>
<span class="go">1</span>
<span class="gp">>>> </span><span class="k">print</span><span class="p">([</span><span class="n">entry</span><span class="o">.</span><span class="n">title</span> <span class="k">for</span> <span class="n">entry</span> <span class="ow">in</span> <span class="n">feed</span><span class="o">.</span><span class="n">entry</span><span class="p">])</span>
<span class="go">['Atom-Powered Robots Run Amok']</span>
<span class="gp">>>> </span><span class="n">link_tag</span> <span class="o">=</span> <span class="s2">"{</span><span class="si">%s</span><span class="s2">}link"</span> <span class="o">%</span> <span class="n">atom</span><span class="o">.</span><span class="n">ATOM_NAMESPACE</span>
<span class="gp">>>> </span><span class="k">print</span><span class="p">([</span><span class="n">link</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">"href"</span><span class="p">)</span> <span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">feed</span><span class="o">.</span><span class="n">iter</span><span class="p">(</span><span class="n">link_tag</span><span class="p">)])</span>
<span class="go">['http://example.org/', 'http://example.org/2003/12/13/atom03']</span>
</pre></div>
</div>
<div class="slide" id="summary-of-lesson-3">
<h1>Summary of lesson 3</h1>
<p>To implement an XML API ...</p>
<ol class="arabic simple">
<li>start off with lxml's Element API<ul>
<li>or take a look at the object API of lxml.objectify</li>
</ul>
</li>
<li>specialise it into a set of custom Element classes</li>
<li>map them to XML tags using one of the lookup schemes</li>
<li>improve the API incrementally while using it<ul>
<li>discover inconveniences and beautify them</li>
<li>avoid putting work into things that work</li>
</ul>
</li>
</ol>
</div>
<div class="slide" id="conclusion">
<h1>Conclusion</h1>
<p>lxml ...</p>
<ul class="simple">
<li>provides a convenient set of tools for XML and HTML<ul>
<li>parsing</li>
<li>generating</li>
<li>working with in-memory trees</li>
</ul>
</li>
<li>follows Python idioms wherever possible<ul>
<li>highly extensible through wrapping and subclassing</li>
<li>callable objects for XPath, CSS selectors, XSLT, schemas</li>
<li>iteration for tree traversal (even while parsing)</li>
<li>list-/dict-like APIs, properties, keyword arguments, ...</li>
</ul>
</li>
<li>makes extension and specialisation easy<ul>
<li>write a special XML generator module in trivial code</li>
<li>write your own XML API incrementally on-the-fly</li>
</ul>
</li>
</ul>
</div>
</div>
</body>
</html>