This tests autolink::
>>> from lxml.html import usedoctest
>>> from lxml.html.clean import autolink_html
>>> print(autolink_html('''
... <div>Link here: http://test.com/foo.html.</div>
... '''))
<div>Link here: <a href="http://test.com/foo.html">http://test.com/foo.html</a>.</div>
>>> print(autolink_html('''
... <div>Mail me at mailto:ianb@test.com or http://myhome.com</div>
... '''))
<div>Mail me at <a href="mailto:ianb@test.com">ianb@test.com</a>
or <a href="http://myhome.com">http://myhome.com</a></div>
>>> print(autolink_html('''
... <div>The <b>great</b> thing is the http://link.com links <i>and</i>
... the http://foobar.com links.</div>'''))
<div>The <b>great</b> thing is the <a href="http://link.com">http://link.com</a> links <i>and</i>
the <a href="http://foobar.com">http://foobar.com</a> links.</div>
>>> print(autolink_html('''
... <div>Link: <http://foobar.com></div>'''))
<div>Link: <<a href="http://foobar.com">http://foobar.com</a>></div>
>>> print(autolink_html('''
... <div>Link: (http://foobar.com)</div>'''))
<div>Link: (<a href="http://foobar.com">http://foobar.com</a>)</div>
Parenthesis are tricky, we'll do our best::
>>> print(autolink_html('''
... <div>(Link: http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software))</div>
... '''))
<div>(Link: <a href="http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software)">http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software)</a>)</div>
>>> print(autolink_html('''
... <div>... a link: http://foo.com)</div>
... '''))
<div>... a link: <a href="http://foo.com">http://foo.com</a>)</div>
Some cases that won't be caught (on purpose)::
>>> print(autolink_html('''
... <div>A link to http://localhost/foo/bar won't, but a link to
... http://test.com will</div>'''))
<div>A link to http://localhost/foo/bar won't, but a link to
<a href="http://test.com">http://test.com</a> will</div>
>>> print(autolink_html('''
... <div>A link in <textarea>http://test.com</textarea></div>'''))
<div>A link in <textarea>http://test.com</textarea></div>
>>> print(autolink_html('''
... <div>A link in <a href="http://foo.com">http://bar.com</a></div>'''))
<div>A link in <a href="http://foo.com">http://bar.com</a></div>
>>> print(autolink_html('''
... <div>A link in <code>http://foo.com</code> or
... <span class="nolink">http://bar.com</span></div>'''))
<div>A link in <code>http://foo.com</code> or
<span class="nolink">http://bar.com</span></div>
There's also a word wrapping function, that should probably be run
after autolink::
>>> from lxml.html.clean import word_break_html
>>> def pascii(s):
... print(s.encode('ascii', 'xmlcharrefreplace').decode('ascii'))
>>> pascii(word_break_html( u'''
... <div>Hey you
... 12345678901234567890123456789012345678901234567890</div>'''))
<div>Hey you
1234567890123456789012345678901234567890​1234567890</div>
Not everything is broken:
>>> pascii(word_break_html('''
... <div>Hey you
... <code>12345678901234567890123456789012345678901234567890</code></div>'''))
<div>Hey you
<code>12345678901234567890123456789012345678901234567890</code></div>
>>> pascii(word_break_html('''
... <a href="12345678901234567890123456789012345678901234567890">text</a>'''))
<a href="12345678901234567890123456789012345678901234567890">text</a>