Blob Blame History Raw
This tests autolink::

    >>> from lxml.html import usedoctest
    >>> from lxml.html.clean import autolink_html
    >>> print(autolink_html('''
    ... <div>Link here: http://test.com/foo.html.</div>
    ... '''))
    <div>Link here: <a href="http://test.com/foo.html">http://test.com/foo.html</a>.</div>
    >>> print(autolink_html('''
    ... <div>Mail me at mailto:ianb@test.com or http://myhome.com</div>
    ... '''))
    <div>Mail me at <a href="mailto:ianb@test.com">ianb@test.com</a>
    or <a href="http://myhome.com">http://myhome.com</a></div>
    >>> print(autolink_html('''
    ... <div>The <b>great</b> thing is the http://link.com links <i>and</i>
    ... the http://foobar.com links.</div>'''))
    <div>The <b>great</b> thing is the <a href="http://link.com">http://link.com</a> links <i>and</i>
    the <a href="http://foobar.com">http://foobar.com</a> links.</div>
    >>> print(autolink_html('''
    ... <div>Link: &lt;http://foobar.com&gt;</div>'''))
    <div>Link: &lt;<a href="http://foobar.com">http://foobar.com</a>&gt;</div>
    >>> print(autolink_html('''
    ... <div>Link: (http://foobar.com)</div>'''))
    <div>Link: (<a href="http://foobar.com">http://foobar.com</a>)</div>

Parenthesis are tricky, we'll do our best::

    >>> print(autolink_html('''
    ... <div>(Link: http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software))</div>
    ... '''))
    <div>(Link: <a href="http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software)">http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software)</a>)</div>
    >>> print(autolink_html('''
    ... <div>... a link: http://foo.com)</div>
    ... '''))
    <div>... a link: <a href="http://foo.com">http://foo.com</a>)</div>

Some cases that won't be caught (on purpose)::

    >>> print(autolink_html('''
    ... <div>A link to http://localhost/foo/bar won't, but a link to
    ...  http://test.com will</div>'''))
    <div>A link to http://localhost/foo/bar won't, but a link to
    <a href="http://test.com">http://test.com</a> will</div>
    >>> print(autolink_html('''
    ... <div>A link in <textarea>http://test.com</textarea></div>'''))
    <div>A link in <textarea>http://test.com</textarea></div>
    >>> print(autolink_html('''
    ... <div>A link in <a href="http://foo.com">http://bar.com</a></div>'''))
    <div>A link in <a href="http://foo.com">http://bar.com</a></div>
    >>> print(autolink_html('''
    ... <div>A link in <code>http://foo.com</code> or
    ... <span class="nolink">http://bar.com</span></div>'''))
    <div>A link in <code>http://foo.com</code> or
    <span class="nolink">http://bar.com</span></div>

There's also a word wrapping function, that should probably be run
after autolink::

    >>> from lxml.html.clean import word_break_html
    >>> def pascii(s):
    ...     print(s.encode('ascii', 'xmlcharrefreplace').decode('ascii'))
    >>> pascii(word_break_html( u'''
    ... <div>Hey you
    ... 12345678901234567890123456789012345678901234567890</div>'''))
    <div>Hey you
    1234567890123456789012345678901234567890&#8203;1234567890</div>

Not everything is broken:

    >>> pascii(word_break_html('''
    ... <div>Hey you
    ... <code>12345678901234567890123456789012345678901234567890</code></div>'''))
    <div>Hey you
    <code>12345678901234567890123456789012345678901234567890</code></div>
    >>> pascii(word_break_html('''
    ... <a href="12345678901234567890123456789012345678901234567890">text</a>'''))
    <a href="12345678901234567890123456789012345678901234567890">text</a>