lxml.html.diff does HTML comparisons. These are word-based comparisons. First, a handy function for normalizing whitespace and doing word wrapping:: >>> import re, textwrap >>> def pwrapped(text): ... text = re.sub(r'[ \n\t\r]+', ' ', text) ... text = textwrap.fill(text) ... print(text) >>> def pdiff(text1, text2): ... pwrapped(htmldiff(text1, text2)) Example:: >>> from lxml.html.diff import htmldiff, html_annotate >>> html1 = '

This is some test text with some changes and some same stuff

' >>> html2 = '''

This is some test textual writing with some changed stuff ... and some same stuff

''' >>> pdiff(html1, html2)

This is some test textual writing with some changed stuff ~~text with some changes~~ and some same stuff

Style tags are largely ignored in terms of differences, though markup is not eliminated:: >>> html1 = '

Hi you guys

' >>> html2 = '

Hi you guys

' >>> pdiff(html1, html2)

Hi you guys

>>> pdiff('text', '

text

>>> pdiff('Hi guys !!', 'Hi guy !!') Hi guy ~~guys~~ !! >>> pdiff('Hi', 'Hi') Hi Hi >>> pdiff('A B C', 'A C') A B C >>> pdiff('A B C', 'B C') A B C >>> pdiff('

', '

>>> pdiff('

', '

Bye

>>> pdiff('

Hi Guy

', '

Bye Guy

Bye Hi Guy

>>> pdiff('

Hey there

', '')

~~Hey there~~

Movement between paragraphs is ignored, as tag-based changes are generally ignored:: >>> >>> pdiff('

Hello

World

', '

Hello World

As a special case, changing the href of a link is displayed, and images are treated like words: >>> pdiff('search', 'search') search Link: http://google.com ~~Link: http://yahoo.com~~ >>> pdiff('

Print this

', '

Print this

>>> pdiff('search', 'search') search Images may sometimes not have 'src' attributes: >>> pdiff('

test

', ' test

') test

A test of empty elements: >>> pdiff('some
text', 'some
test') some
test
~~text~~ Whitespace is generally ignored for the diff but preserved during the diff: >>> print(htmldiff('

first\nsecond\nthird

', '

first\n second\nthird

'))

first second third

>>> print(htmldiff('

first\nsecond\nthird

', '

first\nsecond\nthird

'))

first
    second
    third

>>> print(htmldiff('

first\nsecond

', '

first\nsecond\n third

'))

first
    second
     third

The sixteen combinations:: First "insert start" (del start/middle/end/none): >>> pdiff('A B C', 'D B C D A B C >>> pdiff('A B C', 'D A C') D A B C >>> pdiff('A B C', 'D A B') D A B C >>> pdiff('A B C', 'D A B C') D A B C Next, "insert middle" (del start/middle/end/none): >>> pdiff('A B C', 'D B C') D A B C >>> pdiff('A B C', 'A D C') A D B C >>> pdiff('A B C', 'A D B') A D B C This one case hits the threshold of our insensitive matching: >>> pdiff('A B C', 'A D B C') A D A B C Then "insert end" (del start/middle/end/none): >>> pdiff('A B C', 'B C D') A B C D >>> pdiff('A B C', 'A C D') A B C D >>> pdiff('A B C', 'A B D') A B D C >>> pdiff('A B C', 'A B C D') A B C D Then no insert (del start/middle/end): >>> pdiff('A B C', 'B C') A B C >>> pdiff('A B C', 'A C') A B C >>> pdiff('A B C', 'A B') A B C >>> pdiff('A B C', 'A B') A B C >>> pdiff('A B C', 'A B') A B C >>> pdiff('A

hey there how are you?

', 'A') A

~~hey there how are you?~~

Testing a larger document, to make sure there are not weird unnecessary parallels found: >>> pdiff(''' ...

This is a test document with many words in it that goes on ... for a while and doesn't have anything do to with the next ... document that we match this against

''', ''' ...

This is another document with few similarities to the preceding ... one, but enough that it may have overlap that could turn into ... a confusing series of deletes and inserts. ...

''')

This is another document with few similarities to the preceding one, but enough that it may have overlap that could turn into a confusing series of deletes and inserts.

~~This is a test document with many words in it that goes on for a while and doesn't have anything do to with the next document that we match this against~~

Annotation of content can also be done, where every bit of content is marked up with information about where it came from. First, some setup; note that html_annotate is called with a sequence of documents and the annotation associated with that document. We'll just use indexes, but you could use author or timestamp information. >>> def markup(text, annotation): ... return '%s' % (annotation, text) >>> def panno(*docs): ... pwrapped(html_annotate([(doc, index) for index, doc in enumerate(docs)], ... markup=markup)) Now, a sequence of documents: >>> panno('Hello cruel world', 'Hi cruel world', 'Hi world') Hi world >>> panno('A similar document', 'A similar document', ... 'A similar document here') A similar document here >>> panno('

P1 para

P2 para

', '

P1 para

P3 foo

P1 para

P3 foo

>>> panno('Hello

There World

','Hello

There Town

') Hello

There Town

>>> panno('

Hello

There World','

Hello

There Town')

Hello

There Town >>> panno('

Hello

There World

','

Hello

There Town

Hello

There Town

>>> panno('

Hi You

', ... '

Hi You

', ... '

Hi You

>>> panno('

', ... '

>>> panno('

', ... '

Internals --------- Some utility functions:: >>> from lxml.html.diff import fixup_ins_del_tags, split_unbalanced, split_trailing_whitespace >>> def pfixup(text): ... print(fixup_ins_del_tags(text).strip()) >>> pfixup('

some text and more text and more

>>> pfixup('

Hi! you

>>> pfixup('

Some text and

more text

Some text and

more text

>>> pfixup(''' ...

One table

More stuff

''')

One table

More stuff

Testing split_unbalanced:: >>> split_unbalanced(['', 'hey', '']) ([], ['', 'hey', ''], []) >>> split_unbalanced(['', 'hey']) ([''], ['hey'], []) >>> split_unbalanced(['Hey', '', 'You', '']) ([], ['Hey', 'You'], ['', '']) >>> split_unbalanced(['So', '', 'Hi', '', 'There', '']) ([], ['So', 'Hi', '', 'There', ''], ['']) >>> split_unbalanced(['So', '', 'Hi', '', 'There']) ([''], ['So', 'Hi', 'There'], ['']) Testing split_trailing_whitespace:: >>> split_trailing_whitespace('test\n\n') ('test', '\n\n') >>> split_trailing_whitespace(' test\n ') (' test', '\n ') >>> split_trailing_whitespace('test') ('test', '')