lxml.html.diff does HTML comparisons. These are word-based comparisons. First, a handy function for normalizing whitespace and doing word wrapping:: >>> import re, textwrap >>> def pwrapped(text): ... text = re.sub(r'[ \n\t\r]+', ' ', text) ... text = textwrap.fill(text) ... print(text) >>> def pdiff(text1, text2): ... pwrapped(htmldiff(text1, text2)) Example:: >>> from lxml.html.diff import htmldiff, html_annotate >>> html1 = '
This is some test text with some changes and some same stuff
' >>> html2 = '''This is some test textual writing with some changed stuff ... and some same stuff
''' >>> pdiff(html1, html2)This is some test textual writing with some changed stuff
text with some changes and some same stuff
Hi you guys
' >>> html2 = 'Hi you guys
' >>> pdiff(html1, html2)Hi you guys
>>> pdiff('text', 'text
')text
>>> pdiff('Hi guys !!', 'Hi guy !!') Hi guyHi
', 'Bye
')Bye
Hi
Hi Guy
', 'Bye Guy
') Bye Hi Guy
Hey there
', '')Hey there
Hello
World
', 'Hello World
')Hello World
As a special case, changing the href of a link is displayed, and images are treated like words: >>> pdiff('search', 'search') search Link: http://google.comPrint this
', 'Print this
')Print this
first\nsecond\nthird
', 'first\n second\nthird
'))first second third
>>> print(htmldiff('first\nsecond\nthird', '
first\nsecond\nthird'))
first second third>>> print(htmldiff('
first\nsecond', '
first\nsecond\n third'))
first second thirdThe sixteen combinations:: First "insert start" (del start/middle/end/none): >>> pdiff('A B C', 'D B C D
hey there how are you?
', 'A') Ahey there how are you?
This is a test document with many words in it that goes on ... for a while and doesn't have anything do to with the next ... document that we match this against
''', ''' ...This is another document with few similarities to the preceding ... one, but enough that it may have overlap that could turn into ... a confusing series of deletes and inserts. ...
''')This is another document with few similarities to the preceding one, but enough that it may have overlap that could turn into a confusing series of deletes and inserts.
This is a test document with many words in it that goes on for
a while and doesn't have anything do to with the next document that we
match this against
P1 para
P2 para
', 'P1 para
P3 foo
')P1 para
P3 foo
>>> panno('HelloThere World
','HelloThere Town
') HelloThere Town
>>> panno('Hello
There World','Hello
There Town')Hello
There Town >>> panno('Hello
There World
','Hello
There Town
')Hello
There Town
>>> panno('Hi You
', ... 'Hi You
', ... 'Hi You
')Hi You
>>> panno('', ... '') >>> panno('', ... '') Internals --------- Some utility functions:: >>> from lxml.html.diff import fixup_ins_del_tags, split_unbalanced, split_trailing_whitespace >>> def pfixup(text): ... print(fixup_ins_del_tags(text).strip()) >>> pfixup('some text and more text and more
')some text and more text and more
>>> pfixup('Hi! you
')Hi! you
>>> pfixup('more text
more text
One table | More stuff |
One table | More stuff |