Re: [lxml-dev] Difference between xhtml etrees

Hi, please CC the list on replies. D wrote:
2009/6/11 Stefan Behnel:
D wrote:
I have two xhtml documents which I would like to compare. They are available as etrees. Ideally I would like to have a resulting tree, where the appropriate changes are marked with ins and del tags. I don't need anything fancy like a detection of moves.
I had a look at lxml.html.diff http://codespeak.net/lxml/lxmlhtml.html#html-diff but it operates on html strings only, and not on my parsed tree.
Did you try passing the root elements of the trees?
passing the root objects was a good idea, it can generates the difference the way I want it. I just don't manage to get the data back to xhtml. Maybe you could have a look:
Here is my code: def expandFiles(filename): """open the file named filename, return an etree""" document = "".join(open(filename).readlines()) px = lxml.etree.XMLParser(load_dtd=True, no_network=False) px.feed(document) rx=px.close() docx=lxml.etree.ElementTree(rx) return docx
Note that "load_dtd" does not imply validation, just that a DTD will be loaded if referenced. Also, it is a *lot* more efficient to do this: parser = lxml.etree.XMLParser(load_dtd=True, no_network=False) def expandFiles(filename): """open the file named filename, return an etree""" return lxml.etree.parse(filename, parser) ... and I'd actually rename the function (or drop it completely).
r1=expandFiles(r"1.xhtml") r2=expandFiles(r"2.xhtml") diff = lxml.html.diff.htmldiff(r1.getroot(),r2.getroot()) # diff is now an html fragment, parse it pdiff = lxml.html.document_fromstring(diff) lxml.html.html_to_xhtml(pdiff) pe = lxml.etree.ElementTree(pdiff)
So far, so good.
# this gives me an xhtml file that is parsed without errors by firefox, but does not contain any markup # it looks like this in firefox: {http://www.w3.org/1999/xhtml}meta> Resist SPR3012 Preparation{http://www.w3.org
Not sure how this can happen. I'll give it a try later today.
# in addition, all character entities apper in the form > and not like they should: #62;
Would you have a 'real' example here?
I don't manage to transform pdiff to the same form r1 and r2 are in.
I am sure this is due to a basic misunderstanding of lxml, maybe you directly see what I am doing wrong?
Not direcly, no. Maybe others have an idea? Stefan

Hi All,
please CC the list on replies. I am sorry, I pressed the wrong button.
Note that "load_dtd" does not imply validation, just that a DTD will be loaded if referenced. unfortunately my original xhtml is very non-conforming. (I am planning to migrate a laboratory notebook that was unfortunately written in word. The plan is to copy from word,
I made a running example and attached three small files, the code finds the difference between the two files r1.xhtml and r2.xhtml. The output is written to the file rdiff.xhtml. This file does not display correctly in Firefox. Please note that the output diff is not totally correct. r1 reads "Leave some solvent in the bowl." and r2 "Leave some solvent in the bowl and heat." the code marks: <html:ins>bowl and heat.END{http://www.w3.org/1999/xhtml}p> {http://www.w3.org/1999/xhtml}p> Previous Versions: {http://www.w3.org/1999/xhtml}b>{http://www.w3.org/1999/xhtml}p></html:ins> as inserted, i.e. "bowl and heat." instead of "and heat" paste into Kompozer, then parse the result, get rid of all the word-specific stuff and validate later. This is necessary because each experiment is composed of many smaller descriptions which will be put together into big file. Unfortunately word 2007 still can not handle a master document that contains other documents) best Daniel def minimalExample(): # files contain entities like # often r contains illegal attributes (start , type in ol), not DTD conforming element content (br), and illegally nested paragraphs (p in p, p in b) parser = lxml.etree.XMLParser(load_dtd=True, dtd_validation=True, no_network=False) r1 = lxml.etree.parse("r1.xhtml", parser) r2 = lxml.etree.parse("r2.xhtml", parser) diff = lxml.html.diff.htmldiff(r1.getroot(),r2.getroot()) pdiff = lxml.html.document_fromstring(diff) lxml.html.html_to_xhtml(pdiff) pe = lxml.etree.ElementTree(pdiff) pe.write("rdiff.xhtml",pretty_print = True)
participants (2)
-
D
-
Stefan Behnel