New subject: [lxml-dev] Difference between xhtml etrees

June 16, 2009

      Hi,

please CC the list on replies.

D wrote:
...
2009/6/11 Stefan Behnel:
...
D wrote:
...
I have two xhtml documents which I would like to compare. They are
 available as etrees.
 Ideally I would like to have a resulting tree, where the appropriate
 changes are marked with ins and del tags. I don't need anything fancy
like a detection of moves.
I had a look at lxml.html.diff
 http://codespeak.net/lxml/lxmlhtml.html#html-diff
 but it operates on html strings only, and not on my parsed tree.
Did you try passing the root elements of the trees?
passing the root objects was a good idea, it can generates the
difference the way I want it. I just don't manage to get the data back
to xhtml. Maybe you could have a look:
Here is my code:
def expandFiles(filename):
    """open the file named filename, return an etree"""
        document = "".join(open(filename).readlines())
        px = lxml.etree.XMLParser(load_dtd=True, no_network=False)
        px.feed(document)
        rx=px.close()
        docx=lxml.etree.ElementTree(rx)
        return docx
Note that "load_dtd" does not imply validation, just that a DTD will be
loaded if referenced.

Also, it is a *lot* more efficient to do this:

  parser = lxml.etree.XMLParser(load_dtd=True, no_network=False)

  def expandFiles(filename):
    """open the file named filename, return an etree"""
        return lxml.etree.parse(filename, parser)

... and I'd actually rename the function (or drop it completely).
...
r1=expandFiles(r"1.xhtml")
r2=expandFiles(r"2.xhtml")
diff = lxml.html.diff.htmldiff(r1.getroot(),r2.getroot())
# diff is now an html fragment, parse it
pdiff = lxml.html.document_fromstring(diff)
lxml.html.html_to_xhtml(pdiff)
pe = lxml.etree.ElementTree(pdiff)
So far, so good.
...
# this gives me an xhtml file that is parsed without errors by
firefox, but does not contain any markup
# it looks like this in firefox: {http://www.w3.org/1999/xhtml}meta>
Resist SPR3012 Preparation{http://www.w3.org
Not sure how this can happen. I'll give it a try later today.
...
# in addition, all character entities apper in the form > and not
like they should: #62;
Would you have a 'real' example here?
...
I don't manage to transform pdiff to the same form r1 and r2 are in.
I am sure this is due to a basic misunderstanding of lxml, maybe you
directly see what I am doing wrong?
Not direcly, no. Maybe others have an idea?

Stefan

Re: [lxml-dev] Difference between xhtml etrees

Stefan Behnel

D

tags

participants (2)