Re: [lxml-dev] Building an ESI tag with lxml

29 Jan 2010


      Stefan Behnel wrote:
...
Martin Aspeli, 29.01.2010 10:10:
...
here's a pseudo-doctest that illustrates the problem:
First, we create a simple document. We use the HTML parser here, because
we don't necessarily trust the input being 100% valid XHTML, even though
the doctype says so.
I think that's the main problem. If you parse XHTML using the HTML parser,
you loose information due to the fact that namespaces are not well-defined
for HTML. I'd *always* try with the XML parser first.
This code is being used in a post-processing step for output from Plone. 
Performance is important, so trial-and-error like this is probably 
undesirable. And even then, this would need to work for documents parsed 
with the HTML parser. The output being transformed could include 
not-quite-well-formed XHTML from content-managed pages. That's the 
attraction of xlml in the first place - it can deal with somewhat-crap 
output. ;)
...
However, according to your last comment, it seems you have tried the XML
parser already...
I just re-confirmed it. If the whole thing is parsed with 
etree.fromstring (and lxml.html is not used anywhere) it still doesn't 
close.
...
...
...
...
...
from lxml import etree, html
 doc = """\
...<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
...<html xmlns:esi="http://www.edge-delivery.org/esi/1.0"
xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
...<body>
...<p><img class="mceItem mceTile" src="foo.png"
alt="./target.html" /></p>
...</body>
...</html>
... """
 inputTree = html.fromstring(doc)
We are going to replace the<img />  tag with an<esi:include />  tag. We
find it via an XPath:
...
...
...
placeholderXPath = etree.XPath("//img[contains(concat(' ',
normalize-space(@class), ' '), ' mceTile ')]")
Perfect use case for lxml.cssselect. :)
Well, I got the XPath from css2xpath.appspot.com which uses the same 
algorithm I think.
...
...
...
...
...
matched = list(placeholderXPath(inputTree))
As I said, XPath returns a *list*.
Cool, thanks.
...
Personally, I'd love to have it return an iterable, but libxml2 doesn't
easily give you that. IIRC, there's some limited support for this (it works
for certain patterns), but that would need some serious wrapping effort
with non-trivial memory management.
Changing it in a future release may be risky if it returns a list now.
...
...
...
...
...
matchedNode = matched[0]
We then create the ESI node. At this point, it's nice and self-closing.
Note that we use the etree.tostring() method, since we want XHTML output.
...
...
...
nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'}
 esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap)
 esiNode.set('src', matchedNode.get('alt'))
...
...
...
print etree.tostring(esiNode)
<esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0"
src="./target.html"/>
Ok so far.
...
Now we connect it to the parent:
...
...
...
matchedNode.getparent().replace(matchedNode, esiNode)
At this point it's all over:
...
...
...
print etree.tostring(esiNode)
<esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0"
src="./target.html"></esi:include>
Ah, this is because it is now part of an HTML document, so the HTML
semantics interfere. I remember a not-so-old change in libxml2 (2.7.x?)
that provided 'better' support for HTML serialisation by taking into
account the document context. Looks like this strikes here.
I just looked it up in the sources, recent 2.7.x versions of libxml2 have
added a way to override this behaviour again, but lxml doesn't do this yet.
IIRC, it wasn't trivial at the time - I think it required going through a
different serialisation function or something.
Makes sense, sorta, but I would've thought this was a matter for 
serialisation, not parsing? Even when parsing as HTML, I'm using 
etree.tostring() to serialise.
...
...
And sure enough:
...
...
...
print etree.tostring(inputTree)
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:esi="http://www.edge-delivery.org/esi/1.0"
xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"
xml:lang="en"><body>
  <p><esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0"
src="./target.html"></esi:include></p>
</body></html>
It's also interesting to note that this suddenly has the xmlns
declaration twice.
... namespaces in HTML ...
Yeah, yeah. XHTML. ;-)
...
...
Any ideas would be highly welcome. I've tried to play with different
ways to construct the ESI tag, and different placements for the
placeholder (e.g. outside the<p>  tag), but it's all the same. It also
doesn't seem to make any difference whether I parse with
etree.fromstring() or html.fromstring() (in the real code I'm actually
feeding an HTMLParser).
It *should* make a difference, but from your example I can see that it
doesn't. No idea why. I'll have a closer look later.
I appreciate it!

Martin

-- 
Author of `Professional Plone Development`, a book for developers who
want to work with Plone. See http://martinaspeli.net/plone-book