Stefan Behnel wrote:
Martin Aspeli, 29.01.2010 10:10:
here's a pseudo-doctest that illustrates the problem:
First, we create a simple document. We use the HTML parser here, because we don't necessarily trust the input being 100% valid XHTML, even though the doctype says so.
I think that's the main problem. If you parse XHTML using the HTML parser, you loose information due to the fact that namespaces are not well-defined for HTML. I'd *always* try with the XML parser first.
This code is being used in a post-processing step for output from Plone. Performance is important, so trial-and-error like this is probably undesirable. And even then, this would need to work for documents parsed with the HTML parser. The output being transformed could include not-quite-well-formed XHTML from content-managed pages. That's the attraction of xlml in the first place - it can deal with somewhat-crap output. ;)
However, according to your last comment, it seems you have tried the XML parser already...
I just re-confirmed it. If the whole thing is parsed with etree.fromstring (and lxml.html is not used anywhere) it still doesn't close.
from lxml import etree, html doc = """\ ...<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> ...<html xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> ...<body> ...<p><img class="mceItem mceTile" src="foo.png" alt="./target.html" /></p> ...</body> ...</html> ... """ inputTree = html.fromstring(doc)
We are going to replace the<img /> tag with an<esi:include /> tag. We find it via an XPath:
placeholderXPath = etree.XPath("//img[contains(concat(' ', normalize-space(@class), ' '), ' mceTile ')]")
Perfect use case for lxml.cssselect. :)
Well, I got the XPath from css2xpath.appspot.com which uses the same algorithm I think.
matched = list(placeholderXPath(inputTree))
As I said, XPath returns a *list*.
Cool, thanks.
Personally, I'd love to have it return an iterable, but libxml2 doesn't easily give you that. IIRC, there's some limited support for this (it works for certain patterns), but that would need some serious wrapping effort with non-trivial memory management.
Changing it in a future release may be risky if it returns a list now.
matchedNode = matched[0]
We then create the ESI node. At this point, it's nice and self-closing. Note that we use the etree.tostring() method, since we want XHTML output.
nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', matchedNode.get('alt'))
print etree.tostring(esiNode) <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"/>
Ok so far.
Now we connect it to the parent:
matchedNode.getparent().replace(matchedNode, esiNode)
At this point it's all over:
print etree.tostring(esiNode) <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"></esi:include>
Ah, this is because it is now part of an HTML document, so the HTML semantics interfere. I remember a not-so-old change in libxml2 (2.7.x?) that provided 'better' support for HTML serialisation by taking into account the document context. Looks like this strikes here.
I just looked it up in the sources, recent 2.7.x versions of libxml2 have added a way to override this behaviour again, but lxml doesn't do this yet. IIRC, it wasn't trivial at the time - I think it required going through a different serialisation function or something.
Makes sense, sorta, but I would've thought this was a matter for serialisation, not parsing? Even when parsing as HTML, I'm using etree.tostring() to serialise.
And sure enough:
print etree.tostring(inputTree) <html xmlns="http://www.w3.org/1999/xhtml" xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" xml:lang="en"><body> <p><esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"></esi:include></p> </body></html>
It's also interesting to note that this suddenly has the xmlns declaration twice.
... namespaces in HTML ...
Yeah, yeah. XHTML. ;-)
Any ideas would be highly welcome. I've tried to play with different ways to construct the ESI tag, and different placements for the placeholder (e.g. outside the<p> tag), but it's all the same. It also doesn't seem to make any difference whether I parse with etree.fromstring() or html.fromstring() (in the real code I'm actually feeding an HTMLParser).
It *should* make a difference, but from your example I can see that it doesn't. No idea why. I'll have a closer look later.
I appreciate it! Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book