[lxml-dev] Building an ESI tag with lxml
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Hi, I'm trying to use lxml to conditionally insert an <esi:include /> tag into an HTML document. The document is parsed with the HTML parser and manipulated in various ways. At one point, I search for a node ('placeholder') and want to replace it with something that renders to: <esi:include src="http://..." /> The code I used looks like this: nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} root.nsmap.update(nsmap) # root is the <html /> element esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', url) placeholder.addnext(esiNode) # placeholder is later removed There are two problems with this: - The xmlns:esi ends up on the <esi:include /> tag instead of the HTML root. Varnish doesn't like this apparently. - The <esi:include /> tag is not self-closing when rendered with the html.tostring (using etree.tostring is not really an option as other things are going on which want html rendering). Varnish likes this even less. Thus: <esi:include src="http://..." xmlns:esi="http://www.edge-delivery.org/esi/1.0"></esi:include> What can I do to push the namespace declaration up to the top node ('root') and make the tag self-closing? Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi Martin, Martin Aspeli, 28.01.2010 14:52:
I'm trying to use lxml to conditionally insert an <esi:include /> tag into an HTML document.
First problem: HTML is not namespace aware - namespaces in HTML are underdefined at best (and they certainly were not well defined back in 2001, when the ESI spec appeared).
Updating the nsmap property has no effect. I've updated the docstring appropriately.
As I said, namespaces in HTML... To move the namespace declaration to the top-level element, you can create a new 'html' root element that has it and move the nodes over, e.g. new_root = etree.Element('html', nsmap=nsmap) new_root[:] = root[:] # or copy.deepcopy(root)[:] I think it would be nice to allow an 'nsmap' parameter in the cleanup_namespaces() function. Its namespace declarations would then get added to the element it runs on before starting the cleanup process. That would be a 2.3 feature, though. I don't think adding support for changing 'el.nsmap' would be a good idea, as changing namespace prefixes is actually a rather non-trivial process. This should be requested explicitly at a well selected step in the code (usually just before serialisation, when prefixes become interesting).
Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't know that it's supposed to be self-closing. I think you only have three options here: 1) fix Varnish 2) serialise to XHTML instead of HTML (assuming that Varnish supports that) 3) close the tag through byte string substitution *after* serialisation If you choose to go with 2), you may consider converting the stream back to plain HTML *after* processing the esi tags, using an additional parse-serialise cycle (or an external tool like xmllint or tidy). Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
Ok, thanks.
How (in)efficient is this?
Ok.
Agree. I'd be happy to pass something to the serialiser about namespaces.
Is there no way to make it aware of it? Seems this should be configurable (or monkey-patch-able) somewhere...
1) fix Varnish 2) serialise to XHTML instead of HTML (assuming that Varnish supports that)
It probably would. What's that look like?
3) close the tag through byte string substitution *after* serialisation
Yipes. If I do that, I'll just do the entire tag through such a substitution to be honest and not use lxml at all.
That sounds pretty bad for performance. :( Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/ab69b/ab69beddc1396be52e2c3fc5bdf95de6cc0e575c" alt=""
2010/1/29 Martin Aspeli <optilude+lists@gmail.com>:
FWIW, the only way I've found to get good xhtml output from html parsing is with an xsl like the following... <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="no" omit-xml-declaration="yes" media-type="text/html" encoding="utf-8" doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"/> <xsl:template match="/"> <xsl:copy-of select="."/> </xsl:template> </xsl:stylesheet> This triggers the xml output mode to produce valid xhtml. If et.docinfo.public_id and et.docinfo.system_url could be set somehow then I'm sure it would work without the transform. (The relevant code is at the top of libxml2/xmlsave.c - basically so long as you have one of the xhtml public ids or system urls you'll get the right output). Laurence
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 29.01.2010 01:00:
It's about linear in the number of elements in your tree, plus the number of direct children for the move operation. Maybe not the most efficient thing to do, but usually pretty fast. Certainly a lot faster than you could ever get your own hand-rolled serialiser in Python, for instance. You can compare the absolute numbers on this page: http://codespeak.net/lxml/performance.html#parsing-and-serialising http://codespeak.net/lxml/performance.html#merging-different-sources
I know, that's how ET's serialiser works. Can't work for lxml, though. The serialiser in libxml2 can only write out what is there. It could work for a doctype, though. Support for passing that verbatimly into the serialiser would be a nice feature.
Monkey-patching isn't all that easy in libxml2, though... Not that it can't work for C code, it's just not that portable - nor particularly safe... ;-)
Depends on your input. If it's HTML, there's an html_to_xhtml() function in lxml.html that can do the conversion for you. And the serialiser can always be chosen using the 'method' argument (that's basically the difference between lxml.etree.tostring() and lxml.html.tostring()).
It's rather safe, though. The exact string to replace would be "></esi:include>", which won't appear that easily in your content. Doing the parsing and replacing manually on the input is a lot more fragile.
Don't underestimate the speed of a tool that was made for the job. Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
I've just tried this with serialization using lxml.etree.tostring instead of lxml.html.tostring. Unfortunately, I'm still getting an open-close tag pair instead of a self-closed tag. Any idea what I may be doing wrong? esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', tileHref) esiNode.text = None tilePlaceholderNode.addnext(esiNode) toRemove.append(tilePlaceholderNode) Output: <p class="discreet">...<esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="http://..."></esi:include></p> Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 29.01.2010 03:26:
Setting the .text to None is redundant as this is a new element. Otherwise, doing that should be enough to erase all text content.
tilePlaceholderNode.addnext(esiNode) toRemove.append(tilePlaceholderNode)
I guess I would have used parent.replace(old,new) here.
Works for me. Could you send me a complete code snippet where it doesn't work for you? Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
It was an act of desperation. :)
I didn't do this for two reasons: 1. In some cases (though not here) I'm replacing one placeholder with multiple nodes. 2. This code appears within a loop that's manipulating the tree for each of multiple elements matched with an XPath expression. I thought deleting a node mid-iteration would cause problems.
How much work are you willing to put in? :-) I can give you a Plone buildout that will set up everything and talk you through the steps to reproduce. It's not very hard, it just requires a few steps. I won't bother explaining it if you don't have half an hour to chase it down, though. :) Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 29.01.2010 09:15:
I know. ;)
Another nice feature: support a sequence as replacement. :) Although that requirement is basically satisfied with slice replacements, so I guess that won't make it in for now.
XPath returns a list of nodes, so you are no longer iterating over the tree structure in this case. Ripping stuff out should be absolutely safe here.
LOL! :) "You know, I have this huge pile of code here, but it's really easy to set up and then all you have to do is a tiny bit of debugging. It's easy! It really is! I can't believe you don't want to feel the fun to try it!" Honestly, could you try to come up with a little example that injects namespaced XML content into a small HTML page, and that shows that the XML serialiser behaves unexpected? Shouldn't be hard to write... Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
Could you elaborate an example?
Cool! Less code.
That's why I asked. :-p
See my other mail. I got a minimal example that's bombing out for me. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
Works for me. Could you send me a complete code snippet where it doesn't work for you?
Okay, here's a pseudo-doctest that illustrates the problem: First, we create a simple document. We use the HTML parser here, because we don't necessarily trust the input being 100% valid XHTML, even though the doctype says so.
We are going to replace the <img /> tag with an <esi:include /> tag. We find it via an XPath:
We then create the ESI node. At this point, it's nice and self-closing. Note that we use the etree.tostring() method, since we want XHTML output.
Now we connect it to the parent:
matchedNode.getparent().replace(matchedNode, esiNode)
At this point it's all over:
And sure enough:
It's also interesting to note that this suddenly has the xmlns declaration twice. Any ideas would be highly welcome. I've tried to play with different ways to construct the ESI tag, and different placements for the placeholder (e.g. outside the <p> tag), but it's all the same. It also doesn't seem to make any difference whether I parse with etree.fromstring() or html.fromstring() (in the real code I'm actually feeding an HTMLParser). As soon as I insert it into the parent tree, the tag stops self closing. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 29.01.2010 10:10:
I think that's the main problem. If you parse XHTML using the HTML parser, you loose information due to the fact that namespaces are not well-defined for HTML. I'd *always* try with the XML parser first. However, according to your last comment, it seems you have tried the XML parser already...
Perfect use case for lxml.cssselect. :)
matched = list(placeholderXPath(inputTree))
As I said, XPath returns a *list*. Personally, I'd love to have it return an iterable, but libxml2 doesn't easily give you that. IIRC, there's some limited support for this (it works for certain patterns), but that would need some serious wrapping effort with non-trivial memory management.
Ok so far.
Ah, this is because it is now part of an HTML document, so the HTML semantics interfere. I remember a not-so-old change in libxml2 (2.7.x?) that provided 'better' support for HTML serialisation by taking into account the document context. Looks like this strikes here. I just looked it up in the sources, recent 2.7.x versions of libxml2 have added a way to override this behaviour again, but lxml doesn't do this yet. IIRC, it wasn't trivial at the time - I think it required going through a different serialisation function or something.
... namespaces in HTML ...
It *should* make a difference, but from your example I can see that it doesn't. No idea why. I'll have a closer look later. Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
This code is being used in a post-processing step for output from Plone. Performance is important, so trial-and-error like this is probably undesirable. And even then, this would need to work for documents parsed with the HTML parser. The output being transformed could include not-quite-well-formed XHTML from content-managed pages. That's the attraction of xlml in the first place - it can deal with somewhat-crap output. ;)
However, according to your last comment, it seems you have tried the XML parser already...
I just re-confirmed it. If the whole thing is parsed with etree.fromstring (and lxml.html is not used anywhere) it still doesn't close.
Well, I got the XPath from css2xpath.appspot.com which uses the same algorithm I think.
matched = list(placeholderXPath(inputTree))
As I said, XPath returns a *list*.
Cool, thanks.
Changing it in a future release may be risky if it returns a list now.
Makes sense, sorta, but I would've thought this was a matter for serialisation, not parsing? Even when parsing as HTML, I'm using etree.tostring() to serialise.
Yeah, yeah. XHTML. ;-)
I appreciate it! Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 29.01.2010 11:51:
Obviously. It would rather become a new method on the XPath class, like xpath.iterfind(el).
I read through the libxml2 sources a bit more. It's not confusing HTML at all, it's even smarter than I thought. It looks at the *doctype* of the document that is being serialised and then applies special XHTML formatting rules. :o) http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n2137 Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
But... XHTML says empty tags can self-close as far as I know. And even then, this is in a different namespace.
My C fu is weak. Any hints in there I'm missing? Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 29.01.2010 14:27:
Sure. I just pointed you to the code that formats the output. Given that the DOCTYPE plays the card here, you may also consider keeping the DOCTYPE out of the tree and prepending it after the serialisation.
My C fu is weak. Any hints in there I'm missing?
This, for example: http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n1451 The rule that bites you here is in line 1452. If the element uses a namespace prefix, it will not become self-closing. I have no idea about the reasoning behind such a rule, but if you are interested, I'd go straight to the libxml2 mailing list and ask. There's also line 1414 http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n1414 which emits a default namespace declaration for the XHTML namespace regardless of the existing declarations. Certainly space left for enhancements. IIRC, the XHTML formatting is rather new, may have been added in the 2.7 line. You'll have a good chance of being heard if you propose some sensible improvements to it. Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
Given that the DOCTYPE plays the card here, you may also consider keeping the DOCTYPE out of the tree and prepending it after the serialisation.
That's going to be pretty tricky, but I guess we can try. I wonder what the side-effect of this may, though. Presumably, the DOCTYPE detection is there for a reason.
Good to know. I'm not sure I know how to formulate the needed changes except by re-stating the problem I'm having here, though. It'd probably help if I understood the purpose of the special formatting better. I naively thought that XHTML = XML and wouldn't need any magic. :) Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 29.01.2010 16:09:
:) I think it's because I complained about one of the early 2.7.x versions breaking lxml's serialisation completely, so Daniel eventually added some "do what I mean" work-around to call the right functions in absence of a specific configuration (which lxml can't pass as the API it uses doesn't allow it ...) Don't expect everything in libxml2 to be well designed from the ground up. It was grown over years and has become a crucial part of the GNU/GNOME/... infrastructure. It naturally carries quite a bit of backwards compatibility with it, in both API and functionality. It certainly has its edges. Discussing new stuff to move it into the right directions is almost always worth it.
I wouldn't call that naive. Just go and ask. Stefan
data:image/s3,"s3://crabby-images/ab69b/ab69beddc1396be52e2c3fc5bdf95de6cc0e575c" alt=""
This seems to be a limitation of the xml serializer when it detects xhtml :( $ xsltproc --version Using libxml 20703-SVN3827, libxslt 10124-SVN1494 and libexslt 813 xsltproc was compiled against libxml 20703, libxslt 10124 and libexslt 813 libxslt 10124 was compiled against libxml 20703 libexslt 813 was compiled against libxml 20703 $ cat in.html <html></html> $ cat test.xsl <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml"> <xsl:output method="xml" indent="no" omit-xml-declaration="yes" media-type="text/html" encoding="utf-8" doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"/> <xsl:template match="/"> <html><body><esi:include src="foo"/></body></html> </xsl:template> </xsl:stylesheet> $ xsltproc test.xsl in.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:esi="http://www.edge-delivery.org/esi/1.0"><body><esi:include src="foo"></esi:include></body></html> xsltproc is using the xml parser here. We need xhtml mode or you end up with elements like <br/> (no space) which confuse some browsers. Laurence
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
I tried this (self-closing tag issue notwithstanding), like so: root = tree.getroot() nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} newRoot = etree.Element('html', nsmap=newRoot.attrib.update(root.attrib.items()) newRoot[:] = copy.deepcopy(root)[:] tree._setroot(newRoot) Unfortunately, I've now lost the doctype. :( The head of the page looks like: <html xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en"> <head> Intriguingly, the <esi:include /> tag now self-closes. :-) However, Firefox is showing an empty page. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 29.01.2010 04:38:
You can also create the element using the parser: newRoot = etree.XML(''' <!DOCTYPE ...> <html xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en"/>''') Sadly, doctype setting isn't currently as easy as it could be...
Intriguingly, the <esi:include /> tag now self-closes. :-) However, Firefox is showing an empty page.
May or may not be due to the missing doctype. Stefan
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi Martin, Martin Aspeli, 28.01.2010 14:52:
I'm trying to use lxml to conditionally insert an <esi:include /> tag into an HTML document.
First problem: HTML is not namespace aware - namespaces in HTML are underdefined at best (and they certainly were not well defined back in 2001, when the ESI spec appeared).
Updating the nsmap property has no effect. I've updated the docstring appropriately.
As I said, namespaces in HTML... To move the namespace declaration to the top-level element, you can create a new 'html' root element that has it and move the nodes over, e.g. new_root = etree.Element('html', nsmap=nsmap) new_root[:] = root[:] # or copy.deepcopy(root)[:] I think it would be nice to allow an 'nsmap' parameter in the cleanup_namespaces() function. Its namespace declarations would then get added to the element it runs on before starting the cleanup process. That would be a 2.3 feature, though. I don't think adding support for changing 'el.nsmap' would be a good idea, as changing namespace prefixes is actually a rather non-trivial process. This should be requested explicitly at a well selected step in the code (usually just before serialisation, when prefixes become interesting).
Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't know that it's supposed to be self-closing. I think you only have three options here: 1) fix Varnish 2) serialise to XHTML instead of HTML (assuming that Varnish supports that) 3) close the tag through byte string substitution *after* serialisation If you choose to go with 2), you may consider converting the stream back to plain HTML *after* processing the esi tags, using an additional parse-serialise cycle (or an external tool like xmllint or tidy). Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
Ok, thanks.
How (in)efficient is this?
Ok.
Agree. I'd be happy to pass something to the serialiser about namespaces.
Is there no way to make it aware of it? Seems this should be configurable (or monkey-patch-able) somewhere...
1) fix Varnish 2) serialise to XHTML instead of HTML (assuming that Varnish supports that)
It probably would. What's that look like?
3) close the tag through byte string substitution *after* serialisation
Yipes. If I do that, I'll just do the entire tag through such a substitution to be honest and not use lxml at all.
That sounds pretty bad for performance. :( Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/ab69b/ab69beddc1396be52e2c3fc5bdf95de6cc0e575c" alt=""
2010/1/29 Martin Aspeli <optilude+lists@gmail.com>:
FWIW, the only way I've found to get good xhtml output from html parsing is with an xsl like the following... <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="no" omit-xml-declaration="yes" media-type="text/html" encoding="utf-8" doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"/> <xsl:template match="/"> <xsl:copy-of select="."/> </xsl:template> </xsl:stylesheet> This triggers the xml output mode to produce valid xhtml. If et.docinfo.public_id and et.docinfo.system_url could be set somehow then I'm sure it would work without the transform. (The relevant code is at the top of libxml2/xmlsave.c - basically so long as you have one of the xhtml public ids or system urls you'll get the right output). Laurence
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 29.01.2010 01:00:
It's about linear in the number of elements in your tree, plus the number of direct children for the move operation. Maybe not the most efficient thing to do, but usually pretty fast. Certainly a lot faster than you could ever get your own hand-rolled serialiser in Python, for instance. You can compare the absolute numbers on this page: http://codespeak.net/lxml/performance.html#parsing-and-serialising http://codespeak.net/lxml/performance.html#merging-different-sources
I know, that's how ET's serialiser works. Can't work for lxml, though. The serialiser in libxml2 can only write out what is there. It could work for a doctype, though. Support for passing that verbatimly into the serialiser would be a nice feature.
Monkey-patching isn't all that easy in libxml2, though... Not that it can't work for C code, it's just not that portable - nor particularly safe... ;-)
Depends on your input. If it's HTML, there's an html_to_xhtml() function in lxml.html that can do the conversion for you. And the serialiser can always be chosen using the 'method' argument (that's basically the difference between lxml.etree.tostring() and lxml.html.tostring()).
It's rather safe, though. The exact string to replace would be "></esi:include>", which won't appear that easily in your content. Doing the parsing and replacing manually on the input is a lot more fragile.
Don't underestimate the speed of a tool that was made for the job. Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
I've just tried this with serialization using lxml.etree.tostring instead of lxml.html.tostring. Unfortunately, I'm still getting an open-close tag pair instead of a self-closed tag. Any idea what I may be doing wrong? esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', tileHref) esiNode.text = None tilePlaceholderNode.addnext(esiNode) toRemove.append(tilePlaceholderNode) Output: <p class="discreet">...<esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="http://..."></esi:include></p> Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 29.01.2010 03:26:
Setting the .text to None is redundant as this is a new element. Otherwise, doing that should be enough to erase all text content.
tilePlaceholderNode.addnext(esiNode) toRemove.append(tilePlaceholderNode)
I guess I would have used parent.replace(old,new) here.
Works for me. Could you send me a complete code snippet where it doesn't work for you? Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
It was an act of desperation. :)
I didn't do this for two reasons: 1. In some cases (though not here) I'm replacing one placeholder with multiple nodes. 2. This code appears within a loop that's manipulating the tree for each of multiple elements matched with an XPath expression. I thought deleting a node mid-iteration would cause problems.
How much work are you willing to put in? :-) I can give you a Plone buildout that will set up everything and talk you through the steps to reproduce. It's not very hard, it just requires a few steps. I won't bother explaining it if you don't have half an hour to chase it down, though. :) Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 29.01.2010 09:15:
I know. ;)
Another nice feature: support a sequence as replacement. :) Although that requirement is basically satisfied with slice replacements, so I guess that won't make it in for now.
XPath returns a list of nodes, so you are no longer iterating over the tree structure in this case. Ripping stuff out should be absolutely safe here.
LOL! :) "You know, I have this huge pile of code here, but it's really easy to set up and then all you have to do is a tiny bit of debugging. It's easy! It really is! I can't believe you don't want to feel the fun to try it!" Honestly, could you try to come up with a little example that injects namespaced XML content into a small HTML page, and that shows that the XML serialiser behaves unexpected? Shouldn't be hard to write... Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
Could you elaborate an example?
Cool! Less code.
That's why I asked. :-p
See my other mail. I got a minimal example that's bombing out for me. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
Works for me. Could you send me a complete code snippet where it doesn't work for you?
Okay, here's a pseudo-doctest that illustrates the problem: First, we create a simple document. We use the HTML parser here, because we don't necessarily trust the input being 100% valid XHTML, even though the doctype says so.
We are going to replace the <img /> tag with an <esi:include /> tag. We find it via an XPath:
We then create the ESI node. At this point, it's nice and self-closing. Note that we use the etree.tostring() method, since we want XHTML output.
Now we connect it to the parent:
matchedNode.getparent().replace(matchedNode, esiNode)
At this point it's all over:
And sure enough:
It's also interesting to note that this suddenly has the xmlns declaration twice. Any ideas would be highly welcome. I've tried to play with different ways to construct the ESI tag, and different placements for the placeholder (e.g. outside the <p> tag), but it's all the same. It also doesn't seem to make any difference whether I parse with etree.fromstring() or html.fromstring() (in the real code I'm actually feeding an HTMLParser). As soon as I insert it into the parent tree, the tag stops self closing. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 29.01.2010 10:10:
I think that's the main problem. If you parse XHTML using the HTML parser, you loose information due to the fact that namespaces are not well-defined for HTML. I'd *always* try with the XML parser first. However, according to your last comment, it seems you have tried the XML parser already...
Perfect use case for lxml.cssselect. :)
matched = list(placeholderXPath(inputTree))
As I said, XPath returns a *list*. Personally, I'd love to have it return an iterable, but libxml2 doesn't easily give you that. IIRC, there's some limited support for this (it works for certain patterns), but that would need some serious wrapping effort with non-trivial memory management.
Ok so far.
Ah, this is because it is now part of an HTML document, so the HTML semantics interfere. I remember a not-so-old change in libxml2 (2.7.x?) that provided 'better' support for HTML serialisation by taking into account the document context. Looks like this strikes here. I just looked it up in the sources, recent 2.7.x versions of libxml2 have added a way to override this behaviour again, but lxml doesn't do this yet. IIRC, it wasn't trivial at the time - I think it required going through a different serialisation function or something.
... namespaces in HTML ...
It *should* make a difference, but from your example I can see that it doesn't. No idea why. I'll have a closer look later. Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
This code is being used in a post-processing step for output from Plone. Performance is important, so trial-and-error like this is probably undesirable. And even then, this would need to work for documents parsed with the HTML parser. The output being transformed could include not-quite-well-formed XHTML from content-managed pages. That's the attraction of xlml in the first place - it can deal with somewhat-crap output. ;)
However, according to your last comment, it seems you have tried the XML parser already...
I just re-confirmed it. If the whole thing is parsed with etree.fromstring (and lxml.html is not used anywhere) it still doesn't close.
Well, I got the XPath from css2xpath.appspot.com which uses the same algorithm I think.
matched = list(placeholderXPath(inputTree))
As I said, XPath returns a *list*.
Cool, thanks.
Changing it in a future release may be risky if it returns a list now.
Makes sense, sorta, but I would've thought this was a matter for serialisation, not parsing? Even when parsing as HTML, I'm using etree.tostring() to serialise.
Yeah, yeah. XHTML. ;-)
I appreciate it! Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 29.01.2010 11:51:
Obviously. It would rather become a new method on the XPath class, like xpath.iterfind(el).
I read through the libxml2 sources a bit more. It's not confusing HTML at all, it's even smarter than I thought. It looks at the *doctype* of the document that is being serialised and then applies special XHTML formatting rules. :o) http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n2137 Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
But... XHTML says empty tags can self-close as far as I know. And even then, this is in a different namespace.
My C fu is weak. Any hints in there I'm missing? Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 29.01.2010 14:27:
Sure. I just pointed you to the code that formats the output. Given that the DOCTYPE plays the card here, you may also consider keeping the DOCTYPE out of the tree and prepending it after the serialisation.
My C fu is weak. Any hints in there I'm missing?
This, for example: http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n1451 The rule that bites you here is in line 1452. If the element uses a namespace prefix, it will not become self-closing. I have no idea about the reasoning behind such a rule, but if you are interested, I'd go straight to the libxml2 mailing list and ask. There's also line 1414 http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n1414 which emits a default namespace declaration for the XHTML namespace regardless of the existing declarations. Certainly space left for enhancements. IIRC, the XHTML formatting is rather new, may have been added in the 2.7 line. You'll have a good chance of being heard if you propose some sensible improvements to it. Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
Given that the DOCTYPE plays the card here, you may also consider keeping the DOCTYPE out of the tree and prepending it after the serialisation.
That's going to be pretty tricky, but I guess we can try. I wonder what the side-effect of this may, though. Presumably, the DOCTYPE detection is there for a reason.
Good to know. I'm not sure I know how to formulate the needed changes except by re-stating the problem I'm having here, though. It'd probably help if I understood the purpose of the special formatting better. I naively thought that XHTML = XML and wouldn't need any magic. :) Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 29.01.2010 16:09:
:) I think it's because I complained about one of the early 2.7.x versions breaking lxml's serialisation completely, so Daniel eventually added some "do what I mean" work-around to call the right functions in absence of a specific configuration (which lxml can't pass as the API it uses doesn't allow it ...) Don't expect everything in libxml2 to be well designed from the ground up. It was grown over years and has become a crucial part of the GNU/GNOME/... infrastructure. It naturally carries quite a bit of backwards compatibility with it, in both API and functionality. It certainly has its edges. Discussing new stuff to move it into the right directions is almost always worth it.
I wouldn't call that naive. Just go and ask. Stefan
data:image/s3,"s3://crabby-images/ab69b/ab69beddc1396be52e2c3fc5bdf95de6cc0e575c" alt=""
This seems to be a limitation of the xml serializer when it detects xhtml :( $ xsltproc --version Using libxml 20703-SVN3827, libxslt 10124-SVN1494 and libexslt 813 xsltproc was compiled against libxml 20703, libxslt 10124 and libexslt 813 libxslt 10124 was compiled against libxml 20703 libexslt 813 was compiled against libxml 20703 $ cat in.html <html></html> $ cat test.xsl <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml"> <xsl:output method="xml" indent="no" omit-xml-declaration="yes" media-type="text/html" encoding="utf-8" doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"/> <xsl:template match="/"> <html><body><esi:include src="foo"/></body></html> </xsl:template> </xsl:stylesheet> $ xsltproc test.xsl in.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:esi="http://www.edge-delivery.org/esi/1.0"><body><esi:include src="foo"></esi:include></body></html> xsltproc is using the xml parser here. We need xhtml mode or you end up with elements like <br/> (no space) which confuse some browsers. Laurence
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
I tried this (self-closing tag issue notwithstanding), like so: root = tree.getroot() nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} newRoot = etree.Element('html', nsmap=newRoot.attrib.update(root.attrib.items()) newRoot[:] = copy.deepcopy(root)[:] tree._setroot(newRoot) Unfortunately, I've now lost the doctype. :( The head of the page looks like: <html xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en"> <head> Intriguingly, the <esi:include /> tag now self-closes. :-) However, Firefox is showing an empty page. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 29.01.2010 04:38:
You can also create the element using the parser: newRoot = etree.XML(''' <!DOCTYPE ...> <html xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en"/>''') Sadly, doctype setting isn't currently as easy as it could be...
Intriguingly, the <esi:include /> tag now self-closes. :-) However, Firefox is showing an empty page.
May or may not be due to the missing doctype. Stefan
participants (3)
-
Laurence Rowe
-
Martin Aspeli
-
Stefan Behnel