Mailman 3 [lxml-dev] Building an ESI tag with lxml - lxml - The Python XML Toolkit

newer
Re: [lxml-dev] lxml.html.tostring...

[lxml-dev] Building an ESI tag with lxml

older
[lxml-dev] Custom element classes:...

Martin Aspeli

Jan. 28, 2010

8:52 a.m.

Hi, I'm trying to use lxml to conditionally insert an <esi:include /> tag into an HTML document. The document is parsed with the HTML parser and manipulated in various ways. At one point, I search for a node ('placeholder') and want to replace it with something that renders to: <esi:include src="http://..." /> The code I used looks like this: nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} root.nsmap.update(nsmap) # root is the <html /> element esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', url) placeholder.addnext(esiNode) # placeholder is later removed There are two problems with this: - The xmlns:esi ends up on the <esi:include /> tag instead of the HTML root. Varnish doesn't like this apparently. - The <esi:include /> tag is not self-closing when rendered with the html.tostring (using etree.tostring is not really an option as other things are going on which want html rendering). Varnish likes this even less. Thus: <esi:include src="http://..." xmlns:esi="http://www.edge-delivery.org/esi/1.0"></esi:include> What can I do to push the namespace declaration up to the top node ('root') and make the tag self-closing? Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book

Show replies by date

Stefan Behnel

January 2010

11:04 a.m.

Hi Martin, Martin Aspeli, 28.01.2010 14:52:

...

I'm trying to use lxml to conditionally insert an <esi:include /> tag into an HTML document.

First problem: HTML is not namespace aware - namespaces in HTML are underdefined at best (and they certainly were not well defined back in 2001, when the ESI spec appeared).

...

The document is parsed with the HTML parser and manipulated in various ways. At one point, I search for a node ('placeholder') and want to replace it with something that renders to:

<esi:include src="http://..." />

The code I used looks like this:

nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} root.nsmap.update(nsmap) # root is the <html /> element

Updating the nsmap property has no effect. I've updated the docstring appropriately.

...

esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', url) placeholder.addnext(esiNode) # placeholder is later removed

There are two problems with this:

- The xmlns:esi ends up on the <esi:include /> tag instead of the HTML root. Varnish doesn't like this apparently.

As I said, namespaces in HTML... To move the namespace declaration to the top-level element, you can create a new 'html' root element that has it and move the nodes over, e.g. new_root = etree.Element('html', nsmap=nsmap) new_root[:] = root[:] # or copy.deepcopy(root)[:] I think it would be nice to allow an 'nsmap' parameter in the cleanup_namespaces() function. Its namespace declarations would then get added to the element it runs on before starting the cleanup process. That would be a 2.3 feature, though. I don't think adding support for changing 'el.nsmap' would be a good idea, as changing namespace prefixes is actually a rather non-trivial process. This should be requested explicitly at a well selected step in the code (usually just before serialisation, when prefixes become interesting).

...

- The <esi:include /> tag is not self-closing when rendered with the html.tostring (using etree.tostring is not really an option as other things are going on which want html rendering). Varnish likes this even less.

Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't know that it's supposed to be self-closing. I think you only have three options here: 1) fix Varnish 2) serialise to XHTML instead of HTML (assuming that Varnish supports that) 3) close the tag through byte string substitution *after* serialisation If you choose to go with 2), you may consider converting the stream back to plain HTML *after* processing the esi tags, using an additional parse-serialise cycle (or an external tool like xmllint or tidy). Stefan

Martin Aspeli

7 p.m.

Stefan Behnel wrote:

...

Hi Martin,

Martin Aspeli, 28.01.2010 14:52:

...
I'm trying to use lxml to conditionally insert an<esi:include /> tag into an HTML document.

First problem: HTML is not namespace aware - namespaces in HTML are underdefined at best (and they certainly were not well defined back in 2001, when the ESI spec appeared).

...
The document is parsed with the HTML parser and manipulated in various ways. At one point, I search for a node ('placeholder') and want to replace it with something that renders to:

<esi:include src="http://..." />

The code I used looks like this:

nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} root.nsmap.update(nsmap) # root is the<html /> element

Updating the nsmap property has no effect. I've updated the docstring appropriately.

Ok, thanks.

...

...
esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', url) placeholder.addnext(esiNode) # placeholder is later removed

There are two problems with this:

- The xmlns:esi ends up on the<esi:include /> tag instead of the HTML root. Varnish doesn't like this apparently.

As I said, namespaces in HTML...

To move the namespace declaration to the top-level element, you can create a new 'html' root element that has it and move the nodes over, e.g.

new_root = etree.Element('html', nsmap=nsmap) new_root[:] = root[:] # or copy.deepcopy(root)[:]

How (in)efficient is this?

...

I think it would be nice to allow an 'nsmap' parameter in the cleanup_namespaces() function. Its namespace declarations would then get added to the element it runs on before starting the cleanup process. That would be a 2.3 feature, though.

Ok.

...

I don't think adding support for changing 'el.nsmap' would be a good idea, as changing namespace prefixes is actually a rather non-trivial process. This should be requested explicitly at a well selected step in the code (usually just before serialisation, when prefixes become interesting).

Agree. I'd be happy to pass something to the serialiser about namespaces.

...

...
- The<esi:include /> tag is not self-closing when rendered with the html.tostring (using etree.tostring is not really an option as other things are going on which want html rendering). Varnish likes this even less.

Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't know that it's supposed to be self-closing. I think you only have three options here:

Is there no way to make it aware of it? Seems this should be configurable (or monkey-patch-able) somewhere...

...

1) fix Varnish 2) serialise to XHTML instead of HTML (assuming that Varnish supports that)

It probably would. What's that look like?

...

3) close the tag through byte string substitution *after* serialisation

Yipes. If I do that, I'll just do the entire tag through such a substitution to be honest and not use lxml at all.

...

If you choose to go with 2), you may consider converting the stream back to plain HTML *after* processing the esi tags, using an additional parse-serialise cycle (or an external tool like xmllint or tidy).

That sounds pretty bad for performance. :( Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book

Laurence Rowe

8:47 p.m.

2010/1/29 Martin Aspeli <optilude+lists@gmail.com>:

...

Stefan Behnel wrote:

...
Hi Martin,

Martin Aspeli, 28.01.2010 14:52:

...
I'm trying to use lxml to conditionally insert an<esi:include /> tag into an HTML document.

First problem: HTML is not namespace aware - namespaces in HTML are underdefined at best (and they certainly were not well defined back in 2001, when the ESI spec appeared).

...
The document is parsed with the HTML parser and manipulated in various ways. At one point, I search for a node ('placeholder') and want to replace it with something that renders to:

<esi:include src="http://..." />

The code I used looks like this:

nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} root.nsmap.update(nsmap) # root is the<html /> element

Updating the nsmap property has no effect. I've updated the docstring appropriately.

Ok, thanks.

...
...
esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', url) placeholder.addnext(esiNode) # placeholder is later removed

There are two problems with this:

- The xmlns:esi ends up on the<esi:include /> tag instead of the HTML root. Varnish doesn't like this apparently.

As I said, namespaces in HTML...

To move the namespace declaration to the top-level element, you can create a new 'html' root element that has it and move the nodes over, e.g.

new_root = etree.Element('html', nsmap=nsmap) new_root[:] = root[:] # or copy.deepcopy(root)[:]

How (in)efficient is this?

...
I think it would be nice to allow an 'nsmap' parameter in the cleanup_namespaces() function. Its namespace declarations would then get added to the element it runs on before starting the cleanup process. That would be a 2.3 feature, though.

Ok.

...
I don't think adding support for changing 'el.nsmap' would be a good idea, as changing namespace prefixes is actually a rather non-trivial process. This should be requested explicitly at a well selected step in the code (usually just before serialisation, when prefixes become interesting).

Agree. I'd be happy to pass something to the serialiser about namespaces.

...
...
- The<esi:include /> tag is not self-closing when rendered with the html.tostring (using etree.tostring is not really an option as other things are going on which want html rendering). Varnish likes this even less.

Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't know that it's supposed to be self-closing. I think you only have three options here:

Is there no way to make it aware of it? Seems this should be configurable (or monkey-patch-able) somewhere...

...
1) fix Varnish 2) serialise to XHTML instead of HTML (assuming that Varnish supports that)

It probably would. What's that look like?

...
3) close the tag through byte string substitution *after* serialisation

Yipes. If I do that, I'll just do the entire tag through such a substitution to be honest and not use lxml at all.

...
If you choose to go with 2), you may consider converting the stream back to plain HTML *after* processing the esi tags, using an additional parse-serialise cycle (or an external tool like xmllint or tidy).

That sounds pretty bad for performance. :(

Martin

FWIW, the only way I've found to get good xhtml output from html parsing is with an xsl like the following... <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="no" omit-xml-declaration="yes" media-type="text/html" encoding="utf-8" doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"/> <xsl:template match="/"> <xsl:copy-of select="."/> </xsl:template> </xsl:stylesheet> This triggers the xml output mode to produce valid xhtml. If et.docinfo.public_id and et.docinfo.system_url could be set somehow then I'm sure it would work without the transform. (The relevant code is at the top of libxml2/xmlsave.c - basically so long as you have one of the xhtml public ids or system urls you'll get the right output). Laurence

Stefan Behnel

3:59 a.m.

Martin Aspeli, 29.01.2010 01:00:

...

It's about linear in the number of elements in your tree, plus the number of direct children for the move operation. Maybe not the most efficient thing to do, but usually pretty fast. Certainly a lot faster than you could ever get your own hand-rolled serialiser in Python, for instance. You can compare the absolute numbers on this page: http://codespeak.net/lxml/performance.html#parsing-and-serialising http://codespeak.net/lxml/performance.html#merging-different-sources

...

I know, that's how ET's serialiser works. Can't work for lxml, though. The serialiser in libxml2 can only write out what is there. It could work for a doctype, though. Support for passing that verbatimly into the serialiser would be a nice feature.

...

Monkey-patching isn't all that easy in libxml2, though... Not that it can't work for C code, it's just not that portable - nor particularly safe... ;-)

...

Depends on your input. If it's HTML, there's an html_to_xhtml() function in lxml.html that can do the conversion for you. And the serialiser can always be chosen using the 'method' argument (that's basically the difference between lxml.etree.tostring() and lxml.html.tostring()).

...

It's rather safe, though. The exact string to replace would be "></esi:include>", which won't appear that easily in your content. Doing the parsing and replacing manually on the input is a lot more fragile.

...

Don't underestimate the speed of a tool that was made for the job. Stefan

Stefan Behnel

10:27 a.m.

[replying to myself] Stefan Behnel, 29.01.2010 09:59:

...

The serialiser in libxml2 can only write out what is there.

It could work for a doctype, though. Support for passing that verbatimly into the serialiser would be a nice feature.

There we go: https://codespeak.net/viewvc/?view=rev&revision=70976

...

...
...
xml = '<!DOCTYPE root>\n<root/>' tree = etree.parse(StringIO(xml))

...

...
...
print(etree.tostring(tree)) <!DOCTYPE root> <root/>

...

...
...
print(etree.tostring(tree, ... doctype='<!DOCTYPE root SYSTEM "/tmp/test.dtd">')) <!DOCTYPE root SYSTEM "/tmp/test.dtd"> <root/>

Stefan

Martin Aspeli

9:26 p.m.

Stefan Behnel wrote:

...

I've just tried this with serialization using lxml.etree.tostring instead of lxml.html.tostring. Unfortunately, I'm still getting an open-close tag pair instead of a self-closed tag. Any idea what I may be doing wrong? esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', tileHref) esiNode.text = None tilePlaceholderNode.addnext(esiNode) toRemove.append(tilePlaceholderNode) Output: ...<esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="http://..."></esi:include> Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book

Stefan Behnel

2:47 a.m.

Martin Aspeli, 29.01.2010 03:26:

...

Stefan Behnel wrote:

...
Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't know that it's supposed to be self-closing. I think you only have three options here:

1) fix Varnish 2) serialise to XHTML instead of HTML (assuming that Varnish supports that) 3) close the tag through byte string substitution *after* serialisation

If you choose to go with 2), you may consider converting the stream back to plain HTML *after* processing the esi tags, using an additional parse-serialise cycle (or an external tool like xmllint or tidy).

I've just tried this with serialization using lxml.etree.tostring instead of lxml.html.tostring. Unfortunately, I'm still getting an open-close tag pair instead of a self-closed tag. Any idea what I may be doing wrong?

esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', tileHref) esiNode.text = None

Setting the .text to None is redundant as this is a new element. Otherwise, doing that should be enough to erase all text content.

...

tilePlaceholderNode.addnext(esiNode) toRemove.append(tilePlaceholderNode)

I guess I would have used parent.replace(old,new) here.

...

Output:

...<esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="http://..."></esi:include>

Works for me. Could you send me a complete code snippet where it doesn't work for you? Stefan

Martin Aspeli

3:15 a.m.

Stefan Behnel wrote:

...

It was an act of desperation. :)

...

I didn't do this for two reasons: 1. In some cases (though not here) I'm replacing one placeholder with multiple nodes. 2. This code appears within a loop that's manipulating the tree for each of multiple elements matched with an XPath expression. I thought deleting a node mid-iteration would cause problems.

...

How much work are you willing to put in? :-) I can give you a Plone buildout that will set up everything and talk you through the steps to reproduce. It's not very hard, it just requires a few steps. I won't bother explaining it if you don't have half an hour to chase it down, though. :) Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book

Stefan Behnel

4:18 a.m.

Martin Aspeli, 29.01.2010 09:15:

...

I know. ;)

...

Another nice feature: support a sequence as replacement. :) Although that requirement is basically satisfied with slice replacements, so I guess that won't make it in for now.

...

XPath returns a list of nodes, so you are no longer iterating over the tree structure in this case. Ripping stuff out should be absolutely safe here.

...

LOL! :) "You know, I have this huge pile of code here, but it's really easy to set up and then all you have to do is a tiny bit of debugging. It's easy! It really is! I can't believe you don't want to feel the fun to try it!" Honestly, could you try to come up with a little example that injects namespaced XML content into a small HTML page, and that shows that the XML serialiser behaves unexpected? Shouldn't be hard to write... Stefan

Martin Aspeli

4:22 a.m.

Stefan Behnel wrote:

...

Could you elaborate an example?

...

Cool! Less code.

...

That's why I asked. :-p

...

See my other mail. I got a minimal example that's bombing out for me. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book

Martin Aspeli

4:10 a.m.

Stefan Behnel wrote:

...

Works for me. Could you send me a complete code snippet where it doesn't work for you?

Okay, here's a pseudo-doctest that illustrates the problem: First, we create a simple document. We use the HTML parser here, because we don't necessarily trust the input being 100% valid XHTML, even though the doctype says so.

...

...
...
from lxml import etree, html doc = """\ ... <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> ... <html xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> ... <body> ... <img class="mceItem mceTile" src="foo.png" alt="./target.html" /> ... </body> ... </html> ... """ inputTree = html.fromstring(doc)

We are going to replace the <img /> tag with an <esi:include /> tag. We find it via an XPath:

...

...
...
placeholderXPath = etree.XPath("//img[contains(concat(' ', normalize-space(@class), ' '), ' mceTile ')]") matched = list(placeholderXPath(inputTree)) matchedNode = matched[0]

We then create the ESI node. At this point, it's nice and self-closing. Note that we use the etree.tostring() method, since we want XHTML output.

...

...
...
nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', matchedNode.get('alt'))

...

...
...
print etree.tostring(esiNode) <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"/>

Now we connect it to the parent:

...

...
...
matchedNode.getparent().replace(matchedNode, esiNode)

At this point it's all over:

...

...
...
print etree.tostring(esiNode) <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"></esi:include>

And sure enough:

...

...
...
print etree.tostring(inputTree) <html xmlns="http://www.w3.org/1999/xhtml" xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" xml:lang="en"><body> <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"></esi:include> </body></html>

It's also interesting to note that this suddenly has the xmlns declaration twice. Any ideas would be highly welcome. I've tried to play with different ways to construct the ESI tag, and different placements for the placeholder (e.g. outside the tag), but it's all the same. It also doesn't seem to make any difference whether I parse with etree.fromstring() or html.fromstring() (in the real code I'm actually feeding an HTMLParser). As soon as I insert it into the parent tree, the tag stops self closing. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book

Stefan Behnel

5:08 a.m.

Martin Aspeli, 29.01.2010 10:10:

...

here's a pseudo-doctest that illustrates the problem:

First, we create a simple document. We use the HTML parser here, because we don't necessarily trust the input being 100% valid XHTML, even though the doctype says so.

I think that's the main problem. If you parse XHTML using the HTML parser, you loose information due to the fact that namespaces are not well-defined for HTML. I'd *always* try with the XML parser first. However, according to your last comment, it seems you have tried the XML parser already...

...

...
...
...
from lxml import etree, html doc = """\ ... <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> ... <html xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> ... <body> ... <img class="mceItem mceTile" src="foo.png" alt="./target.html" /> ... </body> ... </html> ... """ inputTree = html.fromstring(doc)

We are going to replace the <img /> tag with an <esi:include /> tag. We find it via an XPath:

...
...
...
placeholderXPath = etree.XPath("//img[contains(concat(' ', normalize-space(@class), ' '), ' mceTile ')]")

Perfect use case for lxml.cssselect. :)

...

...
...
...
matched = list(placeholderXPath(inputTree))

As I said, XPath returns a *list*. Personally, I'd love to have it return an iterable, but libxml2 doesn't easily give you that. IIRC, there's some limited support for this (it works for certain patterns), but that would need some serious wrapping effort with non-trivial memory management.

...

...
...
...
matchedNode = matched[0]

We then create the ESI node. At this point, it's nice and self-closing. Note that we use the etree.tostring() method, since we want XHTML output.

...
...
...
nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', matchedNode.get('alt'))

...
...
...
print etree.tostring(esiNode) <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"/>

Ok so far.

...

Now we connect it to the parent:

...
...
...
matchedNode.getparent().replace(matchedNode, esiNode)

At this point it's all over:

...
...
...
print etree.tostring(esiNode) <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"></esi:include>

Ah, this is because it is now part of an HTML document, so the HTML semantics interfere. I remember a not-so-old change in libxml2 (2.7.x?) that provided 'better' support for HTML serialisation by taking into account the document context. Looks like this strikes here. I just looked it up in the sources, recent 2.7.x versions of libxml2 have added a way to override this behaviour again, but lxml doesn't do this yet. IIRC, it wasn't trivial at the time - I think it required going through a different serialisation function or something.

...

And sure enough:

...
...
...
print etree.tostring(inputTree) <html xmlns="http://www.w3.org/1999/xhtml" xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" xml:lang="en"><body> <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"></esi:include> </body></html>

It's also interesting to note that this suddenly has the xmlns declaration twice.

... namespaces in HTML ...

...

Any ideas would be highly welcome. I've tried to play with different ways to construct the ESI tag, and different placements for the placeholder (e.g. outside the tag), but it's all the same. It also doesn't seem to make any difference whether I parse with etree.fromstring() or html.fromstring() (in the real code I'm actually feeding an HTMLParser).

It *should* make a difference, but from your example I can see that it doesn't. No idea why. I'll have a closer look later. Stefan

Martin Aspeli

5:51 a.m.

Stefan Behnel wrote:

...

Martin Aspeli, 29.01.2010 10:10:

...
here's a pseudo-doctest that illustrates the problem:

First, we create a simple document. We use the HTML parser here, because we don't necessarily trust the input being 100% valid XHTML, even though the doctype says so.

I think that's the main problem. If you parse XHTML using the HTML parser, you loose information due to the fact that namespaces are not well-defined for HTML. I'd *always* try with the XML parser first.

This code is being used in a post-processing step for output from Plone. Performance is important, so trial-and-error like this is probably undesirable. And even then, this would need to work for documents parsed with the HTML parser. The output being transformed could include not-quite-well-formed XHTML from content-managed pages. That's the attraction of xlml in the first place - it can deal with somewhat-crap output. ;)

...

However, according to your last comment, it seems you have tried the XML parser already...

I just re-confirmed it. If the whole thing is parsed with etree.fromstring (and lxml.html is not used anywhere) it still doesn't close.

...

...
...
...
...
from lxml import etree, html doc = """\ ...<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> ...<html xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> ...<body> ...<img class="mceItem mceTile" src="foo.png" alt="./target.html" /> ...</body> ...</html> ... """ inputTree = html.fromstring(doc)

We are going to replace the<img /> tag with an<esi:include /> tag. We find it via an XPath:

...
...
...
placeholderXPath = etree.XPath("//img[contains(concat(' ', normalize-space(@class), ' '), ' mceTile ')]")

Perfect use case for lxml.cssselect. :)

Well, I got the XPath from css2xpath.appspot.com which uses the same algorithm I think.

...

...
...
...
...
matched = list(placeholderXPath(inputTree))

As I said, XPath returns a *list*.

Cool, thanks.

...

Personally, I'd love to have it return an iterable, but libxml2 doesn't easily give you that. IIRC, there's some limited support for this (it works for certain patterns), but that would need some serious wrapping effort with non-trivial memory management.

Changing it in a future release may be risky if it returns a list now.

...

...
...
...
...
matchedNode = matched[0]

We then create the ESI node. At this point, it's nice and self-closing. Note that we use the etree.tostring() method, since we want XHTML output.

...
...
...
nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', matchedNode.get('alt'))

...
...
...
print etree.tostring(esiNode) <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"/>

Ok so far.

...
Now we connect it to the parent:

...
...
...
matchedNode.getparent().replace(matchedNode, esiNode)

At this point it's all over:

...
...
...
print etree.tostring(esiNode) <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"></esi:include>

Ah, this is because it is now part of an HTML document, so the HTML semantics interfere. I remember a not-so-old change in libxml2 (2.7.x?) that provided 'better' support for HTML serialisation by taking into account the document context. Looks like this strikes here.

I just looked it up in the sources, recent 2.7.x versions of libxml2 have added a way to override this behaviour again, but lxml doesn't do this yet. IIRC, it wasn't trivial at the time - I think it required going through a different serialisation function or something.

Makes sense, sorta, but I would've thought this was a matter for serialisation, not parsing? Even when parsing as HTML, I'm using etree.tostring() to serialise.

...

...
And sure enough:

...
...
...
print etree.tostring(inputTree) <html xmlns="http://www.w3.org/1999/xhtml" xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" xml:lang="en"><body> <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"></esi:include> </body></html>

It's also interesting to note that this suddenly has the xmlns declaration twice.

... namespaces in HTML ...

Yeah, yeah. XHTML. ;-)

...

...
Any ideas would be highly welcome. I've tried to play with different ways to construct the ESI tag, and different placements for the placeholder (e.g. outside the tag), but it's all the same. It also doesn't seem to make any difference whether I parse with etree.fromstring() or html.fromstring() (in the real code I'm actually feeding an HTMLParser).

It *should* make a difference, but from your example I can see that it doesn't. No idea why. I'll have a closer look later.

I appreciate it! Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book

Stefan Behnel

7:16 a.m.

Martin Aspeli, 29.01.2010 11:51:

...

Obviously. It would rather become a new method on the XPath class, like xpath.iterfind(el).

...

I read through the libxml2 sources a bit more. It's not confusing HTML at all, it's even smarter than I thought. It looks at the *doctype* of the document that is being serialised and then applies special XHTML formatting rules. :o) http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n2137 Stefan

Martin Aspeli

8:27 a.m.

Stefan Behnel wrote:

...

But... XHTML says empty tags can self-close as far as I know. And even then, this is in a different namespace.

...

http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n2137

My C fu is weak. Any hints in there I'm missing? Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book

Stefan Behnel

9:32 a.m.

Martin Aspeli, 29.01.2010 14:27:

...

Sure. I just pointed you to the code that formats the output. Given that the DOCTYPE plays the card here, you may also consider keeping the DOCTYPE out of the tree and prepending it after the serialisation.

...

...
http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n2137

My C fu is weak. Any hints in there I'm missing?

This, for example: http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n1451 The rule that bites you here is in line 1452. If the element uses a namespace prefix, it will not become self-closing. I have no idea about the reasoning behind such a rule, but if you are interested, I'd go straight to the libxml2 mailing list and ask. There's also line 1414 http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n1414 which emits a default namespace declaration for the XHTML namespace regardless of the existing declarations. Certainly space left for enhancements. IIRC, the XHTML formatting is rather new, may have been added in the 2.7 line. You'll have a good chance of being heard if you propose some sensible improvements to it. Stefan

Martin Aspeli

10:09 a.m.

Stefan Behnel wrote:

...

Given that the DOCTYPE plays the card here, you may also consider keeping the DOCTYPE out of the tree and prepending it after the serialisation.

That's going to be pretty tricky, but I guess we can try. I wonder what the side-effect of this may, though. Presumably, the DOCTYPE detection is there for a reason.

...

Good to know. I'm not sure I know how to formulate the needed changes except by re-stating the problem I'm having here, though. It'd probably help if I understood the purpose of the special formatting better. I naively thought that XHTML = XML and wouldn't need any magic. :) Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book

Stefan Behnel

10:45 a.m.

Martin Aspeli, 29.01.2010 16:09:

...

:) I think it's because I complained about one of the early 2.7.x versions breaking lxml's serialisation completely, so Daniel eventually added some "do what I mean" work-around to call the right functions in absence of a specific configuration (which lxml can't pass as the API it uses doesn't allow it ...) Don't expect everything in libxml2 to be well designed from the ground up. It was grown over years and has become a crucial part of the GNU/GNOME/... infrastructure. It naturally carries quite a bit of backwards compatibility with it, in both API and functionality. It certainly has its edges. Discussing new stuff to move it into the right directions is almost always worth it.

...

I wouldn't call that naive. Just go and ask. Stefan

Laurence Rowe

February 2010

9:56 a.m.

This seems to be a limitation of the xml serializer when it detects xhtml :( $ xsltproc --version Using libxml 20703-SVN3827, libxslt 10124-SVN1494 and libexslt 813 xsltproc was compiled against libxml 20703, libxslt 10124 and libexslt 813 libxslt 10124 was compiled against libxml 20703 libexslt 813 was compiled against libxml 20703 $ cat in.html <html></html> $ cat test.xsl <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml"> <xsl:output method="xml" indent="no" omit-xml-declaration="yes" media-type="text/html" encoding="utf-8" doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"/> <xsl:template match="/"> <html><body><esi:include src="foo"/></body></html> </xsl:template> </xsl:stylesheet> $ xsltproc test.xsl in.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:esi="http://www.edge-delivery.org/esi/1.0"><body><esi:include src="foo"></esi:include></body></html> xsltproc is using the xml parser here. We need xhtml mode or you end up with elements like (no space) which confuse some browsers. Laurence

Martin Aspeli

January 2010

10:38 p.m.

Stefan Behnel wrote:

...

I tried this (self-closing tag issue notwithstanding), like so: root = tree.getroot() nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} newRoot = etree.Element('html', nsmap=newRoot.attrib.update(root.attrib.items()) newRoot[:] = copy.deepcopy(root)[:] tree._setroot(newRoot) Unfortunately, I've now lost the doctype. :( The head of the page looks like: <html xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en"> <head> Intriguingly, the <esi:include /> tag now self-closes. :-) However, Firefox is showing an empty page. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book

Stefan Behnel

2:36 a.m.

Martin Aspeli, 29.01.2010 04:38:

...

You can also create the element using the parser: newRoot = etree.XML(''' <!DOCTYPE ...> <html xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en"/>''') Sadly, doctype setting isn't currently as easy as it could be...

...

Intriguingly, the <esi:include /> tag now self-closes. :-) However, Firefox is showing an empty page.

May or may not be due to the missing doctype. Stefan

Stefan Behnel

January 2010

11:04 a.m.

Hi Martin, Martin Aspeli, 28.01.2010 14:52:

...

I'm trying to use lxml to conditionally insert an <esi:include /> tag into an HTML document.

First problem: HTML is not namespace aware - namespaces in HTML are underdefined at best (and they certainly were not well defined back in 2001, when the ESI spec appeared).

...

The document is parsed with the HTML parser and manipulated in various ways. At one point, I search for a node ('placeholder') and want to replace it with something that renders to:

<esi:include src="http://..." />

The code I used looks like this:

nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} root.nsmap.update(nsmap) # root is the <html /> element

Updating the nsmap property has no effect. I've updated the docstring appropriately.

...

esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', url) placeholder.addnext(esiNode) # placeholder is later removed

There are two problems with this:

- The xmlns:esi ends up on the <esi:include /> tag instead of the HTML root. Varnish doesn't like this apparently.

...

- The <esi:include /> tag is not self-closing when rendered with the html.tostring (using etree.tostring is not really an option as other things are going on which want html rendering). Varnish likes this even less.

Martin Aspeli

7 p.m.

Stefan Behnel wrote:

...

Hi Martin,

Martin Aspeli, 28.01.2010 14:52:

...
I'm trying to use lxml to conditionally insert an<esi:include /> tag into an HTML document.

First problem: HTML is not namespace aware - namespaces in HTML are underdefined at best (and they certainly were not well defined back in 2001, when the ESI spec appeared).

...
The document is parsed with the HTML parser and manipulated in various ways. At one point, I search for a node ('placeholder') and want to replace it with something that renders to:

<esi:include src="http://..." />

The code I used looks like this:

nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} root.nsmap.update(nsmap) # root is the<html /> element

Updating the nsmap property has no effect. I've updated the docstring appropriately.

Ok, thanks.

...

...
esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', url) placeholder.addnext(esiNode) # placeholder is later removed

There are two problems with this:

- The xmlns:esi ends up on the<esi:include /> tag instead of the HTML root. Varnish doesn't like this apparently.

As I said, namespaces in HTML...

To move the namespace declaration to the top-level element, you can create a new 'html' root element that has it and move the nodes over, e.g.

new_root = etree.Element('html', nsmap=nsmap) new_root[:] = root[:] # or copy.deepcopy(root)[:]

How (in)efficient is this?

...

I think it would be nice to allow an 'nsmap' parameter in the cleanup_namespaces() function. Its namespace declarations would then get added to the element it runs on before starting the cleanup process. That would be a 2.3 feature, though.

Ok.

...

I don't think adding support for changing 'el.nsmap' would be a good idea, as changing namespace prefixes is actually a rather non-trivial process. This should be requested explicitly at a well selected step in the code (usually just before serialisation, when prefixes become interesting).

Agree. I'd be happy to pass something to the serialiser about namespaces.

...

...
- The<esi:include /> tag is not self-closing when rendered with the html.tostring (using etree.tostring is not really an option as other things are going on which want html rendering). Varnish likes this even less.

Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't know that it's supposed to be self-closing. I think you only have three options here:

Is there no way to make it aware of it? Seems this should be configurable (or monkey-patch-able) somewhere...

...

1) fix Varnish 2) serialise to XHTML instead of HTML (assuming that Varnish supports that)

It probably would. What's that look like?

...

3) close the tag through byte string substitution *after* serialisation

Yipes. If I do that, I'll just do the entire tag through such a substitution to be honest and not use lxml at all.

...

If you choose to go with 2), you may consider converting the stream back to plain HTML *after* processing the esi tags, using an additional parse-serialise cycle (or an external tool like xmllint or tidy).

That sounds pretty bad for performance. :( Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book

Laurence Rowe

8:47 p.m.

2010/1/29 Martin Aspeli <optilude+lists@gmail.com>:

...

Stefan Behnel wrote:

...
Hi Martin,

Martin Aspeli, 28.01.2010 14:52:

...
I'm trying to use lxml to conditionally insert an<esi:include /> tag into an HTML document.

First problem: HTML is not namespace aware - namespaces in HTML are underdefined at best (and they certainly were not well defined back in 2001, when the ESI spec appeared).

...
The document is parsed with the HTML parser and manipulated in various ways. At one point, I search for a node ('placeholder') and want to replace it with something that renders to:

<esi:include src="http://..." />

The code I used looks like this:

nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} root.nsmap.update(nsmap) # root is the<html /> element

Updating the nsmap property has no effect. I've updated the docstring appropriately.

Ok, thanks.

...
...
esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', url) placeholder.addnext(esiNode) # placeholder is later removed

There are two problems with this:

- The xmlns:esi ends up on the<esi:include /> tag instead of the HTML root. Varnish doesn't like this apparently.

As I said, namespaces in HTML...

To move the namespace declaration to the top-level element, you can create a new 'html' root element that has it and move the nodes over, e.g.

new_root = etree.Element('html', nsmap=nsmap) new_root[:] = root[:] # or copy.deepcopy(root)[:]

How (in)efficient is this?

...
I think it would be nice to allow an 'nsmap' parameter in the cleanup_namespaces() function. Its namespace declarations would then get added to the element it runs on before starting the cleanup process. That would be a 2.3 feature, though.

Ok.

...
I don't think adding support for changing 'el.nsmap' would be a good idea, as changing namespace prefixes is actually a rather non-trivial process. This should be requested explicitly at a well selected step in the code (usually just before serialisation, when prefixes become interesting).

Agree. I'd be happy to pass something to the serialiser about namespaces.

...
...
- The<esi:include /> tag is not self-closing when rendered with the html.tostring (using etree.tostring is not really an option as other things are going on which want html rendering). Varnish likes this even less.

Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't know that it's supposed to be self-closing. I think you only have three options here:

Is there no way to make it aware of it? Seems this should be configurable (or monkey-patch-able) somewhere...

...
1) fix Varnish 2) serialise to XHTML instead of HTML (assuming that Varnish supports that)

It probably would. What's that look like?

...
3) close the tag through byte string substitution *after* serialisation

Yipes. If I do that, I'll just do the entire tag through such a substitution to be honest and not use lxml at all.

...
If you choose to go with 2), you may consider converting the stream back to plain HTML *after* processing the esi tags, using an additional parse-serialise cycle (or an external tool like xmllint or tidy).

That sounds pretty bad for performance. :(

Martin

Stefan Behnel

3:59 a.m.

Martin Aspeli, 29.01.2010 01:00:

...

Monkey-patching isn't all that easy in libxml2, though... Not that it can't work for C code, it's just not that portable - nor particularly safe... ;-)

...

Don't underestimate the speed of a tool that was made for the job. Stefan

Stefan Behnel

10:27 a.m.

[replying to myself] Stefan Behnel, 29.01.2010 09:59:

...

The serialiser in libxml2 can only write out what is there.

It could work for a doctype, though. Support for passing that verbatimly into the serialiser would be a nice feature.

There we go: https://codespeak.net/viewvc/?view=rev&revision=70976

...

...
...
xml = '<!DOCTYPE root>\n<root/>' tree = etree.parse(StringIO(xml))

...

...
...
print(etree.tostring(tree)) <!DOCTYPE root> <root/>

...

...
...
print(etree.tostring(tree, ... doctype='<!DOCTYPE root SYSTEM "/tmp/test.dtd">')) <!DOCTYPE root SYSTEM "/tmp/test.dtd"> <root/>

Stefan

Martin Aspeli

9:26 p.m.

Stefan Behnel wrote:

...

Stefan Behnel

January 2010

2:47 a.m.

Martin Aspeli, 29.01.2010 03:26:

...

Stefan Behnel wrote:

...
Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't know that it's supposed to be self-closing. I think you only have three options here:

1) fix Varnish 2) serialise to XHTML instead of HTML (assuming that Varnish supports that) 3) close the tag through byte string substitution *after* serialisation

If you choose to go with 2), you may consider converting the stream back to plain HTML *after* processing the esi tags, using an additional parse-serialise cycle (or an external tool like xmllint or tidy).

I've just tried this with serialization using lxml.etree.tostring instead of lxml.html.tostring. Unfortunately, I'm still getting an open-close tag pair instead of a self-closed tag. Any idea what I may be doing wrong?

esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', tileHref) esiNode.text = None

Setting the .text to None is redundant as this is a new element. Otherwise, doing that should be enough to erase all text content.

...

tilePlaceholderNode.addnext(esiNode) toRemove.append(tilePlaceholderNode)

I guess I would have used parent.replace(old,new) here.

...

Output:

...<esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="http://..."></esi:include>

Works for me. Could you send me a complete code snippet where it doesn't work for you? Stefan

Martin Aspeli

3:15 a.m.

Stefan Behnel wrote:

...

It was an act of desperation. :)

...

Stefan Behnel

4:18 a.m.

Martin Aspeli, 29.01.2010 09:15:

...

I know. ;)

...

Another nice feature: support a sequence as replacement. :) Although that requirement is basically satisfied with slice replacements, so I guess that won't make it in for now.

...

XPath returns a list of nodes, so you are no longer iterating over the tree structure in this case. Ripping stuff out should be absolutely safe here.

...

Martin Aspeli

4:22 a.m.

Stefan Behnel wrote:

...

Could you elaborate an example?

...

Cool! Less code.

...

That's why I asked. :-p

...

Martin Aspeli

4:10 a.m.

Stefan Behnel wrote:

...

Works for me. Could you send me a complete code snippet where it doesn't work for you?

...

...
...
from lxml import etree, html doc = """\ ... <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> ... <html xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> ... <body> ... <img class="mceItem mceTile" src="foo.png" alt="./target.html" /> ... </body> ... </html> ... """ inputTree = html.fromstring(doc)

We are going to replace the <img /> tag with an <esi:include /> tag. We find it via an XPath:

...

...
...
placeholderXPath = etree.XPath("//img[contains(concat(' ', normalize-space(@class), ' '), ' mceTile ')]") matched = list(placeholderXPath(inputTree)) matchedNode = matched[0]

We then create the ESI node. At this point, it's nice and self-closing. Note that we use the etree.tostring() method, since we want XHTML output.

...

...
...
nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', matchedNode.get('alt'))

...

...
...
print etree.tostring(esiNode) <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"/>

Now we connect it to the parent:

...

...
...
matchedNode.getparent().replace(matchedNode, esiNode)

At this point it's all over:

...

...
...
print etree.tostring(esiNode) <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"></esi:include>

And sure enough:

...

...
...
print etree.tostring(inputTree) <html xmlns="http://www.w3.org/1999/xhtml" xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" xml:lang="en"><body> <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"></esi:include> </body></html>

Stefan Behnel

5:08 a.m.

Martin Aspeli, 29.01.2010 10:10:

...

here's a pseudo-doctest that illustrates the problem:

First, we create a simple document. We use the HTML parser here, because we don't necessarily trust the input being 100% valid XHTML, even though the doctype says so.

...

...
...
...
from lxml import etree, html doc = """\ ... <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> ... <html xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> ... <body> ... <img class="mceItem mceTile" src="foo.png" alt="./target.html" /> ... </body> ... </html> ... """ inputTree = html.fromstring(doc)

We are going to replace the <img /> tag with an <esi:include /> tag. We find it via an XPath:

...
...
...
placeholderXPath = etree.XPath("//img[contains(concat(' ', normalize-space(@class), ' '), ' mceTile ')]")

Perfect use case for lxml.cssselect. :)

...

...
...
...
matched = list(placeholderXPath(inputTree))

...

...
...
...
matchedNode = matched[0]

We then create the ESI node. At this point, it's nice and self-closing. Note that we use the etree.tostring() method, since we want XHTML output.

...
...
...
nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', matchedNode.get('alt'))

...
...
...
print etree.tostring(esiNode) <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"/>

Ok so far.

...

Now we connect it to the parent:

...
...
...
matchedNode.getparent().replace(matchedNode, esiNode)

At this point it's all over:

...
...
...
print etree.tostring(esiNode) <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"></esi:include>

...

And sure enough:

...
...
...
print etree.tostring(inputTree) <html xmlns="http://www.w3.org/1999/xhtml" xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" xml:lang="en"><body> <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"></esi:include> </body></html>

It's also interesting to note that this suddenly has the xmlns declaration twice.

... namespaces in HTML ...

...

Any ideas would be highly welcome. I've tried to play with different ways to construct the ESI tag, and different placements for the placeholder (e.g. outside the tag), but it's all the same. It also doesn't seem to make any difference whether I parse with etree.fromstring() or html.fromstring() (in the real code I'm actually feeding an HTMLParser).

It *should* make a difference, but from your example I can see that it doesn't. No idea why. I'll have a closer look later. Stefan

Martin Aspeli

January 2010

5:51 a.m.

Stefan Behnel wrote:

...

Martin Aspeli, 29.01.2010 10:10:

...
here's a pseudo-doctest that illustrates the problem:

First, we create a simple document. We use the HTML parser here, because we don't necessarily trust the input being 100% valid XHTML, even though the doctype says so.

I think that's the main problem. If you parse XHTML using the HTML parser, you loose information due to the fact that namespaces are not well-defined for HTML. I'd *always* try with the XML parser first.

...

However, according to your last comment, it seems you have tried the XML parser already...

I just re-confirmed it. If the whole thing is parsed with etree.fromstring (and lxml.html is not used anywhere) it still doesn't close.

...

...
...
...
...
from lxml import etree, html doc = """\ ...<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> ...<html xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> ...<body> ...<img class="mceItem mceTile" src="foo.png" alt="./target.html" /> ...</body> ...</html> ... """ inputTree = html.fromstring(doc)

We are going to replace the<img /> tag with an<esi:include /> tag. We find it via an XPath:

...
...
...
placeholderXPath = etree.XPath("//img[contains(concat(' ', normalize-space(@class), ' '), ' mceTile ')]")

Perfect use case for lxml.cssselect. :)

Well, I got the XPath from css2xpath.appspot.com which uses the same algorithm I think.

...

...
...
...
...
matched = list(placeholderXPath(inputTree))

As I said, XPath returns a *list*.

Cool, thanks.

...

Personally, I'd love to have it return an iterable, but libxml2 doesn't easily give you that. IIRC, there's some limited support for this (it works for certain patterns), but that would need some serious wrapping effort with non-trivial memory management.

Changing it in a future release may be risky if it returns a list now.

...

...
...
...
...
matchedNode = matched[0]

We then create the ESI node. At this point, it's nice and self-closing. Note that we use the etree.tostring() method, since we want XHTML output.

...
...
...
nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', matchedNode.get('alt'))

...
...
...
print etree.tostring(esiNode) <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"/>

Ok so far.

...
Now we connect it to the parent:

...
...
...
matchedNode.getparent().replace(matchedNode, esiNode)

At this point it's all over:

...
...
...
print etree.tostring(esiNode) <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"></esi:include>

Ah, this is because it is now part of an HTML document, so the HTML semantics interfere. I remember a not-so-old change in libxml2 (2.7.x?) that provided 'better' support for HTML serialisation by taking into account the document context. Looks like this strikes here.

I just looked it up in the sources, recent 2.7.x versions of libxml2 have added a way to override this behaviour again, but lxml doesn't do this yet. IIRC, it wasn't trivial at the time - I think it required going through a different serialisation function or something.

Makes sense, sorta, but I would've thought this was a matter for serialisation, not parsing? Even when parsing as HTML, I'm using etree.tostring() to serialise.

...

...
And sure enough:

...
...
...
print etree.tostring(inputTree) <html xmlns="http://www.w3.org/1999/xhtml" xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" xml:lang="en"><body> <esi:include xmlns:esi="http://www.edge-delivery.org/esi/1.0" src="./target.html"></esi:include> </body></html>

It's also interesting to note that this suddenly has the xmlns declaration twice.

... namespaces in HTML ...

Yeah, yeah. XHTML. ;-)

...

...
Any ideas would be highly welcome. I've tried to play with different ways to construct the ESI tag, and different placements for the placeholder (e.g. outside the tag), but it's all the same. It also doesn't seem to make any difference whether I parse with etree.fromstring() or html.fromstring() (in the real code I'm actually feeding an HTMLParser).

It *should* make a difference, but from your example I can see that it doesn't. No idea why. I'll have a closer look later.

I appreciate it! Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book

Stefan Behnel

7:16 a.m.

Martin Aspeli, 29.01.2010 11:51:

...

Obviously. It would rather become a new method on the XPath class, like xpath.iterfind(el).

...

Martin Aspeli

8:27 a.m.

Stefan Behnel wrote:

...

But... XHTML says empty tags can self-close as far as I know. And even then, this is in a different namespace.

...

http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n2137

My C fu is weak. Any hints in there I'm missing? Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book

Stefan Behnel

9:32 a.m.

Martin Aspeli, 29.01.2010 14:27:

...

...
http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n2137

My C fu is weak. Any hints in there I'm missing?

Martin Aspeli

10:09 a.m.

Stefan Behnel wrote:

...

Given that the DOCTYPE plays the card here, you may also consider keeping the DOCTYPE out of the tree and prepending it after the serialisation.

That's going to be pretty tricky, but I guess we can try. I wonder what the side-effect of this may, though. Presumably, the DOCTYPE detection is there for a reason.

...