How to produce explicitly closed form of HTML tags
Hi, I am using lxml's XSLT support to transform XML to HTML. The XSLT specifies the following xsl:output directive: <xsl:output method="html" version="4.0" encoding="UTF-8" media-type="text/html" doctype-public="-//W3C//DTD HTML 4.01//EN" standalone="yes" omit-xml-declaration="yes" indent="no"/> I serialize the output element tree to a string using the etree.tostring() function: xslt_tree = etree.parse(xslt_url) xslt_transform = etree.XSLT(xslt_tree.getroot()) out_tree = xslt_transform(in_root_elem) fp.write(etree.tostring(out_tree)) This creates self-closing HTML tags, for example: <script type="text/javascript" src="tocgen.js"/> This hurts, because some browsers (including IE8 and FF10) do not support self-closing tags for some tags when the doctype is HTML. The script tag is a particularly nasty one because not recognizing that it is actually closed causes the whole rest of the document to be interpreted as (invalid) Javascript. But doing that with other tags may create problems as well. What works on all browsers I tested, is the explicitly closed form: <script type="text/javascript" src="tocgen.js"></script> See this thread for discussion: http://stackoverflow.com/questions/69913/why-dont-self-closing-script-tags-w... BTW, Xalan produces the explicitly closed form for the same scenario. (Just to mention it as one other data point, not that I would consider going back to Xalan - lxml is much much faster and by comparing lxml and Xalan output beyond differences such as self-closing tags I found already one difference that seems to be a Xalan bug). -> Is there a way to get lxml's HTML serialization support to produce the explicitly closed form ? The lxml version I am using is lxml 2.3 with libxml2 statically linked, for Python 2.6, on Windows (lxml-2.3.win32-py2.6.exe <http://pypi.python.org/packages/2.6/l/lxml/lxml-2.3.win32-py2.6.exe#md5=4135...>). -> Is there a later version that has a Windows installer and libxml2 statically linked ? (Other people need to be able to set that up for my stuff to work, and this way it is real simple). Andy
Hi,
-> Is there a way to get lxml's HTML serialization support to produce the explicitly closed form ?
I actually found a solution to this problem, by specifying more parameters to the etree.tostring() function: fp.write(etree.tostring(out_tree, encoding="utf-8", method="html", xml_declaration=None, pretty_print=False, with_tail=True, standalone=None, doctype='<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">')) However, this means that the output parameters specified in the XSLT become meaningless, because they are overwritten in the Python program. -> Is there a way to let lxml honor the output parameters specified in the XSLT ? It seems to me that would be a good default behavior. -> Also, is there a way to get any HTML comments in the element tree to be serialized ? The tostring() parameter with_comments is documented to apply to the C14N output method only, and does not cause the HTML comments to be created when set to True. Andy
Le 07/06/2012 12:47, Andreas Maier a écrit :
I actually found a solution to this problem, by specifying more parameters to the etree.tostring() function:
fp.write(etree.tostring(... method="html"))
Alternatively, html5lib can serialize lxml trees: http://code.google.com/p/html5lib/wiki/UserDocumentation#Serialization_of_St... https://github.com/Kozea/docutils-html5-writer/blob/master/docutils_html5/__... Regards, -- Simon Sapin
Andreas Maier, 07.06.2012 12:08:
I am using lxml's XSLT support to transform XML to HTML. The XSLT specifies the following xsl:output directive:
<xsl:output method="html" version="4.0" encoding="UTF-8" media-type="text/html" doctype-public="-//W3C//DTD HTML 4.01//EN" standalone="yes" omit-xml-declaration="yes" indent="no"/>
I serialize the output element tree to a string using the etree.tostring() function:
xslt_tree = etree.parse(xslt_url) xslt_transform = etree.XSLT(xslt_tree.getroot()) out_tree = xslt_transform(in_root_elem) fp.write(etree.tostring(out_tree))
This creates self-closing HTML tags, for example:
<script type="text/javascript" src="tocgen.js"/>
This hurts, because some browsers (including IE8 and FF10) do not support self-closing tags for some tags when the doctype is HTML. The script tag is a particularly nasty one because not recognizing that it is actually closed causes the whole rest of the document to be interpreted as (invalid) Javascript. But doing that with other tags may create problems as well.
What works on all browsers I tested, is the explicitly closed form:
<script type="text/javascript" src="tocgen.js"></script>
You can put a space character into the script tag. Browsers will happily ignore it. Stefan
On 7 June 2012 12:47, Andreas Maier <andreas.r.maier@gmx.de> wrote:
Hi,
-> Is there a way to get lxml's HTML serialization support to produce the explicitly closed form ?
I actually found a solution to this problem, by specifying more parameters to the etree.tostring() function:
fp.write(etree.tostring(out_tree, encoding="utf-8", method="html", xml_declaration=None, pretty_print=False, with_tail=True, standalone=None, doctype='<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">'))
However, this means that the output parameters specified in the XSLT become meaningless, because they are overwritten in the Python program.
-> Is there a way to let lxml honor the output parameters specified in the XSLT ? It seems to me that would be a good default behavior.
Yes, an XSLTResultTree has a __str__ method which respects the <xsl:output> options set, so you should be using: fp.write(str(out_tree)) see: http://lxml.de/FAQ.html#what-is-the-difference-between-str-xslt-doc-and-xslt... When no <xsl:output method="..."> is specified then the XSLT spec says method="html" should be used when the root element is <html>. That will invoked the HTMLSerializer (lxml.html.tostring) and give you HTML output (<br> instead of <br />). If you need HTML compatible XHTML output (i.e. <br /> rather than <br> along with no self-closing script tags) then you need to specify <xsl:output method="xml"> along with an XHTML 1.0 doctype.
-> Also, is there a way to get any HTML comments in the element tree to be serialized ? The tostring() parameter with_comments is documented to apply to the C14N output method only, and does not cause the HTML comments to be created when set to True.
To generate comment elements make sure that your XSLT is not inadvertently stripping them out, that probably means you are missing the identity template somewhere. Laurence
Hi Laurence. Thanks, will try it out when I'm back from vacation. On XML comments: The XSLT produces them explicitly with xsl:comment, and they are generated with xalan and saxon. Thanks, Andy
participants (5)
-
Andreas Maier -
andreasmaier50@googlemail.com -
Laurence Rowe -
Simon Sapin -
Stefan Behnel