How to pass xmlSaveOption flags to etree.tostring?

I need to serialize XHTML 1.1 without minimized empty elements. Specifically, I want to produce EPUB 2 files, which require XHTML 1.1, but some user agents, like epubreader for firefox, load the XHTML files from disk, whereas firefox uses the HTML parser and makes a big mess of minimized elements. So I wonder if there is a way to pass the option flags in libxml xmlSaveOption to etree.tostring (). In case there is no such way, I humbly request the feature to be added. Regards -- Marcello Perathoner webmaster@gutenberg.org

Marcello Perathoner, 30.03.2013 17:10:
I need to serialize XHTML 1.1 without minimized empty elements.
Do you mean something like "<tag/>" instead of "<tag></tag>" ? That would be the XML_SAVE_NO_EMPTY option then, I guess?
It should detect the XHTML namespace though, and then parse it correctly. I wonder why it's a problem on your side. Could you present the code that you use for serialising, and preferably also a tiny example snippet of the output? Especially the XML declaration and the root element?
So I wonder if there is a way to pass the option flags in libxml xmlSaveOption to etree.tostring ().
Not currently. The xmlsave API of libxml2 is not being used in lxml. IIRC, it was added in a later version of libxml2 than lxml currently supports. Should be possible to switch optionally at C compile time, though. Patches welcome. Stefan

On 03/31/2013 06:06 PM, Stefan Behnel wrote:
Do you mean something like "<tag/>" instead of "<tag></tag>" ? That would be the XML_SAVE_NO_EMPTY option then, I guess?
The other way around: I want <tag></tag> and not <tag/>. That is what the option does.
--- test.py --- from lxml import etree root = etree.fromstring (""" <html> <body> <p> <span style="color: red"></span>black </p> </body> </html> """) XHTML11_DOCTYPE = "<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' \ 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>" print (etree.tostring ( root, method = 'xml', xml_declaration = True, doctype = XHTML11_DOCTYPE, encoding = 'utf-8', pretty_print = True)) ------ Output is: <?xml version='1.0' encoding='utf-8'?> <!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'> <html> <body> <p> <span style="color: red"/>black </p> </body> </html> To illustrate the problem do: $ python test.py > test.html $ firefox test.html $ chromium test.html When browsers open a file from disk, they use the HTML parser, even if the file explicitly declares itself to be xml. The HTML parser makes a mess of many things but especially of the <tag/> forms. The question is: how can I write an xml file containing XHTML 1.1 without the troubling <tag/> form? Regards -- Marcello Perathoner webmaster@gutenberg.org

Marcello Perathoner, 31.03.2013 23:35:
Works for me when I 1) rename the file extension to .xhtml and 2) use the correct XHTML namespace inside of the file, i.e. <?xml version='1.0' encoding='utf-8'?> <!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'> <html xmlns="http://www.w3.org/1999/xhtml"> <body> <p> <span style="color: red"/>black </p> </body> </html> Stefan

On 04/01/2013 07:59 AM, Stefan Behnel wrote:
Unfortunately renaming the file(s) is not an option because it will break software from other vendors, ie. kindlegen by Amazon. I'm not looking for a workaround either. I have a workaround that works just fine, ie. insert a unique character into all empty elements, serialize, then strip the unique character again. Its ugly and slow but it works. IMO if libxml has those options lxml should have them too. I'm sure the libxml people implemented them for a reason. Regards -- Marcello Perathoner webmaster@gutenberg.org

Marcello Perathoner, 01.04.2013 15:18:
Ok, but you should still fix your namespace issue.
You can just replace None text values in the tree by an empty string, that's much simpler.
IMO if libxml has those options lxml should have them too. I'm sure the libxml people implemented them for a reason.
Oh, I'm not questioning that. In fact, I'd be happy to conditionally use the xmlSave*() API of libxml2 for versions that support it (i.e. 2.7.2 and later) and implement a method="xhtml" serialisation with it. The code that executes the serialisation is here: https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L235 I take patches, but please make sure they work with older libxml2 versions. To that end, add appropriate C preprocessor #defines to etree_defs.h to make it compile (it has various examples already), and use the LIBXML_VERSION macro to see which API to use. Most of the xmlsave API is declared in tree.pxd already. Please create a pull request on github once you have something to show, so that I can easily review it. Stefan

Marcello Perathoner, 30.03.2013 17:10:
I need to serialize XHTML 1.1 without minimized empty elements.
Do you mean something like "<tag/>" instead of "<tag></tag>" ? That would be the XML_SAVE_NO_EMPTY option then, I guess?
It should detect the XHTML namespace though, and then parse it correctly. I wonder why it's a problem on your side. Could you present the code that you use for serialising, and preferably also a tiny example snippet of the output? Especially the XML declaration and the root element?
So I wonder if there is a way to pass the option flags in libxml xmlSaveOption to etree.tostring ().
Not currently. The xmlsave API of libxml2 is not being used in lxml. IIRC, it was added in a later version of libxml2 than lxml currently supports. Should be possible to switch optionally at C compile time, though. Patches welcome. Stefan

On 03/31/2013 06:06 PM, Stefan Behnel wrote:
Do you mean something like "<tag/>" instead of "<tag></tag>" ? That would be the XML_SAVE_NO_EMPTY option then, I guess?
The other way around: I want <tag></tag> and not <tag/>. That is what the option does.
--- test.py --- from lxml import etree root = etree.fromstring (""" <html> <body> <p> <span style="color: red"></span>black </p> </body> </html> """) XHTML11_DOCTYPE = "<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' \ 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>" print (etree.tostring ( root, method = 'xml', xml_declaration = True, doctype = XHTML11_DOCTYPE, encoding = 'utf-8', pretty_print = True)) ------ Output is: <?xml version='1.0' encoding='utf-8'?> <!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'> <html> <body> <p> <span style="color: red"/>black </p> </body> </html> To illustrate the problem do: $ python test.py > test.html $ firefox test.html $ chromium test.html When browsers open a file from disk, they use the HTML parser, even if the file explicitly declares itself to be xml. The HTML parser makes a mess of many things but especially of the <tag/> forms. The question is: how can I write an xml file containing XHTML 1.1 without the troubling <tag/> form? Regards -- Marcello Perathoner webmaster@gutenberg.org

Marcello Perathoner, 31.03.2013 23:35:
Works for me when I 1) rename the file extension to .xhtml and 2) use the correct XHTML namespace inside of the file, i.e. <?xml version='1.0' encoding='utf-8'?> <!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'> <html xmlns="http://www.w3.org/1999/xhtml"> <body> <p> <span style="color: red"/>black </p> </body> </html> Stefan

On 04/01/2013 07:59 AM, Stefan Behnel wrote:
Unfortunately renaming the file(s) is not an option because it will break software from other vendors, ie. kindlegen by Amazon. I'm not looking for a workaround either. I have a workaround that works just fine, ie. insert a unique character into all empty elements, serialize, then strip the unique character again. Its ugly and slow but it works. IMO if libxml has those options lxml should have them too. I'm sure the libxml people implemented them for a reason. Regards -- Marcello Perathoner webmaster@gutenberg.org

Marcello Perathoner, 01.04.2013 15:18:
Ok, but you should still fix your namespace issue.
You can just replace None text values in the tree by an empty string, that's much simpler.
IMO if libxml has those options lxml should have them too. I'm sure the libxml people implemented them for a reason.
Oh, I'm not questioning that. In fact, I'd be happy to conditionally use the xmlSave*() API of libxml2 for versions that support it (i.e. 2.7.2 and later) and implement a method="xhtml" serialisation with it. The code that executes the serialisation is here: https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L235 I take patches, but please make sure they work with older libxml2 versions. To that end, add appropriate C preprocessor #defines to etree_defs.h to make it compile (it has various examples already), and use the LIBXML_VERSION macro to see which API to use. Most of the xmlsave API is declared in tree.pxd already. Please create a pull request on github once you have something to show, so that I can easily review it. Stefan
participants (2)
-
Marcello Perathoner
-
Stefan Behnel