no xml_declaration for unicode?
Hi, there is a strange behavior, if im trying to use the 'xml_declaration' parameter on unicode encoding:
from lxml import etree root = etree.Element("root") res = etree.tostring(root, encoding="unicode", xml_declaration=True)
raises: Traceback (most recent call last): File "<pyshell#1>", line 4, in <module> res = etree.tostring(root, encoding="unicode", xml_declaration=True) File "lxml.etree.pyx", line 2838, in lxml.etree.tostring (src/lxml/lxml.etree.c:53452) ValueError: Serialisation to unicode must not request an XML declaration I don't understand why it is not allowed to control the xml declaration if i'm using unicode. My environment is lxml 2.3.0 on CPython 2.7.1 (WinXP 32bit) Thanks in advance, Daniel
Hi,
from lxml import etree root = etree.Element("root") res = etree.tostring(root, encoding="unicode", xml_declaration=True)
raises:
Traceback (most recent call last): File "<pyshell#1>", line 4, in <module> res = etree.tostring(root, encoding="unicode", xml_declaration=True) File "lxml.etree.pyx", line 2838, in lxml.etree.tostring (src/lxml/lxml.etree.c:53452) ValueError: Serialisation to unicode must not request an XML declaration
I don't understand why it is not allowed to control the xml declaration if i'm using unicode.
What's the use case? XML declaration in unicode serialisation doesn't make much sense. In lxml, xml_declaration=True basically controls the header encoding information(*). If you serialize to unicode, what would that be? The only sane answer might be the encoding used by Python internally to represent unicode objects but that may differ between Python interpreters, making this a potential portability problem. (*) At a superficial glance lxml does currently not support setting an XML declaration with version attribute only, probably because libxml2 implements XML 1.0. Holger -- GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit gratis Handy-Flat! http://portal.gmx.net/de/go/dsl
Hi Holger, there is no special use case and it's technically not required at this point. I simply prefer a formal declaration for my xml files and haven't understood why I can't do that that way. If the encoding parameter is 'utf-8' the 'xml_declaration' is allowed and works as expected. But not really straight forward is the returned python 2.x 'str' type. Why can I use 'xml_declaration' if the encoding parameter is 'utf-8', but if I choose 'unicode' it doesn't work? That's a little bit confusing... Daniel
Hi,
What's the use case?
XML declaration in unicode serialisation doesn't make much sense. In lxml, xml_declaration=True basically controls the header encoding information(*). If you serialize to unicode, what would that be? The only sane answer might be the encoding used by Python internally to represent unicode objects but that may differ between Python interpreters, making this a potential portability problem.
(*) At a superficial glance lxml does currently not support setting an XML declaration with version attribute only, probably because libxml2 implements XML 1.0.
Holger
why I can't do that that way. If the encoding parameter is 'utf-8' the 'xml_declaration' is allowed and works as expected. But not really straight forward is the returned python 2.x 'str' type. Why can I use 'xml_declaration' if the encoding parameter is 'utf-8', but if I choose 'unicode' it doesn't work?
It's maybe debatable if an XML declaration *without* encoding attribute could/should be produced for unicode serialisation for the sake of not having to make a distinction for different encoding attributes. I really can't see why you'd ever want to mix e.g. 'utf-8' and 'unicode', though. unicode is *not* an encoding. Strictly speaking, the use of 'unicode' as the encoding attribute is slightly misleading imho and indeed there is still the deprecated tounicode() function (at least in lxml 2.2.6, not sure about 2.3). It fits best for my brain to think of unicode not as an encoding but as a logical concept. A Uni*code* is first of all just a code point that represents a character as a number. Now, when you serialize your XML document you encode it to bytes - which happens to be represented by the str type in Python < 3.x. Of course, to decode this bytes buffer you'll have to know which encoding it was encoded in, so the encoding information in the XML declaration is a helper for your parser to correctly interpret encoded byte buffers. Serializing to unicode on the other hand does not produce a bytes buffer but a unicode object. No need or sense to decode this. Holger -- Schon gehört? GMX hat einen genialen Phishing-Filter in die Toolbar eingebaut! http://www.gmx.net/de/go/toolbar
info@damiro.net, 17.03.2011 16:26:
If the encoding parameter is 'utf-8' the 'xml_declaration' is allowed and works as expected. But not really straight forward is the returned python 2.x 'str' type.
Remember that XML is specified as sequence of bytes. UTF-8 is an encoding, it returns bytes (which in Py2 are represented as 'str' type). Stefan
jholg@gmx.de, 17.03.2011 13:16:
XML declaration in unicode serialisation doesn't make much sense.
There's one exception: the "standalone" declaration. Everything else would have to stick to the default values anyway (i.e. no "encoding" declaration, "version" 1.0).
(*) At a superficial glance lxml does currently not support setting an XML declaration with version attribute only, probably because libxml2 implements XML 1.0.
Actually, lxml writes out the declaration itself, so this can be enabled quite easily. Stefan
participants (3)
-
info@damiro.net
-
jholg@gmx.de
-
Stefan Behnel