lxml.etree.CDATA - different unicode strings map to same XML - is this expected behaviour?

[I originally wrote this for StackOverflow, hence all the backticks; I've left them in to deliminate strings.] Not sure if this is a bug or me failing to understand `CDATA`. I have some Python 2 `unicode` objects I want to include in the text of my XML document; I've been asked to put them into `CDATA`, so I used `element.text = lxml.etree.CDATA(mytext)`. However, `þ` (repr: `u'\xfe'`) and `þ` (repr: `u'þ'`) seem to produce the same XML output, despite being very different strings: `<root><![CDATA[þ]]></root>` which lxml parses as `þ`. Can someone confirm it's a bug, or correct my wrongheadedness about using CDATA? BTW: `element.text = mytext` (i.e. no CDATA involved) works as I'd expect. >>> lxml.etree.__version__ u'3.2.4' _test code:_ import lxml.etree import sys literal=sys.argv[1].decode('utf-8') print repr(literal) literalroot=lxml.etree.fromstring("<root></root>") literalroot.text = lxml.etree.CDATA(literal) xml = lxml.etree.tostring(literalroot) print xml print literal, lxml.etree.fromstring(xml).text $ python test.py "þ" u'\xfe' <root><![CDATA[þ]]></root> þ þ $ python test.py "þ" u'þ' <root><![CDATA[þ]]></root> þ þ

Dave McKee, 19.11.2013 19:12:
I assume that you want to include the text content of these objects and not the objects themselves.
I've been asked to put them into `CDATA`
Just for fun? Or was there an actual reason for that? I wouldn't do that. It just makes things unnecessarily complicated. CDATA has a very limited range of usefulness. Storing text is not in that range.
It's because you are serialising to an ASCII-only representation of the document, which must then escape the non-ASCII content. It might be a bug that the escaping doesn't result in an error, because it's clearly irreversible. OTOH, CDATA shouldn't be used lightly. You can pass encoding='utf8' into fromstring() to serialise to UTF-8 instead.
BTW: `element.text = mytext` (i.e. no CDATA involved) works as I'd expect.
Seems both simpler and more correct to me. Stefan

Dave McKee, 19.11.2013 19:12:
I assume that you want to include the text content of these objects and not the objects themselves.
I've been asked to put them into `CDATA`
Just for fun? Or was there an actual reason for that? I wouldn't do that. It just makes things unnecessarily complicated. CDATA has a very limited range of usefulness. Storing text is not in that range.
It's because you are serialising to an ASCII-only representation of the document, which must then escape the non-ASCII content. It might be a bug that the escaping doesn't result in an error, because it's clearly irreversible. OTOH, CDATA shouldn't be used lightly. You can pass encoding='utf8' into fromstring() to serialise to UTF-8 instead.
BTW: `element.text = mytext` (i.e. no CDATA involved) works as I'd expect.
Seems both simpler and more correct to me. Stefan
participants (2)
-
Dave McKee
-
Stefan Behnel