Dave McKee, 19.11.2013 19:12:
Not sure if this is a bug or me failing to understand `CDATA`.
I have some Python 2 `unicode` objects I want to include in the text of my XML document
I assume that you want to include the text content of these objects and not the objects themselves.
I've been asked to put them into `CDATA`
Just for fun? Or was there an actual reason for that? I wouldn't do that. It just makes things unnecessarily complicated. CDATA has a very limited range of usefulness. Storing text is not in that range.
so I used `element.text = lxml.etree.CDATA(mytext)`. However, `รพ` (repr: `u'\xfe'`) and `þ` (repr: `u'þ'`) seem to produce the same XML output, despite being very different strings:
`<root><![CDATA[þ]]></root>`
which lxml parses as `þ`.
Can someone confirm it's a bug, or correct my wrongheadedness about using CDATA?
It's because you are serialising to an ASCII-only representation of the document, which must then escape the non-ASCII content. It might be a bug that the escaping doesn't result in an error, because it's clearly irreversible. OTOH, CDATA shouldn't be used lightly. You can pass encoding='utf8' into fromstring() to serialise to UTF-8 instead.
BTW: `element.text = mytext` (i.e. no CDATA involved) works as I'd expect.
Seems both simpler and more correct to me. Stefan