[I originally wrote this for StackOverflow, hence all the backticks; I've left them in to deliminate strings.]
Not sure if this is a bug or me failing to understand `CDATA`.
I have some Python 2 `unicode` objects I want to include in the text of my XML document; I've been asked to put them into `CDATA`, so I used `element.text = lxml.etree.CDATA(mytext)`. However, `ş` (repr: `u'\xfe'`) and `þ` (repr: `u'þ'`) seem to produce the same XML output, despite being very different strings:
`<root><![CDATA[þ]]></root>`
which lxml parses as `þ`.
Can someone confirm it's a bug, or correct my wrongheadedness about using CDATA?
BTW: `element.text = mytext` (i.e. no CDATA involved) works as I'd expect.
>>> lxml.etree.__version__
u'3.2.4'
_test code:_
import lxml.etree
import sys
literal=sys.argv[1].decode('utf-8')
print repr(literal)
literalroot=lxml.etree.fromstring("<root></root>")
literalroot.text = lxml.etree.CDATA(literal)
xml = lxml.etree.tostring(literalroot)
print xml
print literal, lxml.etree.fromstring(xml).text
$ python test.py "ş"
u'\xfe'
<root><![CDATA[þ]]></root>
ş þ
$ python test.py "þ"
u'þ'
<root><![CDATA[þ]]></root>
þ þ