Mailman 3 lxml.etree.CDATA - different unicode strings map to same XML - is this expected behaviour? - lxml - The Python XML Toolkit

Nov. 19, 2013

      [I originally wrote this for StackOverflow, hence all the backticks; I've
left them in to deliminate strings.]

Not sure if this is a bug or me failing to understand `CDATA`.

I have some Python 2 `unicode` objects I want to include in the text of my
XML document; I've been asked to put them into `CDATA`, so I used
`element.text = lxml.etree.CDATA(mytext)`.  However, `þ` (repr: `u'\xfe'`)
and `þ` (repr: `u'þ'`) seem to produce the same XML output,
despite being very different strings:

`<root><![CDATA[þ]]></root>`

which lxml parses as `þ`.

Can someone confirm it's a bug, or correct my wrongheadedness about using
CDATA?
BTW: `element.text = mytext` (i.e. no CDATA involved) works as I'd expect.

    >>> lxml.etree.__version__
    u'3.2.4'

_test code:_

    import lxml.etree
    import sys
    literal=sys.argv[1].decode('utf-8')
    print repr(literal)
    literalroot=lxml.etree.fromstring("<root></root>")
    literalroot.text = lxml.etree.CDATA(literal)
    xml = lxml.etree.tostring(literalroot)
    print xml
    print literal, lxml.etree.fromstring(xml).text

    $ python test.py "þ"
    u'\xfe'
    <root><![CDATA[þ]]></root>
    þ þ

    $ python test.py "þ"
    u'þ'
    <root><![CDATA[þ]]></root>
    þ þ

lxml.etree.CDATA - different unicode strings map to same XML - is this expected behaviour?

Dave McKee

Stefan Behnel

Stefan Behnel

tags

participants (2)