ElementTree, XML and Unicode -- C0 Controls
Sébastien Boisgérault
Sebastien.Boisgerault at gmail.com
Mon Dec 11 10:24:43 EST 2006
Hi all,
The unicode code points in the 0000-001F range --
except newline, tab, carriage return -- are not legal
XML 1.0 characters.
Attempts to serialize and deserialize such strings
with ElementTree will fail:
>>> elt = Element("root", char=u"\u0000")
>>> xml = tostring(elt)
>>> xml
'<root char="\x00" />'
>>> fromstring(xml)
[...]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 12
Good ! But I was expecting a failure *earlier*, in
the "tostring" function -- I basically assumed that
ElementTree would refuse to generate a XML
fragment that is not well-formed.
Could anyone comment on the rationale behind
the current behavior ? Is it a performance issue,
the search for non-valid unicode code points being
too expensive ?
Cheers,
SB
More information about the Python-list
mailing list