[Tutor] Encoding and XML troubles
William O'Higgins Witteman
hmm at woolgathering.cx
Wed Nov 8 17:06:04 CET 2006
Thanks for the help thusfar. To recap - when parsing XML, ElementTree
is barfing on extended characters.
1. Yes, most XML is written by monkeys, or the programs written by such
monkeys - tough beans, I cannot make my input XML any cleaner without
pre-processing - I am not generating it.
2. The documentation suggests that the default encoding of ElementTree is
US-ASCII, which is not going to be sufficient. My XML is explicitly
setting its encoding to 8859-1, and the XML is actually well-formed(!).
3. I muddied the waters by talking about Python code listing encoding,
sorry.
EXAMPLES:
Vanilla (this works fine):
#!/usr/bin/python
from elementtree import ElementTree as etree
eg = """<seuss><fish>red</fish><fish>blue</fish></seuss>"""
xml = etree.fromstring(eg)
If I change the example string to this:
<seuss><fish>red</fish><fish>blué</fish></seuss>
I get the following error:
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 32)
Okay, the default encoding for my program (and thus my example string)
is US-ASCII, so I'll use 8859-1 instead, adding this line:
# coding: iso-8859-1
I get the same error. Just for laughs I'll change the encoding to
utf-8. Oops, I get the same error.
Has anyone had any luck getting ElementTree to deal with extended
characters? If not, has anyone got a suggestion for how to pre-process
the text in the XML so it won't barf? Thanks.
--
yours,
William
More information about the Tutor
mailing list