[Tutor] Encoding and XML troubles

Wed Nov 8 17:06:04 CET 2006

Thanks for the help thusfar.  To recap - when parsing XML, ElementTree
is barfing on extended characters.

1. Yes, most XML is written by monkeys, or the programs written by such
monkeys - tough beans, I cannot make my input XML any cleaner without
pre-processing - I am not generating it.

2. The documentation suggests that the default encoding of ElementTree is
US-ASCII, which is not going to be sufficient.  My XML is explicitly
setting its encoding to 8859-1, and the XML is actually well-formed(!).

3. I muddied the waters by talking about Python code listing encoding,
sorry.

EXAMPLES:

Vanilla (this works fine):
#!/usr/bin/python

from elementtree import ElementTree as etree

eg = """<seuss><fish>red</fish><fish>blue</fish></seuss>"""

xml = etree.fromstring(eg)

If I change the example string to this:
<seuss><fish>red</fish><fish>blué</fish></seuss>

I get the following error:
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 32)

Okay, the default encoding for my program (and thus my example string) 
is US-ASCII, so I'll use 8859-1 instead, adding this line:
# coding: iso-8859-1

I get the same error.  Just for laughs I'll change the encoding to
utf-8.  Oops, I get the same error.

Has anyone had any luck getting ElementTree to deal with extended
characters?  If not, has anyone got a suggestion for how to pre-process
the text in the XML so it won't barf?  Thanks.
-- 

yours,

William