[Tutor] Encoding and XML troubles

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Wed Nov 8 18:01:27 CET 2006


> Vanilla (this works fine):
> #!/usr/bin/python
>
> from elementtree import ElementTree as etree
>
> eg = """<seuss><fish>red</fish><fish>blue</fish></seuss>"""
>
> xml = etree.fromstring(eg)
>
> If I change the example string to this:
> <seuss><fish>red</fish><fish>blué</fish></seuss>
>
> I get the following error:
> xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
> column 32)


According to:

     http://mail.python.org/pipermail/xml-sig/2006-May/011513.html

the XML content must declare itself what encoding it uses.  For example:

#####################################################################
>>> text = """<?xml version='1.0' encoding='utf-8'?>
<p>\xed\x95\x98\xeb\xa3\xa8\xeb\x8f\x99\xec\x95\x88 
IDLE\xea\xb0\x80\xec\xa7\x80\xea\xb3\xa0 \xeb\x86\x80\xea\xb8\xb0</p>
"""
#####################################################################

Note that the encoding declaration must be on the top of the document.


Then it's ok to use fromstring() on it:

##################################################
>>> doc = elementtree.ElementTree.fromstring(text)
>>> doc.text
u'\ud558\ub8e8\ub3d9\uc548\nIDLE\uac00\uc9c0\uace0 \ub180\uae30'
##################################################

If I use the wrong encoding declaration, or if I'm missing the declaration 
altogether, then yes, I see the same errors that you seen.


> Okay, the default encoding for my program (and thus my example string) 
> is US-ASCII, so I'll use 8859-1 instead, adding this line: # coding: 
> iso-8859-1
>
> I get the same error.  Just for laughs I'll change the encoding to 
> utf-8.  Oops, I get the same error.


The XML encoding has to be explicitely described as part of the XML 
document text.  It's the difference between:

###########################################################################
>>> text = '<seuss><fish>red</fish><fish>blu\xe9</fish></seuss>'
>>> import elementtree.ElementTree
>>> elementtree.ElementTree.fromstring(text)
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File "/usr/lib/python2.4/site-packages/elementtree/ElementTree.py", line 
960, in XML
     parser.feed(text)
   File "/usr/lib/python2.4/site-packages/elementtree/ElementTree.py", line 
1242, in feed
     self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, 
column 32
##########################################################################

and:

##########################################################
>>> text = '''<?xml version="1.0" encoding="iso-8859-1"?>
... <seuss><fish>red</fish><fish>blu\xe9</fish></seuss>'''
>>> doc = elementtree.ElementTree.fromstring(text)
##########################################################

which does work.

If you're dealing with XML content, make sure that your XML documents have 
that encoding declaration, or else you're bound to run into these kinds of 
errors.

Good luck!


More information about the Tutor mailing list