problem parsing utf-8 encoded xml - minidom

ashmir.d at ashmir.d at
Fri Jul 4 07:50:29 CEST 2008

I am trying to parse an xml file using the minidom parser.

from xml.dom import minidom
xmlfilename = "sample.xml"
xmldoc = minidom.parse(xmlfilename)

The parser is failing on this line:

<mrcb245-c>Heinrich Kèufner, Norbert Nedopil, Heinz Schèoch (Hrsg.).</

This is the error message I get:

Traceback (most recent call last):
  File "", line 11, in <module>
    xmldoc = minidom.parse(xmlfilename)
  File "C:\Python25\lib\xml\dom\", line 1913, in parse
    return expatbuilder.parse(file)
  File "C:\Python25\lib\xml\dom\", line 924, in parse
    result = builder.parseFile(fp)
  File "C:\Python25\lib\xml\dom\", line 207, in
    parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line
2254, column 21

It seems to me that it is having an issue with the 'è' character. I
have even tried the following to make sure it recognises the file as
utf-8 file:

from xml.dom import minidom
import codecs

xmlfilename = "sample.xml"
xmlfile =,"r","utf-8")
xmlstring =
xmldoc = minidom.parse(xmlfilename)

However, this doesn't work either and I get the following error

Traceback (most recent call last):
  File "", line 9, in <module>
    xmlstring =
  File "C:\Python25\lib\", line 618, in read
  File "C:\Python25\lib\", line 424, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position
69343-69345: invalid data

I'm assuming here that it is failing at the same place...

Can someone please point me in the right direction?

