problem parsing utf-8 encoded xml - minidom

ashmir.d at gmail.com ashmir.d at gmail.com
Fri Jul 4 03:28:27 EDT 2008


On Jul 4, 2:36 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> > The parser is failing on this line:
>
> > <mrcb245-c>Heinrich Kèufner, Norbert Nedopil, Heinz Schèoch (Hrsg.).</
> > mrcb245-c>
>
> If it is literally this line, it's no surprise: there must not be a line
> break between the slash and the closing element name.
>
> However, since you are getting the error in a different column, it's
> indeed more likely that there is a problem with the encoding.
>
> Given that the Python UTF-8 codec refuses the data, most likely, the
> data is *not* encoded in UTF-8 (but perhaps in Latin-1). If so, you
> need to prefix the XML document with a proper XML declaration, such
> as
>
> <?xml version="1.0" encoding="iso-8859-1"?>
>
> Alternatively, make sure that the file is really encoded in UTF-8.
>
> Regards,
> Martin


There is no line break in the xml file. It was just a formatting issue
on this forum.

However, you were right about the encoding not being
utf-8. The xml file is autogenerated by a different script so that's
probably where it is going wrong.
The parser works fine if I change the first line to
<?xml version="1.0" encoding="iso-8859-1"?>

Thank you very much



More information about the Python-list mailing list