ASCII decoding error with xml.dom.minidom

Martin von Loewis loewis at informatik.hu-berlin.de
Sat Jun 16 14:44:41 EDT 2001


gustafl at algonet.se (Gustaf Liljegren) writes:

> Still got a problem with encoding/decoding errors when working with 
> xml.dom.minidom. I have run into something I didn't ask for. The DOM module 
> continues to output everything as Unicode strings, even if the file is a 
> typical 'plain text' XML file with an ISO 8859-1 encoding attribute in the 
> XML declaration!

XML files are never plain text; all XML files are Unicode. See the XML
recommendation for details. The file may be represented in some
encoding; a DOM implementation is required to present the contents as
Unicode. See the DOM recommendation for details.

> The input data comes from two directions: one XML file, where the
> input takes the form of Unicode strings as described above, and a
> mailbox file, in Latin 1. Content from these two sources should be
> mixed together in an XML output file.

My guess is that you put byte strings into the DOM tree. You should
not do that; instead, you should convert all strings to Unicode before
putting them into the tree. You can get away with putting byte strings
into the tree when they have all bytes <127.

> Ideally, I'd like the output XML file in Latin 1. I wonder if there's an 
> easy way to decode everything in the DOM object to Latin 1, so that this 
> won't happen?

No, that's not possible. Currently, toxml will return a Unicode
string, it is then the caller's responsibility to convert this to
UTF-8 (as toxml will not have put an encoding directive into the
document).

toxml should probably be extended to support various output
encodings. Even if it does, the DOM tree still must contain Unicode
strings only.

Regards,
Martin




More information about the Python-list mailing list