minidom and unicode errors
abhimanyu.seth at gmail.com
Tue Mar 7 07:24:39 CET 2006
On 3/7/06, Abhimanyu Seth <abhimanyu.seth at gmail.com> wrote:
> On 3/7/06, Fredrik Lundh <fredrik at pythonware.com> wrote:
> > Abhimanyu Seth wrote:
> > I'm trying to parse and modify an XML document using xml.dom.minidommodule
> > and Python 2.4.2
> > >> from xml.dom import minidom
> > >> dom = minidom.parse ("c:/test.txt")
> > If the xml file contains a non-ascii character, then i get a parse
> > I have the following line in my xml file:
> > <target>Exception beim Löschen des Audit-Moduls aufgetreten. Exception
> > lautet: %1.</target>
> > ExpatError: not well-formed (invalid token): line 8, column 27
> > If I remove the ö character, then it works fine. I'm guessing this has
> to do
> > with the default encoding which is ascii. I guess i can change the
> > by modifying a file on my machine that the interpretter reads while
> > but then how do I get my program to work on different machines?
> the default encoding for XML is UTF-8. If you're using any other encoding
> in your XML file, you have to specify that in the file itself, by putting
> <?xml?> construct at the top of the file. e.g.
> <?xml version="1.0" encoding="ISO-8859-1"?>
> ... rest of XML file follows ...
> > Also, while writing such a special character to the file, I get an
> > >> document.writexml (file (myFile, "w"), encoding='utf-8')
> > UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
> > 16: ordinal not in range(128)
> not sure; maybe you've added byte strings (encoded strings instead of
> strings) to the document, or maybe there's a bug in minidom. What happens
> you remove the encoding argument? If you still get the same error after
> that, make sure you use only Unicode strings when you add stuff to the
> hope this helps!
> I've specified utf-8 in the xml header
> <?xml version="1.0" encoding="utf-8"?>
> In writexml (), even without specifying the encoding, I get the same
> error. That't why I tried manually specifying the encoding.
> But I managed to find a workaround.
> I got some clues from http://evanjones.ca/python-utf8.html
> According to the site,
> import codecs
> fileObj =
> codecs.open( "someFile", "r", "utf-8" )
> u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file
> should return me a unicode string. But I still get an error.
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 407-410:
> invalid data
> I can't figure out why! Why can't it parse ö character as unicode?
> >> f = codecs.open ("c:/test.txt", "r", "latin-1")
> >> dom = minidom.parseString (codecs.encode (f.read(), "utf-8"))
> works. But then I dunno if this will work for chinese or other unicode
> How do I make my code read unicode files?
> Also, while writing the xml file, I now use codecs.open ()
> >> document.writexml (codecs.open (mFile, "w", "utf-8"), encoding="utf-8")
> IMHO, writexml should be taking care of this, instead of me having to use
> codecs. I guess this is a bug.
Actually, it doesn't work. I don't get any errors, but it doesn't write the
special characters. It's converted them to some gibberish.
ö has become Ã¶.
Now I'm stumped!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-list