On 3/7/06, <b class="gmail_sendername">Fredrik Lundh</b> <<a href="mailto:fredrik@pythonware.com">fredrik@pythonware.com</a>> wrote:<div><span class="gmail_quote"></span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Abhimanyu Seth wrote:<br><br>> I'm trying to parse and modify an XML document using xml.dom.minidom module<br>> and Python 2.4.2<br>><br>> >> from xml.dom import minidom<br>> >> dom = minidom.parse

 ("c:/test.txt")<br>><br>> If the xml file contains a non-ascii character, then i get a parse error.<br>> I have the following line in my xml file:<br>> <target>Exception beim Löschen des Audit-Moduls aufgetreten. Exception Stack

<br>> lautet: %1.</target><br>> ExpatError: not well-formed (invalid token): line 8, column 27<br>><br>> If I remove the ö character, then it works fine. I'm guessing this has to do<br>> with the default encoding which is ascii. I guess i can change the encoding

<br>> by modifying a file on my machine that the interpretter reads while loading,<br>> but then how do I get my program to work on different machines?<br><br>the default encoding for XML is UTF-8.  If you're using any other encoding

<br>in your XML file, you have to specify that in the file itself, by putting an<br><?xml?> construct at the top of the file.  e.g.<br><br>    <?xml version="1.0" encoding="ISO-8859-1"?><br>

    ... rest of XML file follows ...<br><br>> Also, while writing such a special character to the file, I get an error.<br>> >> document.writexml (file (myFile, "w"), encoding='utf-8')<br>><br>> UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position

<br>> 16: ordinal not in range(128)<br><br>not sure; maybe you've added byte strings (encoded strings instead of Unicode<br>strings) to the document, or maybe there's a bug in minidom.  What happens if<br>you remove the encoding argument?  If you still get the same error after doing

<br>that, make sure you use only Unicode strings when you add stuff to the document.<br><br>hope this helps!<br><br></F><br><br><br><br><br><br>--<br><a href="http://mail.python.org/mailman/listinfo/python-list">http://mail.python.org/mailman/listinfo/python-list

</a> </blockquote></div> I've specified utf-8 in the xml header <?xml version="1.0" encoding="utf-8"?> In writexml (), even without specifying the encoding, I get the same error. That't why I tried manually specifying the encoding.

<br><br>But I managed to find a workaround.<br>I got some clues from <a href="http://evanjones.ca/python-utf8.html">http://evanjones.ca/python-utf8.html</a><br><br>According to the site,<br><pre>import codecs<br>fileObj = 

codecs.open( "someFile", "r", "utf-8" )<br>u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file</pre>should return me a unicode string. But I still get an error.<br>

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 407-410: invalid data<br><br>I can't figure out why! Why can't it parse ö character as unicode?<br><br>Anyway, <br>>> f = codecs.open ("c:/test.txt", "r", "latin-1")

<br>>> dom = minidom.parseString (codecs.encode (f.read(), "utf-8"))<br><br>works. But then I dunno if this will work for chinese or other unicode characters.<br>How do I make my code read unicode files?<br>

Also, while writing the xml file, I now use codecs.open () >> document.writexml (codecs.open (mFile, "w", "utf-8"), encoding="utf-8") IMHO, writexml should be taking care of this, instead of me having to use codecs. I guess this is a bug.

<br><br>-- <br>Regards,<br>Abhimanyu