[XML-SIG] XML Unicode and UTF-8
n.youngman at ntlworld.com
n.youngman at ntlworld.com
Thu Aug 5 13:03:17 CEST 2004
>
> From: "Martin v. Löwis" <martin at v.loewis.de>
> Date: 2004/08/05 Thu AM 10:41:59 GMT
> To: n.youngman at ntlworld.com
> CC: xml-sig at python.org
> Subject: Re: [XML-SIG] XML Unicode and UTF-8
<SNIP>
> State all the information that you have, preferably in the form:
> 1. this is what I did
> 2. this is what happened
> 3. this is what I expected to happen instead.
Well, I was trying to state the problem and not impose my own preconceptions of how it should be done, but if you want to go straight into debugging that's fine with me.
First Pass:
segment_tag.appendChild( charset_tag )
unicode_tag = doc.createElement( 'unicode' )
unicode_tag.appendChild( doc.createTextNode( segment[0] ) )
segment_tag.appendChild( unicode_tag )
Inserts binary data into the segment/unicode tag
Saving with
XMLFILE = open( filename, "w" )
xml.documentElement.writexml( XMLFILE, indent="", addindent="", newl="")
XMLFILE.close()
Leaves binary data in the document. I have assumed that this was raw Unicode, may be that's a flawed assumption?
Second Pass:
Save with
XMLFILE = open( filename, "w" )
XMLFILE.write( xml.documentElement.toxml( "utf-8" ) )
XMLFILE.close()
results in:
Traceback (most recent call last):
File "./storemail.py", line 347, in ?
save_message( message, raw_message, savedir + "/" + filename + ".xml" )
File "./storemail.py", line 135, in save_message
XMLFILE.write( xml.documentElement.toxml( "utf-8" ) )
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 48, in toxml
return self.toprettyxml("", "", encoding)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 62, in toprettyxml
self.writexml(writer, "", indent, newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 1039, in writexml
_write_data(writer, "%s%s%s"%(indent, self.data, newl))
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 304, in _write_data
writer.write(data)
File "/usr/local/lib/python2.3/codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 0: ordinal not in range(128)
I hoped this would convert everything to UTF-8 and save it . The appearance of an ASCII codec was a complete surprise to me.
3rd pass:
XMLFILE = codecs.open( filename, "w", "utf-8" )
xml.documentElement.writexml( XMLFILE, indent="", addindent="", newl="")
XMLFILE.close()
produces
Traceback (most recent call last):
File "./storemail.py", line 347, in ?
save_message( message, raw_message, savedir + "/" + filename + ".xml" )
File "./storemail.py", line 137, in save_message
xml.documentElement.writexml( XMLFILE, indent="", addindent="", newl="")
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 1039, in writexml
_write_data(writer, "%s%s%s"%(indent, self.data, newl))
File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 304, in _write_data
writer.write(data)
File "/usr/local/lib/python2.3/codecs.py", line 400, in write
return self.writer.write(data)
File "/usr/local/lib/python2.3/codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 0: ordinal not in range(128)
I hoped this would convert everything to UTF-8 and save it . The appearance of an ASCII codec was a complete surprise to me.
I won't bore you with other combinations, which I didn't expect to work. They didn't.
Neil Youngman
-----------------------------------------
Email provided by http://www.ntlhome.com/
More information about the XML-SIG
mailing list