[XML-SIG] XML Unicode and UTF-8

n.youngman at ntlworld.com n.youngman at ntlworld.com
Thu Aug 5 13:03:17 CEST 2004


> 
> From: "Martin v. Löwis" <martin at v.loewis.de>
> Date: 2004/08/05 Thu AM 10:41:59 GMT
> To: n.youngman at ntlworld.com
> CC: xml-sig at python.org
> Subject: Re: [XML-SIG] XML Unicode and UTF-8

<SNIP>

> State all the information that you have, preferably in the form:
> 1. this is what I did
> 2. this is what happened
> 3. this is what I expected to happen instead.

Well, I was trying to state the problem and not impose my own preconceptions of how it should be done, but if you want to go straight into debugging that's fine with me.

First Pass:

                segment_tag.appendChild( charset_tag )
                unicode_tag = doc.createElement( 'unicode' )
                unicode_tag.appendChild( doc.createTextNode( segment[0] ) )
                segment_tag.appendChild( unicode_tag )

Inserts binary data into the segment/unicode tag

Saving with 

    XMLFILE = open( filename, "w" )

    xml.documentElement.writexml( XMLFILE, indent="", addindent="", newl="")

    XMLFILE.close()

Leaves binary data in the document. I have assumed that this was raw Unicode, may be that's a flawed assumption? 

Second Pass:

Save with
    XMLFILE = open( filename, "w" )
    XMLFILE.write( xml.documentElement.toxml( "utf-8" ) )
    XMLFILE.close()

results in:

Traceback (most recent call last):
  File "./storemail.py", line 347, in ?
    save_message( message, raw_message, savedir + "/" + filename + ".xml" )
  File "./storemail.py", line 135, in save_message
    XMLFILE.write( xml.documentElement.toxml( "utf-8" ) )
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 48, in toxml
    return self.toprettyxml("", "", encoding)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 62, in toprettyxml
    self.writexml(writer, "", indent, newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 1039, in writexml
    _write_data(writer, "%s%s%s"%(indent, self.data, newl))
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 304, in _write_data
    writer.write(data)
  File "/usr/local/lib/python2.3/codecs.py", line 178, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 0: ordinal not in range(128)

I hoped this would convert everything to UTF-8 and save it . The appearance of an ASCII codec was a complete surprise to me.

3rd pass:

    XMLFILE = codecs.open( filename, "w", "utf-8" )
    xml.documentElement.writexml( XMLFILE, indent="", addindent="", newl="")
    XMLFILE.close()

produces

Traceback (most recent call last):
  File "./storemail.py", line 347, in ?
    save_message( message, raw_message, savedir + "/" + filename + ".xml" )
  File "./storemail.py", line 137, in save_message
    xml.documentElement.writexml( XMLFILE, indent="", addindent="", newl="")
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 1039, in writexml
    _write_data(writer, "%s%s%s"%(indent, self.data, newl))
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 304, in _write_data
    writer.write(data)
  File "/usr/local/lib/python2.3/codecs.py", line 400, in write
    return self.writer.write(data)
  File "/usr/local/lib/python2.3/codecs.py", line 178, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 0: ordinal not in range(128)

I hoped this would convert everything to UTF-8 and save it . The appearance of an ASCII codec was a complete surprise to me.

I won't bore you with other combinations, which I didn't expect to work. They didn't.

Neil Youngman


-----------------------------------------
Email provided by http://www.ntlhome.com/




More information about the XML-SIG mailing list