[XML-SIG] XML Unicode and UTF-8
"Martin v. Löwis"
martin at v.loewis.de
Thu Aug 5 13:35:18 CEST 2004
n.youngman at ntlworld.com wrote:
> First Pass:
> segment_tag.appendChild( charset_tag ) unicode_tag =
> doc.createElement( 'unicode' ) unicode_tag.appendChild(
> doc.createTextNode( segment ) ) segment_tag.appendChild(
> unicode_tag )
> Inserts binary data into the segment/unicode tag
What is segment here? In XML, there is no notion of "binary data".
> Leaves binary data in the document. I have assumed that this was raw
> Unicode, may be that's a flawed assumption?
There is nothing that could be called "raw Unicode", either. Again,
XML does not support binary data.
> consumed = self.encode(object, self.errors) UnicodeDecodeError:
> 'ascii' codec can't decode byte 0xee in position 0: ordinal not in
> I hoped this would convert everything to UTF-8 and save it . The
> appearance of an ASCII codec was a complete surprise to me.
You can only encode Unicode objects. Since you apparently have put
a byte string object (<type 'str'>) into the DOM tree, it needs to
convert the byte string into a Unicode string first, before it
can encode the Unicode string as UTF-8. For that, it uses the system
default encoding, which is us-ascii.
Now, the byte string contains the byte '\xee', which is not supported
> 3rd pass:
> XMLFILE = codecs.open( filename, "w", "utf-8" )
> xml.documentElement.writexml( XMLFILE, indent="", addindent="",
> newl="") XMLFILE.close()
> Traceback (most recent call last): File "./storemail.py", line 347,
The problem is that your DOM tree is already ill-formed. You should
not put binary data into a DOM tree.
> I missed out pass 4:
> Create the node with
> unicode_tag.appendChild( doc.createTextNode(
> segment.encode( "utf-8") ) )
Same issue: Apparently, segment is a byte string, but you can only
encode Unicode strings. *If* segment is an UTF-8 encoded byte string,
you should write
More information about the XML-SIG