[XML-SIG] XML Unicode and UTF-8

"Martin v. Löwis" martin at v.loewis.de
Thu Aug 5 13:35:18 CEST 2004


n.youngman at ntlworld.com wrote:
> First Pass:
> 
> segment_tag.appendChild( charset_tag ) unicode_tag =
> doc.createElement( 'unicode' ) unicode_tag.appendChild(
> doc.createTextNode( segment[0] ) ) segment_tag.appendChild(
> unicode_tag )
> 
> Inserts binary data into the segment/unicode tag

What is segment[0] here? In XML, there is no notion of "binary data".

> Leaves binary data in the document. I have assumed that this was raw
> Unicode, may be that's a flawed assumption?

There is nothing that could be called "raw Unicode", either. Again,
XML does not support binary data.

> consumed = self.encode(object, self.errors) UnicodeDecodeError:
> 'ascii' codec can't decode byte 0xee in position 0: ordinal not in
> range(128)
> 
> I hoped this would convert everything to UTF-8 and save it . The
> appearance of an ASCII codec was a complete surprise to me.

You can only encode Unicode objects. Since you apparently have put
a byte string object (<type 'str'>) into the DOM tree, it needs to
convert the byte string into a Unicode string first, before it
can encode the Unicode string as UTF-8. For that, it uses the system
default encoding, which is us-ascii.

Now, the byte string contains the byte '\xee', which is not supported
in ASCII.

> 3rd pass:
> 
> XMLFILE = codecs.open( filename, "w", "utf-8" ) 
> xml.documentElement.writexml( XMLFILE, indent="", addindent="",
> newl="") XMLFILE.close()
> 
> produces
> 
> Traceback (most recent call last): File "./storemail.py", line 347,

The problem is that your DOM tree is already ill-formed. You should
not put binary data into a DOM tree.

 > I missed out pass 4:
 >
 > Create the node with
 >
 >   unicode_tag.appendChild( doc.createTextNode(
 >       segment[0].encode( "utf-8") ) )

Same issue: Apparently, segment[0] is a byte string, but you can only
encode Unicode strings. *If* segment[0] is an UTF-8 encoded byte string,
you should write

    segment[0].decode( "utf-8")

Regards,
Martin


More information about the XML-SIG mailing list