[XML-SIG] XML Unicode and UTF-8

Neil Youngman n.youngman at ntlworld.com
Sat Aug 7 08:48:18 CEST 2004


On Thursday 05 Aug 2004 9:27 pm, Mike Brown wrote:
> Paul Boddie wrote:
> > Do this instead:
> >
> >       utext = segment[0].decode( segment[1] )
>
> The resulting Unicode object may contain characters which are not allowed
> in XML, and thus the text may not be serializable (at least not in a way
> that would produce well-formed XML).

Yes, but it's being written out through a UTF-8 codec to a file which 
specifies 'charset="utf-8"'. AIUI the python UTF-8 codec can detect that it's 
got a unicode string and convert it to utf-8 with no futher programmer 
intervention. 

Of course a week ago, Python was just another buzzword to me, so I could be 
wrong.

> To embed arbitrary bytes in XML, the usual advice is to first convert the
> bytes into a character sequence that is permitted in XML. Base64 is a
> popular and easily implemented option, albeit inefficient. The article at
> http://www.javaworld.com/javaworld/javatips/jw-javatip117-p2.html suggests
> that a custom Huffman implementation is nearly 1:1. I've mapped bytes into
> the Private Use Area of Unicode before, too, although that's definitely not
> efficient.

All neat ideas, but as I want UTF-8 encoding, they would just add an 
unnecessary layer of complexity.

Thanks for trying to help, but I think I've got what I need.

Neil Youngman



More information about the XML-SIG mailing list