[XML-SIG] XML Unicode and UTF-8

Sat Aug 7 19:59:43 CEST 2004

Neil Youngman wrote:
> On Thursday 05 Aug 2004 9:27 pm, Mike Brown wrote:
> > The resulting Unicode object may contain characters which are not allowed
> > in XML, and thus the text may not be serializable (at least not in a way 
> > that would produce well-formed XML).
> 
> Yes, but it's being written out through a UTF-8 codec 

Perhaps I wasn't being clear. It doesn't matter what encoding you use. XML 
places restrictions on what characters can be in the *decoded* (Unicode) 
version of the document. The encoded version of the document is just an 
alternative representation of the Unicode one.

In Python's notation, each character in the document must be one of:
\t  (tab)
\n  (linefeed)
\r  (carriage return)
\u0020-\ud7ff
\ue000-\ufffd
\u10000-\u10ffff

You are not allowed to have any other characters in your document, not even
by reference (e.g., you can't write &#0; to represent \u0000).

So let's say you have 256 bytes of binary data, just byte values 0-255:

>>> bytestring = ''.join(map(chr,range(256)))

How do you put this into your document? You have to make it be Unicode,
so you could try

>>> ustring = unicode(bytestring)

but that would give you an error because by default it's going to assume
bytestring is ascii (actually, what is returned by sys.getdefaultencoding(),
I think), whereas you've got bytes higher than \x7f.

You could try

>>> ustring = unicode(bytestring, 'utf-8')

but you will get errors because the bytes aren't valid UTF-8 sequences.
They're valid iso-8859-1, though, (iso-8859-1 allows any byte value) so
you could do

>>> ustring = unicode(bytestring, 'iso-8859-1')

and now you've got u'\u0000\u0001\u0002...\u00fe\u00ff'.

Note that some of those characters are not allowed in XML. The DOM 
implementations will accept them, because they don't check for illegal 
characters.

>>> from xml.dom.minidom import parseString
>>> doc = parseString('<data/>')
>>> t = doc.createTextNode(ustring)
>>> doc.childNodes[0].appendChild(t)

They'll even blindly serialize them for you.

>>> xmlstring = doc.toxml('utf-8')
>>> xmlustring = doc.toxml()

In all 3 cases (doc, xmlstring, xmlustring), illegal characters are in the XML.
Want proof?

>>> doc2 = parseString(xmlstring)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 1925, in parseString
    return expatbuilder.parseString(string)
  File "/usr/local/lib/python2.3/xml/dom/expatbuilder.py", line 940, in parseString
    return builder.parseString(string)
  File "/usr/local/lib/python2.3/xml/dom/expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 2, column 6

If you try these examples yourself and go looking at the variables created, 
take note that Python's representation of Unicode strings uses '\x00'-'\xff' 
for '\u0000-\u00ff'. It's just a cosmetic thing; if the string is Unicode, 
everything in it is Unicode characters, not bytes.

-Mike