XML can't read Unicode shock horror. News at 11.

Martin von Loewis loewis at informatik.hu-berlin.de
Thu Nov 1 07:01:36 EST 2001


Dale Strickland-Clark <dale at riverhall.NOTHANKS.co.uk> writes:

> That's not much good if my XML document happens to start with:
> 
> <?xml version="1.0" encoding="UTF-16"?>
> 
> To quote from the O'Reilly book, "XML In A Nutshell" p71: "An XML
> parser is required to handle the UTF-16 and UTF-8 encodings or
> Unicode." And I expect similar is stated in the XML DOM spec if I had
> time to look for it.

This is getting interesting. Suppose you have a unicode string

u'<?xml version="1.0" encoding="utf-16"?><foo/>'

How do you want your XML processor to process that? In particular,
what do you think it is supposed to do with the encoding declaration?

As for the XML spec, it merely says

# All XML processors must accept the UTF-8 and UTF-16 encodings of
# 10646; the mechanisms for signaling which of the two is in use, or
# for bringing other encodings into play, are discussed later...

So there is no mentioning of "or Unicode", or equivalent.

The DOM specification does not recognize the existance of XML parsers
(atleast not in DOM level 2 or earlier); so it doesn't say anything on
the subject.

Regards,
Martin



More information about the Python-list mailing list