[XML-SIG] XML and Unicode

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Wed, 23 May 2001 22:01:50 +0200


> How does one detect the charset used in an XML document from a SAX2
> parser (PyXML 0.6.5)?

That is not supported in SAX. The underlying parser may expose this
information; but that is of course parser dependent.

> Also, if I have an XML document encoded ISO-8851-1 (and properly
> identified), should I have a reasonable expectation that the output
> of a SAX processor, post- .encode('utf-8'), should be correct if
> viewed in a Web browser with UTF-8 selected as a character encoding?

Not necessarily. If the document was a HTML document, and if it
has a

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">

line, then the browser has to decide whether it leaves the XML header
or the Content-Type. It would normally use the content type, which
would be incorrect.

If there is no incorrect character set information in the output
document, then a receiver should display it properly.

Of course, whether a Web browser can "correctly" display arbitrary XML
documents is a different question.

Regards,
Martin