[XML-SIG] XML and Unicode

M.-A. Lemburg mal@lemburg.com
Wed, 23 May 2001 00:38:34 +0200

Mark Nottingham wrote:
> How does one detect the charset used in an XML document from a SAX2
> parser (PyXML 0.6.5)?
> Also, if I have an XML document encoded ISO-8851-1 (and properly
> identified), should I have a reasonable expectation that the output
> of a SAX processor, post- .encode('utf-8'), should be correct if
> viewed in a Web browser with UTF-8 selected as a character encoding?

This should work...

> In other words, is the post-parse unicode string a neutral
> representation of the 8851-x string, which can then be encoded as
> utf-8?

Unicode is encoding neutral in the sense that it provides
space for the characters of most scripts. If the parser returns
Unicode, then you can encode it as UTF-8 and have the original
contents of the attribute/element represented as UTF-8 string.

> Or, is it in the charset of the original XML document (my
> testing seems to indicate the latter - what was a 8851 character in
> the original text does not successfully come out the other side)?
> (Sorry if this is obtuse - just getting into i18n, and Python docs
> are thin on the ground)

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/