[XML-SIG] XML and Unicode

Mark Nottingham mnot@mnot.net
Tue, 22 May 2001 15:06:41 -0700


How does one detect the charset used in an XML document from a SAX2
parser (PyXML 0.6.5)?

Also, if I have an XML document encoded ISO-8851-1 (and properly
identified), should I have a reasonable expectation that the output
of a SAX processor, post- .encode('utf-8'), should be correct if
viewed in a Web browser with UTF-8 selected as a character encoding?
In other words, is the post-parse unicode string a neutral
representation of the 8851-x string, which can then be encoded as
utf-8? Or, is it in the charset of the original XML document (my
testing seems to indicate the latter - what was a 8851 character in
the original text does not successfully come out the other side)?

(Sorry if this is obtuse - just getting into i18n, and Python docs
are thin on the ground)

-- 
Mark Nottingham
http://www.mnot.net/