[XML-SIG] XML and Unicode
Wed, 23 May 2001 08:46:25 -0700
It's the em dash in the middle. If true, this behaviour would be a
bug, no? Is there any kind of workaround possible (such as detecting
the encoding of the XML file outside of the parser and .encode()ing
On Wed, May 23, 2001 at 09:38:14AM +0200, M.-A. Lemburg wrote:
> Mark Nottingham wrote:
> > OK, so I'm not getting something then. The attached test script (and
> > data file) is the problem pared down - if u'string' is a neutral
> > encoding, and .encode('utf-8') generates a utf-8 encoded string of
> > that encoding, then the utf-8.html output file should display
> > correctly; however, it doesn't, while the latin-1 output does
> > (because the input is latin-1).
> > It seems like the XML parser isn't converting the ISO-8859-1 to
> > Unicode; does this make sense?
> That's a possibility (even though I don't see any funny characters
> in your example XML file); looking through the pyexpat.c code
> it seems as if the parser assumes that the XML file is encoded
> as UTF-8 -- at least all Unicode conversions are done using UTF-8.
> Marc-Andre Lemburg
> CEO eGenix.com Software GmbH
> Company & Consulting: http://www.egenix.com/
> Python Software: http://www.lemburg.com/python/