[XML-SIG] XML and Unicode

Mark Nottingham mnot@mnot.net
Wed, 23 May 2001 13:44:23 -0700


Thanks. If that's the case, what's happening here (see test script)?
The source text, when written directly to HTML and identified as
ISO-8859-1, correctly displays. when parsed by pyexpat, the resulting
unicode string, .encode('UTF-8') and included in HTML identified as
UTF-8 does not display correctly.   

I'm not sure I understand your previous message - noone has suggested
that it's Windows CP 1252 (although I may have missed messages), and
I'm not sure what you mean by 'consider the document as ISO-8859-1';
I'm feeding a document into an XML parser with encoding="ISO-8859-1",
and getting unicode strings out of it. What mechanism do I have to
consider it as having a particular encoding, beyond the XML
declaration? I've been given the impression that unicode strings are

Cheers & thanks,

On Wed, May 23, 2001 at 10:15:06PM +0200, Martin v. Loewis wrote:
> > > That's a possibility (even though I don't see any funny
> > > characters in your example XML file); looking through the
> > > pyexpat.c code it seems as if the parser assumes that the XML
> > > file is encoded as UTF-8 -- at least all Unicode conversions
> > > are done using UTF-8.
> > > 
> > It's the em dash in the middle. If true, this behaviour would be
> > a bug, no?
> It would be a bug, but pyexpat works correctly. expat indeed does
> guarantee that all text is UTF-8, because it converts the file from
> any input encoding to UTF-8 before passing it to the application.
> Regards,
> Martin

On Wed, May 23, 2001 at 10:04:11PM +0200, Martin v. Loewis wrote:
> > It seems like the XML parser isn't converting the ISO-8859-1 to
> > Unicode; does this make sense?
> As others have explained, your document is really Windows CP 1252,
> not ISO 8859 1 encoded.
> If you consider the document as ISO-8859-1, then the parser *will*
> convert it correctly.
> Regards,
> Martin

Mark Nottingham