[Expat-discuss] Expat treating ISO-8859-1 char strangely?

Fred L. Drake, Jr. fdrake at acm.org
Fri Jul 25 12:44:52 EDT 2003


Stuart Powers writes:
 > Hi, we're new to this mailling list, and we were wondering if anyone here could help us with a problem we're having.
 >  
 > Our XML file (with encoding set to ISO-8859-1) contains the following string:
 >  
 > "Kickin’ it Dash style"
 >  
 > The apostrophe, we're pretty sure, is a character from the
 > ISO-8859-1 character set. (We got this string for testing by
 > copying and pasting from
 > http://www.zeldman.com/daily/0703b.shtml#anil .)
 >  
 > We're using XML::DOM (which uses XML::DOM::Parser, which supposedly uses Expat) to parse this XML file, and when we send the parsed data to a browser (via HTTP), it comes out like this:
 >  
 > " KickinÂ’ it Dash style"

Other than the space prepended at the begining, that's the UTF-8 I'd
expect.

 > That is how Mozilla displays it when it is set to read character
 > encoding ISO-8859-1. When set to read UTF-8, it simply displays
 > "Kickin#146; it Dash style".

I don't know why Mozilla would display it like that.  That's a Mozilla
issue.

 > We would sort of understand it if Expat simply took our ISO-8859-1
 > character and copied it directly (byte by byte), or if it somehow
 > converted it to UTF-8 and we got a UTF-8 character, but it appears
 > that it's doing neither - it's sending us bytes which don't seem to
 > be a valid character in either character set.

This is definately a display thing.  ISO-8859-1 character 0222 (0x92)
converts to the UTF-8 sequence 0302 0222 (0xc2 0x92).  So Expat is
doing the right thing.


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>
PythonLabs at Zope Corporation



More information about the Expat-discuss mailing list