parsing an xml document with funky ascii characters

Mon Feb 4 01:45:01 EST 2002

ayinger1 at pacbell.net (andrew) writes:

> I am using sax parser in python 2.1.
> How do I deal with xml documents with characters like 'ä'?

Depends on how this character is encoded.

> 
> I have tried:
> 	- setting encoding="ISO-8859-1 in the xml doc itself

If you use that encoding, this is the right way (assuming you put it
into the xml header, and assuming you added the closing double-quote).

> 	- setting the InputSource encoding via:
> source.setEncoding('ISO-8859-1')

That has no effect, I believe (although it probably should)

> 	- escaping the character in the doc: ('\x84')

That will give you the string '\x84' in the output. BTW, in what
encoding is LATIN SMALL LETTER A WITH DIAERESIS encoded as \x84?
Certainly not in iso-8859-1

> 	- and, finally, encoding the parsed strings that have this
> character: myString.encode("ISO-8859-1")

That also has no effect, and should not have. Invoking .encode
performs unicode conversion of the string, followed by encoding it as
Latin-1, again. Unless you get a UnicodeError, this will give you the
original string back.

> What I have found is that the default parser (appears to be expat,
> retrieved from sax.make_parser) seems to store every element as
> unicode strings.  

That is proper behaviour. XML document pieces are always reported as
Unicode strings to the application, as per the XML spec.

> It appears to store them incorrectly (so, 'ä' appears in the unicode
> string as '\xe4' instead of '\x84').

If you are using ISO-8859-1, then \xe4 *is* LATIN SMALL LETTER A WITH
DIAERESIS (more precisely, it is \u00e4).

> The result is that if I try to encode the unicode string that i get
> back from the parser, the character in question incorrectly appears
> as 'E' (sum).

Depends on how you are encoding it. If you encode it as ISO-8859-1,
you get the byte \xe4, which does *not* mean 'E (sum)', but indeed
means LATIN SMALL LETTER A WITH DIAERESIS.

That gives a clue as to what encoding you are using, though: In IBM
code page 437, we indeed have the assignments

<U00E4>     /x84         LATIN SMALL LETTER A WITH DIAERESIS
<U03A3>     /xe4         GREEK CAPITAL LETTER SIGMA

(where the latter looks somewhat like a 'E (sum)' indeed).

> Any ideas?  Am I doing something wrong here?

I'm still not sure what you are trying to achieve, and why you think
anything is not working properly. Your main problem seems to be that
you don't know what encoding you are using. 

Since you get \xe4 reported from the parser, it appears that the input
encoding is indeed ISO-8859-1. It is somewhat surprising, though, that
you then expect that the output encoding is CP437. Perhaps you are
viewing the output in a Windows command.com/cmd.exe window? Try
redirecting the output into a file, and view the file with notepad.

If you really need to output as CP437, invoke .encode("cp437") on the
Unicode strings that the parser is giving you before printing them.

If you want to learn more about these things, please read in the
Microsoft documentation about the "OEM" and "ANSI" code pages.

Regards,
Martin