parsing an xml document with funky ascii characters
ayinger1 at pacbell.net
Mon Feb 4 02:14:40 CET 2002
I am using sax parser in python 2.1.
How do I deal with xml documents with characters like 'ä'?
I have tried:
- setting encoding="ISO-8859-1 in the xml doc itself
- setting the InputSource encoding via:
- escaping the character in the doc: ('\x84')
- and, finally, encoding the parsed strings that have this character:
What I have found is that the default parser (appears to be expat,
retrieved from sax.make_parser) seems to store every element as
unicode strings. It appears to store them incorrectly (so, 'ä'
appears in the unicode string as '\xe4' instead of '\x84'). The
result is that if I try to encode the unicode string that i get back
from the parser, the character in question incorrectly appears as 'E'
Any ideas? Am I doing something wrong here?
More information about the Python-list