[Chicago] xml encodings, umlauts, ouch
Kumar McMillan
kumar.mcmillan at gmail.com
Thu Apr 16 04:37:25 CEST 2009
Dear gurus,
I'm trying to parse this XML file that has an escaped hex sequence of
<realname>Jesper Dahlbäck</realname> -- in other words,
after parsing with lxml, u'Jesper Dahlb\xc3\xa4ck'. I know this is
supposed to be an a with an umlaut, Jesper Dahlbäck. BUT shouldn't
that be <realname>Jesper Dahlbäck</realname> ?? that is, hex e4 /
decimal 228 / a with umlaut (http://www.tony-franks.co.uk/UTF-8.htm ).
Then again, the more I think about it, I have no idea what encoding
these escaped XML byte sequences are supposed to be in. The xml file
of course doesn't specify an encoding.
Or, is this some composite format that is trying to say a + umlaut?
When I run chardet.detect() on the byte string it tells me EUC-KR
(Korean) but that doesn't seem right.
Kumar
More information about the Chicago
mailing list