[Chicago] xml encodings, umlauts, ouch

Kumar McMillan kumar.mcmillan at gmail.com
Thu Apr 16 04:37:25 CEST 2009


Dear gurus,

I'm trying to parse this XML file that has an escaped hex sequence of
<realname>Jesper Dahlb&#xC3;&#xA4;ck</realname> -- in other words,
after parsing with lxml, u'Jesper Dahlb\xc3\xa4ck'.  I know this is
supposed to be an a with an umlaut, Jesper Dahlbäck.  BUT shouldn't
that be <realname>Jesper Dahlb&#xE4;ck</realname> ?? that is, hex e4 /
decimal 228 / a with umlaut (http://www.tony-franks.co.uk/UTF-8.htm ).
 Then again, the more I think about it, I have no idea what encoding
these escaped XML byte sequences are supposed to be in.  The xml file
of course doesn't specify an encoding.

Or, is this some composite format that is trying to say a + umlaut?
When I run chardet.detect() on the byte string it tells me EUC-KR
(Korean) but that doesn't seem right.

Kumar


More information about the Chicago mailing list