[Chicago] xml encodings, umlauts, ouch

Aaron Lav asl2 at pobox.com
Thu Apr 16 05:41:40 CEST 2009


On Wed, Apr 15, 2009 at 09:37:25PM -0500, Kumar McMillan wrote:
> Dear gurus,
> 
> I'm trying to parse this XML file that has an escaped hex sequence of
> <realname>Jesper Dahlb&#xC3;&#xA4;ck</realname> -- in other words,
> after parsing with lxml, u'Jesper Dahlb\xc3\xa4ck'.  I know this is
> supposed to be an a with an umlaut, Jesper Dahlb?ck.  BUT shouldn't
> that be <realname>Jesper Dahlb&#xE4;ck</realname> ?? that is, hex e4 /

Yes, it should be &#E4;.  The string you're seeing is the result of
character-reference-encoding the utf-8 encoding of that
('\xc3\xa4'.decode('utf-8') is u'\xe4').

They're doing the wrong thing, since character references are supposed
to be the value of the unicode code point (in this case, 0xe4), see
http://www.w3.org/TR/REC-xml/#NT-CharRef, not an encoded version of
some byte string which represents the code point in some
transformation format.  (This is a little subtle if you're working in
UTF-16 and you have surrogates: you need to decode them before
character-reference-encoding.)


>  Then again, the more I think about it, I have no idea what encoding
> these escaped XML byte sequences are supposed to be in.  The xml file
> of course doesn't specify an encoding.

If it doesn't specify an encoding, then it ought to be utf-8, in the
absence of an HTTP / other encoding declaration.  (http://www.w3.org/TR/REC-xml/#NT-EncodingDecl)

See http://intertwingly.net/slides/2005/etcon/72.html for some discussion
of what happens when the encoding declarations conflict.  

> Or, is this some composite format that is trying to say a + umlaut?

You're thinking of "Combining Diacritical Marks", http://unicode.org/charts/PDF/U0300.pdf (the umlaut is at 0x308).

If you can persuade the person who's generating the XML to do
character reference encoding before UTF-8 encoding, then that would be
the best thing.  Otherwise (eg if they're storing utf-8 as bytes and
building xml via string manipulation), you'll have to add some hackish
preprocessing or postprocessing.

     Aaron "u'\u05e9'" Lav (asl2 at pobox.com)


More information about the Chicago mailing list