Kumar McMillan wrote:
> Dear gurus,
> I'm trying to parse this XML file that has an escaped hex sequence of
> <realname>Jesper Dahlb&#xC3;&#xA4;ck</realname> -- in other words,
> after parsing with lxml, u'Jesper Dahlb\xc3\xa4ck'.  I know this is
> supposed to be an a with an umlaut, Jesper Dahlbäck.  BUT shouldn't
> that be <realname>Jesper Dahlb&#xE4;ck</realname> ?? that is, hex e4 /
> decimal 228 / a with umlaut (http://www.tony-franks.co.uk/UTF-8.htm ).
>  Then again, the more I think about it, I have no idea what encoding
> these escaped XML byte sequences are supposed to be in.  The xml file
> of course doesn't specify an encoding.
> Or, is this some composite format that is trying to say a + umlaut?
> When I run chardet.detect() on the byte string it tells me EUC-KR
> (Korean) but that doesn't seem right.
> Kumar

Others have already responded, but I would like to see if I can make it 

The Unicode system or it's predecessor was originally intended to 
express diacritical marks as add-ons, but there was tremendous pressure 
to include the standard set of Latin characters with diacritical marks 
as well. So an a-umlaut can be coded as a single code point, or as an 
'a' and the 'with an umlaut' code points.

The code point for a-umlaut is U+00E4. That is just a code point in 
hypothetical space, and not an encoding of any sort. Encoded in UTF-8 it 
would be c3a4 (hexadecimal) or 195 164 (decimal). Encoded in the HTML 
style it would be &#xe4; or &#228; , using the code point (and not any 
UTF encoding). I have found that my browser is perfectly content 
displaying both the UTF-8 characters and the HTML escaped forms. I don't 
know XML, but I assume that the escape form is the same as HTML.

So it appears that whatever created the data tried to wrap UTF-8 
encoding in an XML escape wrapper, which would be dead wrong, and tends 
to confuse HTML horribly.

It should be &#xe4;


