[XML-SIG] utf8 conversion issue

Mike Brown mike@skew.org
Tue, 4 Jun 2002 14:03:03 -0600 (MDT)

Matt Zipay wrote:
> I recently discovered in PyXML 0.5.2 (more than a bit behind, I know) 
> that xml.unicode.utf8_iso.code_to_utf9() is returning incorrect values. 
> For example, the name "tørvåld" does not convert properly; it 
> *should* be "torvald" - the 'o' with a stroke, the 'a' with a ring 
> above. However, it gets mangled into "t\xc3\xb8rv\xc3\xa5ld". I took a 
> look at code_to_utf8() and noticed that it in turn calls utf8chr(), 
> which does a comparison to see if the ordinal passed in is <128. 
> Shouldn't it be <256??? Has anyone else wondered about or experienced this?
> Also, I noticed that the line doing the actual conversion reads "return 
> chr(0xc0 | (c>>6)) + chr(0x80 | (c & 0x3f))" where c is the ordinal. I 
> almost hesitate to ask, but why is it even necessary to bit-or and 
> -shift? Especially when this seems to yield incorrect results? Am i just 
> missing something?
> Any input is greatly appreciated.

Either I'm missing what the problem is, or you're missing the fact that the
results are correct. :)

Read up on UTF-8. It uses 1 byte for the ASCII range (U+0000 through U+007F) 
only, and it's a direct mapping (U+0020 --> byte x20). It uses 2 bytes for 
U+0080 through U+0FFF.

Bytes C3 B8 are exactly what you want. That is, &#xF8; means Unicode character
number F8 (in hex): LATIN SMALL LETTER O WITH STROKE. In UTF-8 this character
is represented as 2 bytes: C3 B8.

"\xc3\xb8" is Python's unambiguous way of representing an object of type
string that consists of bytes C3 B8. Python's non-Unicode string objects are
just byte buffers, and you don't know what encoding they actually use (i.e.,
you don't know how those bytes map to Unicode characters). 

The string actually has the right bytes in it. When you print the string,
you're serializing those bytes to an output device. Depending on how you do
it, you'll either write the raw bytes to the device, or Python will do some
escaping for you, if it thinks you might need to keep the output bytes all in
the ASCII range.

   - Mike
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume: http://skew.org/~mike/resume/