[XML-SIG] utf8 conversion issue

Matt Zipay mzipay@ag.com
Tue, 04 Jun 2002 14:55:23 -0400


I recently discovered in PyXML 0.5.2 (more than a bit behind, I know) 
that xml.unicode.utf8_iso.code_to_utf9() is returning incorrect values. 
For example, the name "tørvåld" does not convert properly; it 
*should* be "torvald" - the 'o' with a stroke, the 'a' with a ring 
above. However, it gets mangled into "t\xc3\xb8rv\xc3\xa5ld". I took a 
look at code_to_utf8() and noticed that it in turn calls utf8chr(), 
which does a comparison to see if the ordinal passed in is <128. 
Shouldn't it be <256??? Has anyone else wondered about or experienced this?
Also, I noticed that the line doing the actual conversion reads "return 
chr(0xc0 | (c>>6)) + chr(0x80 | (c & 0x3f))" where c is the ordinal. I 
almost hesitate to ask, but why is it even necessary to bit-or and 
-shift? Especially when this seems to yield incorrect results? Am i just 
missing something?
Any input is greatly appreciated.