[XML-SIG] utf8 conversion issue
Martin v. Loewis
martin@v.loewis.de
04 Jun 2002 21:59:17 +0200
Matt Zipay <mzipay@ag.com> writes:
> I recently discovered in PyXML 0.5.2 (more than a bit behind, I know)
> that xml.unicode.utf8_iso.code_to_utf9() is returning incorrect
> values.
Do you mean _to_utf8 here? If not, what exactly is utf9?
> For example, the name "tørvåld" does not convert properly;
> it *should* be "torvald" - the 'o' with a stroke, the 'a' with a
> ring above.
This is certainly different from "torvald".
> However, it gets mangled into "t\xc3\xb8rv\xc3\xa5ld".
That is correct. In UTF-8, LATIN SMALL LETTER O WITH STROKE
is /xc3/xb8, LATIN SMALL LETTER A WITH RING ABOVE is /xc3/xa5.
> I took a look at code_to_utf8() and noticed that it in turn calls
> utf8chr(), which does a comparison to see if the ordinal passed in
> is <128. Shouldn't it be <256???
No. If all characters below 256 would stand for themselves (as single
bytes), how precisely would you encode characters above 256?
In UTF-8, characters between 128 and 2048 are encoded as two bytes.
> Has anyone else wondered about or experienced this?
Not with this specific implementation of UTF-8, but certainly with
other implementations of UTF-8.
> Also, I noticed that the line doing the actual conversion reads
> "return chr(0xc0 | (c>>6)) + chr(0x80 | (c & 0x3f))" where c is the
> ordinal. I almost hesitate to ask, but why is it even necessary to
> bit-or and -shift?
Because that's how UTF-8 is defined.
> Especially when this seems to yield incorrect results? Am i just
> missing something?
Most definitely.
Regards,
Martin