[XML-SIG] utf8 conversion issue

Martin v. Loewis martin@v.loewis.de
04 Jun 2002 21:59:17 +0200


Matt Zipay <mzipay@ag.com> writes:

> I recently discovered in PyXML 0.5.2 (more than a bit behind, I know)
> that xml.unicode.utf8_iso.code_to_utf9() is returning incorrect
> values. 

Do you mean _to_utf8 here? If not, what exactly is utf9?

> For example, the name "t&#xf8;rv&#xe5;ld" does not convert properly;
> it *should* be "torvald" - the 'o' with a stroke, the 'a' with a
> ring above.

This is certainly different from "torvald".

> However, it gets mangled into "t\xc3\xb8rv\xc3\xa5ld".

That is correct. In UTF-8, LATIN SMALL LETTER O WITH STROKE
is /xc3/xb8, LATIN SMALL LETTER A WITH RING ABOVE is /xc3/xa5.

> I took a look at code_to_utf8() and noticed that it in turn calls
> utf8chr(), which does a comparison to see if the ordinal passed in
> is <128. Shouldn't it be <256???

No. If all characters below 256 would stand for themselves (as single
bytes), how precisely would you encode characters above 256?

In UTF-8, characters between 128 and 2048 are encoded as two bytes.

> Has anyone else wondered about or experienced this?

Not with this specific implementation of UTF-8, but certainly with
other implementations of UTF-8.

> Also, I noticed that the line doing the actual conversion reads
> "return chr(0xc0 | (c>>6)) + chr(0x80 | (c & 0x3f))" where c is the
> ordinal. I almost hesitate to ask, but why is it even necessary to
> bit-or and -shift? 

Because that's how UTF-8 is defined.

> Especially when this seems to yield incorrect results? Am i just
> missing something?

Most definitely.

Regards,
Martin