Tim Peters email@example.com wrote:
... I'd much prefer Python to reflect a fundamental truth about Unicode, which at least makes sure binary-goop can pass through Unicode and remain unharmed, then to reflect a nasty problem with UTF-8 (not everything is legal).
Then you don't want Unicode at all, Moshe. All the official encoding schemes for Unicode 3.0 suffer illegal byte sequences (for example, 0xffff is illegal in UTF-16 (whether BE or LE); this isn't merely a matter of Unicode not yet having assigned a character to this position, it's that the standard explicitly makes this sequence illegal and guarantees it will always be illegal!
in context, I think what Moshe meant was that with a straight character code mapping, any 8-bit string can always be mapped to a unicode string and back again.
given a byte array "b":
u = unicode(b, "default") assert map(ord, u) == map(ord, s)
again, this is no different from casting an integer to a long integer and back again. (imaging having to do that on the bits and bytes level!).
and again, the internal unicode encoding used by the unicode string type itself, or when serializing that string type, has nothing to do with that.