different encodings for unicode() and u''.encode(), bug?

"Martin v. Löwis" martin at v.loewis.de
Sat Jan 12 18:19:05 EST 2008


> What I'd like to understand better is the "compatibility heirarchy" of
> known encodings, in the positive sense that if a string decodes
> successfully with encoding A, then it is also possible that it will
> encode with encodings B, C; and in the negative sense that is if a
> string fails to decode with encoding A, then for sure it will also
> fail to decode with encodings B, C. Any ideas if such an analysis of
> the relationships between encodings exists?

Most certainly. You'll have to learn a lot about many encodings though
to really understand the relationships.

Many encodings X are "ASCII supersets", in the sense that if you have
only characters in the ASCII set, the encoding of the string in ASCII
is the same as the encoding of the string in X. ISO-8859-X, ISO-2022-X,
koi8-x, and UTF-8 fall in this category.

Other encodings are "ASCII supersets" only in the sense that they
include all characters of ASCII, but encode them differently. EBCDIC
and UCS-2/4, UTF-16/32 fall in that category.

Some encodings are 7-bit, so that they decode as ASCII (producing
moji-bake if the input wasn't ASCII). ISO-2022-X is an example.

Some encodings are 8-bit, so that they can decode arbitrary bytes
(again producing moji-bake if the input wasn't that encoding).
ISO-8859-X are examples, as are some of the EBCDIC encodings, and
koi8-x. Also, things will successfully (but meaninglessly) decode
as UTF-16 if the number of bytes in the input is even (likewise
for UTF-32).

HTH,
Martin



More information about the Python-list mailing list