[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Wed Sep 17 05:28:57 CEST 2014

Glenn Linderman writes:

 > Some bytes may decode into characters without needing to be
 > smuggled... maybe not in text-protocols like email, but in the
 > general case. So then some of the bytes that should be interpreted
 > as binary data are not in a disjoint set from characters.

True, but irrelevant.  The point is that whoever chose the codec is
responsible for getting it right, not only the right encoding, but for
the assumption that the input data was pure encoded text.  The rest of
the program can now assume that choice was made correctly, and process
text as text.  The program cannot be blamed for assuming that the
person who chose the codec knew what they were about, and so
characters can be *assumed* to be decoded from bytes representing
characters.

This was not true in Python 2, where it was common practice to
represent encoded text by itself internally, implicitly assuming that
only one encoding would be encountered in each invocation of the
program.  This was never true, and with the spread of the Internet and
then the WWW, it became a major issue.  And that's why we invented
Python 3, to let text be text without the encumbrance of always being
aware of encodings and converting when different encodings collide,
etc.