[docs] [issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

Fri May 20 04:01:11 EDT 2016

Marc-Andre Lemburg added the comment:

Ben, the methods on stings and Unicode objects in Python 2.x are direct interfaces to the underlying codecs. The codecs can handle any number of input and output types, so there are some which only work on 8-bit strings (bytes) and others which take Unicode as input.

As a result, you sometimes see errors due to the conversion of an 8-bit string to Unicode (in the case, where the codec expects a Unicode input).

As example, take the UTF-8 codec. This expects a Unicode input when decoding, so when you pass in an 8-bit string, Python will convert this to Unicode using the default encoding (which is normally set to 'ascii') and then applies the codec operation.

When the 8-bit string is plain ASCII this works great. If not, chances are high that you'll run into a Unicode error.

Now, in Python 2.x you can change the default encoding to either make this work by assuming that all your 8-bit strings are UTF-8 (set it to 'utf-8' in sitecustomize.py), or you can disable the automatic conversion altogether by setting the default encoding to 'unknown', which is a codec specifically created for this purpose. The latter will also raise an exception when attempting to convert an 8-bit string to Unicode - similar to what Python 3 does, except that the error type is different.

Hope that helps.

----------
nosy: +lemburg

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue26369>
_______________________________________