[docs] [issue26369] doc for unicode.decode and str.encode is unnecessarily confusing

Feb. 16, 2016

      New submission from Ben Spiller:

It's well known that lots of people struggle writing correct programs using non-ascii strings in python 2.x, but I think one of the main reasons for this could be very easily fixed with a small addition to the documentation for str.encode and unicode.decode, which is currently quite vague. 

The decode/encode methods really make most sense when called on a unicode string i.e. unicode.encode() to produce a byte string, or on a byte string e.g. str.decode() to produce a unicode object from a byte string. 

However, the additional presence of the opposite methods str.encode() and unicode.decode() is quite confusing, and a frequent source of errors - e.g. calling str.encode('utf-8') first DECODES the str object (which might already be in utf8) to a unicode string **using the default encoding of "ascii"** (!) before ENCODING to a utf-8 byte str as requested, which of course will fail at the first stage with the classic error "UnicodeDecodeError: 'ascii' codec can't decode byte" if there are any non-ascii chars present. It's unfortunate that this initial decode/encode stage ignores both the "encoding" argument (used only for the subsequent encode/decode) and the "errors" argument (commonly used when the programmer is happy with a best-effort conversion e.g. for logging purposes).

Anyway, given this behaviour, a lot of time would be saved by a simple sentence on the doc for str.encode()/unicode.decode() essentially warning people that those methods aren't that useful and they probably really intended to use str.decode()/unicode.encode() - the current doc gives absolutely no clue about this extra stage which ignores the input arguments and sues 'ascii' and 'strict'. It might also be worth stating in the documentation that the pattern (u.encode(encoding) if isinstance(u, unicode) else u) can be helpful for cases where you unavoidably have to deal with both kinds of input, string calling str.encode is such a bad idea. 

In an ideal world I'd love to see the implementation of str.encode/unicode.decode changed to be more useful (i.e. instead of using ascii, it would be more logical and useful to use the passed-in encoding to perform the initial decode/encode, and the apss-in 'errors' value). I wasn't sure if that change would be accepted so for now I'm proposing better documentation of the existing behaviour as a second-best.

----------
assignee: docs@python
components: Documentation
messages: 260359
nosy: benspiller, docs@python
priority: normal
severity: normal
status: open
title: doc for unicode.decode and str.encode is unnecessarily confusing
type: behavior
versions: Python 2.7

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue26369>
_______________________________________