[New-bugs-announce] [issue24019] str/unicode encoding kwarg causes exceptions

Tue Apr 21 06:57:04 CEST 2015

New submission from Mahmoud Hashemi:

The encoding keyword argument to the Python 3 str() and Python 2 unicode() constructors is excessively constraining to the practical use of these core types.

Looking at common usage, both these constructors' primary mode is to convert various objects into text:

>>> str(2)
'2'

But adding an encoding yields:

>>> str(2, encoding='utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: coercing to str: need bytes, bytearray or buffer-like object, int found

While the error message is fine for an experienced developer, I would like to raise the question: is it necessary at all? Even harmlessly getting a str from a str is punished, but leaving off encoding is fine again:

>>> str('hi', encoding='utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: decoding str is not supported
>>> str('hi')
'hi'

Merging and simplifying the two modes of these constructors would yield much more predictable results for experienced and beginning Pythonists alike. Basically, the encoding argument should be ignored if the argument is already a unicode/str instance, or if it is a non-string object. It should only be consulted if the primary argument is a bytestring. Bytestrings already have a .decode() method on them, another, obscurer version of it isn't necessary.

Furthermore, despite the core nature and widespread usage of these types, changing this behavior should break very little existing code and understanding. unicode() and str() will simply behave as expected more often, returning text versions of the arguments passed to them. 

Appendix: To demonstrate the expected behavior of the proposed unicode/str, here is a code snippet we've employed to sanely and safely get a text version of an arbitrary object:

def to_unicode(obj, encoding='utf8', errors='strict'):
    # the encoding default should look at sys's value
    try:
        return unicode(obj)
    except UnicodeDecodeError:
        return unicode(obj, encoding=encoding, errors=errors)

After many years of writing Python and teaching it to developers of all experience levels, I firmly believe that this is the right interaction pattern for Python's core text type. I'm also happy to expand on this issue, turn it into a PEP, or submit a patch if there is interest.

----------
components: Unicode
messages: 241699
nosy: ezio.melotti, haypo, mahmoud
priority: normal
severity: normal
status: open
title: str/unicode encoding kwarg causes exceptions
type: behavior
versions: Python 2.7, Python 3.4, Python 3.5, Python 3.6

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue24019>
_______________________________________