UTF-8 Encoding Error

Steve D'Aprano steve+python at pearwood.info
Thu Dec 29 20:46:03 EST 2016


On Sun, 25 Dec 2016 04:50 pm, Grady Martin wrote:

> On 2016年12月22日 22時38分, subhabangalore at gmail.com wrote:
>>I am getting the error:
>>UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 15:
>>invalid start byte
> 
> The following is a reflex of mine, whenever I encounter Python 2 Unicode
> errors:
> 
> import sys
> reload(sys)
> sys.setdefaultencoding('utf8')


This is a BAD idea, and doing it by "reflex" without very careful thought is
just cargo-cult programming. You should not thoughtlessly change the
default encoding without knowing what you are doing -- and if you know what
you are doing, you won't change it at all.

The Python interpreter *intentionally* removes setdefaultencoding at startup
for a reason. Changing the default encoding can break the interpreter, and
it is NEVER what you actually need. If you think you want it because it
fixes "Unicode errors", all you are doing is covering up bugs in your code.

Here is some background on why setdefaultencoding exists, and why it is
dangerous:

https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/

If you have set the Python 2 default encoding to anything but ASCII, you are
now running a broken system with subtle bugs, including in data structures
as fundamental as dicts.

The standard behaviour:

py> d = {u'café': 1}
py> for key in d:
...     print key == 'caf\xc3\xa9'
...
False


As we should expect: the key in the dict, u'café', is *not* the same as the
byte-string 'caf\xc3\xa9'. But watch how we can break dictionaries by
changing the default encoding:

py> reload(sys)
<module 'sys' (built-in)>
py> sys.setdefaultencoding('utf-8')  # don't do this
py> for key in d:
...     print key == 'caf\xc3\xa9'
...
True


So Python now thinks that 'caf\xc3\xa9' is a key. Or does it?

py> d['caf\xc3\xa9']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'caf\xc3\xa9'

By changing the default encoding, we now have something which is both a key
and not a key of the dict at the same time.



> A relevant Stack Exchange thread awaits you here:
> 
> http://stackoverflow.com/a/21190382/2230956

And that's why I don't trust StackOverflow. It's not bad for answering
simple questions, but once the question becomes more complex the quality of
accepted answers goes down the toilet. The highest voted answer is *wrong*
and *dangerous*.

And then there's this comment:

    Until this moment I was forced to include "# -- coding: utf-8 --" at 
    the begining of each document. This is way much easier and works as
    charm

I have no words for how wrong that is. And this comment:

    ty, this worked for my problem with python throwing UnicodeDecodeError
    on var = u"""vary large string"""

No it did not. There is no possible way that Python will throw that
exception on assignment to a Unicode string literal.

It is posts like this that demonstrate how untrustworthy StackOverflow can
be.



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.



More information about the Python-list mailing list