[Python-ideas] Why decode()/encode() name is harmful

Fri May 29 21:57:16 CEST 2015

On May 29, 2015, at 01:56, anatoly techtonik <techtonik at gmail.com> wrote:
> 
> First, let me start with The Curse of Knowledge
> https://en.wikipedia.org/wiki/Curse_of_knowledge
> which can be summarized as:
> 
> "Once you get something, it becomes hard
> to think how it was to be without it".
> 
> I assume that all of you know difference between
> decode() and encode(), so you're cursed and
> therefore think that getting that right it is just a
> matter of reading documentation, experience and
> time. But quite a lot of had passed and Python 2
> is still there, and Python 3, which is all unicode
> at the core (and which is great for people who
> finally get it) is not as popular. So, remember that
> you are biased towards (or against)
> decode/unicode perception.
> 
> 
> Now imaging a person who has a text file. The
> person need to process that with Python. That
> person is probably a journalist and doesn't know
> anything that "any developer should know about
> unicode". In Python 2 he just copy pastes regular
> expressions to match the letter and is happy. In
> Python 3 he needs to *convert* that text to unicode.

No he doesn't. In Python 3, unless he goes out of his way to open the file in binary mode, or use binary string literals for his regexps, that text is unicode from the moment his code sees it. So he doesn't have to read the docs. 

Python 3 was deliberately designed to make it easier to never have to use bytes internally, so 80% of the users never even have to think about bytes (even at the cost of sometimes making things harder for the more advanced coders who need to write the low-level stuff like network protocol handlers and can't avoid bytes).

Now, all those things _are_ still problems for people who use Python 2. But the only way to fix that is to get those people--and, even more importantly, new people--using Python 3. Which means not introducing any new radical inconsistencies in between Python 2 and 3 (or 4) for no good reason--or, of course, between Python 3.5 and 3.6 (or 4.0).

> Then he tries to read the documentation, it
> already starts to bring conflict to his mind. It says
> to him to "decode" the text.

Where in the documentation does it ever tell you to decode text? If you're inventing fictitious documentation that would confuse people if it existed but doesn't because it doesn't, you can just as well claim that the int method is confusing because it tells him he needs to truncate his integers even though integers are already truncated. Yes, that would be confusing--which is why the docs don't say that.

> I don't know about you,
> but when I'm being told to decode the text, I
> assume that it is crypted, because I watched a
> few spy movies including ones with Sherlock
> Holmes and Stierlitz.

If you open Shift-JIS text as if it were Latin-1 and see a mess of mojibake, it doesn't seem that surprising to be told that you need to decode it properly.

If you open UTF-8 text as if it were UTF-8, and Python has already decoded it for you under the covers, you never have to think about it, so there's no opportunity to be surprised.

> But the text looks legit to me,
> I can clearly see and read it and now you say that
> I need to decode it. You're basically ruining my
> world right here. No wonder that I will resist. I
> probably stressed, has a lot of stuff to do, and you
> are trying to load me with all those abstract
> concepts that conflict with what I know. No way!
> Unless I have a really strong motivation (or
> scientific background) there is no chance to get
> this stuff for me right on this day. I will probably
> repeat the exercise and after a few tries will get
> the output right, but there is no chance I will
> remember this thing on that day.

That's a good point. That's exactly why you see people add random calls to str, unicode, encode, and decode to their Python 2 code until it seems to do the right thing on their one test input, and then freak out when it doesn't work on their second test input and go post a confused mess on StackOverflow or Python-list asking someone to solve it for them.

What's the solution? Make it as unlikely as possible that you'll run into the problem in the first place by nearly forcing you to deal in Unicode all the way through your script, and, when you do need to deal with manual encoding and decoding, make the almost-certainly-wrong nonsensical code impossible to write by not having bytes.encode or str.decode or automatic conversions between the two types. Of course that's a backward-incompatible change, and maybe a radical-enough one that it'll take half a decade for the ecosystem to catch up to the point where most users can benefit from it. Which makes it a good thing that Python started that process half a decade ago. So now, to anyone who runs into that confusion, there's an answer: just upgrade from 2.7 to 3.4, undo all the changes you introduced trying to solve this problem incorrectly, and your original code just works.

Even if you had a better solution than Python 3's (which I doubt, but let's assume you do), what good would that do? That would make the answer: wait 18 months for Python 3.6, then another 12 months for the last of the packages you depend on to finally adjust to the breaking incompatibility that 3.6 introduced, then undo all the changes you introduced trying to solve this problem incorrectly, then make different, more sensible, changes. That's clearly not a better answer.

So, unless you have a better solution than Python 3's and also have a time machine to go back to 2007, what could you possibly have to propose?

> Because
> rewiring neural paths in my brain is much harder
> that paving them from scratch.
> -- 
> anatoly t.
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/