Why decode()/encode() name is harmful
data:image/s3,"s3://crabby-images/1940c/1940cb981172fcc1dafcecc03420e31ecedc6372" alt=""
First, let me start with The Curse of Knowledge https://en.wikipedia.org/wiki/Curse_of_knowledge which can be summarized as: "Once you get something, it becomes hard to think how it was to be without it". I assume that all of you know difference between decode() and encode(), so you're cursed and therefore think that getting that right it is just a matter of reading documentation, experience and time. But quite a lot of had passed and Python 2 is still there, and Python 3, which is all unicode at the core (and which is great for people who finally get it) is not as popular. So, remember that you are biased towards (or against) decode/unicode perception. Now imaging a person who has a text file. The person need to process that with Python. That person is probably a journalist and doesn't know anything that "any developer should know about unicode". In Python 2 he just copy pastes regular expressions to match the letter and is happy. In Python 3 he needs to *convert* that text to unicode. Then he tries to read the documentation, it already starts to bring conflict to his mind. It says to him to "decode" the text. I don't know about you, but when I'm being told to decode the text, I assume that it is crypted, because I watched a few spy movies including ones with Sherlock Holmes and Stierlitz. But the text looks legit to me, I can clearly see and read it and now you say that I need to decode it. You're basically ruining my world right here. No wonder that I will resist. I probably stressed, has a lot of stuff to do, and you are trying to load me with all those abstract concepts that conflict with what I know. No way! Unless I have a really strong motivation (or scientific background) there is no chance to get this stuff for me right on this day. I will probably repeat the exercise and after a few tries will get the output right, but there is no chance I will remember this thing on that day. Because rewiring neural paths in my brain is much harder that paving them from scratch. -- anatoly t.
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Fri, May 29, 2015 at 6:56 PM, anatoly techtonik <techtonik@gmail.com> wrote:
This is because you fundamentally do not understand the difference between bytes and text. Consequently, you are trying to shoehorn new knowledge into your preconceived idea that the file *already contains text*, which is not true. Go read: http://www.joelonsoftware.com/articles/Unicode.html http://nedbatchelder.com/text/unipain.html Also, why is this on python-ideas? Talk about this sort of thing on python-list. ChrisA
data:image/s3,"s3://crabby-images/ae99c/ae99c83a5503af3a14f5b60dbc2d4fde946fec97" alt=""
On Fri, May 29, 2015, at 04:56, anatoly techtonik wrote:
Let's think about how it is to be without _the idea that text is a byte stream in the first place_ - which some people here learned from Python 2, some learned from C, some may have learned from some other language. It was the way things always were, after all, before Unicode came along. The language I was using the most immediately before I started using Python was C#. And C# uses Unicode (well, UTF-16, but the important thing is that it's not an ASCII-compatible sequence of bytes) for strings. One could argue that this paradigm - and the attendant "encode" and "decode" concepts, and stream wrappers that take care of it in the common cases, are _the future_, and that one day nobody will learn that text's natural form is as a sequence of ASCII-compatible bytes... even if text files continue to be encoded that way on the disk.
You don't have to do so explicitly, if the text file's encoding matches your locale. You can just open the file and read it, and it will open as a text-mode stream that takes care of this for you and returns unicode strings. It's a text file, so you open it in text mode. Even if it doesn't match your locale, the proper way is to pass an "encoding" argument to the open function; not to go so deep as to open it in binary mode and decode the bytes yourself.
data:image/s3,"s3://crabby-images/37a4a/37a4aa454918b0961eaf9b44f307b79aea0e872f" alt=""
On Fri, May 29, 2015 at 3:56 AM, anatoly techtonik <techtonik@gmail.com> wrote:
So, ignoring your lack of suggestions for different names, would you also argue that the codecs module (which is how people should be handling this when dealing with files on disk) should also be renamed? codecs is a portmanteau of coder-decoder and deals with converting the code-points to bytes and back. codecs, "Encoding", and "Decoding" are also used for non-text formats too (e.g., files containing video or audio). They in all of the related contexts they have the same meaning. I'm failing to understand your problem with the terminology.
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On May 29, 2015, at 01:56, anatoly techtonik <techtonik@gmail.com> wrote:
No he doesn't. In Python 3, unless he goes out of his way to open the file in binary mode, or use binary string literals for his regexps, that text is unicode from the moment his code sees it. So he doesn't have to read the docs. Python 3 was deliberately designed to make it easier to never have to use bytes internally, so 80% of the users never even have to think about bytes (even at the cost of sometimes making things harder for the more advanced coders who need to write the low-level stuff like network protocol handlers and can't avoid bytes). Now, all those things _are_ still problems for people who use Python 2. But the only way to fix that is to get those people--and, even more importantly, new people--using Python 3. Which means not introducing any new radical inconsistencies in between Python 2 and 3 (or 4) for no good reason--or, of course, between Python 3.5 and 3.6 (or 4.0).
Where in the documentation does it ever tell you to decode text? If you're inventing fictitious documentation that would confuse people if it existed but doesn't because it doesn't, you can just as well claim that the int method is confusing because it tells him he needs to truncate his integers even though integers are already truncated. Yes, that would be confusing--which is why the docs don't say that.
If you open Shift-JIS text as if it were Latin-1 and see a mess of mojibake, it doesn't seem that surprising to be told that you need to decode it properly. If you open UTF-8 text as if it were UTF-8, and Python has already decoded it for you under the covers, you never have to think about it, so there's no opportunity to be surprised.
That's a good point. That's exactly why you see people add random calls to str, unicode, encode, and decode to their Python 2 code until it seems to do the right thing on their one test input, and then freak out when it doesn't work on their second test input and go post a confused mess on StackOverflow or Python-list asking someone to solve it for them. What's the solution? Make it as unlikely as possible that you'll run into the problem in the first place by nearly forcing you to deal in Unicode all the way through your script, and, when you do need to deal with manual encoding and decoding, make the almost-certainly-wrong nonsensical code impossible to write by not having bytes.encode or str.decode or automatic conversions between the two types. Of course that's a backward-incompatible change, and maybe a radical-enough one that it'll take half a decade for the ecosystem to catch up to the point where most users can benefit from it. Which makes it a good thing that Python started that process half a decade ago. So now, to anyone who runs into that confusion, there's an answer: just upgrade from 2.7 to 3.4, undo all the changes you introduced trying to solve this problem incorrectly, and your original code just works. Even if you had a better solution than Python 3's (which I doubt, but let's assume you do), what good would that do? That would make the answer: wait 18 months for Python 3.6, then another 12 months for the last of the packages you depend on to finally adjust to the breaking incompatibility that 3.6 introduced, then undo all the changes you introduced trying to solve this problem incorrectly, then make different, more sensible, changes. That's clearly not a better answer. So, unless you have a better solution than Python 3's and also have a time machine to go back to 2007, what could you possibly have to propose?
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Fri, May 29, 2015 at 12:57:16PM -0700, Andrew Barnert via Python-ideas wrote: Before anyone else engages too deeply in this off-topic discussion, some background: Anatoly wrote to python-list asking for help dealing with a problem where he has a bunch of bytes (file names) which probably represent Russian text but in an unknown legacy encoding, and he wants to round-trip it from bytes to Unicode and back again losslessly. (Russian is a particularly nasty example, because there are multiple mutually-incompatible Russian encodings in widespread use.) As far as I can see, he has been given the solution, or at least a potential solution, on python-list, but as far as I can tell he either hasn't read it, or doesn't like the solutions offerred and so is ignoring them. So there's a real problem hidden here, buried beneath the dramatic presentation of imaginary journalists processing text, but I don't think it's a problem that needs discussing *here* (at least not unless somebody comes up with a concrete proposal or idea to be discussed). A couple more comments follow:
On May 29, 2015, at 01:56, anatoly techtonik <techtonik@gmail.com> wrote:
This is not the case when you have to deal with unknown encodings. And from the perspective of people who only have ASCII (or at worst, Latin-1) text, or who don't care about moji-bake, Python 2 appears easier to work with. To quote Chris Smith: "I find it amusing when novice programmers believe their main job is preventing programs from crashing. More experienced programmers realize that correct code is great, code that crashes could use improvement, but incorrect code that doesn’t crash is a horrible nightmare." Python 2's string handling is designed to minimize the chance of getting an exception when dealing with text in an unknown encoding, but the consequence is that it also minimizes the chance of it doing the right thing except by accident. In Python 2, you can give me a bunch of arbitrary bytes as a string, and I can read them as text, in a sort of ASCII-ish pseudo-encoding, regardless of how inappropriate it is or how much moji-bake it generates. But it won't raise an exception, which for some people is all that matters. Moving to Unicode (in Python 2 or 3) can come as a shock to users who have never had to think about this before. Moji-bake is ubiquitous on the Internet, so there is a real problem to be solved. Python 2's string model is not the way to solve it. I don't think there is any "no-brainer" solution which doesn't involve thinking about bytes and encodings, but if Anatoly or anyone else wants to suggest one, we can discuss it. [...]
These same issues occur in Python 2 if you exclusively use unicode strings u"" instead of the default string type. [...]
Surely you would have to go back to 1953 when the ASCII encoding first started, so we can skip over the whole mess of dozens of mutually incompatible "extended ASCII" code pages? -- Steve
data:image/s3,"s3://crabby-images/1940c/1940cb981172fcc1dafcecc03420e31ecedc6372" alt=""
On Sat, May 30, 2015 at 3:18 AM, Steven D'Aprano <steve@pearwood.info> wrote:
Let me update you on this. There was no solution given. Only the pointers to go read some pointers on the internets again. So, yes, I read replies. But I have very little time to analyse and follow up. The idea I wanted to convey in this thread is that encode/decode is confusing, so if you agree with that, I can start to propose alternatives. And just to make you understand the importance of the question with translating from bytes to unicode and back, let me just tell that this question is the third one voted with 221k views on SO in Python 3 tag. http://stackoverflow.com/questions/tagged/python-3.x -- anatoly t.
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Jun 1, 2015, at 08:46, anatoly techtonik <techtonik@gmail.com> wrote:
Hold on. You had a question, you don't have time to read the answers you were given, so instead you think Python needs to change?
First, as multiple people including the OP say in the comments to that question, what's confusing to novices is that subprocess pipes are the first thing they've used that are binary by default instead of text by default. (For other novices that will instead happen with sockets. But it will eventually happen somewhere.) So, maybe the subprocess docs need a prominent link to, say, the Unicode HOWTO, which is what the OP of that question seems to be proposing. Or maybe it should just be easier to open subprocess pipes in text mode, as it is for files. But I don't see how renaming the methods could possibly help anything. The problem is not that the OP saw the answer and didn't understand or believe it, it's that he didn't know how to search for it. When told the right answer, he immediately said "Thanks, that does it" not "Whatchootalkinbout Willis, I don't have any crypto here". I've never heard of anyone besides you having that reaction. Also, your own answer there is a really bad idea. It was an intentional part of the design of UTF-8 that decoding non-UTF-8 non-ASCII text as if it were UTF-8 will almost always signal an error. It's not a good thing to silently get mojibake instead of getting an error--it just pushes the problem back further, to someone it's harder to understand, find, and debug. In the worst case, it just pushes the problem all the way to the end user, who's even less equipped to deal with it than you when his Russian characters get turned into box graphics. If you have bytes and you want text, the only solution to that is to find out the encoding and decode it. That's not a problem with Python, it's a problem with the proliferation of incompatible encodings that people have used without any in-band or out-of-band indications over the past few decades. Of course there are cases where you want to smuggle bytes with text, or degrade as gracefully as possible on errors, or whatever. That's why decode takes an error handler. But in the usual case, if you try to interpret something as UTF-8 when it's really cp1252, or interpret something as Big5 when it's really Shift-JIS, or whatever, an error is exactly what you should hope for, to tell you that you guessed wrong. That's why it's the default.
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 5/29/2015 4:56 AM, anatoly techtonik wrote: This essay, which is mostly about the clash between python2 thinking and python3 thinking, is off topic for this list. Please use python-list, which is open to any python-related topic. -- Terry Jan Reedy
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Fri, May 29, 2015 at 6:56 PM, anatoly techtonik <techtonik@gmail.com> wrote:
This is because you fundamentally do not understand the difference between bytes and text. Consequently, you are trying to shoehorn new knowledge into your preconceived idea that the file *already contains text*, which is not true. Go read: http://www.joelonsoftware.com/articles/Unicode.html http://nedbatchelder.com/text/unipain.html Also, why is this on python-ideas? Talk about this sort of thing on python-list. ChrisA
data:image/s3,"s3://crabby-images/ae99c/ae99c83a5503af3a14f5b60dbc2d4fde946fec97" alt=""
On Fri, May 29, 2015, at 04:56, anatoly techtonik wrote:
Let's think about how it is to be without _the idea that text is a byte stream in the first place_ - which some people here learned from Python 2, some learned from C, some may have learned from some other language. It was the way things always were, after all, before Unicode came along. The language I was using the most immediately before I started using Python was C#. And C# uses Unicode (well, UTF-16, but the important thing is that it's not an ASCII-compatible sequence of bytes) for strings. One could argue that this paradigm - and the attendant "encode" and "decode" concepts, and stream wrappers that take care of it in the common cases, are _the future_, and that one day nobody will learn that text's natural form is as a sequence of ASCII-compatible bytes... even if text files continue to be encoded that way on the disk.
You don't have to do so explicitly, if the text file's encoding matches your locale. You can just open the file and read it, and it will open as a text-mode stream that takes care of this for you and returns unicode strings. It's a text file, so you open it in text mode. Even if it doesn't match your locale, the proper way is to pass an "encoding" argument to the open function; not to go so deep as to open it in binary mode and decode the bytes yourself.
data:image/s3,"s3://crabby-images/37a4a/37a4aa454918b0961eaf9b44f307b79aea0e872f" alt=""
On Fri, May 29, 2015 at 3:56 AM, anatoly techtonik <techtonik@gmail.com> wrote:
So, ignoring your lack of suggestions for different names, would you also argue that the codecs module (which is how people should be handling this when dealing with files on disk) should also be renamed? codecs is a portmanteau of coder-decoder and deals with converting the code-points to bytes and back. codecs, "Encoding", and "Decoding" are also used for non-text formats too (e.g., files containing video or audio). They in all of the related contexts they have the same meaning. I'm failing to understand your problem with the terminology.
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On May 29, 2015, at 01:56, anatoly techtonik <techtonik@gmail.com> wrote:
No he doesn't. In Python 3, unless he goes out of his way to open the file in binary mode, or use binary string literals for his regexps, that text is unicode from the moment his code sees it. So he doesn't have to read the docs. Python 3 was deliberately designed to make it easier to never have to use bytes internally, so 80% of the users never even have to think about bytes (even at the cost of sometimes making things harder for the more advanced coders who need to write the low-level stuff like network protocol handlers and can't avoid bytes). Now, all those things _are_ still problems for people who use Python 2. But the only way to fix that is to get those people--and, even more importantly, new people--using Python 3. Which means not introducing any new radical inconsistencies in between Python 2 and 3 (or 4) for no good reason--or, of course, between Python 3.5 and 3.6 (or 4.0).
Where in the documentation does it ever tell you to decode text? If you're inventing fictitious documentation that would confuse people if it existed but doesn't because it doesn't, you can just as well claim that the int method is confusing because it tells him he needs to truncate his integers even though integers are already truncated. Yes, that would be confusing--which is why the docs don't say that.
If you open Shift-JIS text as if it were Latin-1 and see a mess of mojibake, it doesn't seem that surprising to be told that you need to decode it properly. If you open UTF-8 text as if it were UTF-8, and Python has already decoded it for you under the covers, you never have to think about it, so there's no opportunity to be surprised.
That's a good point. That's exactly why you see people add random calls to str, unicode, encode, and decode to their Python 2 code until it seems to do the right thing on their one test input, and then freak out when it doesn't work on their second test input and go post a confused mess on StackOverflow or Python-list asking someone to solve it for them. What's the solution? Make it as unlikely as possible that you'll run into the problem in the first place by nearly forcing you to deal in Unicode all the way through your script, and, when you do need to deal with manual encoding and decoding, make the almost-certainly-wrong nonsensical code impossible to write by not having bytes.encode or str.decode or automatic conversions between the two types. Of course that's a backward-incompatible change, and maybe a radical-enough one that it'll take half a decade for the ecosystem to catch up to the point where most users can benefit from it. Which makes it a good thing that Python started that process half a decade ago. So now, to anyone who runs into that confusion, there's an answer: just upgrade from 2.7 to 3.4, undo all the changes you introduced trying to solve this problem incorrectly, and your original code just works. Even if you had a better solution than Python 3's (which I doubt, but let's assume you do), what good would that do? That would make the answer: wait 18 months for Python 3.6, then another 12 months for the last of the packages you depend on to finally adjust to the breaking incompatibility that 3.6 introduced, then undo all the changes you introduced trying to solve this problem incorrectly, then make different, more sensible, changes. That's clearly not a better answer. So, unless you have a better solution than Python 3's and also have a time machine to go back to 2007, what could you possibly have to propose?
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Fri, May 29, 2015 at 12:57:16PM -0700, Andrew Barnert via Python-ideas wrote: Before anyone else engages too deeply in this off-topic discussion, some background: Anatoly wrote to python-list asking for help dealing with a problem where he has a bunch of bytes (file names) which probably represent Russian text but in an unknown legacy encoding, and he wants to round-trip it from bytes to Unicode and back again losslessly. (Russian is a particularly nasty example, because there are multiple mutually-incompatible Russian encodings in widespread use.) As far as I can see, he has been given the solution, or at least a potential solution, on python-list, but as far as I can tell he either hasn't read it, or doesn't like the solutions offerred and so is ignoring them. So there's a real problem hidden here, buried beneath the dramatic presentation of imaginary journalists processing text, but I don't think it's a problem that needs discussing *here* (at least not unless somebody comes up with a concrete proposal or idea to be discussed). A couple more comments follow:
On May 29, 2015, at 01:56, anatoly techtonik <techtonik@gmail.com> wrote:
This is not the case when you have to deal with unknown encodings. And from the perspective of people who only have ASCII (or at worst, Latin-1) text, or who don't care about moji-bake, Python 2 appears easier to work with. To quote Chris Smith: "I find it amusing when novice programmers believe their main job is preventing programs from crashing. More experienced programmers realize that correct code is great, code that crashes could use improvement, but incorrect code that doesn’t crash is a horrible nightmare." Python 2's string handling is designed to minimize the chance of getting an exception when dealing with text in an unknown encoding, but the consequence is that it also minimizes the chance of it doing the right thing except by accident. In Python 2, you can give me a bunch of arbitrary bytes as a string, and I can read them as text, in a sort of ASCII-ish pseudo-encoding, regardless of how inappropriate it is or how much moji-bake it generates. But it won't raise an exception, which for some people is all that matters. Moving to Unicode (in Python 2 or 3) can come as a shock to users who have never had to think about this before. Moji-bake is ubiquitous on the Internet, so there is a real problem to be solved. Python 2's string model is not the way to solve it. I don't think there is any "no-brainer" solution which doesn't involve thinking about bytes and encodings, but if Anatoly or anyone else wants to suggest one, we can discuss it. [...]
These same issues occur in Python 2 if you exclusively use unicode strings u"" instead of the default string type. [...]
Surely you would have to go back to 1953 when the ASCII encoding first started, so we can skip over the whole mess of dozens of mutually incompatible "extended ASCII" code pages? -- Steve
data:image/s3,"s3://crabby-images/1940c/1940cb981172fcc1dafcecc03420e31ecedc6372" alt=""
On Sat, May 30, 2015 at 3:18 AM, Steven D'Aprano <steve@pearwood.info> wrote:
Let me update you on this. There was no solution given. Only the pointers to go read some pointers on the internets again. So, yes, I read replies. But I have very little time to analyse and follow up. The idea I wanted to convey in this thread is that encode/decode is confusing, so if you agree with that, I can start to propose alternatives. And just to make you understand the importance of the question with translating from bytes to unicode and back, let me just tell that this question is the third one voted with 221k views on SO in Python 3 tag. http://stackoverflow.com/questions/tagged/python-3.x -- anatoly t.
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Jun 1, 2015, at 08:46, anatoly techtonik <techtonik@gmail.com> wrote:
Hold on. You had a question, you don't have time to read the answers you were given, so instead you think Python needs to change?
First, as multiple people including the OP say in the comments to that question, what's confusing to novices is that subprocess pipes are the first thing they've used that are binary by default instead of text by default. (For other novices that will instead happen with sockets. But it will eventually happen somewhere.) So, maybe the subprocess docs need a prominent link to, say, the Unicode HOWTO, which is what the OP of that question seems to be proposing. Or maybe it should just be easier to open subprocess pipes in text mode, as it is for files. But I don't see how renaming the methods could possibly help anything. The problem is not that the OP saw the answer and didn't understand or believe it, it's that he didn't know how to search for it. When told the right answer, he immediately said "Thanks, that does it" not "Whatchootalkinbout Willis, I don't have any crypto here". I've never heard of anyone besides you having that reaction. Also, your own answer there is a really bad idea. It was an intentional part of the design of UTF-8 that decoding non-UTF-8 non-ASCII text as if it were UTF-8 will almost always signal an error. It's not a good thing to silently get mojibake instead of getting an error--it just pushes the problem back further, to someone it's harder to understand, find, and debug. In the worst case, it just pushes the problem all the way to the end user, who's even less equipped to deal with it than you when his Russian characters get turned into box graphics. If you have bytes and you want text, the only solution to that is to find out the encoding and decode it. That's not a problem with Python, it's a problem with the proliferation of incompatible encodings that people have used without any in-band or out-of-band indications over the past few decades. Of course there are cases where you want to smuggle bytes with text, or degrade as gracefully as possible on errors, or whatever. That's why decode takes an error handler. But in the usual case, if you try to interpret something as UTF-8 when it's really cp1252, or interpret something as Big5 when it's really Shift-JIS, or whatever, an error is exactly what you should hope for, to tell you that you guessed wrong. That's why it's the default.
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 5/29/2015 4:56 AM, anatoly techtonik wrote: This essay, which is mostly about the clash between python2 thinking and python3 thinking, is off topic for this list. Please use python-list, which is open to any python-related topic. -- Terry Jan Reedy
participants (7)
-
anatoly techtonik
-
Andrew Barnert
-
Chris Angelico
-
Ian Cordasco
-
random832@fastmail.us
-
Steven D'Aprano
-
Terry Reedy