
On Fri, May 29, 2015, at 04:56, anatoly techtonik wrote:
First, let me start with The Curse of Knowledge https://en.wikipedia.org/wiki/Curse_of_knowledge which can be summarized as:
"Once you get something, it becomes hard to think how it was to be without it".
Let's think about how it is to be without _the idea that text is a byte stream in the first place_ - which some people here learned from Python 2, some learned from C, some may have learned from some other language. It was the way things always were, after all, before Unicode came along. The language I was using the most immediately before I started using Python was C#. And C# uses Unicode (well, UTF-16, but the important thing is that it's not an ASCII-compatible sequence of bytes) for strings. One could argue that this paradigm - and the attendant "encode" and "decode" concepts, and stream wrappers that take care of it in the common cases, are _the future_, and that one day nobody will learn that text's natural form is as a sequence of ASCII-compatible bytes... even if text files continue to be encoded that way on the disk.
Now imaging a person who has a text file. The person need to process that with Python. That person is probably a journalist and doesn't know anything that "any developer should know about unicode". In Python 2 he just copy pastes regular expressions to match the letter and is happy. In Python 3 he needs to *convert* that text to unicode.
You don't have to do so explicitly, if the text file's encoding matches your locale. You can just open the file and read it, and it will open as a text-mode stream that takes care of this for you and returns unicode strings. It's a text file, so you open it in text mode. Even if it doesn't match your locale, the proper way is to pass an "encoding" argument to the open function; not to go so deep as to open it in binary mode and decode the bytes yourself.