Re: [Python-ideas] Why decode()/encode() name is harmful

30 May 2015

      On Fri, May 29, 2015 at 12:57:16PM -0700, Andrew Barnert via Python-ideas wrote:

Before anyone else engages too deeply in this off-topic discussion, some 
background: Anatoly wrote to python-list asking for help dealing with a 
problem where he has a bunch of bytes (file names) which probably 
represent Russian text but in an unknown legacy encoding, and he wants 
to round-trip it from bytes to Unicode and back again losslessly.

(Russian is a particularly nasty example, because there are multiple 
mutually-incompatible Russian encodings in widespread use.)

As far as I can see, he has been given the solution, or at least a 
potential solution, on python-list, but as far as I can tell he either 
hasn't read it, or doesn't like the solutions offerred and so is 
ignoring them.

So there's a real problem hidden here, buried beneath the dramatic 
presentation of imaginary journalists processing text, but I don't think 
it's a problem that needs discussing *here* (at least not unless 
somebody comes up with a concrete proposal or idea to be discussed).

A couple more comments follow:
...
On May 29, 2015, at 01:56, anatoly techtonik <techtonik@gmail.com> wrote:
...
...
In Python 2 he just copy pastes regular
expressions to match the letter and is happy. In
Python 3 he needs to *convert* that text to unicode.
No he doesn't. In Python 3, unless he goes out of his way to open the 
file in binary mode, or use binary string literals for his regexps, 
that text is unicode from the moment his code sees it. So he doesn't 
have to read the docs.
This is not the case when you have to deal with unknown encodings. And 
from the perspective of people who only have ASCII (or at worst, 
Latin-1) text, or who don't care about moji-bake, Python 2 appears 
easier to work with. To quote Chris Smith:

"I find it amusing when novice programmers believe their main job is
preventing programs from crashing. More experienced programmers realize
that correct code is great, code that crashes could use improvement, but
incorrect code that doesn’t crash is a horrible nightmare."

Python 2's string handling is designed to minimize the chance of getting 
an exception when dealing with text in an unknown encoding, but the 
consequence is that it also minimizes the chance of it doing the right 
thing except by accident. In Python 2, you can give me a bunch of 
arbitrary bytes as a string, and I can read them as text, in a sort of 
ASCII-ish pseudo-encoding, regardless of how inappropriate it is or how 
much moji-bake it generates. But it won't raise an exception, which for 
some people is all that matters.

Moving to Unicode (in Python 2 or 3) can come as a shock to users who 
have never had to think about this before. Moji-bake is ubiquitous on 
the Internet, so there is a real problem to be solved. Python 2's string 
model is not the way to solve it. I don't think there is any 
"no-brainer" solution which doesn't involve thinking about bytes and 
encodings, but if Anatoly or anyone else wants to suggest one, we can 
discuss it.

[...]
...
Now, all those things _are_ still problems for people who use Python 
2. But the only way to fix that is to get those people--and, even more 
importantly, new people--using Python 3. Which means not introducing 
any new radical inconsistencies in between Python 2 and 3 (or 4) for 
no good reason--or, of course, between Python 3.5 and 3.6 (or 4.0).
These same issues occur in Python 2 if you exclusively use unicode 
strings u"" instead of the default string type.

[...]
...
So, unless you have a better solution than Python 3's and also have a 
time machine to go back to 2007, what could you possibly have to 
propose?
Surely you would have to go back to 1953 when the ASCII encoding first 
started, so we can skip over the whole mess of dozens of mutually 
incompatible "extended ASCII" code pages?

-- 
Steve

Re: [Python-ideas] Why decode()/encode() name is harmful

Steven D'Aprano