[Python-ideas] Why decode()/encode() name is harmful
Steven D'Aprano
steve at pearwood.info
Sat May 30 02:18:12 CEST 2015
On Fri, May 29, 2015 at 12:57:16PM -0700, Andrew Barnert via Python-ideas wrote:
Before anyone else engages too deeply in this off-topic discussion, some
background: Anatoly wrote to python-list asking for help dealing with a
problem where he has a bunch of bytes (file names) which probably
represent Russian text but in an unknown legacy encoding, and he wants
to round-trip it from bytes to Unicode and back again losslessly.
(Russian is a particularly nasty example, because there are multiple
mutually-incompatible Russian encodings in widespread use.)
As far as I can see, he has been given the solution, or at least a
potential solution, on python-list, but as far as I can tell he either
hasn't read it, or doesn't like the solutions offerred and so is
ignoring them.
So there's a real problem hidden here, buried beneath the dramatic
presentation of imaginary journalists processing text, but I don't think
it's a problem that needs discussing *here* (at least not unless
somebody comes up with a concrete proposal or idea to be discussed).
A couple more comments follow:
> On May 29, 2015, at 01:56, anatoly techtonik <techtonik at gmail.com> wrote:
> > In Python 2 he just copy pastes regular
> > expressions to match the letter and is happy. In
> > Python 3 he needs to *convert* that text to unicode.
>
> No he doesn't. In Python 3, unless he goes out of his way to open the
> file in binary mode, or use binary string literals for his regexps,
> that text is unicode from the moment his code sees it. So he doesn't
> have to read the docs.
This is not the case when you have to deal with unknown encodings. And
from the perspective of people who only have ASCII (or at worst,
Latin-1) text, or who don't care about moji-bake, Python 2 appears
easier to work with. To quote Chris Smith:
"I find it amusing when novice programmers believe their main job is
preventing programs from crashing. More experienced programmers realize
that correct code is great, code that crashes could use improvement, but
incorrect code that doesn’t crash is a horrible nightmare."
Python 2's string handling is designed to minimize the chance of getting
an exception when dealing with text in an unknown encoding, but the
consequence is that it also minimizes the chance of it doing the right
thing except by accident. In Python 2, you can give me a bunch of
arbitrary bytes as a string, and I can read them as text, in a sort of
ASCII-ish pseudo-encoding, regardless of how inappropriate it is or how
much moji-bake it generates. But it won't raise an exception, which for
some people is all that matters.
Moving to Unicode (in Python 2 or 3) can come as a shock to users who
have never had to think about this before. Moji-bake is ubiquitous on
the Internet, so there is a real problem to be solved. Python 2's string
model is not the way to solve it. I don't think there is any
"no-brainer" solution which doesn't involve thinking about bytes and
encodings, but if Anatoly or anyone else wants to suggest one, we can
discuss it.
[...]
> Now, all those things _are_ still problems for people who use Python
> 2. But the only way to fix that is to get those people--and, even more
> importantly, new people--using Python 3. Which means not introducing
> any new radical inconsistencies in between Python 2 and 3 (or 4) for
> no good reason--or, of course, between Python 3.5 and 3.6 (or 4.0).
These same issues occur in Python 2 if you exclusively use unicode
strings u"" instead of the default string type.
[...]
> So, unless you have a better solution than Python 3's and also have a
> time machine to go back to 2007, what could you possibly have to
> propose?
Surely you would have to go back to 1953 when the ASCII encoding first
started, so we can skip over the whole mess of dozens of mutually
incompatible "extended ASCII" code pages?
--
Steve
More information about the Python-ideas
mailing list