[Python-ideas] Py3 unicode impositions

Sat Feb 11 05:12:13 CET 2012

Jim Jewett writes:

 > Are you saying that some (many?  all?) platforms make a bad choice there?

No.  I'm saying that whatever choice is made (except for 'latin-1'
because it accepts all bytes regardless of the actual encoding of the
data, or PEP 383 "errors='surrogateescape'" for the same reason, both
of which are unacceptable defaults for production code *for the same
reason*), there is data that will cause that idiom to fail on Python 3
where it would not on Python 2.

This is especially the case if you work with older text data on Mac or
modern Linux where UTF-8 is used, because you're almost certain to run
into Latin-1-encoded files.  My favorite example is ChangeLogs, which
broke my Gentoo package manager when I experimented with using Python
3 as the default Python.  Most packages would work fine, but for some
reason some Python program in the PMS was actually reading the
ChangeLogs, and sometimes they'd be impure ASCII (I don't recall
whether it was utf-8 or latin-1), giving a fatal UnicodeError and
everything grinds to a halt.

That is reason enough for the naive to embrace fear, uncertainty, and
doubt about Python 3's use of Unicode.

The fact is that with a little bit of knowledge, you can almost
certainly get more reliable (and in case of failure, more debuggable)
results from Python 3 than from Python 2.  But people are happy to
deal with the devil they know, even though it's more noxious than the
devil they don't.  Counteracting FUD with words generally doesn't work
IME, unless the words are a "magic spell" that reduces the unknown to
the known.