[Python-ideas] Py3 unicode impositions

Sat Feb 11 11:40:20 CET 2012

On 11 February 2012 04:12, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> This is especially the case if you work with older text data on Mac or
> modern Linux where UTF-8 is used, because you're almost certain to run
> into Latin-1-encoded files.  My favorite example is ChangeLogs, which
> broke my Gentoo package manager when I experimented with using Python
> 3 as the default Python.  Most packages would work fine, but for some
> reason some Python program in the PMS was actually reading the
> ChangeLogs, and sometimes they'd be impure ASCII (I don't recall
> whether it was utf-8 or latin-1), giving a fatal UnicodeError and
> everything grinds to a halt.
>
> That is reason enough for the naive to embrace fear, uncertainty, and
> doubt about Python 3's use of Unicode.

My concern about Unicode in Python 3 is that the principle is, you
specify the right encoding. But often, I don't *know* the encoding ;-(
Text files, like changelogs as a good example, generally have no
marker specifying the encoding, and they can have all sorts (depending
on where the package came from). Worse, I am on Windows and changelogs
usually come from Unix developers - so I'm not familiar with the
common conventions ("well, of course it's in UTF-8, that's what
everyone uses"...)

In Python 2, I can ignore the issue. Sure, I can end up with mojibake,
but for my uses, that's not a disaster. Mostly-readable works. But in
Python 3, I get an error and can't process the file.

I can just use latin-1, or surrogateescape. But that doesn't come
naturally to me yet. Maybe it will in time... Or maybe there's a
better solution I don't know about yet.

To be clear - I am fully in favour of the Python 3 approach, and I
completely support the idea that people should know the encodings of
the stuff they are working with (I've seen others naively make
encoding mistakes often enough to know that when it matters, it really
does matter). But having to worry, not so much about the encoding to
use, but rather about the fact that Python is asking you a question
you can't answer, is a genuine stumbling block. And from what I've
seen, it's at the root of the problems many people have with Unicode
in Python 3.

I'm not arguing for changes to the default behaviour of Python 3. But
if we had a good place to put it, a FAQ entry about "what to do if I
need to process a file whose encoding I don't know" would be useful.
And certainly having a standard answer that people could give when the
question comes up (something practical, not a purist answer like "all
files have an encoding, so you should find out") would help.

Paul.