[Python-ideas] Py3 unicode impositions

Tue Feb 14 22:08:09 CET 2012

On 14 February 2012 09:36, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>  > As I say:
>  > - I know what to do
>  > - It can be a lot of work
>  > - Frankly, the damage is minor (these are usually personal or low-risk scripts)
>  > - The temptation to say "stuff it" and get on with my life is high
>  > - It frustrates me that Python by default tempts me to *not* do the right thing
>
> Please don't blame it on Python.  Python tempts you because it offers
> the choice to do it right.  There is no way that Python can do it
> right *for* you, not even all the resources Microsoft or Apple can
> bring to bear have managed to do it right (you can't get 100% even
> within an all-Windows or all-Mac shop, let alone cross-platform).  Not
> yet; it requires your help.

Point taken.

I think my point is that I wish there was a more obvious way for me to
tell Python that I just want to do it nearly right on this occasion
(like "everything else" does) because I really don't need to care for
now. I'm getting a lot closer to knowing how to do that as this thread
progresses, though, which is why I think of this as more of an
educational issue than anything else.

Thinking about how I'd code something like "cat" naively in C (while
((i = getchar()) != EOF) { putchar(i); }), I guess encoding=latin1 is
the way for Python to "work like everything else" in this context.

So I suppose there's a question. Do we really want to document how to
"do it wrong"? At first glance, obviously not. But if we don't, it
seems that the "Python 3 forces you to know Unicode" meme thrives, and
we keep getting bad press. Maybe we could add a note to the open()
documentation, something like the following:

"""To open a file, you need to know its encoding. This is not always
obvious, depending on where the file came from, among other things.
Other tools can process files without knowing the encoding by assuming
the bytes of the file map 1-1 to the first 256 Unicode characters.
This can cause issues such as mojibake or corrupted data, but for
casual use is sometimes sufficient. To get this behaviour in Python
(with all the same risks and problems) you can use the "latin1"
encoding, which maps bytes to unicode as described above. It is far,
far better to use the correct encoding declaration, if at all
possible, however."""

I have no real opinion on whether this is the right thing to do.
Unfortunately (in a sense :-)) it doesn't matter much to me any more,
as I now have the benefit of learning from this thread, so I'm no
longer in the target audience of the comment :-)

> Thanks for caring!<wink/>

Thanks for helping me learn!
Paul