[Python-ideas] PEP 540: Add a new UTF-8 mode

Thu Jan 5 20:54:49 EST 2017

2017-01-06 2:15 GMT+01:00 INADA Naoki <songofacandy at gmail.com>:
>>> Always use UTF-8 (...)
>>    Please don't! (...)
>
> For stdio (including console), PYTHONIOENCODING can be used for
> supporting legacy system.
> e.g. `export PYTHONIOENCODING=$(locale charmap)`

The problem with ignoring the locale by default and forcing UTF-8 is
that Python works with many libraries which use the locale, not UTF-8.
The PEP 538 also describes mojibake issues if Python is embedded in an
application.

> For commandline argument and filepath, UTF-8/surrogateescape can round trip.
> But mojibake may happens when pass the path to GUI.

Let's say that you have the filename b'nonascii\xff': it's decoded as
'nonascii\xdcff' by the UTF-8 mode. How do GUIs handle such filename?
(I don't know the answer, it's a real question ;-))

> If we chose "Always use UTF-8 for fs encoding", I think
> PYTHONFSENCODING envvar should be
> added again.  (It should be used from startup: decoding command line argument).

Last time I implemented PYTHONFSENCODING, I had many major issues:
https://mail.python.org/pipermail/python-dev/2010-October/104509.html

Do you mean that these issues are now outdated and that you have an
idea how to fix them?

> 3) unzip zip file sent by Windows.   Windows user use no-ASCII filenames, and
> create legacy (no UTF-8) zip file very often.
>
> I think people using non UTF-8 should solve encoding issue by themselves.
> People should use ASCII or UTF-8 always if they don't want to see mojibake.

ZIP files are out the scope of the PEPs 538 and 540. Python cannot
guess the encoding, so it was proposed to add an option to give to
user the ability to specify an encoding: see
https://bugs.python.org/issue10614 for example.

But yeah, data encoded to encodings different than UTF-8 are still
common, and it's not going to change shortly. Since many Windows
applications use the ANSI code page, I easily imagine that many
documents are encoded to various incompatible code pages...

What I understood is that many users don't want Python to complain on
data encoded to different incompatible encodings: process data as a
stream of bytes or characters, it depends. Something closer to Python
2 (stream of bytes). That's what I try to describe in this section:
https://www.python.org/dev/peps/pep-0540/#old-data-stored-in-different-encodings-and-surrogateescape

Victor