[Python-ideas] PEP 540: Add a new UTF-8 mode

Fri Jan 6 06:52:51 EST 2017

2017-01-06 10:50 GMT+01:00 M.-A. Lemburg <mal at egenix.com>:
> Victor: I think you are taking the UTF-8 idea a bit too far.

Hum, sorry, the PEP is still a draft, the rationale is far from
perfect yet. Let me try to simplify the issue: users are unable to
configure a locale for various reasons and expect that Python 3 must
"just works", so never fail on encoding or decoding.

Do you mean that you must try to fix this issue? Or that my approach
is not the good one?

> Nick was trying to address the situation where the locale is
> set to "C", or rather not set at all (in which case the lib C
> defaults to the "C" locale). The latter is a fairly standard
> situation when piping data on Unix or when spawning processes
> which don't inherit the current OS environment.

In the second version of my PEP, Python 3.7 will basically "just work"
with the POSIX locale (or C locale if you prefer). This locale enables
the UTF-8 mode which forces UTF-8/surrogatescape, and this error
handler prevents the most common encode/decode error (but not all of
them!).

When I read the different issues on the bug tracker, I understood that
people have different opinions because they have different use cases
and so different expectations.

I tried to describe a few use cases to help to understand why we don't
have the expectations:
https://www.python.org/dev/peps/pep-0540/#replace-a-word-in-a-text

I guess that "piping data on Unix" is represented by my "Replace a
word in a text" example, right? It implements the "sed -e
s/apple/orange/g" command using Python 3. Classical usage:

   cat input_file | sed -e s/apple/orange/g > output

"UNIX users" don't want Unicode errors here.

> The problem with the "C" locale is that the encoding defaults to
> "ASCII" and thus does not allow Python to show its built-in
> Unicode support.

I don't think that it's the main annoying issues for users.

User complain because basic functions like (1) "List a directory into
stdout" or (2) "List a directory into a text file" fail badly:

(1) https://www.python.org/dev/peps/pep-0540/#list-a-directory-into-stdout
(2) https://www.python.org/dev/peps/pep-0540/#list-a-directory-into-a-text-file

They don't really care of powerful Unicode features, but are bitten
early just on writing data back to the disk, into a pipe, or something
else.

Python 3.6 tries to be nice with users when *getting* data, and it is
very pedantic when you try to put the data somewhere. The only
exception is that stdout now uses the surrogateescape error handler,
but only with the POSIX locale.

> Nick's PEP and the discussion on the ticket
> http://bugs.python.org/issue28180 are trying to address this
> particular situation, not enforce any particular encoding
> overriding the user's configured environment.
>
> So I think it would be better if you'd focus your PEP on the
> same situation: locale set to "C" or not set at all.

I'm not sure that I understood: do you suggest to only modify the
behaviour when the POSIX locale is used, but don't add any option to
ignore the locale and force UTF-8?

At least, I would like to get a UTF-8/strict mode which would require
an option to enable it.

About -X utf8, the idea is to write explicitly that you are sure that
all inputs are encoded to UTF-8 and that you request to encode outputs
to UTF-8.

I guess that you are concerned by locales using encodings other than
ASCII or UTF-8 like Latin1, ShiftJIS or something else?

> BTW: You mention a locale "POSIX" in a few places. I have
> never seen this used in practice and wonder why we should
> even consider this in Python as possible work-around for
> a particular set of features. The locale setting in your
> environment does have a lot of influence on your user
> experience, so forcing people to set a "POSIX" locale doesn't
> sound like a good idea - if they have to go through the
> trouble of correctly setting up their environment for Python
> to correctly run, they would much more likely use the correct
> setting rather than a generic one like "POSIX", which is
> defined as alias for the "C" locale and not as a separate
> locale: (...)

Hum, the POSIX locale is the "C" locale in my PEP.

I don't request users to force the POSIX locale. I propose to make
Python nicer than users already *get* the POSIX locale for various
reasons:

* OS not correctly configured
* SSH connection failing to set the locale
* user using LANG=C to get messages in english
* LANG=C used for a bad reason
* program run in an empty environment
* user locale set to a non-existent locale => the libc falls back on POSIX
* etc.

"LANG=C": "LC_ALL=C" is more correct, but it seems like LANG=C is more
common than LC_ALL=C or LC_CTYPE=C in the wild.

>> It's actually very similar to your PEP, except that instead of adding
>> the ability to make CPython ignore the C level locale settings (which
>> I think is a bad idea based on your own previous work in that area and
>> on the way that CPython interacts with other C/C++ components in the
>> same process and in subprocesses), it just *changes* those settings
>> when we're pretty sure they're wrong.
>
> ... and this is taking the original intent of the ticket
> a little too far as well :-)

By ticket, do you mean a Python issue? By the way, I'm aware of these
two issues:

http://bugs.python.org/issue19846
http://bugs.python.org/issue28180

I'm sure that other issues were opened to request something similiar,
but they got probably less feedback, and I was to lazy yet to find
them all.

> Without the "C.UTF-8" locale available, your PEP [538] only affects
> the FS encoding, AFAICT, unless other parts of the application
> try to interpret the locale env settings as well and use their
> own logic for the interpretation.

I decided to write the PEP 540 because only few operating systems
provide C.UTF-8 or C.utf8. I'm trying to find a solution working on
all UNIX and BSD systems. Maybe I'm wrong, and my approach (ignore the
locale, rather than really "fixing" the locale) is plain wrong.

Again, it's a very hard issue, I don't think that any perfect solution
exists. Otherwise, we would already have fixed this issue 8 years ago!

It's a matter of compromises and finding a practical design which
works for most users.

> For the purpose of experimentation, I would find it better
> to start with just fixing the FS encoding in 3.7 and
> leaving the option to adjust the locale setting turned off
> per default.

Sorry, what do you mean by "fixing the FS encoding"? I understand that
it's basically my PEP 540 without -X utf8 and PYTHONUTF8, only with
the UTF-8 mode enabled for the POSIX locale?

By the way, Nick's PEP 538 doesn't mention surrogateescape. IMHO if we
override or ignore the locale, it's safer to use surrogateescape. The
Use Cases of my PEP 540 should help to understand why.

Victor