PEP 540: Add a new UTF-8 mode

Hi, Nick Coghlan asked me to review his PEP 538 "Coercing the legacy C locale to C.UTF-8": https://www.python.org/dev/peps/pep-0538/ Nick wants to change the default behaviour. I'm not sure that I'm brave enough to follow this direction, so I proposed my old "-X utf8" command line idea as a new PEP: add a new UTF-8 mode, *disabled by default*. These 2 PEPs are the follow-up of the Windows PEP 529 (Change Windows filesystem encoding to UTF-8) and the issue #19977 (Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale). The topic (switching to UTF-8 on UNIX) is actively discussed on: http://bugs.python.org/issue28180 Read the PEP online (HTML): https://www.python.org/dev/peps/pep-0540/ Victor PEP: 540 Title: Add a new UTF-8 mode Version: $Revision$ Last-Modified: $Date$ Author: Victor Stinner <victor.stinner@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 5-January-2016 Python-Version: 3.7 Abstract ======== Add a new UTF-8 mode, opt-in option to use UTF-8 for operating system data instead of the locale encoding. Add ``-X utf8`` command line option and ``PYTHONUTF8`` environment variable. Context ======= Locale and operating system data -------------------------------- Python uses the ``LC_CTYPE`` locale to decide how to encode and decode data from/to the operating system: * file content * command line arguments: ``sys.argv`` * standard streams: ``sys.stdin``, ``sys.stdout``, ``sys.stderr`` * environment variables: ``os.environ`` * filenames: ``os.listdir(str)`` for example * pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example * error messages * name of a timezone * user name, terminal name: ``os``, ``grp`` and ``pwd`` modules * host name, UNIX socket path: see the ``socket`` module * etc. At startup, Python calls ``setlocale(LC_CTYPE, "")`` to use the user ``LC_CTYPE`` locale and then store the locale encoding, ``sys.getfilesystemencoding()``. In the whole lifetime of a Python process, the same encoding and error handler are used to encode and decode data from/to the operating system. .. note:: In some corner case, the *current* ``LC_CTYPE`` locale must be used instead of ``sys.getfilesystemencoding()``. For example, the ``time`` module uses the *current* ``LC_CTYPE`` locale to decode timezone names. The POSIX locale and its encoding --------------------------------- The following environment variables are used to configure the locale, in this preference order: * ``LC_ALL``, most important variable * ``LC_CTYPE`` * ``LANG`` The POSIX locale,also known as "the C locale", is used: * if the first set variable is set to ``"C"`` * if all these variables are unset, for example when a program is started in an empty environment. The encoding of the POSIX locale must be ASCII or a superset of ASCII. On Linux, the POSIX locale uses the ASCII encoding. On FreeBSD and Solaris, ``nl_langinfo(CODESET)`` announces an alias of the ASCII encoding, whereas ``mbstowcs()`` and ``wcstombs()`` functions use the ISO 8859-1 encoding (Latin1) in practice. The problem is that ``os.fsencode()`` and ``os.fsdecode()`` use ``locale.getpreferredencoding()`` codec. For example, if command line arguments are decoded by ``mbstowcs()`` and encoded back by ``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead of retrieving the original byte string. To fix this issue, Python now checks since Python 3.4 if ``mbstowcs()`` really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an alias to ASCII). If not (the effective encoding is not ASCII), Python uses its own ASCII codec instead of using ``mbstowcs()`` and ``wcstombs()`` functions for operating system data. See the `POSIX locale (2016 Edition) <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>`_. C.UTF-8 and C.utf8 locales -------------------------- Some operating systems provide a variant of the POSIX locale using the UTF-8 encoding: * Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"`` * Debian (eglibc 2.13-1, 2011): ``"C.UTF-8"`` * HP-UX: ``"C.utf8"`` It was proposed to add a ``C.UTF-8`` locale to glibc: `glibc C.UTF-8 proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_. Popularity of the UTF-8 encoding -------------------------------- Python 3 uses UTF-8 by default for Python source files. On Mac OS X, Windows and Android, Python always use UTF-8 for operating system data instead of the locale encoding. For Windows, see the `PEP 529: Change Windows filesystem encoding to UTF-8 <https://www.python.org/dev/peps/pep-0529/>`_. On Linux, UTF-8 became the defacto standard encoding by default, replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example, using different encodings for filenames and standard streams is likely to create mojibake, so UTF-8 is now used *everywhere*. The UTF-8 encoding is the default encoding of XML and JSON file format. In January 2017, UTF-8 was used in `more than 88% of web pages <https://w3techs.com/technologies/details/en-utf8/all/all>`_ (HTML, Javascript, CSS, etc.). See `utf8everywhere.org <http://utf8everywhere.org/>`_ for more general information on the UTF-8 codec. .. note:: Some applications and operating systems (especially Windows) use Byte Order Markers (BOM) to indicate the used Unicode encoding: UTF-7, UTF-8, UTF-16-LE, etc. BOM are not well supported and rarely used in Python. Old data stored in different encodings and surrogateescape ---------------------------------------------------------- Even if UTF-8 became the defacto standard, there are still systems in the wild which don't use UTF-8. And there are a lot of data stored in different encodings. For example, an old USB key using the ext3 filesystem with filenames encoded to ISO 8859-1. The Linux kernel and the libc don't decode filenames: a filename is used as a raw array of bytes. The common solution to support any filename is to store filenames as bytes and don't try to decode them. When displayed to stdout, mojibake is displayed if the filename and the terminal don't use the same encoding. Python 3 promotes Unicode everywhere including filenames. A solution to support filenames not decodable from the locale encoding was found: the ``surrogateescape`` error handler (`PEP 393 <https://www.python.org/dev/peps/pep-0393/>`_), store undecodable bytes as surrogate characters. This error handler is used by default for operating system data, by ``os.fsdecode()`` and ``os.fsencode()`` for example (except on Windows which uses the ``strict`` error handler). Standard streams ---------------- Python uses the locale encoding for standard streams: stdin, stdout and stderr. The ``strict`` error handler is used by stdin and stdout to prevent mojibake. The ``backslashreplace`` error handler is used by stderr to avoid Unicode encode error when displaying non-ASCII text. It is especially useful when the POSIX locale is used, because this locale usually uses the ASCII encoding. The problem is that operating system data like filenames are decoded using the ``surrogateescape`` error handler (PEP 393). Displaying a filename to stdout raises an Unicode encode error if the filename contains an undecoded byte stored as a surrogate character. Python 3.6 now uses ``surrogateescape`` for stdin and stdout if the POSIX locale is used: `issue #19977 <http://bugs.python.org/issue19977>`_. The idea is to passthrough operating system data even if it means mojibake, because most UNIX applications work like that. Most UNIX applications store filenames as bytes, usually simply because bytes are first-citizen class in the used programming language, whereas Unicode is badly supported. .. note:: The encoding and/or the error handler of standard streams can be overriden with the ``PYTHONIOENCODING`` environment variable. Proposal ======== Add a new UTF-8 mode, opt-in option to use UTF-8 for operating system data instead of the locale encoding: * Add ``-X utf8`` command line option * Add ``PYTHONUTF8=1`` environment variable Add also a strict UTF-8 mode, enabled by ``-X utf8=strict`` or ``PYTHONUTF8=strict``. The UTF-8 mode changes the default encoding and error handler used by open(), os.fsdecode(), os.fsencode(), sys.stdin, sys.stdout and sys.stderr: ============================ ======================= ======================= ====================== ====================== Function Default, other locales Default, POSIX locale UTF-8 UTF-8 Strict ============================ ======================= ======================= ====================== ====================== open() locale/strict locale/strict UTF-8/surrogateescape UTF-8/strict os.fsdecode(), os.fsencode() locale/surrogateescape locale/surrogateescape UTF-8/surrogateescape UTF-8/strict sys.stdin locale/strict locale/surrogateescape UTF-8/surrogateescape UTF-8/strict sys.stdout locale/strict locale/surrogateescape UTF-8/surrogateescape UTF-8/strict sys.stderr locale/backslashreplace locale/backslashreplace UTF-8/backslashreplace UTF-8/backslashreplace ============================ ======================= ======================= ====================== ====================== The UTF-8 mode is disabled by default to keep hard Unicode errors when encoding or decoding operating system data failed, and to keep the backward compatibility. The user is responsible to enable explicitly the UTF-8 mode, and so is better prepared for mojibake than if the UTF-8 mode would be enabled *by default*. The UTF-8 mode should be used on systems known to be configured with UTF-8 where most applications speak UTF-8. It prevents Unicode errors if the user overrides a locale *by mistake* or if a Python program is started with no locale configured (and so with the POSIX locale). Most UNIX applications handle operating system data as bytes, so ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a limited impact on how these data are handled by the application. The Python UTF-8 mode should help to make Python more interoperable with the other UNIX applications in the system assuming that *UTF-8* is used everywhere and that users *expect* UTF-8. Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in Python is more convenient, since they are more commonly misconfigured *by mistake* (configured to use an encoding different than UTF-8, whereas the system uses UTF-8), rather than being misconfigured by intent. Backward Compatibility ====================== Since the UTF-8 mode is disabled by default, it has no impact on the backward compatibility. The new UTF-8 mode must be enabled explicitly. Alternatives ============ Always use UTF-8 ---------------- Python already always use the UTF-8 encoding on Mac OS X, Android and Windows. Since UTF-8 became the defacto encoding, it makes sense to always use it on all platforms with any locale. The risk is to introduce mojibake if the locale uses a different encoding, especially for locales other than the POSIX locale. Force UTF-8 for the POSIX locale -------------------------------- An alternative to always using UTF-8 in any case is to only use UTF-8 when the ``LC_CTYPE`` locale is the POSIX locale. The `PEP 538: Coercing the legacy C locale to C.UTF-8 <https://www.python.org/dev/peps/pep-0538/>`_ of Nick Coghlan proposes to implement that using the ``C.UTF-8`` locale. Related Work ============ Perl has a ``-C`` command line option and a ``PERLUNICODE`` environment varaible to force UTF-8: see `perlrun <http://perldoc.perl.org/perlrun.html>`_. It is possible to configure UTF-8 per standard stream, on input and output streams, etc. Copyright ========= This document has been placed in the public domain.

I read the PEP 538, PEP 540, and issues related to switching to UTF-8. At least, I can say one thing: people have different points of view :-) To understand why people disagree, I tried to categorize the different point of views and Python expectations: "UNIX mode": Python 2 developers and long UNIX users expect that their code "just works". They like Python 3 features, but Python 3 annoy them with various encoding errors. The expectation is to be able to read data encoded to various incompatible encodings and write it into stdout or a text file. In short, mojibake is not a bug but a feature! "Strict Unicode mode" for real Unicode fans: Python 3 is strict and it's a good thing! Strict codec helps to detect very early bugs in the code. These developers understand very well Unicode and are able to fix complex encoding issues. Mojibake is a no-no for them. Python 3.6 is not exactly in the first or the later category: "it depends". To read data from the operating system, Python 3.6 behaves in "UNIX mode": os.listdir() *does* return invalid filenames, it uses a funny encoding using surrogates. To write data back to the operating system, Python 3.6 wears its "Unicode nazi" hat and becomes strict. It's no more possible to write data from from the operating system back to the operating system. Writing a filename read from os.listdir() into stdout or into a text file fails with an encode error. Subtle behaviour: since Python 3.6, with the POSIX locale, Python 3.6 uses the "UNIX mode" but only to write into stdout. It's possible to write a filename into stdout, but not into a text file. In its current shame, my PEP 540 leaves Python default unchanged, but adds two modes: UTF-8 and UTF-8 strict. The UTF-8 mode is more or less the UNIX mode generalized for all inputs and outputs: mojibake is a feature, just pass bytes unchanged. The UTF-8 strict mode is more extreme that the current "Strict Unicode mode" since it fails on *decoding* data from the operating system. Now that I have a better view of what we have and what we want, the question is if the default behaviour should be changed and if yes, how. Nick's PEP 538 does exactly move to the "UNIX mode" (open() doesn't use surrogateescape) nor the "Strict Unicode mode" (fsdecode() still uses surrogateescape), it's still in a grey area. Maybe Nick can elaborate the use case or update his PEP? I guess that all users and most developers are more in the "UNIX mode" camp. *If* we want to change the default, I suggest to use the "UNIX mode" by default. The question is if someone relies/likes on the current Python 3.6 behaviour: reading "just works", writing is strict. If you like this behaviour, what do you think of the tiny Python 3.6 change: use surrogateescape for stdout when the locale is POSIX. Victor

2017-01-05 17:50 GMT+01:00 Victor Stinner <victor.stinner@gmail.com>:
A common request is that "Python just works" without having to pass a command line option or set an environment variable. Maybe the default behaviour should be left unchanged, but the behaviour with the POSIX locale should change. Maybe we can enable the UTF-8 mode (or "UNIX mode") of the PEP 540 when the POSIX locale is used? Said differently, the UTF-8 would not only be enabled by -X utf8 and PYTHONUTF8=1, but also enabled by the common LANG=C and when Python is started in an empty environment (no env var). Victor

Ok, I modified my PEP: the POSIX locale now enables the UTF-8 mode. 2017-01-05 18:10 GMT+01:00 Victor Stinner <victor.stinner@gmail.com>:
http://bugs.python.org/issue28180 asks to "change the default" to get a Python which "just works" without any kind of configuration, in the context of a Docker image (I don't any detail about the image yet).
Maybe we can enable the UTF-8 mode (or "UNIX mode") of the PEP 540 when the POSIX locale is used?
I read again other issues and I confirm that users are looking for a Python 3 which behaves like Python 2: simply don't bother them with encodings. I see the UTF-8 mode as an opportunity to answer to this request. Moreover, the most common cause of encoding issues is a program run with no locale variable set and so using the POSIX locale. So I modified my PEP 540: the POSIX locale now enables the UTF-8 mode. I had to update the "Backward Compatibility" section since the PEP now introduces a backward incompatible change (POSIX locale), but my bet is that the new behaviour is the one expected by users and that it cannot break applications. I moved my initial proposition as an alternative. I added a "Use Cases" section to explain in depth the "always work" behaviour, which I called the "UNIX mode" in my previous email. Latest version of the PEP: https://github.com/python/peps/blob/master/pep-0540.txt https://www.python.org/dev/peps/pep-0540/ will be updated shortly. Victor

LGTM. Some comments: I want UTF-8 mode is enabled by default (opt-out option) even if locale is not POSIX, like `PYTHONLEGACYWINDOWSFSENCODING`. Users depends on locale know what locale is and how to configure it. They can understand difference between locale mode and UTF-8 mode and they can opt-out UTF-8 mode. But many people lives in "UTF-8 everywhere" world, and don't know about locale. `-X utf8` option should be parsed before converting commandline arguments to wchar_t*. How about adding Py_UnixMain(int argc, char** argv) which is available only on Unix? I dislike wchar_t type and mbstowcs functions on Unix. (I love wchar_t on Windows, off course). I hope we can remove `wchar_t *wstr` from PyASCIIObject and deprecate all wchar_t APIs on Unix in the future. On Fri, Jan 6, 2017 at 10:43 AM, Victor Stinner <victor.stinner@gmail.com> wrote:

2017-01-06 8:21 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
You do, I don't :-) It shouldn't be hard to find very concrete issues from the mojibake issues described at: https://www.python.org/dev/peps/pep-0540/#expected-mojibake-issues IMHO there are 3 steps before being able to reach your dream: 1) add opt-in support for UTF-8 2) use UTF-8 if the locale is POSIX 3) UTF-8 is enabled by default I would prefer to begin with a first Python release at stage (1) or (2), wait for user complains, and later decide if we can move to (3). Right now, I didn't implement the PEP 540, so I wasn't able to experiment anything in practice yet. Well, at least it means that I have to elaborate the "Always use UTF-8" alternative of my PEP to explain why I consider that we are not ready to switch directly to his "obvious" option.
Users depends on locale know what locale is and how to configure it.
It's not a matter of users, but a matter of code in the wild which uses directly C functions like mbstowcs() or wsctombs(). These functions use the current locale encoding, they are not aware of the new Python UTF-8 mode.
But many people lives in "UTF-8 everywhere" world, and don't know about locale.
The PEP 540 was written to help users for very concrete cases. I'm repeating since Python 3.0 that users must learn how to configure their locale. Well, 8 years later, I keep getting exactly the same user complains: "Python doesn't work, it must just work!". It's really hard to decode bytes and later encode the text and prevenet any kind of encoding error. That's why no solution was proposed before.
`-X utf8` option should be parsed before converting commandline (...)
Yeah, that's a though technical issue. I'm not sure right know how to implement this with a clean design. Maybe I will just try with a hack? :-) Victor

INADA Naoki writes:
I find all this very strange from someone with what looks like a Japanese name. I see mojibake and non-Unicode encodings around me all the time. Caveat: I teach at a University that prides itself on being the most international of Japanese national universities, so in my daily work I see Japanese in 4 different encodings (5 if you count the UTF-16 used internally by MS Office), Chinese in 3 different (claimed) encodings, and occasionally Russian in at least two encodings, ..., uh, I could go on but won't. In any case, the biggest problems are legacy email programs and busted websites in Japanese, plus email that is labeled "GB2312" but actually conforms to GBK (and this is a reply in Japanese to a Chinese applicant writing in Japanese encoded as GBK). I agree that people around me mostly know only two encodings: "works for me" and "mojibake", but they also use locales configured for them by technical staff. On top of that, international students (the most likely victims of "UTF-8 by default" because students are the biggest Python users) typically have non-Japanese locales set on their imported computers. I'm not going to say my experience is typical enough to block "UTF-8 by default", but let's do this very carefully with thought.

On 8 January 2017 at 02:47, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Unsurprisingly (given where I work [1]), one of my key concerns is to enable large Python using institutions to be able to keep moving forward, regardless of whether they've fully standardised their internal environments on UTF-8 or not. As such, while I'm entirely in favour of pushing people towards UTF-8 as the default choice everywhere, I also want to make sure that system and application integrators, including the folks responsible for defining the Standard Operating Environments in large organisations, get warnings of potential problems when they arise, and continue to get encoding errors when we have definitive evidence of a compatibiliy problem. For me, that boils down to: - if a locale is properly configured, we'll continue to respect it - if we're ignoring or changing the locale setting without an explicit config option, we'll emit a warning on stderr that we're doing so (*without* using the warnings system, so there's no way to turn it into an exception) - if a UTF-8 based Linux container is run on a GB-18030/ISO-2022/Shift-JIS/etc host and tries to exchange locally encoded data with that host (rather than exchanging UTF-8 encoded data over a network connection), getting an exception is preferable to silently corrupting the data stream (I think I'll add something along those lines to PEP 538 as a new "Core Design Principles" section) Cheers, Nick. [1] https://docs.python.org/devguide/motivations.html#published-entries -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Jan 8, 2017 at 1:47 AM, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Since I work on tech company, and use Linux for most only "server-side" program, I don't live such a situation. But when I see non UTF-8 text, I don't change locale to read such text. (Actually speaking, locale doesn't solve mojibake because it doesn't change my terminal emulator's encoding). And I don't change my terminal emulator setting only for read such a text. What I do is convert it to UTF-8 through command like `view text-from-windows.txt ++enc=cp932` So there are no problem when Python always use UTF-8 for fsencoding and stdio encoding.
Hmm, Which OS do they use? There are no problem in macOS and Windows. Do they use Linux with locale with encoding other than UTF-8, and their terminal emulator uses non-UTF-8 encoding? As my feeling, UTF-8 start dominating from about 10 years ago, and ja_JP.EUC_JP (it was most common locale for Japanese befoer UTF-8) is complete legacy. There is only one machine (which is in LAN, lives from 10+ years ago, /usr/bin/python is Python 1.5!), I can ssh which has ja_JP.eucjp locale.

INADA Naoki writes:
But when I see non UTF-8 text, I don't change locale to read such text.
Nobody does. The problem is if people have locales set for non-UTF-8, which Chinese people often do ("GB18030 isn't just a good idea, it's the law"). Especially forcing stdout to something other than the locale is likely to mess things up.
My university's internal systems typically produce database output (class registration lists and the like) in Shift JIS, but that's not reliable. Some departments still have their home pages in EUC-JP, and pages where the meta http-equiv elements disagree with the content are not unusual. Private sector may be up to date, but academic sector (and from the state of e-stat.go.jp, government in general, I suspect) is stuck in the Jomon era. I don't know that there's going to be a problem, but the idea of implicitly forcing an encoding different from the locale seems likely to cause confusion to me. Aside from Nick's special case of containers supplied by a vendor different from the host OS, I don't really see why this is a good idea. I think it's best to go with the locale that is set (or not), unless we have very good reason to believe that by far most users would be surprised by that, and those who aren't surprised are mostly expert enough to know how to deal with a forced UTF-8 environment if they *don't* want it. A user-selected option is another matter.

Hi Stephen, 2017-01-09 19:42 GMT+01:00 Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp>:
I went to that page, checked the HTML and found: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> Admittedly, the page is in HTML 4.01, but then the Jomon era predates HTML5 by about 16,000 years, so I'll cut them some slack. Anyway, I am quite willing to believe that the situation is as dire as you describe on Windows. However, on OS X, Apple enforces UTF-8. And the Linux vendors are moving in that direction too. And the proposal under discussion is specifically about Linux So, again I am wondering if there are many people who end up with a *Linux* system which has a non-UTF-8 locale. For example, if you use the Ubuntu graphical installer, it asks for your language and then gives you the UTF-8 locale which comes with that. You have to really dive into the non-graphical configuration to get yourself a non-UTF8 locale. Stephan

Oh, I didn't know non-UTF-8 is used for LC_CTYPE in these years!
I talked about LC_CTYPE. We have some legacy files too. But it's not relating to neither of fsencoding nor stdio encoding.
Yes. This is balance matter. Some people are surprised by Python may not use UTF-8 even when writing source code in UTF-8, unlike most of other languages. (Not only rust, Go, node.js, but also Ruby, Perl, or even C!) And some people are surprised because they used locale to tell terminal encoding (which is not UTF-8) to some commands, and Python ~3.6 followed it. I thought later group is very small, and more smaller when 3.7 is released. And if we can drop locale support in the future, we will be able to remove some very dirty code in Python/fileutil.c. That's why I prefer locale-free UTF-8 mode by default, and locale-aware mode as opt-in. But I'm OK we start to ignore C locale, sure.

Victor Stinner writes:
The point of this, I suppose, is that piping to xargs works by default. I haven't read the PEPs (don't have time, mea culpa), but my ideal would be three options: --transparent -> errors=surrogateescape on input and output --postel -> errors=surrogateescape on input, =strict on output --unicode-me-harder -> errors=strict on input and output with --postel being default. Unix afficianados with lots of xargs use can use --transparent. Since people have different preferences, I guess there should be an envvar for this. Others probably should configure open() by open(). I'll try to get to the PEPs over the weekend but can't promise. Steve

2017-01-06 3:10 GMT+01:00 Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp>:
The point of this, I suppose, is that piping to xargs works by default.
Please read the second version (latest) version of my PEP 540 which contains a new "Use Cases" section which helps to define issues and the behaviour of the different modes.
PEP 540: --postel is the default --transparent is the UTF-8 mode --unicode-me-harder is the UTF-8 configured to strict The POSIX locale enables --transparent.
The PEP adds new -X utf8 command line option and PYTHONUTF8 environment variable to configure the UTF-8 mode.
Others probably should configure open() by open().
My PEP 540 does change the encoding used by open() by default: https://www.python.org/dev/peps/pep-0540/#encoding-and-error-handler Obviously, you can still explicitly set the encoding when calling open().
I'll try to get to the PEPs over the weekend but can't promise.
Please read at least the abstract of my PEP 540 ;-) Victor

On Jan 05, 2017, at 05:50 PM, Victor Stinner wrote:
FWIW, it seems to be a general and widespread recommendation to always pass universal_newlines=True to Popen and friends when you only want to deal with unicode from subprocesses: If encoding or errors are specified, or universal_newlines is true, the file objects stdin, stdout and stderr will be opened in text mode using the encoding and errors specified in the call or the defaults for io.TextIOWrapper. Cheers, -Barry

Passing universal_newlines will use whatever locale.getdefaultencoding() returns (which at least on Windows is useless enough that I added the encoding and errors parameters in 3.6). So it sounds like it'll only actually do Unicode on Linux if enough of the planets have aligned, which is what Victor is trying to do, but you can't force the other process to use a particular encoding. universal_newlines may become a bad choice if the default encoding no longer matches what the environment says, and personally, I wouldn't lose much sleep over that. (As an aside, when I was doing all the Unicode changes for Windows in 3.6, I eventually decided that changing locale.getdefaultencoding() was too big a breaking change to ever be a good idea. Perhaps that will be the same result here too, but I'm nowhere near familiar enough with the conventions at play to state that with any certainty.) Cheers, Steve Top-posted from my Windows Phone -----Original Message----- From: "Barry Warsaw" <barry@python.org> Sent: 1/6/2017 14:04 To: "python-ideas@python.org" <python-ideas@python.org> Subject: Re: [Python-ideas] PEP 540: Add a new UTF-8 mode On Jan 05, 2017, at 05:50 PM, Victor Stinner wrote:
FWIW, it seems to be a general and widespread recommendation to always pass universal_newlines=True to Popen and friends when you only want to deal with unicode from subprocesses: If encoding or errors are specified, or universal_newlines is true, the file objects stdin, stdout and stderr will be opened in text mode using the encoding and errors specified in the call or the defaults for io.TextIOWrapper. Cheers, -Barry

On Jan 06, 2017, at 11:08 PM, Steve Dower wrote:
Passing universal_newlines will use whatever locale.getdefaultencoding()
There is no locale.getdefaultencoding(); I think you mean locale.getpreferredencoding(False). (See the "Changed in version 3.3" note in $17.5.1.1 of the stdlib docs.)
universal_newlines is also problematic because it's misnamed from the more common motivation to use it. Very often people do want to open std* in text mode (and thus trade in Unicodes), but they rarely equate that to "universal newlines". So the option is just more hidden magical side-effect and cargo-culted lore. It's certainly *useful* though, and I think we want to be sure that we don't break existing code that uses it for this purpose. Cheers, -Barry

Hi! On Thu, Jan 05, 2017 at 04:38:22PM +0100, Victor Stinner <victor.stinner@gmail.com> wrote:
Please don't! I use different locales and encodings, sometimes it's utf-8, sometimes not - but I have properly configured LC_* settings and I prefer Python to follow my command. It'd be disgusting if Python starts to bend me to its preferences.
The risk is to introduce mojibake if the locale uses a different encoding, especially for locales other than the POSIX locale.
There is no such risk for me as I already have mojibake in my systems. Two most notable sources of mojibake are: 1) FTP servers - people create files (both names and content) in different encodings; w32 FTP clients usually send file names and content in cp1251 (Russian Windows encoding), sometimes in cp866 (Russian Windows OEM encoding). 2) MP3 tags and play lists - almost always cp1251. So whatever my personal encoding is - koi8-r or utf-8 - I have to deal with file names and content in different encodings. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

For stdio (including console), PYTHONIOENCODING can be used for supporting legacy system. e.g. `export PYTHONIOENCODING=$(locale charmap)` For commandline argument and filepath, UTF-8/surrogateescape can round trip. But mojibake may happens when pass the path to GUI. If we chose "Always use UTF-8 for fs encoding", I think PYTHONFSENCODING envvar should be added again. (It should be used from startup: decoding command line argument).
3) unzip zip file sent by Windows. Windows user use no-ASCII filenames, and create legacy (no UTF-8) zip file very often. I think people using non UTF-8 should solve encoding issue by themselves. People should use ASCII or UTF-8 always if they don't want to see mojibake.

2017-01-06 2:15 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
The problem with ignoring the locale by default and forcing UTF-8 is that Python works with many libraries which use the locale, not UTF-8. The PEP 538 also describes mojibake issues if Python is embedded in an application.
For commandline argument and filepath, UTF-8/surrogateescape can round trip. But mojibake may happens when pass the path to GUI.
Let's say that you have the filename b'nonascii\xff': it's decoded as 'nonascii\xdcff' by the UTF-8 mode. How do GUIs handle such filename? (I don't know the answer, it's a real question ;-))
Last time I implemented PYTHONFSENCODING, I had many major issues: https://mail.python.org/pipermail/python-dev/2010-October/104509.html Do you mean that these issues are now outdated and that you have an idea how to fix them?
ZIP files are out the scope of the PEPs 538 and 540. Python cannot guess the encoding, so it was proposed to add an option to give to user the ability to specify an encoding: see https://bugs.python.org/issue10614 for example. But yeah, data encoded to encodings different than UTF-8 are still common, and it's not going to change shortly. Since many Windows applications use the ANSI code page, I easily imagine that many documents are encoded to various incompatible code pages... What I understood is that many users don't want Python to complain on data encoded to different incompatible encodings: process data as a stream of bytes or characters, it depends. Something closer to Python 2 (stream of bytes). That's what I try to describe in this section: https://www.python.org/dev/peps/pep-0540/#old-data-stored-in-different-encod... Victor

On Fri, Jan 06, 2017 at 02:54:49AM +0100, Victor Stinner wrote:
I ran this in Python 2.7 to create the file: open(b'/tmp/nonascii\xff-', 'w') and then confirmed the filename: [steve@ando tmp]$ ls -b nonascii* nonascii\377- Konquorer in KDE 3 displays it with *two* "missing character" glyphs (small hollow boxes) before the hyphen. The KDE "Open File" dialog box shows the file with two blank spaces before the hyphen. My interpretation of this is that the difference is due to using different fonts: the file name is shown the same way, but in one font the missing character is a small box and in the other it is a blank space. I cannot tell what KDE is using for the invalid character, if I copy it as text and paste it into a file I just get the original \xFF. The Geany text editor, which I think uses the same GUI toolkit as Gnome, shows the file with a single "missing glyph" character, this time a black diamond with a question mark in it. It looks like Geany (Gnome?) is displaying the invalid byte as U+FFFD, the Unicode "REPLACEMENT CHARACTER". So at least two Linux GUI environments are capable of dealing with filenames that are invalid UTF-8, in two different ways. Does this answer your question about GUIs? -- Steve

Hi all, One meta-question I have which may already have been discussed much earlier in this whole proposal series, is: How common is this problem? Because I have the impression that nowadays all Linux distributions are UTF-8 by default and you have to show some bloody-mindedness to end up with a POSIX locale. Docker was mentioned, is this not really an issue which should be solved at the Docker level? Since it would affect *all* applications which are doing something non-trivial with encodings? I realise there is some attractiveness in solving the issue "for Python", since that will reduce the amount of bug reports and get people off the chests of the maintainers, but to get this fixed in the wider Linux ecosystem it might be preferable to "Let them eat mojibake", to paraphrase what Marie-Antoinette never said. Stephan 2017-01-06 5:49 GMT+01:00 Steven D'Aprano <steve@pearwood.info>:

2017-01-06 7:22 GMT+01:00 Stephan Houben <stephanh42@gmail.com>:
How common is this problem?
Last 2 or 3 years, I don't recall having be bitten by such issue. On the bug tracker, new issues are opened infrequently. * http://bugs.python.org/issue19977 opened at 2013-12-13, closed at 2014-04-27 * http://bugs.python.org/issue19846 opened at 2013-11-30, closed as NOTABUG at 2015-05-17 22, but got new comments after it was closed * http://bugs.python.org/issue19847 opened at 2013-11-30, closed as NOTABUG at 2013-12-13 * http://bugs.python.org/issue28180 opened at 2016-09-16, still open Again, I don't think that this list is complete, I recall other similar issues.
What do you mean by "eating mojibake"? Users complain because their application is stopped by a Python exception. Currently, most Python 3 applications doesn't produce or display mojibake, since Python is strict on outputs. (One exception: stdout with the POSIX locale since Python 3.5). Victor

Hi Victor, 2017-01-06 13:01 GMT+01:00 Victor Stinner <victor.stinner@gmail.com>:
What do you mean by "eating mojibake"?
OK, I erroneously understood that the failure mode was that mojibake was produced.
Users complain because their application is stopped by a Python exception.
Got it.
OK, I now tried it myself and indeed it produces the following error: UnicodeEncodeError: 'ascii' codec can't encode character '\xfe' in position 0: ordinal not in range(128) My suggestion would be to make this error message more specific. In particular, if we have LC_TYPE/LANG=C or unset, we could print something like the following information (on Linux only): """ You are attempting to use non-ASCII Unicode characters while your system has been configured (possibly erroneously) to operate in the legacy "C" locale, which is pure ASCII. It is strongly recommended that you configure your system to allow arbitrary non-ASCII Unicode characters This can be done by configuring a UTF-8 locale, for example: export LANG=en_US.UTF-8 Use: locale -a | grep UTF-8 to get a list of all valid UTF-8 locales on your system. """ Stephan

Chris Barker - NOAA Federal writes:
Of course. The question is not "should cb@noaa properly configure docker?", it's "Can docker properly configure docker (soon enough)? And if not, should we configure Python?" The third question depends on whether fixing it for you breaks things for others.

When talking about general Docker image, using C locale is OK for most cases. In other words, images using C locale is properly configured. All of node.js, Ruby, Perl, Go and Rust application can talk UTF-8 in docker using C locale, without special configuration. Locale dependent application is very special in this area.

INADA Naoki writes:
s/properly/compatibly/. "Proper" has strong connotations of "according to protocol". Configuring LC_CTYPE for ASCII expecting applications to say "You're lying!" and spew UTF-8 anyway is not "proper". That kind of thing makes me very nervous, and I think justifiably so. And it's only *sufficient* to justify a change to Python's defaults if Python checks for and accurately identifies when it's in a container. Anyway, I need to look more carefully at the actual PEPs and see if there's something concrete to worry about. But remember, we have about 18 months to chew over this if necessary -- I'm only asking for a few more days (until after the "cripple the minds of Japanese youth day", er, "University Admissions Center Examination" this weekend ;-). Steve

In my company, we use network boot servers. To reduce boot image, the image is built with minimalistic approach too. So there were only C locale in most of our servers, and many people in my company had bitten by this problem. I teach them to adding `export PYTHONIOENCODING=utf-8` in their .bashrc. But they had bitten again when using cron. So this problem is not only for docker container. Since UTF-8 dominated, many people including me use C locale to avoid unintentional behavior of commands seeing locale (sort, ls, date, bash, etc...). And use non C locale only for reading non English output from some command, like `man` or `hg help`. It's for i18n / l10n, but not for changing encoding. People live in UTF-8 world are never helped by changing encoding by locale. They are only bitten by the behavior.
Off course. And both PEP doesn't propose default behavior except C locale. So there are 36+ months about changing default behavior. I hope 36 months is enough for people using legacy systems are moving to UTF-8 world. Regards,

Here is one example of locale pitfall. --- # from http://unix.stackexchange.com/questions/169739/why-is-coreutils-sort-slower-... $ cat letters.py import string import random def main(): for _ in range(1_000_000): c = random.choice(string.ascii_letters) print(c) main() $ python3 letters.py > letters.txt $ LC_ALL=C time sort letters.txt > /dev/null 0.35 real 0.32 user 0.02 sys $ LC_ALL=C.UTF-8 time sort letters.txt > /dev/null 0.36 real 0.33 user 0.02 sys $ LC_ALL=ja_JP.UTF-8 time sort letters.txt > /dev/null 11.03 real 10.95 user 0.04 sys $ LC_ALL=en_US.UTF-8 time sort letters.txt > /dev/null 11.05 real 10.97 user 0.04 sys --- This is why some engineer including me use C locale on Linux, at least when there are no C.UTF-8 locale. Off course, we can use LC_CTYPE=en_US.UTF-8, instead of LANG or LC_ALL. (I wonder if we can use LC_CTYPE=UTF-8...) But I dislike current situation that "people should learn how to configure locale properly, and pitfall of non-C locale, only for using UTF-8 on Python".

INADA Naoki writes:
Off course, we can use LC_CTYPE=en_US.UTF-8, instead of LANG or LC_ALL.
You can also use LC_COLLATE=C.
(I wonder if we can use LC_CTYPE=UTF-8...)
Syntactically incorrect: that means the language UTF-8. "LC_TYPE=.UTF-8" might work, but IIRC the language tag is required, the region and encoding are optional. Thus ja_JP, ja.UTF-8 are OK, but .UTF-8 is not. Rant follows:
You can use a distro that implements and defaults to the C.utf-8 locale, and presumably you'll be OK tomorrow, well before 3.7 gets released. (If there are no leftover mines in the field, I don't see a good reason to wait for 3.8 given the known deficiencies of the C locale and the precedent of PEPs 528/529.) Really, we're catering to users who won't set their locales properly and insist on old distros. For Debian, C.utf-8 was suggested in 2009[1], and that RFE refers to other distros that had already implemented it. I have all the sympathy in the world for them -- systems *should* Just Work -- but I'm going to lean against kludges if they mean punishing people who actually learn about and conform to applicable standards (and that includes well-motivated, properly- documented, and carefully-implemented platform-specific extensions), or use systems designed by developers who do.[2] Footnotes: [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=609306 [2] I know how bad standards can suck -- I'm a Mailman developer, looking at you RFC 561, er, 5322. While I'm all for nonconformism if you take responsibility for any disasters that result, developers who conform on behalf of their users are heroes.

I'm sorry. I know it, but I'm not good at English. I meant "I wish posix allowed LC_CTYPE=UTF-8 setting." It's just my desire.
Many people use new Python on legacy Linux which don't have C.UTF-8 locale. I learned how to configure locale for using UTF-8 on Python. But I don't want to force people to learn it, only for Python.
CentOS 7 (and RHEL 7, maybe) seems don't provide C.UTF-8 by default. It means C.UTF-8 is not "universal available" locale at least next 5 years. $ cat /etc/redhat-release CentOS Linux release 7.2.1511 (Core) $ locale -a | grep ^C C

Hi INADA Naoki, (Sorry, I am unsure if INADA or Naoki is your first name...) While I am very much in favour of everything working "out of the box", an issue is that we don't have control over external code (be it Python extensions or external processes invoked from Python). And that code will only look at LANG/LC_TYPE and ignore any cleverness we build into Python. For example, this may mean that a built-in Python string sort will give you a different ordering than invoking the external "sort" command. I have been bitten by this kind of issues, leading to spurious "diffs" if you try to use sorting to put strings into a canonical order. So my feeling is that people are ultimately not being helped by Python trying to be "nice", since they will be bitten by locale issues anyway. IMHO ultimately better to educate them to configure the locale. (I realise that people may reasonably disagree with this assessment ;-) ) I would then recommend to set to en_US.UTF-8, which is slower and less elegant but at least more widely supported. By the way, I know a bit how Node.js deals with locales, and it doesn't try to compensate for "C" locales either. But what it *does* do is that Node never uses the locale settings to determine the encoding of a file: you either have to specify it explicitly OR it defaults to UTF-8 (the latter on output only). So in this respect it is by specification immune against misconfiguration of the encoding. However, other stuff (e.g. date formatting) will still be influenced by the "C" locale as usual. Stephan 2017-01-11 9:17 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:

On 01/11/2017 11:46 AM, Stephan Houben wrote:
AFAIK, this would not be a problem under PEP 538, which effectively treats the "C" locale as "C.UTF-8". Strings of Unicode codepoints and the corresponding UTF-8-encoded bytes sort the same way. Is that wrong, or do you have a better example of trouble with using "C.UTF-8" instead of "C"?
What about the spurious diffs you'd get when switching from "C" to "en_US.UTF-8"? $ LC_ALL=en_US.UTF-8 sort file.txt a a A A $ LC_ALL=C sort file.txt A A a a
I believe the main problem is that the "C" locale really means two very different things: a) Text is encoded as 7-bit ASCII; higher codepoints are an error b) No encoding was specified In both cases, treating "C" as "C.UTF-8" is not bad: a) For 7-bit "text", there's no real difference between these locales b) UTF-8 is a much better default than ASCII

Hi Petr, 2017-01-11 12:22 GMT+01:00 Petr Viktorin <encukou@gmail.com>:
...and this is also something new I learned.
Is that wrong, or do you have a better example of trouble with using "C.UTF-8" instead of "C"?
After long deliberation, it seems I cannot find any source of trouble. +1 So my feeling is that people are ultimately not being helped by
That taught me to explicitly invoke "sort" using LANG=en_US.UTF-8 sort
A "C" locale also means that a program should not *output* non-ASCII characters, unless when explicitly being fed in (like in the case of "cat" or "sort" or the "ls" equivalent from PEP-540). So for example, a program might want to print fancy Unicode box characters to show progress, and check sys.stderr.encoding to see if it can do so. However, under a "C" locale it should not do so since for example the terminal is unlikely to display the fancy box characters properly. Note that the current PEP 540 proposal would be that sys.stderr is in UTF-8 /backslashreplace encoding under the "C" locale. I think this may be a minor concern ultimately, but it would be nice if we had some API to at least reliable answer the question "can I safely output non-ASCII myself?" Stephan

On Wed, Jan 11, 2017 at 7:46 PM, Stephan Houben <stephanh42@gmail.com> wrote:
Hi INADA Naoki,
(Sorry, I am unsure if INADA or Naoki is your first name...)
Never mind, I don't care about name ordering. (INADA is family name).
I'm sorry, could you give me more concrete example? My opinion is +1 to PEP 540, there should be an option to ignore locale setting. (And I hope it will be default setting in future version.) What is your concern?
But someone can't accept 30x slower only sorting ASCII text. At least, infrastructure engineer in my company loves C locale. New Python programmer (e.g. there are many data scientists learning Python) may want to work on Linux server, and learning about locale is not their concern. Web programmers are same. Just want to print UTF-8. Learning about locale may not worth enough for them. But I think there should be an option, and I want to use it.
Yes. Both of PEP 538 and 540 is about encoding. I'm sorry about my misleading word "locale-free". There should be locale support for time formatting, at least UTF-8 locale. Regards,

On 11 January 2017 at 17:05, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
FWIW, I'm hoping to backport whatever improved handling of the C locale that we agree on for Python 3.7+ to the initial system Python 3.6.0 release in Fedora 26 [1] - hence the section about redistributor backports in PEP 538. While the problems with the C locale have been known for a while, this latest attempt to do something about it started as an idea I had for a downstream Fedora-specific patch (which became PEP 538), while that PEP in turn served as motivation for Victor to write PEP 540 as an alternative approach that didn't depend on having the C.UTF-8 locale available. With the F26 Alpha at the end of February and the F26 Beta in late April, I'm hoping we can agree on a way forward without requiring months to make a decision :)
-- I'm only asking for a few more days
Yeah, while I'd prefer not to see the discussions drag out indefinitely, there's also still plenty of time for folks to consider the PEPs closely and formulate their perspective. Cheers, Nick. [1] https://fedoraproject.org/wiki/Releases/26/Schedule -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Jan 06, 2017, at 07:22 AM, Stephan Houben wrote:
It can still happen in some corner cases, even on Debian and Ubuntu where C.UTF-8 is available and e.g. my desktop defaults to en_US.UTF-8. For example, in an sbuild/schroot environment[*], the default locale is C and I've seen package build failures because of this. There may be other such "corner case" environments where this happens too. Cheers, -Barry [*] Where sbuild/schroot is a very common suite of package building tools.

On Sat, Jan 7, 2017 at 8:20 AM, Barry Warsaw <barry@python.org> wrote:
A lot of background jobs get run in a purged environment, too. I don't remember exactly which ones land in the C locale and which don't, but check cron jobs, systemd background processes, inetd, etc, etc, etc. Having Python DTRT in those situations would be a Good Thing. ChrisA

2017-01-06 22:20 GMT+01:00 Barry Warsaw <barry@python.org>:
Right, that's the whole point of the Nick's PEP 538 and my PEP 540: it's still common to get the POSIX locale. I began to list examples of practical use cases where you get the POSIX locale. https://www.python.org/dev/peps/pep-0540/#posix-locale-used-by-mistake I'm not sure about the title of the section: "POSIX locale used by mistake". Barry: About chroot, why do you get a C locale? Is it because no locale is explicitly configured? Or because no locale is installed in the chroot? Would it work if we had a tool to copy the locale from the host when creating the chroot: env vars and the data files required by the locale (if any)? The chroot issue seems close to the reported chroot issue: http://bugs.python.org/issue28180 I understand that it's more a configuration issue, than a deliberate choice to use the POSIX locale. Again, the user requirement is that Python 3 should just work without any kind of specific configuration, as other classic UNIX tools. Victor

On Jan 06, 2017, at 11:33 PM, Victor Stinner wrote:
For some reason it's not configured: % schroot -u root -c sid-amd64 (sid-amd64)# locale LANG= LANGUAGE= LC_CTYPE="POSIX" LC_NUMERIC="POSIX" LC_TIME="POSIX" LC_COLLATE="POSIX" LC_MONETARY="POSIX" LC_MESSAGES="POSIX" LC_PAPER="POSIX" LC_NAME="POSIX" LC_ADDRESS="POSIX" LC_TELEPHONE="POSIX" LC_MEASUREMENT="POSIX" LC_IDENTIFICATION="POSIX" LC_ALL= (sid-amd64)# export LC_ALL=C.UTF-8 (sid-amd64)# locale LANG= LANGUAGE= LC_CTYPE="C.UTF-8" LC_NUMERIC="C.UTF-8" LC_TIME="C.UTF-8" LC_COLLATE="C.UTF-8" LC_MONETARY="C.UTF-8" LC_MESSAGES="C.UTF-8" LC_PAPER="C.UTF-8" LC_NAME="C.UTF-8" LC_ADDRESS="C.UTF-8" LC_TELEPHONE="C.UTF-8" LC_MEASUREMENT="C.UTF-8" LC_IDENTIFICATION="C.UTF-8" LC_ALL=C.UTF-8 I'm not sure why that's the default inside a chroot. I thought there was a bug or discussion about this, but I can't find it right now. Generally when this happens, exporting this environment variable in your debian/rules file is the way to work around the default. Cheers, -Barry

2017-01-07 1:06 GMT+01:00 Barry Warsaw <barry@python.org>:
For some reason it's not configured: (...)
Ok, thanks for the information.
I'm not sure why that's the default inside a chroot.
I found at least one good reason to use the POSIX locale to build a package: it helps to get reproductible builds, see: https://reproducible-builds.org/docs/locales/ I used it as an example in my new rationale: https://www.python.org/dev/peps/pep-0540/#it-s-not-a-bug-you-must-fix-your-l... I tried to explain how using LANG=C can be a smart choice in some cases, and so that Python 3 should do its best to not annoy the user with Unicode errors. I also started to list cases where you get the POSIX locale "by mistake". As I wrote previously, I'm not sure that it's correct to add "by mistake". https://www.python.org/dev/peps/pep-0540/#posix-locale-used-by-mistake By the way, I tried to force the POSIX locale in my benchmarking "perf" module. The idea is to get more reproductible results between heterogeneous computers. But I got a bug report. So I decided to copy the locale by default and add an opt-in --no-locale option to ignore the locale (force the POSIX locale). https://github.com/haypo/perf/issues/15 Victor

On Fri, Jan 06, 2017 at 10:15:52AM +0900, INADA Naoki <songofacandy@gmail.com> wrote:
This means one more thing to reconfigure when I switch locales instead of Python to catches up automatically.
Good example, thank you! I forgot about it because I have wrote my own zip.py and unzip.py that encode/decode filenames.
I think people using non UTF-8 should solve encoding issue by themselves. People should use ASCII or UTF-8 always if they don't want to see mojibake.
Impossible. Even if I'd always use UTF-8 I still will receive a lot of cp1251/cp866. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

2017-01-06 3:10 GMT+01:00 Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp>:
The "always ignore locale and force UTF-8" option has supporters. For example, Nick Coghlan wrote a whole PEP, PEP 538, to support this. I want that my PEP is complete and so lists all famous alternatives. Victor

On 6 January 2017 at 12:37, Victor Stinner <victor.stinner@gmail.com> wrote:
Err, no, that's not what PEP 538 does. PEP 538 doesn't do *anything* if a locale is already properly configured - it only changes the locale if the current locale is "C". It's actually very similar to your PEP, except that instead of adding the ability to make CPython ignore the C level locale settings (which I think is a bad idea based on your own previous work in that area and on the way that CPython interacts with other C/C++ components in the same process and in subprocesses), it just *changes* those settings when we're pretty sure they're wrong. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 06.01.2017 04:32, Nick Coghlan wrote:
Victor: I think you are taking the UTF-8 idea a bit too far. Nick was trying to address the situation where the locale is set to "C", or rather not set at all (in which case the lib C defaults to the "C" locale). The latter is a fairly standard situation when piping data on Unix or when spawning processes which don't inherit the current OS environment. The problem with the "C" locale is that the encoding defaults to "ASCII" and thus does not allow Python to show its built-in Unicode support. Nick's PEP and the discussion on the ticket http://bugs.python.org/issue28180 are trying to address this particular situation, not enforce any particular encoding overriding the user's configured environment. So I think it would be better if you'd focus your PEP on the same situation: locale set to "C" or not set at all. BTW: You mention a locale "POSIX" in a few places. I have never seen this used in practice and wonder why we should even consider this in Python as possible work-around for a particular set of features. The locale setting in your environment does have a lot of influence on your user experience, so forcing people to set a "POSIX" locale doesn't sound like a good idea - if they have to go through the trouble of correctly setting up their environment for Python to correctly run, they would much more likely use the correct setting rather than a generic one like "POSIX", which is defined as alias for the "C" locale and not as a separate locale: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
... and this is taking the original intent of the ticket a little too far as well :-) The original request was to have the FS encoding default to UTF-8, in case the locale is not set or set to "C", with the reasoning being that this makes it easier to use Python in situations where you have exactly this situations (see above). Your PEP takes this approach further by fixing the locale setting to "C.UTF-8" in those two cases - intentionally, with all the implications this has on other parts of the C lib. The latter only has an effect on the C lib, if the "C.UTF-8" locale is available on the system, which it isn't on many systems, since C locales have to be explicitly generated. Without the "C.UTF-8" locale available, your PEP only affects the FS encoding, AFAICT, unless other parts of the application try to interpret the locale env settings as well and use their own logic for the interpretation. For the purpose of experimentation, I would find it better to start with just fixing the FS encoding in 3.7 and leaving the option to adjust the locale setting turned off per default. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 06 2017)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

2017-01-06 10:50 GMT+01:00 M.-A. Lemburg <mal@egenix.com>:
Victor: I think you are taking the UTF-8 idea a bit too far.
Hum, sorry, the PEP is still a draft, the rationale is far from perfect yet. Let me try to simplify the issue: users are unable to configure a locale for various reasons and expect that Python 3 must "just works", so never fail on encoding or decoding. Do you mean that you must try to fix this issue? Or that my approach is not the good one?
In the second version of my PEP, Python 3.7 will basically "just work" with the POSIX locale (or C locale if you prefer). This locale enables the UTF-8 mode which forces UTF-8/surrogatescape, and this error handler prevents the most common encode/decode error (but not all of them!). When I read the different issues on the bug tracker, I understood that people have different opinions because they have different use cases and so different expectations. I tried to describe a few use cases to help to understand why we don't have the expectations: https://www.python.org/dev/peps/pep-0540/#replace-a-word-in-a-text I guess that "piping data on Unix" is represented by my "Replace a word in a text" example, right? It implements the "sed -e s/apple/orange/g" command using Python 3. Classical usage: cat input_file | sed -e s/apple/orange/g > output "UNIX users" don't want Unicode errors here.
I don't think that it's the main annoying issues for users. User complain because basic functions like (1) "List a directory into stdout" or (2) "List a directory into a text file" fail badly: (1) https://www.python.org/dev/peps/pep-0540/#list-a-directory-into-stdout (2) https://www.python.org/dev/peps/pep-0540/#list-a-directory-into-a-text-file They don't really care of powerful Unicode features, but are bitten early just on writing data back to the disk, into a pipe, or something else. Python 3.6 tries to be nice with users when *getting* data, and it is very pedantic when you try to put the data somewhere. The only exception is that stdout now uses the surrogateescape error handler, but only with the POSIX locale.
I'm not sure that I understood: do you suggest to only modify the behaviour when the POSIX locale is used, but don't add any option to ignore the locale and force UTF-8? At least, I would like to get a UTF-8/strict mode which would require an option to enable it. About -X utf8, the idea is to write explicitly that you are sure that all inputs are encoded to UTF-8 and that you request to encode outputs to UTF-8. I guess that you are concerned by locales using encodings other than ASCII or UTF-8 like Latin1, ShiftJIS or something else?
Hum, the POSIX locale is the "C" locale in my PEP. I don't request users to force the POSIX locale. I propose to make Python nicer than users already *get* the POSIX locale for various reasons: * OS not correctly configured * SSH connection failing to set the locale * user using LANG=C to get messages in english * LANG=C used for a bad reason * program run in an empty environment * user locale set to a non-existent locale => the libc falls back on POSIX * etc. "LANG=C": "LC_ALL=C" is more correct, but it seems like LANG=C is more common than LC_ALL=C or LC_CTYPE=C in the wild.
By ticket, do you mean a Python issue? By the way, I'm aware of these two issues: http://bugs.python.org/issue19846 http://bugs.python.org/issue28180 I'm sure that other issues were opened to request something similiar, but they got probably less feedback, and I was to lazy yet to find them all.
I decided to write the PEP 540 because only few operating systems provide C.UTF-8 or C.utf8. I'm trying to find a solution working on all UNIX and BSD systems. Maybe I'm wrong, and my approach (ignore the locale, rather than really "fixing" the locale) is plain wrong. Again, it's a very hard issue, I don't think that any perfect solution exists. Otherwise, we would already have fixed this issue 8 years ago! It's a matter of compromises and finding a practical design which works for most users.
Sorry, what do you mean by "fixing the FS encoding"? I understand that it's basically my PEP 540 without -X utf8 and PYTHONUTF8, only with the UTF-8 mode enabled for the POSIX locale? By the way, Nick's PEP 538 doesn't mention surrogateescape. IMHO if we override or ignore the locale, it's safer to use surrogateescape. The Use Cases of my PEP 540 should help to understand why. Victor

2017-01-06 10:50 GMT+01:00 M.-A. Lemburg <mal@egenix.com>:
My PEP 540 is different than Nick's PEP 538, even for the POSIX locale. I propose to always use the surrogateescape error handler, whereas Nick wants to keep the strict error handler for inputs and outputs. https://www.python.org/dev/peps/pep-0540/#encoding-and-error-handler The surrogateescape error handler is useful to write programs which work as pipes, as cat, grep, sed, ... UNIX program: https://www.python.org/dev/peps/pep-0540/#producer-consumer-model-using-pipe... You can get the behaviour of Nick's PEP 538 using my UTF-8 Strict mode. Compare "UTF-8 mode" and "UTF-8 Strict mode" lines in the tables of my use case. The UTF-8 mode always works, but can produce mojibake, whereas UTF-8 Strict doesn't produce mojibake but can fail depending on data and the locale. IMHO most users prefers usability ("just work") over correctness (prevent mojibake). So Nick and me don't have exaclty the same scope and use cases. Victor

I'm ±0 to surrogateescape by default. I feel +1 for stdout and -1 for stdin. In output case, surrogateescape is weaker than strict, but it only allows surrgateescaped binary. If program carefully use surrogateescaped decode, surrogateescape on stdout is safe enough. On the other hand, surrogateescape is very weak for input. It accepts arbitrary bytes. It should be used carefully. But I agree different encoding handler between stdin/stdout is not beautiful. That's why I'm ±0. FYI, when http://bugs.python.org/issue15216 is merged, we can change error handler easily: ``sys.stdout.set_encoding(errors='surrogateescape')`` So it's controllable from Python. Some program which handles filenames may prefer surrogateescape, and some program like CGI may prefer strict UTF-8 because JSON and HTML5 shouldn't contain arbitrary bytes.

It seems to me that having a C locale can mean two things: 1) It really is meant to be ASCII 2) It's mis-configured (or un-configured), meaning the system encoding is unknown. if (2) then utf-8 is a fine default. if (2), then there are two options: 1) Everything on the sytsem really is ASCII -- in which case, utf-8 would "just work" -- no problem. 2) There are non-ascii file names, etc. on this supposedly ASCII system. In which case, do folks expect their Python programs to find these issues and raise errors? They may well expect that their Python program will not let them try to save a non ASCII filename, for instance. But I suspect that they wouldn't want it to raise an obscure encoding error -- but rather would want the app to do somethign friendly. So I see no downside to using utf-8 when the C locale is defined. -CHB On Wed, Jan 11, 2017 at 4:23 PM, INADA Naoki <songofacandy@gmail.com> wrote:
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Chris Barker writes:
Actually, IME, just like you, they expect it to DTRT, which for *them* is to save it in Shift-JIS or Alternativj or UTF-totally-whacked as their other programs do.
So I see no downside to using utf-8 when the C locale is defined.
You don't have much incentive to look for one, and I doubt you have the experience of the edge cases (if you do, please correct me), so that does not surprise me. I'm not saying there are such cases here, I just want a little time to look harder. Steve

On Thu, Jan 12, 2017 at 7:50 AM, Stephen J. Turnbull <turnbull.stephen.fw@u. tsukuba.ac.jp> wrote:
that's correct. I left out a sentence: This is a good time for others' with experience with the ugly edge cases to speak up! The real challenge is that "output" has three (at least :-) ) use cases: 1) Passing on data the came from input from the same system: Victors' "Unix pipe style". In that case, if a supposedly ASCII-based system has non ascii data, most users would want it to get passed through unchanged. They not likely to expect their python program to enforce their locale (unless it was a program designed to do that - but then it could be explicit about things). 2) The program generating data itself: the mentioned "outputting boxes to the console" example. I think that folks writing these programs should consider whether they really need non-ascii output -- but if they do do this -- I"d image most folks would rather see weird characters in the console than have the program crash. So these both point to utf-8 (with surrogateescape) 3) A program getting input from a user, or a data file, or...... (like a filename, etc). This may be a program intended to be able to handle unicode filenames, etc. (this is my use-case :-) ) -- what should it do when run on an ASCII-only system? This is the tough one -- if the C-locale indicated "non configured", then users would likely want the _something_ written to the FS, rather than a program crash: so utf-8. However, if the system really is properly configured to be ASCII only, then they may want a program to never write non-ascii to the filesystem. However, ultimately, I think it's up to the application developer, rather than to Python itself (Or the sysadmin for the OS that it's running on) to know whether the app is supposed to support non-ascii filenames, etc. i.e. one should expect that running a unicode-aware app on an ascii-only filesystem is going to lead to problems anyway. So I think the "never crash" option is the better one in this imperfect trade-off. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

2017-01-12 1:23 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
I'm ±0 to surrogateescape by default. I feel +1 for stdout and -1 for stdin.
The use case is to be able to write a Python 3 program which works work UNIX pipes without failing with encoding errors: https://www.python.org/dev/peps/pep-0540/#producer-consumer-model-using-pipe... If you want something stricter, there is the UTF-8 Strict mode which prevent mojibake everywhere. I'm not sure that the UTF-8 Strict mode is really useful. When I implemented it, I quickly understood that using strict *everywhere* is just a deadend: it would fail in too many places. https://www.python.org/dev/peps/pep-0540/#use-the-strict-error-handler-for-o... I'm not even sure yet that a Python 3 with stdin using strict is "usable".
What do you mean that "carefully use surrogateescaped decode"? The rationale for using surrogateescape on stdout is to support this use case: https://www.python.org/dev/peps/pep-0540/#list-a-directory-into-stdout
In my experience with the Python bug tracker, almost nobody understands Unicode and locales. For the "Producer-consumer model using pipes" use case, encoding issues of Python 3.6 can be a blocker issue. Some developers may prefer a different programming language which doesn't bother them with Unicode: basicall, *all* other programming languages, no?
But I agree different encoding handler between stdin/stdout is not beautiful. That's why I'm ±0.
That's why there are two modes: UTF-8 and UTF-8 Strict. But I'm not 100% sure yet, on which encodings and error handlers should be used ;-) I started to play with my PEP 540 implementation. I already had to update the PEP 540 and its implementation for Windows. On Windows, os.fsdecode/fsencode now uses surrogatepass, not surrogateescape (Python 3.5 uses strict on Windows). Victor

On Fri, Jan 13, 2017 at 12:12 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
I want http://bugs.python.org/issue15216 is merged in 3.7. It allows application select error handler by straightforward API. So, the problem is "which should be default"? * Program like `ls` can opt-in surrogateescape. * Program want to output valid UTF-8 can opt-out surrogateescape. And I feel former is better, regarding to Python's Zen. But it's not a strong opinion.
Application which is intended to output surrogateescaped data (filenames) should use surrogateescape, surely. But some application is intended to live in UTF-8 world. For example, think about application reads UTF-8 CSV, and insert it into database. When there is CSV encoded by Shift_JIS accidentally, and it is passed to stdin, error is better than insert it into database silently.
I agree. Some developer prefer other language (or Python 2) to Python 3, because of "Unicode by default doesn't fit to POSIX". Both of "strict by default" and "weak by default" have downside.

On Thu, Jan 05, 2017 at 04:38:22PM +0100, Victor Stinner wrote: [...]
PEP 393 is the Flexible String Respresentation. I think you want PEP 383, Non-decodable Bytes in System Character Interfaces. https://www.python.org/dev/peps/pep-0383/
The problem is that operating system data like filenames are decoded using the ``surrogateescape`` error handler (PEP 393).
/s/393/283/ -- Steve

I read the PEP 538, PEP 540, and issues related to switching to UTF-8. At least, I can say one thing: people have different points of view :-) To understand why people disagree, I tried to categorize the different point of views and Python expectations: "UNIX mode": Python 2 developers and long UNIX users expect that their code "just works". They like Python 3 features, but Python 3 annoy them with various encoding errors. The expectation is to be able to read data encoded to various incompatible encodings and write it into stdout or a text file. In short, mojibake is not a bug but a feature! "Strict Unicode mode" for real Unicode fans: Python 3 is strict and it's a good thing! Strict codec helps to detect very early bugs in the code. These developers understand very well Unicode and are able to fix complex encoding issues. Mojibake is a no-no for them. Python 3.6 is not exactly in the first or the later category: "it depends". To read data from the operating system, Python 3.6 behaves in "UNIX mode": os.listdir() *does* return invalid filenames, it uses a funny encoding using surrogates. To write data back to the operating system, Python 3.6 wears its "Unicode nazi" hat and becomes strict. It's no more possible to write data from from the operating system back to the operating system. Writing a filename read from os.listdir() into stdout or into a text file fails with an encode error. Subtle behaviour: since Python 3.6, with the POSIX locale, Python 3.6 uses the "UNIX mode" but only to write into stdout. It's possible to write a filename into stdout, but not into a text file. In its current shame, my PEP 540 leaves Python default unchanged, but adds two modes: UTF-8 and UTF-8 strict. The UTF-8 mode is more or less the UNIX mode generalized for all inputs and outputs: mojibake is a feature, just pass bytes unchanged. The UTF-8 strict mode is more extreme that the current "Strict Unicode mode" since it fails on *decoding* data from the operating system. Now that I have a better view of what we have and what we want, the question is if the default behaviour should be changed and if yes, how. Nick's PEP 538 does exactly move to the "UNIX mode" (open() doesn't use surrogateescape) nor the "Strict Unicode mode" (fsdecode() still uses surrogateescape), it's still in a grey area. Maybe Nick can elaborate the use case or update his PEP? I guess that all users and most developers are more in the "UNIX mode" camp. *If* we want to change the default, I suggest to use the "UNIX mode" by default. The question is if someone relies/likes on the current Python 3.6 behaviour: reading "just works", writing is strict. If you like this behaviour, what do you think of the tiny Python 3.6 change: use surrogateescape for stdout when the locale is POSIX. Victor

2017-01-05 17:50 GMT+01:00 Victor Stinner <victor.stinner@gmail.com>:
A common request is that "Python just works" without having to pass a command line option or set an environment variable. Maybe the default behaviour should be left unchanged, but the behaviour with the POSIX locale should change. Maybe we can enable the UTF-8 mode (or "UNIX mode") of the PEP 540 when the POSIX locale is used? Said differently, the UTF-8 would not only be enabled by -X utf8 and PYTHONUTF8=1, but also enabled by the common LANG=C and when Python is started in an empty environment (no env var). Victor

Ok, I modified my PEP: the POSIX locale now enables the UTF-8 mode. 2017-01-05 18:10 GMT+01:00 Victor Stinner <victor.stinner@gmail.com>:
http://bugs.python.org/issue28180 asks to "change the default" to get a Python which "just works" without any kind of configuration, in the context of a Docker image (I don't any detail about the image yet).
Maybe we can enable the UTF-8 mode (or "UNIX mode") of the PEP 540 when the POSIX locale is used?
I read again other issues and I confirm that users are looking for a Python 3 which behaves like Python 2: simply don't bother them with encodings. I see the UTF-8 mode as an opportunity to answer to this request. Moreover, the most common cause of encoding issues is a program run with no locale variable set and so using the POSIX locale. So I modified my PEP 540: the POSIX locale now enables the UTF-8 mode. I had to update the "Backward Compatibility" section since the PEP now introduces a backward incompatible change (POSIX locale), but my bet is that the new behaviour is the one expected by users and that it cannot break applications. I moved my initial proposition as an alternative. I added a "Use Cases" section to explain in depth the "always work" behaviour, which I called the "UNIX mode" in my previous email. Latest version of the PEP: https://github.com/python/peps/blob/master/pep-0540.txt https://www.python.org/dev/peps/pep-0540/ will be updated shortly. Victor

LGTM. Some comments: I want UTF-8 mode is enabled by default (opt-out option) even if locale is not POSIX, like `PYTHONLEGACYWINDOWSFSENCODING`. Users depends on locale know what locale is and how to configure it. They can understand difference between locale mode and UTF-8 mode and they can opt-out UTF-8 mode. But many people lives in "UTF-8 everywhere" world, and don't know about locale. `-X utf8` option should be parsed before converting commandline arguments to wchar_t*. How about adding Py_UnixMain(int argc, char** argv) which is available only on Unix? I dislike wchar_t type and mbstowcs functions on Unix. (I love wchar_t on Windows, off course). I hope we can remove `wchar_t *wstr` from PyASCIIObject and deprecate all wchar_t APIs on Unix in the future. On Fri, Jan 6, 2017 at 10:43 AM, Victor Stinner <victor.stinner@gmail.com> wrote:

2017-01-06 8:21 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
You do, I don't :-) It shouldn't be hard to find very concrete issues from the mojibake issues described at: https://www.python.org/dev/peps/pep-0540/#expected-mojibake-issues IMHO there are 3 steps before being able to reach your dream: 1) add opt-in support for UTF-8 2) use UTF-8 if the locale is POSIX 3) UTF-8 is enabled by default I would prefer to begin with a first Python release at stage (1) or (2), wait for user complains, and later decide if we can move to (3). Right now, I didn't implement the PEP 540, so I wasn't able to experiment anything in practice yet. Well, at least it means that I have to elaborate the "Always use UTF-8" alternative of my PEP to explain why I consider that we are not ready to switch directly to his "obvious" option.
Users depends on locale know what locale is and how to configure it.
It's not a matter of users, but a matter of code in the wild which uses directly C functions like mbstowcs() or wsctombs(). These functions use the current locale encoding, they are not aware of the new Python UTF-8 mode.
But many people lives in "UTF-8 everywhere" world, and don't know about locale.
The PEP 540 was written to help users for very concrete cases. I'm repeating since Python 3.0 that users must learn how to configure their locale. Well, 8 years later, I keep getting exactly the same user complains: "Python doesn't work, it must just work!". It's really hard to decode bytes and later encode the text and prevenet any kind of encoding error. That's why no solution was proposed before.
`-X utf8` option should be parsed before converting commandline (...)
Yeah, that's a though technical issue. I'm not sure right know how to implement this with a clean design. Maybe I will just try with a hack? :-) Victor

INADA Naoki writes:
I find all this very strange from someone with what looks like a Japanese name. I see mojibake and non-Unicode encodings around me all the time. Caveat: I teach at a University that prides itself on being the most international of Japanese national universities, so in my daily work I see Japanese in 4 different encodings (5 if you count the UTF-16 used internally by MS Office), Chinese in 3 different (claimed) encodings, and occasionally Russian in at least two encodings, ..., uh, I could go on but won't. In any case, the biggest problems are legacy email programs and busted websites in Japanese, plus email that is labeled "GB2312" but actually conforms to GBK (and this is a reply in Japanese to a Chinese applicant writing in Japanese encoded as GBK). I agree that people around me mostly know only two encodings: "works for me" and "mojibake", but they also use locales configured for them by technical staff. On top of that, international students (the most likely victims of "UTF-8 by default" because students are the biggest Python users) typically have non-Japanese locales set on their imported computers. I'm not going to say my experience is typical enough to block "UTF-8 by default", but let's do this very carefully with thought.

On 8 January 2017 at 02:47, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Unsurprisingly (given where I work [1]), one of my key concerns is to enable large Python using institutions to be able to keep moving forward, regardless of whether they've fully standardised their internal environments on UTF-8 or not. As such, while I'm entirely in favour of pushing people towards UTF-8 as the default choice everywhere, I also want to make sure that system and application integrators, including the folks responsible for defining the Standard Operating Environments in large organisations, get warnings of potential problems when they arise, and continue to get encoding errors when we have definitive evidence of a compatibiliy problem. For me, that boils down to: - if a locale is properly configured, we'll continue to respect it - if we're ignoring or changing the locale setting without an explicit config option, we'll emit a warning on stderr that we're doing so (*without* using the warnings system, so there's no way to turn it into an exception) - if a UTF-8 based Linux container is run on a GB-18030/ISO-2022/Shift-JIS/etc host and tries to exchange locally encoded data with that host (rather than exchanging UTF-8 encoded data over a network connection), getting an exception is preferable to silently corrupting the data stream (I think I'll add something along those lines to PEP 538 as a new "Core Design Principles" section) Cheers, Nick. [1] https://docs.python.org/devguide/motivations.html#published-entries -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Jan 8, 2017 at 1:47 AM, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Since I work on tech company, and use Linux for most only "server-side" program, I don't live such a situation. But when I see non UTF-8 text, I don't change locale to read such text. (Actually speaking, locale doesn't solve mojibake because it doesn't change my terminal emulator's encoding). And I don't change my terminal emulator setting only for read such a text. What I do is convert it to UTF-8 through command like `view text-from-windows.txt ++enc=cp932` So there are no problem when Python always use UTF-8 for fsencoding and stdio encoding.
Hmm, Which OS do they use? There are no problem in macOS and Windows. Do they use Linux with locale with encoding other than UTF-8, and their terminal emulator uses non-UTF-8 encoding? As my feeling, UTF-8 start dominating from about 10 years ago, and ja_JP.EUC_JP (it was most common locale for Japanese befoer UTF-8) is complete legacy. There is only one machine (which is in LAN, lives from 10+ years ago, /usr/bin/python is Python 1.5!), I can ssh which has ja_JP.eucjp locale.

INADA Naoki writes:
But when I see non UTF-8 text, I don't change locale to read such text.
Nobody does. The problem is if people have locales set for non-UTF-8, which Chinese people often do ("GB18030 isn't just a good idea, it's the law"). Especially forcing stdout to something other than the locale is likely to mess things up.
My university's internal systems typically produce database output (class registration lists and the like) in Shift JIS, but that's not reliable. Some departments still have their home pages in EUC-JP, and pages where the meta http-equiv elements disagree with the content are not unusual. Private sector may be up to date, but academic sector (and from the state of e-stat.go.jp, government in general, I suspect) is stuck in the Jomon era. I don't know that there's going to be a problem, but the idea of implicitly forcing an encoding different from the locale seems likely to cause confusion to me. Aside from Nick's special case of containers supplied by a vendor different from the host OS, I don't really see why this is a good idea. I think it's best to go with the locale that is set (or not), unless we have very good reason to believe that by far most users would be surprised by that, and those who aren't surprised are mostly expert enough to know how to deal with a forced UTF-8 environment if they *don't* want it. A user-selected option is another matter.

Hi Stephen, 2017-01-09 19:42 GMT+01:00 Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp>:
I went to that page, checked the HTML and found: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> Admittedly, the page is in HTML 4.01, but then the Jomon era predates HTML5 by about 16,000 years, so I'll cut them some slack. Anyway, I am quite willing to believe that the situation is as dire as you describe on Windows. However, on OS X, Apple enforces UTF-8. And the Linux vendors are moving in that direction too. And the proposal under discussion is specifically about Linux So, again I am wondering if there are many people who end up with a *Linux* system which has a non-UTF-8 locale. For example, if you use the Ubuntu graphical installer, it asks for your language and then gives you the UTF-8 locale which comes with that. You have to really dive into the non-graphical configuration to get yourself a non-UTF8 locale. Stephan

Oh, I didn't know non-UTF-8 is used for LC_CTYPE in these years!
I talked about LC_CTYPE. We have some legacy files too. But it's not relating to neither of fsencoding nor stdio encoding.
Yes. This is balance matter. Some people are surprised by Python may not use UTF-8 even when writing source code in UTF-8, unlike most of other languages. (Not only rust, Go, node.js, but also Ruby, Perl, or even C!) And some people are surprised because they used locale to tell terminal encoding (which is not UTF-8) to some commands, and Python ~3.6 followed it. I thought later group is very small, and more smaller when 3.7 is released. And if we can drop locale support in the future, we will be able to remove some very dirty code in Python/fileutil.c. That's why I prefer locale-free UTF-8 mode by default, and locale-aware mode as opt-in. But I'm OK we start to ignore C locale, sure.

Victor Stinner writes:
The point of this, I suppose, is that piping to xargs works by default. I haven't read the PEPs (don't have time, mea culpa), but my ideal would be three options: --transparent -> errors=surrogateescape on input and output --postel -> errors=surrogateescape on input, =strict on output --unicode-me-harder -> errors=strict on input and output with --postel being default. Unix afficianados with lots of xargs use can use --transparent. Since people have different preferences, I guess there should be an envvar for this. Others probably should configure open() by open(). I'll try to get to the PEPs over the weekend but can't promise. Steve

2017-01-06 3:10 GMT+01:00 Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp>:
The point of this, I suppose, is that piping to xargs works by default.
Please read the second version (latest) version of my PEP 540 which contains a new "Use Cases" section which helps to define issues and the behaviour of the different modes.
PEP 540: --postel is the default --transparent is the UTF-8 mode --unicode-me-harder is the UTF-8 configured to strict The POSIX locale enables --transparent.
The PEP adds new -X utf8 command line option and PYTHONUTF8 environment variable to configure the UTF-8 mode.
Others probably should configure open() by open().
My PEP 540 does change the encoding used by open() by default: https://www.python.org/dev/peps/pep-0540/#encoding-and-error-handler Obviously, you can still explicitly set the encoding when calling open().
I'll try to get to the PEPs over the weekend but can't promise.
Please read at least the abstract of my PEP 540 ;-) Victor

On Jan 05, 2017, at 05:50 PM, Victor Stinner wrote:
FWIW, it seems to be a general and widespread recommendation to always pass universal_newlines=True to Popen and friends when you only want to deal with unicode from subprocesses: If encoding or errors are specified, or universal_newlines is true, the file objects stdin, stdout and stderr will be opened in text mode using the encoding and errors specified in the call or the defaults for io.TextIOWrapper. Cheers, -Barry

Passing universal_newlines will use whatever locale.getdefaultencoding() returns (which at least on Windows is useless enough that I added the encoding and errors parameters in 3.6). So it sounds like it'll only actually do Unicode on Linux if enough of the planets have aligned, which is what Victor is trying to do, but you can't force the other process to use a particular encoding. universal_newlines may become a bad choice if the default encoding no longer matches what the environment says, and personally, I wouldn't lose much sleep over that. (As an aside, when I was doing all the Unicode changes for Windows in 3.6, I eventually decided that changing locale.getdefaultencoding() was too big a breaking change to ever be a good idea. Perhaps that will be the same result here too, but I'm nowhere near familiar enough with the conventions at play to state that with any certainty.) Cheers, Steve Top-posted from my Windows Phone -----Original Message----- From: "Barry Warsaw" <barry@python.org> Sent: 1/6/2017 14:04 To: "python-ideas@python.org" <python-ideas@python.org> Subject: Re: [Python-ideas] PEP 540: Add a new UTF-8 mode On Jan 05, 2017, at 05:50 PM, Victor Stinner wrote:
FWIW, it seems to be a general and widespread recommendation to always pass universal_newlines=True to Popen and friends when you only want to deal with unicode from subprocesses: If encoding or errors are specified, or universal_newlines is true, the file objects stdin, stdout and stderr will be opened in text mode using the encoding and errors specified in the call or the defaults for io.TextIOWrapper. Cheers, -Barry

On Jan 06, 2017, at 11:08 PM, Steve Dower wrote:
Passing universal_newlines will use whatever locale.getdefaultencoding()
There is no locale.getdefaultencoding(); I think you mean locale.getpreferredencoding(False). (See the "Changed in version 3.3" note in $17.5.1.1 of the stdlib docs.)
universal_newlines is also problematic because it's misnamed from the more common motivation to use it. Very often people do want to open std* in text mode (and thus trade in Unicodes), but they rarely equate that to "universal newlines". So the option is just more hidden magical side-effect and cargo-culted lore. It's certainly *useful* though, and I think we want to be sure that we don't break existing code that uses it for this purpose. Cheers, -Barry

Hi! On Thu, Jan 05, 2017 at 04:38:22PM +0100, Victor Stinner <victor.stinner@gmail.com> wrote:
Please don't! I use different locales and encodings, sometimes it's utf-8, sometimes not - but I have properly configured LC_* settings and I prefer Python to follow my command. It'd be disgusting if Python starts to bend me to its preferences.
The risk is to introduce mojibake if the locale uses a different encoding, especially for locales other than the POSIX locale.
There is no such risk for me as I already have mojibake in my systems. Two most notable sources of mojibake are: 1) FTP servers - people create files (both names and content) in different encodings; w32 FTP clients usually send file names and content in cp1251 (Russian Windows encoding), sometimes in cp866 (Russian Windows OEM encoding). 2) MP3 tags and play lists - almost always cp1251. So whatever my personal encoding is - koi8-r or utf-8 - I have to deal with file names and content in different encodings. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

For stdio (including console), PYTHONIOENCODING can be used for supporting legacy system. e.g. `export PYTHONIOENCODING=$(locale charmap)` For commandline argument and filepath, UTF-8/surrogateescape can round trip. But mojibake may happens when pass the path to GUI. If we chose "Always use UTF-8 for fs encoding", I think PYTHONFSENCODING envvar should be added again. (It should be used from startup: decoding command line argument).
3) unzip zip file sent by Windows. Windows user use no-ASCII filenames, and create legacy (no UTF-8) zip file very often. I think people using non UTF-8 should solve encoding issue by themselves. People should use ASCII or UTF-8 always if they don't want to see mojibake.

2017-01-06 2:15 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
The problem with ignoring the locale by default and forcing UTF-8 is that Python works with many libraries which use the locale, not UTF-8. The PEP 538 also describes mojibake issues if Python is embedded in an application.
For commandline argument and filepath, UTF-8/surrogateescape can round trip. But mojibake may happens when pass the path to GUI.
Let's say that you have the filename b'nonascii\xff': it's decoded as 'nonascii\xdcff' by the UTF-8 mode. How do GUIs handle such filename? (I don't know the answer, it's a real question ;-))
Last time I implemented PYTHONFSENCODING, I had many major issues: https://mail.python.org/pipermail/python-dev/2010-October/104509.html Do you mean that these issues are now outdated and that you have an idea how to fix them?
ZIP files are out the scope of the PEPs 538 and 540. Python cannot guess the encoding, so it was proposed to add an option to give to user the ability to specify an encoding: see https://bugs.python.org/issue10614 for example. But yeah, data encoded to encodings different than UTF-8 are still common, and it's not going to change shortly. Since many Windows applications use the ANSI code page, I easily imagine that many documents are encoded to various incompatible code pages... What I understood is that many users don't want Python to complain on data encoded to different incompatible encodings: process data as a stream of bytes or characters, it depends. Something closer to Python 2 (stream of bytes). That's what I try to describe in this section: https://www.python.org/dev/peps/pep-0540/#old-data-stored-in-different-encod... Victor

On Fri, Jan 06, 2017 at 02:54:49AM +0100, Victor Stinner wrote:
I ran this in Python 2.7 to create the file: open(b'/tmp/nonascii\xff-', 'w') and then confirmed the filename: [steve@ando tmp]$ ls -b nonascii* nonascii\377- Konquorer in KDE 3 displays it with *two* "missing character" glyphs (small hollow boxes) before the hyphen. The KDE "Open File" dialog box shows the file with two blank spaces before the hyphen. My interpretation of this is that the difference is due to using different fonts: the file name is shown the same way, but in one font the missing character is a small box and in the other it is a blank space. I cannot tell what KDE is using for the invalid character, if I copy it as text and paste it into a file I just get the original \xFF. The Geany text editor, which I think uses the same GUI toolkit as Gnome, shows the file with a single "missing glyph" character, this time a black diamond with a question mark in it. It looks like Geany (Gnome?) is displaying the invalid byte as U+FFFD, the Unicode "REPLACEMENT CHARACTER". So at least two Linux GUI environments are capable of dealing with filenames that are invalid UTF-8, in two different ways. Does this answer your question about GUIs? -- Steve

Hi all, One meta-question I have which may already have been discussed much earlier in this whole proposal series, is: How common is this problem? Because I have the impression that nowadays all Linux distributions are UTF-8 by default and you have to show some bloody-mindedness to end up with a POSIX locale. Docker was mentioned, is this not really an issue which should be solved at the Docker level? Since it would affect *all* applications which are doing something non-trivial with encodings? I realise there is some attractiveness in solving the issue "for Python", since that will reduce the amount of bug reports and get people off the chests of the maintainers, but to get this fixed in the wider Linux ecosystem it might be preferable to "Let them eat mojibake", to paraphrase what Marie-Antoinette never said. Stephan 2017-01-06 5:49 GMT+01:00 Steven D'Aprano <steve@pearwood.info>:

2017-01-06 7:22 GMT+01:00 Stephan Houben <stephanh42@gmail.com>:
How common is this problem?
Last 2 or 3 years, I don't recall having be bitten by such issue. On the bug tracker, new issues are opened infrequently. * http://bugs.python.org/issue19977 opened at 2013-12-13, closed at 2014-04-27 * http://bugs.python.org/issue19846 opened at 2013-11-30, closed as NOTABUG at 2015-05-17 22, but got new comments after it was closed * http://bugs.python.org/issue19847 opened at 2013-11-30, closed as NOTABUG at 2013-12-13 * http://bugs.python.org/issue28180 opened at 2016-09-16, still open Again, I don't think that this list is complete, I recall other similar issues.
What do you mean by "eating mojibake"? Users complain because their application is stopped by a Python exception. Currently, most Python 3 applications doesn't produce or display mojibake, since Python is strict on outputs. (One exception: stdout with the POSIX locale since Python 3.5). Victor

Hi Victor, 2017-01-06 13:01 GMT+01:00 Victor Stinner <victor.stinner@gmail.com>:
What do you mean by "eating mojibake"?
OK, I erroneously understood that the failure mode was that mojibake was produced.
Users complain because their application is stopped by a Python exception.
Got it.
OK, I now tried it myself and indeed it produces the following error: UnicodeEncodeError: 'ascii' codec can't encode character '\xfe' in position 0: ordinal not in range(128) My suggestion would be to make this error message more specific. In particular, if we have LC_TYPE/LANG=C or unset, we could print something like the following information (on Linux only): """ You are attempting to use non-ASCII Unicode characters while your system has been configured (possibly erroneously) to operate in the legacy "C" locale, which is pure ASCII. It is strongly recommended that you configure your system to allow arbitrary non-ASCII Unicode characters This can be done by configuring a UTF-8 locale, for example: export LANG=en_US.UTF-8 Use: locale -a | grep UTF-8 to get a list of all valid UTF-8 locales on your system. """ Stephan

Chris Barker - NOAA Federal writes:
Of course. The question is not "should cb@noaa properly configure docker?", it's "Can docker properly configure docker (soon enough)? And if not, should we configure Python?" The third question depends on whether fixing it for you breaks things for others.

When talking about general Docker image, using C locale is OK for most cases. In other words, images using C locale is properly configured. All of node.js, Ruby, Perl, Go and Rust application can talk UTF-8 in docker using C locale, without special configuration. Locale dependent application is very special in this area.

INADA Naoki writes:
s/properly/compatibly/. "Proper" has strong connotations of "according to protocol". Configuring LC_CTYPE for ASCII expecting applications to say "You're lying!" and spew UTF-8 anyway is not "proper". That kind of thing makes me very nervous, and I think justifiably so. And it's only *sufficient* to justify a change to Python's defaults if Python checks for and accurately identifies when it's in a container. Anyway, I need to look more carefully at the actual PEPs and see if there's something concrete to worry about. But remember, we have about 18 months to chew over this if necessary -- I'm only asking for a few more days (until after the "cripple the minds of Japanese youth day", er, "University Admissions Center Examination" this weekend ;-). Steve

In my company, we use network boot servers. To reduce boot image, the image is built with minimalistic approach too. So there were only C locale in most of our servers, and many people in my company had bitten by this problem. I teach them to adding `export PYTHONIOENCODING=utf-8` in their .bashrc. But they had bitten again when using cron. So this problem is not only for docker container. Since UTF-8 dominated, many people including me use C locale to avoid unintentional behavior of commands seeing locale (sort, ls, date, bash, etc...). And use non C locale only for reading non English output from some command, like `man` or `hg help`. It's for i18n / l10n, but not for changing encoding. People live in UTF-8 world are never helped by changing encoding by locale. They are only bitten by the behavior.
Off course. And both PEP doesn't propose default behavior except C locale. So there are 36+ months about changing default behavior. I hope 36 months is enough for people using legacy systems are moving to UTF-8 world. Regards,

Here is one example of locale pitfall. --- # from http://unix.stackexchange.com/questions/169739/why-is-coreutils-sort-slower-... $ cat letters.py import string import random def main(): for _ in range(1_000_000): c = random.choice(string.ascii_letters) print(c) main() $ python3 letters.py > letters.txt $ LC_ALL=C time sort letters.txt > /dev/null 0.35 real 0.32 user 0.02 sys $ LC_ALL=C.UTF-8 time sort letters.txt > /dev/null 0.36 real 0.33 user 0.02 sys $ LC_ALL=ja_JP.UTF-8 time sort letters.txt > /dev/null 11.03 real 10.95 user 0.04 sys $ LC_ALL=en_US.UTF-8 time sort letters.txt > /dev/null 11.05 real 10.97 user 0.04 sys --- This is why some engineer including me use C locale on Linux, at least when there are no C.UTF-8 locale. Off course, we can use LC_CTYPE=en_US.UTF-8, instead of LANG or LC_ALL. (I wonder if we can use LC_CTYPE=UTF-8...) But I dislike current situation that "people should learn how to configure locale properly, and pitfall of non-C locale, only for using UTF-8 on Python".

INADA Naoki writes:
Off course, we can use LC_CTYPE=en_US.UTF-8, instead of LANG or LC_ALL.
You can also use LC_COLLATE=C.
(I wonder if we can use LC_CTYPE=UTF-8...)
Syntactically incorrect: that means the language UTF-8. "LC_TYPE=.UTF-8" might work, but IIRC the language tag is required, the region and encoding are optional. Thus ja_JP, ja.UTF-8 are OK, but .UTF-8 is not. Rant follows:
You can use a distro that implements and defaults to the C.utf-8 locale, and presumably you'll be OK tomorrow, well before 3.7 gets released. (If there are no leftover mines in the field, I don't see a good reason to wait for 3.8 given the known deficiencies of the C locale and the precedent of PEPs 528/529.) Really, we're catering to users who won't set their locales properly and insist on old distros. For Debian, C.utf-8 was suggested in 2009[1], and that RFE refers to other distros that had already implemented it. I have all the sympathy in the world for them -- systems *should* Just Work -- but I'm going to lean against kludges if they mean punishing people who actually learn about and conform to applicable standards (and that includes well-motivated, properly- documented, and carefully-implemented platform-specific extensions), or use systems designed by developers who do.[2] Footnotes: [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=609306 [2] I know how bad standards can suck -- I'm a Mailman developer, looking at you RFC 561, er, 5322. While I'm all for nonconformism if you take responsibility for any disasters that result, developers who conform on behalf of their users are heroes.

I'm sorry. I know it, but I'm not good at English. I meant "I wish posix allowed LC_CTYPE=UTF-8 setting." It's just my desire.
Many people use new Python on legacy Linux which don't have C.UTF-8 locale. I learned how to configure locale for using UTF-8 on Python. But I don't want to force people to learn it, only for Python.
CentOS 7 (and RHEL 7, maybe) seems don't provide C.UTF-8 by default. It means C.UTF-8 is not "universal available" locale at least next 5 years. $ cat /etc/redhat-release CentOS Linux release 7.2.1511 (Core) $ locale -a | grep ^C C

Hi INADA Naoki, (Sorry, I am unsure if INADA or Naoki is your first name...) While I am very much in favour of everything working "out of the box", an issue is that we don't have control over external code (be it Python extensions or external processes invoked from Python). And that code will only look at LANG/LC_TYPE and ignore any cleverness we build into Python. For example, this may mean that a built-in Python string sort will give you a different ordering than invoking the external "sort" command. I have been bitten by this kind of issues, leading to spurious "diffs" if you try to use sorting to put strings into a canonical order. So my feeling is that people are ultimately not being helped by Python trying to be "nice", since they will be bitten by locale issues anyway. IMHO ultimately better to educate them to configure the locale. (I realise that people may reasonably disagree with this assessment ;-) ) I would then recommend to set to en_US.UTF-8, which is slower and less elegant but at least more widely supported. By the way, I know a bit how Node.js deals with locales, and it doesn't try to compensate for "C" locales either. But what it *does* do is that Node never uses the locale settings to determine the encoding of a file: you either have to specify it explicitly OR it defaults to UTF-8 (the latter on output only). So in this respect it is by specification immune against misconfiguration of the encoding. However, other stuff (e.g. date formatting) will still be influenced by the "C" locale as usual. Stephan 2017-01-11 9:17 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:

On 01/11/2017 11:46 AM, Stephan Houben wrote:
AFAIK, this would not be a problem under PEP 538, which effectively treats the "C" locale as "C.UTF-8". Strings of Unicode codepoints and the corresponding UTF-8-encoded bytes sort the same way. Is that wrong, or do you have a better example of trouble with using "C.UTF-8" instead of "C"?
What about the spurious diffs you'd get when switching from "C" to "en_US.UTF-8"? $ LC_ALL=en_US.UTF-8 sort file.txt a a A A $ LC_ALL=C sort file.txt A A a a
I believe the main problem is that the "C" locale really means two very different things: a) Text is encoded as 7-bit ASCII; higher codepoints are an error b) No encoding was specified In both cases, treating "C" as "C.UTF-8" is not bad: a) For 7-bit "text", there's no real difference between these locales b) UTF-8 is a much better default than ASCII

Hi Petr, 2017-01-11 12:22 GMT+01:00 Petr Viktorin <encukou@gmail.com>:
...and this is also something new I learned.
Is that wrong, or do you have a better example of trouble with using "C.UTF-8" instead of "C"?
After long deliberation, it seems I cannot find any source of trouble. +1 So my feeling is that people are ultimately not being helped by
That taught me to explicitly invoke "sort" using LANG=en_US.UTF-8 sort
A "C" locale also means that a program should not *output* non-ASCII characters, unless when explicitly being fed in (like in the case of "cat" or "sort" or the "ls" equivalent from PEP-540). So for example, a program might want to print fancy Unicode box characters to show progress, and check sys.stderr.encoding to see if it can do so. However, under a "C" locale it should not do so since for example the terminal is unlikely to display the fancy box characters properly. Note that the current PEP 540 proposal would be that sys.stderr is in UTF-8 /backslashreplace encoding under the "C" locale. I think this may be a minor concern ultimately, but it would be nice if we had some API to at least reliable answer the question "can I safely output non-ASCII myself?" Stephan

On Wed, Jan 11, 2017 at 7:46 PM, Stephan Houben <stephanh42@gmail.com> wrote:
Hi INADA Naoki,
(Sorry, I am unsure if INADA or Naoki is your first name...)
Never mind, I don't care about name ordering. (INADA is family name).
I'm sorry, could you give me more concrete example? My opinion is +1 to PEP 540, there should be an option to ignore locale setting. (And I hope it will be default setting in future version.) What is your concern?
But someone can't accept 30x slower only sorting ASCII text. At least, infrastructure engineer in my company loves C locale. New Python programmer (e.g. there are many data scientists learning Python) may want to work on Linux server, and learning about locale is not their concern. Web programmers are same. Just want to print UTF-8. Learning about locale may not worth enough for them. But I think there should be an option, and I want to use it.
Yes. Both of PEP 538 and 540 is about encoding. I'm sorry about my misleading word "locale-free". There should be locale support for time formatting, at least UTF-8 locale. Regards,

On 11 January 2017 at 17:05, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
FWIW, I'm hoping to backport whatever improved handling of the C locale that we agree on for Python 3.7+ to the initial system Python 3.6.0 release in Fedora 26 [1] - hence the section about redistributor backports in PEP 538. While the problems with the C locale have been known for a while, this latest attempt to do something about it started as an idea I had for a downstream Fedora-specific patch (which became PEP 538), while that PEP in turn served as motivation for Victor to write PEP 540 as an alternative approach that didn't depend on having the C.UTF-8 locale available. With the F26 Alpha at the end of February and the F26 Beta in late April, I'm hoping we can agree on a way forward without requiring months to make a decision :)
-- I'm only asking for a few more days
Yeah, while I'd prefer not to see the discussions drag out indefinitely, there's also still plenty of time for folks to consider the PEPs closely and formulate their perspective. Cheers, Nick. [1] https://fedoraproject.org/wiki/Releases/26/Schedule -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Jan 06, 2017, at 07:22 AM, Stephan Houben wrote:
It can still happen in some corner cases, even on Debian and Ubuntu where C.UTF-8 is available and e.g. my desktop defaults to en_US.UTF-8. For example, in an sbuild/schroot environment[*], the default locale is C and I've seen package build failures because of this. There may be other such "corner case" environments where this happens too. Cheers, -Barry [*] Where sbuild/schroot is a very common suite of package building tools.

On Sat, Jan 7, 2017 at 8:20 AM, Barry Warsaw <barry@python.org> wrote:
A lot of background jobs get run in a purged environment, too. I don't remember exactly which ones land in the C locale and which don't, but check cron jobs, systemd background processes, inetd, etc, etc, etc. Having Python DTRT in those situations would be a Good Thing. ChrisA

2017-01-06 22:20 GMT+01:00 Barry Warsaw <barry@python.org>:
Right, that's the whole point of the Nick's PEP 538 and my PEP 540: it's still common to get the POSIX locale. I began to list examples of practical use cases where you get the POSIX locale. https://www.python.org/dev/peps/pep-0540/#posix-locale-used-by-mistake I'm not sure about the title of the section: "POSIX locale used by mistake". Barry: About chroot, why do you get a C locale? Is it because no locale is explicitly configured? Or because no locale is installed in the chroot? Would it work if we had a tool to copy the locale from the host when creating the chroot: env vars and the data files required by the locale (if any)? The chroot issue seems close to the reported chroot issue: http://bugs.python.org/issue28180 I understand that it's more a configuration issue, than a deliberate choice to use the POSIX locale. Again, the user requirement is that Python 3 should just work without any kind of specific configuration, as other classic UNIX tools. Victor

On Jan 06, 2017, at 11:33 PM, Victor Stinner wrote:
For some reason it's not configured: % schroot -u root -c sid-amd64 (sid-amd64)# locale LANG= LANGUAGE= LC_CTYPE="POSIX" LC_NUMERIC="POSIX" LC_TIME="POSIX" LC_COLLATE="POSIX" LC_MONETARY="POSIX" LC_MESSAGES="POSIX" LC_PAPER="POSIX" LC_NAME="POSIX" LC_ADDRESS="POSIX" LC_TELEPHONE="POSIX" LC_MEASUREMENT="POSIX" LC_IDENTIFICATION="POSIX" LC_ALL= (sid-amd64)# export LC_ALL=C.UTF-8 (sid-amd64)# locale LANG= LANGUAGE= LC_CTYPE="C.UTF-8" LC_NUMERIC="C.UTF-8" LC_TIME="C.UTF-8" LC_COLLATE="C.UTF-8" LC_MONETARY="C.UTF-8" LC_MESSAGES="C.UTF-8" LC_PAPER="C.UTF-8" LC_NAME="C.UTF-8" LC_ADDRESS="C.UTF-8" LC_TELEPHONE="C.UTF-8" LC_MEASUREMENT="C.UTF-8" LC_IDENTIFICATION="C.UTF-8" LC_ALL=C.UTF-8 I'm not sure why that's the default inside a chroot. I thought there was a bug or discussion about this, but I can't find it right now. Generally when this happens, exporting this environment variable in your debian/rules file is the way to work around the default. Cheers, -Barry

2017-01-07 1:06 GMT+01:00 Barry Warsaw <barry@python.org>:
For some reason it's not configured: (...)
Ok, thanks for the information.
I'm not sure why that's the default inside a chroot.
I found at least one good reason to use the POSIX locale to build a package: it helps to get reproductible builds, see: https://reproducible-builds.org/docs/locales/ I used it as an example in my new rationale: https://www.python.org/dev/peps/pep-0540/#it-s-not-a-bug-you-must-fix-your-l... I tried to explain how using LANG=C can be a smart choice in some cases, and so that Python 3 should do its best to not annoy the user with Unicode errors. I also started to list cases where you get the POSIX locale "by mistake". As I wrote previously, I'm not sure that it's correct to add "by mistake". https://www.python.org/dev/peps/pep-0540/#posix-locale-used-by-mistake By the way, I tried to force the POSIX locale in my benchmarking "perf" module. The idea is to get more reproductible results between heterogeneous computers. But I got a bug report. So I decided to copy the locale by default and add an opt-in --no-locale option to ignore the locale (force the POSIX locale). https://github.com/haypo/perf/issues/15 Victor

On Fri, Jan 06, 2017 at 10:15:52AM +0900, INADA Naoki <songofacandy@gmail.com> wrote:
This means one more thing to reconfigure when I switch locales instead of Python to catches up automatically.
Good example, thank you! I forgot about it because I have wrote my own zip.py and unzip.py that encode/decode filenames.
I think people using non UTF-8 should solve encoding issue by themselves. People should use ASCII or UTF-8 always if they don't want to see mojibake.
Impossible. Even if I'd always use UTF-8 I still will receive a lot of cp1251/cp866. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

2017-01-06 3:10 GMT+01:00 Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp>:
The "always ignore locale and force UTF-8" option has supporters. For example, Nick Coghlan wrote a whole PEP, PEP 538, to support this. I want that my PEP is complete and so lists all famous alternatives. Victor

On 6 January 2017 at 12:37, Victor Stinner <victor.stinner@gmail.com> wrote:
Err, no, that's not what PEP 538 does. PEP 538 doesn't do *anything* if a locale is already properly configured - it only changes the locale if the current locale is "C". It's actually very similar to your PEP, except that instead of adding the ability to make CPython ignore the C level locale settings (which I think is a bad idea based on your own previous work in that area and on the way that CPython interacts with other C/C++ components in the same process and in subprocesses), it just *changes* those settings when we're pretty sure they're wrong. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 06.01.2017 04:32, Nick Coghlan wrote:
Victor: I think you are taking the UTF-8 idea a bit too far. Nick was trying to address the situation where the locale is set to "C", or rather not set at all (in which case the lib C defaults to the "C" locale). The latter is a fairly standard situation when piping data on Unix or when spawning processes which don't inherit the current OS environment. The problem with the "C" locale is that the encoding defaults to "ASCII" and thus does not allow Python to show its built-in Unicode support. Nick's PEP and the discussion on the ticket http://bugs.python.org/issue28180 are trying to address this particular situation, not enforce any particular encoding overriding the user's configured environment. So I think it would be better if you'd focus your PEP on the same situation: locale set to "C" or not set at all. BTW: You mention a locale "POSIX" in a few places. I have never seen this used in practice and wonder why we should even consider this in Python as possible work-around for a particular set of features. The locale setting in your environment does have a lot of influence on your user experience, so forcing people to set a "POSIX" locale doesn't sound like a good idea - if they have to go through the trouble of correctly setting up their environment for Python to correctly run, they would much more likely use the correct setting rather than a generic one like "POSIX", which is defined as alias for the "C" locale and not as a separate locale: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
... and this is taking the original intent of the ticket a little too far as well :-) The original request was to have the FS encoding default to UTF-8, in case the locale is not set or set to "C", with the reasoning being that this makes it easier to use Python in situations where you have exactly this situations (see above). Your PEP takes this approach further by fixing the locale setting to "C.UTF-8" in those two cases - intentionally, with all the implications this has on other parts of the C lib. The latter only has an effect on the C lib, if the "C.UTF-8" locale is available on the system, which it isn't on many systems, since C locales have to be explicitly generated. Without the "C.UTF-8" locale available, your PEP only affects the FS encoding, AFAICT, unless other parts of the application try to interpret the locale env settings as well and use their own logic for the interpretation. For the purpose of experimentation, I would find it better to start with just fixing the FS encoding in 3.7 and leaving the option to adjust the locale setting turned off per default. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 06 2017)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

2017-01-06 10:50 GMT+01:00 M.-A. Lemburg <mal@egenix.com>:
Victor: I think you are taking the UTF-8 idea a bit too far.
Hum, sorry, the PEP is still a draft, the rationale is far from perfect yet. Let me try to simplify the issue: users are unable to configure a locale for various reasons and expect that Python 3 must "just works", so never fail on encoding or decoding. Do you mean that you must try to fix this issue? Or that my approach is not the good one?
In the second version of my PEP, Python 3.7 will basically "just work" with the POSIX locale (or C locale if you prefer). This locale enables the UTF-8 mode which forces UTF-8/surrogatescape, and this error handler prevents the most common encode/decode error (but not all of them!). When I read the different issues on the bug tracker, I understood that people have different opinions because they have different use cases and so different expectations. I tried to describe a few use cases to help to understand why we don't have the expectations: https://www.python.org/dev/peps/pep-0540/#replace-a-word-in-a-text I guess that "piping data on Unix" is represented by my "Replace a word in a text" example, right? It implements the "sed -e s/apple/orange/g" command using Python 3. Classical usage: cat input_file | sed -e s/apple/orange/g > output "UNIX users" don't want Unicode errors here.
I don't think that it's the main annoying issues for users. User complain because basic functions like (1) "List a directory into stdout" or (2) "List a directory into a text file" fail badly: (1) https://www.python.org/dev/peps/pep-0540/#list-a-directory-into-stdout (2) https://www.python.org/dev/peps/pep-0540/#list-a-directory-into-a-text-file They don't really care of powerful Unicode features, but are bitten early just on writing data back to the disk, into a pipe, or something else. Python 3.6 tries to be nice with users when *getting* data, and it is very pedantic when you try to put the data somewhere. The only exception is that stdout now uses the surrogateescape error handler, but only with the POSIX locale.
I'm not sure that I understood: do you suggest to only modify the behaviour when the POSIX locale is used, but don't add any option to ignore the locale and force UTF-8? At least, I would like to get a UTF-8/strict mode which would require an option to enable it. About -X utf8, the idea is to write explicitly that you are sure that all inputs are encoded to UTF-8 and that you request to encode outputs to UTF-8. I guess that you are concerned by locales using encodings other than ASCII or UTF-8 like Latin1, ShiftJIS or something else?
Hum, the POSIX locale is the "C" locale in my PEP. I don't request users to force the POSIX locale. I propose to make Python nicer than users already *get* the POSIX locale for various reasons: * OS not correctly configured * SSH connection failing to set the locale * user using LANG=C to get messages in english * LANG=C used for a bad reason * program run in an empty environment * user locale set to a non-existent locale => the libc falls back on POSIX * etc. "LANG=C": "LC_ALL=C" is more correct, but it seems like LANG=C is more common than LC_ALL=C or LC_CTYPE=C in the wild.
By ticket, do you mean a Python issue? By the way, I'm aware of these two issues: http://bugs.python.org/issue19846 http://bugs.python.org/issue28180 I'm sure that other issues were opened to request something similiar, but they got probably less feedback, and I was to lazy yet to find them all.
I decided to write the PEP 540 because only few operating systems provide C.UTF-8 or C.utf8. I'm trying to find a solution working on all UNIX and BSD systems. Maybe I'm wrong, and my approach (ignore the locale, rather than really "fixing" the locale) is plain wrong. Again, it's a very hard issue, I don't think that any perfect solution exists. Otherwise, we would already have fixed this issue 8 years ago! It's a matter of compromises and finding a practical design which works for most users.
Sorry, what do you mean by "fixing the FS encoding"? I understand that it's basically my PEP 540 without -X utf8 and PYTHONUTF8, only with the UTF-8 mode enabled for the POSIX locale? By the way, Nick's PEP 538 doesn't mention surrogateescape. IMHO if we override or ignore the locale, it's safer to use surrogateescape. The Use Cases of my PEP 540 should help to understand why. Victor

2017-01-06 10:50 GMT+01:00 M.-A. Lemburg <mal@egenix.com>:
My PEP 540 is different than Nick's PEP 538, even for the POSIX locale. I propose to always use the surrogateescape error handler, whereas Nick wants to keep the strict error handler for inputs and outputs. https://www.python.org/dev/peps/pep-0540/#encoding-and-error-handler The surrogateescape error handler is useful to write programs which work as pipes, as cat, grep, sed, ... UNIX program: https://www.python.org/dev/peps/pep-0540/#producer-consumer-model-using-pipe... You can get the behaviour of Nick's PEP 538 using my UTF-8 Strict mode. Compare "UTF-8 mode" and "UTF-8 Strict mode" lines in the tables of my use case. The UTF-8 mode always works, but can produce mojibake, whereas UTF-8 Strict doesn't produce mojibake but can fail depending on data and the locale. IMHO most users prefers usability ("just work") over correctness (prevent mojibake). So Nick and me don't have exaclty the same scope and use cases. Victor

I'm ±0 to surrogateescape by default. I feel +1 for stdout and -1 for stdin. In output case, surrogateescape is weaker than strict, but it only allows surrgateescaped binary. If program carefully use surrogateescaped decode, surrogateescape on stdout is safe enough. On the other hand, surrogateescape is very weak for input. It accepts arbitrary bytes. It should be used carefully. But I agree different encoding handler between stdin/stdout is not beautiful. That's why I'm ±0. FYI, when http://bugs.python.org/issue15216 is merged, we can change error handler easily: ``sys.stdout.set_encoding(errors='surrogateescape')`` So it's controllable from Python. Some program which handles filenames may prefer surrogateescape, and some program like CGI may prefer strict UTF-8 because JSON and HTML5 shouldn't contain arbitrary bytes.

It seems to me that having a C locale can mean two things: 1) It really is meant to be ASCII 2) It's mis-configured (or un-configured), meaning the system encoding is unknown. if (2) then utf-8 is a fine default. if (2), then there are two options: 1) Everything on the sytsem really is ASCII -- in which case, utf-8 would "just work" -- no problem. 2) There are non-ascii file names, etc. on this supposedly ASCII system. In which case, do folks expect their Python programs to find these issues and raise errors? They may well expect that their Python program will not let them try to save a non ASCII filename, for instance. But I suspect that they wouldn't want it to raise an obscure encoding error -- but rather would want the app to do somethign friendly. So I see no downside to using utf-8 when the C locale is defined. -CHB On Wed, Jan 11, 2017 at 4:23 PM, INADA Naoki <songofacandy@gmail.com> wrote:
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Chris Barker writes:
Actually, IME, just like you, they expect it to DTRT, which for *them* is to save it in Shift-JIS or Alternativj or UTF-totally-whacked as their other programs do.
So I see no downside to using utf-8 when the C locale is defined.
You don't have much incentive to look for one, and I doubt you have the experience of the edge cases (if you do, please correct me), so that does not surprise me. I'm not saying there are such cases here, I just want a little time to look harder. Steve

On Thu, Jan 12, 2017 at 7:50 AM, Stephen J. Turnbull <turnbull.stephen.fw@u. tsukuba.ac.jp> wrote:
that's correct. I left out a sentence: This is a good time for others' with experience with the ugly edge cases to speak up! The real challenge is that "output" has three (at least :-) ) use cases: 1) Passing on data the came from input from the same system: Victors' "Unix pipe style". In that case, if a supposedly ASCII-based system has non ascii data, most users would want it to get passed through unchanged. They not likely to expect their python program to enforce their locale (unless it was a program designed to do that - but then it could be explicit about things). 2) The program generating data itself: the mentioned "outputting boxes to the console" example. I think that folks writing these programs should consider whether they really need non-ascii output -- but if they do do this -- I"d image most folks would rather see weird characters in the console than have the program crash. So these both point to utf-8 (with surrogateescape) 3) A program getting input from a user, or a data file, or...... (like a filename, etc). This may be a program intended to be able to handle unicode filenames, etc. (this is my use-case :-) ) -- what should it do when run on an ASCII-only system? This is the tough one -- if the C-locale indicated "non configured", then users would likely want the _something_ written to the FS, rather than a program crash: so utf-8. However, if the system really is properly configured to be ASCII only, then they may want a program to never write non-ascii to the filesystem. However, ultimately, I think it's up to the application developer, rather than to Python itself (Or the sysadmin for the OS that it's running on) to know whether the app is supposed to support non-ascii filenames, etc. i.e. one should expect that running a unicode-aware app on an ascii-only filesystem is going to lead to problems anyway. So I think the "never crash" option is the better one in this imperfect trade-off. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

2017-01-12 1:23 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
I'm ±0 to surrogateescape by default. I feel +1 for stdout and -1 for stdin.
The use case is to be able to write a Python 3 program which works work UNIX pipes without failing with encoding errors: https://www.python.org/dev/peps/pep-0540/#producer-consumer-model-using-pipe... If you want something stricter, there is the UTF-8 Strict mode which prevent mojibake everywhere. I'm not sure that the UTF-8 Strict mode is really useful. When I implemented it, I quickly understood that using strict *everywhere* is just a deadend: it would fail in too many places. https://www.python.org/dev/peps/pep-0540/#use-the-strict-error-handler-for-o... I'm not even sure yet that a Python 3 with stdin using strict is "usable".
What do you mean that "carefully use surrogateescaped decode"? The rationale for using surrogateescape on stdout is to support this use case: https://www.python.org/dev/peps/pep-0540/#list-a-directory-into-stdout
In my experience with the Python bug tracker, almost nobody understands Unicode and locales. For the "Producer-consumer model using pipes" use case, encoding issues of Python 3.6 can be a blocker issue. Some developers may prefer a different programming language which doesn't bother them with Unicode: basicall, *all* other programming languages, no?
But I agree different encoding handler between stdin/stdout is not beautiful. That's why I'm ±0.
That's why there are two modes: UTF-8 and UTF-8 Strict. But I'm not 100% sure yet, on which encodings and error handlers should be used ;-) I started to play with my PEP 540 implementation. I already had to update the PEP 540 and its implementation for Windows. On Windows, os.fsdecode/fsencode now uses surrogatepass, not surrogateescape (Python 3.5 uses strict on Windows). Victor

On Fri, Jan 13, 2017 at 12:12 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
I want http://bugs.python.org/issue15216 is merged in 3.7. It allows application select error handler by straightforward API. So, the problem is "which should be default"? * Program like `ls` can opt-in surrogateescape. * Program want to output valid UTF-8 can opt-out surrogateescape. And I feel former is better, regarding to Python's Zen. But it's not a strong opinion.
Application which is intended to output surrogateescaped data (filenames) should use surrogateescape, surely. But some application is intended to live in UTF-8 world. For example, think about application reads UTF-8 CSV, and insert it into database. When there is CSV encoded by Shift_JIS accidentally, and it is passed to stdin, error is better than insert it into database silently.
I agree. Some developer prefer other language (or Python 2) to Python 3, because of "Unicode by default doesn't fit to POSIX". Both of "strict by default" and "weak by default" have downside.

On Thu, Jan 05, 2017 at 04:38:22PM +0100, Victor Stinner wrote: [...]
PEP 393 is the Flexible String Respresentation. I think you want PEP 383, Non-decodable Bytes in System Character Interfaces. https://www.python.org/dev/peps/pep-0383/
The problem is that operating system data like filenames are decoded using the ``surrogateescape`` error handler (PEP 393).
/s/393/283/ -- Steve
participants (14)
-
Barry Warsaw
-
Chris Angelico
-
Chris Barker
-
Chris Barker - NOAA Federal
-
INADA Naoki
-
M.-A. Lemburg
-
Nick Coghlan
-
Oleg Broytman
-
Petr Viktorin
-
Stephan Houben
-
Stephen J. Turnbull
-
Steve Dower
-
Steven D'Aprano
-
Victor Stinner