On 5 June 2014 22:37, Paul Sokolovsky firstname.lastname@example.org wrote:
On Thu, 5 Jun 2014 22:20:04 +1000 Nick Coghlan email@example.com wrote:
problems caused by trusting the locale encoding to be correct, but the startup code will need non-trivial changes for that to happen - the C.UTF-8 locale may even become widespread before we get there).
... And until those golden times come, it would be nice if Python did not force its perfect world model, which unfortunately is not based on surrounding reality, and let users solve their encoding problems themselves - when they need, because again, one can go quite a long way without dealing with encodings at all. Whereas now Python3 forces users to deal with encoding almost universally, but forcing a particular for all strings (which is again, doesn't correspond to the state of surrounding reality). I already hear response that it's good that users taught to deal with encoding, that will make them write correct programs, but that's a bit far away from the original aim of making it write "correct" programs easy and pleasant. (And definition of "correct" vary.)
As I've said before in other contexts, find me Windows, Mac OS X and JVM developers, or educators and scientists that are as concerned by the text model changes as folks that are primarily focused on Linux system (including network) programming, and I'll be more willing to concede the point.
Windows, Mac OS X, and the JVM are all opinionated about the text encodings to be used at platform boundaries (using UTF-16, UTF-8 and UTF-16, respectively). By contrast, Linux (or, more accurately, POSIX) says "well, it's configurable, but we won't provide a reliable mechanism for finding out what the encoding is. So either guess as best you can based on the info the OS *does* provide, assume UTF-8, assume 'some ASCII compatible encoding', or don't do anything that requires knowing the encoding of the data being exchanged with the OS, like, say, displaying file names to users or accepting arbitrary text as input, transforming it in a content aware fashion, and echoing it back in a console application".
None of those options are perfectly good choices. 6(ish) years ago, we chose the first option, because it has the best chance of working properly on Linux systems that use ASCII incompatible encodings like ShiftJIS, ISO-2022, and various other East Asian codecs. For normal user space programming, Linux is pretty reliable when it comes to ensuring the locale encoding is set to something sensible, but the price we currently pay for that decision is interoperability issues with things like daemons not receiving any configuration settings and hence falling back the POSIX locale and ssh environment forwarding moving a clients encoding settings to a session on a server with different settings. I still consider it preferable to impose inconveniences like that based on use case (situations where Linux systems don't provide sensible encoding settings) than geographic region (locales where ASCII incompatible encodings are likely to still be in common use).
If I (or someone else) ever find the time to implement PEP 432 (or something like it) to address some of the limitations of the interpreter startup sequence that currently make it difficult to avoid relying on the POSIX locale encoding on Linux, then we'll be in a position to reassess that decision based on the increased adoption of UTF-8 by Linux distributions in recent years. As the major community Linux distributions complete the migration of their system utilities to Python 3, we'll get to see if they decide it's better to make their locale settings more reliable, or help make it easier for Python 3 to ignore them when they're wrong.