[Python-Dev] Bytes path support

Nick Coghlan ncoghlan at gmail.com
Sun Aug 24 05:27:55 CEST 2014

On 24 August 2014 04:37, Oleg Broytman <phd at phdru.name> wrote:
> On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore <p.f.moore at gmail.com> wrote:
>> Generally, it seems to be mostly a reaction to the repeated claims
>> that Python, or Windows, or whatever, is "broken".
>    Ah, if that's the only problem I certainly can live with that. My
> problem is that it *seems* this anti-Unix attitude infiltrates Python
> core development. I very much hope I'm wrong and it really isn't.

The POSIX locale based approach to handling encodings is genuinely
broken - it's almost as broken as code pages are on Windows. The
fundamental flaw is that locales encourage *bilingual* computing:
handling English plus one other language correctly. Given a global
internet, bilingual computing *is a fundamentally broken approach*. We
need multilingual computing (any human language, all the time), and
that means Unicode.

As some examples of where bilingual computing breaks down:

* My NFS client and server may have different locale settings
* My FTP client and server may have different locale settings
* My SSH client and server may have different locale settings
* I save a file locally and send it to someone with a different locale setting
* I attempt to access a Windows share from a Linux client (or vice-versa)
* I clone my POSIX hosted git or Mercurial repository on a Windows client
* I have to connect my Linux client to a Windows Active Directory
domain (or vice-versa)
* I have to interoperate between native code and JVM code

The entire computing industry is currently struggling with this
monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale
encoding/code pages) -> multilingual (Unicode) transition. It's been
going on for decades, and it's still going to be quite some time
before we're done.

The POSIX world is slowly clawing its way towards a multilingual model
that actually works: UTF-8
Windows (including the CLR) and the JVM adopted a different
multilingual model, but still one that actually works: UTF-16-LE

POSIX is hampered by legacy ASCII defaults in various subsystems (most
notably the default locale) and the assumption that system metadata is
"just bytes" (an assumption that breaks down as soon as you have to
hand that metadata over to another machine that may have different
locale settings)
Windows is hampered by the fact they kept the old 8-bit APIs around
for backwards compatibility purposes, so applications using those APIs
are still only bilingual (at best) rather than multilingual.
JVM and CLR applications will at least handle the Basic Multilingual
Plane (UCS-2) correctly, but may not correctly handle code points
beyond the 16-bit boundary (this is the "Python narrow builds don't
handle Unicode correctly" problem that was resolved for Python 3.3+ by
PEP 393)

Individual users (including some organisations) may have the luxury of
saying "well, all my clients and all my servers are POSIX, so I don't
care about interoperability with other platforms". As the providers of
a cross-platform runtime environment, we don't have that luxury - we
need to figure out how to get *all* the major platforms playing nice
with each other, regardless of whether they chose UTF-8 or UTF-16-LE
as the basis for their approach towards providing multilingual
computing environments.

Historically, that question of cross platform interoperability for
open source software has been handled in a few different ways:

* Don't really interoperate with anybody, reinvent all the wheels (the JVM way)
* Emulate POSIX on Windows (the Cygwin/MinGW way)
* Let the application developer figure it out (the Python 2 way)

The first approach is inordinately expensive - it took the resources
of Sun in its heyday to make it possible, and it effectively locks the
JVM out of certain kinds of computing (e.g. it's hard to do array
oriented programming in JVM languages, because the CPU and GPU
vectorisation features aren't readily accessible).

The second approach prevents the creation of truly native Windows
applications, which makes it uncompelling as a way of attracting
Windows users - it sends a clear signal that the project doesn't
*really* care about supporting Windows as a platform, but instead only
grudgingly accepts that there are Windows users out there that might
like to use their software.

The third approach is the one we tried for a long time with Python 2,
and essentially found to be an "experts only" solution. Yes, you can
*make* it work, but the runtime isn't set up so it works *by default*.

The Unicode changes in Python 3 are a result of the Python core
development team saying "it really shouldn't be this hard for
application developers to get cross-platform interoperability between
correctly configured systems when dealing solely with correctly
encoded data and metadata". The idea of Python 3 is that applications
should require additional complexity solely to deal with *incorrectly*
configured systems and improperly encoded data and metadata (and,
ideally, the detection of the need for such handling should be "Python
3 threw an exception" rather than "something further down the line
detected corrupted data").

This is software rather than magic, though - these improvements only
happen through people actually knuckling down and solving the related
problems. When folks complain about Python 3's operating system
interface handling causing problems in some situations? They're almost
always referring to areas where we're still relying on the locale
system on POSIX or the code page system on Windows. Both of those
approaches are irredeemably broken - the answer is to stop relying on
them, but appropriately updating the affected subsystems generally
isn't a trivial task. A lot of the affected code runs before the
interpreter is fully initialised, which makes it really hard to test,
and a lot of it is incredibly convoluted due to various configuration
options and platform specific details, which makes it incredibly hard
to modify without breaking anything.

One of those areas is the fact that we still use the old 8-bit APIs to
interact with the Windows console. Those are just as broken in a
multilingual world as the other Windows 8-bit APIs, so Drekin came up
with a project to expose the Windows console as a UTF-16-LE stream
that uses the 16-bit APIs instead:

I personally hope we'll be able to get the issues Drekin references
there resolved for Python 3.5 - if other folks hope for the same
thing, then one of the best ways to help that happen is to try out the
win_unicode_console module and provide feedback on what does and
doesn't work.

Another was getting exceptions attempting to write OS data to
sys.stdout when the locale settings had been scrubbed from the
environment. For Python 3.5, we better tolerate that situation by
setting "errors=surrogateescape" on sys.stdout when the environment
claims "ascii" as a suitable encoding for talking to the operating
system (this is our way of saying "we don't actually believe you, but
also don't have the data we need to overrule you completely").

While I was going to wait for more feedback from Fedora folks before
pushing the idea again, this thread also makes me think it would be
worth our while to add more tools for dealing with surrogate escapes
and latin-1 binary data smuggling just to help make those techniques
more discoverable and accessible:

These various discussions are also giving me plenty of motivation to
get back to working on PEP 432 (the rewrite of the interpreter startup
sequence) for Python 3.5. A lot of these things are just plain hard to
change because of the complexity of the current startup code.
Redesigning that to use a cleaner, multiphase startup sequence that
gets the core interpreter running *before* configuring the operating
system integration should give us several more options when it comes
to dealing with some of these challenges.


Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia

More information about the Python-Dev mailing list