[Python-Dev] Bytes path support

Mon Aug 25 12:15:31 CEST 2014

Hi! Thank you very much, Nick, for long and detailed explanation!

On Sun, Aug 24, 2014 at 01:27:55PM +1000, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On 24 August 2014 04:37, Oleg Broytman <phd at phdru.name> wrote:
> > On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore <p.f.moore at gmail.com> wrote:
> >> Generally, it seems to be mostly a reaction to the repeated claims
> >> that Python, or Windows, or whatever, is "broken".
> >
> >    Ah, if that's the only problem I certainly can live with that. My
> > problem is that it *seems* this anti-Unix attitude infiltrates Python
> > core development. I very much hope I'm wrong and it really isn't.
> 
> The POSIX locale based approach to handling encodings is genuinely
> broken - it's almost as broken as code pages are on Windows. The
> fundamental flaw is that locales encourage *bilingual* computing:
> handling English plus one other language correctly. Given a global
> internet, bilingual computing *is a fundamentally broken approach*. We
> need multilingual computing (any human language, all the time), and
> that means Unicode.
> 
> As some examples of where bilingual computing breaks down:
> 
> * My NFS client and server may have different locale settings
> * My FTP client and server may have different locale settings
> * My SSH client and server may have different locale settings
> * I save a file locally and send it to someone with a different locale setting
> * I attempt to access a Windows share from a Linux client (or vice-versa)
> * I clone my POSIX hosted git or Mercurial repository on a Windows client
> * I have to connect my Linux client to a Windows Active Directory
> domain (or vice-versa)
> * I have to interoperate between native code and JVM code
> 
> The entire computing industry is currently struggling with this
> monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale
> encoding/code pages) -> multilingual (Unicode) transition. It's been
> going on for decades, and it's still going to be quite some time
> before we're done.
> 
> The POSIX world is slowly clawing its way towards a multilingual model
> that actually works: UTF-8
> Windows (including the CLR) and the JVM adopted a different
> multilingual model, but still one that actually works: UTF-16-LE
> 
> POSIX is hampered by legacy ASCII defaults in various subsystems (most
> notably the default locale) and the assumption that system metadata is
> "just bytes" (an assumption that breaks down as soon as you have to
> hand that metadata over to another machine that may have different
> locale settings)
> Windows is hampered by the fact they kept the old 8-bit APIs around
> for backwards compatibility purposes, so applications using those APIs
> are still only bilingual (at best) rather than multilingual.
> JVM and CLR applications will at least handle the Basic Multilingual
> Plane (UCS-2) correctly, but may not correctly handle code points
> beyond the 16-bit boundary (this is the "Python narrow builds don't
> handle Unicode correctly" problem that was resolved for Python 3.3+ by
> PEP 393)
> 
> Individual users (including some organisations) may have the luxury of
> saying "well, all my clients and all my servers are POSIX, so I don't
> care about interoperability with other platforms". As the providers of
> a cross-platform runtime environment, we don't have that luxury - we
> need to figure out how to get *all* the major platforms playing nice
> with each other, regardless of whether they chose UTF-8 or UTF-16-LE
> as the basis for their approach towards providing multilingual
> computing environments.
> 
> Historically, that question of cross platform interoperability for
> open source software has been handled in a few different ways:
> 
> * Don't really interoperate with anybody, reinvent all the wheels (the JVM way)
> * Emulate POSIX on Windows (the Cygwin/MinGW way)
> * Let the application developer figure it out (the Python 2 way)
> 
> The first approach is inordinately expensive - it took the resources
> of Sun in its heyday to make it possible, and it effectively locks the
> JVM out of certain kinds of computing (e.g. it's hard to do array
> oriented programming in JVM languages, because the CPU and GPU
> vectorisation features aren't readily accessible).
> 
> The second approach prevents the creation of truly native Windows
> applications, which makes it uncompelling as a way of attracting
> Windows users - it sends a clear signal that the project doesn't
> *really* care about supporting Windows as a platform, but instead only
> grudgingly accepts that there are Windows users out there that might
> like to use their software.
> 
> The third approach is the one we tried for a long time with Python 2,
> and essentially found to be an "experts only" solution. Yes, you can
> *make* it work, but the runtime isn't set up so it works *by default*.
> 
> The Unicode changes in Python 3 are a result of the Python core
> development team saying "it really shouldn't be this hard for
> application developers to get cross-platform interoperability between
> correctly configured systems when dealing solely with correctly
> encoded data and metadata". The idea of Python 3 is that applications
> should require additional complexity solely to deal with *incorrectly*
> configured systems and improperly encoded data and metadata (and,
> ideally, the detection of the need for such handling should be "Python
> 3 threw an exception" rather than "something further down the line
> detected corrupted data").
> 
> This is software rather than magic, though - these improvements only
> happen through people actually knuckling down and solving the related
> problems. When folks complain about Python 3's operating system
> interface handling causing problems in some situations? They're almost
> always referring to areas where we're still relying on the locale
> system on POSIX or the code page system on Windows. Both of those
> approaches are irredeemably broken - the answer is to stop relying on
> them, but appropriately updating the affected subsystems generally
> isn't a trivial task. A lot of the affected code runs before the
> interpreter is fully initialised, which makes it really hard to test,
> and a lot of it is incredibly convoluted due to various configuration
> options and platform specific details, which makes it incredibly hard
> to modify without breaking anything.
> 
> One of those areas is the fact that we still use the old 8-bit APIs to
> interact with the Windows console. Those are just as broken in a
> multilingual world as the other Windows 8-bit APIs, so Drekin came up
> with a project to expose the Windows console as a UTF-16-LE stream
> that uses the 16-bit APIs instead:
> https://pypi.python.org/pypi/win_unicode_console
> 
> I personally hope we'll be able to get the issues Drekin references
> there resolved for Python 3.5 - if other folks hope for the same
> thing, then one of the best ways to help that happen is to try out the
> win_unicode_console module and provide feedback on what does and
> doesn't work.
> 
> Another was getting exceptions attempting to write OS data to
> sys.stdout when the locale settings had been scrubbed from the
> environment. For Python 3.5, we better tolerate that situation by
> setting "errors=surrogateescape" on sys.stdout when the environment
> claims "ascii" as a suitable encoding for talking to the operating
> system (this is our way of saying "we don't actually believe you, but
> also don't have the data we need to overrule you completely").
> 
> While I was going to wait for more feedback from Fedora folks before
> pushing the idea again, this thread also makes me think it would be
> worth our while to add more tools for dealing with surrogate escapes
> and latin-1 binary data smuggling just to help make those techniques
> more discoverable and accessible:
> http://bugs.python.org/issue18814#msg225791
> 
> These various discussions are also giving me plenty of motivation to
> get back to working on PEP 432 (the rewrite of the interpreter startup
> sequence) for Python 3.5. A lot of these things are just plain hard to
> change because of the complexity of the current startup code.
> Redesigning that to use a cleaner, multiphase startup sequence that
> gets the core interpreter running *before* configuring the operating
> system integration should give us several more options when it comes
> to dealing with some of these challenges.
> 
> Regards,
> Nick.
> 
> -- 
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/phd%40phdru.name

Oleg.
-- 
     Oleg Broytman            http://phdru.name/            phd at phdru.name
           Programmers don't die, they just GOSUB without RETURN.