[Python-Dev] File system path encoding on Windows

Mon Aug 22 11:58:31 EDT 2016

On 22Aug2016 0247, Stephen J. Turnbull wrote:
> Nick Coghlan writes:
>  > On 21 August 2016 at 06:31, Steve Dower <steve.dower at python.org> wrote:
>
>  > > My biggest concern is that it then falls onto users to know how
>  > > to start Python with that flag.
>
> The users I'm most worried about belong to organizations where
> concerted effort has been made to "purify" the environment so that
> they *can* use bytes-oriented code.  That is, getfilesystemencoding()
> == getpreferredencoding() == what is actually used throughout the
> system.  Such organizations will be able to choose the flag correctly,
> and implement it organization-wide, I'm pretty sure.  I doubt that all
> will choose UTF-8 at this point in time, though I wish they would.

I think that these are also the people who are likely to read a PEP and 
enable an environment variable to preserve the current behaviour. I'm 
more concerned about uncontrolled environments where a library breaks on 
a random user's machine because random user downloaded a file from a 
foreign website.

I don't recall whether I mentioned an environment variable (i.e. 
PYTHONUSELEGACYENCODING or similar) to switch back to mbcs:ignore by 
default, but it was always my intent to have one.

> Python itself is already ready for UTF-8, except that on Windows
> getfilesystemencoding and getpreferredencoding can't honestly return
> 'utf-8', AIUI.  I understand that that is exactly what Steve wants to
> change, but "honestly" is the rub.  What happens if Python 3.6 is only
> part of a bytes-oriented system, receives a filename forced to UTF-8-
> encoded bytes, and passes that over a pipe or in shared memory or in a
> file to a non-Python-3.6 application that trusts the system defaults?
> "Boom!", no?  Is there any experience anywhere in any implementation
> language with systems used on Windows that use this approach of
> pretending the Windows world is UTF-8?  If not, why is it a good idea
> for Python to go first?

The Windows world is Unicode. Mostly represented in UTF-16, but UTF-8 is 
entirely equivalent.

All MSVC users have been pushed towards Unicode for many years. The .NET 
Framework has defaulted to UTF-8 its entire existence. The use of code 
pages has been discouraged for decades. We're not going first :)

>  > > On the other hand, having code opt-in or out of the new handling
>  > > requires changing code (which is presumably not going to happen,
>  > > or we wouldn't consider keeping the old behaviour and/or letting
>  > > the user control it),
>
> I don't understand why this argument doesn't cut both ways equally.
> If you believe that, you should also believe that the same people who
> won't change code to opt in also won't use a Python containing fix #1,
> and may not install it at all.  Doesn't that matter?

People already do this (e.g. Python 2.7). I don't think it should matter 
enough to prevent us from making changes in new versions of Python. 
Otherwise, why would we ever release new versions?

So I guess the question here is: for organisations who have already 
(incorrectly) assumed that the file system encoding and the active code 
page are always the same, have built solid infrastructure around this 
using bytes (including ensuring that their systems never encounter 
external paths in glob/listdir/etc.), are currently using 3.5 and want 
to migrate to 3.6 - is an environment variable to change back to mbcs 
sufficient to meet their needs?

Cheers,
Steve