[Python-Dev] File system path encoding on Windows

Steve Dower steve.dower at python.org
Mon Aug 29 12:30:36 EDT 2016


On 28Aug2016 2043, Stephen J. Turnbull wrote:
> tritium-list at sdamon.com writes:
>
>  > Once you get to var lengths like that, arcane single character flags start
>  > looking preferable.  How about "PYTHONWINLEGACY" to just turn it all on or
>  > off.  If the code breaks on one thing, it obviously isn't written to use the
>  > other two, so might as well shut them all off.
>
> Since Steve is thinking about three separate PEPs (among other things,
> they might be implemented on different timelines), that's not really
> possible (placing the features under control of one switch at
> different times would be an unacceptable compatibility break).

Yeah, the likelihood of different timelines basically means three PEPs 
are going to be necessary. But I think we can have a single 
"PYTHONWINDOWSANSI" (or ...MBCS) flag to cover all three whenever they 
come in without it being a compatibility break, especially if (as Nick 
suggested) there are _PYTHONWINDOWSANSI(CONSOLE|PATH|LOCALE) flags too. 
But it does give us the ability to say "all ANSI or all UTF-8 are 
supported; mix-and-match at your own risk".

> Anyway, it's not *obvious* that your premise is true, because code
> isn't written to do any of those things.  It's written to process
> bytes agnostically.  The question is what does the environment look
> like.  Steve obviously has a perspective on environment which suggests
> that these aspects are often decoupled because in Windows the actual
> filesystem is never bytes-oriented.  I don't know if it's possible to
> construct a coherent environment where these aspects are decoupled,
> but I can't say it's impossible, either.

Actually, the three items are basically completely decoupled, though it 
isn't obvious.

* stdin/stdout/stderr are text wrappers by default (under my changes, 
using the console encoding when it's a console and the locale encoding 
when it's a file/pipe). There's no point reading bytes from the console, 
and redirected files or pipes are unaffected by the change.
* the file system encoding only affects paths passed into/returned from 
the OS as bytes, and...
* the locale encoding affects files opened in text mode, which means...
* if you open('rb') and read paths, the locale encoding has no effect on 
whether the bytes are the right encoding to be used as paths

So while there are scenarios that use multiple pieces of this, there 
should only be one change impacting any scenario:
* reading str paths from a file - locale encoding
* reading bytes paths from a file - filesystem encoding
* reading str paths from a pipe/redirected file - locale encoding
* reading bytes paths from a pipe/redirected file - filesystem encoding
* reading str paths from the console - console encoding
* reading bytes paths from the console (i.e. 
sys.stdin.buffer.raw.read()) - filesystem encoding

The last case doesn't make sense anyway right now, as 
sys.stdin.buffer.raw has no specified encoding and you can't reliably 
read paths from it. Perhaps there exist examples of where this is put to 
good use (bearing in mind it must be an actual console - not a 
redirection or pipe) - I would love to hear about them.

As far as I can tell, any other combination requires the Python 
developer to convert between str and bytes themselves, which may lead to 
errors if they have assumed that the encoding of the bytes would never 
change, but code that ignores encodings and uses bytes or str 
exclusively is only going to encounter one (bytes) or two (str) of the 
changes.

Cheers,
Steve



More information about the Python-Dev mailing list