[Python-Dev] File system path encoding on Windows
Steve Dower
steve.dower at python.org
Mon Aug 29 12:30:36 EDT 2016
On 28Aug2016 2043, Stephen J. Turnbull wrote:
> tritium-list at sdamon.com writes:
>
> > Once you get to var lengths like that, arcane single character flags start
> > looking preferable. How about "PYTHONWINLEGACY" to just turn it all on or
> > off. If the code breaks on one thing, it obviously isn't written to use the
> > other two, so might as well shut them all off.
>
> Since Steve is thinking about three separate PEPs (among other things,
> they might be implemented on different timelines), that's not really
> possible (placing the features under control of one switch at
> different times would be an unacceptable compatibility break).
Yeah, the likelihood of different timelines basically means three PEPs
are going to be necessary. But I think we can have a single
"PYTHONWINDOWSANSI" (or ...MBCS) flag to cover all three whenever they
come in without it being a compatibility break, especially if (as Nick
suggested) there are _PYTHONWINDOWSANSI(CONSOLE|PATH|LOCALE) flags too.
But it does give us the ability to say "all ANSI or all UTF-8 are
supported; mix-and-match at your own risk".
> Anyway, it's not *obvious* that your premise is true, because code
> isn't written to do any of those things. It's written to process
> bytes agnostically. The question is what does the environment look
> like. Steve obviously has a perspective on environment which suggests
> that these aspects are often decoupled because in Windows the actual
> filesystem is never bytes-oriented. I don't know if it's possible to
> construct a coherent environment where these aspects are decoupled,
> but I can't say it's impossible, either.
Actually, the three items are basically completely decoupled, though it
isn't obvious.
* stdin/stdout/stderr are text wrappers by default (under my changes,
using the console encoding when it's a console and the locale encoding
when it's a file/pipe). There's no point reading bytes from the console,
and redirected files or pipes are unaffected by the change.
* the file system encoding only affects paths passed into/returned from
the OS as bytes, and...
* the locale encoding affects files opened in text mode, which means...
* if you open('rb') and read paths, the locale encoding has no effect on
whether the bytes are the right encoding to be used as paths
So while there are scenarios that use multiple pieces of this, there
should only be one change impacting any scenario:
* reading str paths from a file - locale encoding
* reading bytes paths from a file - filesystem encoding
* reading str paths from a pipe/redirected file - locale encoding
* reading bytes paths from a pipe/redirected file - filesystem encoding
* reading str paths from the console - console encoding
* reading bytes paths from the console (i.e.
sys.stdin.buffer.raw.read()) - filesystem encoding
The last case doesn't make sense anyway right now, as
sys.stdin.buffer.raw has no specified encoding and you can't reliably
read paths from it. Perhaps there exist examples of where this is put to
good use (bearing in mind it must be an actual console - not a
redirection or pipe) - I would love to hear about them.
As far as I can tell, any other combination requires the Python
developer to convert between str and bytes themselves, which may lead to
errors if they have assumed that the encoding of the bytes would never
change, but code that ignores encodings and uses bytes or str
exclusively is only going to encounter one (bytes) or two (str) of the
changes.
Cheers,
Steve
More information about the Python-Dev
mailing list