Re: [Python-ideas] Fix default encodings on Windows
On Mon, Aug 15, 2016, at 12:35, Steve Dower wrote:
I'm still not sure we're talking about the same thing right now.
For `open(path_as_bytes).read()`, are we talking about the way path_as_bytes is passed to the file system? Or the codec used to decide the returned string?
We are talking about the way path_as_bytes is passed to the filesystem, and in particular what encoding path_as_bytes is *actually* in, when it was obtained from a file or other stream opened in binary mode.
On 15Aug2016 0954, Random832 wrote:
On Mon, Aug 15, 2016, at 12:35, Steve Dower wrote:
I'm still not sure we're talking about the same thing right now.
For `open(path_as_bytes).read()`, are we talking about the way path_as_bytes is passed to the file system? Or the codec used to decide the returned string?
We are talking about the way path_as_bytes is passed to the filesystem, and in particular what encoding path_as_bytes is *actually* in, when it was obtained from a file or other stream opened in binary mode.
Okay good, we are talking about the same thing. Passing path_as_bytes in that location has been deprecated since 3.3, so we are well within our rights (and probably overdue) to make it a TypeError in 3.6. While it's obviously an invalid assumption, for the purposes of changing the language we can assume that no existing code is passing bytes into any functions where it has been deprecated. As far as I'm concerned, there are currently no filesystem APIs on Windows that accept paths as bytes. Given that, I'm proposing adding support for using byte strings encoded with UTF-8 in file system functions on Windows. This allows Python users to omit switching code like: if os.name == 'nt': f = os.stat(os.listdir('.')[-1]) else: f = os.stat(os.listdir(b'.')[-1]) Or simply using the bytes variant unconditionally because they heard it was faster (sacrificing cross-platform correctness, since it may not correctly round-trip on Windows). My proposal is to remove all use of the *A APIs and only use the *W APIs. That completely removes the (already deprecated) use of bytes as paths. I then propose to change the (unused on Windows) sys.getfsdefaultencoding() to 'utf-8' and handle bytes being passed into filesystem functions by transcoding into UTF-16 and calling the *W APIs. This completely removes the active codepage from the chain, allows paths returned from the filesystem to correctly roundtrip via bytes in Python, and allows those bytes paths to be manipulated at '\' characters. (Frankly I don't mind what encoding we use, and I'd be quite happy to force bytes paths to be UTF-16-LE encoded, which would also round-trip invalid surrogate pairs. But that would prevent basic manipulation which seems to be a higher priority.) This does not allow you to take bytes from an arbitrary source and assume that they are correctly encoded for the file system. Python 3.3, 3.4 and 3.5 have been warning that doing that is deprecated and the path needs to be decoded to a known encoding first. At this stage, it's time for us to either make byte paths an error, or to specify a suitable encoding that can correctly round-trip paths. If this does not answer the question, I'm going to need the question to be explained more clearly for me. Cheers, Steve
On 15Aug2016 1126, Steve Dower wrote:
My proposal is to remove all use of the *A APIs and only use the *W APIs. That completely removes the (already deprecated) use of bytes as paths. I then propose to change the (unused on Windows) sys.getfsdefaultencoding() to 'utf-8' and handle bytes being passed into filesystem functions by transcoding into UTF-16 and calling the *W APIs.
Of course, I meant sys.getfilesystemencoding() here. The C functions have "FSDefault" in many of the names, which is why I guessed the wrong Python variant. Cheers, Steve
On Mon, Aug 15, 2016 at 6:26 PM, Steve Dower <steve.dower@python.org> wrote:
(Frankly I don't mind what encoding we use, and I'd be quite happy to force bytes paths to be UTF-16-LE encoded, which would also round-trip invalid surrogate pairs. But that would prevent basic manipulation which seems to be a higher priority.)
The CRT manually decodes and encodes using the private functions __acrt_copy_path_to_wide_string and __acrt_copy_to_char. These use either the ANSI or OEM codepage, depending on the value returned by WinAPI AreFileApisANSI. CPython could follow suit. Doing its own encoding and decoding would enable using filesystem functions that will never get an [A]NSI version (e.g. GetFileInformationByHandleEx), while still retaining backward compatibility. Filesystem encoding could use WC_NO_BEST_FIT_CHARS and raise a warning when lpUsedDefaultChar is true. Filesystem decoding could use MB_ERR_INVALID_CHARS and raise a warning and retry without this flag for ERROR_NO_UNICODE_TRANSLATION (e.g. an invalid DBCS sequence). This could be implemented with a new "warning" handler for PyUnicode_EncodeCodePage and PyUnicode_DecodeCodePageStateful. A new 'fsmbcs' encoding could be added that checks AreFileApisANSI to choose betwen CP_ACP and CP_OEMCP.
On 15Aug2016 1819, eryk sun wrote:
On Mon, Aug 15, 2016 at 6:26 PM, Steve Dower <steve.dower@python.org> wrote:
(Frankly I don't mind what encoding we use, and I'd be quite happy to force bytes paths to be UTF-16-LE encoded, which would also round-trip invalid surrogate pairs. But that would prevent basic manipulation which seems to be a higher priority.)
The CRT manually decodes and encodes using the private functions __acrt_copy_path_to_wide_string and __acrt_copy_to_char. These use either the ANSI or OEM codepage, depending on the value returned by WinAPI AreFileApisANSI. CPython could follow suit. Doing its own encoding and decoding would enable using filesystem functions that will never get an [A]NSI version (e.g. GetFileInformationByHandleEx), while still retaining backward compatibility.
Filesystem encoding could use WC_NO_BEST_FIT_CHARS and raise a warning when lpUsedDefaultChar is true. Filesystem decoding could use MB_ERR_INVALID_CHARS and raise a warning and retry without this flag for ERROR_NO_UNICODE_TRANSLATION (e.g. an invalid DBCS sequence). This could be implemented with a new "warning" handler for PyUnicode_EncodeCodePage and PyUnicode_DecodeCodePageStateful. A new 'fsmbcs' encoding could be added that checks AreFileApisANSI to choose betwen CP_ACP and CP_OEMCP.
None of that makes it less complicated or more reliable. Warnings based on values are bad (they should be based on types) and using the *W APIs exclusively is the right way to go. The question then is whether we allow file system functions to return bytes, and if so, which encoding to use. This then directly informs what the functions accept, for the purposes of round-tripping. *Any* encoding that may silently lose data is a problem, which basically leaves utf-16 as the only option. However, as that causes other problems, maybe we can accept the tradeoff of returning utf-8 and failing when a path contains invalid surrogate pairs (which is extremely rare by comparison to characters outside of CP_ACP)? If utf-8 is unacceptable, we're back to the current situation and should be removing the support for bytes that was deprecated three versions ago. Cheers, Steve
On Mon, Aug 15, 2016 at 6:26 PM, Steve Dower <steve.dower@python.org> wrote:
and using the *W APIs exclusively is the right way to go.
My proposal was to use the wide-character APIs, but transcoding CP_ACP without best-fit characters and raising a warning whenever the default character is used (e.g. substituting Katakana middle dot when creating a file using a bytes path that has an invalid sequence in CP932). This proposal was in response to the case made by Stephen Turnbull. If using UTF-8 is getting such heavy pushback, I thought half a solution was better than nothing, and it also sets up the infrastructure to easily switch to UTF-8 if that idea eventually gains acceptance. It could raise exceptions instead of warnings if that's preferred, since bytes paths on Windows are already deprecated.
*Any* encoding that may silently lose data is a problem, which basically leaves utf-16 as the only option. However, as that causes other problems, maybe we can accept the tradeoff of returning utf-8 and failing when a path contains invalid surrogate pairs
Are there any common sources of illegal UTF-16 surrogates in Windows filenames? I see that WTF-8 (Wobbly) was developed to handle this problem. A WTF-8 path would roundtrip back to the filesystem, but it should only be used internally in a program.
2016-08-16 8:06 GMT+02:00 eryk sun <eryksun@gmail.com>:
My proposal was to use the wide-character APIs, but transcoding CP_ACP without best-fit characters and raising a warning whenever the default character is used (e.g. substituting Katakana middle dot when creating a file using a bytes path that has an invalid sequence in CP932).
A problem with all these proposal is that they *add* new code to the CPython code base, code specific to Windows. There are very few core developers (1 or 2?) who work on this code specific to Windows. I would prefer to *drop* code specific to Windows rather that *adding* (or changing) code specific to Windows, just to make the CPython code base simpler to maintain. It's already annoying enough. It's common that a Python function has one implementation for all platforms except Windows, and a second implementation specific to Windows. An example: os.listdir() * ~150 lines of C code for the Windows implementation * ~100 lines of C code for the UNIX/BSD implementation * The Windows implementation is splitted in two parts: Unicode and bytes, so the code is basically duplicated (2 versions) If you remove the bytes support, the Windows function is reduced to 100 lines (-50). I'm not sure that modifying the API using byte would solve any issue on Windows, and there is an obvious risk of regression (mojibake when you concatenerate strings encoded to UTF-8 and to ANSI code page). I'm in favor of forcing developers to use Unicode on Windows, which is the correct way to use the Windows API. The side effect is that such code works perfectly well on UNIX/BSD ;-) To be clear: drop the deprecated code to support bytes on Windows. I already proposed to drop bytes support on Windows and most answers were "please keep them", so another option is to keep the "broken code" as the status quo... I really hate APIs using bytes on Windows because they use WideCharToMultiByte() (encode unicode to bytes) in a mode which is likely to lead to mojibake: unencodable characters are replaced with "best fit characters" or "?". https://unicodebook.readthedocs.io/operating_systems.html#encode-and-decode-... In a perfect world, I would also propose to deprecate bytes filenames on UNIX, but I expect an insane flamewar on the definition of "UNIX", history of UNIX, etc. (non technical discussion, since Unicode works very well on Python 3...). Victor
Given that, I'm proposing adding support for using byte strings encoded with UTF-8 in file system functions on Windows. This allows Python users to omit switching code like:
if os.name == 'nt': f = os.stat(os.listdir('.')[-1]) else: f = os.stat(os.listdir(b'.')[-1])
REALLY? Do we really want to encourage using bytes as paths? IIUC, anyone that wants to platform-independentify that code just needs to use proper strings (or pat glib) for paths everywhere, yes? I understand that pre-surrogate-escape, there was a need for bytes paths, but those days are gone, yes? So why, at this late date, kludge what should be a deprecated pattern into the Windows build??? -CHB
My proposal is to remove all use of the *A APIs and only use the *W APIs. That completely removes the (already deprecated) use of bytes as paths.
Yes, this is good.
I then propose to change the (unused on Windows) sys.getfsdefaultencoding() to 'utf-8' and handle bytes being passed into filesystem functions by transcoding into UTF-16 and calling the *W APIs.
I'm really not sure utf-8 is magic enough to do this. Where do you imagine that utf-8 is coming from as bytes??? AIUI, while utf-8 is almost universal in *nix for file system names, folks do not want to count on it -- hence the use of bytes. And it is far less prevalent in the Windows world...
, allows paths returned from the filesystem to correctly roundtrip via bytes in Python,
That you could do with native bytes (UTF-16, yes?)
. But that would prevent basic manipulation which seems to be a higher priority.)
Still think Unicode is the answer to that...
At this stage, it's time for us to either make byte paths an error,
+1. :-) CHB
On 16 August 2016 at 11:34, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
Given that, I'm proposing adding support for using byte strings encoded with UTF-8 in file system functions on Windows. This allows Python users to omit switching code like:
if os.name == 'nt': f = os.stat(os.listdir('.')[-1]) else: f = os.stat(os.listdir(b'.')[-1])
REALLY? Do we really want to encourage using bytes as paths? IIUC, anyone that wants to platform-independentify that code just needs to use proper strings (or pat glib) for paths everywhere, yes?
The problem is that bytes-as-paths actually *does* work for Mac OS X and systemd based Linux distros properly configured to use UTF-8 for OS interactions. This means that a lot of backend network service code makes that assumption, especially when it was originally written for Python 2, and rather than making it work properly on Windows, folks just drop Windows support as part of migrating to Python 3. At an ecosystem level, that means we're faced with a choice between implicitly encouraging folks to make their code *nix only, and finding a way to provide a more *nix like experience when running on Windows (where UTF-8 encoded binary data just works, and either other encodings lead to mojibake or else you use chardet to figure things out). Steve is suggesting that the latter option is preferable, a view I agree with since it lowers barriers to entry for Windows based developers to contribute to primarily *nix focused projects.
I understand that pre-surrogate-escape, there was a need for bytes paths, but those days are gone, yes?
No, UTF-8 encoded bytes are still the native language of network service development: http://utf8everywhere.org/ It also helps with cases where folks are switching back and forth between Python and other environments like JavaScript and Go where the UTF-8 assumption is more prevalent.
So why, at this late date, kludge what should be a deprecated pattern into the Windows build???
Promoting cross-platform consistency often leads to enabling patterns that are considered a bad idea from a native platform perspective, and this strikes me as an example of that (just as the binary/text separation itself is a case where Python 3 diverged from the POSIX text model to improve consistency across *nix, Windows, JVM and CLR environments). Cheers, Nick.
On 15 August 2016 at 19:26, Steve Dower <steve.dower@python.org> wrote:
Passing path_as_bytes in that location has been deprecated since 3.3, so we are well within our rights (and probably overdue) to make it a TypeError in 3.6. While it's obviously an invalid assumption, for the purposes of changing the language we can assume that no existing code is passing bytes into any functions where it has been deprecated.
As far as I'm concerned, there are currently no filesystem APIs on Windows that accept paths as bytes.
[...] On 16 August 2016 at 03:00, Nick Coghlan <ncoghlan@gmail.com> wrote:
The problem is that bytes-as-paths actually *does* work for Mac OS X and systemd based Linux distros properly configured to use UTF-8 for OS interactions. This means that a lot of backend network service code makes that assumption, especially when it was originally written for Python 2, and rather than making it work properly on Windows, folks just drop Windows support as part of migrating to Python 3.
At an ecosystem level, that means we're faced with a choice between implicitly encouraging folks to make their code *nix only, and finding a way to provide a more *nix like experience when running on Windows (where UTF-8 encoded binary data just works, and either other encodings lead to mojibake or else you use chardet to figure things out).
Steve is suggesting that the latter option is preferable, a view I agree with since it lowers barriers to entry for Windows based developers to contribute to primarily *nix focused projects.
So does this mean that you're recommending reverting the deprecation of bytes as paths in favour of documenting that bytes as paths is acceptable, but it will require an encoding of UTF-8 rather than the current behaviour? If so, that raises some questions: 1. Is it OK to backtrack on a deprecation by changing the behaviour like this? (I think it is, but others who rely on the current, deprecated, behaviour may not). 2. Should we be making "always UTF-8" the behaviour on all platforms, rather than just Windows (e.g., Unix systems which haven't got UTF-8 as their locale setting)? This doesn't seem to be a Windows-specific question any more (I'm assuming that if bytes-as-paths are deprecated, that's a cross-platform change, but see below). Having said all this, I can't find the documentation stating that bytes paths are deprecated - the open() documentation for 3.5 says "file is either a string or bytes object giving the pathname (absolute or relative to the current working directory) of the file to be opened or an integer file descriptor of the file to be wrapped" and there's no mention of a deprecation. Steve - could you provide a reference? Paul
On Tue, Aug 16, 2016 at 10:53 AM, Paul Moore <p.f.moore@gmail.com> wrote:
Having said all this, I can't find the documentation stating that bytes paths are deprecated - the open() documentation for 3.5 says "file is either a string or bytes object giving the pathname (absolute or relative to the current working directory) of the file to be opened or an integer file descriptor of the file to be wrapped" and there's no mention of a deprecation.
Bytes paths aren't deprecated on Unix -- only on Windows, and only for the os functions. You can see the deprecation warning with -Wall: >>> os.listdir(b'.') __main__:1: DeprecationWarning: The Windows bytes API has been deprecated, use Unicode filenames instead AFAIK this isn't documented. Since the Windows CRT's _open implementation uses MultiByteToWideChar without the flag MB_ERR_INVALID_CHARS, bytes paths should also be deprecated for io.open. The problem is that bad DBCS sequences are mapped silently to the default Unicode character instead of raising an error.
On 16 August 2016 at 14:09, eryk sun <eryksun@gmail.com> wrote:
On Tue, Aug 16, 2016 at 10:53 AM, Paul Moore <p.f.moore@gmail.com> wrote:
Having said all this, I can't find the documentation stating that bytes paths are deprecated - the open() documentation for 3.5 says "file is either a string or bytes object giving the pathname (absolute or relative to the current working directory) of the file to be opened or an integer file descriptor of the file to be wrapped" and there's no mention of a deprecation.
Bytes paths aren't deprecated on Unix -- only on Windows, and only for the os functions. You can see the deprecation warning with -Wall:
>>> os.listdir(b'.') __main__:1: DeprecationWarning: The Windows bytes API has been deprecated, use Unicode filenames instead
Thanks. So this remains a Windows-only issue (which is good).
AFAIK this isn't documented.
It probably should be. Although if we're changing the deprecation to a behaviour change, then maybe there's no point. But some of the arguments here about breaking code are hinging on the idea that people currently using the bytes API are using an (on the way to being) unsupported feature and it's not really acceptable to take that position if the deprecation wasn't announced. If the objections being raised here (in the context of Japanese encodings and similar) would apply equally to the bytes API being removed, then it seems to me that we have a failure in our deprecation process, as those objections should have been addressed when we started the deprecation. Alternatively, if the deprecation of the os functions is OK, but it's the deprecation of open (and presumably io.open) that's the issue, then the whole process is somewhat problematic - it seems daft in the long term to deprecate bytes paths in os functions like os.open and yet allow them in the supposedly higher level io.open and the open builtin. (And in the short term, it's illogical to me that the deprecation isn't for open as well as the os functions). I don't have a view on whether the cost to Japanese users is sufficiently high that we should continue along the deprecation path (or even divert to an enforced-UTF8 approach that's just as problematic for them). But maybe it's worth a separate thread, specifically focused on the use of bytes paths, rather than being lumped in with other Windows encoding issues? Paul
On Tue, Aug 16, 2016, at 09:59, Paul Moore wrote:
It probably should be. Although if we're changing the deprecation to a behaviour change, then maybe there's no point. But some of the arguments here about breaking code are hinging on the idea that people currently using the bytes API are using an (on the way to being) unsupported feature and it's not really acceptable to take that position if the deprecation wasn't announced. If the objections being raised here (in the context of Japanese encodings and similar) would apply equally to the bytes API being removed,
There also seems to be an undercurrent in the discussions we're having now that using bytes paths and not unicode paths is somehow The Right Thing for unix-like OSes, and that breaking it (in whatever way) on windows causes code that Does The Right Thing on unix to require extra work to port to windows. That's seemingly both the rationale for the proposal itself and for the objections.
There also seems to be an undercurrent in the discussions we're having now that using bytes paths and not unicode paths is somehow The Right Thing for unix-like OSes,
Almost -- from my perusing of discussions from the last few years, there do seem to be some library developers and *nix affectionados that DO think it's The Right Thing -- after all, a char* has always worked, yes? But these folks also seem to think that a •nix system with no way of knowing what the encoding of the names in the file system (and could have more than one) is not "broken" in any way. A note about "utf-8 everywhere": while maybe a good idea, it's my understanding that *nix developers absolutely do not want utf-8 to be assumed in the Python APIs. Rather, this is all about punting the handling of encodings down to the application level, rather that the OS and Library level. Which is more backward compatible, but otherwise a horrible idea. And very much in conflict with Python 3's approach. So it seems odd to assume utf-8 on Windows, where it is less ubiquitous. Back to "The Right Thing" -- it's clear to me that everyone supporting this proposal is vet much doing so because it's "The Pragmatic Thing". But it seems folks porting from py2 need to explicitly convert the calls from str to bytes anyway to get the bytes behavior. With surrogate escapes, now you need to do nothing. So we're really supporting code that was ported to py3 earlier in the game - but it seems a bad idea to cement that hacks solution in place. And if the file amen in question are coming from a byte stream somehow, rather than file system API calls, then you really do need to know the encoding -- yes really! If a developer wants to assume utf-8, that's fine, but the developer should be making that decision, not Python itself. And not on Windows only. -CHB
and that breaking it (in whatever way) on windows causes code that Does The Right Thing on unix to require extra work to port to windows. That's seemingly both the rationale for the proposal itself and for the objections. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
I just want to clearly address two points, since I feel like multiple posts have been unclear on them. 1. The bytes API was deprecated in 3.3 and it is listed in https://docs.python.org/3/whatsnew/3.3.html. Lack of mention in the docs is an unfortunate oversight, but it was certainly announced and the warning has been there for three released versions. We can freely change or remove the support now, IMHO. 2. Windows file system encoding is *always* UTF-16. There's no "assuming mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding it is". We know exactly what the encoding is on every supported version of Windows. UTF-16. This discussion is for the developers who insist on using bytes for paths within Python, and the question is, "how do we best represent UTF-16 encoded paths in bytes?" The choices are: * don't represent them at all (remove bytes API) * convert and drop characters not in the (legacy) active code page * convert and fail on characters not in the (legacy) active code page * convert and fail on invalid surrogate pairs * represent them as UTF-16-LE in bytes (with embedded '\0' everywhere) Currently we have the second option. My preference is the fourth option, as it will cause the least breakage of existing code and enable the most amount of code to just work in the presence of non-ACP characters. The fifth option is the best for round-tripping within Windows APIs. The only code that will break with any change is code that was using an already deprecated API. Code that correctly uses str to represent "encoding agnostic text" is unaffected. If you see an alternative choice to those listed above, feel free to contribute it. Otherwise, can we focus the discussion on these (or any new) choices? Cheers, Steve
On 2016-08-16 08:56, Steve Dower wrote:
I just want to clearly address two points, since I feel like multiple posts have been unclear on them.
1. The bytes API was deprecated in 3.3 and it is listed in https://docs.python.org/3/whatsnew/3.3.html. Lack of mention in the docs is an unfortunate oversight, but it was certainly announced and the warning has been there for three released versions. We can freely change or remove the support now, IMHO.
I strongly disagree with that. If using the code does not raise a visible warning (because DeprecationWarning is silent by default), and the documentation does not say it's deprecated, it hasn't actually been deprecated. Deprecation is the communicative act of saying "don't do this anymore". If that information is not communicated in the appropriate places (e.g., the docs), the deprecation has not occurred. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown
Thanks for the clarity, Steve, a couple questions/thoughts: The choices are:
* don't represent them at all (remove bytes API)
Would the bytes API be removed on *nix also?
* convert and drop characters not in the (legacy) active code page * convert and fail on characters not in the (legacy) active code page
"Failure is not an option" -- These two seem like a plain old bad idea. * convert and fail on invalid surrogate pairs
where would an invalid surrogate pair come from? never from a file system API call, yes? * represent them as UTF-16-LE in bytes (with embedded '\0' everywhere)
would this be doing anything -- or just keeping whatever the Windows API takes/returns? i.e. exactly what is done on *nix?
The fifth option is the best for round-tripping within Windows APIs.
How is it better? only performance (i.e. no encoding/decoding required) -- or would it be more reliable as well? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Tue, Aug 16, 2016, at 12:12, Chris Barker wrote:
* convert and fail on invalid surrogate pairs
where would an invalid surrogate pair come from? never from a file system API call, yes?
In principle it could, if the filesystem contains a file with an invalid surrogate pair. Nothing else, in general, prevents such a file from being created, though it's not easy to do so by accident.
On 16 August 2016 at 16:56, Steve Dower <steve.dower@python.org> wrote:
I just want to clearly address two points, since I feel like multiple posts have been unclear on them.
1. The bytes API was deprecated in 3.3 and it is listed in https://docs.python.org/3/whatsnew/3.3.html. Lack of mention in the docs is an unfortunate oversight, but it was certainly announced and the warning has been there for three released versions. We can freely change or remove the support now, IMHO.
For clarity, the statement was: """ issue 13374: The Windows bytes API has been deprecated in the os module. Use Unicode filenames, instead of bytes filenames, to not depend on the ANSI code page anymore and to support any filename. """ First of all, note that I'm perfectly OK with deprecating bytes paths. However, this statement specifically does *not* say anything about use of bytes paths outside of the os module (builtin open and the io module being the obvious places). Secondly, it appears that unfortunately the main Python documentation wasn't updated to state this. So while "we can freely change or remove the support now" may be true, it's not that simple - the debate here is at least in part about builtin open, and there's nothing anywhere that I can see that states that bytes support in open has been deprecated. Maybe there should have been, and maybe everyone involved at the time assumed that it was, but that's water under the bridge.
2. Windows file system encoding is *always* UTF-16. There's no "assuming mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding it is". We know exactly what the encoding is on every supported version of Windows. UTF-16.
This discussion is for the developers who insist on using bytes for paths within Python, and the question is, "how do we best represent UTF-16 encoded paths in bytes?"
People passing bytes to open() have in my view, already chosen not to follow the standard advice of "decode incoming data at the boundaries of your application". They may have good reasons for that, but it's perfectly reasonable to expect them to take responsibility for manually tracking the encoding of the resulting bytes values flowing through their code. It is of course, also true that "works for me in my environment" is a viable strategy - but the maintenance cost of this strategy if things change (whether in Python, or in the environment) is on the application developers - they are hoping that cost is minimal, but that's a risk they choose to take.
The choices are:
* don't represent them at all (remove bytes API) * convert and drop characters not in the (legacy) active code page * convert and fail on characters not in the (legacy) active code page * convert and fail on invalid surrogate pairs * represent them as UTF-16-LE in bytes (with embedded '\0' everywhere)
Actually, with the exception of the last one (which seems "obviously not sensible") these all feel more to me like answers to the question "how do we best interpret bytes provided to us as UTF-16?". It's a subtle point, but IMO important. It's much easier to answer the question you posed, but what people are actually concerned about is interpreting bytes, not representing Unicode. The correct answer to "how do we interpret bytes" is "in the face of ambiguity, refuse to guess" - but people using the bytes API have *already* bought into the current heuristic for guessing, so changing affects them.
Currently we have the second option.
My preference is the fourth option, as it will cause the least breakage of existing code and enable the most amount of code to just work in the presence of non-ACP characters.
It changes the encoding used to interpret bytes. While it preserves more information in the "UTF-16 to bytes" direction, nobody really cares about that direction. And in the "bytes to UTF-16" direction, it changes the interpretation of basically all non-ASCII bytes. That's a lot of breakage. Although as already noted, it's only breaking things that currently work while relying on a (maybe) undocumented API (byte paths to builtin open isn't actually documented) and on an arguably bad default that nevertheless works for them.
The fifth option is the best for round-tripping within Windows APIs.
The only code that will break with any change is code that was using an already deprecated API. Code that correctly uses str to represent "encoding agnostic text" is unaffected.
Code using Unicode is unaffected, certainly. Ideally that means that only a tiny minority of users should be affected. Are we over-reacting to reports of standard practices in Japan? I've no idea.
If you see an alternative choice to those listed above, feel free to contribute it. Otherwise, can we focus the discussion on these (or any new) choices?
Accept that we should have deprecated builtin open and the io module, but didn't do so. Extend the existing deprecation of bytes paths on Windows, to cover *all* APIs, not just the os module, But modify the deprecation to be "use of the Windows CP_ACP code page (via the ...A Win32 APIs) is deprecated and will be replaced with use of UTF-8 as the implied encoding for all bytes paths on Windows starting in Python 3.7". Document and publicise it much more prominently, as it is a breaking change. Then leave it one release for people to prepare for the change. Oh, and (obviously) check back with Guido on his view - he's expressed concern, but I for one don't have the slightest idea in this case what his preference would be... Paul
Paul Moore writes:
On 16 August 2016 at 16:56, Steve Dower <steve.dower@python.org> wrote:
This discussion is for the developers who insist on using bytes for paths within Python, and the question is, "how do we best represent UTF-16 encoded paths in bytes?"
That's incomplete, AFAICS. (Paul makes this point somewhat differently.) We don't want to represent paths in bytes on Windows if we can avoid it. Nor does UTF-16 really enter into it (except for the technical issue of invalid surrogate pairs). So a full statement is, "How do we best represent Windows file system paths in bytes for interoperability with systems that natively represent paths in bytes?" ("Other systems" refers to both other platforms and existing programs on Windows.) BTW, why "surrogate pairs"? Does Windows validate surrogates to ensure they come in pairs, but not necessarily in the right order (or perhaps sometimes they resolve to non-characters such as U+1FFFF)? Paul says:
People passing bytes to open() have in my view, already chosen not to follow the standard advice of "decode incoming data at the boundaries of your application". They may have good reasons for that, but it's perfectly reasonable to expect them to take responsibility for manually tracking the encoding of the resulting bytes values flowing through their code.
Abstractly true, but in practice there's no such need for those who made the choice! In a properly set up POSIX locale[1], it Just Works by design, especially if you use UTF-8 as the preferred encoding. It's Windows developers and users who suffer, not those who wrote the code, nor their primary audience which uses POSIX platforms.
It is of course, also true that "works for me in my environment" is a viable strategy - but the maintenance cost of this strategy if things change (whether in Python, or in the environment) is on the application developers - they are hoping that cost is minimal, but that's a risk they choose to take.
Nick's point is that the risk is on Windows users and developers for the Windows platform who did *not* make that choice, but rather had it made for them by developers on a different platform where it Just Works. He argues that we should level the playing field. It's also relevant that those developers on the originating platform for the code typically resist complexifying changes to make things work on other platforms too (cf. Victor's advocacy of removing the bytes APIs on Windows). Victor's points are good IMO; he's not just resisting Windows, there are real resource consequences.
Code using Unicode is unaffected, certainly. Ideally that means that only a tiny minority of users should be affected. Are we over-reacting to reports of standard practices in Japan? I've no idea.
AFAIK, India and Southeast Asia have already abandoned their indigenous standards in favor of Unicode/UTF-8, so it doesn't matter if they use str or bytes, either way Steve's proposal will Just Work. I don't know anything about Arabic, Hebrew, Cyrillic, and Eastern Europeans. That leaves China, which is like Japan in having had a practically universal encoding (ie, every script you'll actually see roundtrips, emoji being the only practical issue) since the 1970s. So I suspect Chinese also primarily use their local code page (GB2312 or GB18030) for plain text documents, possibly including .ini and Makefiles. Over-reaction? I have no idea either. Just a potentially widespread risk, both to users and to Python's reputation for maintaining compatibility. (I don't think it's "fair", but among my acquaintances Python has a poor rep -- Steve's argument that if you develop code for 3.5 you should expect to have to modify it to use it with 3.6 cuts no ice with them.)
If you see an alternative choice to those listed above, feel free to contribute it. Otherwise, can we focus the discussion on these (or any new) choices?
Accept that we should have deprecated builtin open and the io module, but didn't do so. Extend the existing deprecation of bytes paths on Windows, to cover *all* APIs, not just the os module, But modify the deprecation to be "use of the Windows CP_ACP code page (via the ...A Win32 APIs) is deprecated and will be replaced with use of UTF-8 as the implied encoding for all bytes paths on Windows starting in Python 3.7". Document and publicise it much more prominently, as it is a breaking change. Then leave it one release for people to prepare for the change.
I like this one! If my paranoid fears are realized, in practice it might have to wait two releases, but at least this announcement should get people who are at risk to speak up. If they don't, then you can just call me "Chicken Little" and go ahead! Footnotes: [1] An oxymoron, but there you go.
On Wed, Aug 17, 2016 at 9:35 AM, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
BTW, why "surrogate pairs"? Does Windows validate surrogates to ensure they come in pairs, but not necessarily in the right order (or perhaps sometimes they resolve to non-characters such as U+1FFFF)?
A program can pass the filesystem a name containing one or more surrogate codes that isn't in a valid UTF-16 surrogate pair (i.e. a leading code in the range D800-DBFF followed by a trailing code in the range DC00-DFFF). In the user-mode runtime library and kernel executive, nothing up to the filesystem driver checks for a valid UTF-16 string. Microsoft's filesystems remain compatible with UCS2 from the 90s and don't care that the name isn't legal UTF-16. The same goes for the in-memory filesystems used for named pipes (NPFS, \\.\pipe) and mailslots (MSFS, \\.\mailslot). But non-Microsoft filesystems don't necessarily store names as wide-character strings. They may use UTF-8, in which case an invalid UTF-16 name will cause the system call to fail because it's an invalid parameter. If the filesystem allows creating such a badly named file or directory, it can still be accessed using a regular unicode path, which is how things stand currently. I see that Victor has suggested using "surrogatepass" in issue 27781. That would allow seamless operation. The downside is that bytes have a higher chance of leaking out of Python than strings created by 'surrogateescape' on Unix. But since it isn't a proper Unicode string on disk, at least nothing has changed substantively by transcoding to "surrogatepass" UTF-8.
eryk sun writes:
On Wed, Aug 17, 2016 at 9:35 AM, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
BTW, why "surrogate pairs"? Does Windows validate surrogates to ensure they come in pairs, but not necessarily in the right order (or perhaps sometimes they resolve to non-characters such as U+1FFFF)?
Microsoft's filesystems remain compatible with UCS2
So it's not just invalid surrogate *pairs*, it's invalid surrogates of all kinds. This means that it's theoretically possible (though I gather that it's unlikely in the extreme) for a real Windows filename to indistinguishable from one generated by Python's surrogateescape handler. What happens when Python's directory manipulation functions on Windows encounter such a filename? Do they try to write it to the disk directory? Do they succeed? Does that depend on surrogateescape? Is there a reason in practice to allow surrogateescape at all on names in Windows filesystems, at least when using the *W API? You mention non-Microsoft filesystems; are they common enough to matter? I admit that as we converge on sanity (UTF-8 for text/* content, some kind of Unicode for filesystem names) none of this is very likely to matter, but I'm a worrywart.... Steve
On Thu, Aug 18, 2016 at 2:32 AM, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
So it's not just invalid surrogate *pairs*, it's invalid surrogates of all kinds. This means that it's theoretically possible (though I gather that it's unlikely in the extreme) for a real Windows filename to indistinguishable from one generated by Python's surrogateescape handler.
Absolutely if the filesystem is one of Microsoft's such as NTFS, FAT32, exFAT, ReFS, NPFS (named pipes), MSFS (mailslots) -- and I'm pretty sure it's also possible with CDFS and UDFS. UDF allows any Unicode character except NUL.
What happens when Python's directory manipulation functions on Windows encounter such a filename? Do they try to write it to the disk directory? Do they succeed? Does that depend on surrogateescape?
Python allows these 'Unicode' (but not strictly UTF compatible) strings, so it doesn't have a problem with such filenames, as long as it's calling the Windows wide-character APIs.
Is there a reason in practice to allow surrogateescape at all on names in Windows filesystems, at least when using the *W API? You mention non-Microsoft filesystems; are they common enough to matter?
Previously I gave an example with a VirtualBox shared folder, which rejects names with invalid surrogates. I don't know how common that is in general. I typically switch between 2 guests on a Linux host and share folders between systems. In Windows I mount shared folders as directory symlinks in C:\Mount. I just tested another example that led to different results. Ext2Fsd is a free ext2/ext3 filesystem driver for Windows. I mounted an ext2 disk in Windows 10. Next, in Python I created a file named "\udc00b\udc00a\udc00d" in the root directory. Ext2Fsd defaults to using UTF-8 as the drive codepage, so I expected it to reject this filename, just like VBoxSF does. But it worked: >>> os.listdir('.')[-1] '\udc00b\udc00a\udc00d' As expected the ANSI API substitutes question marks for the surrogate codes: >>> os.listdir(b'.')[-1] b'?b?a?d' So what did Ext2Fsd write in this supposedly UTF-8 filesystem? I mounted the disk in Linux to check: >>> os.listdir(b'.')[-1] b'\xed\xb0\x80b\xed\xb0\x80a\xed\xb0\x80d' It blindly encoded the surrogate codes, creating invalid UTF-8. I think it's called WTF-8 (Wobbly Transformation Format). The file manager in Linux displays this file as "���b���a���d (invalid encoding)", and ls prints "???b???a???d". Python uses its surrogateescape error handler: >>> os.listdir('.')[-1] '\udced\udcb0\udc80b\udced\udcb0\udc80a\udced\udcb0\udc80d' The original name can be decoded using the surrogatepass error handler: >>> os.listdir(b'.')[-1].decode(errors='surrogatepass') '\udc00b\udc00a\udc00d'
On 17Aug2016 0235, Stephen J. Turnbull wrote:
Paul Moore writes:
On 16 August 2016 at 16:56, Steve Dower <steve.dower@python.org> wrote:
This discussion is for the developers who insist on using bytes for paths within Python, and the question is, "how do we best represent UTF-16 encoded paths in bytes?"
That's incomplete, AFAICS. (Paul makes this point somewhat differently.) We don't want to represent paths in bytes on Windows if we can avoid it. Nor does UTF-16 really enter into it (except for the technical issue of invalid surrogate pairs). So a full statement is, "How do we best represent Windows file system paths in bytes for interoperability with systems that natively represent paths in bytes?" ("Other systems" refers to both other platforms and existing programs on Windows.)
That's incorrect, or at least possible to interpret correctly as the wrong thing. The goal is "code compatibility with systems ...", not interoperability. Nothing about this will make it easier to take a path from Windows and use it on Linux or vice versa, but it will make it easier/more reliable to take code that uses paths on Linux and use it on Windows.
BTW, why "surrogate pairs"? Does Windows validate surrogates to ensure they come in pairs, but not necessarily in the right order (or perhaps sometimes they resolve to non-characters such as U+1FFFF)?
Eryk answered this better than I would have.
Paul says:
People passing bytes to open() have in my view, already chosen not to follow the standard advice of "decode incoming data at the boundaries of your application". They may have good reasons for that, but it's perfectly reasonable to expect them to take responsibility for manually tracking the encoding of the resulting bytes values flowing through their code.
Abstractly true, but in practice there's no such need for those who made the choice! In a properly set up POSIX locale[1], it Just Works by design, especially if you use UTF-8 as the preferred encoding. It's Windows developers and users who suffer, not those who wrote the code, nor their primary audience which uses POSIX platforms.
You mentioned "locale", "preferred" and "encoding" in the same sentence, so I hope you're not thinking of locale.getpreferredencoding()? Changing that function is orthogonal to this discussion, despite the fact that in most cases it returns the same code page as what is going to be used by the file system functions (which in most cases will also be used by the encoding returned from sys.getfilesystemencoding()). When Windows developers and users suffer, I see it as my responsibility to reduce that suffering. Changing Python on Windows should do that without affecting developers on Linux, even though the Right Way is to change all the developers on Linux to use str for paths.
If you see an alternative choice to those listed above, feel free to contribute it. Otherwise, can we focus the discussion on these (or any new) choices?
Accept that we should have deprecated builtin open and the io module, but didn't do so. Extend the existing deprecation of bytes paths on Windows, to cover *all* APIs, not just the os module, But modify the deprecation to be "use of the Windows CP_ACP code page (via the ...A Win32 APIs) is deprecated and will be replaced with use of UTF-8 as the implied encoding for all bytes paths on Windows starting in Python 3.7". Document and publicise it much more prominently, as it is a breaking change. Then leave it one release for people to prepare for the change.
I like this one! If my paranoid fears are realized, in practice it might have to wait two releases, but at least this announcement should get people who are at risk to speak up. If they don't, then you can just call me "Chicken Little" and go ahead!
I don't think there's any reasonable way to noisily deprecate these functions within Python, but certainly the docs can be made clearer. People who explicitly encode with sys.getfilesystemencoding() should not get the deprecation message, but we can't tell whether they got their bytes from the right encoding or a RNG, so there's no way to discriminate. I'm going to put together a summary post here (hopefully today) and get those who have been contributing to basically sign off on it, then I'll take it to python-dev. The possible outcomes I'll propose will basically be "do we keep the status quo, undeprecate and change the functionality, deprecate the deprecation and undeprecate/change in a couple releases, or say that it wasn't a real deprecation so we can deprecate and then change functionality in a couple releases". Cheers, Steve
Steve Dower writes:
On 17Aug2016 0235, Stephen J. Turnbull wrote:
So a full statement is, "How do we best represent Windows file system paths in bytes for interoperability with systems that natively represent paths in bytes?" ("Other systems" refers to both other platforms and existing programs on Windows.)
That's incorrect, or at least possible to interpret correctly as the wrong thing. The goal is "code compatibility with systems ...", not interoperability.
You're right, I stated that incorrectly. I don't have anything to add to your corrected version.
In a properly set up POSIX locale[1], it Just Works by design, especially if you use UTF-8 as the preferred encoding. It's Windows developers and users who suffer, not those who wrote the code, nor their primary audience which uses POSIX platforms.
You mentioned "locale", "preferred" and "encoding" in the same sentence, so I hope you're not thinking of locale.getpreferredencoding()? Changing that function is orthogonal to this discussion,
You consistently ignore Makefiles, .ini, etc. It is *not* orthogonal, it is *the* reason for all opposition to your proposal or request that it be delayed. Filesystem names *are* text in part because they are *used as filenames in text*.
When Windows developers and users suffer, I see it as my responsibility to reduce that suffering. Changing Python on Windows should do that without affecting developers on Linux, even though the Right Way is to change all the developers on Linux to use str for paths.
I resent that. If I were a partisan Linux fanboy, I'd be cheering you on because I think your proposal is going to hurt an identifiable and large class of *Windows* users. I know about and fear this possiblity because they use a language I love (Japanese) and an encoding I hate but have achieved a state of peaceful coexistence with (Shift JIS). And on the general principle, *I* don't disagree. I mentioned earlier that I use only the str interfaces in my own code on Linux and Mac OS X, and that I suspect that there are no real efficiency implications to using str rather than bytes for those interfaces. On the other hand, the programming convenience of reading the occasional "text" filename (or other text, such as XML tags) out of a binary stream and passing it directly to filesystem APIs cannot be denied. I think that the kind of usage you propose (a fixed, universal codec, universally accepted; ie, 'utf-8') is the best way to handle that in the long run. But as Grandmaster Lasker said, "Before the end game, the gods have placed the middle game." (Lord Keynes isn't relevant here, Python will outlive all of us. :-)
I don't think there's any reasonable way to noisily deprecate these functions within Python, but certainly the docs can be made clearer. People who explicitly encode with sys.getfilesystemencoding() should not get the deprecation message, but we can't tell whether they got their bytes from the right encoding or a RNG, so there's no way to discriminate.
I agree with you within Python; the custom is for DeprecationWarnings to be silent by default. As for "making noise", how about announcing the deprecation as like the top headline for 3.6, postponing the actual change to 3.7, and in the meantime you and Nick do a keynote duet at PyCon? (Your partner could be Guido, too, but Nick has been the most articulate proponent for this particular aspect of "inclusion". I think having a representative from the POSIX world explaining the importance of this for "all of us" would greatly multiply the impact.) Perhaps, given my proposed timing, a discussion at the language summit in '17 and the keynote in '18 would be the best timing. (OT, political: I've been strongly influenced in this proposal by recently reading http://blog.aurynn.com/contempt-culture. There's not as much of it in Python as in other communities I'm involved in, but I think this would be a good symbolic opportunity to express our oppostion to it. "Inclusion" isn't just about gender and race!)
I'm going to put together a summary post here (hopefully today) and get those who have been contributing to basically sign off on it, then I'll take it to python-dev. The possible outcomes I'll propose will basically be "do we keep the status quo, undeprecate and change the functionality, deprecate the deprecation and undeprecate/change in a couple releases, or say that it wasn't a real deprecation so we can deprecate and then change functionality in a couple releases".
FWIW, of those four, I dislike 'status quo' the most, and like 'say it wasn't real, deprecate and change' the best. Although I lean toward phrasing that as "we deprecated it, but we realize that practitioners are by and large not aware of the deprecation, and nobody expects the Spanish Inquisition". @Nick, if you're watching: I wonder if it would be possible to expand the "in the file system, bytes are UTF-8" proposal to POSIX as well, perhaps for 3.8?
"You consistently ignore Makefiles, .ini, etc." Do people really do open('makefile', 'rb'), extract filenames and try to use them without ever decoding the file contents? I've honestly never seen that, and it certainly looks like the sort of thing Python 3 was intended to discourage. (As soon as you open(..., 'r') you're only affected by this change if you explicitly encode again with mbcs.) Top-posted from my Windows Phone -----Original Message----- From: "Stephen J. Turnbull" <turnbull.stephen.fw@u.tsukuba.ac.jp> Sent: 8/17/2016 19:43 To: "Steve Dower" <steve.dower@python.org> Cc: "Paul Moore" <p.f.moore@gmail.com>; "Python-Ideas" <python-ideas@python.org> Subject: Re: [Python-ideas] Fix default encodings on Windows Steve Dower writes:
On 17Aug2016 0235, Stephen J. Turnbull wrote:
So a full statement is, "How do we best represent Windows file system paths in bytes for interoperability with systems that natively represent paths in bytes?" ("Other systems" refers to both other platforms and existing programs on Windows.)
That's incorrect, or at least possible to interpret correctly as the wrong thing. The goal is "code compatibility with systems ...", not interoperability.
You're right, I stated that incorrectly. I don't have anything to add to your corrected version.
In a properly set up POSIX locale[1], it Just Works by design, especially if you use UTF-8 as the preferred encoding. It's Windows developers and users who suffer, not those who wrote the code, nor their primary audience which uses POSIX platforms.
You mentioned "locale", "preferred" and "encoding" in the same sentence, so I hope you're not thinking of locale.getpreferredencoding()? Changing that function is orthogonal to this discussion,
You consistently ignore Makefiles, .ini, etc. It is *not* orthogonal, it is *the* reason for all opposition to your proposal or request that it be delayed. Filesystem names *are* text in part because they are *used as filenames in text*.
When Windows developers and users suffer, I see it as my responsibility to reduce that suffering. Changing Python on Windows should do that without affecting developers on Linux, even though the Right Way is to change all the developers on Linux to use str for paths.
I resent that. If I were a partisan Linux fanboy, I'd be cheering you on because I think your proposal is going to hurt an identifiable and large class of *Windows* users. I know about and fear this possiblity because they use a language I love (Japanese) and an encoding I hate but have achieved a state of peaceful coexistence with (Shift JIS). And on the general principle, *I* don't disagree. I mentioned earlier that I use only the str interfaces in my own code on Linux and Mac OS X, and that I suspect that there are no real efficiency implications to using str rather than bytes for those interfaces. On the other hand, the programming convenience of reading the occasional "text" filename (or other text, such as XML tags) out of a binary stream and passing it directly to filesystem APIs cannot be denied. I think that the kind of usage you propose (a fixed, universal codec, universally accepted; ie, 'utf-8') is the best way to handle that in the long run. But as Grandmaster Lasker said, "Before the end game, the gods have placed the middle game." (Lord Keynes isn't relevant here, Python will outlive all of us. :-)
I don't think there's any reasonable way to noisily deprecate these functions within Python, but certainly the docs can be made clearer. People who explicitly encode with sys.getfilesystemencoding() should not get the deprecation message, but we can't tell whether they got their bytes from the right encoding or a RNG, so there's no way to discriminate.
I agree with you within Python; the custom is for DeprecationWarnings to be silent by default. As for "making noise", how about announcing the deprecation as like the top headline for 3.6, postponing the actual change to 3.7, and in the meantime you and Nick do a keynote duet at PyCon? (Your partner could be Guido, too, but Nick has been the most articulate proponent for this particular aspect of "inclusion". I think having a representative from the POSIX world explaining the importance of this for "all of us" would greatly multiply the impact.) Perhaps, given my proposed timing, a discussion at the language summit in '17 and the keynote in '18 would be the best timing. (OT, political: I've been strongly influenced in this proposal by recently reading http://blog.aurynn.com/contempt-culture. There's not as much of it in Python as in other communities I'm involved in, but I think this would be a good symbolic opportunity to express our oppostion to it. "Inclusion" isn't just about gender and race!)
I'm going to put together a summary post here (hopefully today) and get those who have been contributing to basically sign off on it, then I'll take it to python-dev. The possible outcomes I'll propose will basically be "do we keep the status quo, undeprecate and change the functionality, deprecate the deprecation and undeprecate/change in a couple releases, or say that it wasn't a real deprecation so we can deprecate and then change functionality in a couple releases".
FWIW, of those four, I dislike 'status quo' the most, and like 'say it wasn't real, deprecate and change' the best. Although I lean toward phrasing that as "we deprecated it, but we realize that practitioners are by and large not aware of the deprecation, and nobody expects the Spanish Inquisition". @Nick, if you're watching: I wonder if it would be possible to expand the "in the file system, bytes are UTF-8" proposal to POSIX as well, perhaps for 3.8?
On Thu, Aug 18, 2016 at 6:23 AM, Steve Dower <steve.dower@python.org> wrote:
"You consistently ignore Makefiles, .ini, etc."
Do people really do open('makefile', 'rb'), extract filenames and try to use them without ever decoding the file contents?
I'm sure they do :-( But this has always confused me - back in the python2 "good old days" text and binary mode were exactly the same on *nix -- so folks sometimes fell into the trap of opening binary files as text on *nix, and then it failing on Windows but I can't image why anyone would have done the opposite. So in porting to py3, they would have had to *add* that 'b' (and a bunch of b'filename') to keep the good old bytes is text interface. Why would anyone do that? Honestly confused. I've honestly never seen that, and it certainly looks like the sort of
thing Python 3 was intended to discourage.
exactly -- we really don't need to support folks reading text files in binary mode and not considering encoding... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 19 August 2016 at 08:05, Chris Barker <chris.barker@noaa.gov> wrote:
On Thu, Aug 18, 2016 at 6:23 AM, Steve Dower <steve.dower@python.org> wrote:
"You consistently ignore Makefiles, .ini, etc."
Do people really do open('makefile', 'rb'), extract filenames and try to use them without ever decoding the file contents?
I'm sure they do :-(
But this has always confused me - back in the python2 "good old days" text and binary mode were exactly the same on *nix -- so folks sometimes fell into the trap of opening binary files as text on *nix, and then it failing on Windows but I can't image why anyone would have done the opposite.
So in porting to py3, they would have had to *add* that 'b' (and a bunch of b'filename') to keep the good old bytes is text interface.
Why would anyone do that?
For a fair amount of *nix-centric code that primarily works with ASCII data, adding the 'b' prefix is the easiest way to get into the common subset of Python 2 & 3. However, this means that such code is currently relying on deprecated functionality on Windows, and if we actually followed through on the deprecation with feature removal, Steve's expectation (which I agree with) is that many affected projects would just drop Windows support entirely, rather than changing their code to use str instead of bytes (at least under Python 3 on Windows). The end result of Steve's proposed changes should be that such code would typically do the right thing across all of Mac OS X, Linux and WIndows, as long as the latter two are configured to use "utf-8" as their default locale encoding or active code page (respectively). Linux and Windows would still both have situations encountered with ssh environment variable forwarding and with East Asian system configurations that have the potential to result in mojibake, where these challenges come up mainly with network communications on Linux, and local file processing on Windows. The reason I like Steve's proposal is that it gets us to a better baseline situation for cross-platform compatibility (including with the CLR and JVM API models), and replaces the status quo with three smaller as yet unsolved problems: - network protocol interoperability on Linux systems configured with a non UTF-8 locale - system access on Linux servers with a forwarded SSH environment that doesn't match the server settings - processing file contents on Windows systems with an active code page other than UTF-8 For Linux, our answer is basically "UTF-8 is really the only system locale that works properly for other reasons, so we'll essentially wait for non-UTF-8 Linux systems to slowly age out of humanity's collective IT infrastructure" For Windows, our preliminary answer is the same as the situation on Linux, which is why Stephen's concerned by the proposal - it reduces the incentive for folks to support Windows *properly*, by switching to modeling paths as text the way pathlib does. However, it seems to me that those higher level pathlib APIs are the best way to encourage future code to be more WIndows friendly - they sweep a lot of these messy low level concerns under the API rug, so more Python 3 native code will use str paths by default, with bytes paths mainly showing in Python 2/3 compatible code bases and some optimised data processing code. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Fri, Aug 19, 2016 at 12:30 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
So in porting to py3, they would have had to *add* that 'b' (and a bunch of b'filename') to keep the good old bytes is text interface.
Why would anyone do that?
For a fair amount of *nix-centric code that primarily works with ASCII data, adding the 'b' prefix is the easiest way to get into the common subset of Python 2 & 3.
Sure -- but it's entirely unnecessary, yes? If you don't change your code, you'll get py2(bytes) strings as paths in py2, and py3 (Unicode) strings as paths on py3. So different, yes. But wouldn't it all work? So folks are making an active choice to change their code to get some perceived (real?) performance benefit??? However, as I understand it, py3 string paths did NOT "just work" in place of py2 paths before surrogate pairs were introduced (when was that?) -- so are we dealing with all of this because some (a lot, and important) libraries ported to py3 early in the game? What I'm getting at is whether there is anything other than inertia that keeps folks using bytes paths in py3 code? Maybe it wouldn't be THAT hard to get folks to make the switch: it's EASIER to port your code to py3 this way! -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Chris Barker writes:
Sure -- but it's entirely unnecessary, yes? If you don't change your code, you'll get py2(bytes) strings as paths in py2, and py3 (Unicode) strings as paths on py3. So different, yes. But wouldn't it all work?
The difference is that if you happen to have a file name on Unix that is *not* encoded in the default locale, bytes Just Works, while Something Bad happens with unicode (mixing Python 3 and Python 2 terminology for clarity). Also, in Python the C/POSIX default locale implied a codec of 'ascii' which is quite risky nowadays, so using unicode meant always being conscious of encodings.
So folks are making an active choice to change their code to get some perceived (real?) performance benefit???
No, they're making a passive choice to not fix whut ain't broke nohow, but in Python 3 is spelled differently. It's the same order of change as "print stuff" (Python 2) to "print(stuff)" (Python 3), except that it's not as automatic. (Ie, where print is *always* a function call in Python 3, often in a Python 2 -> 3 port you're better off with str than bytes, especially before PEP 461 "% formatting for bytes".)
However, as I understand it, py3 string paths did NOT "just work" in place of py2 paths before surrogate pairs were introduced (when was that?)
I'm not sure what you're referring to. Python 2 unicode and Python 3 str have been capable of representing (for values of "representing" that require appropriate choice of I/O codecs) the entire repertoire of Unicode since version 1.6 [sic!]. I suppose you mean PEP 383 (implemented in Python 3.1), which added a pseudo-encoding for unencodable bytes, ie, the surrogateescape error handler. This was never a major consideration in practice, however, as you could always get basically the same effect with the 'latin-1' codec. That is, the surrogateescape handler is primarily of benefit to those who are already convinced that fully conformant Unicode is the way to go. It doesn't make a difference to those who prefer bytes.
What I'm getting at is whether there is anything other than inertia that keeps folks using bytes paths in py3 code? Maybe it wouldn't be THAT hard to get folks to make the switch: it's EASIER to port your code to py3 this way!
It's not. First, encoding awareness is real work. If you try to DTRT, you open yourself up to UnicodeErrors anywhere in your code where there's a Python/rest-of-world boundary. If you just use bytes, you may be producing garbage, but your program doesn't stop running, and you can always argue it's either your upstream's or your downstream's fault. I *personally* have always found the work to be worthwhile, as my work always involves "real" text processing, and frequently not in pure ASCII. Second, there are a lot of low-level use cases where (1) efficiency matters and (2) all the processing actually done involves switching on byte values in the range 32-126. It makes sense to do that work on bytes, wouldn't you say?<wink/> And to make the switch cases easier to read, it's common practice to form (or contort) those bytes into human words. These cases include a lot of the familiar acronyms: SMTP, HTTP, DNS, VCS, VM (as in "bytecode interpreter"), ... and the projects are familiar: Twisted, Mercurial, .... Bottom line: I'm with you! I think that "filenames are text" *should* be the default mode for Python programmers. But there are important use cases where it's sometimes more work to make that work than to make bytes work (on POSIX), and typically those cases also inherit largish, battle-tested code bases that assume a "bytes in, bytes through, bytes out" model. We can't deprecate "filenames as bytes" on POSIX yet, and if we want to encourage participation in projects that use that model by Windows-based programmers, we can't deprecate completely on Windows, either.
Summary for python-dev. This is the email I'm proposing to take over to the main mailing list to get some actual decisions made. As I don't agree with some of the possible recommendations, I want to make sure that they're represented fairly. I also want to summarise the background leading to why we should consider making a change here at all, rather than simply leaving it alone. There's a chance this will all make its way into a PEP, depending on how controversial the core team thinks this is. Please let me know if you think I've misrepresented (or unfairly represented) any of the positions, or if you think I can simplify/clarify anything in here. Please don't treat this like a PEP review - it's just going to be an email to python-dev - but the more we can avoid having the discussions there we've already had here the better. Cheers, Steve --- Background ========== File system paths are almost universally represented as text in some encoding determined by the file system. In Python, we expose these paths via a number of interfaces, such as the os and io modules. Paths may be passed either direction across these interfaces, that is, from the filesystem to the application (for example, os.listdir()), or from the application to the filesystem (for example, os.unlink()). When paths are passed between the filesystem and the application, they are either passed through as a bytes blob or converted to/from str using sys.getfilesystemencoding(). The result of encoding a string with sys.getfilesystemencoding() is a blob of bytes in the native format for the default file system. On Windows, the native format for the filesystem is utf-16-le. The recommended platform APIs for accessing the filesystem all accept and return text encoded in this format. However, prior to Windows NT (and possibly further back), the native format was a configurable machine option and a separate set of APIs existed to accept this format. The option (the "active code page") and these APIs (the "*A functions") still exist in recent versions of Windows for backwards compatibility, though new functionality often only has a utf-16-le API (the "*W functions"). In Python, we recommend using str as the default format on Windows because it can correctly round-trip all the characters representable in utf-16-le. Our support for bytes explicitly uses the *A functions and hence the encoding for the bytes is "whatever the active code page is". Since the active code page cannot represent all Unicode characters, the conversion of a path into bytes can lose information without warning. As a demonstration of this:
open('test\uAB00.txt', 'wb').close() import glob glob.glob('test*') ['test\uab00.txt'] glob.glob(b'test*') [b'test?.txt']
The Unicode character in the second call to glob is missing information. You can observe the same results in os.listdir() or any function that matches its result type to the parameter type. Why is this a problem? ====================== While the obvious and correct answer is to just use str everywhere, it remains well known that on Linux and MacOS it is perfectly okay to use bytes when taking values from the filesystem and passing them back. Doing so also avoids the cost of decoding and reencoding, such that (theoretically), code like below should be faster because of the `b'.'`:
for f in os.listdir(b'.'): ... os.stat(f) ...
On Windows, if a filename exists that cannot be encoding with the active code page, you will receive an error from the above code. These errors are why in Python 3.3 the use of bytes paths on Windows was deprecated (listed in the What's New, but not clearly obvious in the documentation - more on this later). The above code produces multiple deprecation warnings in 3.3, 3.4 and 3.5 on Windows. However, we still keep seeing libraries use bytes paths, which can cause unexpected issues on Windows. Given the current approach of quietly recommending that library developers either write their code twice (once for bytes and once for str) or use str exclusively are not working, we should consider alternative mitigations. Proposals ========= There are two dimensions here - the fix and the timing. We can basically choose any fix and any timing. The main differences between the fixes are the balance between incorrect behaviour and backwards-incompatible behaviour. The main issue with respect to timing is whether or not we believe using bytes as paths on Windows was correctly deprecated in 3.3 and sufficiently advertised since to allow us to change the behaviour in 3.6. Fixes ----- Fix #1: Change sys.getfilesystemencoding() to utf-8 on Windows Currently the default filesystem encoding is 'mbcs', which is a meta-encoder that uses the active code page. In reality, our implementation uses the *A APIs and we don't explicitly decode bytes in order to pass them to the filesystem. This allows the OS to quietly replace invalid characters (the equivalent of 'mbcs:replace'). This proposal would remove all use of the *A APIs and only ever call the *W APIs. When paths are returned to Python as str, they will be decoded from utf-16-le. When paths are to be returned as bytes, we would decode from utf-16-le to utf-8 using surrogatepass. Equally, when paths are provided as bytes, they are decoded from utf-8 into utf-16-le and passed to the *W APIs. The choice of utf-8 is to ensure the ability to round-trip, while also allowing basic manipulation of paths as bytes (basically, locating and slicing at '\' characters). It is debated, but I believe this is not a backwards compatibility issue because: * byte paths in Python are specified as being encoded by sys.getfilesystemencoding() * byte paths on Windows have been deprecated for three versions Unfortunately, the deprecation is not explicitly called out anywhere in the docs apart from the What's New page, so there is an argument that it shouldn't be counted despite the warnings in the interpreter. However, this is more directly addressed in the discussion of timing below. Equally, sys.getfilesystemencoding() documents the specific return values for various platforms, as well as that it is part of the protocol for using bytes to represent filesystem strings. I believe both of these arguments are invalid, that the only code that will break as a result of this change is relying on deprecated functionality and not correctly following the encoding contract, and that the (probably noisy) breakage that will occur is less bad than the silent breakage that currently exists. As far as implementation goes, there is already a patch for this at http://bugs.python.org/issue27781. In short, we update the path converter to decode bytes (path->narrow) to Unicode (path->wide) and remove all the code that would call *A APIs. In my patch I've changed path->narrow to a flag that indicates whether to convert back to bytes on return, and also to prevent compilation of code that tries to use ->narrow as a string on Windows (maybe that will get too annoying for contributors? good discussion for the tracker IMHO). Fix #2: Do the mbcs decoding ourselves This is essentially the same as fix #1, but instead of changing to utf-8 we keep mbcs as the encoding. This approach will allow us to utilise new functionality that is only available as *W APIs, and also lets us be more strict about encoding/decoding to bytes. For example, rather than silently replacing Unicode characters with '?', we could warn or fail the operation, potentially modifying that behaviour with an environment variable or flag. Compared to fix #1, this will enable some new functionality but will not fix any of the problems immediately. New runtime errors may cause some problems to be more obvious and lead to fixes, provided library maintainers are interested in supporting Windows and adding a separate code path to treat filesystem paths as strings. Fix #3: Make bytes paths on Windows an error By preventing the use of bytes paths on Windows completely we prevent users from hitting encoding issues. However, we do this at the expense of usability. I don't have numbers of libraries that will simply fail on Windows if this "fix" is made, but given I've already had people directly email me and tell me about their problems we can safely assume it's non-zero. I'm really not a fan of this fix, because it doesn't actually make things better in a practical way, despite being more "pure". Timing #1: Change it in 3.6 This timing assumes that we believe the deprecation of using bytes for paths in Python 3.3 was sufficiently well advertised that we can freely make changes in 3.6. A typical deprecation cycle would be two versions before removal (though we also often leave things in forever when they aren't fundamentally broken), so we have passed that point and theoretically can remove or change the functionality without breaking it. In this case, we would announce in 3.6 that using bytes as paths on Windows is no longer deprecated, and that the encoding used is whatever is returned by sys.getfilesystemencoding(). Timing #2: Change it in 3.7 This timing assumes that the deprecation in 3.3 was valid, but acknowledges that it was not well publicised. For 3.6, we aggressively make it known that only strings should be used to represent paths on Windows and bytes are invalid and going to change in 3.7. (It has been suggested that I could use a keynote at PyCon to publicise this, and while I'd totally accept a keynote, I'd hate to subject a crowd to just this issue for an hour :) ). My concern with this approach is that there is no benefit to the change at all. If we aggressively publicise the fact that libraries that don't handle Unicode paths on Windows properly are using deprecated functionality and need to be fixed by 3.7 in order to avoid breaking (more precisely - continuing to be broken, but with a different error message), then we will alienate non-Windows developers further from the platform (net loss for the ecosystem) and convince some to switch to str everywhere (net gain for the ecosystem). The latter case removes the need to make any change in 3.7 at all, so we would really just be making noise about something that people haven't noticed and not necessarily going in and fixing anything. Timing #3: Change it in 3.8 This timing assumes that the deprecation in 3.3 was not sufficient and we need to start a new deprecation cycle. This is strengthened by the fact that the deprecation announcement does not explicitly include the io module or the builtin open() function, and so some developers may believe that using bytes for paths with these is okay despite the os module being deprecated. The one upside to this approach is that it would also allow us to change locale.getpreferredencoding() to utf-8 on Windows (to affect the default behaviour of open(..., 'r') ), which I don't believe is going to be possible without a new deprecation cycle. There is a strong argument that the following code should also round-trip regardless of platform:
with open('list.txt', 'w') as f: ... for i in os.listdir('.'): ... print(i, file=f) ... with open('list.txt', 'r') as f: ... files = list(f) ...
Currently, the default encoding for open() cannot represent all filenames that may be returned from listdir(). This may affect makefiles and configuration files that contain paths. Currently they will work correctly for paths that can be represented in the machine's active code page (though it should be noted that the *A APIs may be changed to use the OEM code page rather than the active code page, which would also break this case). Possibly resolving both issues simultaneously is worth waiting for two more releases? I'm not convinced the change to getfilesystemencoding() needs to wait for getpreferredencoding() to also change, or that they necessarily need to match, but it would not be hugely surprising to see the changes bundled together. I'll also note that there has been no discussion about changing getpreferredencoding() so far, though there have been a number of "+1" votes alongside some "+1 with significant concerns" votes. Changing the default encoding of the contents of data files is pretty scary, so I'm not in any rush to force it in. Acknowledgements ================ Thanks to Stephen Turnbull, Eryk Sun, Victor Stinner and Random832 for their significant contributions and willingness to engage, and to everyone else on python-ideas for contributing to the discussion.
On Fri, Aug 19, 2016 at 1:25 AM, Steve Dower <steve.dower@python.org> wrote:
open('test\uAB00.txt', 'wb').close() import glob glob.glob('test*') ['test\uab00.txt'] glob.glob(b'test*') [b'test?.txt']
The Unicode character in the second call to glob is missing information. You can observe the same results in os.listdir() or any function that matches its result type to the parameter type.
Apologies if this is just noise, but I'm a little confused by this. The second call to glob doesn't have any Unicode characters at all, the way I see it - it's all bytes. Am I completely misunderstanding this? ChrisA
On Thu, Aug 18, 2016, at 11:29, Chris Angelico wrote:
glob.glob('test*') ['test\uab00.txt'] glob.glob(b'test*') [b'test?.txt']
The Unicode character in the second call to glob is missing information.
Apologies if this is just noise, but I'm a little confused by this. The second call to glob doesn't have any Unicode characters at all, the way I see it - it's all bytes. Am I completely misunderstanding this?
The unicode character is in the actual name of the actual file being matched. That the byte string returned by glob fails to represent that character in any encoding is the problem. Glob results don't exist in a vacuum, they're supposed to represent, and be usable to access, files that actually exist on the real filesystem.
On 18Aug2016 0829, Chris Angelico wrote:
The second call to glob doesn't have any Unicode characters at all, the way I see it - it's all bytes. Am I completely misunderstanding this?
You're not the only one - I think this has been the most common misunderstanding. On Windows, the paths as stored in the filesystem are actually all text - more precisely, utf-16-le encoded bytes, represented as 16-bit characters strings. Converting to an 8-bit character representation only exists for compatibility with code written for other platforms (either Linux, or much older versions of Windows). The operating system has one way to do the conversion to bytes, which Python currently uses, but since we control that transformation I'm proposing an alternative conversion that is more reliable than compatible (with Windows 3.1... shouldn't affect compatibility with code that properly handles multibyte encodings, which should include anything developed for Linux in the last decade or two). Does that help? I tried to keep the explanation short and focused :) Cheers, Steve
On Fri, Aug 19, 2016 at 1:54 AM, Steve Dower <steve.dower@python.org> wrote:
On 18Aug2016 0829, Chris Angelico wrote:
The second call to glob doesn't have any Unicode characters at all, the way I see it - it's all bytes. Am I completely misunderstanding this?
You're not the only one - I think this has been the most common misunderstanding.
On Windows, the paths as stored in the filesystem are actually all text - more precisely, utf-16-le encoded bytes, represented as 16-bit characters strings.
Converting to an 8-bit character representation only exists for compatibility with code written for other platforms (either Linux, or much older versions of Windows). The operating system has one way to do the conversion to bytes, which Python currently uses, but since we control that transformation I'm proposing an alternative conversion that is more reliable than compatible (with Windows 3.1... shouldn't affect compatibility with code that properly handles multibyte encodings, which should include anything developed for Linux in the last decade or two).
Does that help? I tried to keep the explanation short and focused :)
Ah, I think I see what you mean. There's a slight ambiguity in the word "missing" here. 1) The Unicode character in the result lacks some of the information it should have 2) The Unicode character in the file name is information that has now been lost. My reading was the first, but AIUI you actually meant the second. If so, I'd be inclined to reword it very slightly, eg: "The Unicode character in the second call to glob is now lost information." Is that a correct interpretation? ChrisA
On 18Aug2016 0900, Chris Angelico wrote:
On Fri, Aug 19, 2016 at 1:54 AM, Steve Dower <steve.dower@python.org> wrote:
On 18Aug2016 0829, Chris Angelico wrote:
The second call to glob doesn't have any Unicode characters at all, the way I see it - it's all bytes. Am I completely misunderstanding this?
You're not the only one - I think this has been the most common misunderstanding.
On Windows, the paths as stored in the filesystem are actually all text - more precisely, utf-16-le encoded bytes, represented as 16-bit characters strings.
Converting to an 8-bit character representation only exists for compatibility with code written for other platforms (either Linux, or much older versions of Windows). The operating system has one way to do the conversion to bytes, which Python currently uses, but since we control that transformation I'm proposing an alternative conversion that is more reliable than compatible (with Windows 3.1... shouldn't affect compatibility with code that properly handles multibyte encodings, which should include anything developed for Linux in the last decade or two).
Does that help? I tried to keep the explanation short and focused :)
Ah, I think I see what you mean. There's a slight ambiguity in the word "missing" here.
1) The Unicode character in the result lacks some of the information it should have
2) The Unicode character in the file name is information that has now been lost.
My reading was the first, but AIUI you actually meant the second. If so, I'd be inclined to reword it very slightly, eg:
"The Unicode character in the second call to glob is now lost information."
Is that a correct interpretation?
I think so, though I find the wording a little awkward (and on rereading, my original wording was pretty bad). How about: "The second call to glob has replaced the Unicode character with '?', which means the actual filename cannot be recovered and the path is no longer valid." Cheers, STeve
On Fri, Aug 19, 2016 at 2:07 AM, Steve Dower <steve.dower@python.org> wrote:
I think so, though I find the wording a little awkward (and on rereading, my original wording was pretty bad). How about:
"The second call to glob has replaced the Unicode character with '?', which means the actual filename cannot be recovered and the path is no longer valid."
I like that. Very clear and precise, without losing too much concision. Thank you for explaining, as Cameron Baum often says. ChrisA
On Thu, Aug 18, 2016 at 4:07 PM, Steve Dower <steve.dower@python.org> wrote:
On 18Aug2016 0900, Chris Angelico wrote:
On Fri, Aug 19, 2016 at 1:54 AM, Steve Dower <steve.dower@python.org> wrote:
On 18Aug2016 0829, Chris Angelico wrote:
The second call to glob doesn't have any Unicode characters at all, the way I see it - it's all bytes. Am I completely misunderstanding this?
You're not the only one - I think this has been the most common misunderstanding.
On Windows, the paths as stored in the filesystem are actually all text - more precisely, utf-16-le encoded bytes, represented as 16-bit characters strings.
Converting to an 8-bit character representation only exists for compatibility with code written for other platforms (either Linux, or much older versions of Windows). The operating system has one way to do the conversion to bytes, which Python currently uses, but since we control that transformation I'm proposing an alternative conversion that is more reliable than compatible (with Windows 3.1... shouldn't affect compatibility with code that properly handles multibyte encodings, which should include anything developed for Linux in the last decade or two).
Does that help? I tried to keep the explanation short and focused :)
Ah, I think I see what you mean. There's a slight ambiguity in the word "missing" here.
1) The Unicode character in the result lacks some of the information it should have
2) The Unicode character in the file name is information that has now been lost.
My reading was the first, but AIUI you actually meant the second. If so, I'd be inclined to reword it very slightly, eg:
"The Unicode character in the second call to glob is now lost information."
Is that a correct interpretation?
I think so, though I find the wording a little awkward (and on rereading, my original wording was pretty bad). How about:
"The second call to glob has replaced the Unicode character with '?', which means the actual filename cannot be recovered and the path is no longer valid."
They're all just characters in the context of Unicode, so I think it's clearest to use the character code, e.g.: The second call to glob has replaced the U+AB00 character with '?', which means ...
On Fri, Aug 19, 2016 at 2:39 AM, eryk sun <eryksun@gmail.com> wrote:
They're all just characters in the context of Unicode, so I think it's clearest to use the character code, e.g.:
The second call to glob has replaced the U+AB00 character with '?', which means ...
Technically the character has been replaced with the byte value 63, although at this point, we're getting into dangerous areas of bytes being interpreted in one way or another. ChrisA
On Thu, Aug 18, 2016 at 4:44 PM, Chris Angelico <rosuav@gmail.com> wrote:
On Fri, Aug 19, 2016 at 2:39 AM, eryk sun <eryksun@gmail.com> wrote:
They're all just characters in the context of Unicode, so I think it's clearest to use the character code, e.g.:
The second call to glob has replaced the U+AB00 character with '?', which means ...
Technically the character has been replaced with the byte value 63, although at this point, we're getting into dangerous areas of bytes being interpreted in one way or another.
Windows NLS codepages are all supersets of ASCII (no EBCDIC to worry about), and the default character when encoding is always b"?". The default Unicode character when decoding is also almost always "?", except Japanese uses U+30FB.
On 8/18/2016 11:25 AM, Steve Dower wrote:
In this case, we would announce in 3.6 that using bytes as paths on Windows is no longer deprecated,
My understanding is the the first 2 fixes refine the deprecation rather than reversing it. And #3 simply applies it. -- Terry Jan Reedy
On 18Aug2016 1036, Terry Reedy wrote:
On 8/18/2016 11:25 AM, Steve Dower wrote:
In this case, we would announce in 3.6 that using bytes as paths on Windows is no longer deprecated,
My understanding is the the first 2 fixes refine the deprecation rather than reversing it. And #3 simply applies it.
#3 certainly just applies the deprecation. As for the first two, I don't see any reason to deprecate the functionality once the issues are resolved. If using utf-8 encoded bytes is going to work fine in all the same cases as using str, why discourage it?
On 8/18/2016 1:39 PM, Steve Dower wrote:
On 18Aug2016 1036, Terry Reedy wrote:
On 8/18/2016 11:25 AM, Steve Dower wrote:
In this case, we would announce in 3.6 that using bytes as paths on Windows is no longer deprecated,
My understanding is the the first 2 fixes refine the deprecation rather than reversing it. And #3 simply applies it.
#3 certainly just applies the deprecation.
As for the first two, I don't see any reason to deprecate the functionality once the issues are resolved. If using utf-8 encoded bytes is going to work fine in all the same cases as using str, why discourage it?
As I understand it, you still proposing to remove the use of bytes encoded with anything other than utf-8 (and the corresponding *A internal functions) and in particular stop lossy path transformations. Am I wrong? -- Terry Jan Reedy
On Thu, Aug 18, 2016 at 3:25 PM, Steve Dower <steve.dower@python.org> wrote:
allow us to change locale.getpreferredencoding() to utf-8 on Windows
_bootlocale.getpreferredencoding would need to be hard coded to return 'utf-8' on Windows. _locale._getdefaultlocale() itself shouldn't return 'utf-8' as the encoding because the CRT doesn't allow it as a locale encoding. site.aliasmbcs() uses getpreferredencoding, so it will need to be modified. The codecs module could add get_acp and get_oemcp functions based on GetACP and GetOEMCP, returning for example 'cp1252' and 'cp850'. Then aliasmbcs could call get_acp. Adding get_oemcp would also help with decoding output from subprocess.Popen. There's been discussion about adding encoding and errors options to Popen, and what the default should be. When writing to a pipe or file, some programs use OEM, some use ANSI, some use the console codepage if available, and far fewer use Unicode encodings. Obviously it's better to specify the encoding in each case if you know it. Regarding the locale module, how about modernizing _locale._getdefaultlocale to return the Windows locale name [1] from GetUserDefaultLocaleName? For example, it could return a tuple such as ('en-UK', None) and ('uz-Latn-UZ', None) -- always with the encoding set to None. The CRT accepts the new locale names, but it isn't quite up to speed. It still sets a legacy locale when the locale string is empty. In this case the high-level setlocale could call _getdefaultlocale. Also _parse_localename, which is called by getlocale, needs to return a tuple with the encoding as None. Currently it raises a ValueError for Windows locale names as defined by [1]. [1]: https://msdn.microsoft.com/en-us/library/dd373814
2016-08-16 17:56 GMT+02:00 Steve Dower <steve.dower@python.org>:
2. Windows file system encoding is *always* UTF-16. There's no "assuming mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding it is". We know exactly what the encoding is on every supported version of Windows. UTF-16.
I think that you missed a important issue (or "use case") which is called the "Makefile problem" by Mercurial developers: https://www.mercurial-scm.org/wiki/EncodingStrategy#The_.22makefile_problem.... I already explained it before, but maybe you misunderstood or just missed it, so here is a more concrete example. A runner.py script produces a bytes filename and sends it to a second read_file.py script through stdin/stdout. The read_file.py script opens the file using open(filename). The read_file.py script is run by Python 2 which works naturally on bytes. The question is how the runner.py produces (encodes) the filename. runner.py (script run by Python 3.7): --- import os, sys, subprocess, tempfile filename = 'h\xe9.txt' content = b'foo bar' print("filename unicode: %a" % filename) root = os.path.realpath(os.path.dirname(__file__)) script = os.path.join(root, 'read_file.py') old_cwd = os.getcwd() with tempfile.TemporaryDirectory() as tmpdir: os.chdir(tmpdir) with open(filename, 'wb') as fp: fp.write(content) filenameb = os.listdir(b'.')[0] # Python 3.5 encodes Unicode (UTF-16) to the ANSI code page # what if Python 3.7 encodes Unicode (UTF-16) to UTF-8? print("filename bytes: %a" % filenameb) proc = subprocess.Popen(['py', '-2', script], stdin=subprocess.PIPE, stdout=subprocess.PIPE) stdout = proc.communicate(filenameb)[0] print("File content: %a" % stdout) os.chdir(old_cwd) --- read_file.py (run by Python 2): --- import sys filename = sys.stdin.read() # Python 2 calls the Windows C open() function # which expects a filename encoded to the ANSI code page with open(filename) as fp: content = fp.read() sys.stdout.write(content) sys.stdout.flush() --- read_file.py only works if the non-ASCII filename is encoded to the ANSI code page. The question is how you expect developers should handle such use case. For example, are developers responsible to transcode communicate() data (input and outputs) manually? That's why I keep repeating that ANSI code page is the best *default* encoding because it is the encoded expected by other applications. I know that the ANSI code page is usually limited and caused various painful issues when handling non-ASCII data, but it's the status quo if you really want to handle data as bytes... Sorry, I didn't read all emails of this long thread, so maybe I missed your answer to this issue. Victor
On 16Aug2016 1603, Victor Stinner wrote:
2016-08-16 17:56 GMT+02:00 Steve Dower <steve.dower@python.org>:
2. Windows file system encoding is *always* UTF-16. There's no "assuming mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding it is". We know exactly what the encoding is on every supported version of Windows. UTF-16.
I think that you missed a important issue (or "use case") which is called the "Makefile problem" by Mercurial developers: https://www.mercurial-scm.org/wiki/EncodingStrategy#The_.22makefile_problem....
I already explained it before, but maybe you misunderstood or just missed it, so here is a more concrete example.
I guess I misunderstood. The concrete example really help, thank you. The problem here is that there is an application boundary without a defined encoding, right where you put the comment.
filenameb = os.listdir(b'.')[0] # Python 3.5 encodes Unicode (UTF-16) to the ANSI code page # what if Python 3.7 encodes Unicode (UTF-16) to UTF-8? print("filename bytes: %a" % filenameb)
proc = subprocess.Popen(['py', '-2', script], stdin=subprocess.PIPE, stdout=subprocess.PIPE) stdout = proc.communicate(filenameb)[0] print("File content: %a" % stdout)
If you are defining the encoding as 'mbcs', then you need to check that sys.getfilesystemencoding() == 'mbcs', and if it doesn't then reencode. Alternatively, since this script is the "new" code, you would use `os.listdir('.')[0].encode('mbcs')`, given that you have explicitly determined that mbcs is the encoding for the later transfer. Essentially, the problem is that this code is relying on a certain non-guaranteed behaviour of a deprecated API, where using sys.getfilesystemencoding() as documented would have prevented any issue (see https://docs.python.org/3/library/os.html#file-names-command-line-arguments-...). In one of the emails I think you missed, I called this out as the only case where code will break with a change to sys.getfilesystemencoding(). So yes, breaking existing code is something I would never do lightly. However, I'm very much of the opinion that the only code that will break is code that is already broken (or at least fragile) and that nobody is forced to take a major upgrade to Python or should necessarily expect 100% compatibility between major versions. Cheers, Steve
2016-08-17 1:27 GMT+02:00 Steve Dower <steve.dower@python.org>:
filenameb = os.listdir(b'.')[0] # Python 3.5 encodes Unicode (UTF-16) to the ANSI code page # what if Python 3.7 encodes Unicode (UTF-16) to UTF-8? print("filename bytes: %a" % filenameb)
proc = subprocess.Popen(['py', '-2', script], stdin=subprocess.PIPE, stdout=subprocess.PIPE) stdout = proc.communicate(filenameb)[0] print("File content: %a" % stdout)
If you are defining the encoding as 'mbcs', then you need to check that sys.getfilesystemencoding() == 'mbcs', and if it doesn't then reencode.
Sorry, I don't understand. What do you mean by "defining an encoding"? It's not possible to modify sys.getfilesystemencoding() in Python. What does "reencode"? I'm lost.
Alternatively, since this script is the "new" code, you would use `os.listdir('.')[0].encode('mbcs')`, given that you have explicitly determined that mbcs is the encoding for the later transfer.
My example is not new code. It is a very simplified script to explain the issue that can occur in a large code base which *currently* works well on Python 2 and Pyhon 3 in the common case (only handle data encodable to the ANSI code page).
Essentially, the problem is that this code is relying on a certain non-guaranteed behaviour of a deprecated API, where using sys.getfilesystemencoding() as documented would have prevented any issue (see https://docs.python.org/3/library/os.html#file-names-command-line-arguments-...).
sys.getfilesystemencoding() is used in applications which store data as Unicode, but we are talking about applications storing data as bytes, no?
So yes, breaking existing code is something I would never do lightly. However, I'm very much of the opinion that the only code that will break is code that is already broken (or at least fragile) and that nobody is forced to take a major upgrade to Python or should necessarily expect 100% compatibility between major versions.
Well, it's somehow the same issue that we had in Python 2: applications work in most cases, but start to fail with non-ASCII characters, or maybe only in some cases. In this case, the ANSI code page is fine if all data can be encoded to the ANSI code page. You start to get troubles when you start to use characters not encodable to your ANSI code page. Last time I checked, Microsoft Visual Studio behaved badly (has bugs) with such filenames. It's the same for many applications. So it's not like Windows applications already handle this case very well. So let me call it a corner case. I'm not sure that it's worth it to explicitly break the Python backward compatibility on Windows for such corner case, especially because it's already possible to fix applications by starting to use Unicode everywhere (which would likely fix more issues than expected as a side effect). It's still unclear to me if it's simpler to modify an application using bytes to start using Unicode (for filenames), or if your proposition requires less changes. My main concern is the "makefile issue" which requires more complex code to transcode data between UTF-8 and ANSI code page. To me, it's like we are going back to Python 2 where no data had known encoding and mojibake was the default. If you manipulate strings in two encodings, it's likely to make mistakes and concatenate two strings encoded to two different encodings (=> mojibake). Victor
On 16Aug2016 1650, Victor Stinner wrote:
2016-08-17 1:27 GMT+02:00 Steve Dower <steve.dower@python.org>:
filenameb = os.listdir(b'.')[0] # Python 3.5 encodes Unicode (UTF-16) to the ANSI code page # what if Python 3.7 encodes Unicode (UTF-16) to UTF-8? print("filename bytes: %a" % filenameb)
proc = subprocess.Popen(['py', '-2', script], stdin=subprocess.PIPE, stdout=subprocess.PIPE) stdout = proc.communicate(filenameb)[0] print("File content: %a" % stdout)
If you are defining the encoding as 'mbcs', then you need to check that sys.getfilesystemencoding() == 'mbcs', and if it doesn't then reencode.
Sorry, I don't understand. What do you mean by "defining an encoding"? It's not possible to modify sys.getfilesystemencoding() in Python. What does "reencode"? I'm lost.
You are transferring text between two applications without specifying what the encoding is. sys.getfilesystemencoding() does not apply to proc.communicate() - you can use your choice of encoding for communicating between two processes.
Alternatively, since this script is the "new" code, you would use `os.listdir('.')[0].encode('mbcs')`, given that you have explicitly determined that mbcs is the encoding for the later transfer.
My example is not new code. It is a very simplified script to explain the issue that can occur in a large code base which *currently* works well on Python 2 and Pyhon 3 in the common case (only handle data encodable to the ANSI code page).
If you are planning to run it with Python 3.6, then I'd argue it's "new" code. When you don't want anything to change, you certainly don't change the major version of your runtime.
Essentially, the problem is that this code is relying on a certain non-guaranteed behaviour of a deprecated API, where using sys.getfilesystemencoding() as documented would have prevented any issue (see https://docs.python.org/3/library/os.html#file-names-command-line-arguments-...).
sys.getfilesystemencoding() is used in applications which store data as Unicode, but we are talking about applications storing data as bytes, no?
No, we're talking about how Python code communicates with the file system. Applications can store their data however they like, but when they pass it to a filesystem function they need to pass it as str or bytes encoding with sys.getfilesystemencoding() (this has always been the case).
So yes, breaking existing code is something I would never do lightly. However, I'm very much of the opinion that the only code that will break is code that is already broken (or at least fragile) and that nobody is forced to take a major upgrade to Python or should necessarily expect 100% compatibility between major versions.
Well, it's somehow the same issue that we had in Python 2: applications work in most cases, but start to fail with non-ASCII characters, or maybe only in some cases.
In this case, the ANSI code page is fine if all data can be encoded to the ANSI code page. You start to get troubles when you start to use characters not encodable to your ANSI code page. Last time I checked, Microsoft Visual Studio behaved badly (has bugs) with such filenames. It's the same for many applications. So it's not like Windows applications already handle this case very well. So let me call it a corner case.
The existence of bugs in other applications is not a good reason to help people create new bugs.
I'm not sure that it's worth it to explicitly break the Python backward compatibility on Windows for such corner case, especially because it's already possible to fix applications by starting to use Unicode everywhere (which would likely fix more issues than expected as a side effect).
It's still unclear to me if it's simpler to modify an application using bytes to start using Unicode (for filenames), or if your proposition requires less changes.
My proposition requires less changes *when you target multiple platforms and would prefer to use bytes*. It allows the below code to be written as either branch without losing the ability to round-trip whatever filename happens to be returned: if os.name == 'nt': f = open(os.listdir('.')[-1]) else: f = open(os.listdir(b'.')[-1]) If you choose just the first branch (use str for paths), then you do get a better result. However, we have been telling people to do that since 3.0 (and made it easier in 3.2 IIRC) and it's now 3.5 and they are still complaining about not getting to use bytes for paths. So rather than have people say "Windows support is too hard", this change enables the second branch to be used on all platforms.
My main concern is the "makefile issue" which requires more complex code to transcode data between UTF-8 and ANSI code page. To me, it's like we are going back to Python 2 where no data had known encoding and mojibake was the default. If you manipulate strings in two encodings, it's likely to make mistakes and concatenate two strings encoded to two different encodings (=> mojibake).
Your makefile example is going back to Python 2, as it has no known encoding. If you want to associate an encoding with bytes, you decode it to text or you explicitly specify what the encoding should be. Your own example makes assumptions about what encoding the bytes have, which is why it has a bug. Cheers, Steve
On 2016-08-16 17:14, Steve Dower wrote:
The existence of bugs in other applications is not a good reason to help people create new bugs.
I haven't been following all the details in this thread, but isn't the whole purpose of this proposed change to accommodate code (apparently on Linux?) that is buggy in that it assumes it can use bytes for paths without knowing the encoding? It seems like from one perspective allowing bytes in paths is just helping to accommodate a certain very widespread class of bugs. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown
On 16Aug2016 1915, Brendan Barnwell wrote:
On 2016-08-16 17:14, Steve Dower wrote:
The existence of bugs in other applications is not a good reason to help people create new bugs.
I haven't been following all the details in this thread, but isn't the whole purpose of this proposed change to accommodate code (apparently on Linux?) that is buggy in that it assumes it can use bytes for paths without knowing the encoding? It seems like from one perspective allowing bytes in paths is just helping to accommodate a certain very widespread class of bugs.
Using bytes on Linux (in Python) is incorrect but works reliably, while using bytes on Windows is incorrect and unreliable. This change makes it incorrect and reliable on both platforms. I said at the start the correct alternative would be to actually force all developers to use str for paths everywhere. That seems infeasible, so I'm trying to at least improve the situation for Windows users who are running code written by Linux developers. Hence there are tradeoffs, rather than perfection. (Also, you took my quote out of context - it was referring to the fact that non-Python developers sometimes fail to get path encoding correct too. But your question was fair.) Cheers, Steve
I've just created http://bugs.python.org/issue27781 with a patch removing use of the *A API from posixmodule.c and changing the default FS encoding to utf-8. Since we're still discussing whether the encoding should be utf-8 or something else, let's keep that here. But if you want to see how the changes would look, feel free to check out the patch and comment on the issue. When we reach some agreement here I'll try and summarize the points of view on the issue so we have a record there. Cheers, Steve
On Tue, Aug 16, 2016 at 3:56 PM, Steve Dower <steve.dower@python.org> wrote:
2. Windows file system encoding is *always* UTF-16. There's no "assuming mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding it is". We know exactly what the encoding is on every supported version of Windows. UTF-16.
Internal filesystem details don't directly affect this issue, except for how each filesystem handles invalid surrogates in names passed to functions in the wide-character API. Some filesystems that are available on Windows do reject a filename that has an invalid surrogate, so I think any program that attempts to create such malformed names is already broken. For example, with NTFS I can create a file named "\ud800b\ud800a\ud800d", but trying this in a VirtualBox shared folder fails because the VBoxSF filesystem can't transcode the name to its internal UTF-8 encoding. Thus I don't think supporting invalid surrogates should be a deciding factor in favor of UTF-16, which I think is an unpractical choice. Bytes coming from files, databases, and the network are likely to be either UTF-8 or some legacy encoding, so the practical choice is between ANSI/OEM and UTF-8. The reliable choice is UTF-8. Using UTF-8 for bytes paths can be adopted at first in 3.6 as an option that gets enabled via an environment variable. If it's not enabled or explicitly disabled, show a visible warning (i.e. not requiring -Wall) that legacy bytes paths are deprecated. In 3.7 UTF-8 can become the default, but the same environment variable should allow opting out to use the legacy encoding. The infrastructure put in place to support this should be able to work either way. Victor, I haven't checked Steve's patch yet in issue 27781, but making this change should largely simplify the Windows support code in many cases, as the bytes path conversion can be centralized, and relatively few functions return strings that need to be encoded back as bytes. posixmodule.c will no longer need separate code paths that call *A functions, e.g.: CreateFileA, CreateDirectoryA, CreateHardLinkA, CreateSymbolicLinkA, DeleteFileA, RemoveDirectoryA, FindFirstFileA, MoveFileExA, GetFileAttributesA, GetFileAttributesExA, SetFileAttributesA, GetCurrentDirectoryA, SetCurrentDirectoryA, SetEnvironmentVariableA, ShellExecuteA
On 2016-08-16 16:56, Steve Dower wrote:
I just want to clearly address two points, since I feel like multiple posts have been unclear on them.
1. The bytes API was deprecated in 3.3 and it is listed in https://docs.python.org/3/whatsnew/3.3.html. Lack of mention in the docs is an unfortunate oversight, but it was certainly announced and the warning has been there for three released versions. We can freely change or remove the support now, IMHO.
2. Windows file system encoding is *always* UTF-16. There's no "assuming mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding it is". We know exactly what the encoding is on every supported version of Windows. UTF-16.
This discussion is for the developers who insist on using bytes for paths within Python, and the question is, "how do we best represent UTF-16 encoded paths in bytes?"
The choices are:
* don't represent them at all (remove bytes API) * convert and drop characters not in the (legacy) active code page * convert and fail on characters not in the (legacy) active code page * convert and fail on invalid surrogate pairs * represent them as UTF-16-LE in bytes (with embedded '\0' everywhere)
Currently we have the second option.
My preference is the fourth option, as it will cause the least breakage of existing code and enable the most amount of code to just work in the presence of non-ACP characters.
The fifth option is the best for round-tripping within Windows APIs.
The only code that will break with any change is code that was using an already deprecated API. Code that correctly uses str to represent "encoding agnostic text" is unaffected.
If you see an alternative choice to those listed above, feel free to contribute it. Otherwise, can we focus the discussion on these (or any new) choices?
Could we use still call it 'mbcs', but use 'surrogateescape'?
On Thu, Aug 18, 2016, at 13:18, MRAB wrote:
If you see an alternative choice to those listed above, feel free to contribute it. Otherwise, can we focus the discussion on these (or any new) choices?
Could we use still call it 'mbcs', but use 'surrogateescape'?
Er, this discussion is about converting *from* unicode (including arbitrary but usually valid characters) *to* bytes.
On 18Aug2016 1018, MRAB wrote:
Could we use still call it 'mbcs', but use 'surrogateescape'?
surrogateescape is used for escaping undecodable values when you want to represent arbitrary bytes in Unicode. It's the wrong direction for this situation - we are starting with valid Unicode and encoding to bytes (for the convenience of the Python developer who wants to use bytes everywhere). Bytes correctly encoded under mbcs can always be correctly decoded to Unicode ('correctly' implies that they were encoded with the same configuration as the machine doing the decoding - mbcs changes from machine to machine). So there's nothing to escape from mbcs->Unicode, and we don't control the definition of Unicode->mbcs well enough to be able to invent an escaping scheme while remaining compatible with the operating system's interpretation of mbcs (CP_ACP). (One way to look at the utf-8 proposal is saying "we will escape arbitrary Unicode characters within Python bytes strings and decode them at the Python-OS boundary". The main concern about this is the backwards compatibility issues around people taking arbitrarily encoded bytes and sharing them without including the encoding. Previously that would work on a subset of machines without Unicode support, but this change would only make it work within Python 3.6 and later. Hence the discussion about whether this whole thing was deprecated already or not.) Cheers, Steve
Just to make sure this is clear, the Pragmatic logic is thus: * There are more *nix-centric developers in the Python ecosystem than Windows-centric (or even Windows-agnostic) developers. * The bytes path approach works fine on *nix systems. * Whatever might be Right and Just -- the reality is that a number of projects, including important and widely used libraries and frameworks, use the bytes API for working with filenames and paths, etc. Therefore, there is a lot of code that does not work right on Windows. Currently, to get it to work right on Windows, you need to write Windows specific code, which many folks don't want or know how to do (or just can't support one way or the other). So the Solution is to either: (A) get everyone to use Unicode "properly", which will work on all platforms (but only on py3.5 and above?) or (B) kludge some *nix-compatible support for byte paths into Windows, that will work at least much of the time. It's clear (to me at least) that (A) it the "Right Thing", but real world experience has shown that it's unlikely to happen any time soon. Practicality beats Purity and all that -- this is a judgment call. Have I got that right? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 16.08.2016 18:06, Chris Barker wrote:
It's clear (to me at least) that (A) it the "Right Thing", but real world experience has shown that it's unlikely to happen any time soon.
Practicality beats Purity and all that -- this is a judgment call.
Maybe, but even when it takes a lot of time to get it right, I always prefer the "Right Thing". My past experience taught me that everything will always come back to the "Right Thing" even partly as it is *surprise* the "Right Thing" (TM). Question is: are we in a hurry? Has somebody too little time to wait for the "Right Thing" to happen? Sven
On 16Aug2016 1006, Sven R. Kunze wrote:
Question is: are we in a hurry? Has somebody too little time to wait for the "Right Thing" to happen?
Not really in a hurry. It's just that I decided to attack a few of the encoding issues on Windows and this was one of them. Ideally I'd want to get the change in for 3.6.0b1 so there's plenty of testing time. But we've been waiting many years for this already so I guess a few more won't hurt. The current situation of making Linux developers write different path handling code for Windows vs Linux (or just use str for paths) is painful for some, but not as bad as the other issues I want to fix. Cheers, Steve
On 16.08.2016 19:44, Steve Dower wrote:
On 16Aug2016 1006, Sven R. Kunze wrote:
Question is: are we in a hurry? Has somebody too little time to wait for the "Right Thing" to happen?
Not really in a hurry. It's just that I decided to attack a few of the encoding issues on Windows and this was one of them.
Ideally I'd want to get the change in for 3.6.0b1 so there's plenty of testing time. But we've been waiting many years for this already so I guess a few more won't hurt. The current situation of making Linux developers write different path handling code for Windows vs Linux (or just use str for paths) is painful for some, but not as bad as the other issues I want to fix.
I assume one overall goal will be Windows and Linux programs handling paths the same way which I personally find a very good idea. And as long as such long-term goals are properly communicated, people are educated the right way and official deprecation phases are in place, everything is good, I guess. :) Sven
On 17 August 2016 at 02:06, Chris Barker <chris.barker@noaa.gov> wrote:
Just to make sure this is clear, the Pragmatic logic is thus:
* There are more *nix-centric developers in the Python ecosystem than Windows-centric (or even Windows-agnostic) developers.
* The bytes path approach works fine on *nix systems.
For the given value of "works fine" that is "works fine, except when it doesn't, and then you end up with mojibake".
* Whatever might be Right and Just -- the reality is that a number of projects, including important and widely used libraries and frameworks, use the bytes API for working with filenames and paths, etc.
Therefore, there is a lot of code that does not work right on Windows.
Currently, to get it to work right on Windows, you need to write Windows specific code, which many folks don't want or know how to do (or just can't support one way or the other).
So the Solution is to either:
(A) get everyone to use Unicode "properly", which will work on all platforms (but only on py3.5 and above?)
or
(B) kludge some *nix-compatible support for byte paths into Windows, that will work at least much of the time.
It's clear (to me at least) that (A) it the "Right Thing", but real world experience has shown that it's unlikely to happen any time soon.
Practicality beats Purity and all that -- this is a judgment call.
Have I got that right?
Yep, pretty much. Based on Stephen Turnbull's concerns, I wonder if we could make a whitelist of universal encodings that Python-on-Windows will use in preference to UTF-8 if they're configured as the current code page. If we accepted GB18030, GB2312, Shift-JIS, and ISO-2022-* as overrides, then problems would be significantly less likely. Another alternative would be to apply a similar solution as we do on Linux with regards to the "surrogateescape" error handler: there are some interfaces (like the standard streams) where we only enable that error handler specifically if the preferred encoding is reported as ASCII. In 2016, we're *very* skeptical about any properly configured system actually being ASCII-only (rather than that value showing up because the POSIX standards mandate it as the default), so we don't really believe the OS when it tells us that. The equivalent for Windows would be to disbelieve the configured code page only when it was reported as "mbcs" - for folks that had configured their system to use something other than the default, Python would believe them, just as we do on Linux. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 17Aug2016 0901, Nick Coghlan wrote:
On 17 August 2016 at 02:06, Chris Barker <chris.barker@noaa.gov> wrote:
So the Solution is to either:
(A) get everyone to use Unicode "properly", which will work on all platforms (but only on py3.5 and above?)
or
(B) kludge some *nix-compatible support for byte paths into Windows, that will work at least much of the time.
It's clear (to me at least) that (A) it the "Right Thing", but real world experience has shown that it's unlikely to happen any time soon.
Practicality beats Purity and all that -- this is a judgment call.
Have I got that right?
Yep, pretty much. Based on Stephen Turnbull's concerns, I wonder if we could make a whitelist of universal encodings that Python-on-Windows will use in preference to UTF-8 if they're configured as the current code page. If we accepted GB18030, GB2312, Shift-JIS, and ISO-2022-* as overrides, then problems would be significantly less likely.
Another alternative would be to apply a similar solution as we do on Linux with regards to the "surrogateescape" error handler: there are some interfaces (like the standard streams) where we only enable that error handler specifically if the preferred encoding is reported as ASCII. In 2016, we're *very* skeptical about any properly configured system actually being ASCII-only (rather than that value showing up because the POSIX standards mandate it as the default), so we don't really believe the OS when it tells us that.
The equivalent for Windows would be to disbelieve the configured code page only when it was reported as "mbcs" - for folks that had configured their system to use something other than the default, Python would believe them, just as we do on Linux.
The problem here is that "mbcs" is not configurable - it's a meta-encoder that uses whatever is configured as the "language (system locale) to use when displaying text in programs that do not support Unicode" (quote from the dialog where administrators can configure this). So there's nothing to disbelieve here. And even on machines where the current code page is "reliable", UTF-16 is still the actual encoding, which means UTF-8 is still a better choice for representing the path as a blob of bytes. Currently we have inconsistent encoding between different Windows machines and could either remove that inconsistency completely or simply reduce it for (approx.) English speakers. I would rather an extreme here - either make it consistent regardless of user configuration, or make it so broken that nobody can use it at all. (And note that the correct way to support *some* other FS encodings would be to change the return value from sys.getfilesystemencoding(), which breaks people who currently ignore that just as badly as changing it to utf-8 would.) Cheers, Steve
Hmm, doesn't seem to be explicitly listed as a deprecation, though discussion form around that time makes it clear that everyone thought it was. I also found this proposal to use strict mbcs to decode bytes for use against the file system, which is basically the same as what I'm proposing now apart from the more limited encoding: https://mail.python.org/pipermail/python-dev/2011-October/114203.html It definitely results in less C code to maintain if we do the decode ourselves. We could use strict mbcs, but I'd leave the deprecation warnings in there. Or perhaps we provide an env var to use mbcs as the file system encoding but default to utf8 (I already have one for selecting legacy console encoding)? Callers should be asking the sys module for the encoding anyway, so I'd expect few libraries to be impacted, though applications might prefer it. Top-posted from my Windows Phone -----Original Message----- From: "Paul Moore" <p.f.moore@gmail.com> Sent: 8/16/2016 3:54 To: "Nick Coghlan" <ncoghlan@gmail.com> Cc: "python-ideas" <python-ideas@python.org> Subject: Re: [Python-ideas] Fix default encodings on Windows On 15 August 2016 at 19:26, Steve Dower <steve.dower@python.org> wrote:
Passing path_as_bytes in that location has been deprecated since 3.3, so we are well within our rights (and probably overdue) to make it a TypeError in 3.6. While it's obviously an invalid assumption, for the purposes of changing the language we can assume that no existing code is passing bytes into any functions where it has been deprecated.
As far as I'm concerned, there are currently no filesystem APIs on Windows that accept paths as bytes.
[...] On 16 August 2016 at 03:00, Nick Coghlan <ncoghlan@gmail.com> wrote:
The problem is that bytes-as-paths actually *does* work for Mac OS X and systemd based Linux distros properly configured to use UTF-8 for OS interactions. This means that a lot of backend network service code makes that assumption, especially when it was originally written for Python 2, and rather than making it work properly on Windows, folks just drop Windows support as part of migrating to Python 3.
At an ecosystem level, that means we're faced with a choice between implicitly encouraging folks to make their code *nix only, and finding a way to provide a more *nix like experience when running on Windows (where UTF-8 encoded binary data just works, and either other encodings lead to mojibake or else you use chardet to figure things out).
Steve is suggesting that the latter option is preferable, a view I agree with since it lowers barriers to entry for Windows based developers to contribute to primarily *nix focused projects.
So does this mean that you're recommending reverting the deprecation of bytes as paths in favour of documenting that bytes as paths is acceptable, but it will require an encoding of UTF-8 rather than the current behaviour? If so, that raises some questions: 1. Is it OK to backtrack on a deprecation by changing the behaviour like this? (I think it is, but others who rely on the current, deprecated, behaviour may not). 2. Should we be making "always UTF-8" the behaviour on all platforms, rather than just Windows (e.g., Unix systems which haven't got UTF-8 as their locale setting)? This doesn't seem to be a Windows-specific question any more (I'm assuming that if bytes-as-paths are deprecated, that's a cross-platform change, but see below). Having said all this, I can't find the documentation stating that bytes paths are deprecated - the open() documentation for 3.5 says "file is either a string or bytes object giving the pathname (absolute or relative to the current working directory) of the file to be opened or an integer file descriptor of the file to be wrapped" and there's no mention of a deprecation. Steve - could you provide a reference? Paul _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Nick Coghlan writes:
At an ecosystem level, that means we're faced with a choice between implicitly encouraging folks to make their code *nix only, and finding a way to provide a more *nix like experience when running on Windows (where UTF-8 encoded binary data just works, and either other encodings lead to mojibake or else you use chardet to figure things out).
Most of the time we do know what the encoding is, we can just ask Windows (although Steve proposes to make Python fib about that, we could add other APIs). This change means that programs that until now could be encoding- agnostic and just pass around bytes on Windows, counting on Python to consistently convert those to the appropriate form for the API, can't do that any more. They have to find out what the encoding is, and transcode to UTF-8, or rewrite to do their processing as text. This is a potential burden on existing user code. I suppose that there are such programs, for the same reasons that networking programs tend to use bytes I/O: ports from Python 2, an (misplaced?) emphasis on performance, etc.
Steve is suggesting that the latter option is preferable, a view I agree with since it lowers barriers to entry for Windows based developers to contribute to primarily *nix focused projects.
Sure, but do you have any idea what the costs might be? Aside from the code burden mentioned above, there's a reputational issue. Just yesterday I was having a (good-natured) Perl vs. Python discussion on my LUG ML, and two developers volunteered that they avoid Python because "the Python developers frequently break backward compatibility". These memes tend to go off on their own anyway, but this change will really feed that one.
Promoting cross-platform consistency often leads to enabling patterns that are considered a bad idea from a native platform perspective, and this strikes me as an example of that (just as the binary/text separation itself is a case where Python 3 diverged from the POSIX text model to improve consistency across *nix, Windows, JVM and CLR environments).
I would say rather Python 3 chose an across-the-board better, more robust model supporting internationalization and multilingualization properly. POSIX's text model is suitable at best for a fragile localization. This change, OTOH, is a step backward we wouldn't consider except for the intended effect on ease of writing networking code. That's important, but I really don't think that's going to be the only major effect, and I fear it won't be the most important effect. Of course that's FUD -- I have no data on potential burden to existing use cases, or harm to reputation. But neither do you and Steve. :-(
participants (14)
-
Brendan Barnwell
-
Chris Angelico
-
Chris Barker
-
Chris Barker - NOAA Federal
-
eryk sun
-
MRAB
-
Nick Coghlan
-
Paul Moore
-
Random832
-
Stephen J. Turnbull
-
Steve Dower
-
Sven R. Kunze
-
Terry Reedy
-
Victor Stinner