On 17Aug2016 0235, Stephen J. Turnbull wrote:
Paul Moore writes:
On 16 August 2016 at 16:56, Steve Dower <steve.dower@python.org> wrote:
This discussion is for the developers who insist on using bytes for paths within Python, and the question is, "how do we best represent UTF-16 encoded paths in bytes?"
That's incomplete, AFAICS. (Paul makes this point somewhat differently.) We don't want to represent paths in bytes on Windows if we can avoid it. Nor does UTF-16 really enter into it (except for the technical issue of invalid surrogate pairs). So a full statement is, "How do we best represent Windows file system paths in bytes for interoperability with systems that natively represent paths in bytes?" ("Other systems" refers to both other platforms and existing programs on Windows.)
That's incorrect, or at least possible to interpret correctly as the wrong thing. The goal is "code compatibility with systems ...", not interoperability. Nothing about this will make it easier to take a path from Windows and use it on Linux or vice versa, but it will make it easier/more reliable to take code that uses paths on Linux and use it on Windows.
BTW, why "surrogate pairs"? Does Windows validate surrogates to ensure they come in pairs, but not necessarily in the right order (or perhaps sometimes they resolve to non-characters such as U+1FFFF)?
Eryk answered this better than I would have.
Paul says:
People passing bytes to open() have in my view, already chosen not to follow the standard advice of "decode incoming data at the boundaries of your application". They may have good reasons for that, but it's perfectly reasonable to expect them to take responsibility for manually tracking the encoding of the resulting bytes values flowing through their code.
Abstractly true, but in practice there's no such need for those who made the choice! In a properly set up POSIX locale[1], it Just Works by design, especially if you use UTF-8 as the preferred encoding. It's Windows developers and users who suffer, not those who wrote the code, nor their primary audience which uses POSIX platforms.
You mentioned "locale", "preferred" and "encoding" in the same sentence, so I hope you're not thinking of locale.getpreferredencoding()? Changing that function is orthogonal to this discussion, despite the fact that in most cases it returns the same code page as what is going to be used by the file system functions (which in most cases will also be used by the encoding returned from sys.getfilesystemencoding()). When Windows developers and users suffer, I see it as my responsibility to reduce that suffering. Changing Python on Windows should do that without affecting developers on Linux, even though the Right Way is to change all the developers on Linux to use str for paths.
If you see an alternative choice to those listed above, feel free to contribute it. Otherwise, can we focus the discussion on these (or any new) choices?
Accept that we should have deprecated builtin open and the io module, but didn't do so. Extend the existing deprecation of bytes paths on Windows, to cover *all* APIs, not just the os module, But modify the deprecation to be "use of the Windows CP_ACP code page (via the ...A Win32 APIs) is deprecated and will be replaced with use of UTF-8 as the implied encoding for all bytes paths on Windows starting in Python 3.7". Document and publicise it much more prominently, as it is a breaking change. Then leave it one release for people to prepare for the change.
I like this one! If my paranoid fears are realized, in practice it might have to wait two releases, but at least this announcement should get people who are at risk to speak up. If they don't, then you can just call me "Chicken Little" and go ahead!
I don't think there's any reasonable way to noisily deprecate these functions within Python, but certainly the docs can be made clearer. People who explicitly encode with sys.getfilesystemencoding() should not get the deprecation message, but we can't tell whether they got their bytes from the right encoding or a RNG, so there's no way to discriminate. I'm going to put together a summary post here (hopefully today) and get those who have been contributing to basically sign off on it, then I'll take it to python-dev. The possible outcomes I'll propose will basically be "do we keep the status quo, undeprecate and change the functionality, deprecate the deprecation and undeprecate/change in a couple releases, or say that it wasn't a real deprecation so we can deprecate and then change functionality in a couple releases". Cheers, Steve