Re: [Python-Dev] File system path encoding on Windows
2016-08-30 16:31 GMT+02:00 Steve Dower <steve.dower@python.org>:
It's the random user on Windows who installed their library that has the problem. They don't know the fix, and may not know how to apply it (e.g. if it's their Jupyter notebook that won't find one of their files - no obvious command line options here).
There is already a DeprecationWarning. Sadly, it's hidden by default: you need a debug build of Python or more simply to pass -Wd command line option. Maybe we should make this warning (Deprecation warning on bytes paths) visible by default, or add a new warning suggesting to enable -X utf8 the first time a Python function gets a byte string (like a filename)?
Any system that requires communication between two different versions of Python must have install instructions (if it's public) or someone who maintains it. It won't magically break without an upgrade, and it should not get an upgrade without testing. The environment variable is available for this kind of scenario, though I'd hope the testing occurs during beta and it gets fixed by the time we release.
I disagree that breaking backward compatibility is worth it. Most users don't care of Unicode since their application already "just works well" for their use case. Having to set an env var to "repair" their app to be able to upgrade Python is not really convenient. Victor
On 30Aug2016 0806, Victor Stinner wrote:
2016-08-30 16:31 GMT+02:00 Steve Dower <steve.dower@python.org>:
It's the random user on Windows who installed their library that has the problem. They don't know the fix, and may not know how to apply it (e.g. if it's their Jupyter notebook that won't find one of their files - no obvious command line options here).
There is already a DeprecationWarning. Sadly, it's hidden by default: you need a debug build of Python or more simply to pass -Wd command line option.
It also only appears on Windows, so developers who do the right thing on POSIX never find out about it. Your average user isn't going to see it - they'll just see the OSError when their file is not found due to the lossy encoding.
Maybe we should make this warning (Deprecation warning on bytes paths) visible by default, or add a new warning suggesting to enable -X utf8 the first time a Python function gets a byte string (like a filename)?
The more important thing in my opinion is to make it visible on all platforms, regardless of whether bytes paths are suitable or not. But this will probably be seen as hostile by the majority of open-source Python developers, which is why I'd rather just quietly fix the incompatibility.
Any system that requires communication between two different versions of Python must have install instructions (if it's public) or someone who maintains it. It won't magically break without an upgrade, and it should not get an upgrade without testing. The environment variable is available for this kind of scenario, though I'd hope the testing occurs during beta and it gets fixed by the time we release.
I disagree that breaking backward compatibility is worth it. Most users don't care of Unicode since their application already "just works well" for their use case.
Again, the problem is libraries (code written by someone else that you want to reuse), not applications (code written by you to solve your business problem in your environment). Code that assumes the default encodings are sufficient is already broken in the general case, and libraries nearly always need to cover the general case while applications do not. The stdlib needs to cover the general case, which is why I keep using open(os.listdir(b'.')[-1]) as an example of something that should never fail because of encoding issues. In theory, we should encourage library developers to support Windows properly by using str for paths, probably by disabling bytes paths everywhere. Alternatively, we make it so that bytes paths work fine everywhere and stop telling people that their code is wrong for a platform they're already not hugely concerned about.
Having to set an env var to "repair" their app to be able to upgrade Python is not really convenient.
Upgrading Python in an already running system isn't going to be really convenient anyway. Going from x.y.z to x.y.z+1 should be convenient, but from x.y to x.y+1 deserves testing and possibly code or environment changes. I don't understand why changing Python at the same time we change the version number is suddenly controversial. Cheers, Steve
Is this thread something I need to follow closely? -- --Guido van Rossum (python.org/~guido)
On 30Aug2016 1108, Guido van Rossum wrote:
Is this thread something I need to follow closely?
I have PEPs coming, and I'll distil the technical parts of the discussion into those. We may need you to impose an opinion on whether 3.6 is an appropriate time for the change or it should wait for 3.7. I think the technical implications are fairly clear, it's just the risk of surprising/upsetting users that is not. Cheers, Steve
On 31 August 2016 at 01:06, Victor Stinner <victor.stinner@gmail.com> wrote:
2016-08-30 16:31 GMT+02:00 Steve Dower <steve.dower@python.org>:
Any system that requires communication between two different versions of Python must have install instructions (if it's public) or someone who maintains it. It won't magically break without an upgrade, and it should not get an upgrade without testing. The environment variable is available for this kind of scenario, though I'd hope the testing occurs during beta and it gets fixed by the time we release.
I disagree that breaking backward compatibility is worth it. Most users don't care of Unicode since their application already "just works well" for their use case.
Having to set an env var to "repair" their app to be able to upgrade Python is not really convenient.
This seems to be the crux of the disagreement: our perceptions of the relative risks to native Windows Python applications that currently work properly on Python 3.5 vs the potential compatibility benefits to primarily *nix applications that currently *don't* work on Windows under Python 3.5. If I'm understanding Steve's position correctly, his view is that native Python applications that are working well on Windows under Python 3.5 *must already be using strings to interact with the OS*. This means that they will be unaffected by the proposed changes, as the proposed changes only impact attempts to pass bytes to the OS, not attempts to pass strings. In uncontrolled environments, using bytes to interact with the OS on Windows just *plain doesn't work properly* under the current model, so the proposed change is a matter of changing precisely how those applications fail, rather than moving them from a previously working state to a newly broken state. For the proposed default behaviour change to cause problems then, there must be large bodies of software that exist in sufficiently controlled environments that they can get bytes-on-WIndows to work in the first place by carefully managing the code page settings, but then *also* permit uncontrolled upgrades to Python 3.6 without first learning that they need to add a new environment variable setting to preserve the Python 3.5 (and earlier) bytes handling behaviour. Steve's assertion is that this intersection of "managed code page settings" and "unmanaged Python upgrades" results in the null set. A new opt-in config option eliminates any risk of breaking anything, but means Steve will have to wait until 3.7 to try out the idea of having more *nix centric software "just work" on Windows. In the grand scheme of things, I still think it's worth taking that additional time, especially if things are designed so that embedding applications can easily flip the default behaviour. Yes, there will be environments on Windows where the command line option won't help, just as there are environments on Linux where it won't help. I think that's OK, as we can use the 3.6 cycle to thrash out the details of the new behaviour in the environments where it *does* help (e.g. developers running their test suites on Windows systems), get people used to the idea that the behaviour of binary paths on Windows is going to change, and then actually make the switch in Python 3.7. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Le 30 août 2016 8:05 PM, "Nick Coghlan" <ncoghlan@gmail.com> a écrit :
This seems to be the crux of the disagreement: our perceptions of the relative risks to native Windows Python applications that currently work properly on Python 3.5 vs the potential compatibility benefits to primarily *nix applications that currently *don't* work on Windows under Python 3.5.
As I already wrote once, my problem is also tjat I simply have no idea how much Python 3 code uses bytes filename. For example, does it concern more than 25% of py3 modules on PyPi, or less than 5%? Having an idea of the ratio would help to move the discussion forward. Victor
2016-08-30 23:51 GMT+02:00 Victor Stinner <victor.stinner@gmail.com>:
As I already wrote once, my problem is also tjat I simply have no idea how much Python 3 code uses bytes filename. For example, does it concern more than 25% of py3 modules on PyPi, or less than 5%?
I made a very quick test on Windows using a modified Python raising an exception on bytes path. First of all, setuptools fails. It's a kind of blocker issue :-) I quickly fixed it (only one line needs to be modified). I tried to run Twisted unit tests (python -m twisted.trial twisted) of Twisted 16.4. I got a lot of exceptions on bytes path from the twisted/python/filepath.py module, but also from twisted/trial/util.py. It looks like these modules are doing their best to convert all paths to... bytes. I had to modify more than 5 methods just to be able to start running unit tests. Quick result: setuptools and Twisted rely on bytes path. Dropping bytes path support on Windows breaks these modules. It also means that these modules don't support the full Unicode range on Windows on Python 3.5. Victor
On 30Aug2016 1611, Victor Stinner wrote:
2016-08-30 23:51 GMT+02:00 Victor Stinner <victor.stinner@gmail.com>:
As I already wrote once, my problem is also tjat I simply have no idea how much Python 3 code uses bytes filename. For example, does it concern more than 25% of py3 modules on PyPi, or less than 5%?
I made a very quick test on Windows using a modified Python raising an exception on bytes path.
First of all, setuptools fails. It's a kind of blocker issue :-) I quickly fixed it (only one line needs to be modified).
I tried to run Twisted unit tests (python -m twisted.trial twisted) of Twisted 16.4. I got a lot of exceptions on bytes path from the twisted/python/filepath.py module, but also from twisted/trial/util.py. It looks like these modules are doing their best to convert all paths to... bytes. I had to modify more than 5 methods just to be able to start running unit tests.
Quick result: setuptools and Twisted rely on bytes path. Dropping bytes path support on Windows breaks these modules.
It also means that these modules don't support the full Unicode range on Windows on Python 3.5.
Thanks. That's a good idea (certainly better than mine, which was to go reading code...) I haven't looked into setuptools, but Twisted appears to be correctly using sys.getfilesystemencoding() when they coerce to bytes, which means the proposed change will simply allow the full Unicode range when paths are encoded. However, if there are places where bytes are not transcoded when they should be *then* there will be new issues. I wonder if we can quickly test whether that happens (e.g. use the file system encoding to "taint" the path somehow - special prefix? - so we can raise if bytes that haven't been correctly encoded at some point are passed in). Some of my other searching revealed occasional correct use of sys.getfilesystemencoding(), a decent number of uses as a fallback when other encodings are not available, and it's very hard to search for code that uses the os module with bytes not checked to be the right encoding. This is why I argue that the beta period is the best opportunity to check, and why we're better to flip the switch now and flip it back if it all goes horribly wrong - the alternative is a *very* labour intensive exercise that I doubt we can muster.
I made another quick&dirty test on Django 1.10 (I ran Django test suite on my modified Python raising exception on bytes path): I didn't notice any exception related to bytes path. Django seems to only use Unicode for paths. I can try to run more tests if you know some other major Python applications (modules?) working on Windows/Python 3. Note: About Twisted, I forgot to mention that I'm not really surprised that Twisted uses bytes. Twisted was created something like 10 years ago, when bytes was the defacto choice. Using Unicode in Python 2 was painful when you imagine a module as large as Twisted. Twisted has to support Python 2 and Python 3, so it's not surprising that it still uses bytes in some places, instead of Unicode. Moreover, as many Python applications/modules, Linux is a first citizen, whereas Windows is more supported as "best effort". Victor
On 30Aug2016 1702, Victor Stinner wrote:
I made another quick&dirty test on Django 1.10 (I ran Django test suite on my modified Python raising exception on bytes path): I didn't notice any exception related to bytes path.
Django seems to only use Unicode for paths.
I can try to run more tests if you know some other major Python applications (modules?) working on Windows/Python 3.
The major ones aren't really the concern. I'd be interested to see where numpy and pandas are at, but I suspect they've already encountered and fixed many of these issues due to the size of the user base. (Though skim-reading numpy I see lots of code that would be affected - for better or worse - if the default encoding for open() changed...) I'm more concerned about the long-tail of more focused libraries. Feel free to grab a random selection of Django extensions and try them out, but I don't really think it's worth the effort. I'm certainly not demanding you do it.
Note: About Twisted, I forgot to mention that I'm not really surprised that Twisted uses bytes. Twisted was created something like 10 years ago, when bytes was the defacto choice. Using Unicode in Python 2 was painful when you imagine a module as large as Twisted. Twisted has to support Python 2 and Python 3, so it's not surprising that it still uses bytes in some places, instead of Unicode.
Yeah, I don't think they're doing anything wrong and wouldn't want to call them out on it. Especially since they already correctly handle it by asking Python what encoding should be used for the bytes.
Moreover, as many Python applications/modules, Linux is a first citizen, whereas Windows is more supported as "best effort".
That last point is exactly why I think this is important. Any arguments against making Windows behave more like Linux (i.e. bytes paths are reliable) need to be clear as to why this doesn't matter or is less important than other concerns. Cheers, Steve
On 31 August 2016 at 10:27, Steve Dower <steve.dower@python.org> wrote:
On 30Aug2016 1702, Victor Stinner wrote:
I can try to run more tests if you know some other major Python applications (modules?) working on Windows/Python 3.
The major ones aren't really the concern. I'd be interested to see where numpy and pandas are at, but I suspect they've already encountered and fixed many of these issues due to the size of the user base. (Though skim-reading numpy I see lots of code that would be affected - for better or worse - if the default encoding for open() changed...)
For a case of "Don't break software already trying to do things right", the https://github.com/beetbox/beets example that Daniel linked earlier would be a good one to test. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (4)
-
Guido van Rossum
-
Nick Coghlan
-
Steve Dower
-
Victor Stinner