[Twisted-Python] Python3: should paths be bytes or str?
The porting guide says No byte paths in sys.path. I am not sure what this means, I would assume that file paths should always be native strings. Or does it mean that sys.path must only contain ascii bytes? doc for FilePath says On both Python 2 and Python 3, paths can only be bytes. and svn commit 35410 by itamarst changed the doc for some path functions in python/filepath.py from str to bytes but not all of them: fgrep 'type path:' filepath.py @type path: L{str} @type path: L{str} @type path: L{bytes} @type path: L{bytes} @type path: L{bytes} I stumbled upon this while trying to find out how much work it might be to make bin/trial run with python3 admin/run-python3-tests already passes for all twisted.spread related tests but I still need to clean up a lot. after adding an assert to FilePath.__init__, python3 bin/trial ... gives File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 601, in run config.parseOptions() File "/home/wr/ssdsrc/Twisted/twisted/python/usage.py", line 277, in parseOptions self.postOptions() File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 472, in postOptions _BasicOptions.postOptions(self) File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 382, in postOptions self['reporter'] = self._loadReporterByName(self['reporter']) File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 369, in _loadReporterByName for p in plugin.getPlugins(itrial.IReporter): File "/home/wr/ssdsrc/Twisted/twisted/plugin.py", line 209, in getPlugins allDropins = getCache(package) File "/home/wr/ssdsrc/Twisted/twisted/plugin.py", line 134, in getCache mod = getModule(module.__name__) File "/home/wr/ssdsrc/Twisted/twisted/python/modules.py", line 781, in getModule return theSystemPath[moduleName] File "/home/wr/ssdsrc/Twisted/twisted/python/modules.py", line 702, in __getitem__ self._findEntryPathString(moduleObject)), File "/home/wr/ssdsrc/Twisted/twisted/python/modules.py", line 627, in _findEntryPathString if _isPackagePath(FilePath(topPackageObj.__file__)): File "/home/wr/ssdsrc/Twisted/twisted/python/filepath.py", line 664, in __init__ assert isinstance(path, bytes), 'path must be bytes: %r' % (path,) AssertionError: path must be bytes: '/home/wr/ssdsrc/Twisted/twisted/__init__.py' -- Wolfgang
On 01:26 am, wolfgang.kde@rohdewald.de wrote:
The porting guide says
No byte paths in sys.path.
What porting guide is that?
doc for FilePath says On both Python 2 and Python 3, paths can only be bytes.
I stumbled upon this while trying to find out how much work it might be to make bin/trial run with python3
admin/run-python3-tests already passes for all twisted.spread related tests but I still need to clean up a lot.
after adding an assert to FilePath.__init__, python3 bin/trial ... gives
File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 601, in run config.parseOptions() File "/home/wr/ssdsrc/Twisted/twisted/python/usage.py", line 277, in parseOptions self.postOptions() File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 472, in postOptions _BasicOptions.postOptions(self) File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 382, in postOptions self['reporter'] = self._loadReporterByName(self['reporter']) File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 369, in _loadReporterByName for p in plugin.getPlugins(itrial.IReporter): File "/home/wr/ssdsrc/Twisted/twisted/plugin.py", line 209, in getPlugins allDropins = getCache(package) File "/home/wr/ssdsrc/Twisted/twisted/plugin.py", line 134, in getCache mod = getModule(module.__name__) File "/home/wr/ssdsrc/Twisted/twisted/python/modules.py", line 781, in getModule return theSystemPath[moduleName] File "/home/wr/ssdsrc/Twisted/twisted/python/modules.py", line 702, in __getitem__ self._findEntryPathString(moduleObject)), File "/home/wr/ssdsrc/Twisted/twisted/python/modules.py", line 627, in _findEntryPathString if _isPackagePath(FilePath(topPackageObj.__file__)): File "/home/wr/ssdsrc/Twisted/twisted/python/filepath.py", line 664, in __init__ assert isinstance(path, bytes), 'path must be bytes: %r' % (path,) AssertionError: path must be bytes: '/home/wr/ssdsrc/Twisted/twisted/__init__.py'
If paths are being represented using unicode somewhere and you want to use them with FilePath then you have to encode them (or you have to add unicode path support to FilePath and let FilePath encode them). Unfortunately it's not entirely obvious how to make FilePath support unicode paths since not all platforms Twisted supports represent filesystem paths using unicode. The choice python-dev made to bridge this gap was the creation of the "surrogateescape" error handler for the UTC-8 codec. This lets you pretend that any time you need to convert between bytes and unicode the correct codec is UTF-8 (with this special error handler). It's not clear this was a good choice (since the result is unicode strings that may contain garbage which will confuse other software) but it's also not clear it's possible for Twisted to try to make any other choice (at some point Twisted has to interoperate with the path-related APIs in Python itself - `sys.path`, for example). Not sure if that helps you at all. Maybe it outlines the problem a little more clearly, at least. Jean-Paul
Am Montag, 8. September 2014, 02:14:10 schrieb exarkun@twistedmatrix.com:
On 01:26 am, wolfgang.kde@rohdewald.de wrote:
The porting guide says
No byte paths in sys.path.
What porting guide is that?
https://twistedmatrix.com/trac/wiki/Plan/Python3 see the reviewer check list
it's also not clear it's possible for Twisted to try to make any other choice (at some point Twisted has to interoperate with the path-related APIs in Python itself - `sys.path`, for example).
Am Montag, 8. September 2014, 02:14:10 schrieb exarkun@twistedmatrix.com:
If paths are being represented using unicode somewhere and you want to use them with FilePath then you have to encode them (or you have to add unicode path support to FilePath and let FilePath encode them).
I always thought module names must be ascii-only but now found PEP3131 So we should do the same for twisted.python.modules as in those other places grepped below. And add that assert isinstance(path, bytes) for PY3 in FilePath. And maybe this should go into the above check list? I have no edit rights in the Wiki. BUT – I will stop trying to port python/modules.py, the usage of the same strings for both module names and file paths is too much interwoven, I do not want to touch that. My feeling is that file names should all be unicode, converting them only where needed. But then I am not an expert about this. Next problem - PEP3131. See separate mail. grep -r __file__ | grep encode web/test/test_webclient.py:serverPEM = FilePath(test.__file__.encode("utf-8")).sibling(b'server.pem') test/ssl_helpers.py:certPath = nativeString(FilePath(__file__.encode("utf-8") test/test_setup.py:if not FilePath(twisted.__file__.encode('utf-8')).sibling(b'topfiles').child(b'setup.py').exists(): python/test/test_deprecate.py: self.assertEqual(FilePath(module.__file__.encode("utf-8")), internet/test/test_gireactor.py: path = FilePath(__file__.encode("utf-8")).sibling( -- Wolfgang
On Mon, Sep 8, 2014 at 5:14 AM, <exarkun@twistedmatrix.com> wrote:
On 01:26 am, wolfgang.kde@rohdewald.de wrote:
The porting guide says
No byte paths in sys.path.
What porting guide is that?
doc for FilePath says On both Python 2 and Python 3, paths can only be bytes.
I stumbled upon this while trying to find out how much work it might be to make bin/trial run with python3
admin/run-python3-tests already passes for all twisted.spread related tests but I still need to clean up a lot.
after adding an assert to FilePath.__init__, python3 bin/trial ... gives
File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 601, in run config.parseOptions() File "/home/wr/ssdsrc/Twisted/twisted/python/usage.py", line 277, in parseOptions self.postOptions() File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 472, in postOptions _BasicOptions.postOptions(self) File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 382, in postOptions self['reporter'] = self._loadReporterByName(self['reporter']) File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 369, in _loadReporterByName for p in plugin.getPlugins(itrial.IReporter): File "/home/wr/ssdsrc/Twisted/twisted/plugin.py", line 209, in getPlugins allDropins = getCache(package) File "/home/wr/ssdsrc/Twisted/twisted/plugin.py", line 134, in getCache mod = getModule(module.__name__) File "/home/wr/ssdsrc/Twisted/twisted/python/modules.py", line 781, in getModule return theSystemPath[moduleName] File "/home/wr/ssdsrc/Twisted/twisted/python/modules.py", line 702, in __getitem__ self._findEntryPathString(moduleObject)), File "/home/wr/ssdsrc/Twisted/twisted/python/modules.py", line 627, in _findEntryPathString if _isPackagePath(FilePath(topPackageObj.__file__)): File "/home/wr/ssdsrc/Twisted/twisted/python/filepath.py", line 664, in __init__ assert isinstance(path, bytes), 'path must be bytes: %r' % (path,) AssertionError: path must be bytes: '/home/wr/ssdsrc/Twisted/twisted/__init__.py'
If paths are being represented using unicode somewhere and you want to use them with FilePath then you have to encode them (or you have to add unicode path support to FilePath and let FilePath encode them).
Unfortunately it's not entirely obvious how to make FilePath support unicode paths since not all platforms Twisted supports represent filesystem paths using unicode.
It really depends on filesystem, not on a platform. Platform just makes sure that you won't shoot it in the foot. So to behave good you need to translate you path knowledge to platform knowledge when you have to make change. In data transformation theory that may mean: [ ] get data about path in native format [ ] detect the source encoding of filesystem [ ] figure out if you can work with native format [ ] python 2 way - just work with bytes [ ] python 3 way - look if native filesystem format is convertible to unicode [ ] if conversion is symmetrical - operate in unicode [ ] if not convertible, alternatives (options, switches) [ ] fail and explain why to user in user actionable manner (don't use ?) [ ] use some symmetrical mapping to unicode and mark path objects as `mapped` so that there is a trace of hacks on filepaths [ ] provide API on objects without ability to use names directly [ ] transform the name to a "safe" valid value loosing the original name and explain the user why and what happened to the old name
The choice python-dev made to bridge this gap was the creation of the "surrogateescape" error handler for the UTC-8 codec. This lets you pretend that any time you need to convert between bytes and unicode the correct codec is UTF-8 (with this special error handler).
It's not clear this was a good choice (since the result is unicode strings that may contain garbage which will confuse other software) but it's also not clear it's possible for Twisted to try to make any other choice (at some point Twisted has to interoperate with the path-related APIs in Python itself - `sys.path`, for example).
Not sure if that helps you at all. Maybe it outlines the problem a little more clearly, at least.
I think that it should be a choice of application maintainers. If they want to create files with dots at the end, Windows allows this, but doesn't support it through standard WinAPI calls, because of FAT. But you can use special path transformation prefix \\?\ to do this on NTFS or on a remote machine. In networking OS plays less role, but the new choice for every platform filesystem is not clear for users. They don't realize where the problem comes from. In the end it all depends on FS first, then on OS API (which can be skipped thanks to direct disk access), then on FS library, then on application. Application should be able to opt-in for handling all possible cases, or just "safe" subset or something middle, so a task of a framework is to just describe the problem well and ensure that people can make their choice. -- anatoly t.
On Sep 7, 2014, at 7:14 PM, exarkun@twistedmatrix.com wrote:
On 01:26 am, wolfgang.kde@rohdewald.de wrote:
The porting guide says
No byte paths in sys.path.
What porting guide is that?
doc for FilePath says On both Python 2 and Python 3, paths can only be bytes.
I stumbled upon this while trying to find out how much work it might be to make bin/trial run with python3
admin/run-python3-tests already passes for all twisted.spread related tests but I still need to clean up a lot.
after adding an assert to FilePath.__init__, python3 bin/trial ... gives
File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 601, in run config.parseOptions() File "/home/wr/ssdsrc/Twisted/twisted/python/usage.py", line 277, in parseOptions self.postOptions() File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 472, in postOptions _BasicOptions.postOptions(self) File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 382, in postOptions self['reporter'] = self._loadReporterByName(self['reporter']) File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 369, in _loadReporterByName for p in plugin.getPlugins(itrial.IReporter): File "/home/wr/ssdsrc/Twisted/twisted/plugin.py", line 209, in getPlugins allDropins = getCache(package) File "/home/wr/ssdsrc/Twisted/twisted/plugin.py", line 134, in getCache mod = getModule(module.__name__) File "/home/wr/ssdsrc/Twisted/twisted/python/modules.py", line 781, in getModule return theSystemPath[moduleName] File "/home/wr/ssdsrc/Twisted/twisted/python/modules.py", line 702, in __getitem__ self._findEntryPathString(moduleObject)), File "/home/wr/ssdsrc/Twisted/twisted/python/modules.py", line 627, in _findEntryPathString if _isPackagePath(FilePath(topPackageObj.__file__)): File "/home/wr/ssdsrc/Twisted/twisted/python/filepath.py", line 664, in __init__ assert isinstance(path, bytes), 'path must be bytes: %r' % (path,) AssertionError: path must be bytes: '/home/wr/ssdsrc/Twisted/twisted/__init__.py'
If paths are being represented using unicode somewhere and you want to use them with FilePath then you have to encode them (or you have to add unicode path support to FilePath and let FilePath encode them).
Unfortunately it's not entirely obvious how to make FilePath support unicode paths since not all platforms Twisted supports represent filesystem paths using unicode.
The choice python-dev made to bridge this gap was the creation of the "surrogateescape" error handler for the UTC-8 codec. This lets you pretend that any time you need to convert between bytes and unicode the correct codec is UTF-8 (with this special error handler).
It's not clear this was a good choice (since the result is unicode strings that may contain garbage which will confuse other software) but it's also not clear it's possible for Twisted to try to make any other choice (at some point Twisted has to interoperate with the path-related APIs in Python itself - `sys.path`, for example).
Not sure if that helps you at all. Maybe it outlines the problem a little more clearly, at least.
The problem with making FilePath support unicode is that we want to provide an interface that applications can rely upon, specified in terms of specific types (bytes or text) so that when you get an IFilePath you know what you can do with it. As it is currently implemented, FilePath exposes its internal representation fairly directly, most notably as the ‘.path’ attribute, but also in the return-type of methods like "basename" and "segmentsFrom". FilePath doesn't exactly "support" unicode, in that it's specifically documented not to, but it's sort of hard to tell, since you can instantiate one with a unicode string in both python 2 and python 3, and get (apparently) correct results out of it for some methods. However, methods that need a string constant as part of their implementation, like siblingExtensionSearch and globChildren, will break unceremoniously when presented with unicode. Another decision that python-dev made to bridge the gap was to randomly allow different string types be passed to platform APIs, like this:
import os os.listdir(u".") ['a', 'b', 'c'] os.listdir(b".") [b'a', b'b', b'c'] os.path.basename(b".") b'.' os.path.basename(".") '.'
This implies a parallel structure might be possible for FilePath: if you pass its constructor bytes, you get a BytesFilePath; if you pass it text, you get a TextFilePath. You can't mix the two, and once you've chosen a path you can't choose a different one. IFilePath could then document that all of its existing methods have the return type of "whatever got passed to __init__" (which is what the current implementation does about 2/3 of the time anyway on py3, and about 9/10 of the time on py2; we would just be making it work intentionally, all the way). But, it would then be possible to give BytesFilePath a "asText()" method and vice versa "asBytes()" - since it's the filesystem, metadata about encodings exists outside your program and you would not need to guess at encodings, you'd simply indicate what return value you'd like from methods like .basename() et. al. The more I think about this, the more I like it - it's a bit of annoying and subtle implementation work, but I think it would supply the behavior that most people want, remain compatible with most of the existing unspecified behavior, and it would address clean text/bytes separation without having a giant deprecation cycle and inventing a new interface. It's also the sort of implementation work which, after some discussion and consideration, we could be reasonably sure is *correct* rather than guessing at things. Thoughts? -glyph
Am Montag, 8. September 2014, 23:01:50 schrieb Glyph:
This implies a parallel structure might be possible for FilePath: if you pass its constructor bytes, you get a BytesFilePath; if you pass it text, you get a TextFilePath. You can't mix the two, and once you've chosen a path you can't choose a different one.
this sounds good. After the port of pb to PY3 is done, I might have a look at it, it probably should be done before trying to port modules.py and trial to PY3. -- Wolfgang
On Sep 9, 2014, at 7:40 PM, Wolfgang Rohdewald <wolfgang.kde@rohdewald.de> wrote:
Am Montag, 8. September 2014, 23:01:50 schrieb Glyph:
This implies a parallel structure might be possible for FilePath: if you pass its constructor bytes, you get a BytesFilePath; if you pass it text, you get a TextFilePath. You can't mix the two, and once you've chosen a path you can't choose a different one.
this sounds good. After the port of pb to PY3 is done, I might have a look at it, it probably should be done before trying to port modules.py and trial to PY3.
Thank you very much for putting all this work in. I encourage you to also get involved with doing code reviews, so that the >50 tickets currently on <https://twistedmatrix.com/trac/report/25> won't distract other reviewers from ever getting to your contributions! :) -glyph
participants (4)
-
anatoly techtonik -
exarkun@twistedmatrix.com -
Glyph -
Wolfgang Rohdewald