PEP 471 "scandir" accepted
Hi, I asked privately Guido van Rossum if I can be the BDFL-delegate for the PEP 471 and he agreed. I accept the latest version of the PEP: http://legacy.python.org/dev/peps/pep-0471/ I consider that the PEP 471 "scandir" was discussed enough to collect all possible options (variations of the API) and that main flaws have been detected. Ben Hoyt modified his PEP to list all these options, and for each option gives advantages and drawbacks. Great job Ben :-) Thanks all developers who contributed to the threads on the python-dev mailing list! The new version of the PEP has an optional "follow_symlinks" parameter which is True by default. IMO this API fits better the common case, list the content of a single directory, and it's now simple to not follow symlinks to implement a recursive function like os.walk(). The PEP also explicitly mentions that os.walk() will be modified to benefit of the new os.scandir() function. I'm happy because the final API is very close to os.path functions and pathlib.Path methods. Python stays consistent, which is a great power of this language! The PEP is accepted. It's time to review the implementation ;-) The current code can be found at: https://github.com/benhoyt/scandir (I don't think that Ben already updated his implementation for the latest version of the PEP.) Victor
I asked privately Guido van Rossum if I can be the BDFL-delegate for the PEP 471 and he agreed. I accept the latest version of the PEP:
Thank you!
The PEP also explicitly mentions that os.walk() will be modified to benefit of the new os.scandir() function.
Yes, this was a good suggestion to include that explicitly -- in actual fact, speeding up os.walk() was my main goal initially.
The PEP is accepted.
Superb. Could you please update the PEP with the Resolution and BDFL-Delegate fields?
It's time to review the implementation ;-) The current code can be found at:
https://github.com/benhoyt/scandir
(I don't think that Ben already updated his implementation for the latest version of the PEP.)
I have actually updated my GitHub repo for the current PEP (did this last Saturday). However, there are still a few open issues, the main one is that my scandir.py module doesn't handle the bytes/str thing properly. I intend to work on the CPython implementation over the next few weeks. However, a couple of thoughts up-front: I think if I were doing this from scratch I'd reimplement listdir() in Python as "return [e.name for e in scandir(path)]". However, I'm not sure this is a good idea, as I don't really want listdir() to suddenly use more memory and perform slightly *worse* due to the extra DirEntry object allocations. So my basic plan is to have an internal helper function in posixmodule.c that either yields DirEntry objects or strings. And then listdir() would simply be defined something like "return list(_scandir(path, yield_strings=True))" in C or in Python. My reasoning is that then there'll be much less (if any) code duplication between scandir() and listdir(). Does this sound like a reasonable approach? -Ben
Ben Hoyt
I think if I were doing this from scratch I'd reimplement listdir() in Python as "return [e.name for e in scandir(path)]". ... So my basic plan is to have an internal helper function in posixmodule.c that either yields DirEntry objects or strings. And then listdir() would simply be defined something like "return list(_scandir(path, yield_strings=True))" in C or in Python.
My reasoning is that then there'll be much less (if any) code duplication between scandir() and listdir().
Does this sound like a reasonable approach?
Note: listdir() accepts an integer path (an open file descriptor that refers to a directory) that is passed to fdopendir() on POSIX [4] i.e., *you can't use scandir() to replace listdir() in this case* (as I've already mentioned in [1]). See the corresponding tests from [2]. [1] https://mail.python.org/pipermail/python-dev/2014-July/135296.html [2] https://mail.python.org/pipermail/python-dev/2014-June/135265.html
From os.listdir() docs [3]:
This function can also support specifying a file descriptor; the file descriptor must refer to a directory.
[3] https://docs.python.org/3.4/library/os.html#os.listdir [4] http://hg.python.org/cpython/file/3.4/Modules/posixmodule.c#l3736 -- Akira
Note: listdir() accepts an integer path (an open file descriptor that refers to a directory) that is passed to fdopendir() on POSIX [4] i.e., *you can't use scandir() to replace listdir() in this case* (as I've already mentioned in [1]). See the corresponding tests from [2].
[1] https://mail.python.org/pipermail/python-dev/2014-July/135296.html [2] https://mail.python.org/pipermail/python-dev/2014-June/135265.html
From os.listdir() docs [3]:
This function can also support specifying a file descriptor; the file descriptor must refer to a directory.
[3] https://docs.python.org/3.4/library/os.html#os.listdir [4] http://hg.python.org/cpython/file/3.4/Modules/posixmodule.c#l3736
Fair point. Yes, I hadn't realized listdir supported dir_fd (must have been looking at 2.x docs), though you've pointed it out at [1] above. and I guess I wasn't thinking about implementation at the time. It would be easy enough (I think) to have the helper function support both, but raise an error in the scandir() function if the type of path is an integer. However, given that we have to support this for listdir() anyway, I think it's worth reconsidering whether scandir()'s directory argument can be an integer FD. Given that listdir() already supports it, it will almost certainly be asked for later anyway for someone who's porting some listdir code that uses an FD. Thoughts, Victor? -Ben
2014-07-22 17:52 GMT+02:00 Ben Hoyt
However, given that we have to support this for listdir() anyway, I think it's worth reconsidering whether scandir()'s directory argument can be an integer FD. Given that listdir() already supports it, it will almost certainly be asked for later anyway for someone who's porting some listdir code that uses an FD. Thoughts, Victor?
Please focus on what was accepted in the PEP. We should first test os.scandir(). In a few months, with better feedbacks, we can consider extending os.scandir() to support a file descriptor. There are different issues which should be discussed and decided to implement it (ex: handle the lifetime of the directory file descriptor). Victor
On 23 Jul 2014 02:18, "Victor Stinner"
2014-07-22 17:52 GMT+02:00 Ben Hoyt
: However, given that we have to support this for listdir() anyway, I think it's worth reconsidering whether scandir()'s directory argument can be an integer FD. Given that listdir() already supports it, it will almost certainly be asked for later anyway for someone who's porting some listdir code that uses an FD. Thoughts, Victor?
Please focus on what was accepted in the PEP. We should first test os.scandir(). In a few months, with better feedbacks, we can consider extending os.scandir() to support a file descriptor. There are different issues which should be discussed and decided to implement it (ex: handle the lifetime of the directory file descriptor).
As Victor suggests, getting the core version working and incorporated first is a good way to go. Future enhancements (like accepting a file descriptor) and refactorings (like eliminating the code duplication with listdir) don't need to (and hence shouldn't) go into the initial patch. Cheers, Nick.
Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Makes sense, thanks. -Ben
On Tue, Jul 22, 2014 at 4:57 PM, Nick Coghlan
On 23 Jul 2014 02:18, "Victor Stinner"
wrote: 2014-07-22 17:52 GMT+02:00 Ben Hoyt
: However, given that we have to support this for listdir() anyway, I think it's worth reconsidering whether scandir()'s directory argument can be an integer FD. Given that listdir() already supports it, it will almost certainly be asked for later anyway for someone who's porting some listdir code that uses an FD. Thoughts, Victor?
Please focus on what was accepted in the PEP. We should first test os.scandir(). In a few months, with better feedbacks, we can consider extending os.scandir() to support a file descriptor. There are different issues which should be discussed and decided to implement it (ex: handle the lifetime of the directory file descriptor).
As Victor suggests, getting the core version working and incorporated first is a good way to go. Future enhancements (like accepting a file descriptor) and refactorings (like eliminating the code duplication with listdir) don't need to (and hence shouldn't) go into the initial patch.
Cheers, Nick.
Victor
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Ben Hoyt
Note: listdir() accepts an integer path (an open file descriptor that refers to a directory) that is passed to fdopendir() on POSIX [4] i.e., *you can't use scandir() to replace listdir() in this case* (as I've already mentioned in [1]). See the corresponding tests from [2].
[1] https://mail.python.org/pipermail/python-dev/2014-July/135296.html [2] https://mail.python.org/pipermail/python-dev/2014-June/135265.html
From os.listdir() docs [3]:
This function can also support specifying a file descriptor; the file descriptor must refer to a directory.
[3] https://docs.python.org/3.4/library/os.html#os.listdir [4] http://hg.python.org/cpython/file/3.4/Modules/posixmodule.c#l3736
Fair point.
Yes, I hadn't realized listdir supported dir_fd (must have been looking at 2.x docs), though you've pointed it out at [1] above. and I guess I wasn't thinking about implementation at the time.
FYI, dir_fd is related but *different*: compare "specifying a file descriptor" [1] vs. "paths relative to directory descriptors" [2]. "NOTE: os.supports_fd and os.supports_dir_fd are different sets." [3]:
import os os.listdir in os.supports_fd True os.listdir in os.supports_dir_fd False
[1] https://docs.python.org/3/library/os.html#path-fd [2] https://docs.python.org/3/library/os.html#dir-fd [3] https://mail.python.org/pipermail/python-dev/2014-July/135296.html To be clear: *listdir() does not support dir_fd* though it can be emulated using os.open(dir_fd=..). You can safely ignore the rest of the e-mail until you want to implement path-fd [1] support for os.scandir() in several months. Here's code example that demonstrates both path-fd [1] and dir-fd [2]: import contextlib import os with contextlib.ExitStack() as stack: dir_fd = os.open('/etc', os.O_RDONLY) stack.callback(os.close, dir_fd) fd = os.open('init.d', os.O_RDONLY, dir_fd=dir_fd) # dir-fd [2] stack.callback(os.close, fd) print("\n".join(os.listdir(fd))) # path-fd [1] It is the same as os.listdir('/etc/init.d') unless '/etc' is symlinked to refer to another directory after the first os.open('/etc',..) call. See also, os.fwalk(dir_fd=..) [4] [4] https://docs.python.org/3/library/os.html#os.fwalk
However, given that we have to support this for listdir() anyway, I think it's worth reconsidering whether scandir()'s directory argument can be an integer FD.
What is entry.path in this case? If input directory is a file descriptor (an integer) then os.path.join(directory, entry.name) won't work. "PEP 471 should explicitly reject the support for specifying a file descriptor so that a code that uses os.scandir may assume that entry.path attribute is always present (no exceptions due to a failure to read /proc/self/fd/NNN or an error while calling fcntl(F_GETPATH) or GetFileInformationByHandleEx() -- see http://stackoverflow.com/q/1188757 )." [5] [5] https://mail.python.org/pipermail/python-dev/2014-July/135441.html On the other hand os.fwalk() [4] that supports both path-fd [1] and dir-fd [2] could be implemented without entry.path property if os.scandir() supports just path-fd [1]. os.fwalk() provides a safe way to traverse a directory tree without symlink races e.g., [6]: def get_tree_size(directory): """Return total size of files in directory and subdirs.""" return sum(entry.lstat().st_size for root, dirs, files, rootfd in fwalk(directory) for entry in files) [6] http://legacy.python.org/dev/peps/pep-0471/#examples where fwalk() is the exact copy of os.fwalk() except that it uses _fwalk() which is defined in terms of scandir(): import os # adapt os._fwalk() to use scandir() instead of os.listdir() def _fwalk(topfd, toppath, topdown, onerror, follow_symlinks): # Note: This uses O(depth of the directory tree) file descriptors: # if necessary, it can be adapted to only require O(1) FDs, see # http://bugs.python.org/issue13734 entries = scandir(topfd) dirs, nondirs = [], [] for entry in entries: #XXX call onerror on OSError on next() and return? # report symlinks to directories as directories (like os.walk) # but no recursion into symlinked subdirectories unless # follow_symlinks is true # add dangling symlinks as nondirs (DirEntry.is_dir() doesn't # raise on broken links) try: (dirs if entry.is_dir() else nondirs).append(entry) except FileNotFoundError: continue # ignore disappeared files if topdown: yield toppath, dirs, nondirs, topfd for entry in dirs: try: orig_st = entry.stat(follow_symlinks=follow_symlinks) #XXX O_DIRECTORY, O_CLOEXEC, [? O_NOCTTY, O_SEARCH ?] dirfd = os.open(entry.name, os.O_RDONLY, dir_fd=topfd) except OSError as err: if onerror is not None: onerror(err) return try: if follow_symlinks or os.path.samestat(orig_st, os.stat(dirfd)): dirpath = os.path.join(toppath, entry.name) # entry.path yield from _fwalk(dirfd, dirpath, topdown, onerror, follow_symlinks) finally: close(dirfd) # or use with entry.opendir() as dirfd: ... if not topdown: yield toppath, dirs, nondirs, topfd i.e., if os.scandir() supports specifying file descriptors [1] then it is relatively straightforward to define os.fwalk() in terms of it. Would scandir() provide the same performance benefits as for os.walk()? entry.stat() can be implemented without entry.path when entry._directory (or whatever other DirEntry's attribute that stores the first parameter to os.scandir(fd)) is an open file descriptor that refers to a directory: def stat(self, *, follow_symlinks=True): return os.stat(self.name, #NOTE: ignore caching follow_symlinks=follow_symlinks, dir_fd=self._directory) lstat = lambda self: self.stat(follow_symlinks=False) -- Akira
Modify os.listdir() to use os.scandir() is not part of the PEP, you should not do that. If you worry about performances, try to implement my free list idea. You may modify the C code of listdir() to share as much code as possible. I mean you can implement your idea in C. Victor
Le 21/07/2014 18:26, Victor Stinner a écrit :
I'm happy because the final API is very close to os.path functions and pathlib.Path methods. Python stays consistent, which is a great power of this language!
By the way, http://bugs.python.org/issue19767 could benefit too. Regards Antoine.
participants (6)
-
Akira Li
-
Antoine Pitrou
-
Ben Hoyt
-
Ethan Furman
-
Nick Coghlan
-
Victor Stinner