os.walk() is going to be *fast* with scandir
Just thought I'd share some of my excitement about how fast the all-C version [1] of os.scandir() is turning out to be. Below are the results of my scandir / walk benchmark run with three different versions. I'm using an SSD, which seems to make it especially faster than listdir / walk. Note that benchmark results can vary a lot, depending on operating system, file system, hard drive type, and the OS's caching state. Anyway, os.walk() can be FIFTY times as fast using os.scandir(). # Old ctypes implementation of scandir in scandir.py: C:\work\scandir>\work\python\cpython\python benchmark.py -r Using slower ctypes version of scandir os.walk took 1.144s, scandir.walk took 0.060s -- 19.2x as fast # Existing "half C" implementation of scandir in _scandir.c: C:\work\scandir>\Python34-x86\python.exe benchmark.py -r Using fast C version of scandir os.walk took 1.160s, scandir.walk took 0.042s -- 27.6x as fast # New "all C" os.scandir implementation in posixmodule.c: C:\work\scandir>\work\python\cpython\python benchmark.py -r Using Python 3.5's builtin os.scandir() os.walk took 1.141s, scandir.walk took 0.022s -- 53.0x as fast [1] Work in progress implementation as part of Python 3.5's posixmodule.c available here: https://github.com/benhoyt/scandir/blob/master/posixmodule.c -Ben
Le 09/08/2014 12:43, Ben Hoyt a écrit :
Just thought I'd share some of my excitement about how fast the all-C version [1] of os.scandir() is turning out to be.
Below are the results of my scandir / walk benchmark run with three different versions. I'm using an SSD, which seems to make it especially faster than listdir / walk. Note that benchmark results can vary a lot, depending on operating system, file system, hard drive type, and the OS's caching state.
Anyway, os.walk() can be FIFTY times as fast using os.scandir().
Very nice results, thank you :-) Regards Antoine.
On 10 August 2014 13:20, Antoine Pitrou <antoine@python.org> wrote:
Le 09/08/2014 12:43, Ben Hoyt a écrit :
Just thought I'd share some of my excitement about how fast the all-C version [1] of os.scandir() is turning out to be.
Below are the results of my scandir / walk benchmark run with three different versions. I'm using an SSD, which seems to make it especially faster than listdir / walk. Note that benchmark results can vary a lot, depending on operating system, file system, hard drive type, and the OS's caching state.
Anyway, os.walk() can be FIFTY times as fast using os.scandir().
Very nice results, thank you :-)
Indeed! This may actually motivate me to start working on a redesign of walkdir at some point, with scandir and DirEntry objects as the basis. My original approach was just too slow to be useful in practice (at least when working with trees on the scale of a full Fedora or RHEL build hosted on an NFS share). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
A small tip from my bzr days - cd into the directory before scanning it - especially if you'll end up statting more than a fraction of the files, or are recursing - otherwise the VFS does a traversal for each path you directly stat / recurse into. This can become a dominating factor in some workloads (I shaved several hundred milliseconds off of bzr stat on kernel trees doing this). -Rob On 10 August 2014 15:57, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 10 August 2014 13:20, Antoine Pitrou <antoine@python.org> wrote:
Le 09/08/2014 12:43, Ben Hoyt a écrit :
Just thought I'd share some of my excitement about how fast the all-C version [1] of os.scandir() is turning out to be.
Below are the results of my scandir / walk benchmark run with three different versions. I'm using an SSD, which seems to make it especially faster than listdir / walk. Note that benchmark results can vary a lot, depending on operating system, file system, hard drive type, and the OS's caching state.
Anyway, os.walk() can be FIFTY times as fast using os.scandir().
Very nice results, thank you :-)
Indeed!
This may actually motivate me to start working on a redesign of walkdir at some point, with scandir and DirEntry objects as the basis. My original approach was just too slow to be useful in practice (at least when working with trees on the scale of a full Fedora or RHEL build hosted on an NFS share).
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/robertc%40robertcollins.n...
-- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Converged Cloud
Hi Larry, On 10 August 2014 08:11, Larry Hastings <larry@hastings.org> wrote:
A small tip from my bzr days - cd into the directory before scanning it
I doubt that's permissible for a library function like os.scandir().
Indeed, chdir() is notably not compatible with multithreading. There would be a non-portable but clean way to do that: the functions openat() and fstatat(). They only exist on relatively modern Linuxes, though. A bientôt, Armin.
Armin Rigo <arigo@tunes.org> writes:
On 10 August 2014 08:11, Larry Hastings <larry@hastings.org> wrote:
A small tip from my bzr days - cd into the directory before scanning it
I doubt that's permissible for a library function like os.scandir().
Indeed, chdir() is notably not compatible with multithreading. There would be a non-portable but clean way to do that: the functions openat() and fstatat(). They only exist on relatively modern Linuxes, though.
There is os.fwalk() that could be both safer and faster than os.walk(). It yields rootdir fd that can be used by functions that support dir_fd parameter, see os.supports_dir_fd set. They use *at() functions under the hood. os.fwalk() could be implemented in terms of os.scandir() if the latter would support fd parameter like os.listdir() does (be in os.supports_fd set (note: it is different from os.supports_dir_fd)). Victor Stinner suggested [1] to allow scandir(fd) but I don't see it being mentioned in the pep 471 [2]: it neither supports nor rejects the idea. [1] https://mail.python.org/pipermail/python-dev/2014-July/135283.html [2] http://legacy.python.org/dev/peps/pep-0471/ -- Akira
Victor Stinner suggested [1] to allow scandir(fd) but I don't see it being mentioned in the pep 471 [2]: it neither supports nor rejects the idea.
[1] https://mail.python.org/pipermail/python-dev/2014-July/135283.html [2] http://legacy.python.org/dev/peps/pep-0471/
Yes, listdir() supports fd, and I think scandir() probably will too to parallel that, if not for v1.0 then soon after. Victor and I want to focus on getting the PEP 471 (string path only) version working first. -Ben
Indeed - my suggestion is applicable to people using the library -Rob On 10 Aug 2014 18:21, "Larry Hastings" <larry@hastings.org> wrote:
On 08/09/2014 10:40 PM, Robert Collins wrote:
A small tip from my bzr days - cd into the directory before scanning it
I doubt that's permissible for a library function like os.scandir().
*/arry*
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/robertc%40robertcollins.n...
On Sun, 10 Aug 2014 13:57:36 +1000, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 10 August 2014 13:20, Antoine Pitrou <antoine@python.org> wrote:
Le 09/08/2014 12:43, Ben Hoyt a écrit :
Just thought I'd share some of my excitement about how fast the all-C version [1] of os.scandir() is turning out to be.
Below are the results of my scandir / walk benchmark run with three different versions. I'm using an SSD, which seems to make it especially faster than listdir / walk. Note that benchmark results can vary a lot, depending on operating system, file system, hard drive type, and the OS's caching state.
Anyway, os.walk() can be FIFTY times as fast using os.scandir().
Very nice results, thank you :-)
Indeed!
This may actually motivate me to start working on a redesign of walkdir at some point, with scandir and DirEntry objects as the basis. My original approach was just too slow to be useful in practice (at least when working with trees on the scale of a full Fedora or RHEL build hosted on an NFS share).
There is another potentially good place in the stdlib to apply scandir: iglob. See issue 22167. --David
participants (8)
-
Akira Li
-
Antoine Pitrou
-
Armin Rigo
-
Ben Hoyt
-
Larry Hastings
-
Nick Coghlan
-
R. David Murray
-
Robert Collins