Issue 11406: adding os.scandir(), a directory iterator returning stat-like info
A few of us were having a discussion at http://bugs.python.org/issue11406 about adding os.scandir(): a generator version of os.listdir() to make iterating over very large directories more memory efficient. This also reflects how the OS gives things to you -- it doesn't give you a big list, but you call a function to iterate and fetch the next entry. While I think that's a good idea, I'm not sure just that much is enough of an improvement to make adding the generator version worth it. But what would make this a killer feature is making os.scandir() generate tuples of (name, stat_like_info). The Windows directory iteration functions (FindFirstFile/FindNextFile) give you the full stat information for free, and the Linux and OS X functions (opendir/readdir) give you partial file information (d_type in the dirent struct, which is basically the st_mode part of a stat, whether it's a file, directory, link, etc). Having this available at the Python level would mean we can vastly speed up functions like os.walk() that otherwise need to make an os.stat() call for every file returned. In my benchmarks of such a generator on Windows, it speeds up os.walk() by 9-10x. On Linux/OS X, it's more like 1.5-3x. In my opinion, that kind of gain is huge, especially on Windows, but also on Linux/OS X. So the idea is to add this relatively low-level function that exposes the extra information the OS gives us for free, but which os.listdir() currently throws away. Then higher-level, platform-independent functions like os.walk() could use os.scandir() to get much better performance. People over at Issue 11406 think this is a good idea. HOWEVER, there's debate over what kind of object the second element in the tuple, "stat_like_info", should be. My strong vote is for it to be a stat_result-like object, but where the fields are None if they're unknown. There would be basically three scenarios: 1) stat_result with all fields set: this would happen on Windows, where you get as much info from FindFirst/FindNext as from an os.stat() 2) stat_result with just st_mode set, and all other fields None: this would be the usual case on Linux/OS X 3) stat_result with all fields None: this would happen on systems whose readdir()/dirent doesn't have d_type, or on Linux/OS X when d_type was DT_UNKNOWN Higher-level functions like os.walk() would then check the fields they needed are not None, and only call os.stat() if needed, for example: # Build lists of files and directories in path files = [] dirs = [] for name, st in os.scandir(path): if st.st_mode is None: st = os.stat(os.path.join(path, name)) if stat.S_ISDIR(st.st_mode): dirs.append(name) else: files.append(name) Not bad for a 2-10x performance boost, right? What do folks think? Cheers, Ben. P.S. A few non-essential further notes: 1) As a Windows guy, a nice-to-have addition to os.scandir() would be a keyword arg like win_wildcard which defaulted to '*.*', but power users can pass in to utilize the wildcard feature of FindFirst/FindNext on Windows. We have plenty of other low-level functions that expose OS-specific features in the OS module, so this would be no different. But then again, it's not nearly as important as exposing the stat info. 2) I've been dabbling with this concept for a while in my BetterWalk library: https://github.com/benhoyt/betterwalk Note that the benchmarks there are old, and I've made further improvements in my local copy. The ctypes version gives speed gains for os.walk() of 2-3x on Windows, but I've also got a C version, which is giving 9-10x speed gains. I haven't yet got a Linux/OS X version written in C. 3) See also the previous python-dev thread on BetterWalk: http://mail.python.org/pipermail/python-ideas/2012-November/017944.html
Am 10.05.2013 12:55, schrieb Ben Hoyt:
Higher-level functions like os.walk() would then check the fields they needed are not None, and only call os.stat() if needed, for example:
# Build lists of files and directories in path files = [] dirs = [] for name, st in os.scandir(path): if st.st_mode is None: st = os.stat(os.path.join(path, name)) if stat.S_ISDIR(st.st_mode): dirs.append(name) else: files.append(name)
Have you actually tried the code? It can't give you correct answers. The struct dirent.d_type member as returned by readdir() has different values than stat.st_mode's file type. For example on my system readdir() returns DT_DIR for a directory but S_ISDIR() checks different bits: DT_DIR = 4 S_ISDIR(mode) ((mode) & 0170000) == 0040000 Or are you proposing to map d_type to st_mode? That's also problematic because st_mode would only have file type bits, not permission bits. Also POSIX standards state that new file types will not get additional S_IF* constant assigned to. Some operation systems have IFTODT() / DTTOIF() macros which convert bits between st_mode and d_type but the macros aren't part of POSIX standard. Hence I'm +1 on the general idea but -1 on something stat like. IMHO os.scandir() should yield four objects: * name * inode * file type or DT_UNKNOWN * stat_result or None stat_result shall only be returned when the operating systems provides a full stat result as returned by os.stat(). Christian
Le Fri, 10 May 2013 13:46:30 +0200, Christian Heimes <christian@python.org> a écrit :
Hence I'm +1 on the general idea but -1 on something stat like. IMHO os.scandir() should yield four objects:
* name * inode * file type or DT_UNKNOWN * stat_result or None
stat_result shall only be returned when the operating systems provides a full stat result as returned by os.stat().
But what if some systems return more than the file type and less than a full stat result? The general problem is POSIX's terrible inertia. I feel that a stat result with some None fields would be an acceptable compromise here. Regards Antoine.
On 10 May, 2013, at 14:16, Antoine Pitrou <solipsis@pitrou.net> wrote:
Le Fri, 10 May 2013 13:46:30 +0200, Christian Heimes <christian@python.org> a écrit :
Hence I'm +1 on the general idea but -1 on something stat like. IMHO os.scandir() should yield four objects:
* name * inode * file type or DT_UNKNOWN * stat_result or None
stat_result shall only be returned when the operating systems provides a full stat result as returned by os.stat().
But what if some systems return more than the file type and less than a full stat result? The general problem is POSIX's terrible inertia. I feel that a stat result with some None fields would be an acceptable compromise here.
But how do you detect that the st_mode field on systems with a d_type is incomplete, as oposed to a system that can return a full st_mode from its readdir equivalent and where the permission bits happen to be 0o0000? One option would be to add a file type field to stat_result, IIRC this was mentioned in some revisions of the extended stat_result proposal over on python-ideas. Ronald
Regards
Antoine.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/ronaldoussoren%40mac.com
Am 10.05.2013 14:16, schrieb Antoine Pitrou:
But what if some systems return more than the file type and less than a full stat result? The general problem is POSIX's terrible inertia. I feel that a stat result with some None fields would be an acceptable compromise here.
POSIX only defines the d_ino and d_name members of struct dirent. Linux, BSD and probably some other platforms also happen to provide d_type. The other members of struct dirent (d_reclen, d_namlen) aren't useful in Python space by themselves. d_type and st_mode aren't compatible in any way. As you know st_mode also contains POSIX permission information. The file type is encoded with a different set of bits, too. Future file types aren't mapped to S_IF* constants for st_mode. For d_ino you also need the device number from the directory because the inode is only unique within a device. I don't really see how to map strut dirent to struct stat on POSIX. Christian
On Fri, May 10, 2013 at 11:46 PM, Christian Heimes <christian@python.org> wrote:
Am 10.05.2013 14:16, schrieb Antoine Pitrou:
But what if some systems return more than the file type and less than a full stat result? The general problem is POSIX's terrible inertia. I feel that a stat result with some None fields would be an acceptable compromise here.
POSIX only defines the d_ino and d_name members of struct dirent. Linux, BSD and probably some other platforms also happen to provide d_type. The other members of struct dirent (d_reclen, d_namlen) aren't useful in Python space by themselves.
d_type and st_mode aren't compatible in any way. As you know st_mode also contains POSIX permission information. The file type is encoded with a different set of bits, too. Future file types aren't mapped to S_IF* constants for st_mode.
Why are we exposing a bitfield as the primary Python level API, anyway? It makes sense for the well defined permission bits, but why are we copying the C level concept for the other flags? Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Le Fri, 10 May 2013 23:53:37 +1000, Nick Coghlan <ncoghlan@gmail.com> a écrit :
On Fri, May 10, 2013 at 11:46 PM, Christian Heimes <christian@python.org> wrote:
Am 10.05.2013 14:16, schrieb Antoine Pitrou:
But what if some systems return more than the file type and less than a full stat result? The general problem is POSIX's terrible inertia. I feel that a stat result with some None fields would be an acceptable compromise here.
POSIX only defines the d_ino and d_name members of struct dirent. Linux, BSD and probably some other platforms also happen to provide d_type. The other members of struct dirent (d_reclen, d_namlen) aren't useful in Python space by themselves.
d_type and st_mode aren't compatible in any way. As you know st_mode also contains POSIX permission information. The file type is encoded with a different set of bits, too. Future file types aren't mapped to S_IF* constants for st_mode.
Why are we exposing a bitfield as the primary Python level API, anyway? It makes sense for the well defined permission bits, but why are we copying the C level concept for the other flags?
Precisely because they are not well-defined, hence any interpretation by us may be incorrect or incomplete (e.g. obscure system-specific bits). Regards Antoine.
Le Fri, 10 May 2013 15:46:21 +0200, Christian Heimes <christian@python.org> a écrit :
Am 10.05.2013 14:16, schrieb Antoine Pitrou:
But what if some systems return more than the file type and less than a full stat result? The general problem is POSIX's terrible inertia. I feel that a stat result with some None fields would be an acceptable compromise here.
POSIX only defines the d_ino and d_name members of struct dirent. Linux, BSD and probably some other platforms also happen to provide d_type. The other members of struct dirent (d_reclen, d_namlen) aren't useful in Python space by themselves.
d_type and st_mode aren't compatible in any way. As you know st_mode also contains POSIX permission information. The file type is encoded with a different set of bits, too. Future file types aren't mapped to S_IF* constants for st_mode.
Thank you and Ronald for clarifying. This does make the API design a bit bothersome. We want to expose as much information as possible in a cross-platform way and with a flexible granularity, but doing so might require a gazillion of namedtuple fields (platonically, as much as one field per stat bit).
For d_ino you also need the device number from the directory because the inode is only unique within a device.
But hopefully you've already stat'ed the directory ;) Regards Antoine.
On 10 May, 2013, at 15:54, Antoine Pitrou <solipsis@pitrou.net> wrote:
Le Fri, 10 May 2013 15:46:21 +0200, Christian Heimes <christian@python.org> a écrit :
Am 10.05.2013 14:16, schrieb Antoine Pitrou:
But what if some systems return more than the file type and less than a full stat result? The general problem is POSIX's terrible inertia. I feel that a stat result with some None fields would be an acceptable compromise here.
POSIX only defines the d_ino and d_name members of struct dirent. Linux, BSD and probably some other platforms also happen to provide d_type. The other members of struct dirent (d_reclen, d_namlen) aren't useful in Python space by themselves.
d_type and st_mode aren't compatible in any way. As you know st_mode also contains POSIX permission information. The file type is encoded with a different set of bits, too. Future file types aren't mapped to S_IF* constants for st_mode.
Thank you and Ronald for clarifying. This does make the API design a bit bothersome. We want to expose as much information as possible in a cross-platform way and with a flexible granularity, but doing so might require a gazillion of namedtuple fields (platonically, as much as one field per stat bit).
One field per stat bit is overkill, file permissions are well known enough to keep them as a single item. Most if not all uses of the st_mode field can be covered by adding just "filetype" and "permissions" fields. That would also make it possible to use stat_result in os.scandir() without loosing information (it would have filetype != None and permissions and st_mode == None on systems with d_type).
For d_ino you also need the device number from the directory because the inode is only unique within a device.
But hopefully you've already stat'ed the directory ;)
Why? There's no need to stat the directory when implementing os.walk using os.scandir (for systems that return filetype information in the API used by os.scandir). Anyway, setting st_ino in the result of os.scandir is harmless, even though using st_ino is uncommon. Getting st_dev from the directory isn't good anyway, for example when using rebind mounts to mount a single file into a different directory (which is a convenient way to make a configuration file available in a chroot environment) Ronald
Regards
Antoine.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/ronaldoussoren%40mac.com
Have you actually tried the code? It can't give you correct answers. The struct dirent.d_type member as returned by readdir() has different values than stat.st_mode's file type.
Yes, I'm quite aware of that. In the first version of BetterWalk that's exactly how it did it, and this approach worked fine. However...
Or are you proposing to map d_type to st_mode?
Yes, that's exactly what I was proposing -- sorry if that wasn't clear.
Hence I'm +1 on the general idea but -1 on something stat like. IMHO os.scandir() should yield four objects:
* name * inode * file type or DT_UNKNOWN * stat_result or None
This feels quite heavy to me. And I don't like it how for the normal case (checking whether something was a file or directory) you'd have to check file_type against DT_UNKNOWN as well as stat_result against None before doing anything with it: for item in os.scandir(): if item.file_type == DT_UNKNOWN and item.stat_result is None: # call os.stat() I guess that's not *too* bad.
That's also problematic because st_mode would only have file type bits, not permission bits.
You're right. However, given that scandir() is intended as a low-level, OS-specific function, couldn't we just document this and move on? Keep the API nice and simple and still cover 95% of the use cases. How often does anyone actually iterate through a directory doing stuff with the permission bits. The nice thing about having it return a stat-like object is that in almost all cases you don't have to have two different code paths (d_type and st_mode), you just deal with st_mode. And we already have the stat module for dealing with st_mode stuff, so we wouldn't need another bunch of code/constants for dealing with d_type. The documentation could just say something like: "The exact information returned in st_mode is OS-specific. In practice, on Windows it returns all the information that stat() does. On Linux and OS X, it's either None or it includes the mode bits (but not the permissions bits)." Antoine said: "But what if some systems return more than the file type and less than a full stat result?" Again, I just think that debating the very fine points like this to get that last 5% of use cases will mean we never have this very useful function in the library. In all the *practical* examples I've seen (and written myself), I iterate over a directory and I just need to know whether it's a file or directory (or maybe a link). Occassionally you need the size as well, but that would just mean a similar check "if st.st_size is None: st = os.stat(...)", which on Linux/OS X would call stat(), but it'd still be free and fast on Windows. -Ben
On Sat, May 11, 2013 at 2:24 PM, Ben Hoyt <benhoyt@gmail.com> wrote:
In all the *practical* examples I've seen (and written myself), I iterate over a directory and I just need to know whether it's a file or directory (or maybe a link). Occassionally you need the size as well, but that would just mean a similar check "if st.st_size is None: st = os.stat(...)", which on Linux/OS X would call stat(), but it'd still be free and fast on Windows.
Here's the full set of fields on a current stat object: st_atime st_atime_ns st_blksize st_blocks st_ctime st_ctime_ns st_dev st_gid st_ino st_mode st_mtime st_mtime_ns st_nlink st_rdev st_size st_uid Do we really want to publish an object with all of those as attributes potentially set to None, when the abstraction we're trying to present is intended primarily for the benefit of os.walk? And if we're creating a custom object instead, why return a 2-tuple rather than making the entry's name an attribute of the custom object? To me, that suggests a more reasonable API for os.scandir() might be for it to be an iterator over "dir_entry" objects: name (as a string) is_file() is_dir() is_link() stat() cached_stat (None or a stat object) On all platforms, the query methods would not require a separate stat() call. On Windows, cached_stat would be populated with a full stat object when scandir builds the entry. On non-Windows platforms, cached_stat would initially be None, and you would have to call stat() to populate it. If we find other details that we can reliably provide cross-platform from the dir information, then we can add more query methods or attributes to the dir_entry object. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Am 11.05.2013 16:34, schrieb Nick Coghlan:
Here's the full set of fields on a current stat object:
st_atime st_atime_ns st_blksize st_blocks st_ctime st_ctime_ns st_dev st_gid st_ino st_mode st_mtime st_mtime_ns st_nlink st_rdev st_size st_uid
And there are more fields on some platforms, e.g. st_birthtime.
To me, that suggests a more reasonable API for os.scandir() might be for it to be an iterator over "dir_entry" objects:
name (as a string) is_file() is_dir() is_link() stat() cached_stat (None or a stat object)
I suggest that we call it .lstat() and .cached_lstat to make clear that we are talking about no-follow stat() here. On platforms that support fstatat() it should use fstatat(dir_fd, name, &buf, AT_SYMLINK_NOFOLLOW) where dir_fd is the fd from dirfd() of opendir()'s return value.
On all platforms, the query methods would not require a separate stat() call. On Windows, cached_stat would be populated with a full stat object when scandir builds the entry. On non-Windows platforms, cached_stat would initially be None, and you would have to call stat() to populate it.
+1
If we find other details that we can reliably provide cross-platform from the dir information, then we can add more query methods orst attributes to the dir_entry object.
I'd like to see d_type and d_ino, too. d_type should default to DT_UNKNOWN, d_ino to None. Christian
On Sun, May 12, 2013 at 1:42 AM, Christian Heimes <christian@python.org> wrote:
I suggest that we call it .lstat() and .cached_lstat to make clear that we are talking about no-follow stat() here.
Fair point.
On platforms that support fstatat() it should use fstatat(dir_fd, name, &buf, AT_SYMLINK_NOFOLLOW) where dir_fd is the fd from dirfd() of opendir()'s return value.
It may actually make sense to expose the dir_fd as another attribute of the dir_entry object.
If we find other details that we can reliably provide cross-platform from the dir information, then we can add more query methods orst attributes to the dir_entry object.
I'd like to see d_type and d_ino, too. d_type should default to DT_UNKNOWN, d_ino to None.
I'd prefer to see a more minimal set to start with - just the features needed to implement os.walk and os.fwalk more efficiently, and provide ready access to the full stat result. Once that core functionality is in place, *then* start debating what other use cases to optimise based on which platforms would support those optimisations and which would require dropping back to the full stat implementation anyway. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sun, May 12, 2013 at 2:30 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Once that core functionality is in place, *then* start debating what other use cases to optimise based on which platforms would support those optimisations and which would require dropping back to the full stat implementation anyway.
Alternatively, we could simply have a full "dirent" attribute that is None on Windows. That would actually make sense at an implementation level anyway - is_file() etc would check self.cached_lstat first, and if that was None they would check self.dirent, and if that was also None they would raise an error. Construction of a dir_entry would require either a stat object or a dirent object, but complain if it received both. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
And if we're creating a custom object instead, why return a 2-tuple rather than making the entry's name an attribute of the custom object?
To me, that suggests a more reasonable API for os.scandir() might be for it to be an iterator over "dir_entry" objects:
name (as a string) is_file() is_dir() is_link() stat() cached_stat (None or a stat object)
Nice! I really like your basic idea of returning a custom object instead of a 2-tuple. And I agree with Christian that .stat() would be clearer called .lstat(). I also like your later idea of simply exposing .dirent (would be None on Windows). One tweak I'd suggest is that is_file() etc be called isfile() etc without the underscore, to match the naming of the os.path.is* functions.
That would actually make sense at an implementation level anyway - is_file() etc would check self.cached_lstat first, and if that was None they would check self.dirent, and if that was also None they would raise an error.
Hmm, I'm not sure about this at all. Are you suggesting that the DirEntry object's is* functions would raise an error if both cached_lstat and dirent were None? Wouldn't it make for a much simpler API to just call os.lstat() and populate cached_lstat instead? As far as I'm concerned, that'd be the point of making DirEntry.lstat() a function. In fact, I don't think .cached_lstat should be exposed to the user. They just call entry.lstat(), and it returns a cached stat or calls os.lstat() to get the real stat if required (and populates the internal cached stat value). And the entry.is* functions would call entry.lstat() if dirent was or d_type was DT_UNKNOWN. This would change relatively nasty code like this: files = [] dirs = [] for entry in os.scandir(path): try: isdir = entry.isdir() except NotPresentError: st = os.lstat(os.path.join(path, entry.name)) isdir = stat.S_ISDIR(st) if isdir: dirs.append(entry.name) else: files.append(entry.name) Into nice clean code like this: files = [] dirs = [] for entry in os.scandir(path): if entry.isfile(): dirs.append(entry.name) else: files.append(entry.name) This change would make scandir() usable by ordinary mortals, rather than just hardcore library implementors. In other words, I'm proposing that the DirEntry objects yielded by scandir() would have .name and .dirent attributes, and .isdir(), .isfile(), .islink(), .lstat() methods, and look basically like this (though presumably implemented in C): class DirEntry: def __init__(self, name, dirent, lstat, path='.'): # User shouldn't need to call this, but called internally by scandir() self.name = name self.dirent = dirent self._lstat = lstat # non-public attributes self._path = path def lstat(self): if self._lstat is None: self._lstat = os.lstat(os.path.join(self._path, self.name)) return self._lstat def isdir(self): if self.dirent is not None and self.dirent.d_type != DT_UNKNOWN: return self.dirent.d_type == DT_DIR else: return stat.S_ISDIR(self.lstat().st_mode) def isfile(self): if self.dirent is not None and self.dirent.d_type != DT_UNKNOWN: return self.dirent.d_type == DT_REG else: return stat.S_ISREG(self.lstat().st_mode) def islink(self): if self.dirent is not None and self.dirent.d_type != DT_UNKNOWN: return self.dirent.d_type == DT_LNK else: return stat.S_ISLNK(self.lstat().st_mode) Oh, and the .dirent would either be None (Windows) or would have .d_type and .d_ino attributes (Linux, OS X). This would make the scandir() API nice and simple to use for callers, but still expose all the information the OS provides (both the meaningful fields in dirent, and a full stat on Windows, nicely cached in the DirEntry object). Thoughts? -Ben
Am 13.05.2013 00:04, schrieb Ben Hoyt:
In fact, I don't think .cached_lstat should be exposed to the user. They just call entry.lstat(), and it returns a cached stat or calls os.lstat() to get the real stat if required (and populates the internal cached stat value). And the entry.is* functions would call entry.lstat() if dirent was or d_type was DT_UNKNOWN. This would change relatively nasty code like this:
I would prefer to go the other route and don't expose lstat(). It's cleaner and less confusing to have a property cached_lstat on the object because it actually says what it contains. The property's internal code can do a lstat() call if necessary. Your code example doesn't handle the case of a failing lstat() call. It can happen when the file is removed or permission of a parent directory changes.
This change would make scandir() usable by ordinary mortals, rather than just hardcore library implementors.
Why not have both? The os module exposes and leaks the platform details on more than on occasion. A low level function can expose name + dirent struct on POSIX and name + stat_result on Windows. Then you can build a high level API like os.scandir() in pure Python code.
class DirEntry: def __init__(self, name, dirent, lstat, path='.'): # User shouldn't need to call this, but called internally by scandir() self.name = name self.dirent = dirent self._lstat = lstat # non-public attributes self._path = path
You should include the fd of the DIR pointer here for the new *at() function family.
def lstat(self): if self._lstat is None: self._lstat = os.lstat(os.path.join(self._path, self.name)) return self._lstat
The function should use fstatat(2) function (os.lstat with dir_fd) when it is available on the current platform. It's better and more secure than lstat() with a joined path.
def isdir(self): if self.dirent is not None and self.dirent.d_type != DT_UNKNOWN: return self.dirent.d_type == DT_DIR else: return stat.S_ISDIR(self.lstat().st_mode)
def isfile(self): if self.dirent is not None and self.dirent.d_type != DT_UNKNOWN: return self.dirent.d_type == DT_REG else: return stat.S_ISREG(self.lstat().st_mode)
def islink(self): if self.dirent is not None and self.dirent.d_type != DT_UNKNOWN: return self.dirent.d_type == DT_LNK else: return stat.S_ISLNK(self.lstat().st_mode)
A bit faster: d_type = getattr(self.dirent, "d_type", DT_UNKNOWN) if d_type != DT_UNKNOWN: return d_type == DT_LNK The code doesn't handle a failing lstat() call. Christian
I would prefer to go the other route and don't expose lstat(). It's cleaner and less confusing to have a property cached_lstat on the object because it actually says what it contains. The property's internal code can do a lstat() call if necessary.
Are you suggesting just accessing .cached_lstat could call os.lstat()? That seems very bad to me. It's a property access -- it looks cheap, therefore people will expect it to be. From PEP 8 "Avoid using properties for computationally expensive operations; the attribute notation makes the caller believe that access is (relatively) cheap." Even worse is error handling -- I'd expect the expression "entry.cached_lstat" to only ever raise AttributeError, not OSError in the case it calls stat under the covers. Calling code would have to have a try/except around what looked like a simple attribute access. For these two reasons I think lstat() should definitely be a function.
Your code example doesn't handle the case of a failing lstat() call. It can happen when the file is removed or permission of a parent directory changes.
True. My isdir/isfile/islink implementations should catch any OSError from the lstat() and return False (like os.path.isdir etc do). But then calling code still doesn't need try/excepts around the isdir() calls. This is how os.walk() is implemented -- there's no extra error handling around the isdir() call.
Why not have both? The os module exposes and leaks the platform details on more than on occasion. A low level function can expose name + dirent struct on POSIX and name + stat_result on Windows. Then you can build a high level API like os.scandir() in pure Python code.
I wouldn't be opposed to that, but it's a scandir() implementation detail. If there's a scandir_helper_win() and scandir_helper_posix() written in C, and the rest is written in Python, that'd be fine by me. As long as the Python part didn't slow it down much.
The function should use fstatat(2) function (os.lstat with dir_fd) when it is available on the current platform. It's better and more secure than lstat() with a joined path.
Sure. I'm primarily a Windows dev, so not too familiar with all the fancy stat* functions. But what you're saying makes sense. -Ben
Am 13.05.2013 02:21, schrieb Ben Hoyt:
Are you suggesting just accessing .cached_lstat could call os.lstat()? That seems very bad to me. It's a property access -- it looks cheap, therefore people will expect it to be. From PEP 8 "Avoid using properties for computationally expensive operations; the attribute notation makes the caller believe that access is (relatively) cheap."
Even worse is error handling -- I'd expect the expression "entry.cached_lstat" to only ever raise AttributeError, not OSError in the case it calls stat under the covers. Calling code would have to have a try/except around what looked like a simple attribute access.
For these two reasons I think lstat() should definitely be a function.
OK, you got me! I'm now convinced that a property is a bad idea. I still like to annotate that the function may return a cached value. Perhaps lstat() could require an argument? def lstat(self, cached): if not cached or self._lstat is None: self._lstat = os.lstat(...) return self._lstat
True. My isdir/isfile/islink implementations should catch any OSError from the lstat() and return False (like os.path.isdir etc do). But then calling code still doesn't need try/excepts around the isdir() calls. This is how os.walk() is implemented -- there's no extra error handling around the isdir() call.
You could take the opportunity and take the 'file was deleted' case into account. I admit it has a very low priority. Please regard the case for bonus points only. ;)
Sure. I'm primarily a Windows dev, so not too familiar with all the fancy stat* functions. But what you're saying makes sense.
I'm glad to be of assistance! The feature is new (added in 3.3) and is available on most POSIX platforms. http://docs.python.org/3/library/os.html#dir-fd If you need any help or testing please feel free to ask me. I really like to get this feature into 3.4. Christian
OK, you got me! I'm now convinced that a property is a bad idea.
Thanks. :-)
I still like to annotate that the function may return a cached value. Perhaps lstat() could require an argument?
def lstat(self, cached): if not cached or self._lstat is None: self._lstat = os.lstat(...) return self._lstat
Hmm, I'm just not sure I like the API. Setting cached to True to me would imply it's only ever going to come from the cache (i.e., just return self._lstat). Also, isdir() etc have the same issue, so if you're going this route, their signatures would need this too. The DirEntry instance is really a cached value in itself. ".name" is cached, ".dirent" is cached, and the methods return cached if they can. That's more or less the point of the object. But you have a fair point, and this would need to be explicit in the documentation. -Ben
True. My isdir/isfile/islink implementations should catch any OSError from the lstat() and return False (like os.path.isdir etc do). But then calling code still doesn't need try/excepts around the isdir() calls. This is how os.walk() is implemented -- there's no extra error handling around the isdir() call.
You could take the opportunity and take the 'file was deleted' case into account. I admit it has a very low priority. Please regard the case for bonus points only. ;)
Sure. I'm primarily a Windows dev, so not too familiar with all the fancy stat* functions. But what you're saying makes sense.
I'm glad to be of assistance! The feature is new (added in 3.3) and is available on most POSIX platforms. http://docs.python.org/3/library/os.html#dir-fd
If you need any help or testing please feel free to ask me. I really like to get this feature into 3.4.
Christian
2013/5/13 Ben Hoyt <benhoyt@gmail.com>:
class DirEntry: def __init__(self, name, dirent, lstat, path='.'): # User shouldn't need to call this, but called internally by scandir() self.name = name self.dirent = dirent self._lstat = lstat # non-public attributes self._path = path
def lstat(self): if self._lstat is None: self._lstat = os.lstat(os.path.join(self._path, self.name)) return self._lstat ...
You need to provide a way to invalidate the stat cache, DirEntry.clearcache() for example. Victor
On Mon, May 13, 2013 at 12:11 PM, Victor Stinner <victor.stinner@gmail.com> wrote:
2013/5/13 Ben Hoyt <benhoyt@gmail.com>:
class DirEntry: ... def lstat(self): if self._lstat is None: self._lstat = os.lstat(os.path.join(self._path, self.name)) return self._lstat ...
You need to provide a way to invalidate the stat cache, DirEntry.clearcache() for example.
Hmm, I'm not sure why, as the stat result is cached on the DirEntry instance (not the class). If you don't want the cached version, just call os.stat() yourself, or throw away the DirEntry instance. DirEntry instances would just be used for dealing with scandir() results. -Ben
On Sun, May 12, 2013 at 3:04 PM, Ben Hoyt <benhoyt@gmail.com> wrote:
And if we're creating a custom object instead, why return a 2-tuple rather than making the entry's name an attribute of the custom object?
To me, that suggests a more reasonable API for os.scandir() might be for it to be an iterator over "dir_entry" objects:
name (as a string) is_file() is_dir() is_link() stat() cached_stat (None or a stat object)
Nice! I really like your basic idea of returning a custom object instead of a 2-tuple. And I agree with Christian that .stat() would be clearer called .lstat(). I also like your later idea of simply exposing .dirent (would be None on Windows).
One tweak I'd suggest is that is_file() etc be called isfile() etc without the underscore, to match the naming of the os.path.is* functions.
That would actually make sense at an implementation level anyway - is_file() etc would check self.cached_lstat first, and if that was None they would check self.dirent, and if that was also None they would raise an error.
Hmm, I'm not sure about this at all. Are you suggesting that the DirEntry object's is* functions would raise an error if both cached_lstat and dirent were None? Wouldn't it make for a much simpler API to just call os.lstat() and populate cached_lstat instead? As far as I'm concerned, that'd be the point of making DirEntry.lstat() a function.
In fact, I don't think .cached_lstat should be exposed to the user. They just call entry.lstat(), and it returns a cached stat or calls os.lstat() to get the real stat if required (and populates the internal cached stat value). And the entry.is* functions would call entry.lstat() if dirent was or d_type was DT_UNKNOWN. This would change relatively nasty code like this:
files = [] dirs = [] for entry in os.scandir(path): try: isdir = entry.isdir() except NotPresentError: st = os.lstat(os.path.join(path, entry.name)) isdir = stat.S_ISDIR(st) if isdir: dirs.append(entry.name) else: files.append(entry.name)
Into nice clean code like this:
files = [] dirs = [] for entry in os.scandir(path): if entry.isfile(): dirs.append(entry.name) else: files.append(entry.name)
This change would make scandir() usable by ordinary mortals, rather than just hardcore library implementors.
In other words, I'm proposing that the DirEntry objects yielded by scandir() would have .name and .dirent attributes, and .isdir(), .isfile(), .islink(), .lstat() methods, and look basically like this (though presumably implemented in C):
class DirEntry: def __init__(self, name, dirent, lstat, path='.'): # User shouldn't need to call this, but called internally by scandir() self.name = name self.dirent = dirent self._lstat = lstat # non-public attributes self._path = path
def lstat(self): if self._lstat is None: self._lstat = os.lstat(os.path.join(self._path, self.name)) return self._lstat
def isdir(self): if self.dirent is not None and self.dirent.d_type != DT_UNKNOWN: return self.dirent.d_type == DT_DIR else: return stat.S_ISDIR(self.lstat().st_mode)
def isfile(self): if self.dirent is not None and self.dirent.d_type != DT_UNKNOWN: return self.dirent.d_type == DT_REG else: return stat.S_ISREG(self.lstat().st_mode)
def islink(self): if self.dirent is not None and self.dirent.d_type != DT_UNKNOWN: return self.dirent.d_type == DT_LNK else: return stat.S_ISLNK(self.lstat().st_mode)
Oh, and the .dirent would either be None (Windows) or would have .d_type and .d_ino attributes (Linux, OS X).
This would make the scandir() API nice and simple to use for callers, but still expose all the information the OS provides (both the meaningful fields in dirent, and a full stat on Windows, nicely cached in the DirEntry object).
Thoughts?
I like the sound of this (which sounds like what you've implemented now though I haven't looked at your code). -gps
-Ben _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/greg%40krypto.org
On 10/05/2013 11:55, Ben Hoyt wrote:
A few of us were having a discussion at http://bugs.python.org/issue11406 about adding os.scandir(): a generator version of os.listdir() to make iterating over very large directories more memory efficient. This also reflects how the OS gives things to you -- it doesn't give you a big list, but you call a function to iterate and fetch the next entry.
While I think that's a good idea, I'm not sure just that much is enough of an improvement to make adding the generator version worth it.
But what would make this a killer feature is making os.scandir() generate tuples of (name, stat_like_info). The Windows directory iteration functions (FindFirstFile/FindNextFile) give you the full stat information for free, and the Linux and OS X functions (opendir/readdir) give you partial file information (d_type in the dirent struct, which is basically the st_mode part of a stat, whether it's a file, directory, link, etc).
Having this available at the Python level would mean we can vastly speed up functions like os.walk() that otherwise need to make an os.stat() call for every file returned. In my benchmarks of such a generator on Windows, it speeds up os.walk() by 9-10x. On Linux/OS X, it's more like 1.5-3x. In my opinion, that kind of gain is huge, especially on Windows, but also on Linux/OS X.
So the idea is to add this relatively low-level function that exposes the extra information the OS gives us for free, but which os.listdir() currently throws away. Then higher-level, platform-independent functions like os.walk() could use os.scandir() to get much better performance. People over at Issue 11406 think this is a good idea.
HOWEVER, there's debate over what kind of object the second element in the tuple, "stat_like_info", should be. My strong vote is for it to be a stat_result-like object, but where the fields are None if they're unknown. There would be basically three scenarios:
1) stat_result with all fields set: this would happen on Windows, where you get as much info from FindFirst/FindNext as from an os.stat() 2) stat_result with just st_mode set, and all other fields None: this would be the usual case on Linux/OS X 3) stat_result with all fields None: this would happen on systems whose readdir()/dirent doesn't have d_type, or on Linux/OS X when d_type was DT_UNKNOWN
Higher-level functions like os.walk() would then check the fields they needed are not None, and only call os.stat() if needed, for example:
# Build lists of files and directories in path files = [] dirs = [] for name, st in os.scandir(path): if st.st_mode is None: st = os.stat(os.path.join(path, name)) if stat.S_ISDIR(st.st_mode): dirs.append(name) else: files.append(name)
Not bad for a 2-10x performance boost, right? What do folks think?
Cheers, Ben.
[snip] In the python-ideas list there's a thread "PEP: Extended stat_result" about adding methods to stat_result. Using that, you wouldn't necessarily have to look at st.st_mode. The method could perform an additional os.stat() if the field was None. For example: # Build lists of files and directories in path files = [] dirs = [] for name, st in os.scandir(path): if st.is_dir(): dirs.append(name) else: files.append(name) That looks much nicer.
On 10 May, 2013, at 16:30, MRAB <python@mrabarnett.plus.com> wrote:
[snip] In the python-ideas list there's a thread "PEP: Extended stat_result" about adding methods to stat_result.
Using that, you wouldn't necessarily have to look at st.st_mode. The method could perform an additional os.stat() if the field was None. For example:
# Build lists of files and directories in path files = [] dirs = [] for name, st in os.scandir(path): if st.is_dir(): dirs.append(name) else: files.append(name)
That looks much nicer.
I'd prefer a filetype field, with 'st.filetype == "dir"' instead of 'st.is_dir()'. The actual type of filetype values is less important, an enum type would also work although bootstrapping that type could be interesting. Ronald
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/ronaldoussoren%40mac.com
In the python-ideas list there's a thread "PEP: Extended stat_result" about adding methods to stat_result.
Using that, you wouldn't necessarily have to look at st.st_mode. The method could perform an additional os.stat() if the field was None. For
example:
# Build lists of files and directories in path files = [] dirs = [] for name, st in os.scandir(path): if st.is_dir(): dirs.append(name) else: files.append(name)
That's not too bad. However, the st.is_dir() function could potentially call os.stat(), so you'd have to be specific about how errors are handled. Also, I'm not too enthusiastic about how much "API weight" this would add -- do you need st.is_link() and st.size() and st.everything_else() as well? -Ben
Okay, I've renamed my "BetterWalk" module to "scandir" and updated it as per our discussion: https://github.com/benhoyt/scandir/#readme It's not yet production-ready, and is basically still in API and performance testing stage. For instance, the underlying scandir_helper functions don't even return iterators yet -- they're just glorified versions of os.listdir() that return an additional d_ino/d_type (Linux) or stat_result (Windows). In any case, I really like the API (thanks mostly to Nick Coghlan), and performance is great, even with DirEntry being written in Python. PERFORMANCE: On Windows I'm seeing that scandir.walk() on a large test tree (see benchmark.py) is 8-9 times faster than os.walk(), and on Linux it's 3-4 times faster. Yes, it is that much faster, and yes, those numbers are real. :-) Please critique away. At this stage it'd be most helpful to critique any API or performance-related issues rather than coding style or minor bugs, as I'm expecting the code itself will change quite a bit still. Todos: * Make _scandir.scandir_helper functions return real iterators instead of lists * Move building of DirEntry objects into C module, so basically the entire scandir() is in C * Add tests -Ben
Hi Ben, Am 13.05.13 14:25, schrieb Ben Hoyt:
...It's not yet production-ready, and is basically still in API and performance testing stage. ...
In any case, I really like the API (thanks mostly to Nick Coghlan), and performance is great, even with DirEntry being written in Python.
PERFORMANCE: On Windows I'm seeing that scandir.walk() on a large test tree (see benchmark.py) is 8-9 times faster than os.walk(), and on Linux it's 3-4 times faster. Yes, it is that much faster, and yes, those numbers are real. :-)
Please critique away. At this stage it'd be most helpful to critique any API or performance-related issues ...
you asked for critique, but the performance seems to be also 2-3 times speedup (as stated by benchmark.py) on mac osx 10.8.3 (on MacBook Pro 13 inch, start of 2011, solid state disk) with python 2.7.4 (the homebrew one): $> git clone git://github.com/benhoyt/scandir.git $> cd scandir && python setup.py install $> python benchmark.py USING FAST C version Creating tree at benchtree: depth=4, num_dirs=5, num_files=50 Priming the system's cache... Benchmarking walks on benchtree, repeat 1/3... Benchmarking walks on benchtree, repeat 2/3... Benchmarking walks on benchtree, repeat 3/3... os.walk took 0.104s, scandir.walk took 0.031s -- 3.3x as fast $> python benchmark.py -s USING FAST C version Priming the system's cache... Benchmarking walks on benchtree, repeat 1/3... Benchmarking walks on benchtree, repeat 2/3... Benchmarking walks on benchtree, repeat 3/3... os.walk size 226395000, scandir.walk size 226395000 -- equal os.walk took 0.246s, scandir.walk took 0.125s -- 2.0x as fast So for now, all well and thank you. All the best, Stefan.
On Mon, May 13, 2013 at 10:25 PM, Ben Hoyt <benhoyt@gmail.com> wrote:
Okay, I've renamed my "BetterWalk" module to "scandir" and updated it as per our discussion:
Nice!
PERFORMANCE: On Windows I'm seeing that scandir.walk() on a large test tree (see benchmark.py) is 8-9 times faster than os.walk(), and on Linux it's 3-4 times faster. Yes, it is that much faster, and yes, those numbers are real. :-)
I'd to see the numbers for NFS or CIFS - stat() can be brutally slow over a network connection (that's why we added a caching mechanism to importlib).
Please critique away. At this stage it'd be most helpful to critique any API or performance-related issues rather than coding style or minor bugs, as I'm expecting the code itself will change quite a bit still.
I initially quite liked the idea of not offering any methods on DirEntry, only properties, to make it obvious that they don't touch the file system, but just report info from the scandir call. However, I think that it ends up reading strangely, and would be confusing relative to the os.path() APIs. What you have now seems like a good, simple alternative. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
I'd to see the numbers for NFS or CIFS - stat() can be brutally slow over a network connection (that's why we added a caching mechanism to importlib).
How do I know what file system Windows networking is using? In any case, here's some numbers on Windows -- it's looking pretty good! This is with default DEPTH/NUM_DIRS/NUM_FILES on a LAN: Benchmarking walks on \\anothermachine\docs\Ben\bigtree, repeat 3/3... os.walk took 11.345s, scandir.walk took 0.340s -- 33.3x as fast And this is on a VPN on a remote network with the benchmark.py values cranked down to DEPTH = 3, NUM_DIRS = 3, NUM_FILES = 20 (because otherwise it was taking far too long): Benchmarking walks on \\ben1.titanmt.local\c$\dev\scandir\benchtree, repeat 3/3... os.walk took 122.310s, scandir.walk took 5.452s -- 22.4x as fast If anyone can run benchmark.py on Linux / NFS or similar, that'd be great. You'll probably have to lower DEPTH/NUM_DIRS/NUM_FILES first and then move the "benchtree" to the network file system to run it against that.
I initially quite liked the idea of not offering any methods on DirEntry, only properties, to make it obvious that they don't touch the file system, but just report info from the scandir call. However, I think that it ends up reading strangely, and would be confusing relative to the os.path() APIs.
What you have now seems like a good, simple alternative.
Thanks. Yeah, I kinda liked the "DirEntry doesn't make any OS calls" at first too, but then as I got into it I realized it make for a really nasty API for most use cases. I like how it's ended up. -Ben
Le Tue, 14 May 2013 10:41:01 +1200, Ben Hoyt <benhoyt@gmail.com> a écrit :
I'd to see the numbers for NFS or CIFS - stat() can be brutally slow over a network connection (that's why we added a caching mechanism to importlib).
How do I know what file system Windows networking is using? In any case, here's some numbers on Windows -- it's looking pretty good! This is with default DEPTH/NUM_DIRS/NUM_FILES on a LAN:
Benchmarking walks on \\anothermachine\docs\Ben\bigtree, repeat 3/3... os.walk took 11.345s, scandir.walk took 0.340s -- 33.3x as fast
And this is on a VPN on a remote network with the benchmark.py values cranked down to DEPTH = 3, NUM_DIRS = 3, NUM_FILES = 20 (because otherwise it was taking far too long):
Benchmarking walks on \\ben1.titanmt.local\c$\dev\scandir\benchtree, repeat 3/3... os.walk took 122.310s, scandir.walk took 5.452s -- 22.4x as fast
If anyone can run benchmark.py on Linux / NFS or similar, that'd be great. You'll probably have to lower DEPTH/NUM_DIRS/NUM_FILES first and then move the "benchtree" to the network file system to run it against that.
Why does your benchmark create such large files? It doesn't make sense. Regards Antoine.
If anyone can run benchmark.py on Linux / NFS or similar, that'd be great. You'll probably have to lower DEPTH/NUM_DIRS/NUM_FILES first and then move the "benchtree" to the network file system to run it against that.
Why does your benchmark create such large files? It doesn't make sense.
Yeah, I was just thinking about that last night, and I should probably change that. Originally I did it because I thought it might affect the speed of directory walking, so I was trying to make some of the files large to be more "real world". I've just tested it, and in practice file system doesn't make much difference, so I've fixed that now: https://github.com/benhoyt/scandir/commit/9663c0afcc5c020d5d1fe34a120b0331b8... Thanks, Ben
Le Tue, 14 May 2013 20:54:50 +1200, Ben Hoyt <benhoyt@gmail.com> a écrit :
If anyone can run benchmark.py on Linux / NFS or similar, that'd be great. You'll probably have to lower DEPTH/NUM_DIRS/NUM_FILES first and then move the "benchtree" to the network file system to run it against that.
Why does your benchmark create such large files? It doesn't make sense.
Yeah, I was just thinking about that last night, and I should probably change that. Originally I did it because I thought it might affect the speed of directory walking, so I was trying to make some of the files large to be more "real world". I've just tested it, and in practice file system doesn't make much difference, so I've fixed that now:
Thanks. I had bumped the number of files, thinking it would make things more interesting, and it filled my disk. Regards Antoine.
large to be more "real world". I've just tested it, and in practice file system doesn't make much difference, so I've fixed that now:
Thanks. I had bumped the number of files, thinking it would make things more interesting, and it filled my disk.
Denial of Pitrou attack -- sorry! :-) Anyway, it shouldn't fill your disk now. Though it still does use more on-disk space than 3 bytes per file on most FSs, depending on the smallest block size. -Ben
Le Tue, 14 May 2013 10:41:01 +1200, Ben Hoyt <benhoyt@gmail.com> a écrit :
If anyone can run benchmark.py on Linux / NFS or similar, that'd be great. You'll probably have to lower DEPTH/NUM_DIRS/NUM_FILES first and then move the "benchtree" to the network file system to run it against that.
On a locally running VM: os.walk took 0.400s, scandir.walk took 0.120s -- 3.3x as fast Same VM accessed from the host through a local sshfs: os.walk took 2.261s, scandir.walk took 2.055s -- 1.1x as fast Same, but with "sshfs -o cache=no": os.walk took 24.060s, scandir.walk took 25.906s -- 0.9x as fast Regards Antoine.
On a locally running VM: os.walk took 0.400s, scandir.walk took 0.120s -- 3.3x as fast
Same VM accessed from the host through a local sshfs: os.walk took 2.261s, scandir.walk took 2.055s -- 1.1x as fast
Same, but with "sshfs -o cache=no": os.walk took 24.060s, scandir.walk took 25.906s -- 0.9x as fast
Thanks. I take it those are "USING FAST C version"? What is "-o cache=no"? I'm guessing the last one isn't giving dirents, so my version is slightly slower than the built-in listdir/stat version due to building and calling methods on the DirEntry objects in Python. It should be no slower when it's all moved to C. -Ben
Le Tue, 14 May 2013 21:10:08 +1200, Ben Hoyt <benhoyt@gmail.com> a écrit :
On a locally running VM: os.walk took 0.400s, scandir.walk took 0.120s -- 3.3x as fast
Same VM accessed from the host through a local sshfs: os.walk took 2.261s, scandir.walk took 2.055s -- 1.1x as fast
Same, but with "sshfs -o cache=no": os.walk took 24.060s, scandir.walk took 25.906s -- 0.9x as fast
Thanks. I take it those are "USING FAST C version"?
Yes.
What is "-o cache=no"? I'm guessing the last one isn't giving dirents, so my version is slightly slower than the built-in listdir/stat version due to building and calling methods on the DirEntry objects in Python.
It disables sshfs's built-in cache (I suppose it's a filesystem metadata cache). The man page doesn't tell much more about it.
It should be no slower when it's all moved to C.
The slowdown is too small to be interesting. The main point is that there was no speedup, though. Regards Antoine.
It should be no slower when it's all moved to C.
The slowdown is too small to be interesting. The main point is that there was no speedup, though.
True, and thanks for testing. I don't think that's a big issue, however. If it's 3-8x faster in the majority of cases (local disk on all systems, Windows networking), and no slower in a minority (sshfs), I'm not too sad about that. I wonder how sshfs compared to nfs. -Ben
Le Tue, 14 May 2013 22:14:42 +1200, Ben Hoyt <benhoyt@gmail.com> a écrit :
It should be no slower when it's all moved to C.
The slowdown is too small to be interesting. The main point is that there was no speedup, though.
True, and thanks for testing.
I don't think that's a big issue, however. If it's 3-8x faster in the majority of cases (local disk on all systems, Windows networking), and no slower in a minority (sshfs), I'm not too sad about that.
I wonder how sshfs compared to nfs.
Ok, with a NFS mount (default options, especially "sync") to the same local VM: First run: os.walk took 17.137s, scandir.walk took 0.625s -- 27.4x as fast Second run: os.walk took 1.535s, scandir.walk took 0.617s -- 2.5x as fast (something fishy with caches?) Regards Antoine.
I wonder how sshfs compared to nfs.
(I've modified your benchmark to also test the case where data isn't in the page cache). Local ext3: cached: os.walk took 0.096s, scandir.walk took 0.030s -- 3.2x as fast uncached: os.walk took 0.320s, scandir.walk took 0.130s -- 2.5x as fast NFSv3, 1Gb/s network: cached: os.walk took 0.220s, scandir.walk took 0.078s -- 2.8x as fast uncached: os.walk took 0.269s, scandir.walk took 0.139s -- 1.9x as fast
Very interesting. Although os.walk may not be widely used in cluster applications, anything that lowers the number of calls to stat() in an spplication is worthwhile for parallel filesystems as stat() is handled by the only non-parallel node, the MDS. Small test on another NFS drive: Creating tree at benchtree: depth=4, num_dirs=5, num_files=50 Priming the system's cache... Benchmarking walks on benchtree, repeat 1/3... Benchmarking walks on benchtree, repeat 2/3... Benchmarking walks on benchtree, repeat 3/3... os.walk took 0.117s, scandir.walk took 0.041s -- 2.8x as fast I may try it on a Lustre FS if I have some time and if I don't forget about this. Cheers, Matthieu 2013/5/14 Charles-François Natali <cf.natali@gmail.com>
I wonder how sshfs compared to nfs.
(I've modified your benchmark to also test the case where data isn't in the page cache).
Local ext3: cached: os.walk took 0.096s, scandir.walk took 0.030s -- 3.2x as fast uncached: os.walk took 0.320s, scandir.walk took 0.130s -- 2.5x as fast
NFSv3, 1Gb/s network: cached: os.walk took 0.220s, scandir.walk took 0.078s -- 2.8x as fast uncached: os.walk took 0.269s, scandir.walk took 0.139s -- 1.9x as fast _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/matthieu.brucher%40gmail.c...
-- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher Music band: http://liliejay.com/
On Tue, May 14, 2013 at 12:14 PM, Ben Hoyt <benhoyt@gmail.com> wrote:
I don't think that's a big issue, however. If it's 3-8x faster in the majority of cases (local disk on all systems, Windows networking), and no slower in a minority (sshfs), I'm not too sad about that.
Might be interesting to test something status calls with a hacked Mercurial. Cheers, Dirkjan
participants (12)
-
Antoine Pitrou -
Ben Hoyt -
Charles-François Natali -
Christian Heimes -
Dirkjan Ochtman -
Gregory P. Smith -
Matthieu Brucher -
MRAB -
Nick Coghlan -
Ronald Oussoren -
Stefan Drees -
Victor Stinner