Speed up os.walk() 5x to 9x by using file attributes from FindFirst/NextFile() and readdir()

I've noticed that os.walk() is a lot slower than it needs to be because it does an os.stat() call for every file/directory. It does this because it uses listdir(), which doesn't return any file attributes. So instead of using the file information provided by FindFirstFile() / FindNextFile() and readdir() -- which listdir() calls -- os.walk() does a stat() on every file to see whether it's a directory or not. Both FindFirst/FindNext and readdir() give this information already. Using it would basically bring the number of system calls down from O(N) to O(log N). I've written a proof-of-concept (see [1] below) using ctypes and FindFirst/FindNext on Windows, showing that for sizeable directory trees it gives a 4x to 6x speedup -- so this is not a micro-optimization! I started trying the same thing with opendir/readdir on Linux, but don't have as much experience there, and wanted to get some feedback on the concept first. I assume it'd be a similar speedup by using d_type & DT_DIR from readdir(). The problem is even worse when you're calling os.walk() and then doing your own stat() on each file, for example, to get the total size of all files in a tree -- see [2]. It means it's calling stat() twice on every file, and I see about a 9x speedup in this scenario using the info FindFirst/Next provide. So there are a couple of things here: 1) The simplest thing to do would be to keep the APIs exactly the same, and get the ~5x speedup on os.walk() -- on Windows, unsure of the exact speedup on Linux. And on OS's where readdir() doesn't return "is directory" information, obviously it'd fall back to using the stat on each file. 2) We could significantly improve the API by adding a listdir_stat() or similar function, which would return a list of (filename, stat_result) tuples instead of just the names. That might be useful in its own right, but of course os.walk() could use it to speed itself up. Then of course it might be good to have walk_stat() which you could use to speed up the "summing sizes" cases. Other related improvements to the listdir/walk APIs that could be considered are: * Using the wildcard/glob that FindFirst/FindNext take to do filtering -- this would avoid fnmatch-ing and might speed up large operations, though I don't have numbers. Obviously we'd have to simulate this with fnmatch on non-Windows OSs, but this kind of filtering is something I've done with os.walk() many times, so just having the API option would be useful ("glob" keyword arg?). * Changing listdir() to yield instead of return a list (or adding yieldir?). This fits both the FindNext/readdir APIs, and would address issues like [3]. Anyway, cutting a long story short -- do folks think 1) is a good idea? What about some of the thoughts in 2)? In either case, what would be the best way to go further on this? Thanks, Ben. [1]: https://gist.github.com/4044946 [2]: http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-o... [3]: http://stackoverflow.com/questions/4403598/list-files-in-a-folder-as-a-strea...

On Fri, Nov 9, 2012 at 8:29 PM, Ben Hoyt <benhoyt@gmail.com> wrote:
It's even worse when you add NFS (and other network filesystems) into the mix, so yes, +1 on devising a more efficient API design for bulk stat retrieval than the current listdir+explicit-stat approach that can lead to an excessive number of network round trips. It's a complex enough idea that it definitely needs some iteration outside the stdlib before it could be added, though. You could either start exploring this as a new project, or else if you wanted to fork my walkdir project on BitBucket I'd be interested in reviewing any pull requests you made along those lines - redundant stat calls are currently one of the issues with using walkdir for more complex tasks. (However you decide to proceed, you'll need to set things up to build an extension module, though - walkdir is pure Python at this point). Another alternative you may want to explore is whether or not Antoine Pitrou would be interested in adding such a capability to his pathlib module. pathlib already includes stat result caching in Path objects, and thus may be able to support a clean API for returning path details with the stat results precached. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

It seems many folks think that an os.iterdir() is a good idea, and some that agree that something like os.iterdir_stat() for efficient directory traversal + stat combination is a good idea. And if we get a faster os.walk() for free, that's great too. :-) Nick Coughlan mentioned his walkdir and Antoine's pathlib. While I think these are good third-party libraries, I admit I'm not the biggest fan of either of their APIs. HOWEVER, mainly I think that the stdlib's os.listdir() and os.walk() aren't going away anytime soon, so we might as well make incremental (though significant) improvements to them in the meantime. So I'm going to propose a couple of minimally-invasive changes (API- wise), in what I think is order of importance, highest to lowest: 1) Speeding up os.walk(). I've shown we can easily get a ~5x speedup on Windows by not calling stat() on each file. And on Linux/BSD this same data is available from readdir()'s dirent, so I presume there's be a similar speedup, though it may not be quite 5x. 2) I also propose adding os.iterdir(path='.') to do exactly the same thing as os.listdir(), but yield the results as it gets them instead of returning the whole list at once. 3) Partly for implementing the more efficient walk(), but also for general use, I propose adding os.iterdir_stat() which would be like iterdir but yield (filename, stat) tuples. If stat-while-iterating isn't available on the system, the stat item would be None. If it is available, the stat_result fields that the OS presents would be available -- the other fields would be None. In practice, iterdir_stat() would call FindFirst/Next on Windows and readdir_r on Linux/BSD/Mac OS X, and be implemented in posixmodule.c. This means that on Linux/BSD/Mac OS X it'd return a stat_result with st_mode set but the other fields None, on Windows it'd basically return the full stat_result, and on other systems it'd return (filename, None). The usage pattern (and exactly how os.walk would use it) would be as follows: for filename, st in os.iterdir_stat(path): if st is None or st.st_mode is None: st = os.stat(os.path.join(path, filename)) if stat.S_ISDIR(st.st_mode): # handle directory else: # handle file I'm very keen on 1). And I think adding 2) and 3) make sense, because they're (a) asked for by various folks, (b) fairly simple and self- explanatory APIs, and (c) they'll be needed to implement the faster os.walk() anyway. Thoughts? What's the next step? If I come up with a patch against posixmodule.c, tests, etc, is this likely to be accepted? I could also flesh out my pure-Python proof of concept [1] to do what I'm suggesting above and go from there... Thanks, Ben [1] https://gist.github.com/4044946 On Sat, Nov 10, 2012 at 1:23 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:

On Mon, Nov 12, 2012 at 7:17 PM, Ben Hoyt <benhoyt@gmail.com> wrote:
The issue with patching the stdlib directly rather than releasing something on PyPI is that you likely won't get any design or usability feedback until the first 3.4 alpha, unless it happens to catch the interest of someone willing to tinker with a patched version earlier than that. It only takes one or two savvy users to get solid feedback by publishing something on PyPI (that's all I got for contextlb2, and the design of contextlib.ExitStack in 3.3 benefited greatly from the process). Just the discipline of writing docs, tests and giving people a rationale for downloading your module can help a great deal with making the case for the subsequent stdlib change. The reason I suggested walkdir as a possible venue is that I think your idea here may help with some of walkdir's *other* API design problems (of which there are quite a few, which is why I stopped pushing for it as a stdlib addition in its current state - it has too many drawbacks to be consistently superior to rolling your own custom solution) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 2012-11-12 09:17, Ben Hoyt wrote:
I'm not sure that I like "st is None or st.st_mode is None". You say that if a stat field is not available, it's None. That being the case, if no stat fields are available, couldn't their fields be None? That would lead to just "st.st_mode is None".
[snip]

On 12 Nov, 2012, at 10:17, Ben Hoyt <benhoyt@gmail.com> wrote:
Where would st_mode be retrieved from? The readdir(3) interface only provides d_type (and that field is not in POSIX or SUS). The d_type field contains a file type, and while you could use that to construct a value for st_mode that can be used to test the file type, you cannot reconstruct the file permissions from that. Ronald

On 13 Nov, 2012, at 8:13, Ben Hoyt <benhoyt@gmail.com> wrote:
It would be very odd to have an st_mode that contains a subset of the information the platform can provide. In particular having st_mode would give the impression that it is the full mode. Why not return the information that can be cheaply provided, is useful and can be provided by major platforms (at least Windows, Linux and OS X)? And "useful" would be information for which there is a clear usecase, such as the filetype as it can significantly speed up os.walk. OTOH, FindNextFile on Windows can return most information in struct stat. Ronald

Yes, it's slightly odd, but not as odd as you'd think. This is especially true for Windows users, because we're used to st_mode only being a subset of the information -- the permission bits are basically meaningless on Windows. The alternative is to introduce yet another new tuple/struct with "type size atime ctime mtime" fields. But you still have to specify that it's implementation dependent (Linux/BSD only provides type, Windows provides all those fields), and then you have to have ways of testing what type the type is. stat_result and the stat module already give you those things, which is why I think it's best to stick with the stat_result structure. In terms of what's useful, certainly "type" and "size" are, so you may as well throw in atime/ctime/mtime, which Windows also gives us for free. -Ben

On 13 Nov, 2012, at 21:00, Ben Hoyt <benhoyt@gmail.com> wrote:
That's one more reason for returning a new tuple/struct with a type field: the full st_mode is not useful on Windows, and on Unix readdir doesn't return a full st_mode in the first place.
The interface of the stat module for determining the file type is not very pretty.
How did you measure the 5x speedup you saw with you modified os.walk? It would be interesting to see if Unix platforms have a simular speedup, because if they don't the new API could just return the results of stat (or lstat ...). Ronald

Hmmm, I'm not sure I agree: st_mode from the new iterdir_stat() will be as useful as that currently returned by os.stat(), and it is very useful (mainly for checking whether an entry is a directory or not). You're right that it won't return a full st_mode on Linux/BSD, but I think it's better for folks to use the existing "if stat.S_ISDIR(st.st_mode): ..." idiom than introduce a new thing.
How did you measure the 5x speedup you saw with you modified os.walk?
Just by os.walk()ing through a large directory tree with basically nothing in the inner loop, and comparing that to the same thing with my version.
It would be interesting to see if Unix platforms have a simular speedup, because if they don't the new API could just return the results of stat (or lstat ...).
Yeah, true. I'll do that and post results in the next few days when I get it done. I'm *hoping* for a similar speedup there too, given the increase in system calls, but you never know till you benchmark ... maybe system calls are much faster on Linux, or stat() is cached better or whatever. -Ben

On Wed, Nov 14, 2012 at 5:14 PM, Ronald Oussoren <ronaldoussoren@mac.com>wrote:
One thing to keep in mind with these kind of metrics is that I/O latency is a major factor. Solid state vs spinning disk vs network drive is going to make a *big* difference to the relative performance of the different mechanisms. With NFS (et al), it's particularly important to minimise the number of round trips to the server (that's why the new dir listing caching in the 3.3 import system results in such dramatic speed-ups when some of the sys.path entries are located on network drives). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Wed, Nov 14, 2012 at 10:33 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Data from bzr: you can get a very significant speed up by doing two things: - use readdir to get the inode numbers of the files in the directory and stat the files in-increasing-number-order. (this gives you monotonically increasing IO). - chdir to the directory before you stat and use a relative path: it turns out when working with many files that the overhead of absolute paths is substantial. We got a (IIRC 90% reduction in 'bzr status' time applying both of these things, and you can grab the pyrex module needed to do readdir from bzr - though we tuned what we had to match the needs of a VCS, so its likely too convoluted for general purpose use). -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services

Huh, very interesting, thanks. On the first point, did you need to separately stat() the files after the readdir()? Presumably you needed information other than the info in the d_type field from readdir's dirent struct.
Do you have a web link to said source code? I'm having trouble (read: being lazy) figuring out the bzr source repo. -Ben

Le Wed, 14 Nov 2012 22:53:44 +1300, Robert Collins <robertc@robertcollins.net> a écrit :
This assumes directory entries are sorted by inode number (in a btree, I imagine). Is this assumption specific to some Linux / Ubuntu filesystem?
How about using fstatat() instead? chdir() is a no-no since it's a process-wide setting. Regards Antoine.

On Nov 14, 2012 4:55 AM, "Antoine Pitrou" <solipsis@pitrou.net> wrote:
It doesn't assume that, because inodes aren't stored in directories on Posix file systems. Instead, they have names & inode numbers . The inode (which is where the data stat returns lives) is stored elsewhere in the file system, typically in on-disk arrays indexed by inode number (and that's grossly oversimplified). I'm not sure how this would work on modern file systems (zfs, btrfs). <mike

Mike Meyer wrote:
It seems to assume that the inodes referenced by a particular directory will often be clustered, so that many of them reside in the same or nearby disk blocks. But it's hard to imagine this having much effect with the large file system caches used these days. -- Greg

On Wed, Nov 14, 2012, at 5:52, Antoine Pitrou wrote:
I think he was proposing listing the whole directory in advance (which os.walk already does), sorting it, and then looping over it calling stat. If the idea is for an API that exposes more information returned by readdir, though, why not get d_type too when it's available?
A) is fstatat even implemented in python? B) is fstatat even possible under windows? C) using *at functions for this the usual way incurs overhead in the form of having to maintain a number of open file handles equal to the depth of your directory. IIRC, some gnu tools will fork the process to avoid this limit. Though, since we're not doing this for security reasons we could fall back on absolute [or deep relative] paths or reopen '..' to ascend back up, instead. D) you have to close those handles eventually. what if the caller doesn't finish the generator?.

On Wed, 14 Nov 2012 13:20:31 -0500 random832@fastmail.us wrote:
But I don't understand why sorting (by inode? by name?) would make stat() calls faster. That's what I'm trying to understand.
Yup, it's available as a special parameter to os.stat(): http://docs.python.org/dev/library/os.html#os.stat
B) is fstatat even possible under windows?
No, but Windows has its own functions to solve the issue (as explained elsewhere in this thread).
Indeed. But directory trees are usually much wider than they are deep. Regards Antoine.

On Thu, Nov 15, 2012 at 7:43 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Less head seeks. Consider 1000 files in a directory. They will likely be grouped on the disk for locality of reference (e.g. same inode group) though mv's and other operations mean this is only 'likely' not 'a given'. when you do 1000 stat calls in a row, the kernel is working with a command depth of 1, so it can't elevator seek optimise the work load. If it sees linear IO it will trigger readahead, but thats less likely. What the sort by inode does is mean that the disk head never needs to seek backwards, so you get a single outward pass over the needed sectors. This is a widely spread technique: http://git.661346.n2.nabble.com/RFC-HACK-refresh-index-lstat-in-inode-order-... is a recent recurrence of the discussion, with sources cited. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services

On Wed, Nov 14, 2012 at 11:52 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Its definitely not applicable globally ( but its no worse in general than any arbitrary sort, so safe to do everywhere). On the ext* family of file systems inode A < inode B implies inode A is located on a lower sector than B.
fstatat looks *perfect*. Thanks for the pointer. I forget the win32 behaviour, but on Linux a thread is a process -> chdir is localised to the process and not altered across threads. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services

Robert Collins wrote:
chdir is a very bad idea for libraries on Windows - it is meant mainly for the user and not the application. It has also been removed completely for Windows 8 apps (not desktop applications, just the new style ones). Full paths may involve some overhead, but at least with FindFirst/NextFile this should be limited to memory copies and not file lookups. Cheers, Steve

On Wed, Nov 14, 2012, at 13:55, Robert Collins wrote:
I forget the win32 behaviour, but on Linux a thread is a process -> chdir is localised to the process and not altered across threads.
This claim needs to be unpacked, to point out where you are right and where you are wrong: Linux threads are processes: true. cwd _can be_ localized to a single [thread] process rather than shared in the whole [traditional] process: also true. (see http://linux.die.net/man/2/clone, note specifically the CLONE_FS flag) cwd _actually is_ localized to a process when created in the ordinary manner as a thread: false. (see http://linux.die.net/man/7/pthreads )

On Wed, Nov 14, 2012 at 11:54 PM, Serhiy Storchaka <storchaka@gmail.com> wrote:
cwd is thread-safe on unix (well, Linux anyhow :P), and the native OS dir walking on Windows is better itself. The only definitions I can find for fwalk are on BSD, not on Linux, so fwalk is also likely something that would need porting; The definitions I did find take a vector of file pointers, and so are totally irrelevant for the point I made. -Rob

Yeah, you're right. The benchmarks I've been doing are only stable after the first run, when I suppose everything is cached and you've taken most of the I/O latency out of the picture. I guess this also means the speed-up for the first run won't be nearly as significant. But I'll do some further testing. -Ben

My very first post, and I screw up the reply to list option. Sorry about that. On 11/13/2012 2:07 AM, Ronald Oussoren wrote:
I think he had the idea that it would just return this incomplete st_mode, and you'd have to deal with it. But maybe the solution is to add a st_type property to stat_result, that returns either this or the appropriate bits from st_mode - is there a reason these are a single field other than a historical artifact of the fact that they are/were in a single 16-bit field in UNIX? Note that according to the documentation not all filesystems on linux support d_type, either. You have to be prepared for the possibility of getting DT_UNKNOWN

P.S. Something else occurs to me. In the presence of reparse points (i.e. symbolic links) on win32, I believe information about whether the destination is meant to be a directory is still provided (I haven't confirmed this, but I know you're required to provide it when making a symlink). This is discarded when the st_mode field is populated with the information that it is a symlink. If the goal is "speed up os.walk", it might be worth keeping this information and using it in os.walk(..., followlinks=True) - maybe the windows version of the stat result has a field for the windows attributes? It's arguable, though, that symbolic links on windows are rare enough not to matter.

On 11/12/12, Ben Hoyt <benhoyt@gmail.com> wrote:
I know that two functions may be better than a keyword, but a combinatorial explosion of functions ... isn't. Even given that listdir can't change for backwards compatibility, and given that iteration might be better for large directories, I'm still not sure an exact analogue is worth it. Could this be someone combined with your 3rd proposal? For example, instead of returning a list of str (or bytes) names, could you return a generator that would yield some sort of File objects? (Well, obviously you *could*, the question is whether that goes too far down the rabbit hole of what a Path object should have.) My strawman is an object such that (a) No extra system calls will be made just to fill in data not available from the dir entry itself. I wouldn't even promise a name, though I can't think of a sensible directory listing that doesn't provide the name. (b) Any metadata you do have -- name, fd, results of stat, results of getxattr ... will be available as an attribute. That way users of filesystems that do send back the size or type won't have to call stat. (c) Attributes will default to None, supporting the "if x is None: x=stat()" pattern for the users who do care about attributes that were not available quickly. (If there is an attribute for which "None" is actually meaningful, the user can use hasattr -- but that is a corner case, not worth polluting the API for.) *Maybe* teach open about these objects, so that it can look for the name or fd attributes. Alternatively, it could return a str (or bytes) subclass that has the other attributes when they are available. That seems a bit contrived, but might be better for backwards compatibility. (os.walk could return such objects too, instead of extracting the name.) -jJ

"Combinatorial explosion of functions"? Slight exaggeration. :-) I'm proposing adding two new functions, iterdir and iterdir_stat. And I definitely think two new functions with pretty standard names and almost self-documenting signatures is much simpler than keyword args and magical return values. For example, Python 2.x has dict.items() and dict.iteritems(), and it's clear what the "iter" version does at a glance.
I considered this, as well as a str subclass with a ".stat" attribute. But iterdir_stat's (filename, stat) tuple is much more explicit, whereas the str subclass just seemed too magical -- though I might be able to be convinced otherwise. Yes, as you mention, I really think the File/Path object is a rabbit hole, and the intent of my proposal is very minimal changes / minimal learning curve -- an incremental addition. -Ben

On Wed, Nov 14, 2012 at 5:51 PM, Jim Jewett <jimjjewett@gmail.com> wrote:
Two questions: 1) Is there some way to distinguish that your st_mode field is only partially there (i.e. - you get the Linux/BSD d_type value, but not the rest of st_mode)? 2) How about making these attributes properties, so that touching one that isn't there causes them all to be populated. <mike

Very good questions.
Not currently. I haven't thought about this too hard -- there way be a bit that's always set/not set within st_mode itself. Otherwise I'd have to add a full_st_mode or similar property
2) How about making these attributes properties, so that touching one that isn't there causes them all to be populated.
Interesting thought. Might be a bit too magical for my taste. One concern I have with either of the above (adding a property to stat_result) is that then I'd have to use my own class or namedtuple rather than os.stat_result. Presumably that's not a big deal, though, as you'd never do isinstance testing on it. In the case of 2), it might have to be a class, meaning somewhat heavier memory-wise for lots and lots of files. -Ben

It's not a bad idea, but neither am I super-keen on it, because of these two reasons: 1) You've have to add a whole new way / set of constants / functions to test for the different values of d_type. Whereas there's already stuff (stat module) to test for st_mode values. 2) It'd make the typical use case more complex, for example, the straight "if st.st_mode is None ... else ..." I gave earlier becomes this: for filename, st in iterdir_stat(path): if st.d_type is None: if st.st_mode is None: st = os.stat(os.path.join(path, filename)) is_dir = stat.S_ISDIR(st.st_mode) else: is_dir = st.d_type == DT_DIR -Ben

On 11/15/2012 4:50 PM, Ben Hoyt wrote:
if st.d_type is None: st = os.stat(os.path.join(path, filename)) is_dir = st.d_type == DT_DIR Of course, your code would ultimately be more complex anyway since when followlinks=True you want to use isdir, and when it's not you want to use lstat. And consider what information iterdir_stat actually returns when the results are symlinks (if it's readdir/d_type, it's going to say "it's a symlink" and you need to try again to followlinks, if it's WIN32_FIND_DATA you have the information for both in principle, but the stat structure can only include one. Do we need an iterdir_lstat? If so, should iterdir_stat return None if d_type is DT_LNK, or DT_LNK?) ...and ultimately deprecating the S_IS___ stuff. It made sense in the 1970s when there was real savings in packing your 4 bits of type information and your 12 bits of permission information in a single 16-bit field, now it's just a historical artifact that seems like the only reason for it is a vague "make os feel like C on Unix" principle.

On 11/14/12, Mike Meyer <mwm@mired.org> wrote:
On Wed, Nov 14, 2012 at 5:51 PM, Jim Jewett <jimjjewett@gmail.com> wrote:
On 11/12/12, Ben Hoyt <benhoyt@gmail.com> wrote:
Two questions:
os.iterdir did not call stat; you have partial information. Or are you saying that you want to distinguish between "This filesystem doesn't track that information", "This process couldn't get that information right now", and "That particular piece of information requires a second call that hasn't been made yet"?
2) How about making these attributes properties, so that touching one that isn't there causes them all to be populated.
Part of the motivation was to minimize extra system calls; that suggests making another one should be a function call instead of a property. That said, I can see value in making that optional call return something with the same API. Perhaps: if thefile.X is None: thefile.restat() # namespace clash, but is "restat" really a likely file attribute? or if thefile.X is None: thefile=stat(thefile) -jJ

On Nov 14, 2012 10:19 PM, "Jim Jewett" <jimjjewett@gmail.com> wrote:
On 11/14/12, Mike Meyer <mwm@mired.org> wrote:
On Wed, Nov 14, 2012 at 5:51 PM, Jim Jewett <jimjjewett@gmail.com>
wrote:
Note that you're eliding the proposal these questions were about, that os.iterdir return some kind of object that had attributes that carried the stat values, or None if they weren't available.
I want to distinguish between the case where st_mode is filled from the BSD/Unix d_type directory entry - meaning there is information so st_mode is not None, but the information is incomplete and requires a second system call to fetch - and the case where it's filled via the Windows calls which provide all the information that is available for st_mode, so no second system call is needed.
Except that I don't see that there's anything to do once you've found a None-valued attribute *except* make that extra call. If there's a use case where you find one of the attributes is None and then not get the value from the system, I agree with you. If there isn't, then you might as well roll that one use case into the object rather than force every client to do the stat call and extract the information from it in that case.

On 11/15/12, Mike Meyer <mwm@mired.org> wrote:
On Nov 14, 2012 10:19 PM, "Jim Jewett" <jimjjewett@gmail.com> wrote:
So you're basically saying that you want to know whether an explicit stat call would make a difference? (Other than updating the information if it has changed.)
Except that I don't see that there's anything to do once you've found a None-valued attribute *except* make that extra call.
ah. I was thinking of reporting, where you could just leave a column off the report. Or of some sort of optimization, where knowing the size (or last change date) is not required, but may be helpful. I suppose these might be sufficiently uncommon that triggering the extra stat call instead of returning None might be justified. -jJ

On Wed, Nov 14, 2012 at 11:53 PM, Jim Jewett <jimjjewett@gmail.com> wrote:
That's one way of looking at it. The problem is that you tell if a value has been filled or not by having a None value. But st_mode is itself multi-valued, and you don't always get all available value. Maybe d_type should be it's own attribute? If readdir returns it, we use it as is. If not, then the caller either does the None/stat dance or we make it a property that gets filled from the stat structure. Thanks, <mike

I'm inclined to KISS and just let the caller handle it. Many other things in the "os" module are system dependent, including os.stat(), so if the stat_results results returned by iterdir_stat() are system dependent, that's just par for the course. I'm thinking of a docstring something like: """Yield tuples of (filename, stat_result) for each filename in directory given by "path". Like listdir(), '.' and '..' are skipped. The values are yielded in system-dependent order. Each stat_result is an object like you'd get by calling os.stat() on that file, but not all information is present on all systems, and st_* fields that are not available will be None. In practice, stat_result is a full os.stat() on Windows, but only the "is type" bits of the st_mode field are available on Linux/OS X/BSD. """ So in real life, if you're using more than stat.S_ISDIR() of st_mode, you'll need to call stat separately. But 1) it's quite rare to need eg the permissions bits in this context, and 2) if you're expecting st_mode to have that extra stuff your code's already system-dependent, as permission bits don't mean much on Windows. But the main point is that what the OS gives you for free is easily available. -Ben

On Nov 15, 2012 1:40 AM, "Ben Hoyt" <benhoyt <benhoyt@gmail.com>@<benhoyt@gmail.com> gmail.com <benhoyt@gmail.com>> wrote:
There's a code smell here, in that the doc for Unix variants is incomplete and wrong. Whether or not you get the d_type values depends on the OS having that extension. Further, there's a d_type value (DT_UNKNOWN) that isn't a valid value for the S_IFMT bits in st_mode (at least on BSD).
If the goal is to make os.walk fast, then it might be better (on Posix systems, anyway) to see if it can be built on top of ftw instead of low-level directory scanning routines. <mike

From: Mike Meyer <mwm@mired.org> Sent: Thu, November 15, 2012 2:29:44 AM
If the goal is to make os.walk fast, then it might be better (on Posix systems,
anyway) to see if it can be built on top of ftw instead of low-level directory scanning routines.
You can't actually use ftw, because it doesn't give any way to handle the options to os.walk. Plus, it's "obsolescent" (POSIX2008), "deprecated" (linux), or "legacy" (OS X), and at least some platforms will throw warnings when you call it. It's also somewhat underspecified, and different platforms, even different versions of the same platform, will give you different behavior in various cases (especially with symlinks). But you could, e.g., use fts on platforms that have it, nftw on platforms that have a version close enough to recent linux/glibc for our purposes, and fall back to readdir+stat for the rest. That could give a big speed improvement on the most popular platforms, and on the others, at least things would be no worse than today (and anyone who cared could much more easily write the appropriate nftw/fts/whatever port for their platform once the framework was in place).

Not sure I understand why the docstring above is incomplete/wrong. I say "Not all information is present on all systems" and "In practice ... only the 'is type' bits of the st_mode field are available on Linux/OS X/BSD". All three of those systems provide d_type, so I think that's correct. -Ben

On Nov 15, 2012 2:06 PM, "Ben Hoyt" <benhoyt@gmail.com> wrote:
incomplete
It's incomplete because it doesn't say what happens on other Posix systems. It's wrong because it implies that the type bits of st_mode are always available, when that's not the case. Better would be 'on Posix systems, if st_mode is not None only the type bits are valid.' Assuming that the underlying code translates DT_UNKNOWN to binding st_mode to None.

On 11/15/12, Mike Meyer <mwm@mired.org> wrote:
The specification allows other fields as well; is it really the case that *no* filesystem supports them? Perhaps: """Yield tuples of (filename, stat_result) for each file in the "path" directory, excluding the '.' and '..' entries. The order of results is arbitrary, and the effect of modifying a directory after generator creation is filesystem-dependent. Each stat_result is similar to the result of os.stat(filename), except that only the directory entry itself is examined; any attribute which would require a second system call (even os.stat) is set to None. In practice, Windows will typically fill in all attributes; other systems are most likely to fill in only the "is type" bits, or even nothing at all. """ -jJ

I'm pretty much convinced that - if the primary goal is to speed up os.walk by leveraging the Windows calls and the existence of d_type on some posix file systems - the proposed iterdir_stat interface is about as good as we can do. However, as a tool for making it easy to iterate through files in a directory getting some/all stat information, I think it's ugly. It's designed specifically for one system, with another common case sort of wedged in. There's no telling how well it will be handle any other systems, but I can see that they might be problematical. Worse yet, you wind up with stat information you can't trust, so have to basically write code to access multiple attributes like: if st.attr1 is None: st = os.stat(...) func(st.attr1) if st.attr2 is None: st = os.stat(...) func(st.attr2) Not bad if you only want one or two values, but ugly if you want four or more. I can see a number of alternatives to improve this situation: 1) wrap the return partial stat info in a proxy object that will do a real stat if a request is made for a value that isn't there. This has already been rejected. 2) Make iterdir_stat an os.walk internal tool, and don't export it. 3) Add some kind of "we have a full stat" indicator, so that clients that want to use lots of attributes can just check that and do the stat if needed. 4) Pick and document one of the a stat values as a "we have a full stat" indicator, to use like case 3. 5) Add a keyword argument to iterdir_stat that causes it to always just do the full stat. Actually, having three modes might be useful: the default is None, which is the currently proposed behavior. Setting it to True causes the full stat always be done, and setting it to False just returns file names. 6) Depreciate os.walk, and provide os.itertree with an interface that lets us leverage the available tools better. That's a whole other can of worms, though. Thanks, <mike

Greg Ewing suggested:
+1 for following that seventh path. It offers the additional benefit for the library code, that constraints of the backend functionality used are more clearer to handle: If requested and available allthough expensive, "yield nevertheless the attribute values" is then a valid strategy. All the best, Stefan.

I don't love most of these solutions as they seem to complicate things. I'd be happy with 2) if it came to it -- but I think it's a very useful tool, especially on Windows, because it means Windows users would have "pure Python access" to FindFirst/FindNext and a very good speed improvement for os.walk.
Ah, that's an interesting idea. It's nice and explicit. I'd go for either the status quo I've proposed or this one. Though only thing I'd want to push for is a clean API. Any suggestions? Mine would be something like: for filename, st in iterdir_stat(path, stat_fields=['st_mode', 'st_size']: ... However, you might also need 'd_type' field option, because eg for os.walk() you don't actually need all of st_mode, just the type info. This is odd (that most of the fields are named "st_*" but one "d_type") but not terrible. -Ben

From: Mike Meyer <mwm@mired.org> Sent: Thu, November 15, 2012 2:29:44 AM
If the goal is to make os.walk fast, then it might be better (on Posix systems,
anyway) to see if it can be built on top of ftw instead of low-level directory scanning routines.
After a bit of experimentation, I'm not sure there actually is any significant improvement to be had on most POSIX systems working that way. Looking at the source from FreeBSD, OS X, and glibc, they all call stat (or a stat family call) on each file, unless you ask for no stat info. A quick test on OS X shows that calling fts via ctypes is about 5% faster than os.walk, and 5% slower than find -ls or find -mtime (which will stat every file). Passing FTS_NOSTAT to fts is about 3x faster, but only 8% faster than os.walk with the stat calls hacked out, and 40% slower than find. So, a "nostat" option is a potential performance improvement, but switching to ftw/nftw/fts, with or without the nostat flag, doesn't seem to be worth it.

On Fri, Nov 16, 2012 at 4:32 AM, Andrew Barnert <abarnert@yahoo.com> wrote:
I agree with that, so long as you have to stay with the os.walk interface.
Looking at the source from FreeBSD, OS X, and glibc, they all call stat (or a stat family call) on each file, unless you ask for no stat info.
Right. They either they give you *all* the stat information, or they don't give you *any* of it. So there's no way to use it to create the directory/other split in os.walk without doing the stat calls.
Passing FTS_NOSTAT to fts is about 3x faster, but only 8% faster than os.walk with the stat calls hacked out, and 40% slower than find.
That's actually a good thing to know. With FTS_NOSTAT, fts winds up using the d_type field to figure out what's a directory (assuming you were on a file system that has those). That's the proposed change for os.walk, so we now have an estimate of how fast we can expect it to be. I'm surprised that it's slower than find. The FreeBSD version of find uses fts_open/fts_read. Could it be that used FTS_NOCHDIR to emulate os.walk, whereas find doesn't? <mike

I'm not sure I'd put too much confidence on the 3x difference as generally applicable to POSIX. Apple uses FreeBSD's fts unmodified, even though in a quick browser I saw at least one case where a trivial change would have made a difference (the link count check that's only used with ufs/nfs/nfs4/ext2fs would also work on hfs+). Also, OS X with HFS+ drives does some bizarre stuff with disk caching, especially with an SSD (which in itself probably changes the performance characteristics). But I'd guess it's somewhere in the right ballpark, and if anything it'll probably be even more improvement on FreeBSD and linux than on OS X.
No, just FTS_PHYSICAL (with or without FTS_NOSTAT). It looks like more than half of the difference is due to print(ent.fts_path.decode('utf8')) in Python vs. puts(entry->fts_path) in find (based on removing the print entirely). I don't think it's worth the effort to investigate further—let's get the 3x faster before we worry about the last 40%… But if you want to, the source I used is at https://github.com/abarnert/py-fts

Passing FTS_NOSTAT to fts is about 3x faster, but only 8% faster than os.walk with the stat calls hacked out, and 40% slower than find.
Relatedly, I've just finished a proof-of-concept version of iterdir_stat() for Linux using readdir_r and ctypes, and it was only about 10% faster than the existing os.walk on large directories. I was surprised by this, given that I saw a 400% speedup removing the stat()s on Windows, but I guess it means that stat() and/or system calls in general are *much* faster or better cached on Linux. Still, it's definitely worth the huge speedup on Windows, and I think it's the right thing to use the dirent d_type info on Linux, even though the speed gain is small -- it's still faster, and it still saves all those os.stat()s. Also, I'm doing this via ctypes in pure Python, so doing it in C may give another small boost especially for the Linux version. If anyone wants to test what speeds they're getting on Linux or Windows, or critique my proof of concept, please try it at https://github.com/benhoyt/betterwalk -- just run "python benchmark.py [directory]" on a large directory. Note this is only a proof of concept at this stage, not hardened code!
So, a "nostat" option is a potential performance improvement, but switching to ftw/nftw/fts, with or without the nostat flag, doesn't seem to be worth it.
Agreed. Also, this is beyond the scope of my initial suggestion. -Ben

On 11/18/2012 3:52 PM, Ben Hoyt wrote:
Linux readdir can return the file type, but it's always going to be DT_LNK for symlinks, requiring an extra stat call if followlinks=True. Windows functions return both "is a symlink" and "is a directory" in a single call (ntpath.isdir / nt._isdir takes advantage of this, and the same information is present in the Find functions' data) but not other information about the symlink target beyond whether it is a directory. Neither python's existing stat function nor anything proposed here has a way of representing this, but it would be useful for os.walk to be able to get this information.

Am 09.11.2012 11:29, schrieb Ben Hoyt:
+1 for something like yielddir(). I while ago I proposed that the os module shall get another function for iterating over a directory. The new function is to return a generator that yields structs. The structs contain the name and additional metadata that like the d_type. On Unix it should use the reentrant version of readdir(), too. A struct has the benefit that it can grow additional fields or contain operating dependent information like inode on Unix. Christian

On 09.11.12 16:42, Christian Heimes wrote:
+1 for something like yielddir().
See also http://bugs.python.org/issue11406.
The only fields in the dirent structure that are mandated by POSIX.1 are: d_name[], of unspecified size, with at most NAME_MAX characters preceding the terminating null byte; and (as an XSI extension) d_ino. The other fields are unstandardized, and not present on all systems.

Am 09.11.2012 15:54, schrieb Serhiy Storchaka:
The only fields in the dirent structure that are mandated by POSIX.1 are: d_name[], of unspecified size, with at most NAME_MAX characters preceding the terminating null byte; and (as an XSI extension) d_ino. The other fields are unstandardized, and not present on all systems.
I'm well aware of the fact.The idea is to use as many information as we can get for free and acquire missing information from other sources like stat(). Also some information depend on the file system: Currently, only some file systems (among them: Btrfs, ext2, ext3, and ext4) have full support returning the file type in d_type. All applications must properly handle a return of DT_UNKNOWN.

Ben Hoyt <benhoyt@...> writes:
Sounds good. I recently answered a Stack Overflow question [1] which showed Python performing an order of magnitude slower than Ruby. Ruby's Dir implementation is written in C and less flexible than os.walk, but there's room for improvement, as you've shown. Regards, Vinay Sajip [1] http://stackoverflow.com/questions/13138160/benchmarks-does-python-have-a-fa...

On 09.11.12 22:30, Vinay Sajip wrote:
This is not so much about walking, as about recursive globbing. See http://bugs.python.org/issue13968. Also note that os.fwalk can be must faster than os.walk (if OS supports it).

On Fri, Nov 9, 2012 at 8:29 PM, Ben Hoyt <benhoyt@gmail.com> wrote:
It's even worse when you add NFS (and other network filesystems) into the mix, so yes, +1 on devising a more efficient API design for bulk stat retrieval than the current listdir+explicit-stat approach that can lead to an excessive number of network round trips. It's a complex enough idea that it definitely needs some iteration outside the stdlib before it could be added, though. You could either start exploring this as a new project, or else if you wanted to fork my walkdir project on BitBucket I'd be interested in reviewing any pull requests you made along those lines - redundant stat calls are currently one of the issues with using walkdir for more complex tasks. (However you decide to proceed, you'll need to set things up to build an extension module, though - walkdir is pure Python at this point). Another alternative you may want to explore is whether or not Antoine Pitrou would be interested in adding such a capability to his pathlib module. pathlib already includes stat result caching in Path objects, and thus may be able to support a clean API for returning path details with the stat results precached. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

It seems many folks think that an os.iterdir() is a good idea, and some that agree that something like os.iterdir_stat() for efficient directory traversal + stat combination is a good idea. And if we get a faster os.walk() for free, that's great too. :-) Nick Coughlan mentioned his walkdir and Antoine's pathlib. While I think these are good third-party libraries, I admit I'm not the biggest fan of either of their APIs. HOWEVER, mainly I think that the stdlib's os.listdir() and os.walk() aren't going away anytime soon, so we might as well make incremental (though significant) improvements to them in the meantime. So I'm going to propose a couple of minimally-invasive changes (API- wise), in what I think is order of importance, highest to lowest: 1) Speeding up os.walk(). I've shown we can easily get a ~5x speedup on Windows by not calling stat() on each file. And on Linux/BSD this same data is available from readdir()'s dirent, so I presume there's be a similar speedup, though it may not be quite 5x. 2) I also propose adding os.iterdir(path='.') to do exactly the same thing as os.listdir(), but yield the results as it gets them instead of returning the whole list at once. 3) Partly for implementing the more efficient walk(), but also for general use, I propose adding os.iterdir_stat() which would be like iterdir but yield (filename, stat) tuples. If stat-while-iterating isn't available on the system, the stat item would be None. If it is available, the stat_result fields that the OS presents would be available -- the other fields would be None. In practice, iterdir_stat() would call FindFirst/Next on Windows and readdir_r on Linux/BSD/Mac OS X, and be implemented in posixmodule.c. This means that on Linux/BSD/Mac OS X it'd return a stat_result with st_mode set but the other fields None, on Windows it'd basically return the full stat_result, and on other systems it'd return (filename, None). The usage pattern (and exactly how os.walk would use it) would be as follows: for filename, st in os.iterdir_stat(path): if st is None or st.st_mode is None: st = os.stat(os.path.join(path, filename)) if stat.S_ISDIR(st.st_mode): # handle directory else: # handle file I'm very keen on 1). And I think adding 2) and 3) make sense, because they're (a) asked for by various folks, (b) fairly simple and self- explanatory APIs, and (c) they'll be needed to implement the faster os.walk() anyway. Thoughts? What's the next step? If I come up with a patch against posixmodule.c, tests, etc, is this likely to be accepted? I could also flesh out my pure-Python proof of concept [1] to do what I'm suggesting above and go from there... Thanks, Ben [1] https://gist.github.com/4044946 On Sat, Nov 10, 2012 at 1:23 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:

On Mon, Nov 12, 2012 at 7:17 PM, Ben Hoyt <benhoyt@gmail.com> wrote:
The issue with patching the stdlib directly rather than releasing something on PyPI is that you likely won't get any design or usability feedback until the first 3.4 alpha, unless it happens to catch the interest of someone willing to tinker with a patched version earlier than that. It only takes one or two savvy users to get solid feedback by publishing something on PyPI (that's all I got for contextlb2, and the design of contextlib.ExitStack in 3.3 benefited greatly from the process). Just the discipline of writing docs, tests and giving people a rationale for downloading your module can help a great deal with making the case for the subsequent stdlib change. The reason I suggested walkdir as a possible venue is that I think your idea here may help with some of walkdir's *other* API design problems (of which there are quite a few, which is why I stopped pushing for it as a stdlib addition in its current state - it has too many drawbacks to be consistently superior to rolling your own custom solution) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 2012-11-12 09:17, Ben Hoyt wrote:
I'm not sure that I like "st is None or st.st_mode is None". You say that if a stat field is not available, it's None. That being the case, if no stat fields are available, couldn't their fields be None? That would lead to just "st.st_mode is None".
[snip]

On 12 Nov, 2012, at 10:17, Ben Hoyt <benhoyt@gmail.com> wrote:
Where would st_mode be retrieved from? The readdir(3) interface only provides d_type (and that field is not in POSIX or SUS). The d_type field contains a file type, and while you could use that to construct a value for st_mode that can be used to test the file type, you cannot reconstruct the file permissions from that. Ronald

On 13 Nov, 2012, at 8:13, Ben Hoyt <benhoyt@gmail.com> wrote:
It would be very odd to have an st_mode that contains a subset of the information the platform can provide. In particular having st_mode would give the impression that it is the full mode. Why not return the information that can be cheaply provided, is useful and can be provided by major platforms (at least Windows, Linux and OS X)? And "useful" would be information for which there is a clear usecase, such as the filetype as it can significantly speed up os.walk. OTOH, FindNextFile on Windows can return most information in struct stat. Ronald

Yes, it's slightly odd, but not as odd as you'd think. This is especially true for Windows users, because we're used to st_mode only being a subset of the information -- the permission bits are basically meaningless on Windows. The alternative is to introduce yet another new tuple/struct with "type size atime ctime mtime" fields. But you still have to specify that it's implementation dependent (Linux/BSD only provides type, Windows provides all those fields), and then you have to have ways of testing what type the type is. stat_result and the stat module already give you those things, which is why I think it's best to stick with the stat_result structure. In terms of what's useful, certainly "type" and "size" are, so you may as well throw in atime/ctime/mtime, which Windows also gives us for free. -Ben

On 13 Nov, 2012, at 21:00, Ben Hoyt <benhoyt@gmail.com> wrote:
That's one more reason for returning a new tuple/struct with a type field: the full st_mode is not useful on Windows, and on Unix readdir doesn't return a full st_mode in the first place.
The interface of the stat module for determining the file type is not very pretty.
How did you measure the 5x speedup you saw with you modified os.walk? It would be interesting to see if Unix platforms have a simular speedup, because if they don't the new API could just return the results of stat (or lstat ...). Ronald

Hmmm, I'm not sure I agree: st_mode from the new iterdir_stat() will be as useful as that currently returned by os.stat(), and it is very useful (mainly for checking whether an entry is a directory or not). You're right that it won't return a full st_mode on Linux/BSD, but I think it's better for folks to use the existing "if stat.S_ISDIR(st.st_mode): ..." idiom than introduce a new thing.
How did you measure the 5x speedup you saw with you modified os.walk?
Just by os.walk()ing through a large directory tree with basically nothing in the inner loop, and comparing that to the same thing with my version.
It would be interesting to see if Unix platforms have a simular speedup, because if they don't the new API could just return the results of stat (or lstat ...).
Yeah, true. I'll do that and post results in the next few days when I get it done. I'm *hoping* for a similar speedup there too, given the increase in system calls, but you never know till you benchmark ... maybe system calls are much faster on Linux, or stat() is cached better or whatever. -Ben

On Wed, Nov 14, 2012 at 5:14 PM, Ronald Oussoren <ronaldoussoren@mac.com>wrote:
One thing to keep in mind with these kind of metrics is that I/O latency is a major factor. Solid state vs spinning disk vs network drive is going to make a *big* difference to the relative performance of the different mechanisms. With NFS (et al), it's particularly important to minimise the number of round trips to the server (that's why the new dir listing caching in the 3.3 import system results in such dramatic speed-ups when some of the sys.path entries are located on network drives). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Wed, Nov 14, 2012 at 10:33 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Data from bzr: you can get a very significant speed up by doing two things: - use readdir to get the inode numbers of the files in the directory and stat the files in-increasing-number-order. (this gives you monotonically increasing IO). - chdir to the directory before you stat and use a relative path: it turns out when working with many files that the overhead of absolute paths is substantial. We got a (IIRC 90% reduction in 'bzr status' time applying both of these things, and you can grab the pyrex module needed to do readdir from bzr - though we tuned what we had to match the needs of a VCS, so its likely too convoluted for general purpose use). -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services

Huh, very interesting, thanks. On the first point, did you need to separately stat() the files after the readdir()? Presumably you needed information other than the info in the d_type field from readdir's dirent struct.
Do you have a web link to said source code? I'm having trouble (read: being lazy) figuring out the bzr source repo. -Ben

Le Wed, 14 Nov 2012 22:53:44 +1300, Robert Collins <robertc@robertcollins.net> a écrit :
This assumes directory entries are sorted by inode number (in a btree, I imagine). Is this assumption specific to some Linux / Ubuntu filesystem?
How about using fstatat() instead? chdir() is a no-no since it's a process-wide setting. Regards Antoine.

On Nov 14, 2012 4:55 AM, "Antoine Pitrou" <solipsis@pitrou.net> wrote:
It doesn't assume that, because inodes aren't stored in directories on Posix file systems. Instead, they have names & inode numbers . The inode (which is where the data stat returns lives) is stored elsewhere in the file system, typically in on-disk arrays indexed by inode number (and that's grossly oversimplified). I'm not sure how this would work on modern file systems (zfs, btrfs). <mike

Mike Meyer wrote:
It seems to assume that the inodes referenced by a particular directory will often be clustered, so that many of them reside in the same or nearby disk blocks. But it's hard to imagine this having much effect with the large file system caches used these days. -- Greg

On Wed, Nov 14, 2012, at 5:52, Antoine Pitrou wrote:
I think he was proposing listing the whole directory in advance (which os.walk already does), sorting it, and then looping over it calling stat. If the idea is for an API that exposes more information returned by readdir, though, why not get d_type too when it's available?
A) is fstatat even implemented in python? B) is fstatat even possible under windows? C) using *at functions for this the usual way incurs overhead in the form of having to maintain a number of open file handles equal to the depth of your directory. IIRC, some gnu tools will fork the process to avoid this limit. Though, since we're not doing this for security reasons we could fall back on absolute [or deep relative] paths or reopen '..' to ascend back up, instead. D) you have to close those handles eventually. what if the caller doesn't finish the generator?.

On Wed, 14 Nov 2012 13:20:31 -0500 random832@fastmail.us wrote:
But I don't understand why sorting (by inode? by name?) would make stat() calls faster. That's what I'm trying to understand.
Yup, it's available as a special parameter to os.stat(): http://docs.python.org/dev/library/os.html#os.stat
B) is fstatat even possible under windows?
No, but Windows has its own functions to solve the issue (as explained elsewhere in this thread).
Indeed. But directory trees are usually much wider than they are deep. Regards Antoine.

On Thu, Nov 15, 2012 at 7:43 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Less head seeks. Consider 1000 files in a directory. They will likely be grouped on the disk for locality of reference (e.g. same inode group) though mv's and other operations mean this is only 'likely' not 'a given'. when you do 1000 stat calls in a row, the kernel is working with a command depth of 1, so it can't elevator seek optimise the work load. If it sees linear IO it will trigger readahead, but thats less likely. What the sort by inode does is mean that the disk head never needs to seek backwards, so you get a single outward pass over the needed sectors. This is a widely spread technique: http://git.661346.n2.nabble.com/RFC-HACK-refresh-index-lstat-in-inode-order-... is a recent recurrence of the discussion, with sources cited. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services

On Wed, Nov 14, 2012 at 11:52 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Its definitely not applicable globally ( but its no worse in general than any arbitrary sort, so safe to do everywhere). On the ext* family of file systems inode A < inode B implies inode A is located on a lower sector than B.
fstatat looks *perfect*. Thanks for the pointer. I forget the win32 behaviour, but on Linux a thread is a process -> chdir is localised to the process and not altered across threads. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services

Robert Collins wrote:
chdir is a very bad idea for libraries on Windows - it is meant mainly for the user and not the application. It has also been removed completely for Windows 8 apps (not desktop applications, just the new style ones). Full paths may involve some overhead, but at least with FindFirst/NextFile this should be limited to memory copies and not file lookups. Cheers, Steve

On Wed, Nov 14, 2012, at 13:55, Robert Collins wrote:
I forget the win32 behaviour, but on Linux a thread is a process -> chdir is localised to the process and not altered across threads.
This claim needs to be unpacked, to point out where you are right and where you are wrong: Linux threads are processes: true. cwd _can be_ localized to a single [thread] process rather than shared in the whole [traditional] process: also true. (see http://linux.die.net/man/2/clone, note specifically the CLONE_FS flag) cwd _actually is_ localized to a process when created in the ordinary manner as a thread: false. (see http://linux.die.net/man/7/pthreads )

On Wed, Nov 14, 2012 at 11:54 PM, Serhiy Storchaka <storchaka@gmail.com> wrote:
cwd is thread-safe on unix (well, Linux anyhow :P), and the native OS dir walking on Windows is better itself. The only definitions I can find for fwalk are on BSD, not on Linux, so fwalk is also likely something that would need porting; The definitions I did find take a vector of file pointers, and so are totally irrelevant for the point I made. -Rob

Yeah, you're right. The benchmarks I've been doing are only stable after the first run, when I suppose everything is cached and you've taken most of the I/O latency out of the picture. I guess this also means the speed-up for the first run won't be nearly as significant. But I'll do some further testing. -Ben

My very first post, and I screw up the reply to list option. Sorry about that. On 11/13/2012 2:07 AM, Ronald Oussoren wrote:
I think he had the idea that it would just return this incomplete st_mode, and you'd have to deal with it. But maybe the solution is to add a st_type property to stat_result, that returns either this or the appropriate bits from st_mode - is there a reason these are a single field other than a historical artifact of the fact that they are/were in a single 16-bit field in UNIX? Note that according to the documentation not all filesystems on linux support d_type, either. You have to be prepared for the possibility of getting DT_UNKNOWN

P.S. Something else occurs to me. In the presence of reparse points (i.e. symbolic links) on win32, I believe information about whether the destination is meant to be a directory is still provided (I haven't confirmed this, but I know you're required to provide it when making a symlink). This is discarded when the st_mode field is populated with the information that it is a symlink. If the goal is "speed up os.walk", it might be worth keeping this information and using it in os.walk(..., followlinks=True) - maybe the windows version of the stat result has a field for the windows attributes? It's arguable, though, that symbolic links on windows are rare enough not to matter.

On 11/12/12, Ben Hoyt <benhoyt@gmail.com> wrote:
I know that two functions may be better than a keyword, but a combinatorial explosion of functions ... isn't. Even given that listdir can't change for backwards compatibility, and given that iteration might be better for large directories, I'm still not sure an exact analogue is worth it. Could this be someone combined with your 3rd proposal? For example, instead of returning a list of str (or bytes) names, could you return a generator that would yield some sort of File objects? (Well, obviously you *could*, the question is whether that goes too far down the rabbit hole of what a Path object should have.) My strawman is an object such that (a) No extra system calls will be made just to fill in data not available from the dir entry itself. I wouldn't even promise a name, though I can't think of a sensible directory listing that doesn't provide the name. (b) Any metadata you do have -- name, fd, results of stat, results of getxattr ... will be available as an attribute. That way users of filesystems that do send back the size or type won't have to call stat. (c) Attributes will default to None, supporting the "if x is None: x=stat()" pattern for the users who do care about attributes that were not available quickly. (If there is an attribute for which "None" is actually meaningful, the user can use hasattr -- but that is a corner case, not worth polluting the API for.) *Maybe* teach open about these objects, so that it can look for the name or fd attributes. Alternatively, it could return a str (or bytes) subclass that has the other attributes when they are available. That seems a bit contrived, but might be better for backwards compatibility. (os.walk could return such objects too, instead of extracting the name.) -jJ

"Combinatorial explosion of functions"? Slight exaggeration. :-) I'm proposing adding two new functions, iterdir and iterdir_stat. And I definitely think two new functions with pretty standard names and almost self-documenting signatures is much simpler than keyword args and magical return values. For example, Python 2.x has dict.items() and dict.iteritems(), and it's clear what the "iter" version does at a glance.
I considered this, as well as a str subclass with a ".stat" attribute. But iterdir_stat's (filename, stat) tuple is much more explicit, whereas the str subclass just seemed too magical -- though I might be able to be convinced otherwise. Yes, as you mention, I really think the File/Path object is a rabbit hole, and the intent of my proposal is very minimal changes / minimal learning curve -- an incremental addition. -Ben

On Wed, Nov 14, 2012 at 5:51 PM, Jim Jewett <jimjjewett@gmail.com> wrote:
Two questions: 1) Is there some way to distinguish that your st_mode field is only partially there (i.e. - you get the Linux/BSD d_type value, but not the rest of st_mode)? 2) How about making these attributes properties, so that touching one that isn't there causes them all to be populated. <mike

Very good questions.
Not currently. I haven't thought about this too hard -- there way be a bit that's always set/not set within st_mode itself. Otherwise I'd have to add a full_st_mode or similar property
2) How about making these attributes properties, so that touching one that isn't there causes them all to be populated.
Interesting thought. Might be a bit too magical for my taste. One concern I have with either of the above (adding a property to stat_result) is that then I'd have to use my own class or namedtuple rather than os.stat_result. Presumably that's not a big deal, though, as you'd never do isinstance testing on it. In the case of 2), it might have to be a class, meaning somewhat heavier memory-wise for lots and lots of files. -Ben

It's not a bad idea, but neither am I super-keen on it, because of these two reasons: 1) You've have to add a whole new way / set of constants / functions to test for the different values of d_type. Whereas there's already stuff (stat module) to test for st_mode values. 2) It'd make the typical use case more complex, for example, the straight "if st.st_mode is None ... else ..." I gave earlier becomes this: for filename, st in iterdir_stat(path): if st.d_type is None: if st.st_mode is None: st = os.stat(os.path.join(path, filename)) is_dir = stat.S_ISDIR(st.st_mode) else: is_dir = st.d_type == DT_DIR -Ben

On 11/15/2012 4:50 PM, Ben Hoyt wrote:
if st.d_type is None: st = os.stat(os.path.join(path, filename)) is_dir = st.d_type == DT_DIR Of course, your code would ultimately be more complex anyway since when followlinks=True you want to use isdir, and when it's not you want to use lstat. And consider what information iterdir_stat actually returns when the results are symlinks (if it's readdir/d_type, it's going to say "it's a symlink" and you need to try again to followlinks, if it's WIN32_FIND_DATA you have the information for both in principle, but the stat structure can only include one. Do we need an iterdir_lstat? If so, should iterdir_stat return None if d_type is DT_LNK, or DT_LNK?) ...and ultimately deprecating the S_IS___ stuff. It made sense in the 1970s when there was real savings in packing your 4 bits of type information and your 12 bits of permission information in a single 16-bit field, now it's just a historical artifact that seems like the only reason for it is a vague "make os feel like C on Unix" principle.

On 11/14/12, Mike Meyer <mwm@mired.org> wrote:
On Wed, Nov 14, 2012 at 5:51 PM, Jim Jewett <jimjjewett@gmail.com> wrote:
On 11/12/12, Ben Hoyt <benhoyt@gmail.com> wrote:
Two questions:
os.iterdir did not call stat; you have partial information. Or are you saying that you want to distinguish between "This filesystem doesn't track that information", "This process couldn't get that information right now", and "That particular piece of information requires a second call that hasn't been made yet"?
2) How about making these attributes properties, so that touching one that isn't there causes them all to be populated.
Part of the motivation was to minimize extra system calls; that suggests making another one should be a function call instead of a property. That said, I can see value in making that optional call return something with the same API. Perhaps: if thefile.X is None: thefile.restat() # namespace clash, but is "restat" really a likely file attribute? or if thefile.X is None: thefile=stat(thefile) -jJ

On Nov 14, 2012 10:19 PM, "Jim Jewett" <jimjjewett@gmail.com> wrote:
On 11/14/12, Mike Meyer <mwm@mired.org> wrote:
On Wed, Nov 14, 2012 at 5:51 PM, Jim Jewett <jimjjewett@gmail.com>
wrote:
Note that you're eliding the proposal these questions were about, that os.iterdir return some kind of object that had attributes that carried the stat values, or None if they weren't available.
I want to distinguish between the case where st_mode is filled from the BSD/Unix d_type directory entry - meaning there is information so st_mode is not None, but the information is incomplete and requires a second system call to fetch - and the case where it's filled via the Windows calls which provide all the information that is available for st_mode, so no second system call is needed.
Except that I don't see that there's anything to do once you've found a None-valued attribute *except* make that extra call. If there's a use case where you find one of the attributes is None and then not get the value from the system, I agree with you. If there isn't, then you might as well roll that one use case into the object rather than force every client to do the stat call and extract the information from it in that case.

On 11/15/12, Mike Meyer <mwm@mired.org> wrote:
On Nov 14, 2012 10:19 PM, "Jim Jewett" <jimjjewett@gmail.com> wrote:
So you're basically saying that you want to know whether an explicit stat call would make a difference? (Other than updating the information if it has changed.)
Except that I don't see that there's anything to do once you've found a None-valued attribute *except* make that extra call.
ah. I was thinking of reporting, where you could just leave a column off the report. Or of some sort of optimization, where knowing the size (or last change date) is not required, but may be helpful. I suppose these might be sufficiently uncommon that triggering the extra stat call instead of returning None might be justified. -jJ

On Wed, Nov 14, 2012 at 11:53 PM, Jim Jewett <jimjjewett@gmail.com> wrote:
That's one way of looking at it. The problem is that you tell if a value has been filled or not by having a None value. But st_mode is itself multi-valued, and you don't always get all available value. Maybe d_type should be it's own attribute? If readdir returns it, we use it as is. If not, then the caller either does the None/stat dance or we make it a property that gets filled from the stat structure. Thanks, <mike

I'm inclined to KISS and just let the caller handle it. Many other things in the "os" module are system dependent, including os.stat(), so if the stat_results results returned by iterdir_stat() are system dependent, that's just par for the course. I'm thinking of a docstring something like: """Yield tuples of (filename, stat_result) for each filename in directory given by "path". Like listdir(), '.' and '..' are skipped. The values are yielded in system-dependent order. Each stat_result is an object like you'd get by calling os.stat() on that file, but not all information is present on all systems, and st_* fields that are not available will be None. In practice, stat_result is a full os.stat() on Windows, but only the "is type" bits of the st_mode field are available on Linux/OS X/BSD. """ So in real life, if you're using more than stat.S_ISDIR() of st_mode, you'll need to call stat separately. But 1) it's quite rare to need eg the permissions bits in this context, and 2) if you're expecting st_mode to have that extra stuff your code's already system-dependent, as permission bits don't mean much on Windows. But the main point is that what the OS gives you for free is easily available. -Ben

On Nov 15, 2012 1:40 AM, "Ben Hoyt" <benhoyt <benhoyt@gmail.com>@<benhoyt@gmail.com> gmail.com <benhoyt@gmail.com>> wrote:
There's a code smell here, in that the doc for Unix variants is incomplete and wrong. Whether or not you get the d_type values depends on the OS having that extension. Further, there's a d_type value (DT_UNKNOWN) that isn't a valid value for the S_IFMT bits in st_mode (at least on BSD).
If the goal is to make os.walk fast, then it might be better (on Posix systems, anyway) to see if it can be built on top of ftw instead of low-level directory scanning routines. <mike

From: Mike Meyer <mwm@mired.org> Sent: Thu, November 15, 2012 2:29:44 AM
If the goal is to make os.walk fast, then it might be better (on Posix systems,
anyway) to see if it can be built on top of ftw instead of low-level directory scanning routines.
You can't actually use ftw, because it doesn't give any way to handle the options to os.walk. Plus, it's "obsolescent" (POSIX2008), "deprecated" (linux), or "legacy" (OS X), and at least some platforms will throw warnings when you call it. It's also somewhat underspecified, and different platforms, even different versions of the same platform, will give you different behavior in various cases (especially with symlinks). But you could, e.g., use fts on platforms that have it, nftw on platforms that have a version close enough to recent linux/glibc for our purposes, and fall back to readdir+stat for the rest. That could give a big speed improvement on the most popular platforms, and on the others, at least things would be no worse than today (and anyone who cared could much more easily write the appropriate nftw/fts/whatever port for their platform once the framework was in place).

Not sure I understand why the docstring above is incomplete/wrong. I say "Not all information is present on all systems" and "In practice ... only the 'is type' bits of the st_mode field are available on Linux/OS X/BSD". All three of those systems provide d_type, so I think that's correct. -Ben

On Nov 15, 2012 2:06 PM, "Ben Hoyt" <benhoyt@gmail.com> wrote:
incomplete
It's incomplete because it doesn't say what happens on other Posix systems. It's wrong because it implies that the type bits of st_mode are always available, when that's not the case. Better would be 'on Posix systems, if st_mode is not None only the type bits are valid.' Assuming that the underlying code translates DT_UNKNOWN to binding st_mode to None.

On 11/15/12, Mike Meyer <mwm@mired.org> wrote:
The specification allows other fields as well; is it really the case that *no* filesystem supports them? Perhaps: """Yield tuples of (filename, stat_result) for each file in the "path" directory, excluding the '.' and '..' entries. The order of results is arbitrary, and the effect of modifying a directory after generator creation is filesystem-dependent. Each stat_result is similar to the result of os.stat(filename), except that only the directory entry itself is examined; any attribute which would require a second system call (even os.stat) is set to None. In practice, Windows will typically fill in all attributes; other systems are most likely to fill in only the "is type" bits, or even nothing at all. """ -jJ

I'm pretty much convinced that - if the primary goal is to speed up os.walk by leveraging the Windows calls and the existence of d_type on some posix file systems - the proposed iterdir_stat interface is about as good as we can do. However, as a tool for making it easy to iterate through files in a directory getting some/all stat information, I think it's ugly. It's designed specifically for one system, with another common case sort of wedged in. There's no telling how well it will be handle any other systems, but I can see that they might be problematical. Worse yet, you wind up with stat information you can't trust, so have to basically write code to access multiple attributes like: if st.attr1 is None: st = os.stat(...) func(st.attr1) if st.attr2 is None: st = os.stat(...) func(st.attr2) Not bad if you only want one or two values, but ugly if you want four or more. I can see a number of alternatives to improve this situation: 1) wrap the return partial stat info in a proxy object that will do a real stat if a request is made for a value that isn't there. This has already been rejected. 2) Make iterdir_stat an os.walk internal tool, and don't export it. 3) Add some kind of "we have a full stat" indicator, so that clients that want to use lots of attributes can just check that and do the stat if needed. 4) Pick and document one of the a stat values as a "we have a full stat" indicator, to use like case 3. 5) Add a keyword argument to iterdir_stat that causes it to always just do the full stat. Actually, having three modes might be useful: the default is None, which is the currently proposed behavior. Setting it to True causes the full stat always be done, and setting it to False just returns file names. 6) Depreciate os.walk, and provide os.itertree with an interface that lets us leverage the available tools better. That's a whole other can of worms, though. Thanks, <mike

Greg Ewing suggested:
+1 for following that seventh path. It offers the additional benefit for the library code, that constraints of the backend functionality used are more clearer to handle: If requested and available allthough expensive, "yield nevertheless the attribute values" is then a valid strategy. All the best, Stefan.

I don't love most of these solutions as they seem to complicate things. I'd be happy with 2) if it came to it -- but I think it's a very useful tool, especially on Windows, because it means Windows users would have "pure Python access" to FindFirst/FindNext and a very good speed improvement for os.walk.
Ah, that's an interesting idea. It's nice and explicit. I'd go for either the status quo I've proposed or this one. Though only thing I'd want to push for is a clean API. Any suggestions? Mine would be something like: for filename, st in iterdir_stat(path, stat_fields=['st_mode', 'st_size']: ... However, you might also need 'd_type' field option, because eg for os.walk() you don't actually need all of st_mode, just the type info. This is odd (that most of the fields are named "st_*" but one "d_type") but not terrible. -Ben

From: Mike Meyer <mwm@mired.org> Sent: Thu, November 15, 2012 2:29:44 AM
If the goal is to make os.walk fast, then it might be better (on Posix systems,
anyway) to see if it can be built on top of ftw instead of low-level directory scanning routines.
After a bit of experimentation, I'm not sure there actually is any significant improvement to be had on most POSIX systems working that way. Looking at the source from FreeBSD, OS X, and glibc, they all call stat (or a stat family call) on each file, unless you ask for no stat info. A quick test on OS X shows that calling fts via ctypes is about 5% faster than os.walk, and 5% slower than find -ls or find -mtime (which will stat every file). Passing FTS_NOSTAT to fts is about 3x faster, but only 8% faster than os.walk with the stat calls hacked out, and 40% slower than find. So, a "nostat" option is a potential performance improvement, but switching to ftw/nftw/fts, with or without the nostat flag, doesn't seem to be worth it.

On Fri, Nov 16, 2012 at 4:32 AM, Andrew Barnert <abarnert@yahoo.com> wrote:
I agree with that, so long as you have to stay with the os.walk interface.
Looking at the source from FreeBSD, OS X, and glibc, they all call stat (or a stat family call) on each file, unless you ask for no stat info.
Right. They either they give you *all* the stat information, or they don't give you *any* of it. So there's no way to use it to create the directory/other split in os.walk without doing the stat calls.
Passing FTS_NOSTAT to fts is about 3x faster, but only 8% faster than os.walk with the stat calls hacked out, and 40% slower than find.
That's actually a good thing to know. With FTS_NOSTAT, fts winds up using the d_type field to figure out what's a directory (assuming you were on a file system that has those). That's the proposed change for os.walk, so we now have an estimate of how fast we can expect it to be. I'm surprised that it's slower than find. The FreeBSD version of find uses fts_open/fts_read. Could it be that used FTS_NOCHDIR to emulate os.walk, whereas find doesn't? <mike

I'm not sure I'd put too much confidence on the 3x difference as generally applicable to POSIX. Apple uses FreeBSD's fts unmodified, even though in a quick browser I saw at least one case where a trivial change would have made a difference (the link count check that's only used with ufs/nfs/nfs4/ext2fs would also work on hfs+). Also, OS X with HFS+ drives does some bizarre stuff with disk caching, especially with an SSD (which in itself probably changes the performance characteristics). But I'd guess it's somewhere in the right ballpark, and if anything it'll probably be even more improvement on FreeBSD and linux than on OS X.
No, just FTS_PHYSICAL (with or without FTS_NOSTAT). It looks like more than half of the difference is due to print(ent.fts_path.decode('utf8')) in Python vs. puts(entry->fts_path) in find (based on removing the print entirely). I don't think it's worth the effort to investigate further—let's get the 3x faster before we worry about the last 40%… But if you want to, the source I used is at https://github.com/abarnert/py-fts

Passing FTS_NOSTAT to fts is about 3x faster, but only 8% faster than os.walk with the stat calls hacked out, and 40% slower than find.
Relatedly, I've just finished a proof-of-concept version of iterdir_stat() for Linux using readdir_r and ctypes, and it was only about 10% faster than the existing os.walk on large directories. I was surprised by this, given that I saw a 400% speedup removing the stat()s on Windows, but I guess it means that stat() and/or system calls in general are *much* faster or better cached on Linux. Still, it's definitely worth the huge speedup on Windows, and I think it's the right thing to use the dirent d_type info on Linux, even though the speed gain is small -- it's still faster, and it still saves all those os.stat()s. Also, I'm doing this via ctypes in pure Python, so doing it in C may give another small boost especially for the Linux version. If anyone wants to test what speeds they're getting on Linux or Windows, or critique my proof of concept, please try it at https://github.com/benhoyt/betterwalk -- just run "python benchmark.py [directory]" on a large directory. Note this is only a proof of concept at this stage, not hardened code!
So, a "nostat" option is a potential performance improvement, but switching to ftw/nftw/fts, with or without the nostat flag, doesn't seem to be worth it.
Agreed. Also, this is beyond the scope of my initial suggestion. -Ben

On 11/18/2012 3:52 PM, Ben Hoyt wrote:
Linux readdir can return the file type, but it's always going to be DT_LNK for symlinks, requiring an extra stat call if followlinks=True. Windows functions return both "is a symlink" and "is a directory" in a single call (ntpath.isdir / nt._isdir takes advantage of this, and the same information is present in the Find functions' data) but not other information about the symlink target beyond whether it is a directory. Neither python's existing stat function nor anything proposed here has a way of representing this, but it would be useful for os.walk to be able to get this information.

Am 09.11.2012 11:29, schrieb Ben Hoyt:
+1 for something like yielddir(). I while ago I proposed that the os module shall get another function for iterating over a directory. The new function is to return a generator that yields structs. The structs contain the name and additional metadata that like the d_type. On Unix it should use the reentrant version of readdir(), too. A struct has the benefit that it can grow additional fields or contain operating dependent information like inode on Unix. Christian

On 09.11.12 16:42, Christian Heimes wrote:
+1 for something like yielddir().
See also http://bugs.python.org/issue11406.
The only fields in the dirent structure that are mandated by POSIX.1 are: d_name[], of unspecified size, with at most NAME_MAX characters preceding the terminating null byte; and (as an XSI extension) d_ino. The other fields are unstandardized, and not present on all systems.

Am 09.11.2012 15:54, schrieb Serhiy Storchaka:
The only fields in the dirent structure that are mandated by POSIX.1 are: d_name[], of unspecified size, with at most NAME_MAX characters preceding the terminating null byte; and (as an XSI extension) d_ino. The other fields are unstandardized, and not present on all systems.
I'm well aware of the fact.The idea is to use as many information as we can get for free and acquire missing information from other sources like stat(). Also some information depend on the file system: Currently, only some file systems (among them: Btrfs, ext2, ext3, and ext4) have full support returning the file type in d_type. All applications must properly handle a return of DT_UNKNOWN.

Ben Hoyt <benhoyt@...> writes:
Sounds good. I recently answered a Stack Overflow question [1] which showed Python performing an order of magnitude slower than Ruby. Ruby's Dir implementation is written in C and less flexible than os.walk, but there's room for improvement, as you've shown. Regards, Vinay Sajip [1] http://stackoverflow.com/questions/13138160/benchmarks-does-python-have-a-fa...

On 09.11.12 22:30, Vinay Sajip wrote:
This is not so much about walking, as about recursive globbing. See http://bugs.python.org/issue13968. Also note that os.fwalk can be must faster than os.walk (if OS supports it).
participants (18)
-
Andrew Barnert
-
Antoine Pitrou
-
Ben Hoyt
-
Chris Lambacher
-
Christian Heimes
-
Greg Ewing
-
Jim Jewett
-
Mike Meyer
-
MRAB
-
Nick Coghlan
-
Random832
-
random832@fastmail.us
-
Robert Collins
-
Ronald Oussoren
-
Serhiy Storchaka
-
Stefan Drees
-
Steve Dower
-
Vinay Sajip