pathlib and issue 11406 (a directory iterator returning stat-like info)

Hi folks, I decided to start another thread for my thoughts on the interaction between pathlib (Antoine's new PEP 428), issue 11406 (proposal for a directory iterator returning stat-like info), and my own scandir library, which implements something along the lines of issue 11406. My scandir library (https://github.com/benhoyt/scandir) is something I've been working on for a while -- it provides a scandir() function which uses the OS's directory iterator functions to expose as much stat-like information as possible (readdir and FindFirstFile etc). This way functions like os.walk() can use the info (particularly "is_dir()") and not require tons of extra calls to os.stat(). This provides a huge speed boost for os.walk() in many cases: I've seen 3-4x on Linux, and up to 20x on Windows. (It depends on various things, not least of which is Windows' weird stat caching -- if I run my scandir benchmark "fresh", I get os.walk() running 8-9 times as fast as the built-in one. But if I run it after an un-hibernate, suddenly it runs 18-20 times as fast as the built-in one. Either way, huge gains, especially on Windows.) scandir.scandir() returns a DirEntry object, which has .isdir(), .isfile(), .islink(), and .lstat() attributes. Look familiar? When I was reading PEP 428 and saw .is_file(), .is_dir(), and .stat(), I thought -- surely I can merge this with pathlib and Path objects. The first thing I can do to scandir is rename my isdir() type attributes to match PEP 428's, so that DirEntry quacks like a Path object where it can. However, I'm wondering if I can change scandir to return actual Path objects. Or better, because Path already helpfully provides iterdir() which yields Path objects, and Path objects have .is_dir() etc, can scandir()-like behaviour simply work out-of-the-box? This mainly depends on how Path is going to cache stat information. If it caches it, then this will just work. Sounds like Guido's opinion was that both cached and uncached use cases are important, but that it should be very clear which one you're getting. I personally like the .stat() and .restat() idea. The other related thing is that DirEntry only provides .lstat(), because it's providing stat-like info without following links. Note in this context that it's not just "network filesystems" on which stat() is slow (https://mail.python.org/pipermail/python-dev/2013-May/125805.html). It's quite slow in Windows under various conditions too. See also Nick Coghlan's post about a DirEntry-style object on the issue 11406 thread: https://mail.python.org/pipermail/python-dev/2013-May/126148.html Thoughts and suggestions for how to merge scandir with pathlib's approach? It's important to me that pathlib's API doesn't cut itself off from a more efficient implement of the ideas from issue 11406 and scandir... Thanks, Ben.

On Mon, 25 Nov 2013 11:20:08 +1300 Ben Hoyt <benhoyt@gmail.com> wrote:
This mainly depends on how Path is going to cache stat information. If it caches it, then this will just work. Sounds like Guido's opinion was that both cached and uncached use cases are important, but that it should be very clear which one you're getting. I personally like the .stat() and .restat() idea.
Right now, pathlib doesn't cache. Guido decided it was safer to start off like that, and perhaps later we can add some optional caching. One reason caching didn't go in is that it's not clear which API is best. Working on pluggin scandir() into pathlib would actually help choosing a stat-caching API. (or, rather, lstat-caching...)
The other related thing is that DirEntry only provides .lstat(), because it's providing stat-like info without following links.
Path.is_dir() and friends use stat(), i.e. they inform you about whether a symlink's target is a directory (not the symlink itself). Of course, if the DirEntry says the path is a symlink, Path.is_dir() could then run stat() to find out about the target. Do you plan to propose scandir() for inclusion in the stdlib? Regards Antoine.

Right now, pathlib doesn't cache. Guido decided it was safer to start off like that, and perhaps later we can add some optional caching.
One reason caching didn't go in is that it's not clear which API is best. Working on pluggin scandir() into pathlib would actually help choosing a stat-caching API.
(or, rather, lstat-caching...)
The other related thing is that DirEntry only provides .lstat(), because it's providing stat-like info without following links.
Path.is_dir() and friends use stat(), i.e. they inform you about whether a symlink's target is a directory (not the symlink itself). Of course, if the DirEntry says the path is a symlink, Path.is_dir() could then run stat() to find out about the target.
Do you plan to propose scandir() for inclusion in the stdlib?
Yes, I was hoping to propose adding "os.scandir() -> yields DirEntry objects" for inclusion into the stdlib, and also speed up os.walk() as a result. However, pathlib's API with .is_dir() and .lstat() etc are so close to DirEntry, I'd be much keener to roll up the scandir functionality into pathlib's iterdir(), as that's already going in the standard library, and iterdir() already returns Path objects. I'm just not sure it's possible or useful without stat caching. We could do Path.lstat(cached=True), but we'd also really want is_dir(cached=True), so that API kinda sucks. Alternatively you could have iterdir(cached=True) return PathWithCachedStat style objects -- probably better, but kinda messy. For these reasons, I would much prefer stat caching on by default in Path -- in my experience, the cached behaviour is desired much much more often than the non-cached. I've written directory walkers more often than I can count, whereas I've maybe only once written a long-running process that needs to re-stat, and if it's clearly documented as cached, then it's super easy to call restat(), or create a new Path instance to get new stat info. This would allow iterdir() to take advantage of the huge performance improvements you can get when walking directories. Guido, are you at all open to reconsidering the uncached-by-default in light of this? -Ben

On Mon, 25 Nov 2013 12:04:28 +1300 Ben Hoyt <benhoyt@gmail.com> wrote:
Right now, pathlib doesn't cache. Guido decided it was safer to start off like that, and perhaps later we can add some optional caching.
One reason caching didn't go in is that it's not clear which API is best. Working on pluggin scandir() into pathlib would actually help choosing a stat-caching API.
(or, rather, lstat-caching...)
The other related thing is that DirEntry only provides .lstat(), because it's providing stat-like info without following links.
Path.is_dir() and friends use stat(), i.e. they inform you about whether a symlink's target is a directory (not the symlink itself). Of course, if the DirEntry says the path is a symlink, Path.is_dir() could then run stat() to find out about the target.
Do you plan to propose scandir() for inclusion in the stdlib?
Yes, I was hoping to propose adding "os.scandir() -> yields DirEntry objects" for inclusion into the stdlib, and also speed up os.walk() as a result.
However, pathlib's API with .is_dir() and .lstat() etc are so close to DirEntry, I'd be much keener to roll up the scandir functionality into pathlib's iterdir(), as that's already going in the standard library, and iterdir() already returns Path objects.
We could still expose scandir() as a low-level API, *and* call it in pathlib for optimizations.
We could do Path.lstat(cached=True), but we'd also really want is_dir(cached=True), so that API kinda sucks. Alternatively you could have iterdir(cached=True) return PathWithCachedStat style objects -- probably better, but kinda messy.
Perhaps Path.enable_caching()? It would enable caching not only on this path objects, but all objects constructed from it. Regards Antoine.

On 25 Nov 2013 09:07, "Ben Hoyt" <benhoyt@gmail.com> wrote:
Right now, pathlib doesn't cache. Guido decided it was safer to start off like that, and perhaps later we can add some optional caching.
One reason caching didn't go in is that it's not clear which API is best. Working on pluggin scandir() into pathlib would actually help choosing a stat-caching API.
(or, rather, lstat-caching...)
The other related thing is that DirEntry only provides .lstat(), because it's providing stat-like info without following links.
Path.is_dir() and friends use stat(), i.e. they inform you about whether a symlink's target is a directory (not the symlink itself). Of course, if the DirEntry says the path is a symlink, Path.is_dir() could then run stat() to find out about the target.
Do you plan to propose scandir() for inclusion in the stdlib?
Yes, I was hoping to propose adding "os.scandir() -> yields DirEntry objects" for inclusion into the stdlib, and also speed up os.walk() as a result.
However, pathlib's API with .is_dir() and .lstat() etc are so close to DirEntry, I'd be much keener to roll up the scandir functionality into pathlib's iterdir(), as that's already going in the standard library, and iterdir() already returns Path objects.
I'm just not sure it's possible or useful without stat caching.
We could do Path.lstat(cached=True), but we'd also really want is_dir(cached=True), so that API kinda sucks. Alternatively you could have iterdir(cached=True) return PathWithCachedStat style objects -- probably better, but kinda messy.
For these reasons, I would much prefer stat caching on by default in Path -- in my experience, the cached behaviour is desired much much more often than the non-cached. I've written directory walkers more often than I can count, whereas I've maybe only once written a long-running process that needs to re-stat, and if it's clearly documented as cached, then it's super easy to call restat(), or create a new Path instance to get new stat info.
This would allow iterdir() to take advantage of the huge performance improvements you can get when walking directories.
Guido, are you at all open to reconsidering the uncached-by-default in light of this?
No, caching on the object is dangerously unintuitive - it means two Path objects can compare equal, but give different answers for stat-dependent queries. A global string (or Path) keyed cache (rather than a per-object cache) would actually be a safer option, since it would ensure distinct path objects always gave the same answer. That's the approach I will likely pursue at some point in walkdir. It's also quite likely the "rich stat object" API will be pursued for 3.5, which is a much safer approach to stat result caching than trying to embed it directly in pathlib.Path objects. That's why we decided to punt on the caching question until 3.5 - it's better to provide a predictable building block that doesn't provide caching, and then work out how to provide a sensible caching layer on top of that, rather than trying to rush a potentially flawed caching design that leads to inconsistent behaviour. Cheers, Nick.
-Ben _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

Antoine's class-global flag seems like a bad idea.
A global string (or Path) keyed cache (rather than a per-object cache) would actually be a safer option, since it would ensure distinct path objects always gave the same answer. That's the approach I will likely pursue at some point in walkdir.
Interesting approach. This wouldn't really solve the problem for scandir / DirEntry / performance issues, but it's a fair idea in general.
It's also quite likely the "rich stat object" API will be pursued for 3.5, which is a much safer approach to stat result caching than trying to embed it directly in pathlib.Path objects.
As a Windows dev, I'm not sure I love the "rich stat object idea", because stat_result objects are sooo Posixy. On Windows, (some of) the file attribute info is stuffed into a stat_result struct. Which kinda works, but I like how Path exposes the higher-level, cross-platform stuff like .is_dir() so that most of the time you don't need to worry about stat. (You still need to worry about caching, though.)
That's why we decided to punt on the caching question until 3.5 - it's better to provide a predictable building block that doesn't provide caching, and then work out how to provide a sensible caching layer on top of that, rather than trying to rush a potentially flawed caching design that leads to inconsistent behaviour.
Yep, agreed about rushing in a potentially flawed caching design. But I also don't want to "rush in" a design that prohibits scandir()-style performance optimizations -- though I guess it can still go in there one way or the other. "Worst case", we can add os.scandir() separately, which return DirEntry, "path-like" objects. -Ben

On 25 Nov 2013 09:31, "Ben Hoyt" <benhoyt@gmail.com> wrote:
It's also quite likely the "rich stat object" API will be pursued for
3.5,
which is a much safer approach to stat result caching than trying to embed it directly in pathlib.Path objects.
As a Windows dev, I'm not sure I love the "rich stat object idea", because stat_result objects are sooo Posixy. On Windows, (some of) the file attribute info is stuffed into a stat_result struct. Which kinda works, but I like how Path exposes the higher-level, cross-platform stuff like .is_dir() so that most of the time you don't need to worry about stat. (You still need to worry about caching, though.)
That's why we decided to punt on the caching question until 3.5 - it's better to provide a predictable building block that doesn't provide caching, and then work out how to provide a sensible caching layer on top of
The idea of the rich stat result object is that has all that info prepopulated, based on an initial stat call. "Caching" it amounts to "keep a reference to it". It is suggested that it would be a subset of the pathlib.Path API: http://bugs.python.org/issue19725 If it's also a superset of the existing stat object API, then at least Path.stat and Path.lstat (and perhaps the lower level APIs) can be updated to return it in 3.5. that,
rather than trying to rush a potentially flawed caching design that leads to inconsistent behaviour.
Yep, agreed about rushing in a potentially flawed caching design. But I also don't want to "rush in" a design that prohibits scandir()-style performance optimizations -- though I guess it can still go in there one way or the other.
Yeah, the realisation that an initial non-caching approach didn't lock us out of external caching may not have been well communicated to the list. I was discussing the walkdir integration possibilities with Antoine and Guido and realised I would likely still need an external cache, even if pathlib had its own internal caching. At that point, it seemed highly desirable to duck the caching question entirely.
"Worst case", we can add os.scandir() separately, which return DirEntry, "path-like" objects.
Indeed, we may still want such an object API, since dirent doesn't provide full stat info. A PEP reviewing all this for 3.5 and proposing a specific os.scandir API would be a good thing. Cheers, Nick.
-Ben

The idea of the rich stat result object is that has all that info prepopulated, based on an initial stat call. "Caching" it amounts to "keep a reference to it".
It is suggested that it would be a subset of the pathlib.Path API: http://bugs.python.org/issue19725
If it's also a superset of the existing stat object API, then at least Path.stat and Path.lstat (and perhaps the lower level APIs) can be updated to return it in 3.5.
Got it.
"Worst case", we can add os.scandir() separately, which return DirEntry, "path-like" objects.
Indeed, we may still want such an object API, since dirent doesn't provide full stat info.
I'm not quite sure what you're suggesting here. In any case, I'm going to modify my scandir() so its DirEntry objects are closer to pathlib.Path, particularly: * isdir() -> is_dir() * isfile() -> is_file() * islink() -> is_symlink() * add is_socket(), is_fifo(), is_block_device(), and is_char_device() I'm considering removing DirEntry's .dirent attribute entirely. The above is_* functions cover everything in .dirent.d_type in a much more Pythonic and cross-platform way, and the only other info in .dirent is d_ino -- can a non-Windows dev tell me how or when d_ino would be useful? If it's useful, is it useful in a higher-level, cross-platform API such as scandir()? Hmmm, I wonder about this "rich stat object" idea in light of the above. Do the methods on pathlib.Path basically supercede the need for this? Because otherwise folks will always be wondering whether to say "path.is_dir()" or "path.stat().is_dir" ... two ways to do it, right next to each other. So I'd prefer to add the "rich" stuff on the higher-level Path instead of the lower-level stat.
A PEP reviewing all this for 3.5 and proposing a specific os.scandir API would be a good thing.
Thanks, I'll definitely consider writing a PEP. -Ben

On 25 November 2013 03:18, Ben Hoyt <benhoyt@gmail.com> wrote:
d_ino -- can a non-Windows dev tell me how or when d_ino would be useful? If it's useful, is it useful in a higher-level, cross-platform API such as scandir()?
OK, so I'm a Windows dev, but my understanding is that d_ino is useful to tell if two files are identical - hard links to the same physical file have the same d_ino value. I don't believe it's possible to do this on Windows at all. I've seen it used in tools like diff, to short-circuit doing the actual diff if you know from a stat that the 2 files are the same. Paul

OK, so I'm a Windows dev, but my understanding is that d_ino is useful to tell if two files are identical - hard links to the same physical file have the same d_ino value. I don't believe it's possible to do this on Windows at all.
I've seen it used in tools like diff, to short-circuit doing the actual diff if you know from a stat that the 2 files are the same.
Okay, that helps -- thanks. So the inode number is probably not all that useful in this context at all. Because it doesn't come with the device, you don't know whether it's unique (from the posixpath.samestat source, it looks like a file's only unique if the inode and device numbers are equal). So I think I'm going to drop .dirent entirely, and just expose the d_type information via the is_* functions. I'm not sure about is_socket(), is_fifo(), is_block_device(), is_char_device(). I'm tempted to just leave them off, as I think they'll basically never be used ... their stat counterparts are exceedingly rare in the stdlib, so if you really want that, just use .lstat(). -Ben

On 25 Nov 2013 13:18, "Ben Hoyt" <benhoyt@gmail.com> wrote:
The idea of the rich stat result object is that has all that info prepopulated, based on an initial stat call. "Caching" it amounts to
reference to it".
It is suggested that it would be a subset of the pathlib.Path API: http://bugs.python.org/issue19725
If it's also a superset of the existing stat object API, then at least Path.stat and Path.lstat (and perhaps the lower level APIs) can be updated to return it in 3.5.
Got it.
"Worst case", we can add os.scandir() separately, which return DirEntry, "path-like" objects.
Indeed, we may still want such an object API, since dirent doesn't
"keep a provide
full stat info.
I'm not quite sure what you're suggesting here.
In any case, I'm going to modify my scandir() so its DirEntry objects are closer to pathlib.Path, particularly:
* isdir() -> is_dir() * isfile() -> is_file() * islink() -> is_symlink() * add is_socket(), is_fifo(), is_block_device(), and is_char_device()
I'm considering removing DirEntry's .dirent attribute entirely. The above is_* functions cover everything in .dirent.d_type in a much more Pythonic and cross-platform way, and the only other info in .dirent is d_ino -- can a non-Windows dev tell me how or when d_ino would be useful? If it's useful, is it useful in a higher-level, cross-platform API such as scandir()?
Hmmm, I wonder about this "rich stat object" idea in light of the above. Do the methods on pathlib.Path basically supercede the need for this? Because otherwise folks will always be wondering whether to say "path.is_dir()" or "path.stat().is_dir" ... two ways to do it, right next to each other. So I'd prefer to add the "rich" stuff on the higher-level Path instead of the lower-level stat.
The rich stat API proposal exists precisely to provide a clean way to do stat result caching - path objects always give immediate data, stat objects give cached answers. The direct APIs on Path would just become a trivial shortcut once a rich stat APIs existed - you could use the long form if you wanted to, but it would be pointless to do so. Cheers, Nick.

On Sun, Nov 24, 2013 at 3:04 PM, Ben Hoyt <benhoyt@gmail.com> wrote:
Right now, pathlib doesn't cache. Guido decided it was safer to start off like that, and perhaps later we can add some optional caching.
One reason caching didn't go in is that it's not clear which API is best. Working on pluggin scandir() into pathlib would actually help choosing a stat-caching API.
(or, rather, lstat-caching...)
The other related thing is that DirEntry only provides .lstat(), because it's providing stat-like info without following links.
Path.is_dir() and friends use stat(), i.e. they inform you about whether a symlink's target is a directory (not the symlink itself). Of course, if the DirEntry says the path is a symlink, Path.is_dir() could then run stat() to find out about the target.
Do you plan to propose scandir() for inclusion in the stdlib?
Yes, I was hoping to propose adding "os.scandir() -> yields DirEntry objects" for inclusion into the stdlib, and also speed up os.walk() as a result.
However, pathlib's API with .is_dir() and .lstat() etc are so close to DirEntry, I'd be much keener to roll up the scandir functionality into pathlib's iterdir(), as that's already going in the standard library, and iterdir() already returns Path objects.
I'm just not sure it's possible or useful without stat caching.
We could do Path.lstat(cached=True), but we'd also really want is_dir(cached=True), so that API kinda sucks. Alternatively you could have iterdir(cached=True) return PathWithCachedStat style objects -- probably better, but kinda messy.
For these reasons, I would much prefer stat caching on by default in Path -- in my experience, the cached behaviour is desired much much more often than the non-cached. I've written directory walkers more often than I can count, whereas I've maybe only once written a long-running process that needs to re-stat, and if it's clearly documented as cached, then it's super easy to call restat(), or create a new Path instance to get new stat info.
This would allow iterdir() to take advantage of the huge performance improvements you can get when walking directories.
Guido, are you at all open to reconsidering the uncached-by-default in light of this?
I think we should think hard and deep about all the consequences. I was initially in favor of stat caching, but during offline review of PEP 428 Nick pointed out that there are too many different ways to do stat caching, and convinced me that it would be wrong to rush it. Now that beta 1 is out I really don't want to reconsider this -- we really need to stick to the plan. The ship has likewise sailed for adding scandir() (whether to os or pathlib). By all means experiment and get it ready for consideration for 3.5, but I don't want to add it to 3.4. In general I think there are some tough choices regarding stat caching. You already brought up stat vs. lstat -- there's also the issue of what to do if [l]stat fails -- do we cache the exception? IMO, the current incarnation is for convenience, correctness and cross-platform semantics -- three C's. The next incarnation can add a fourth C, caching. -- --Guido van Rossum (python.org/~guido)

I think we should think hard and deep about all the consequences. I was initially in favor of stat caching, but during offline review of PEP 428 Nick pointed out that there are too many different ways to do stat caching, and convinced me that it would be wrong to rush it. Now that beta 1 is out I really don't want to reconsider this -- we really need to stick to the plan.
Fair call, and thanks for the response.
The ship has likewise sailed for adding scandir() (whether to os or pathlib). By all means experiment and get it ready for consideration for 3.5, but I don't want to add it to 3.4.
Yes, I was definitely thinking about 3.5 at this stage. :-) What would be the next step for getting something like os.scandir() added for 3.5 -- a PEP referencing the various issues?
In general I think there are some tough choices regarding stat caching. You already brought up stat vs. lstat -- there's also the issue of what to do if [l]stat fails -- do we cache the exception?
IMO, the current incarnation is for convenience, correctness and cross-platform semantics -- three C's. The next incarnation can add a fourth C, caching.
Three/four C's, I like it! -Ben
participants (5)
-
Antoine Pitrou
-
Ben Hoyt
-
Guido van Rossum
-
Nick Coghlan
-
Paul Moore