PEP 471 (scandir): Add a new DirEntry.inode() method?

Hi, TL;DR: on POSIX, is it useful to know the inode number (st_ino) without the device number (st_dev)? While reading feedback on the Python 3.5 alpha 1 release, I saw a comment saying that the current design of os.scandir() (PEP 471) doesn't fit a very specific usecase where the inode number is needed: "Ah, turns out we needed even more optimizations than that is able to give us; in particular, the underlying system readdir call gives us the inode number, which we need to compare against a cache of hard links, in order to avoid having to stat the underlying files if we've already done so on another hard link. It looks like the DirEntry API used here only includes the path and name, not the inode number, without invoking another stat call, and we needed to optimize out that extra stat call." https://www.reddit.com/r/Python/comments/2synry/so_8_peps_are_currently_bein... Since the C function readdir() provides the inode number (d_ino field of the dirent structure), I propose add a new DirEntry.inode() method. *** Now the real question: is it useful to know the inode number (st_ino) without the device number (st_dev)? *** On POSIX, you can still get the st_dev from DirEntry.stat(), but it always require a system call. So you loose the whole purpose of DirEntry (no extra syscall). I wrote a script script check_stdev.py, attached to this email, to check if all entries of a directory have the same st_dev value than the directory itself: - same for /usr/bin, /usr/lib, /tmp, /proc, ... - different for /dev What about "union" file systems like UnionFS or thinks like "mount -o bind"? Can someone test? Does anyone have some information? So the answer looks to be: it's useful for all directories except of /dev. Example: --- /dev/hugepages st_dev is different: 35 vs 5 /dev/mqueue st_dev is different: 13 vs 5 /dev/pts st_dev is different: 11 vs 5 /dev/shm st_dev is different: 17 vs 5 --- On POSIX, DirEntry.inode() just exposes the d_ino value from readdir(). On Windows, FirstFindFileW/FindFindFileW returns almost a full stat_result structure, except of st_ino, st_dev and st_nlink fields which are set to 0. So DirEntry.inode() has to call os.lstat() to read the inode number. The inode number will be cached by DirEntry.inode() in the DirEntry object, but the os.lstat() result is dropped. On Windows, I don't want to cache the full os.lstat() result from DirEntry.inode() into DirEntry to replace the previous incomplete stat_result from FirstFindFileW/FindFindFileW, because DirEntry.stat() would return a different result (st_ino, st_dev, st_nlink fields set or not) depending if the inode() methode was called or not. Note: scandir-6.patch of http://bugs.python.org/issue22524 contains an implementation of os.scandir() with DirEntry.inode(), if you want to play. Victor

TL;DR: on POSIX, is it useful to know the inode number (st_ino) without the device number (st_dev)?
I can't answer this question (not being a Linux dev and not knowing much about this), but I'm +1 for adding DirEntry.inode(). On Windows, we're exposing all the information FindFirst/FindNext give us, but on Linux we expose everything useful from readdir except d_ino, which is easy to add, and according to that reddit comment, may make scandir useful in more real scenarios. -Ben

Victor Stinner writes:
*** Now the real question: is it useful to know the inode number (st_ino) without the device number (st_dev)? ***
On POSIX, you can still get the st_dev from DirEntry.stat(), but it always require a system call. So you loose the whole purpose of DirEntry (no extra syscall).
True, but I suppose in many cases the user will know that all file system objects handled are on the same device, or will be willing to risk an occasional anomoly. IMO: Document the limitation (if no extra syscall) or inefficiency (with the syscall), and let the user choose. The remaining issue is whether to provide a convenience function for the device number, with appropriately loud warnings about how inefficient it is, or to deter the user with the need to call .stat() and extract the device number.

On 14Feb2015 11:35, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Victor Stinner writes:
*** Now the real question: is it useful to know the inode number (st_ino) without the device number (st_dev)? ***
On POSIX, you can still get the st_dev from DirEntry.stat(), but it always require a system call. So you loose the whole purpose of DirEntry (no extra syscall).
True, but I suppose in many cases the user will know that all file system objects handled are on the same device, or will be willing to risk an occasional anomoly.
In POSIX, all filsystem objects named by a directory are on the same device unless one is a mount point. (And in that case, d_ino from stat won't match d_ino from scandir; I expect.)
IMO: Document the limitation (if no extra syscall) or inefficiency (with the syscall), and let the user choose.
+1 on .inode(): d_ino has been available in the directory data on POSIX since at least V7 UNIX (1970s), almost certainly earlier. Agree the limitation should be mentioned.
The remaining issue is whether to provide a convenience function for the device number, with appropriately loud warnings about how inefficient it is, or to deter the user with the need to call .stat() and extract the device number.
-1 on that. People will use it! Given the doco above, it should be obvious under what circumstances one might choose to call stat, and making that stat overt means it is less likely to be called unwisely. Since scandir is all about efficiency, providing a very costly convenience function seems to go against the grain. Regarding usefulness: Victor, you've got the typical use case in another post (i.e. useful as in "advantageous"), and your own tests show that st_dev of the dir matches st_dev of a dir's entries in all normal/regular filesystems (i.e. useful as in "meaningful/consistent"). Special filesystems like /dev may be weird, but people relying on this should be aware of the constraint anyway. Since a directory at the low level is essentially a mapping of names to inodes within the directory's filesystem, this is to be expected. Cheers, Cameron Simpson <cs@zip.com.au> Uh, this is only temporary...unless it works. - Red Green

On 14 Feb 2015 13:17, "Cameron Simpson" <cs@zip.com.au> wrote:
-1 on that. People will use it! Given the doco above, it should be
obvious under what circumstances one might choose to call stat, and making that stat overt means it is less likely to be called unwisely.
Since scandir is all about efficiency, providing a very costly
convenience function seems to go against the grain.
Regarding usefulness: Victor, you've got the typical use case in another
post (i.e. useful as in "advantageous"), and your own tests show that st_dev of the dir matches st_dev of a dir's entries in all normal/regular filesystems (i.e. useful as in "meaningful/consistent"). Special filesystems like /dev may be weird, but people relying on this should be aware of the constraint anyway. Since a directory at the low level is essentially a mapping of names to inodes within the directory's filesystem, this is to be expected. +1 from me for Cameron's perspective & rationale - it's useful for detecting hardlinks, it will usually work, and the cases where it isn't sufficient on its own are filesystem handling edge cases in more ways than one. Cheers, Nick.
Cheers, Cameron Simpson <cs@zip.com.au>
Uh, this is only temporary...unless it works. - Red Green
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

Le samedi 14 février 2015, Stephen J. Turnbull <stephen@xemacs.org> a écrit :
IMO: Document the limitation (if no extra syscall) or inefficiency (with the syscall), and let the user choose.
Hum, by the way, I don't know if we should dd the method on Windows. As I said, I don't want to cache The result of the os.lstat(). Basically, there is no benfit for other methods to call inode(). A method may be a trap for Windows users. I propose something else: a DirEntry.inode read-only property which would be None on Windows. So you see dirrectly that the property is for POSIX, and that calling os.stat() is required on Windows. os.stat() not DirEntry.stat(), DirEntry.stat() doesn't fill st_ino, st_dev and st_nlink are not filled on Windows. Victor

2015-02-14 11:57 GMT+01:00 Victor Stinner <victor.stinner@gmail.com>:
I propose something else: a DirEntry.inode read-only property (...)
Full DirEntry API: - name (str) attribute - path (str) read-only property, created at the first call - inode (int or None) attribute <=== my proposition - is_dir(*, follow_symlinks=True) - is_file(*, follow_symlinks=True) - is_symlink(*, follow_symlinks=True) - stat(*, follow_symlinks=True) is_dir(), is_file(), is_symlink() and stat() are method because they may all require a syscall (os.stat or os.lstat). They all cache their result. In some cases, the result is already known when DirEntry is created. In most cases, a single call to os.stat() is required to fill the result of all methods. Victor

On Sat Feb 14 2015 at 3:17:51 AM Victor Stinner <victor.stinner@gmail.com> wrote:
2015-02-14 11:57 GMT+01:00 Victor Stinner <victor.stinner@gmail.com>:
I propose something else: a DirEntry.inode read-only property (...)
Full DirEntry API:
- name (str) attribute - path (str) read-only property, created at the first call - inode (int or None) attribute <=== my proposition
+1 we need to provide the inode (we shouldn't be throwing anything from the underlying directory entry away when possible). But... I think the "or None" semantics are a bad idea. It'd be better for this to raise AttributeError on Windows so that someone can't write the most natural form of code assuming that inode is valid and have it appear to work on Windows when in fact it'd do the wrong thing.
- is_dir(*, follow_symlinks=True) - is_file(*, follow_symlinks=True) - is_symlink(*, follow_symlinks=True) - stat(*, follow_symlinks=True)
is_dir(), is_file(), is_symlink() and stat() are method because they may all require a syscall (os.stat or os.lstat). They all cache their result. In some cases, the result is already known when DirEntry is created. In most cases, a single call to os.stat() is required to fill the result of all methods.
Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ greg%40krypto.org

+1 we need to provide the inode (we shouldn't be throwing anything from the underlying directory entry away when possible). But...
I think the "or None" semantics are a bad idea. It'd be better for this to raise AttributeError on Windows so that someone can't write the most natural form of code assuming that inode is valid and have it appear to work on Windows when in fact it'd do the wrong thing.
+1 for inode support. I agree with the above -- it should either raise AttributeError on Windows if it's not going to be set ... or it should be more like Victor's original proposal where .inode() is a method that calls stat on Windows. I don't have strong feelings. -Ben

On Sat, 14 Feb 2015 15:32:07 -0500 Ben Hoyt <benhoyt@gmail.com> wrote:
+1 we need to provide the inode (we shouldn't be throwing anything from the underlying directory entry away when possible). But...
I think the "or None" semantics are a bad idea. It'd be better for this to raise AttributeError on Windows so that someone can't write the most natural form of code assuming that inode is valid and have it appear to work on Windows when in fact it'd do the wrong thing.
+1 for inode support. I agree with the above -- it should either raise AttributeError on Windows if it's not going to be set ... or it should be more like Victor's original proposal where .inode() is a method that calls stat on Windows. I don't have strong feelings.
The whole point of scandir is to expose low-level system calls in a cross-platform way. If you start raising some exceptions on some platforms then that quality disappears. Regards Antoine.

That suggests the .inode() method approach makes more sense then. On Sat, Feb 14, 2015, 12:44 PM Antoine Pitrou <solipsis@pitrou.net> wrote:
+1 we need to provide the inode (we shouldn't be throwing anything from the underlying directory entry away when possible). But...
I think the "or None" semantics are a bad idea. It'd be better for
On Sat, 14 Feb 2015 15:32:07 -0500 Ben Hoyt <benhoyt@gmail.com> wrote: this to
raise AttributeError on Windows so that someone can't write the most natural form of code assuming that inode is valid and have it appear to work on Windows when in fact it'd do the wrong thing.
+1 for inode support. I agree with the above -- it should either raise AttributeError on Windows if it's not going to be set ... or it should be more like Victor's original proposal where .inode() is a method that calls stat on Windows. I don't have strong feelings.
The whole point of scandir is to expose low-level system calls in a cross-platform way. If you start raising some exceptions on some platforms then that quality disappears.
Regards
Antoine.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ greg%40krypto.org

Antoine Pitrou <solipsis@pitrou.net>:
The whole point of scandir is to expose low-level system calls in a cross-platform way.
Cross-platform is great and preferable, but low-level system facilities should be made available even when they are unique to a particular OS. Marko

The whole point of scandir is to expose low-level system calls in a cross-platform way.
Cross-platform is great and preferable, but low-level system facilities should be made available even when they are unique to a particular OS.
Yes, but this can be made cross-platform fairly easily, just like the other method calls. Just like DirEntry.stat() has different cross-platform operation (no stat call on Windows, a stat call on POSIX), DirEntry.inode() would have a different operation (stat call on Windows, no stat call on POSIX). -Ben

+1 for inode support. I agree with the above -- it should either raise AttributeError on Windows if it's not going to be set ... or it should be more like Victor's original proposal where .inode() is a method that calls stat on Windows. I don't have strong feelings.
The whole point of scandir is to expose low-level system calls in a cross-platform way. If you start raising some exceptions on some platforms then that quality disappears.
I agree with that! -Ben

Le 14 févr. 2015 18:47, "Gregory P. Smith" <greg@krypto.org> a écrit :
I think the "or None" semantics are a bad idea.
Oh, in fact it shouldn't be None but 0 onWindows to be consistent with DirEntry.stat().st_ino which is also equal to 0. The value 0 is not a valid inode number. Victor
participants (8)
-
Antoine Pitrou
-
Ben Hoyt
-
Cameron Simpson
-
Gregory P. Smith
-
Marko Rauhamaa
-
Nick Coghlan
-
Stephen J. Turnbull
-
Victor Stinner