My summary of the scandir (PEP 471)

Hi, @Ben: it's time to update your PEP to complete it with this discussion! IMO DirEntry must be as simple as possible and portable: - os.scandir(str) - DirEntry.lstat_result object only available on Windows, same result than os.lstat() - DirEntry.fullname(): os.path.join(directory, DirEntry.name), where directory would be an hidden attribute of DirEntry Notes: - DirEntry.lstat_result is better than DirEntry.lstat() because it makes explicitly that lstat_result is only computed once. When I call DirEntry.lstat(), I expect to get the current status of the file, not the cached one. It's also hard to explain (document) that DirEntry.lstat() may or may call a system call. Don't do that, use DirEntry.lstat_result. - I don't think that we should support scandir(bytes). If you really want to support os.scandir(bytes), it must raise an error on Windows since bytes filename are already deprecated. It wouldn't make sense to add new function with a deprecated feature. Since we have the PEP 383 (surrogateescape), it's better to advice to use Unicode on all platforms. Almost all Python functions are able to encode back Unicode filename automatically. Use os.fsencode() to encode manually if needd. - We may not define a DirEntry.fullname() method: the directory name is usually well known. Ok, but every time that I use os.listdir(), I write os.path.join(directory, name) because in some cases I want the full path. Example: interesting = [] for name in os.listdir(path): fullpath = os.path.join(path, name) if os.path.isdir(fullpath): continue if ... test on the file ...: # i need the full path here, not the relative path # (ex: my own recursive "scandir"/"walk" function) interesting.append(fullpath) - It must not be possible to "refresh" a DirEntry object. Call os.stat(entry.fullname()) or pathlib.Path(entry.fullname()) to get fresh data. DirEntry is only computed once, that's all. It's well defined. - No Windows wildcard, you wrote that the feature has many corner cases, and it's only available on Windows. It's easy to combine scandir with fnmatch. Victor

Thanks for spinning this off to (hopefully) finished the discussion. I agree it's nearly time to update the PEP.
I'm quite strongly against this, and I think it's actually the worst of both worlds. It is not as good an API because: (a) it doesn't call stat for you (on POSIX), so you have to check an attribute and call scandir manually if you need it, turning what should be one line of code into four. Your proposal above was kind of how I had it originally, where you had to do extra tests and call scandir manually if you needed it (see https://mail.python.org/pipermail/python-dev/2013-May/126119.html) (b) the .lstat_result attribute is available on Windows but not on POSIX, meaning it's very easy for Windows developers to write code that will run and work fine on Windows, but then break horribly on POSIX; I think it'd be better if it broke hard on Windows to make writing cross-platform code easy The two alternates are: 1) the original proposal in the current version of PEP 471, where DirEntry has an .lstat() method which calls stat() on POSIX but is free on Windows 2) Nick Coghlan's proposal on the previous thread (https://mail.python.org/pipermail/python-dev/2014-June/135261.html) suggesting an ensure_lstat keyword param to scandir if you need the lstat_result value I would make one small tweak to Nick Coghlan's proposal to make writing cross-platform code easier. Instead of .lstat_result being None sometimes (on POSIX), have it None always unless you specify ensure_lstat=True. (Actually, call it get_lstat=True to kind of make this more obvious.) Per (b) above, this means Windows developers wouldn't accidentally write code which failed on POSIX systems -- it'd fail fast on Windows too if you accessed .lstat_result without specifying get_lstat=True. I'm still unsure which of these I like better. I think #1's API is slightly nicer without the ensure_lstat parameter, and error handling of the stat() is more explicit. But #2 always fetches the stat info at the same time as the dir entry info, so eliminates the problem of having the file info change between scandir iteration and the .lstat() call. I'm leaning towards preferring #2 (Nick's proposal) because it solves or gets around the caching issue. My one concern is error handling. Is it an issue if scandir's __next__ can raise an OSError either from the readdir() call or the call to stat()? My thinking is probably not. In practice, would it ever really happen that readdir() would succeed but an os.stat() immediately after would fail? I guess it could if the file is deleted, but then if it were deleted a microsecond earlier the readdir() would fail anyway, or not? Or does readdir give you a consistent, "snap-shotted" view on things? The one other thing I'm not quite sure about with Nick's proposal is the name .lstat_result, as it's long. I can see why he suggested that, as .lstat sounds like a verb, but maybe that's okay? If we can have .is_dir and .is_file as attributes, my thinking is an .lstat attribute is fine too. I don't feel too strongly though.
Really, are bytes filenames deprecated? I think maybe they should be, as they don't work on Windows :-), but the latest Python "os" docs (https://docs.python.org/3.5/library/os.html) still say that all functions that accept path names accept either str or bytes, and return a value of the same type where necessary. So I think scandir() should do the same thing.
Agreed. I use this a lot too. However, I'd prefer a .fullname attribute rather than a method, as it's free/cheap to compute and doesn't require OS calls. Out of interest, why do we have .is_dir and .stat_result but .fullname rather than .full_name? .fullname seems reasonable to me, but maybe consistency is a good thing here?
I agree refresh() is not needed -- just use os.stat() or pathlib.
Agreed. -Ben

2014-07-01 15:00 GMT+02:00 Ben Hoyt <benhoyt@gmail.com>:
(a) it doesn't call stat for you (on POSIX), so you have to check an attribute and call scandir manually if you need it,
Yes, and that's something common when you use the os module. For example, don't try to call os.fork(), os.getgid() or os.fchmod() on Windows :-) Closer to your PEP, the following OS attributes are only available on UNIX: st_blocks, st_blksize, st_rdev, st_flags; and st_file_attributes is only available on Windows. I don't think that using lstat_result is a common need when browsing a directoy. In most cases, you only need is_dir() and the name attribute.
On UNIX, does it mean that .lstat() calls os.lstat() at the first call, and then always return the same result? It would be different than os.lstat() and pathlib.Path.stat() :-( I would prefer to have the same behaviour than pathlib and os (you know, the well known consistency of Python stdlib). As I wrote, I expect a function call to always retrieve the new status.
I don't like this idea because it makes error handling more complex. The syntax to catch exceptions on an iterator is verbose (while: try: next() except ...). Whereas calling os.lstat(entry.fullname()) is explicit and it's easy to surround it with try/except.
.lstat_result being None sometimes (on POSIX),
Don't do that, it's not how Python handles portability. We use hasattr().
would it ever really happen that readdir() would succeed but an os.stat() immediately after would fail?
Yes, it can happen. The filesystem is system-wide and shared by all users. The file can be deleted.
Really, are bytes filenames deprecated?
Yes, in all functions of the os module since Python 3.3. I'm sure because I implemented the deprecation :-) Try open(b'test.txt', w') on Windows with python -Werror.
I think maybe they should be, as they don't work on Windows :-)
Windows has an API dedicated to bytes filenames, the ANSI API. But this API has annoying bugs: it replaces unencodable characters by question marks, and there is no option to be noticed on the encoding error. Different users complained about that. It was decided to not change Python since Python is a light wrapper over the kernel system calls. But bytes filenames are now deprecated to advice users to use the native type for filenames on Windows: Unicode!
Maybe I forgot to update the documentation :-(
So I think scandir() should do the same thing.
You may support scandir(bytes) on Windows but you will need to emit a deprecation warning too. (which are silent by default.) Victor

On 01.07.2014 15:00, Ben Hoyt wrote:
No need for a microsecond-timed deletion -- a directory with +r but without +x will allow you to list the entries, but stat calls on the files will fail with EPERM: $ ls -l drwxr--r--. 2 root root 60 1. Jul 16:52 test $ sudo ls -l test total 0 -rw-r--r--. 1 root root 0 1. Jul 16:52 foo $ ls test ls: cannot access test/foo: Permission denied total 0 -????????? ? ? ? ? ? foo $ stat test/foo stat: cannot stat ‘test/foo’: Permission denied I had the idea to treat a failing lstat() inside scandir() as if the entry wasn’t found at all, but in this context, this seems wrong too. regards, jwi

Ah -- very good to know, thanks. This definitely points me in the direction of wanting better control over error handling. Speaking of errors, and thinking of handling errors during iteration -- in what cases (if any) would an individual readdir fail if the opendir succeeded? -Ben

On 1 Jul 2014 07:31, "Victor Stinner" <victor.stinner@gmail.com> wrote:
2014-07-01 15:00 GMT+02:00 Ben Hoyt <benhoyt@gmail.com>:
Actually, we may need to copy the os.walk API and accept an "onerror" callback as a scandir argument. Regardless of whether or not we have "ensure_lstat", the iteration step could fail, so I don't believe we can just transfer the existing approach of catching exceptions from the listdir call.
That's not true in general - we do either, depending on context. With the addition of an os.walk style onerror callback, I'm still in favour of a "get_lstat" flag (tweaked as Ben suggests to always be None unless requested, so Windows code is less likely to be inadvertently non-portable)
We need per-iteration error handling for the readdir call anyway, so I think an onerror callback is a better option than dropping the ability to easily obtain full stat information as part of the iteration. Cheers, Nick.

I don't mind the idea of an "onerror" callback, but it's adding complexity. Putting aside the question of caching/timing for a second and assuming .lstat() as per the current PEP 471, do we really need per-iteration error handling for readdir()? When would that actually fail in practice? -Ben

On 1 July 2014 08:42, Ben Hoyt <benhoyt@gmail.com> wrote:
An NFS mount dropping the connection or a USB key being removed are the first that come to mind, but I expect there are others. I find it's generally better to just assume that any system call may fail for obscure reasons and put the infrastructure in place to deal with it rather than getting ugly, hard to track down bugs later. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 07/01/2014 07:59 AM, Jonas Wielicki wrote:
I had the idea to treat a failing lstat() inside scandir() as if the entry wasn’t found at all, but in this context, this seems wrong too.
Well, os.walk supports passing in an error handler -- perhaps scandir should as well. -- ~Ethan~

On 1 July 2014 14:00, Ben Hoyt <benhoyt@gmail.com> wrote:
This is getting very complicated (at least to me, as a Windows user, where the basic idea seems straightforward). It seems to me that the right model is the standard "thin wrapper round the OS feature" that acts as a building block - it's typical of the rest of the os module. I think that thin wrapper is needed - even if the various bells and whistles are useful, they can be built on top of a low-level version (whereas the converse is not the case). Typically, such thin wrappers expose POSIX semantics by default, and Windows behaviour follows as closely as possible (see for example stat, where st_ino makes no sense on Windows, but is present). In this case, we're exposing Windows semantics, and POSIX is the one needing to fit the model, but the principle is the same. On that basis, optional attributes (as used in stat results) seem entirely sensible. The documentation for DirEntry could easily be written to parallel that of a stat result: """ The return value is an object whose attributes correspond to the data the OS returns about a directory entry: * name - the object's name * full_name - the object's full name (including path) * is_dir - whether the object is a directory * is file - whether the object is a plain file * is_symlink - whether the object is a symbolic link On Windows, the following attributes are also available * st_size - the size, in bytes, of the object (only meaningful for files) * st_atime - time of last access * st_mtime - time of last write * st_ctime - time of creation * st_file_attributes - Windows file attribute bits (see the FILE_ATTRIBUTE_* constants in the stat module) """ That's no harder to understand (or to work with) than the equivalent stat result. The only difference is that the unavailable attributes can be queried on POSIX, there's just a separate system call involved (with implications in terms of performance, error handling and potential race conditions). The version of scandir with the ensure_lstat argument is easy to write based on one with optional arguments (I'm playing fast and loose with adding attributes to DirEntry values here, just for the sake of an example - the details are left as an exercise) def scandir_ensure(path='.', ensure_lstat=False): for entry in os.scandir(path): if ensure_lstat and not hasattr(entry, 'st_size'): stat_data = os.lstat(entry.full_name) entry.st_size = stat_data.st_size entry.st_atime = stat_data.st_atime entry.st_mtime = stat_data.st_mtime entry.st_ctime = stat_data.st_ctime # Ignore file_attributes, as we'll never get here on Windows yield entry Variations on how you handle errors in the lstat call, etc, can be added to taste. Please, let's stick to a low-level wrapper round the OS API for the first iteration of this feature. Enhancements can be added later, when real-world usage has proved their value. Paul

On 1 July 2014 14:20, Paul Moore <p.f.moore@gmail.com> wrote:
+1 from me - especially if this recipe goes in at least the PEP, and potentially even the docs. I'm also OK with postponing onerror support for the time being - that should be straightforward to add later if we decide we need it. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Thanks for the effort in your response, Paul. I'm all for KISS, but let's just slow down a bit here.
Yes, but API design is important. For example, urllib2 has a kind of the "thin wrapper approach", but millions of people use the 3rd-party "requests" library because it's just so much nicer to use. There are low-level functions in the "os" module, but there are also a lot of higher-level functions (os.walk) and functions that smooth over cross-platform issues (os.stat). Detailed comments below.
Again, this seems like a nice simple idea, but I think it's actually a worst-of-both-worlds solution -- it has a few problems: 1) It's a nasty API to actually write code with. If you try to use it, it gives off a "made only for low-level library authors" rather than "designed for developers" smell. For example, here's a get_tree_size() function I use written in both versions (original is the PEP 471 version with the addition of .full_name): def get_tree_size_original(path): """Return total size of all files in directory tree at path.""" total = 0 for entry in os.scandir(path): if entry.is_dir(): total += get_tree_size_original(entry.full_name) else: total += entry.lstat().st_size return total def get_tree_size_new(path): """Return total size of all files in directory tree at path.""" total = 0 for entry in os.scandir(path): if hasattr(entry, 'is_dir') and hasattr(entry, 'st_size'): is_dir = entry.is_dir size = entry.st_size else: st = os.lstat(entry.full_name) is_dir = stat.S_ISDIR(st.st_mode) size = st.st_size if is_dir: total += get_tree_size_new(entry.full_name) else: total += size return total I know which version I'd rather write and maintain! It seems to me new users and folks new to Python could easily write the top version, but the bottom is longer, more complicated, and harder to get right. It would also be very easy to write code in a way that works on Windows but bombs hard on POSIX. 2) It seems like your assumption is that is_dir/is_file/is_symlink are always available on POSIX via readdir. This isn't actually the case (this was discussed in the original threads) -- if readdir() returns dirent.d_type as DT_UNKNOWN, then you actually have to call os.stat() anyway to get it. So, as the above definition of get_tree_size_new() shows, you have to use getattr/hasattr on everything: is_dir/is_file/is_symlink as well as the st_* attributes. 3) It's not much different in concept to the PEP 471 version, except that PEP 471 has a built-in .lstat() method, making the user's life much easier. This is the sense in which it's the worst of both worlds -- it's a far less nice API to use, but it still has the same issues with race conditions the original does. So thinking about this again: First, based on the +1's to Paul's new solution, I don't think people are too concerned about the race condition issue (attributes being different between the original readdir and the os.stat calls). I think this is probably fair -- if folks care, they can handle it in an application-specific way. So that means Paul's new solution and the original PEP 471 approach are both okay on that score. Second, comparing PEP 471 to Nick's solution: error handling is much more straight-forward and simple to document with the original PEP 471 approach (just try/catch around the function calls) than with Nick's get_lstat=True approach of doing the stat() if needed inside the iterator. To catch errors with that approach, you'd either have to do a "while True" loop and try/catch around next(it) manually (which is very yucky code), or we'd have to add an onerror callback, which is somewhat less nice to use and harder to document (signature of the callback, exception object, etc). So given all of the above, I'm fairly strongly in favour of the approach in the original PEP 471 due to it's easy-to-use API and straight-forward try/catch approach to error handling. (My second option would be Nick's get_lstat=True with the onerror callback. My third option would be Paul's attribute-only solution, as it's just very hard to use.) Thoughts? -Ben

tl;dr - I agree with your points and think that the original PEP 471 proposal is fine. The details here are just clarification of why my proposal wasn't just "use PEP 471 as written" in the first place... On 2 July 2014 13:41, Ben Hoyt <benhoyt@gmail.com> wrote:
Fair point. But *only* because is_dir isn't guaranteed to be available. I could debate other aspects of your translation to use my API, but it's not relevant as my proposal was flawed in terms of is_XXX.
Given the is_dir point, agreed.
You may have a point here - my Windows bias may be showing. It's already awfully easy to write code that works on POSIX but bombs hard on Windows (deleting open files, for example) so I find it tempting to think "give them a taste of their own medicine" :-) More seriously, it seems to me that the scandir API is specifically designed to write efficient code on platforms where the OS gives information that allows you to do so. Warping the API too much to cater for platforms where that isn't possible seems to have the priorities backwards. Making the API not be an accident waiting to happen is fine, though. And let's be careful, too. My position is that it's not too hard to write code that works on Windows, Linux and OS X but you're right you could miss the problem with platforms that don't even support a free is_dir(). It's *easier* to write Windows-only code by mistake, but the fix to cover the "big three" is pretty simple (if not hasattr, lstat).
Ah, the wording in the PEP says "Linux, Windows, OS X". Superficially, that said "everywhere" to me. It might be worth calling out specifically some examples where it's not available without an extra system call, just to make the point explicit. You're right, though, that blows away the simplicity of my proposal. The original PEP 471 seems precisely right to me, in that case. I was only really arguing for attributes because they seem more obviously static than a method call. And personally I don't care about that aspect.
Agreed. My intent was never to remove the race conditions, I see them as the responsibility of the application to consider (many applications simply won't care, and those that do will likely want a specific solution, not a library-level compromise).
+1. That was my main point, in actual fact
Agreed. If my solution had worked, it would have been by isolating a few extra cases where you could guarantee errors won't happen. But actually, errors *can* happen in those cases, on certain systems. So PEP 471 wins on all counts here too.
Agreed. The solution I proposed isn't just "very hard to use", it's actually wrong. If is_XXX are optional attributes, that's not my solution, and I agree it's *awful*. Paul. PS I'd suggest adding a "Rejected proposals" section to the PEP which mentions the race condition issue and points to this discussion as an indication that people didn't seem to see it as a problem. On 2 July 2014 13:41, Ben Hoyt <benhoyt@gmail.com> wrote:

Thanks for the clarifications and support.
Good call. I'll update the wording in the PEP here and try to call out specific examples of where is_dir() could call os.stat(). Hard-core POSIX people, do you know when readdir() d_type will be DT_UNKNOWN on (for example) Linux or OS X? I suspect this can happen on certain network filesystems, but I'm not sure.
Definitely agreed. I'll add this, and clarify various other issues in the PEP, and then repost. -Ben

Ben Hoyt <benhoyt@gmail.com> writes:
Any fuse file system mounted by some other user and without -o allow_other. For these entries, stat() will fail as well. Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«
participants (9)
-
Ben Hoyt
-
Chris Angelico
-
Ethan Furman
-
Glenn Linderman
-
Jonas Wielicki
-
Nick Coghlan
-
Nikolaus Rath
-
Paul Moore
-
Victor Stinner