PEP 471 -- os.scandir() function -- a better and faster directory iterator

Hi Python dev folks, I've written a PEP proposing a specific os.scandir() API for a directory iterator that returns the stat-like info from the OS, the main advantage of which is to speed up os.walk() and similar operations between 4-20x, depending on your OS and file system. Full details, background info, and context links are in the PEP, which Victor Stinner has uploaded at the following URL, and I've also copied inline below. http://legacy.python.org/dev/peps/pep-0471/ Would love feedback on the PEP, but also of course on the proposal itself. -Ben PEP: 471 Title: os.scandir() function -- a better and faster directory iterator Version: $Revision$ Last-Modified: $Date$ Author: Ben Hoyt <benhoyt@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 30-May-2014 Python-Version: 3.5 Abstract ======== This PEP proposes including a new directory iteration function, ``os.scandir()``, in the standard library. This new function adds useful functionality and increases the speed of ``os.walk()`` by 2-10 times (depending on the platform and file system) by significantly reducing the number of times ``stat()`` needs to be called. Rationale ========= Python's built-in ``os.walk()`` is significantly slower than it needs to be, because -- in addition to calling ``os.listdir()`` on each directory -- it executes the system call ``os.stat()`` or ``GetFileAttributes()`` on each file to determine whether the entry is a directory or not. But the underlying system calls -- ``FindFirstFile`` / ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X -- already tell you whether the files returned are directories or not, so no further system calls are needed. In short, you can reduce the number of system calls from approximately 2N to N, where N is the total number of files and directories in the tree. (And because directory trees are usually much wider than they are deep, it's often much better than this.) In practice, removing all those extra system calls makes ``os.walk()`` about **8-9 times as fast on Windows**, and about **2-3 times as fast on Linux and Mac OS X**. So we're not talking about micro- optimizations. See more `benchmarks`_. .. _`benchmarks`: https://github.com/benhoyt/scandir#benchmarks Somewhat relatedly, many people (see Python `Issue 11406`_) are also keen on a version of ``os.listdir()`` that yields filenames as it iterates instead of returning them as one big list. This improves memory efficiency for iterating very large directories. So as well as providing a ``scandir()`` iterator function for calling directly, Python's existing ``os.walk()`` function could be sped up a huge amount. .. _`Issue 11406`: http://bugs.python.org/issue11406 Implementation ============== The implementation of this proposal was written by Ben Hoyt (initial version) and Tim Golden (who helped a lot with the C extension module). It lives on GitHub at `benhoyt/scandir`_. .. _`benhoyt/scandir`: https://github.com/benhoyt/scandir Note that this module has been used and tested (see "Use in the wild" section in this PEP), so it's more than a proof-of-concept. However, it is marked as beta software and is not extensively battle-tested. It will need some cleanup and more thorough testing before going into the standard library, as well as integration into `posixmodule.c`. Specifics of proposal ===================== Specifically, this PEP proposes adding a single function to the ``os`` module in the standard library, ``scandir``, that takes a single, optional string as its argument:: scandir(path='.') -> generator of DirEntry objects Like ``listdir``, ``scandir`` calls the operating system's directory iteration system calls to get the names of the files in the ``path`` directory, but it's different from ``listdir`` in two ways: * Instead of bare filename strings, it returns lightweight ``DirEntry`` objects that hold the filename string and provide simple methods that allow access to the stat-like data the operating system returned. * It returns a generator instead of a list, so that ``scandir`` acts as a true iterator instead of returning the full list immediately. ``scandir()`` yields a ``DirEntry`` object for each file and directory in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'`` pseudo-directories are skipped, and the entries are yielded in system-dependent order. Each ``DirEntry`` object has the following attributes and methods: * ``name``: the entry's filename, relative to ``path`` (corresponds to the return values of ``os.listdir``) * ``is_dir()``: like ``os.path.isdir()``, but requires no system calls on most systems (Linux, Windows, OS X) * ``is_file()``: like ``os.path.isfile()``, but requires no system calls on most systems (Linux, Windows, OS X) * ``is_symlink()``: like ``os.path.islink()``, but requires no system calls on most systems (Linux, Windows, OS X) * ``lstat()``: like ``os.lstat()``, but requires no system calls on Windows The ``DirEntry`` attribute and method names were chosen to be the same as those in the new ``pathlib`` module for consistency. Notes on caching ---------------- The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute is obviously always cached, and the ``is_X`` and ``lstat`` methods cache their values (immediately on Windows via ``FindNextFile``, and on first use on Linux / OS X via a ``stat`` call) and never refetch from the system. For this reason, ``DirEntry`` objects are intended to be used and thrown away after iteration, not stored in long-lived data structured and the methods called again and again. If a user wants to do that (for example, for watching a file's size change), they'll need to call the regular ``os.lstat()`` or ``os.path.getsize()`` functions which force a new system call each time. Examples ======== Here's a good usage pattern for ``scandir``. This is in fact almost exactly how the scandir module's faster ``os.walk()`` implementation uses it:: dirs = [] non_dirs = [] for entry in scandir(path): if entry.is_dir(): dirs.append(entry) else: non_dirs.append(entry) The above ``os.walk()``-like code will be significantly using scandir on both Windows and Linux or OS X. Or, for getting the total size of files in a directory tree -- showing use of the ``DirEntry.lstat()`` method:: def get_tree_size(path): """Return total size of files in path and subdirs.""" size = 0 for entry in scandir(path): if entry.is_dir(): sub_path = os.path.join(path, entry.name) size += get_tree_size(sub_path) else: size += entry.lstat().st_size return size Note that ``get_tree_size()`` will get a huge speed boost on Windows, because no extra stat call are needed, but on Linux and OS X the size information is not returned by the directory iteration functions, so this function won't gain anything there. Support ======= The scandir module on GitHub has been forked and used quite a bit (see "Use in the wild" in this PEP), but there's also been a fair bit of direct support for a scandir-like function from core developers and others on the python-dev and python-ideas mailing lists. A sampling: * **Nick Coghlan**, a core Python developer: "I've had the local Red Hat release engineering team express their displeasure at having to stat every file in a network mounted directory tree for info that is present in the dirent structure, so a definite +1 to os.scandir from me, so long as it makes that info available." [`source1 <http://bugs.python.org/issue11406>`_] * **Tim Golden**, a core Python developer, supports scandir enough to have spent time refactoring and significantly improving scandir's C extension module. [`source2 <https://github.com/tjguk/scandir>`_] * **Christian Heimes**, a core Python developer: "+1 for something like yielddir()" [`source3 <https://mail.python.org/pipermail/python-ideas/2012-November/017772.html>`_] and "Indeed! I'd like to see the feature in 3.4 so I can remove my own hack from our code base." [`source4 <http://bugs.python.org/issue11406>`_] * **Gregory P. Smith**, a core Python developer: "As 3.4beta1 happens tonight, this isn't going to make 3.4 so i'm bumping this to 3.5. I really like the proposed design outlined above." [`source5 <http://bugs.python.org/issue11406>`_] * **Guido van Rossum** on the possibility of adding scandir to Python 3.5 (as it was too late for 3.4): "The ship has likewise sailed for adding scandir() (whether to os or pathlib). By all means experiment and get it ready for consideration for 3.5, but I don't want to add it to 3.4." [`source6 <https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_] Support for this PEP itself (meta-support?) was given by Nick Coghlan on python-dev: "A PEP reviewing all this for 3.5 and proposing a specific os.scandir API would be a good thing." [`source7 <https://mail.python.org/pipermail/python-dev/2013-November/130588.html>`_] Use in the wild =============== To date, ``scandir`` is definitely useful, but has been clearly marked "beta", so it's uncertain how much use of it there is in the wild. Ben Hoyt has had several reports from people using it. For example: * Chris F: "I am processing some pretty large directories and was half expecting to have to modify getdents. So thanks for saving me the effort." [via personal email] * bschollnick: "I wanted to let you know about this, since I am using Scandir as a building block for this code. Here's a good example of scandir making a radical performance improvement over os.listdir." [`source8 <https://github.com/benhoyt/scandir/issues/19>`_] * Avram L: "I'm testing our scandir for a project I'm working on. Seems pretty solid, so first thing, just want to say nice work!" [via personal email] Others have `requested a PyPI package`_ for it, which has been created. See `PyPI package`_. .. _`requested a PyPI package`: https://github.com/benhoyt/scandir/issues/12 .. _`PyPI package`: https://pypi.python.org/pypi/scandir GitHub stats don't mean too much, but scandir does have several watchers, issues, forks, etc. Here's the run-down as of the stats as of June 5, 2014: * Watchers: 17 * Stars: 48 * Forks: 15 * Issues: 2 open, 19 closed **However, the much larger point is this:**, if this PEP is accepted, ``os.walk()`` can easily be reimplemented using ``scandir`` rather than ``listdir`` and ``stat``, increasing the speed of ``os.walk()`` very significantly. There are thousands of developers, scripts, and production code that would benefit from this large speedup of ``os.walk()``. For example, on GitHub, there are almost as many uses of ``os.walk`` (194,000) as there are of ``os.mkdir`` (230,000). Open issues and optional things =============================== There are a few open issues or optional additions: Should scandir be in its own module? ------------------------------------ Should the function be included in the standard library in a new module, ``scandir.scandir()``, or just as ``os.scandir()`` as discussed? The preference of this PEP's author (Ben Hoyt) would be ``os.scandir()``, as it's just a single function. Should there be a way to access the full path? ---------------------------------------------- Should ``DirEntry``'s have a way to get the full path without using ``os.path.join(path, entry.name)``? This is a pretty common pattern, and it may be useful to add pathlib-like ``str(entry)`` functionality. This functionality has also been requested in `issue 13`_ on GitHub. .. _`issue 13`: https://github.com/benhoyt/scandir/issues/13 Should it expose Windows wildcard functionality? ------------------------------------------------ Should ``scandir()`` have a way of exposing the wildcard functionality in the Windows ``FindFirstFile`` / ``FindNextFile`` functions? The scandir module on GitHub exposes this as a ``windows_wildcard`` keyword argument, allowing Windows power users the option to pass a custom wildcard to ``FindFirstFile``, which may avoid the need to use ``fnmatch`` or similar on the resulting names. It is named the unwieldly ``windows_wildcard`` to remind you you're writing power- user, Windows-only code if you use it. This boils down to whether ``scandir`` should be about exposing all of the system's directory iteration features, or simply providing a fast, simple, cross-platform directory iteration API. This PEP's author votes for not including ``windows_wildcard`` in the standard library version, because even though it could be useful in rare cases (say the Windows Dropbox client?), it'd be too easy to use it just because you're a Windows developer, and create code that is not cross-platform. Possible improvements ===================== There are many possible improvements one could make to scandir, but here is a short list of some this PEP's author has in mind: * scandir could potentially be further sped up by calling ``readdir`` / ``FindNextFile`` say 50 times per ``Py_BEGIN_ALLOW_THREADS`` block so that it stays in the C extension module for longer, and may be somewhat faster as a result. This approach hasn't been tested, but was suggested by on Issue 11406 by Antoine Pitrou. [`source9 <http://bugs.python.org/msg130125>`_] Previous discussion =================== * `Original thread Ben Hoyt started on python-ideas`_ about speeding up ``os.walk()`` * Python `Issue 11406`_, which includes the original proposal for a scandir-like function * `Further thread Ben Hoyt started on python-dev`_ that refined the ``scandir()`` API, including Nick Coghlan's suggestion of scandir yielding ``DirEntry``-like objects * `Final thread Ben Hoyt started on python-dev`_ to discuss the interaction between scandir and the new ``pathlib`` module * `Question on StackOverflow`_ about why ``os.walk()`` is slow and pointers on how to fix it (this inspired the author of this PEP early on) * `BetterWalk`_, this PEP's author's previous attempt at this, on which the scandir code is based .. _`Original thread Ben Hoyt started on python-ideas`: https://mail.python.org/pipermail/python-ideas/2012-November/017770.html .. _`Further thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-May/126119.html .. _`Final thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-November/130572.html .. _`Question on StackOverflow`: http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-o... .. _`BetterWalk`: https://github.com/benhoyt/betterwalk Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:

On 27 June 2014 09:28, MRAB <python@mrabarnett.plus.com> wrote:
Personally, I'd prefer the name 'iterdir' because it emphasises that it's an iterator.
Exactly what I was going to post (with the added note that thee's an obvious symmetry with listdir). +1 for iterdir rather than scandir Other than that: +1 for adding scandir to the stdlib -1 for windows_wildcard (it would be an attractive nuisance to write windows-only code) Tim Delaney

I don't mind iterdir() and would take it :-), but I'll just say why I chose the name scandir() -- though it wasn't my suggestion originally: iterdir() sounds like just an iterator version of listdir(), kinda like keys() and iterkeys() in Python 2. Whereas in actual fact the return values are quite different (DirEntry objects vs strings), and so the name change reflects that difference a little. I'm also -1 on windows_wildcard. I think it's asking for trouble, and wouldn't gain much on Windows in most cases anyway. -Ben On Thu, Jun 26, 2014 at 7:43 PM, Ethan Furman <ethan@stoneleaf.us> wrote:

On 2014-06-27 02:37, Ben Hoyt wrote:
[snip] The re module has 'findall', which returns a list of strings, and 'finditer', which returns an iterator that yields match objects, so there's a precedent. :-)

+1 on getting this in for 3.5. If the only objection people are having is the stupid paint color of the name I don't care what it's called! scandir matches the libc API of the same name. iterdir also makes sense to anyone reading it. Whoever checks this in can pick one and be done with it. We have other Python APIs with iter in the name and tend not to be trying to mirror C so much these days so the iterdir folks do have a valid point. I'm not a huge fan of the DirEntry object and the method calls on it instead of simply yielding tuples of (filename, partially_filled_in_stat_result) but I don't *really* care which is used as they both work fine and it is trivial to wrap with another generator expression to turn it into exactly what you want anyways. Python not having the ability to operate on large directories means Python simply cannot be used for common system maintenance tasks. Python being slow to walk a file system due to unnecessary stat calls (often each an entire io op. requiring a disk seek!) due to the existing information that it throws away not being used via listdir is similarly a problem. This addresses both. IMNSHO, it is a single function, it belongs in the os module right next to listdir. -gps On Thu, Jun 26, 2014 at 6:37 PM, Ben Hoyt <benhoyt@gmail.com> wrote:

On Jun 26, 2014, at 4:38 PM, Tim Delaney <timothy.c.delaney@gmail.com> wrote: On 27 June 2014 09:28, MRAB <python@mrabarnett.plus.com> wrote:
-1 for windows_wildcard (it would be an attractive nuisance to write
windows-only code) Could you emulate it on other platforms? +1 on the rest of it. -Chris

Hello, On Thu, 26 Jun 2014 18:59:45 -0400 Ben Hoyt <benhoyt@gmail.com> wrote:
I noticed obvious inefficiency of os.walk() implemented in terms of os.listdir() when I worked on "os" module for MicroPython. I essentially did what your PEP suggests - introduced internal generator function (ilistdir_ex() in https://github.com/micropython/micropython-lib/blob/master/os/os/__init__.py... ), in terms of which both os.listdir() and os.walk() are implemented. With my MicroPython hat on, os.scandir() would make things only worse. With current interface, one can either have inefficient implementation (like CPython chose) or efficient implementation (like MicroPython chose) - all transparently. os.scandir() supposedly opens up efficient implementation for everyone, but at the price of bloating API and introducing heavy-weight objects to wrap info. PEP calls it "lightweight DirEntry objects", but that cannot be true, because all Python objects are heavy-weight, especially those which have methods. It would be better if os.scandir() was specified to return a struct (named tuple) compatible with return value of os.stat() (with only fields relevant to underlying readdir()-like system call). The grounds for that are obvious: it's already existing data interface in module "os", which is also based on open standard for operating systems - POSIX, so if one is to expect something about file attributes, it's what one can reasonably base expectations on. But reusing os.stat struct is glaringly not what's proposed. And it's clear where that comes from - "[DirEntry.]lstat(): like os.lstat(), but requires no system calls on Windows". Nice, but OS "FooBar" can do much more than Windows - it has a system call to send a file by email, right when scanning a directory containing it. So, why not to have DirEntry.send_by_email(recipient) method? I hear the answer - it's because CPython strives to support Windows well, while doesn't care about "FooBar" OS. And then it again leads to the question I posed several times - where's line between "CPython" and "Python"? Is it grounded for CPython to add (or remove) to Python stdlib something which is useful for its users, but useless or complicating for other Python implementations? Especially taking into account that there's "win32api" module allowing Windows users to use all wonders of its API? Especially that os.stat struct is itself pretty extensible (https://docs.python.org/3.4/library/os.html#os.stat : "On other Unix systems (such as FreeBSD), the following attributes may be available ...", "On Mac OS systems...", - so extra fields can be added for Windows just the same, if really needed).
[] -- Best regards, Paul mailto:pmiscml@gmail.com

Hello, On Thu, 26 Jun 2014 17:35:21 -0700 Benjamin Peterson <benjamin@python.org> wrote:
Because you need to call them. And if the only thing they do is return object field, call overhead is rather noticeable.
namedtuples have methods.
Yes, unfortunately. But fortunately, named tuple is a subclass of tuple, so user caring for efficiency can just use numeric indexing which existed for os.stat values all the time, blissfully ignoring cruft which have been accumulating there since 1.5 times. -- Best regards, Paul mailto:pmiscml@gmail.com

Nice (though I see the implementation is very *nix specific).
It's a fair point that os.walk() can be implemented efficiently without adding a new function and API. However, often you'll want more info, like the file size, which scandir() can give you via DirEntry.lstat(), which is free on Windows. So opening up this efficient API is beneficial. In CPython, I think the DirEntry objects are as lightweight as stat_result objects. I'm an embedded developer by background, so I know the constraints here, but I really don't think Python's development should be tailored to fit MicroPython. If os.scandir() is not very efficient on MicroPython, so be it -- 99% of all desktop/server users will gain from it.
Yes, we considered this early on (see the python-ideas and python-dev threads referenced in the PEP), but decided it wasn't a great API to overload stat_result further, and have most of the attributes None or not present on Linux.
Yes. Incidentally, I just submitted an (accepted) patch for Python 3.5 that adds the full Win32 file attribute data to stat_result objects on Windows (see https://docs.python.org/3.5/whatsnew/3.5.html#os). However, for scandir() to be useful, you also need the name. My original version of this directory iterator returned two-tuples of (name, stat_result). But most people didn't like the API, and I don't really either. You could overload stat_result with a .name attribute in this case, but it still isn't a nice API to have most of the attributes None, and then you have to test for that, etc. So basically we tweaked the API to do what was best, and ended up with it returning DirEntry objects with is_file() and similar methods. Hope that helps give a bit more context. If you haven't read the relevant python-ideas and python-dev threads, those are interesting too. -Ben

Hello, On Thu, 26 Jun 2014 21:52:43 -0400 Ben Hoyt <benhoyt@gmail.com> wrote: []
Surely, tailoring Python to MicroPython's needs is completely not what I suggest. It was an example of alternative implementation which optimized os.walk() without need for any additional public module APIs. Vice-versa, high-level nature of API call like os.walk() and underspecification of low-level details (like which function implemented in terms of which others) allow MicroPython provide optimized implementation even with its resource constraints. So, power of high-level interfaces and underspecification should not be underestimated ;-). But I don't want to argue that os.scandir() is "not needed", because that's hardly productive. Something I'd like to prototype in uPy and ideally lead further up to PEP status is to add iterator-based string methods, and I pretty much can expect "we lived without it" response, so don't want to go the same way regarding addition of other iterator-based APIs - it's clear that more iterator/generator based APIs is a good direction for Python to evolve.
[]
Yes, returning (name, stat_result) would be my first motion too, I don't see why someone wouldn't like pair of 2 values, with each value of obvious type and semantics within "os" module. Regarding stat result, os.stat() provides full information about a file, and intuitively, one may expect that os.scandir() would provide subset of that info, asymptotically reaching volume of what os.stat() may provide, depending on OS capabilities. So, if truly OS-independent interface is wanted to salvage more data from a dir scanning, using os.stat struct as data interface is hard to ignore. But well, if it was rejected already, what can be said? Perhaps, at least the PEP could be extended to explicitly mention other approached which were discussed and rejected, not just link to a discussion archive (from experience with reading other PEPs, they oftentimes contained such subsections, so hope this suggestion is not ungrounded).
-- Best regards, Paul mailto:pmiscml@gmail.com

On Fri, Jun 27, 2014 at 03:07:46AM +0300, Paul Sokolovsky wrote:
os.scandir is not part of the Python API, it is not a built-in function. It is part of the CPython standard library. That means (in my opinion) that there is an expectation that other Pythons should provide it, but not an absolute requirement. Especially for the os module, which by definition is platform-specific. In my opinion that means you have four options: 1. provide os.scandir, with exactly the same semantics as on CPython; 2. provide os.scandir, but change its semantics to be more lightweight (e.g. return an ordinary tuple, as you already suggest); 3. don't provide os.scandir at all; or 4. do something different depending on whether the platform is Linux or an embedded system. I would consider any of those acceptable for a library feature, but not for a language feature. [...]
Correct. If there is sufficient demand for FooBar, then CPython may support it. Until then, FooBarPython can support it, and offer whatever platform-specific features are needed within its standard library.
I think so. And other implementations are free to do the same thing. Of course there is an expectation that the standard library of most implementations will be broadly similar, but not that they will be identical. I am surprised that both Jython and IronPython offer an non-functioning dis module: you can import it successfully, but if there's a way to actually use it, I haven't found it: steve@orac:~$ jython Jython 2.5.1+ (Release_2_5_1, Aug 4 2010, 07:18:19) [OpenJDK Server VM (Sun Microsystems Inc.)] on java1.6.0_27 Type "help", "copyright", "credits" or "license" for more information.
IronPython gives a different exception: steve@orac:~$ ipy IronPython 2.6 Beta 2 DEBUG (2.6.0.20) on .NET 2.0.50727.1433 Type "help", "copyright", "credits" or "license" for more information.
It's quite annoying, I would have rather that they just removed the module altogether. Better still would have been to disassemble code objects to whatever byte code the Java and .Net platforms use. But there's surely no requirement to disassemble to CPython byte code! -- Steven

Hello, On Fri, 27 Jun 2014 12:08:41 +1000 Steven D'Aprano <steve@pearwood.info> wrote:
Ok, so standard library also has API, and that's the API being discussed.
Yes, that's intuitive, but not strict and formal, so is subject to interpretations. As a developer working on alternative Python implementation, I'd like to have better understanding of what needs to be done to be a compliant implementation (in particular, because I need to pass that info down to the users). So, I was told that https://docs.python.org/3/reference/index.html describes Python, not CPython. Next step is figuring out whether https://docs.python.org/3/library/index.html describes Python or CPython, and if the latter, how to separate Python's stdlib essence from extended library CPython provides?
Good, thanks. If that represents shared opinion of (C)Python developers (so, there won't be claims like "MicroPython is not Python because it doesn't provide os.scandir()" (or hundred of other missing stdlib functions ;-) )) that's good enough already. With that in mind, I wished that any Python implementation was as complete and as efficient as possible, and one way to achieve that is to not add stdlib entities without real need (be it more API calls or more data types). So, I'm glad to know that os.scandir() passed thru Occam's Razor in this respect and specified the way it is really for common good. [] -- Best regards, Paul mailto:pmiscml@gmail.com

I'm generally +1, with opinions noted below on these two topics. On 6/26/2014 3:59 PM, Ben Hoyt wrote:
+1
Because another common pattern is to check for name matches pattern, I think it would be good to have a feature that provides such. I do that in my own private directory listing extensions, and also some command lines expose it to the user. Where exposed to the user, I use -p windows-pattern and -P regexp. My implementation converts the windows-pattern to a regexp, and then uses common code, but for this particular API, because the windows_wildcard can be optimized by the window API call used, it would make more sense to pass windows_wildcard directly to FindFirst on Windows, but on *nix convert it to a regexp. Both Windows and *nix would call re to process pattern matches except for the case on Windows of having a Windows pattern passed in. The alternate parameter could simply be called wildcard, and would be a regexp. If desired, other flavors of wildcard bsd_wildcard? could also be implemented, but I'm not sure there are any benefits to them, as there are, as far as I am aware, no optimizations for those patterns in those systems.

On 26 June 2014 23:59, Ben Hoyt <benhoyt@gmail.com> wrote:
Would love feedback on the PEP, but also of course on the proposal itself.
A solid +1 from me. Some specific points: - I'm in favour of it being in the os module. It's more discoverable there, as well as the other reasons mentioned. - I prefer scandir as the name, for the reason you gave (the output isn't the same as an iterator version of listdir) - I'm mildly against windows_wildcard (even though I'm a windows user) - You mention the caching behaviour of DirEntry objects. The limitations should be clearly covered in the final docs, as it's the sort of thing people will get wrong otherwise. Paul

Hi, You wrote a great PEP Ben, thanks :-) But it's now time for comments!
But the underlying system calls -- ``FindFirstFile`` / ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir? You should add a link to FindFirstFile doc: http://msdn.microsoft.com/en-us/library/windows/desktop/aa364418%28v=vs.85%2... It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we should mimic stat_result recent addition: the new stat_result.file_attributes field. Add DirEntry.file_attributes which would only be available on Windows. The Windows structure also contains FILETIME ftCreationTime; FILETIME ftLastAccessTime; FILETIME ftLastWriteTime; DWORD nFileSizeHigh; DWORD nFileSizeLow; It would be nice to expose them as well. I'm no more surprised that the exact API is different depending on the OS for functions of the os module.
Does your implementation uses a free list to avoid the cost of memory allocation? A short free list of 10 or maybe just 1 may help. The free list may be stored directly in the generator object.
Does it support also bytes filenames on UNIX? Python now supports undecodable filenames thanks to the PEP 383 (surrogateescape). I prefer to use the same type for filenames on Linux and Windows, so Unicode is better. But some users might prefer bytes for other reasons.
The ``DirEntry`` attribute and method names were chosen to be the same as those in the new ``pathlib`` module for consistency.
Great! That's exactly what I expected :-) Consistency with other modules.
Crazy idea: would it be possible to "convert" a DirEntry object to a pathlib.Path object without losing the cache? I guess that pathlib.Path expects a full stat_result object.
I don't understand how you can build a full lstat() result without really calling stat. I see that WIN32_FIND_DATA contains the size, but here you call lstat(). If you know that it's not a symlink, you already know the size, but you still have to call stat() to retrieve all fields required to build a stat_result no?
Do you plan to continue to maintain your module for Python < 3.5, but upgrade your module for the final PEP?
Yes, put it in the os module which is already bloated :-)
I think that it would be very convinient to store the directory name in the DirEntry. It should be light, it's just a reference. And provide a fullname() name which would just return os.path.join(path, entry.name) without trying to resolve path to get an absolute path.
Would it be hard to implement the wildcard feature on UNIX to compare performances of scandir('*.jpg') with and without the wildcard built in os.scandir? I implemented it in C for the tracemalloc module (Filter object): http://hg.python.org/features/tracemalloc Get the revision 69fd2d766005 and search match_filename_joker() in Modules/_tracemalloc.c. The function matchs the filename backward because it most cases, the last latter is enough to reject a filename (ex: "*.jpg" => reject filenames not ending with "g"). The filename is normalized before matching the pattern: converted to lowercase and / is replaced with \ on Windows. It was decided to drop the Filter object to keep the tracemalloc module as simple as possible. Charles-François was not convinced by the speedup. But tracemalloc case is different because the OS didn't provide an API for that. Victor

I guess it'd be better to say "Windows" and "Unix-based OSs" throughout the PEP? Because all of these (including Mac OS X) are Unix-based.
I think you've misunderstood how DirEntry.lstat() works on Windows -- it's basically a no-op, as Windows returns the full stat information with the original FindFirst/FindNext OS calls. This is fairly explict in the PEP, but I'm sure I could make it clearer: DirEntry.lstat(): "like os.lstat(), but requires no system calls on Windows So you can already get the dwFileAttributes for free by saying entry.lstat().st_file_attributes. You can also get all the other fields you mentioned for free via .lstat() with no additional OS calls on Windows, for example: entry.lstat().st_size. Feel free to suggest changes to the PEP or scandir docs if this isn't clear. Note that is_dir()/is_file()/is_symlink() are free on all systems, but .lstat() is only free on Windows.
No, it doesn't. I might add this to the PEP under "possible improvements". However, I think the speed increase by removing the extra OS call and/or disk seek is going to be way more than memory allocation improvements, so I'm not sure this would be worth it.
Does it support also bytes filenames on UNIX?
I forget exactly now what my scandir module does, but for os.scandir() I think this should behave exactly like os.listdir() does for Unicode/bytes filenames.
The main problem is that pathlib.Path objects explicitly don't cache stat info (and Guido doesn't want them to, for good reason I think). There's a thread on python-dev about this earlier. I'll add it to a "Rejected ideas" section.
See above.
Do you plan to continue to maintain your module for Python < 3.5, but upgrade your module for the final PEP?
Yes, I intend to maintain the standalone scandir module for 2.6 <= Python < 3.5, at least for a good while. For integration into the Python 3.5 stdlib, the implementation will be integrated into posixmodule.c, of course.
Yeah, fair suggestion. I'm still slightly on the fence about this, but I think an explicit fullname() is a good suggestion. Ideally I think it'd be better to mimic pathlib.Path.__str__() which is kind of the equivalent of fullname(). But how does pathlib deal with unicode/bytes issues if it's the str function which has to return a str object? Or at least, it'd be very weird if __str__() returned bytes. But I think it'd need to if you passed bytes into scandir(). Do others have thoughts?
It's a good idea, the problem with this is that the Windows wildcard implementation has a bunch of crazy edge cases where *.ext will catch more things than just a simple regex/glob. This was discussed on python-dev or python-ideas previously, so I'll dig it up and add to a Rejected Ideas section. In any case, this could be added later if there's a way to iron out the Windows quirks. -Ben

On 29 June 2014 05:48, Ben Hoyt <benhoyt@gmail.com> wrote:
*nix and POSIX-based are the two conventions I use.
The key problem with caches on pathlib.Path objects is that you could end up with two separate path objects that referred to the same filesystem location but returned different answers about the filesystem state because their caches might be stale. DirEntry is different, as the content is generally *assumed* to be stale (referring to when the directory was scanned, rather than the current filesystem state). DirEntry.lstat() on POSIX systems will be an exception to that general rule (referring to the time of first lookup, rather than when the directory was scanned, so the answer rom lstat() may be inconsistent with other data stored directly on the DirEntry object), but one we can probably live with. More generally, as part of the pathlib PEP review, we figured out that a *per-object* cache of filesystem state would be an inherently bad idea, but a string based *process global* cache might make sense for modules like walkdir (not part of the stdlib - it's an iterator pipeline based approach to file tree scanning I wrote a while back, that currently suffers badly from the performance impact of repeated stat calls at different stages of the pipeline). We realised this was getting into a space where application and library specific concerns are likely to start affecting the caching design, though, so the current status of standard library level stat caching is "it's not clear if there's an available approach that would be sufficiently general purpose to be appropriate for inclusion in the standard library". Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 28 Jun 2014, at 21:48, Ben Hoyt wrote:
However, it would be bad to have two implementations of the concept of "filename" with different attribute and method names. The best way to ensure compatible APIs would be if one class was derived from the other.
[...]
Servus, Walter

On 27.06.2014 00:59, Ben Hoyt wrote:
I find this behaviour a bit misleading: using methods and have them return cached results. How much (implementation and/or performance and/or memory) overhead would incur by using property-like access here? I think this would underline the static nature of the data. This would break the semantics with respect to pathlib, but they’re only marginally equal anyways -- and as far as I understand it, pathlib won’t cache, so I think this has a fair point here. regards, jwi

On 28 Jun 2014 01:27, "Jonas Wielicki" <j.wielicki@sotecware.net> wrote:
Indeed - using properties rather than methods may help emphasise the deliberate *difference* from pathlib in this case (i.e. value when the result was retrieved from the OS, rather than the value right now). The main benefit is that switching from using the DirEntry object to a pathlib Path will require touching all the places where the performance characteristics switch from "memory access" to "system call". This benefit is also the main downside, so I'd actually be OK with either decision on this one. Other comments: * +1 on the general idea * +1 on scandir() over iterdir, since it *isn't* just an iterator version of listdir * -1 on including Windows specific globbing support in the API * -0 on including cross platform globbing support in the initial iteration of the API (that could be done later as a separate RFE instead) * +1 on a new section in the PEP covering rejected design options (calling it iterdir, returning a 2-tuple instead of a dedicated DirEntry type) * regarding "why not a 2-tuple", we know from experience that operating systems evolve and we end up wanting to add additional info to this kind of API. A dedicated DirEntry type lets us adjust the information returned over time, without breaking backwards compatibility and without resorting to ugly hacks like those in some of the time and stat APIs (or even our own codec info APIs) * it would be nice to see some relative performance numbers for NFS and CIFS network shares - the additional network round trips can make excessive stat calls absolutely brutal from a speed perspective when using a network drive (that's why the stat caching added to the import system in 3.3 dramatically sped up the case of having network drives on sys.path, and why I thought AJ had a point when he was complaining about the fact we didn't expose the dirent data from os.listdir) Regards, Nick.
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

On Fri, Jun 27, 2014 at 2:58 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Agreed. Globbing or filtering support should not hold this up. If that part isn't settled, just don't include it and work out what it should be as a future enhancement.
* +1 on a new section in the PEP covering rejected design options (calling it iterdir, returning a 2-tuple instead of a dedicated DirEntry type)
+1. IMNSHO, one of the most important part of PEPs: capturing the entire decision process to document the "why nots".
fwiw, I wouldn't wait for benchmark numbers. A needless stat call when you've got the information from an earlier API call is already brutal. It is easy to compute from existing ballparks remote file server / cloud access: ~100ms, local spinning disk seek+read: ~10ms. fetch of stat info cached in memory on file server on the local network: ~500us. You can go down further to local system call overhead which can vary wildly but should likely be assumed to be at least 10us. You don't need a benchmark to tell you that adding needless >= 500us-100ms blocking operations to your program is bad. :) -gps

On 28 June 2014 16:17, Gregory P. Smith <greg@krypto.org> wrote:
Agreed, but walking even a moderately large tree over the network can really hammer home the point that this offers a significant performance enhancement as the latency of access increases. I've found that kind of comparison can be eye-opening for folks that are used to only operating on local disks (even spinning disks, let alone SSDs) and/or relatively small trees (distro build trees aren't *that* big, but they're big enough for this kind of difference in access overhead to start getting annoying). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 28 June 2014 19:17, Nick Coghlan <ncoghlan@gmail.com> wrote:
Oops, forgot to add - I agree this isn't a blocking issue for the PEP, it's definitely only in "nice to have" territory. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Re is_dir etc being properties rather than methods:
The problem with this is that properties "look free", they look just like attribute access, so you wouldn't normally handle exceptions when accessing them. But .lstat() and .is_dir() etc may do an OS call, so if you're needing to be careful with error handling, you may want to handle errors on them. Hence I think it's best practice to make them functions(). Some of us discussed this on python-dev or python-ideas a while back, and I think there was general agreement with what I've stated above and therefore they should be methods. But I'll dig up the links and add to a Rejected ideas section.
* +1 on a new section in the PEP covering rejected design options (calling it iterdir, returning a 2-tuple instead of a dedicated DirEntry type)
Great idea. I'll add a bunch of stuff, including the above, to a new section, Rejected Design Options.
Fully agreed.
Don't know if you saw, but there are actually some benchmarks, including one over NFS, on the scandir GitHub page: https://github.com/benhoyt/scandir#benchmarks os.walk() was 23 times faster with scandir() than the current listdir() + stat() implementation on the Windows NFS file system I tried. Pretty good speedup! -Ben

On 29 June 2014 05:55, Ben Hoyt <benhoyt@gmail.com> wrote:
Yes, only the stuff that *never* needs a system call (regardless of OS) would be a candidate for handling as a property rather than a method call. Consistency of access would likely trump that idea anyway, but it would still be worth ensuring that the PEP is clear on which values are guaranteed to reflect the state at the time of the directory scanning and which may imply an additional stat call.
No, I hadn't seen those - may be worth referencing explicitly from the PEP (and if there's already a reference... oops!)
Ah, nice! Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Jun 28, 2014 at 03:55:00PM -0400, Ben Hoyt wrote:
I think this one could go either way. Methods look like they actually re-test the value each time you call it. I can easily see people not realising that the value is cached and writing code like this toy example: # Detect a file change. t = the_file.lstat().st_mtime while the_file.lstat().st_mtime == t: sleep(0.1) print("Changed!") I know that's not the best way to detect file changes, but I'm sure people will do something like that and not realise that the call to lstat is cached. Personally, I would prefer a property. If I forget to wrap a call in a try...except, it will fail hard and I will get an exception. But with a method call, the failure is silent and I keep getting the cached result. Speaking of caching, is there a way to freshen the cached values? -- Steven

On 29 June 2014 20:52, Steven D'Aprano <steve@pearwood.info> wrote:
Speaking of caching, is there a way to freshen the cached values?
Switch to a full Path object instead of relying on the cached DirEntry data. This is what makes me wary of including lstat, even though Windows offers it without the extra stat call. Caching behaviour is *really* hard to make intuitive, especially when it *sometimes* returns data that looks fresh (as it on first call on POSIX systems). Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 29.06.2014 13:08, Nick Coghlan wrote:
This bugs me too. An idea I had was adding a keyword argument to scandir which specifies whether stat data should be added to the direntry or not. If the flag is set to True, This would implicitly call lstat on POSIX before returning the DirEntry, and use the available data on Windows. If the flag is set to False, all the fields in the DirEntry will be None, for consistency, even on Windows. This is not optimal in cases where the stat information is needed only for some of the DirEntry objects, but would also reduce the required logic in the DirEntry object. Thoughts?
Regards, Nick.

On 06/29/2014 04:12 AM, Jonas Wielicki wrote:
If the flag is set to False, all the fields in the DirEntry will be None, for consistency, even on Windows.
-1 This consistency is unnecessary. -- ~Ethan~

On 29.06.2014 19:04, Ethan Furman wrote:
I’m not sure -- similar to the windows_wildcard option this might be a temptation to write platform dependent code, although possibly by accident (i.e. not reading the docs carefully).

On 29 June 2014 12:08, Nick Coghlan <ncoghlan@gmail.com> wrote:
If it matters that much we *could* simply call it cached_lstat(). It's ugly, but I really don't like the idea of throwing the information away - after all, the fact that we currently throw data away is why there's even a need for scandir. Let's not make the same mistake again... Paul

On 29 June 2014 21:45, Paul Moore <p.f.moore@gmail.com> wrote:
Future-proofing is the reason DirEntry is a full fledged class in the first place, though. Effectively communicating the behavioural difference between DirEntry and pathlib.Path is the main thing that makes me nervous about adhering too closely to the Path API. To restate the problem and the alternative proposal, these are the DirEntry methods under discussion: is_dir(): like os.path.isdir(), but requires no system calls on at least POSIX and Windows is_file(): like os.path.isfile(), but requires no system calls on at least POSIX and Windows is_symlink(): like os.path.islink(), but requires no system calls on at least POSIX and Windows lstat(): like os.lstat(), but requires no system calls on Windows For the almost-certain-to-be-cached items, the suggestion is to make them properties (or just ordinary attributes): is_dir is_file is_symlink What do with lstat() is currently less clear, since POSIX directory scanning doesn't provide that level of detail by default. The PEP also doesn't currently state whether the is_dir(), is_file() and is_symlink() results would be updated if a call to lstat() produced different answers than the original directory scanning process, which further suggests to me that allowing the stat call to be delayed on POSIX systems is a potentially problematic and inherently confusing design. We would have two options: - update them, meaning calling lstat() may change those results from being a snapshot of the setting at the time the directory was scanned - leave them alone, meaning the DirEntry object and the DirEntry.lstat() result may give different answers Those both sound ugly to me. So, here's my alternative proposal: add an "ensure_lstat" flag to scandir() itself, and don't have *any* methods on DirEntry, only attributes. That would make the DirEntry attributes: is_dir: boolean, always populated is_file: boolean, always populated is_symlink boolean, always populated lstat_result: stat result, may be None on POSIX systems if ensure_lstat is False (I'm not particularly sold on "lstat_result" as the name, but "lstat" reads as a verb to me, so doesn't sound right as an attribute name) What this would allow: - by default, scanning is efficient everywhere, but lstat_result may be None on POSIX systems - if you always need the lstat result, setting "ensure_lstat" will trigger the extra system call implicitly - if you only sometimes need the stat result, you can call os.lstat() explicitly when the DirEntry lstat attribute is None Most importantly, *regardless of platform*, the cached stat result (if not None) would reflect the state of the entry at the time the directory was scanned, rather than at some arbitrary later point in time when lstat() was first called on the DirEntry object. There'd still be a slight window of discrepancy (since the filesystem state may change between reading the directory entry and making the lstat() call), but this could be effectively eliminated from the perspective of the Python code by making the result of the lstat() call authoritative for the whole DirEntry object. Regards, Nick. P.S. We'd be generating quite a few of these, so we can use __slots__ to keep the memory overhead to a minimum (that's just a general comment - it's really irrelevant to the methods-or-attributes question). -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 6/29/2014 5:28 AM, Nick Coghlan wrote:
+1 to this in particular, but this whole refresh of the semantics sounds better overall. Finally, for the case where someone does want to keep the DirEntry around, a .refresh() API could rerun lstat() and update all the data. And with that (initial data potentially always populated, or None, and an explicit refresh() API), the data could all be returned as properties, implying that they aren't fetching new data themselves, because they wouldn't be. Glenn

Yeah, I quite like this. It does make the caching more explicit and consistent. It's slightly annoying that it's less like pathlib.Path now, but DirEntry was never pathlib.Path anyway, so maybe it doesn't matter. The differences in naming may highlight the difference in caching, so maybe it's a good thing. Two further questions from me: 1) How does error handling work? Now os.stat() will/may be called during iteration, so in __next__. But it hard to catch errors because you don't call __next__ explicitly. Is this a problem? How do other iterators that make system calls or raise errors handle this? 2) There's still the open question in the PEP of whether to include a way to access the full path. This is cheap to build, it has to be built anyway on POSIX systems, and it's quite useful for further operations on the file. I think the best way to handle this is a .fullname or .full_name attribute as suggested elsewhere. Thoughts? -Ben

On 1 July 2014 03:05, Ben Hoyt <benhoyt@gmail.com> wrote:
I'm torn between whether I'd prefer the stat fields to be populated on Windows if ensure_lstat=False or not. There are good arguments each way, but overall I'm inclining towards having it consistent with POSIX - don't populate them unless ensure_lstat=True. +0 for stat fields to be None on all platforms unless ensure_lstat=True.
See my comments below on .fullname.
I think it just needs to be documented that iterating may throw the same exceptions as os.lstat(). It's a little trickier if you don't want the scope of your exception to be too broad, but you can always wrap the iteration in a generator to catch and handle the exceptions you care about, and allow the rest to propagate. def scandir_accessible(path='.'): gen = os.scandir(path) while True: try: yield next(gen) except PermissionError: pass 2) There's still the open question in the PEP of whether to include a
+1 for .fullname. The earlier suggestion to have __str__ return the name is killed I think by the fact that .fullname could be bytes. It would be nice if pathlib.Path objects were enhanced to take a DirEntry and use the .fullname automatically, but you could always call Path(direntry.fullname). Tim Delaney

On 06/30/2014 03:07 PM, Tim Delaney wrote:
If a Windows user just needs the free info, why should s/he have to pay the price of a full stat call? I see no reason to hold the Windows side back and not take advantage of what it has available. There are plenty of posix calls that Windows is not able to use, after all. -- ~Ethan~

On 1 July 2014 08:38, Ethan Furman <ethan@stoneleaf.us> wrote:
On Windows ensure_lstat would either be either a NOP (if the fields are always populated), or it simply determines if the fields get populated. No extra stat call. On POSIX it's the difference between an extra stat call or not. Tim Delaney

On 06/30/2014 04:15 PM, Tim Delaney wrote:
I suppose the exact behavior is still under discussion, as there are only two or three fields one gets "for free" on Windows (I think...), where as an os.stat call would get everything available for the platform.
On POSIX it's the difference between an extra stat call or not.
Agreed on this part. Still, no reason to slow down the Windows side by throwing away info unnecessarily -- that's why this PEP exists, after all. -- ~Ethan~

On Mon, Jun 30, 2014 at 3:07 PM, Tim Delaney <timothy.c.delaney@gmail.com> wrote:
This won't work well if lstat info is only needed for some entries. Is that a common use-case? It was mentioned earlier in the thread. -- Devin

The proposal I was replying to was that: - There is no .refresh() - ensure_lstat=False means no OS has populated attributes - ensure_lstat=True means ever OS has populated attributes Even if we add a .refresh(), the latter two items mean that you can't avoid doing extra work (either too much on windows, or too much on linux), if you want only a subset of the files' lstat info. -- Devin P.S. your mail client's quoting breaks my mail client (gmail)'s quoting. On Mon, Jun 30, 2014 at 7:04 PM, Glenn Linderman <v+python@g.nevcal.com> wrote:

On 30 Jun 2014 19:13, "Glenn Linderman" <v+python@g.nevcal.com> wrote:
If it is, use ensure_lstat=False, and use the proposed (by me) .refresh()
API to update the data for those that need it. I'm -1 on a refresh API for DirEntry - just use pathlib in that case. Cheers, Nick.
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

On 6/30/2014 10:17 PM, Nick Coghlan wrote:
I'm not sure refresh() is the best name, but I think a "get_stat_info_from_direntry_or_call_stat()" (hah!) makes sense. If you really need the stat info, then you can write simple code like: for entry in os.scandir(path): mtime = entry.get_stat_info_from_direntry_or_call_stat().st_mtime And it won't call stat() any more times than needed. Once per file on Posix, zero times per file on Windows. Without an API like this, you'll need a check in the application code on whether or not to call stat(). Eric.

2014-07-01 4:04 GMT+02:00 Glenn Linderman <v+python@g.nevcal.com>:
We should make DirEntry as simple as possible. In Python, the classic behaviour is to not define an attribute if it's not available on a platform. For example, stat().st_file_attributes is only available on Windows. I don't like the idea of the ensure_lstat parameter because os.scandir would have to call two system calls, it makes harder to guess which syscall failed (readdir or lstat). If you need lstat on UNIX, write: if hasattr(entry, 'lstat_result'): size = entry.lstat_result.st_size else: size = os.lstat(entry.fullname()).st_size Victor

Ben Hoyt <benhoyt@gmail.com> writes:
Have you considered adding support for paths relative to directory descriptors [1] via keyword only dir_fd=None parameter if it may lead to more efficient implementations on some platforms? [1]: https://docs.python.org/3.4/library/os.html#dir-fd -- akira

On Sat, Jun 28, 2014 at 11:05 PM, Akira Li <4kir4.1i@gmail.com> wrote:
Potentially more efficient and also potentially safer (see 'man openat')... but an enhancement that can wait, if necessary. ChrisA

Chris Angelico <rosuav@gmail.com> writes:
Introducing the feature later creates unnecessary incompatibilities. Either it should be explicitly rejected in the PEP 471 and something-like `os.scandir(os.open(relative_path, dir_fd=fd))` recommended instead (assuming `os.scandir in os.supports_fd` like `os.listdir()`). At C level it could be implemented using fdopendir/openat or scandirat. Here's the function description using Argument Clinic DSL: /*[clinic input] os.scandir path : path_t(allow_fd=True, nullable=True) = '.' *path* can be specified as either str or bytes. On some platforms, *path* may also be specified as an open file descriptor; the file descriptor must refer to a directory. If this functionality is unavailable, using it raises NotImplementedError. * dir_fd : dir_fd = None If not None, it should be a file descriptor open to a directory, and *path* should be a relative string; path will then be relative to that directory. if *dir_fd* is unavailable, using it raises NotImplementedError. Yield a DirEntry object for each file and directory in *path*. Just like os.listdir, the '.' and '..' pseudo-directories are skipped, and the entries are yielded in system-dependent order. {parameters} It's an error to use *dir_fd* when specifying *path* as an open file descriptor. [clinic start generated code]*/ And corresponding tests (from test_posix:PosixTester), to show the compatibility with os.listdir argument parsing in detail: def test_scandir_default(self): # When scandir is called without argument, # it's the same as scandir(os.curdir). self.assertIn(support.TESTFN, [e.name for e in posix.scandir()]) def _test_scandir(self, curdir): filenames = sorted(e.name for e in posix.scandir(curdir)) self.assertIn(support.TESTFN, filenames) #NOTE: assume listdir, scandir accept the same types on the platform self.assertEqual(sorted(posix.listdir(curdir)), filenames) def test_scandir(self): self._test_scandir(os.curdir) def test_scandir_none(self): # it's the same as scandir(os.curdir). self._test_scandir(None) def test_scandir_bytes(self): # When scandir is called with a bytes object, # the returned entries names are still of type str. # Call `os.fsencode(entry.name)` to get bytes self.assertIn('a', {'a'}) self.assertNotIn(b'a', {'a'}) self._test_scandir(b'.') @unittest.skipUnless(posix.scandir in os.supports_fd, "test needs fd support for posix.scandir()") def test_scandir_fd_minus_one(self): # it's the same as scandir(os.curdir). self._test_scandir(-1) def test_scandir_float(self): # invalid args self.assertRaises(TypeError, posix.scandir, -1.0) @unittest.skipUnless(posix.scandir in os.supports_fd, "test needs fd support for posix.scandir()") def test_scandir_fd(self): fd = posix.open(posix.getcwd(), posix.O_RDONLY) self.addCleanup(posix.close, fd) self._test_scandir(fd) self.assertEqual( sorted(posix.scandir('.')), sorted(posix.scandir(fd))) # call 2nd time to test rewind self.assertEqual( sorted(posix.scandir('.')), sorted(posix.scandir(fd))) @unittest.skipUnless(posix.scandir in os.supports_dir_fd, "test needs dir_fd support for os.scandir()") def test_scandir_dir_fd(self): relpath = 'relative_path' with support.temp_dir() as parent: fullpath = os.path.join(parent, relpath) with support.temp_dir(path=fullpath): support.create_empty_file(os.path.join(parent, 'a')) support.create_empty_file(os.path.join(fullpath, 'b')) fd = posix.open(parent, posix.O_RDONLY) self.addCleanup(posix.close, fd) self.assertEqual( sorted(posix.scandir(relpath, dir_fd=fd)), sorted(posix.scandir(fullpath))) # check that fd is still useful self.assertEqual( sorted(posix.scandir(relpath, dir_fd=fd)), sorted(posix.scandir(fullpath))) -- Akira

On 6/26/2014 6:59 PM, Ben Hoyt wrote:
One of the major reasons for this seems to be efficiently using information that is already available from the OS "for free". Unfortunately it seems that the current API and most of the leading alternate proposals hide from the user what information is actually there "free" and what is going to incur an extra cost. I would prefer an API that simply gives whatever came for free from the OS and then let the user decide if the extra expense is worth the extra information. Maybe that stat information was only going to be used for an informational log that can be skipped if it's going to incur extra expense? Janzert

On 27 June 2014 09:28, MRAB <python@mrabarnett.plus.com> wrote:
Personally, I'd prefer the name 'iterdir' because it emphasises that it's an iterator.
Exactly what I was going to post (with the added note that thee's an obvious symmetry with listdir). +1 for iterdir rather than scandir Other than that: +1 for adding scandir to the stdlib -1 for windows_wildcard (it would be an attractive nuisance to write windows-only code) Tim Delaney

I don't mind iterdir() and would take it :-), but I'll just say why I chose the name scandir() -- though it wasn't my suggestion originally: iterdir() sounds like just an iterator version of listdir(), kinda like keys() and iterkeys() in Python 2. Whereas in actual fact the return values are quite different (DirEntry objects vs strings), and so the name change reflects that difference a little. I'm also -1 on windows_wildcard. I think it's asking for trouble, and wouldn't gain much on Windows in most cases anyway. -Ben On Thu, Jun 26, 2014 at 7:43 PM, Ethan Furman <ethan@stoneleaf.us> wrote:

On 2014-06-27 02:37, Ben Hoyt wrote:
[snip] The re module has 'findall', which returns a list of strings, and 'finditer', which returns an iterator that yields match objects, so there's a precedent. :-)

+1 on getting this in for 3.5. If the only objection people are having is the stupid paint color of the name I don't care what it's called! scandir matches the libc API of the same name. iterdir also makes sense to anyone reading it. Whoever checks this in can pick one and be done with it. We have other Python APIs with iter in the name and tend not to be trying to mirror C so much these days so the iterdir folks do have a valid point. I'm not a huge fan of the DirEntry object and the method calls on it instead of simply yielding tuples of (filename, partially_filled_in_stat_result) but I don't *really* care which is used as they both work fine and it is trivial to wrap with another generator expression to turn it into exactly what you want anyways. Python not having the ability to operate on large directories means Python simply cannot be used for common system maintenance tasks. Python being slow to walk a file system due to unnecessary stat calls (often each an entire io op. requiring a disk seek!) due to the existing information that it throws away not being used via listdir is similarly a problem. This addresses both. IMNSHO, it is a single function, it belongs in the os module right next to listdir. -gps On Thu, Jun 26, 2014 at 6:37 PM, Ben Hoyt <benhoyt@gmail.com> wrote:

On Jun 26, 2014, at 4:38 PM, Tim Delaney <timothy.c.delaney@gmail.com> wrote: On 27 June 2014 09:28, MRAB <python@mrabarnett.plus.com> wrote:
-1 for windows_wildcard (it would be an attractive nuisance to write
windows-only code) Could you emulate it on other platforms? +1 on the rest of it. -Chris

Hello, On Thu, 26 Jun 2014 18:59:45 -0400 Ben Hoyt <benhoyt@gmail.com> wrote:
I noticed obvious inefficiency of os.walk() implemented in terms of os.listdir() when I worked on "os" module for MicroPython. I essentially did what your PEP suggests - introduced internal generator function (ilistdir_ex() in https://github.com/micropython/micropython-lib/blob/master/os/os/__init__.py... ), in terms of which both os.listdir() and os.walk() are implemented. With my MicroPython hat on, os.scandir() would make things only worse. With current interface, one can either have inefficient implementation (like CPython chose) or efficient implementation (like MicroPython chose) - all transparently. os.scandir() supposedly opens up efficient implementation for everyone, but at the price of bloating API and introducing heavy-weight objects to wrap info. PEP calls it "lightweight DirEntry objects", but that cannot be true, because all Python objects are heavy-weight, especially those which have methods. It would be better if os.scandir() was specified to return a struct (named tuple) compatible with return value of os.stat() (with only fields relevant to underlying readdir()-like system call). The grounds for that are obvious: it's already existing data interface in module "os", which is also based on open standard for operating systems - POSIX, so if one is to expect something about file attributes, it's what one can reasonably base expectations on. But reusing os.stat struct is glaringly not what's proposed. And it's clear where that comes from - "[DirEntry.]lstat(): like os.lstat(), but requires no system calls on Windows". Nice, but OS "FooBar" can do much more than Windows - it has a system call to send a file by email, right when scanning a directory containing it. So, why not to have DirEntry.send_by_email(recipient) method? I hear the answer - it's because CPython strives to support Windows well, while doesn't care about "FooBar" OS. And then it again leads to the question I posed several times - where's line between "CPython" and "Python"? Is it grounded for CPython to add (or remove) to Python stdlib something which is useful for its users, but useless or complicating for other Python implementations? Especially taking into account that there's "win32api" module allowing Windows users to use all wonders of its API? Especially that os.stat struct is itself pretty extensible (https://docs.python.org/3.4/library/os.html#os.stat : "On other Unix systems (such as FreeBSD), the following attributes may be available ...", "On Mac OS systems...", - so extra fields can be added for Windows just the same, if really needed).
[] -- Best regards, Paul mailto:pmiscml@gmail.com

Hello, On Thu, 26 Jun 2014 17:35:21 -0700 Benjamin Peterson <benjamin@python.org> wrote:
Because you need to call them. And if the only thing they do is return object field, call overhead is rather noticeable.
namedtuples have methods.
Yes, unfortunately. But fortunately, named tuple is a subclass of tuple, so user caring for efficiency can just use numeric indexing which existed for os.stat values all the time, blissfully ignoring cruft which have been accumulating there since 1.5 times. -- Best regards, Paul mailto:pmiscml@gmail.com

Nice (though I see the implementation is very *nix specific).
It's a fair point that os.walk() can be implemented efficiently without adding a new function and API. However, often you'll want more info, like the file size, which scandir() can give you via DirEntry.lstat(), which is free on Windows. So opening up this efficient API is beneficial. In CPython, I think the DirEntry objects are as lightweight as stat_result objects. I'm an embedded developer by background, so I know the constraints here, but I really don't think Python's development should be tailored to fit MicroPython. If os.scandir() is not very efficient on MicroPython, so be it -- 99% of all desktop/server users will gain from it.
Yes, we considered this early on (see the python-ideas and python-dev threads referenced in the PEP), but decided it wasn't a great API to overload stat_result further, and have most of the attributes None or not present on Linux.
Yes. Incidentally, I just submitted an (accepted) patch for Python 3.5 that adds the full Win32 file attribute data to stat_result objects on Windows (see https://docs.python.org/3.5/whatsnew/3.5.html#os). However, for scandir() to be useful, you also need the name. My original version of this directory iterator returned two-tuples of (name, stat_result). But most people didn't like the API, and I don't really either. You could overload stat_result with a .name attribute in this case, but it still isn't a nice API to have most of the attributes None, and then you have to test for that, etc. So basically we tweaked the API to do what was best, and ended up with it returning DirEntry objects with is_file() and similar methods. Hope that helps give a bit more context. If you haven't read the relevant python-ideas and python-dev threads, those are interesting too. -Ben

Hello, On Thu, 26 Jun 2014 21:52:43 -0400 Ben Hoyt <benhoyt@gmail.com> wrote: []
Surely, tailoring Python to MicroPython's needs is completely not what I suggest. It was an example of alternative implementation which optimized os.walk() without need for any additional public module APIs. Vice-versa, high-level nature of API call like os.walk() and underspecification of low-level details (like which function implemented in terms of which others) allow MicroPython provide optimized implementation even with its resource constraints. So, power of high-level interfaces and underspecification should not be underestimated ;-). But I don't want to argue that os.scandir() is "not needed", because that's hardly productive. Something I'd like to prototype in uPy and ideally lead further up to PEP status is to add iterator-based string methods, and I pretty much can expect "we lived without it" response, so don't want to go the same way regarding addition of other iterator-based APIs - it's clear that more iterator/generator based APIs is a good direction for Python to evolve.
[]
Yes, returning (name, stat_result) would be my first motion too, I don't see why someone wouldn't like pair of 2 values, with each value of obvious type and semantics within "os" module. Regarding stat result, os.stat() provides full information about a file, and intuitively, one may expect that os.scandir() would provide subset of that info, asymptotically reaching volume of what os.stat() may provide, depending on OS capabilities. So, if truly OS-independent interface is wanted to salvage more data from a dir scanning, using os.stat struct as data interface is hard to ignore. But well, if it was rejected already, what can be said? Perhaps, at least the PEP could be extended to explicitly mention other approached which were discussed and rejected, not just link to a discussion archive (from experience with reading other PEPs, they oftentimes contained such subsections, so hope this suggestion is not ungrounded).
-- Best regards, Paul mailto:pmiscml@gmail.com

On Fri, Jun 27, 2014 at 03:07:46AM +0300, Paul Sokolovsky wrote:
os.scandir is not part of the Python API, it is not a built-in function. It is part of the CPython standard library. That means (in my opinion) that there is an expectation that other Pythons should provide it, but not an absolute requirement. Especially for the os module, which by definition is platform-specific. In my opinion that means you have four options: 1. provide os.scandir, with exactly the same semantics as on CPython; 2. provide os.scandir, but change its semantics to be more lightweight (e.g. return an ordinary tuple, as you already suggest); 3. don't provide os.scandir at all; or 4. do something different depending on whether the platform is Linux or an embedded system. I would consider any of those acceptable for a library feature, but not for a language feature. [...]
Correct. If there is sufficient demand for FooBar, then CPython may support it. Until then, FooBarPython can support it, and offer whatever platform-specific features are needed within its standard library.
I think so. And other implementations are free to do the same thing. Of course there is an expectation that the standard library of most implementations will be broadly similar, but not that they will be identical. I am surprised that both Jython and IronPython offer an non-functioning dis module: you can import it successfully, but if there's a way to actually use it, I haven't found it: steve@orac:~$ jython Jython 2.5.1+ (Release_2_5_1, Aug 4 2010, 07:18:19) [OpenJDK Server VM (Sun Microsystems Inc.)] on java1.6.0_27 Type "help", "copyright", "credits" or "license" for more information.
IronPython gives a different exception: steve@orac:~$ ipy IronPython 2.6 Beta 2 DEBUG (2.6.0.20) on .NET 2.0.50727.1433 Type "help", "copyright", "credits" or "license" for more information.
It's quite annoying, I would have rather that they just removed the module altogether. Better still would have been to disassemble code objects to whatever byte code the Java and .Net platforms use. But there's surely no requirement to disassemble to CPython byte code! -- Steven

Hello, On Fri, 27 Jun 2014 12:08:41 +1000 Steven D'Aprano <steve@pearwood.info> wrote:
Ok, so standard library also has API, and that's the API being discussed.
Yes, that's intuitive, but not strict and formal, so is subject to interpretations. As a developer working on alternative Python implementation, I'd like to have better understanding of what needs to be done to be a compliant implementation (in particular, because I need to pass that info down to the users). So, I was told that https://docs.python.org/3/reference/index.html describes Python, not CPython. Next step is figuring out whether https://docs.python.org/3/library/index.html describes Python or CPython, and if the latter, how to separate Python's stdlib essence from extended library CPython provides?
Good, thanks. If that represents shared opinion of (C)Python developers (so, there won't be claims like "MicroPython is not Python because it doesn't provide os.scandir()" (or hundred of other missing stdlib functions ;-) )) that's good enough already. With that in mind, I wished that any Python implementation was as complete and as efficient as possible, and one way to achieve that is to not add stdlib entities without real need (be it more API calls or more data types). So, I'm glad to know that os.scandir() passed thru Occam's Razor in this respect and specified the way it is really for common good. [] -- Best regards, Paul mailto:pmiscml@gmail.com

I'm generally +1, with opinions noted below on these two topics. On 6/26/2014 3:59 PM, Ben Hoyt wrote:
+1
Because another common pattern is to check for name matches pattern, I think it would be good to have a feature that provides such. I do that in my own private directory listing extensions, and also some command lines expose it to the user. Where exposed to the user, I use -p windows-pattern and -P regexp. My implementation converts the windows-pattern to a regexp, and then uses common code, but for this particular API, because the windows_wildcard can be optimized by the window API call used, it would make more sense to pass windows_wildcard directly to FindFirst on Windows, but on *nix convert it to a regexp. Both Windows and *nix would call re to process pattern matches except for the case on Windows of having a Windows pattern passed in. The alternate parameter could simply be called wildcard, and would be a regexp. If desired, other flavors of wildcard bsd_wildcard? could also be implemented, but I'm not sure there are any benefits to them, as there are, as far as I am aware, no optimizations for those patterns in those systems.

On 26 June 2014 23:59, Ben Hoyt <benhoyt@gmail.com> wrote:
Would love feedback on the PEP, but also of course on the proposal itself.
A solid +1 from me. Some specific points: - I'm in favour of it being in the os module. It's more discoverable there, as well as the other reasons mentioned. - I prefer scandir as the name, for the reason you gave (the output isn't the same as an iterator version of listdir) - I'm mildly against windows_wildcard (even though I'm a windows user) - You mention the caching behaviour of DirEntry objects. The limitations should be clearly covered in the final docs, as it's the sort of thing people will get wrong otherwise. Paul

Hi, You wrote a great PEP Ben, thanks :-) But it's now time for comments!
But the underlying system calls -- ``FindFirstFile`` / ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir? You should add a link to FindFirstFile doc: http://msdn.microsoft.com/en-us/library/windows/desktop/aa364418%28v=vs.85%2... It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we should mimic stat_result recent addition: the new stat_result.file_attributes field. Add DirEntry.file_attributes which would only be available on Windows. The Windows structure also contains FILETIME ftCreationTime; FILETIME ftLastAccessTime; FILETIME ftLastWriteTime; DWORD nFileSizeHigh; DWORD nFileSizeLow; It would be nice to expose them as well. I'm no more surprised that the exact API is different depending on the OS for functions of the os module.
Does your implementation uses a free list to avoid the cost of memory allocation? A short free list of 10 or maybe just 1 may help. The free list may be stored directly in the generator object.
Does it support also bytes filenames on UNIX? Python now supports undecodable filenames thanks to the PEP 383 (surrogateescape). I prefer to use the same type for filenames on Linux and Windows, so Unicode is better. But some users might prefer bytes for other reasons.
The ``DirEntry`` attribute and method names were chosen to be the same as those in the new ``pathlib`` module for consistency.
Great! That's exactly what I expected :-) Consistency with other modules.
Crazy idea: would it be possible to "convert" a DirEntry object to a pathlib.Path object without losing the cache? I guess that pathlib.Path expects a full stat_result object.
I don't understand how you can build a full lstat() result without really calling stat. I see that WIN32_FIND_DATA contains the size, but here you call lstat(). If you know that it's not a symlink, you already know the size, but you still have to call stat() to retrieve all fields required to build a stat_result no?
Do you plan to continue to maintain your module for Python < 3.5, but upgrade your module for the final PEP?
Yes, put it in the os module which is already bloated :-)
I think that it would be very convinient to store the directory name in the DirEntry. It should be light, it's just a reference. And provide a fullname() name which would just return os.path.join(path, entry.name) without trying to resolve path to get an absolute path.
Would it be hard to implement the wildcard feature on UNIX to compare performances of scandir('*.jpg') with and without the wildcard built in os.scandir? I implemented it in C for the tracemalloc module (Filter object): http://hg.python.org/features/tracemalloc Get the revision 69fd2d766005 and search match_filename_joker() in Modules/_tracemalloc.c. The function matchs the filename backward because it most cases, the last latter is enough to reject a filename (ex: "*.jpg" => reject filenames not ending with "g"). The filename is normalized before matching the pattern: converted to lowercase and / is replaced with \ on Windows. It was decided to drop the Filter object to keep the tracemalloc module as simple as possible. Charles-François was not convinced by the speedup. But tracemalloc case is different because the OS didn't provide an API for that. Victor

I guess it'd be better to say "Windows" and "Unix-based OSs" throughout the PEP? Because all of these (including Mac OS X) are Unix-based.
I think you've misunderstood how DirEntry.lstat() works on Windows -- it's basically a no-op, as Windows returns the full stat information with the original FindFirst/FindNext OS calls. This is fairly explict in the PEP, but I'm sure I could make it clearer: DirEntry.lstat(): "like os.lstat(), but requires no system calls on Windows So you can already get the dwFileAttributes for free by saying entry.lstat().st_file_attributes. You can also get all the other fields you mentioned for free via .lstat() with no additional OS calls on Windows, for example: entry.lstat().st_size. Feel free to suggest changes to the PEP or scandir docs if this isn't clear. Note that is_dir()/is_file()/is_symlink() are free on all systems, but .lstat() is only free on Windows.
No, it doesn't. I might add this to the PEP under "possible improvements". However, I think the speed increase by removing the extra OS call and/or disk seek is going to be way more than memory allocation improvements, so I'm not sure this would be worth it.
Does it support also bytes filenames on UNIX?
I forget exactly now what my scandir module does, but for os.scandir() I think this should behave exactly like os.listdir() does for Unicode/bytes filenames.
The main problem is that pathlib.Path objects explicitly don't cache stat info (and Guido doesn't want them to, for good reason I think). There's a thread on python-dev about this earlier. I'll add it to a "Rejected ideas" section.
See above.
Do you plan to continue to maintain your module for Python < 3.5, but upgrade your module for the final PEP?
Yes, I intend to maintain the standalone scandir module for 2.6 <= Python < 3.5, at least for a good while. For integration into the Python 3.5 stdlib, the implementation will be integrated into posixmodule.c, of course.
Yeah, fair suggestion. I'm still slightly on the fence about this, but I think an explicit fullname() is a good suggestion. Ideally I think it'd be better to mimic pathlib.Path.__str__() which is kind of the equivalent of fullname(). But how does pathlib deal with unicode/bytes issues if it's the str function which has to return a str object? Or at least, it'd be very weird if __str__() returned bytes. But I think it'd need to if you passed bytes into scandir(). Do others have thoughts?
It's a good idea, the problem with this is that the Windows wildcard implementation has a bunch of crazy edge cases where *.ext will catch more things than just a simple regex/glob. This was discussed on python-dev or python-ideas previously, so I'll dig it up and add to a Rejected Ideas section. In any case, this could be added later if there's a way to iron out the Windows quirks. -Ben

On 29 June 2014 05:48, Ben Hoyt <benhoyt@gmail.com> wrote:
*nix and POSIX-based are the two conventions I use.
The key problem with caches on pathlib.Path objects is that you could end up with two separate path objects that referred to the same filesystem location but returned different answers about the filesystem state because their caches might be stale. DirEntry is different, as the content is generally *assumed* to be stale (referring to when the directory was scanned, rather than the current filesystem state). DirEntry.lstat() on POSIX systems will be an exception to that general rule (referring to the time of first lookup, rather than when the directory was scanned, so the answer rom lstat() may be inconsistent with other data stored directly on the DirEntry object), but one we can probably live with. More generally, as part of the pathlib PEP review, we figured out that a *per-object* cache of filesystem state would be an inherently bad idea, but a string based *process global* cache might make sense for modules like walkdir (not part of the stdlib - it's an iterator pipeline based approach to file tree scanning I wrote a while back, that currently suffers badly from the performance impact of repeated stat calls at different stages of the pipeline). We realised this was getting into a space where application and library specific concerns are likely to start affecting the caching design, though, so the current status of standard library level stat caching is "it's not clear if there's an available approach that would be sufficiently general purpose to be appropriate for inclusion in the standard library". Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 28 Jun 2014, at 21:48, Ben Hoyt wrote:
However, it would be bad to have two implementations of the concept of "filename" with different attribute and method names. The best way to ensure compatible APIs would be if one class was derived from the other.
[...]
Servus, Walter

On 27.06.2014 00:59, Ben Hoyt wrote:
I find this behaviour a bit misleading: using methods and have them return cached results. How much (implementation and/or performance and/or memory) overhead would incur by using property-like access here? I think this would underline the static nature of the data. This would break the semantics with respect to pathlib, but they’re only marginally equal anyways -- and as far as I understand it, pathlib won’t cache, so I think this has a fair point here. regards, jwi

On 28 Jun 2014 01:27, "Jonas Wielicki" <j.wielicki@sotecware.net> wrote:
Indeed - using properties rather than methods may help emphasise the deliberate *difference* from pathlib in this case (i.e. value when the result was retrieved from the OS, rather than the value right now). The main benefit is that switching from using the DirEntry object to a pathlib Path will require touching all the places where the performance characteristics switch from "memory access" to "system call". This benefit is also the main downside, so I'd actually be OK with either decision on this one. Other comments: * +1 on the general idea * +1 on scandir() over iterdir, since it *isn't* just an iterator version of listdir * -1 on including Windows specific globbing support in the API * -0 on including cross platform globbing support in the initial iteration of the API (that could be done later as a separate RFE instead) * +1 on a new section in the PEP covering rejected design options (calling it iterdir, returning a 2-tuple instead of a dedicated DirEntry type) * regarding "why not a 2-tuple", we know from experience that operating systems evolve and we end up wanting to add additional info to this kind of API. A dedicated DirEntry type lets us adjust the information returned over time, without breaking backwards compatibility and without resorting to ugly hacks like those in some of the time and stat APIs (or even our own codec info APIs) * it would be nice to see some relative performance numbers for NFS and CIFS network shares - the additional network round trips can make excessive stat calls absolutely brutal from a speed perspective when using a network drive (that's why the stat caching added to the import system in 3.3 dramatically sped up the case of having network drives on sys.path, and why I thought AJ had a point when he was complaining about the fact we didn't expose the dirent data from os.listdir) Regards, Nick.
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

On Fri, Jun 27, 2014 at 2:58 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Agreed. Globbing or filtering support should not hold this up. If that part isn't settled, just don't include it and work out what it should be as a future enhancement.
* +1 on a new section in the PEP covering rejected design options (calling it iterdir, returning a 2-tuple instead of a dedicated DirEntry type)
+1. IMNSHO, one of the most important part of PEPs: capturing the entire decision process to document the "why nots".
fwiw, I wouldn't wait for benchmark numbers. A needless stat call when you've got the information from an earlier API call is already brutal. It is easy to compute from existing ballparks remote file server / cloud access: ~100ms, local spinning disk seek+read: ~10ms. fetch of stat info cached in memory on file server on the local network: ~500us. You can go down further to local system call overhead which can vary wildly but should likely be assumed to be at least 10us. You don't need a benchmark to tell you that adding needless >= 500us-100ms blocking operations to your program is bad. :) -gps

On 28 June 2014 16:17, Gregory P. Smith <greg@krypto.org> wrote:
Agreed, but walking even a moderately large tree over the network can really hammer home the point that this offers a significant performance enhancement as the latency of access increases. I've found that kind of comparison can be eye-opening for folks that are used to only operating on local disks (even spinning disks, let alone SSDs) and/or relatively small trees (distro build trees aren't *that* big, but they're big enough for this kind of difference in access overhead to start getting annoying). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 28 June 2014 19:17, Nick Coghlan <ncoghlan@gmail.com> wrote:
Oops, forgot to add - I agree this isn't a blocking issue for the PEP, it's definitely only in "nice to have" territory. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Re is_dir etc being properties rather than methods:
The problem with this is that properties "look free", they look just like attribute access, so you wouldn't normally handle exceptions when accessing them. But .lstat() and .is_dir() etc may do an OS call, so if you're needing to be careful with error handling, you may want to handle errors on them. Hence I think it's best practice to make them functions(). Some of us discussed this on python-dev or python-ideas a while back, and I think there was general agreement with what I've stated above and therefore they should be methods. But I'll dig up the links and add to a Rejected ideas section.
* +1 on a new section in the PEP covering rejected design options (calling it iterdir, returning a 2-tuple instead of a dedicated DirEntry type)
Great idea. I'll add a bunch of stuff, including the above, to a new section, Rejected Design Options.
Fully agreed.
Don't know if you saw, but there are actually some benchmarks, including one over NFS, on the scandir GitHub page: https://github.com/benhoyt/scandir#benchmarks os.walk() was 23 times faster with scandir() than the current listdir() + stat() implementation on the Windows NFS file system I tried. Pretty good speedup! -Ben

On 29 June 2014 05:55, Ben Hoyt <benhoyt@gmail.com> wrote:
Yes, only the stuff that *never* needs a system call (regardless of OS) would be a candidate for handling as a property rather than a method call. Consistency of access would likely trump that idea anyway, but it would still be worth ensuring that the PEP is clear on which values are guaranteed to reflect the state at the time of the directory scanning and which may imply an additional stat call.
No, I hadn't seen those - may be worth referencing explicitly from the PEP (and if there's already a reference... oops!)
Ah, nice! Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Jun 28, 2014 at 03:55:00PM -0400, Ben Hoyt wrote:
I think this one could go either way. Methods look like they actually re-test the value each time you call it. I can easily see people not realising that the value is cached and writing code like this toy example: # Detect a file change. t = the_file.lstat().st_mtime while the_file.lstat().st_mtime == t: sleep(0.1) print("Changed!") I know that's not the best way to detect file changes, but I'm sure people will do something like that and not realise that the call to lstat is cached. Personally, I would prefer a property. If I forget to wrap a call in a try...except, it will fail hard and I will get an exception. But with a method call, the failure is silent and I keep getting the cached result. Speaking of caching, is there a way to freshen the cached values? -- Steven

On 29 June 2014 20:52, Steven D'Aprano <steve@pearwood.info> wrote:
Speaking of caching, is there a way to freshen the cached values?
Switch to a full Path object instead of relying on the cached DirEntry data. This is what makes me wary of including lstat, even though Windows offers it without the extra stat call. Caching behaviour is *really* hard to make intuitive, especially when it *sometimes* returns data that looks fresh (as it on first call on POSIX systems). Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 29.06.2014 13:08, Nick Coghlan wrote:
This bugs me too. An idea I had was adding a keyword argument to scandir which specifies whether stat data should be added to the direntry or not. If the flag is set to True, This would implicitly call lstat on POSIX before returning the DirEntry, and use the available data on Windows. If the flag is set to False, all the fields in the DirEntry will be None, for consistency, even on Windows. This is not optimal in cases where the stat information is needed only for some of the DirEntry objects, but would also reduce the required logic in the DirEntry object. Thoughts?
Regards, Nick.

On 06/29/2014 04:12 AM, Jonas Wielicki wrote:
If the flag is set to False, all the fields in the DirEntry will be None, for consistency, even on Windows.
-1 This consistency is unnecessary. -- ~Ethan~

On 29.06.2014 19:04, Ethan Furman wrote:
I’m not sure -- similar to the windows_wildcard option this might be a temptation to write platform dependent code, although possibly by accident (i.e. not reading the docs carefully).

On 29 June 2014 12:08, Nick Coghlan <ncoghlan@gmail.com> wrote:
If it matters that much we *could* simply call it cached_lstat(). It's ugly, but I really don't like the idea of throwing the information away - after all, the fact that we currently throw data away is why there's even a need for scandir. Let's not make the same mistake again... Paul

On 29 June 2014 21:45, Paul Moore <p.f.moore@gmail.com> wrote:
Future-proofing is the reason DirEntry is a full fledged class in the first place, though. Effectively communicating the behavioural difference between DirEntry and pathlib.Path is the main thing that makes me nervous about adhering too closely to the Path API. To restate the problem and the alternative proposal, these are the DirEntry methods under discussion: is_dir(): like os.path.isdir(), but requires no system calls on at least POSIX and Windows is_file(): like os.path.isfile(), but requires no system calls on at least POSIX and Windows is_symlink(): like os.path.islink(), but requires no system calls on at least POSIX and Windows lstat(): like os.lstat(), but requires no system calls on Windows For the almost-certain-to-be-cached items, the suggestion is to make them properties (or just ordinary attributes): is_dir is_file is_symlink What do with lstat() is currently less clear, since POSIX directory scanning doesn't provide that level of detail by default. The PEP also doesn't currently state whether the is_dir(), is_file() and is_symlink() results would be updated if a call to lstat() produced different answers than the original directory scanning process, which further suggests to me that allowing the stat call to be delayed on POSIX systems is a potentially problematic and inherently confusing design. We would have two options: - update them, meaning calling lstat() may change those results from being a snapshot of the setting at the time the directory was scanned - leave them alone, meaning the DirEntry object and the DirEntry.lstat() result may give different answers Those both sound ugly to me. So, here's my alternative proposal: add an "ensure_lstat" flag to scandir() itself, and don't have *any* methods on DirEntry, only attributes. That would make the DirEntry attributes: is_dir: boolean, always populated is_file: boolean, always populated is_symlink boolean, always populated lstat_result: stat result, may be None on POSIX systems if ensure_lstat is False (I'm not particularly sold on "lstat_result" as the name, but "lstat" reads as a verb to me, so doesn't sound right as an attribute name) What this would allow: - by default, scanning is efficient everywhere, but lstat_result may be None on POSIX systems - if you always need the lstat result, setting "ensure_lstat" will trigger the extra system call implicitly - if you only sometimes need the stat result, you can call os.lstat() explicitly when the DirEntry lstat attribute is None Most importantly, *regardless of platform*, the cached stat result (if not None) would reflect the state of the entry at the time the directory was scanned, rather than at some arbitrary later point in time when lstat() was first called on the DirEntry object. There'd still be a slight window of discrepancy (since the filesystem state may change between reading the directory entry and making the lstat() call), but this could be effectively eliminated from the perspective of the Python code by making the result of the lstat() call authoritative for the whole DirEntry object. Regards, Nick. P.S. We'd be generating quite a few of these, so we can use __slots__ to keep the memory overhead to a minimum (that's just a general comment - it's really irrelevant to the methods-or-attributes question). -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 6/29/2014 5:28 AM, Nick Coghlan wrote:
+1 to this in particular, but this whole refresh of the semantics sounds better overall. Finally, for the case where someone does want to keep the DirEntry around, a .refresh() API could rerun lstat() and update all the data. And with that (initial data potentially always populated, or None, and an explicit refresh() API), the data could all be returned as properties, implying that they aren't fetching new data themselves, because they wouldn't be. Glenn

Yeah, I quite like this. It does make the caching more explicit and consistent. It's slightly annoying that it's less like pathlib.Path now, but DirEntry was never pathlib.Path anyway, so maybe it doesn't matter. The differences in naming may highlight the difference in caching, so maybe it's a good thing. Two further questions from me: 1) How does error handling work? Now os.stat() will/may be called during iteration, so in __next__. But it hard to catch errors because you don't call __next__ explicitly. Is this a problem? How do other iterators that make system calls or raise errors handle this? 2) There's still the open question in the PEP of whether to include a way to access the full path. This is cheap to build, it has to be built anyway on POSIX systems, and it's quite useful for further operations on the file. I think the best way to handle this is a .fullname or .full_name attribute as suggested elsewhere. Thoughts? -Ben

On 1 July 2014 03:05, Ben Hoyt <benhoyt@gmail.com> wrote:
I'm torn between whether I'd prefer the stat fields to be populated on Windows if ensure_lstat=False or not. There are good arguments each way, but overall I'm inclining towards having it consistent with POSIX - don't populate them unless ensure_lstat=True. +0 for stat fields to be None on all platforms unless ensure_lstat=True.
See my comments below on .fullname.
I think it just needs to be documented that iterating may throw the same exceptions as os.lstat(). It's a little trickier if you don't want the scope of your exception to be too broad, but you can always wrap the iteration in a generator to catch and handle the exceptions you care about, and allow the rest to propagate. def scandir_accessible(path='.'): gen = os.scandir(path) while True: try: yield next(gen) except PermissionError: pass 2) There's still the open question in the PEP of whether to include a
+1 for .fullname. The earlier suggestion to have __str__ return the name is killed I think by the fact that .fullname could be bytes. It would be nice if pathlib.Path objects were enhanced to take a DirEntry and use the .fullname automatically, but you could always call Path(direntry.fullname). Tim Delaney

On 06/30/2014 03:07 PM, Tim Delaney wrote:
If a Windows user just needs the free info, why should s/he have to pay the price of a full stat call? I see no reason to hold the Windows side back and not take advantage of what it has available. There are plenty of posix calls that Windows is not able to use, after all. -- ~Ethan~

On 1 July 2014 08:38, Ethan Furman <ethan@stoneleaf.us> wrote:
On Windows ensure_lstat would either be either a NOP (if the fields are always populated), or it simply determines if the fields get populated. No extra stat call. On POSIX it's the difference between an extra stat call or not. Tim Delaney

On 06/30/2014 04:15 PM, Tim Delaney wrote:
I suppose the exact behavior is still under discussion, as there are only two or three fields one gets "for free" on Windows (I think...), where as an os.stat call would get everything available for the platform.
On POSIX it's the difference between an extra stat call or not.
Agreed on this part. Still, no reason to slow down the Windows side by throwing away info unnecessarily -- that's why this PEP exists, after all. -- ~Ethan~

On Mon, Jun 30, 2014 at 3:07 PM, Tim Delaney <timothy.c.delaney@gmail.com> wrote:
This won't work well if lstat info is only needed for some entries. Is that a common use-case? It was mentioned earlier in the thread. -- Devin

The proposal I was replying to was that: - There is no .refresh() - ensure_lstat=False means no OS has populated attributes - ensure_lstat=True means ever OS has populated attributes Even if we add a .refresh(), the latter two items mean that you can't avoid doing extra work (either too much on windows, or too much on linux), if you want only a subset of the files' lstat info. -- Devin P.S. your mail client's quoting breaks my mail client (gmail)'s quoting. On Mon, Jun 30, 2014 at 7:04 PM, Glenn Linderman <v+python@g.nevcal.com> wrote:

On 30 Jun 2014 19:13, "Glenn Linderman" <v+python@g.nevcal.com> wrote:
If it is, use ensure_lstat=False, and use the proposed (by me) .refresh()
API to update the data for those that need it. I'm -1 on a refresh API for DirEntry - just use pathlib in that case. Cheers, Nick.
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

On 6/30/2014 10:17 PM, Nick Coghlan wrote:
I'm not sure refresh() is the best name, but I think a "get_stat_info_from_direntry_or_call_stat()" (hah!) makes sense. If you really need the stat info, then you can write simple code like: for entry in os.scandir(path): mtime = entry.get_stat_info_from_direntry_or_call_stat().st_mtime And it won't call stat() any more times than needed. Once per file on Posix, zero times per file on Windows. Without an API like this, you'll need a check in the application code on whether or not to call stat(). Eric.

2014-07-01 4:04 GMT+02:00 Glenn Linderman <v+python@g.nevcal.com>:
We should make DirEntry as simple as possible. In Python, the classic behaviour is to not define an attribute if it's not available on a platform. For example, stat().st_file_attributes is only available on Windows. I don't like the idea of the ensure_lstat parameter because os.scandir would have to call two system calls, it makes harder to guess which syscall failed (readdir or lstat). If you need lstat on UNIX, write: if hasattr(entry, 'lstat_result'): size = entry.lstat_result.st_size else: size = os.lstat(entry.fullname()).st_size Victor

Ben Hoyt <benhoyt@gmail.com> writes:
Have you considered adding support for paths relative to directory descriptors [1] via keyword only dir_fd=None parameter if it may lead to more efficient implementations on some platforms? [1]: https://docs.python.org/3.4/library/os.html#dir-fd -- akira

On Sat, Jun 28, 2014 at 11:05 PM, Akira Li <4kir4.1i@gmail.com> wrote:
Potentially more efficient and also potentially safer (see 'man openat')... but an enhancement that can wait, if necessary. ChrisA

Chris Angelico <rosuav@gmail.com> writes:
Introducing the feature later creates unnecessary incompatibilities. Either it should be explicitly rejected in the PEP 471 and something-like `os.scandir(os.open(relative_path, dir_fd=fd))` recommended instead (assuming `os.scandir in os.supports_fd` like `os.listdir()`). At C level it could be implemented using fdopendir/openat or scandirat. Here's the function description using Argument Clinic DSL: /*[clinic input] os.scandir path : path_t(allow_fd=True, nullable=True) = '.' *path* can be specified as either str or bytes. On some platforms, *path* may also be specified as an open file descriptor; the file descriptor must refer to a directory. If this functionality is unavailable, using it raises NotImplementedError. * dir_fd : dir_fd = None If not None, it should be a file descriptor open to a directory, and *path* should be a relative string; path will then be relative to that directory. if *dir_fd* is unavailable, using it raises NotImplementedError. Yield a DirEntry object for each file and directory in *path*. Just like os.listdir, the '.' and '..' pseudo-directories are skipped, and the entries are yielded in system-dependent order. {parameters} It's an error to use *dir_fd* when specifying *path* as an open file descriptor. [clinic start generated code]*/ And corresponding tests (from test_posix:PosixTester), to show the compatibility with os.listdir argument parsing in detail: def test_scandir_default(self): # When scandir is called without argument, # it's the same as scandir(os.curdir). self.assertIn(support.TESTFN, [e.name for e in posix.scandir()]) def _test_scandir(self, curdir): filenames = sorted(e.name for e in posix.scandir(curdir)) self.assertIn(support.TESTFN, filenames) #NOTE: assume listdir, scandir accept the same types on the platform self.assertEqual(sorted(posix.listdir(curdir)), filenames) def test_scandir(self): self._test_scandir(os.curdir) def test_scandir_none(self): # it's the same as scandir(os.curdir). self._test_scandir(None) def test_scandir_bytes(self): # When scandir is called with a bytes object, # the returned entries names are still of type str. # Call `os.fsencode(entry.name)` to get bytes self.assertIn('a', {'a'}) self.assertNotIn(b'a', {'a'}) self._test_scandir(b'.') @unittest.skipUnless(posix.scandir in os.supports_fd, "test needs fd support for posix.scandir()") def test_scandir_fd_minus_one(self): # it's the same as scandir(os.curdir). self._test_scandir(-1) def test_scandir_float(self): # invalid args self.assertRaises(TypeError, posix.scandir, -1.0) @unittest.skipUnless(posix.scandir in os.supports_fd, "test needs fd support for posix.scandir()") def test_scandir_fd(self): fd = posix.open(posix.getcwd(), posix.O_RDONLY) self.addCleanup(posix.close, fd) self._test_scandir(fd) self.assertEqual( sorted(posix.scandir('.')), sorted(posix.scandir(fd))) # call 2nd time to test rewind self.assertEqual( sorted(posix.scandir('.')), sorted(posix.scandir(fd))) @unittest.skipUnless(posix.scandir in os.supports_dir_fd, "test needs dir_fd support for os.scandir()") def test_scandir_dir_fd(self): relpath = 'relative_path' with support.temp_dir() as parent: fullpath = os.path.join(parent, relpath) with support.temp_dir(path=fullpath): support.create_empty_file(os.path.join(parent, 'a')) support.create_empty_file(os.path.join(fullpath, 'b')) fd = posix.open(parent, posix.O_RDONLY) self.addCleanup(posix.close, fd) self.assertEqual( sorted(posix.scandir(relpath, dir_fd=fd)), sorted(posix.scandir(fullpath))) # check that fd is still useful self.assertEqual( sorted(posix.scandir(relpath, dir_fd=fd)), sorted(posix.scandir(fullpath))) -- Akira

On 6/26/2014 6:59 PM, Ben Hoyt wrote:
One of the major reasons for this seems to be efficiently using information that is already available from the OS "for free". Unfortunately it seems that the current API and most of the leading alternate proposals hide from the user what information is actually there "free" and what is going to incur an extra cost. I would prefer an API that simply gives whatever came for free from the OS and then let the user decide if the extra expense is worth the extra information. Maybe that stat information was only going to be used for an informational log that can be skipped if it's going to incur extra expense? Janzert
participants (22)
-
Akira Li
-
Ben Hoyt
-
Benjamin Peterson
-
Chris Angelico
-
Chris Barker - NOAA Federal
-
Devin Jeanpierre
-
Eric V. Smith
-
Ethan Furman
-
Glenn Linderman
-
Gregory P. Smith
-
Janzert
-
Jonas Wielicki
-
MRAB
-
Nick Coghlan
-
Paul Moore
-
Paul Sokolovsky
-
Ryan
-
Steven D'Aprano
-
Terry Reedy
-
Tim Delaney
-
Victor Stinner
-
Walter Dörwald