[Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

Nick Coghlan ncoghlan at gmail.com
Fri Jun 27 23:58:50 CEST 2014

On 28 Jun 2014 01:27, "Jonas Wielicki" <j.wielicki at sotecware.net> wrote:
> On 27.06.2014 00:59, Ben Hoyt wrote:
> > Specifics of proposal
> > =====================
> > [snip] Each ``DirEntry`` object has the following
> > attributes and methods:
> > [snip]
> > Notes on caching
> > ----------------
> >
> > The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute
> > is obviously always cached, and the ``is_X`` and ``lstat`` methods
> > cache their values (immediately on Windows via ``FindNextFile``, and
> > on first use on Linux / OS X via a ``stat`` call) and never refetch
> > from the system.
> I find this behaviour a bit misleading: using methods and have them
> return cached results. How much (implementation and/or performance
> and/or memory) overhead would incur by using property-like access here?
> I think this would underline the static nature of the data.
> This would break the semantics with respect to pathlib, but they’re only
> marginally equal anyways -- and as far as I understand it, pathlib won’t
> cache, so I think this has a fair point here.

Indeed - using properties rather than methods may help emphasise the
deliberate *difference* from pathlib in this case (i.e. value when the
result was retrieved from the OS, rather than the value right now). The
main benefit is that switching from using the DirEntry object to a pathlib
Path will require touching all the places where the performance
characteristics switch from "memory access" to "system call". This benefit
is also the main downside, so I'd actually be OK with either decision on
this one.

Other comments:

* +1 on the general idea
* +1 on scandir() over iterdir, since it *isn't* just an iterator version
of listdir
* -1 on including Windows specific globbing support in the API
* -0 on including cross platform globbing support in the initial iteration
of the API (that could be done later as a separate RFE instead)
* +1 on a new section in the PEP covering rejected design options (calling
it iterdir, returning a 2-tuple instead of a dedicated DirEntry type)
* regarding "why not a 2-tuple", we know from experience that operating
systems evolve and we end up wanting to add additional info to this kind of
API. A dedicated DirEntry type lets us adjust the information returned over
time, without breaking backwards compatibility and without resorting to
ugly hacks like those in some of the time and stat APIs (or even our own
codec info APIs)
* it would be nice to see some relative performance numbers for NFS and
CIFS network shares - the additional network round trips can make excessive
stat calls absolutely brutal from a speed perspective when using a network
drive (that's why the stat caching added to the import system in 3.3
dramatically sped up the case of having network drives on sys.path, and why
I thought AJ had a point when he was complaining about the fact we didn't
expose the dirent data from os.listdir)


> regards,
> jwi
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140628/9ced4e32/attachment.html>

More information about the Python-Dev mailing list