Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

29 Jun 2014

      Re is_dir etc being properties rather than methods:
...
...
I find this behaviour a bit misleading: using methods and have them
return cached results. How much (implementation and/or performance
and/or memory) overhead would incur by using property-like access here?
I think this would underline the static nature of the data.
This would break the semantics with respect to pathlib, but they're only
marginally equal anyways -- and as far as I understand it, pathlib won't
cache, so I think this has a fair point here.
Indeed - using properties rather than methods may help emphasise the
deliberate *difference* from pathlib in this case (i.e. value when the
result was retrieved from the OS, rather than the value right now). The main
benefit is that switching from using the DirEntry object to a pathlib Path
will require touching all the places where the performance characteristics
switch from "memory access" to "system call". This benefit is also the main
downside, so I'd actually be OK with either decision on this one.
The problem with this is that properties "look free", they look just
like attribute access, so you wouldn't normally handle exceptions when
accessing them. But .lstat() and .is_dir() etc may do an OS call, so
if you're needing to be careful with error handling, you may want to
handle errors on them. Hence I think it's best practice to make them
functions().

Some of us discussed this on python-dev or python-ideas a while back,
and I think there was general agreement with what I've stated above
and therefore they should be methods. But I'll dig up the links and
add to a Rejected ideas section.
...
* +1 on a new section in the PEP covering rejected design options (calling
it iterdir, returning a 2-tuple instead of a dedicated DirEntry type)
Great idea. I'll add a bunch of stuff, including the above, to a new
section, Rejected Design Options.
...
* regarding "why not a 2-tuple", we know from experience that operating
systems evolve and we end up wanting to add additional info to this kind of
API. A dedicated DirEntry type lets us adjust the information returned over
time, without breaking backwards compatibility and without resorting to ugly
hacks like those in some of the time and stat APIs (or even our own codec
info APIs)
Fully agreed.
...
* it would be nice to see some relative performance numbers for NFS and CIFS
network shares - the additional network round trips can make excessive stat
calls absolutely brutal from a speed perspective when using a network drive
(that's why the stat caching added to the import system in 3.3 dramatically
sped up the case of having network drives on sys.path, and why I thought AJ
had a point when he was complaining about the fact we didn't expose the
dirent data from os.listdir)
Don't know if you saw, but there are actually some benchmarks,
including one over NFS, on the scandir GitHub page:

https://github.com/benhoyt/scandir#benchmarks

os.walk() was 23 times faster with scandir() than the current
listdir() + stat() implementation on the Windows NFS file system I
tried. Pretty good speedup!

-Ben

Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

Ben Hoyt