[Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

Nick Coghlan ncoghlan at gmail.com
Sun Jun 29 07:03:27 CEST 2014


On 29 June 2014 05:55, Ben Hoyt <benhoyt at gmail.com> wrote:
> Re is_dir etc being properties rather than methods:
>
>>> I find this behaviour a bit misleading: using methods and have them
>>> return cached results. How much (implementation and/or performance
>>> and/or memory) overhead would incur by using property-like access here?
>>> I think this would underline the static nature of the data.
>>>
>>> This would break the semantics with respect to pathlib, but they're only
>>> marginally equal anyways -- and as far as I understand it, pathlib won't
>>> cache, so I think this has a fair point here.
>>
>> Indeed - using properties rather than methods may help emphasise the
>> deliberate *difference* from pathlib in this case (i.e. value when the
>> result was retrieved from the OS, rather than the value right now). The main
>> benefit is that switching from using the DirEntry object to a pathlib Path
>> will require touching all the places where the performance characteristics
>> switch from "memory access" to "system call". This benefit is also the main
>> downside, so I'd actually be OK with either decision on this one.
>
> The problem with this is that properties "look free", they look just
> like attribute access, so you wouldn't normally handle exceptions when
> accessing them. But .lstat() and .is_dir() etc may do an OS call, so
> if you're needing to be careful with error handling, you may want to
> handle errors on them. Hence I think it's best practice to make them
> functions().
>
> Some of us discussed this on python-dev or python-ideas a while back,
> and I think there was general agreement with what I've stated above
> and therefore they should be methods. But I'll dig up the links and
> add to a Rejected ideas section.

Yes, only the stuff that *never* needs a system call (regardless of
OS) would be a candidate for handling as a property rather than a
method call. Consistency of access would likely trump that idea
anyway, but it would still be worth ensuring that the PEP is clear on
which values are guaranteed to reflect the state at the time of the
directory scanning and which may imply an additional stat call.

>> * it would be nice to see some relative performance numbers for NFS and CIFS
>> network shares - the additional network round trips can make excessive stat
>> calls absolutely brutal from a speed perspective when using a network drive
>> (that's why the stat caching added to the import system in 3.3 dramatically
>> sped up the case of having network drives on sys.path, and why I thought AJ
>> had a point when he was complaining about the fact we didn't expose the
>> dirent data from os.listdir)
>
> Don't know if you saw, but there are actually some benchmarks,
> including one over NFS, on the scandir GitHub page:
>
> https://github.com/benhoyt/scandir#benchmarks

No, I hadn't seen those - may be worth referencing explicitly from the
PEP (and if there's already a reference... oops!)

> os.walk() was 23 times faster with scandir() than the current
> listdir() + stat() implementation on the Windows NFS file system I
> tried. Pretty good speedup!

Ah, nice!

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-Dev mailing list