[Python-Dev] PEP 471 (scandir): Poll to choose the implementation (full C or C+Python)

Guido van Rossum guido at python.org
Fri Feb 13 18:31:44 CET 2015


I vote for the C implementation.

On Fri, Feb 13, 2015 at 2:07 AM, Victor Stinner <victor.stinner at gmail.com>
wrote:

> Hi,
>
> TL,DR: are you ok to add 800 lines of C code for os.scandir(), 4x
> faster than os.listdir() when the file type is checked?
>
> I accepted the PEP 471 (os.scandir) a few months ago, but it is not
> implement yet in Python 3.5, because I didn't make a choice on the
> implementation.
>
> Ben Hoyt wrote different implementations:
> - full C: os.scandir() and DirEntry are written in C (no change on os.py)
> - C+Python: os._scandir() (wrapper for opendir/readdir and
> FindFirstFileW/FindNextFileW) in C, DirEntry in Python
> - ctypes: os.scandir() and DirEntry fully implemented in Python
>
> I'm not interested by the ctypes implementation. It's useful for a
> third party project hosted at PyPI, but for CPython I prefer to wrap C
> functions using C code.
>
>
> In short, the C implementation is faster than the C+Python implementation.
>
> The issue #22524 (*) is full of benchmark numbers. IMO the most
> interesting benchmark is to compare os.listdir() + os.stat() versus
> os.scandir() + Direntry.is_dir(). Let me try to summarize results of
> this benchmark:
>
> * C implementation: scandir is at least 3.5x faster than listdir, up
> to 44.6x faster on Windows
> * C+Python implementation: scandir is not really faster than listdir,
> between 1.3x and 1.4x faster
>
> (*) http://bugs.python.org/issue22524
>
>
> Ben Hoyt reminded me that os.scandir() (PEP 471) doesn't add any new
> feature: pathlib already provides a nice API on top of os and os.path
> modules. (You may even notice that DirEntry a much fewer methods ;-))
> The main (only?) purpose of the PEP is performance.
>
> If os.scandir() is "only" 1.4x faster, I don't think that it is
> interesting to use os.scandir() in an application. I guess that all
> applications/libraries will want to keep compatibility with Python 3.4
> and older and so will anyway have to duplicate the code to use
> os.listdir() + os.stat(). So is it worth to duplicate code for such
> small speedup?
>
> Now I see 3 choices:
>
> - take the full C implementation, because it's much faster (at least
> 3.4x faster!)
> - reject the whole PEP 471 (not nice), because it adds too much code
> for a minor speedup (not true on Windows: up to 44x faster!)
> - take the C+Python implementation, because maintenance matters more
> than performances (only 1.3x faster, sorry)
>
> => IMO the best option is to take the C implementation. What do you think?
>
>
> I'm concerned by the length of the C code: the full C implementations
> adds ~800 lines of C code to posixmodule.c. This file is already the
> longest C file in CPython. I don't want to make it longer, but I'm not
> motived to start to split it. Last time I proposed to split a file
> (unicodeobject.c), some developers complained that it makes search
> harder. I don't understand this, there are so many tools to navigate
> in C code. But it was enough for me to give up on this idea.
>
> A alternative is to add a new _scandir.c module to host the new C
> code, and share some code with posixmodule.c: remove "static" keyword
> from required C functions (functions to convert Windows attributes to
> a os.stat_result object). That's a reasonable choice. What do you
> think?
>
>
> FYI I ran the benchmark on different hardware (SSD, HDD, tmpfs), file
> systems (ext4, tmpfs, NFS/ext4), operating systems (Linux, Windows).
>
> Victor
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/guido%40python.org
>



-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20150213/b5f51269/attachment.html>


More information about the Python-Dev mailing list