On Thu, 26 Jun 2014 18:59:45 -0400 Ben Hoyt firstname.lastname@example.org wrote:
Hi Python dev folks,
I've written a PEP proposing a specific os.scandir() API for a directory iterator that returns the stat-like info from the OS, the main advantage of which is to speed up os.walk() and similar operations between 4-20x, depending on your OS and file system. Full details, background info, and context links are in the PEP, which Victor Stinner has uploaded at the following URL, and I've also copied inline below.
I noticed obvious inefficiency of os.walk() implemented in terms of os.listdir() when I worked on "os" module for MicroPython. I essentially did what your PEP suggests - introduced internal generator function (ilistdir_ex() in https://github.com/micropython/micropython-lib/blob/master/os/os/__init__.py... ), in terms of which both os.listdir() and os.walk() are implemented.
With my MicroPython hat on, os.scandir() would make things only worse. With current interface, one can either have inefficient implementation (like CPython chose) or efficient implementation (like MicroPython chose) - all transparently. os.scandir() supposedly opens up efficient implementation for everyone, but at the price of bloating API and introducing heavy-weight objects to wrap info. PEP calls it "lightweight DirEntry objects", but that cannot be true, because all Python objects are heavy-weight, especially those which have methods.
It would be better if os.scandir() was specified to return a struct (named tuple) compatible with return value of os.stat() (with only fields relevant to underlying readdir()-like system call). The grounds for that are obvious: it's already existing data interface in module "os", which is also based on open standard for operating systems - POSIX, so if one is to expect something about file attributes, it's what one can reasonably base expectations on.
But reusing os.stat struct is glaringly not what's proposed. And it's clear where that comes from - "[DirEntry.]lstat(): like os.lstat(), but requires no system calls on Windows". Nice, but OS "FooBar" can do much more than Windows - it has a system call to send a file by email, right when scanning a directory containing it. So, why not to have DirEntry.send_by_email(recipient) method? I hear the answer - it's because CPython strives to support Windows well, while doesn't care about "FooBar" OS.
And then it again leads to the question I posed several times - where's line between "CPython" and "Python"? Is it grounded for CPython to add (or remove) to Python stdlib something which is useful for its users, but useless or complicating for other Python implementations? Especially taking into account that there's "win32api" module allowing Windows users to use all wonders of its API? Especially that os.stat struct is itself pretty extensible (https://docs.python.org/3.4/library/os.html#os.stat : "On other Unix systems (such as FreeBSD), the following attributes may be available ...", "On Mac OS systems...", - so extra fields can be added for Windows just the same, if really needed).
Would love feedback on the PEP, but also of course on the proposal itself.