[Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
victor.stinner at gmail.com
Fri Jun 27 09:44:17 CEST 2014
You wrote a great PEP Ben, thanks :-) But it's now time for comments!
> But the underlying system calls -- ``FindFirstFile`` /
> ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir?
You should add a link to FindFirstFile doc:
It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we
should mimic stat_result recent addition: the new
stat_result.file_attributes field. Add DirEntry.file_attributes which
would only be available on Windows.
The Windows structure also contains
It would be nice to expose them as well. I'm no more surprised that
the exact API is different depending on the OS for functions of the os
> * Instead of bare filename strings, it returns lightweight
> ``DirEntry`` objects that hold the filename string and provide
> simple methods that allow access to the stat-like data the operating
> system returned.
Does your implementation uses a free list to avoid the cost of memory
allocation? A short free list of 10 or maybe just 1 may help. The free
list may be stored directly in the generator object.
> ``scandir()`` yields a ``DirEntry`` object for each file and directory
> in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'``
> pseudo-directories are skipped, and the entries are yielded in
> system-dependent order. Each ``DirEntry`` object has the following
> attributes and methods:
Does it support also bytes filenames on UNIX?
Python now supports undecodable filenames thanks to the PEP 383
(surrogateescape). I prefer to use the same type for filenames on
Linux and Windows, so Unicode is better. But some users might prefer
bytes for other reasons.
> The ``DirEntry`` attribute and method names were chosen to be the same
> as those in the new ``pathlib`` module for consistency.
Great! That's exactly what I expected :-) Consistency with other modules.
> Notes on caching
> The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute
> is obviously always cached, and the ``is_X`` and ``lstat`` methods
> cache their values (immediately on Windows via ``FindNextFile``, and
> on first use on Linux / OS X via a ``stat`` call) and never refetch
> from the system.
> For this reason, ``DirEntry`` objects are intended to be used and
> thrown away after iteration, not stored in long-lived data structured
> and the methods called again and again.
> If a user wants to do that (for example, for watching a file's size
> change), they'll need to call the regular ``os.lstat()`` or
> ``os.path.getsize()`` functions which force a new system call each
Crazy idea: would it be possible to "convert" a DirEntry object to a
pathlib.Path object without losing the cache? I guess that
pathlib.Path expects a full stat_result object.
> Or, for getting the total size of files in a directory tree -- showing
> use of the ``DirEntry.lstat()`` method::
> def get_tree_size(path):
> """Return total size of files in path and subdirs."""
> size = 0
> for entry in scandir(path):
> if entry.is_dir():
> sub_path = os.path.join(path, entry.name)
> size += get_tree_size(sub_path)
> size += entry.lstat().st_size
> return size
> Note that ``get_tree_size()`` will get a huge speed boost on Windows,
> because no extra stat call are needed, but on Linux and OS X the size
> information is not returned by the directory iteration functions, so
> this function won't gain anything there.
I don't understand how you can build a full lstat() result without
really calling stat. I see that WIN32_FIND_DATA contains the size, but
here you call lstat(). If you know that it's not a symlink, you
already know the size, but you still have to call stat() to retrieve
all fields required to build a stat_result no?
> The scandir module on GitHub has been forked and used quite a bit (see
> "Use in the wild" in this PEP),
Do you plan to continue to maintain your module for Python < 3.5, but
upgrade your module for the final PEP?
> Should scandir be in its own module?
> Should the function be included in the standard library in a new
> module, ``scandir.scandir()``, or just as ``os.scandir()`` as
> discussed? The preference of this PEP's author (Ben Hoyt) would be
> ``os.scandir()``, as it's just a single function.
Yes, put it in the os module which is already bloated :-)
> Should there be a way to access the full path?
> Should ``DirEntry``'s have a way to get the full path without using
> ``os.path.join(path, entry.name)``? This is a pretty common pattern,
> and it may be useful to add pathlib-like ``str(entry)`` functionality.
> This functionality has also been requested in `issue 13`_ on GitHub.
> .. _`issue 13`: https://github.com/benhoyt/scandir/issues/13
I think that it would be very convinient to store the directory name
in the DirEntry. It should be light, it's just a reference.
And provide a fullname() name which would just return
os.path.join(path, entry.name) without trying to resolve path to get
an absolute path.
> Should it expose Windows wildcard functionality?
> Should ``scandir()`` have a way of exposing the wildcard functionality
> in the Windows ``FindFirstFile`` / ``FindNextFile`` functions? The
> scandir module on GitHub exposes this as a ``windows_wildcard``
> keyword argument, allowing Windows power users the option to pass a
> custom wildcard to ``FindFirstFile``, which may avoid the need to use
> ``fnmatch`` or similar on the resulting names. It is named the
> unwieldly ``windows_wildcard`` to remind you you're writing power-
> user, Windows-only code if you use it.
> This boils down to whether ``scandir`` should be about exposing all of
> the system's directory iteration features, or simply providing a fast,
> simple, cross-platform directory iteration API.
Would it be hard to implement the wildcard feature on UNIX to compare
performances of scandir('*.jpg') with and without the wildcard built
I implemented it in C for the tracemalloc module (Filter object):
Get the revision 69fd2d766005 and search match_filename_joker() in
Modules/_tracemalloc.c. The function matchs the filename backward
because it most cases, the last latter is enough to reject a filename
(ex: "*.jpg" => reject filenames not ending with "g").
The filename is normalized before matching the pattern: converted to
lowercase and / is replaced with \ on Windows.
It was decided to drop the Filter object to keep the tracemalloc
module as simple as possible. Charles-François was not convinced by
But tracemalloc case is different because the OS didn't provide an API for that.
More information about the Python-Dev