Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

27 Jun 2014

      Hi,

You wrote a great PEP Ben, thanks :-) But it's now time  for comments!
...
But the underlying system calls -- ``FindFirstFile`` /
``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir?

You should add a link to FindFirstFile doc:
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364418%28v=vs.85%2...

It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we
should mimic stat_result recent addition: the new
stat_result.file_attributes field. Add DirEntry.file_attributes which
would only be available on Windows.

The Windows structure also contains

  FILETIME ftCreationTime;
  FILETIME ftLastAccessTime;
  FILETIME ftLastWriteTime;
  DWORD    nFileSizeHigh;
  DWORD    nFileSizeLow;

It would be nice to expose them as well. I'm  no more surprised that
the exact API is different depending on the OS for functions of the os
module.
...
* Instead of bare filename strings, it returns lightweight
  ``DirEntry`` objects that hold the filename string and provide
  simple methods that allow access to the stat-like data the operating
  system returned.
Does your implementation uses a free list to avoid the cost of memory
allocation? A short free list of 10 or maybe just 1 may help. The free
list may be stored directly in the generator object.
...
``scandir()`` yields a ``DirEntry`` object for each file and directory
in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'``
pseudo-directories are skipped, and the entries are yielded in
system-dependent order. Each ``DirEntry`` object has the following
attributes and methods:
Does it support also bytes filenames on UNIX?

Python now supports undecodable filenames thanks to the PEP 383
(surrogateescape). I prefer to use the same type for filenames on
Linux and Windows, so Unicode is better. But some users might prefer
bytes for other reasons.
...
The ``DirEntry`` attribute and method names were chosen to be the same
as those in the new ``pathlib`` module for consistency.
Great! That's exactly what I expected :-) Consistency with other modules.
...
Notes on caching
----------------
The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute
is obviously always cached, and the ``is_X`` and ``lstat`` methods
cache their values (immediately on Windows via ``FindNextFile``, and
on first use on Linux / OS X via a ``stat`` call) and never refetch
from the system.
For this reason, ``DirEntry`` objects are intended to be used and
thrown away after iteration, not stored in long-lived data structured
and the methods called again and again.
If a user wants to do that (for example, for watching a file's size
change), they'll need to call the regular ``os.lstat()`` or
``os.path.getsize()`` functions which force a new system call each
time.
Crazy idea: would it be possible to "convert" a DirEntry object to a
pathlib.Path object without losing the cache? I guess that
pathlib.Path expects a full  stat_result object.
...
Or, for getting the total size of files in a directory tree -- showing
use of the ``DirEntry.lstat()`` method::
def get_tree_size(path):
        """Return total size of files in path and subdirs."""
        size = 0
        for entry in scandir(path):
            if entry.is_dir():
                sub_path = os.path.join(path, entry.name)
                size += get_tree_size(sub_path)
            else:
                size += entry.lstat().st_size
        return size
Note that ``get_tree_size()`` will get a huge speed boost on Windows,
because no extra stat call are needed, but on Linux and OS X the size
information is not returned by the directory iteration functions, so
this function won't gain anything there.
I don't understand how you can build a full lstat() result without
really calling stat. I see that WIN32_FIND_DATA contains the size, but
here you call lstat(). If you know that it's not a symlink, you
already know the size, but you still have to call stat() to retrieve
all fields required to build a stat_result no?
...
Support
=======
The scandir module on GitHub has been forked and used quite a bit (see
"Use in the wild" in this PEP),
Do you plan to continue to maintain your module for Python < 3.5, but
upgrade your module for the final PEP?
...
Should scandir be in its own module?
------------------------------------
Should the function be included in the standard library in a new
module, ``scandir.scandir()``, or just as ``os.scandir()`` as
discussed? The preference of this PEP's author (Ben Hoyt) would be
``os.scandir()``, as it's just a single function.
Yes, put it in the os module which is already bloated :-)
...
Should there be a way to access the full path?
----------------------------------------------
Should ``DirEntry``'s have a way to get the full path without using
``os.path.join(path, entry.name)``? This is a pretty common pattern,
and it may be useful to add pathlib-like ``str(entry)`` functionality.
This functionality has also been requested in `issue 13`_ on GitHub.
.. _`issue 13`: https://github.com/benhoyt/scandir/issues/13
I think that it would be very convinient to store the directory name
in the DirEntry. It should be light, it's just a reference.

And provide a fullname() name which would just return
os.path.join(path, entry.name) without trying to resolve path to get
an absolute path.
...
Should it expose Windows wildcard functionality?
------------------------------------------------
Should ``scandir()`` have a way of exposing the wildcard functionality
in the Windows ``FindFirstFile`` / ``FindNextFile`` functions? The
scandir module on GitHub exposes this as a ``windows_wildcard``
keyword argument, allowing Windows power users the option to pass a
custom wildcard to ``FindFirstFile``, which may avoid the need to use
``fnmatch`` or similar on the resulting names. It is named the
unwieldly ``windows_wildcard`` to remind you you're writing power-
user, Windows-only code if you use it.
This boils down to whether ``scandir`` should be about exposing all of
the system's directory iteration features, or simply providing a fast,
simple, cross-platform directory iteration API.
Would it be hard to implement the wildcard feature on UNIX to compare
performances of scandir('*.jpg') with and without the wildcard built
in os.scandir?

I implemented it in C for the tracemalloc module (Filter object):
http://hg.python.org/features/tracemalloc

Get the revision 69fd2d766005 and search match_filename_joker() in
Modules/_tracemalloc.c. The function matchs the filename backward
because it most cases, the last latter is enough to reject a filename
(ex: "*.jpg" => reject filenames not ending with "g").

The filename is normalized before matching the pattern: converted to
lowercase and / is replaced with \ on Windows.

It was decided to drop the Filter object to keep the tracemalloc
module as simple as possible. Charles-François was not convinced by
the speedup.

But tracemalloc case is different because the OS didn't provide an API for that.

Victor

Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

Victor Stinner