You wrote a great PEP Ben, thanks :-) But it's now time for comments!
But the underlying system calls -- ``FindFirstFile`` / ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir?
You should add a link to FindFirstFile doc: http://msdn.microsoft.com/en-us/library/windows/desktop/aa364418%28v=vs.85%2...
It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we should mimic stat_result recent addition: the new stat_result.file_attributes field. Add DirEntry.file_attributes which would only be available on Windows.
The Windows structure also contains
FILETIME ftCreationTime; FILETIME ftLastAccessTime; FILETIME ftLastWriteTime; DWORD nFileSizeHigh; DWORD nFileSizeLow;
It would be nice to expose them as well. I'm no more surprised that the exact API is different depending on the OS for functions of the os module.
- Instead of bare filename strings, it returns lightweight ``DirEntry`` objects that hold the filename string and provide simple methods that allow access to the stat-like data the operating system returned.
Does your implementation uses a free list to avoid the cost of memory allocation? A short free list of 10 or maybe just 1 may help. The free list may be stored directly in the generator object.
``scandir()`` yields a ``DirEntry`` object for each file and directory in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'`` pseudo-directories are skipped, and the entries are yielded in system-dependent order. Each ``DirEntry`` object has the following attributes and methods:
Does it support also bytes filenames on UNIX?
Python now supports undecodable filenames thanks to the PEP 383 (surrogateescape). I prefer to use the same type for filenames on Linux and Windows, so Unicode is better. But some users might prefer bytes for other reasons.
The ``DirEntry`` attribute and method names were chosen to be the same as those in the new ``pathlib`` module for consistency.
Great! That's exactly what I expected :-) Consistency with other modules.
Notes on caching
The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute is obviously always cached, and the ``is_X`` and ``lstat`` methods cache their values (immediately on Windows via ``FindNextFile``, and on first use on Linux / OS X via a ``stat`` call) and never refetch from the system.
For this reason, ``DirEntry`` objects are intended to be used and thrown away after iteration, not stored in long-lived data structured and the methods called again and again.
If a user wants to do that (for example, for watching a file's size change), they'll need to call the regular ``os.lstat()`` or ``os.path.getsize()`` functions which force a new system call each time.
Crazy idea: would it be possible to "convert" a DirEntry object to a pathlib.Path object without losing the cache? I guess that pathlib.Path expects a full stat_result object.
Or, for getting the total size of files in a directory tree -- showing use of the ``DirEntry.lstat()`` method::
def get_tree_size(path): """Return total size of files in path and subdirs.""" size = 0 for entry in scandir(path): if entry.is_dir(): sub_path = os.path.join(path, entry.name) size += get_tree_size(sub_path) else: size += entry.lstat().st_size return size
Note that ``get_tree_size()`` will get a huge speed boost on Windows, because no extra stat call are needed, but on Linux and OS X the size information is not returned by the directory iteration functions, so this function won't gain anything there.
I don't understand how you can build a full lstat() result without really calling stat. I see that WIN32_FIND_DATA contains the size, but here you call lstat(). If you know that it's not a symlink, you already know the size, but you still have to call stat() to retrieve all fields required to build a stat_result no?
The scandir module on GitHub has been forked and used quite a bit (see "Use in the wild" in this PEP),
Do you plan to continue to maintain your module for Python < 3.5, but upgrade your module for the final PEP?
Should scandir be in its own module?
Should the function be included in the standard library in a new module, ``scandir.scandir()``, or just as ``os.scandir()`` as discussed? The preference of this PEP's author (Ben Hoyt) would be ``os.scandir()``, as it's just a single function.
Yes, put it in the os module which is already bloated :-)
Should there be a way to access the full path?
Should ``DirEntry``'s have a way to get the full path without using ``os.path.join(path, entry.name)``? This is a pretty common pattern, and it may be useful to add pathlib-like ``str(entry)`` functionality. This functionality has also been requested in `issue 13`_ on GitHub.
.. _`issue 13`: https://github.com/benhoyt/scandir/issues/13
I think that it would be very convinient to store the directory name in the DirEntry. It should be light, it's just a reference.
And provide a fullname() name which would just return os.path.join(path, entry.name) without trying to resolve path to get an absolute path.
Should it expose Windows wildcard functionality?
Should ``scandir()`` have a way of exposing the wildcard functionality in the Windows ``FindFirstFile`` / ``FindNextFile`` functions? The scandir module on GitHub exposes this as a ``windows_wildcard`` keyword argument, allowing Windows power users the option to pass a custom wildcard to ``FindFirstFile``, which may avoid the need to use ``fnmatch`` or similar on the resulting names. It is named the unwieldly ``windows_wildcard`` to remind you you're writing power- user, Windows-only code if you use it.
This boils down to whether ``scandir`` should be about exposing all of the system's directory iteration features, or simply providing a fast, simple, cross-platform directory iteration API.
Would it be hard to implement the wildcard feature on UNIX to compare performances of scandir('*.jpg') with and without the wildcard built in os.scandir?
I implemented it in C for the tracemalloc module (Filter object): http://hg.python.org/features/tracemalloc
Get the revision 69fd2d766005 and search match_filename_joker() in Modules/_tracemalloc.c. The function matchs the filename backward because it most cases, the last latter is enough to reject a filename (ex: "*.jpg" => reject filenames not ending with "g").
The filename is normalized before matching the pattern: converted to lowercase and / is replaced with \ on Windows.
It was decided to drop the Filter object to keep the tracemalloc module as simple as possible. Charles-François was not convinced by the speedup.
But tracemalloc case is different because the OS didn't provide an API for that.