Please consider skipping hidden directories in os.walk, os.fwalk, etc.

In a lot of uses of os.walk it is desirable to skip version control directories, (which are usually hidden directories), to the point that almost all of the examples given look like: import os for root, dirs, files in os.walk(some_dir): if 'CVS' in dirs: dirs.remove('CVS') # or .svn or .hg etc. # do something... But of course there are many version control systems to the point that much of my personal code looks like, (note that I have to use a multitude of version control systems due to project requirements): import os vcs_dirs = ['.hg', '.svn', 'CSV', '.git', '.bz'] # Version control directory names I know for root, dirs, files in os.walk(some_dir): for dirname in vcs_dirs: dirs.remove(dirname) I am sure that I am missing many other version control systems but the one thing that all of the ones that I am familiar with default to creating their files in hidden directories. I know that the above sometimes hits problems on Windows if someone manually created a directory and you end up with abortions such as Csv\ or .SVN .... Since it could be argued that hidden directories are possibly more common than simlinks, (especially in the Windows world of course), and that hidden directories have normally been hidden by someone for a reason it seems to make sense to me to normally ignore them in directory traversal. Obviously there are also occasions when it makes sense to include VCS, or other hidden, directories files, (e.g. "Where did all of my disk space go?" or "delete recursively"), so I would like to suggest including in the os.walk family of functions an additional parameter to control skipping all hidden directories - either positively or negatively. Names that spring to mind include: * nohidden * nohidden_dirs * hidden * hidden_dirs This change could be made with no impact on current behaviour by defaulting to hidden=True (or nohidden=False) which would just about ensure that no existing code is broken or quite a few bugs in existing code could be quietly fixed, (and some new ones introduced), by defaulting to this behaviour. Since the implementation of os.walk has changed to use os.scandir which exposes the returned file statuses in the os.DirEntry.stat() the overhead should be minimal. An alternative would be to add another new function, say os.vwalk(), to only walk visible entries. Note that a decision would have to be made on whether to include such filtering when topdown is False, personally I am tempted to include the filtering so as to maintain consistency but ignoring the filter when topdown is False, (or if topdown is False and the hidden behaviour is unspecified), might make sense if the skipping of hidden directories becomes the new default (then recursively removing files & directories would still include processing hidden items by default). If this receives a positive response I would be happy to undertake the effort involved in producing a PR. -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG. http://www.avg.com

On Mon, May 07, 2018 at 06:05:15AM +0000, Steve Barnes wrote:
I would write something like: for root, dirs, files in filter(ignorable, os.walk(some_dir)): ... where ignorable() is a function that returns False for whatever you want to ignore. I don't think we can possibly agree on a single definition of "ignorable". This could include any combination of: - dot files; - files with the invisible bit set, for file systems which support that; - files within certain directories; - files ending in ~ (backup files); - files with certain extensions; or more. Possibly this is a good use-case for composible functions, so we could have a set of pre-built filters: ignorable = invisible + dotfiles + directories('.git', '.hg') + extensions('~', '.pdf') but that sounds like it ought to be a separate module, not built in. -- Steve

On Mon, May 7, 2018 at 9:44 PM Steve Barnes <gadgetsteve@live.co.uk> wrote:
On Tue, May 8, 2018 at 12:06 AM Steven D'Aprano <steve@pearwood.info> wrote:
I would write something like: for root, dirs, files in filter(ignorable, os.walk(some_dir)):
I agree with Steven with regards to `filter` needing to be flexible. If you want to avoid duplicate `stat` calls, you'll probably write: import os import stat def is_hidden(st): return bool(st.st_file_attributes & stat.FILE_ATTRIBUTE_HIDDEN) def visible_walk(path): for entry in os.scandir(path): if entry.is_dir(): if not is_hidden(entry.stat()): yield from visible_walk(entry.path) else: if not is_hidden(entry.stat()): yield entry.path Then you can decide whether you want to ignore hidden files or just hidden directories. The variations for such a need are many. So it makes sense to leave any specific filtering need outside of the standard library. A PyPI package with a few standard filtered walks could be a nice exploration for this idea. Cheers, Yuval

Hi! On Tue, May 08, 2018 at 07:12:35AM +0000, Yuval Greenfield <ubershmekel@gmail.com> wrote:
So anyone who wants to filter os.walk() must reimplement os.walk() themselves instead of passing something like filter_dir and filter_file (or accept_dir/accept_file) to os.walk()? Kind of painful, no?
Cheers, Yuval
Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

I like the idea. I think an argument to os.walk() is the simplest option for most users. But per some comments, "hidden" is actually more subtle than the filesystem bit sometimes. I.e. dot-files, ~ suffix, maybe .bak, etc. I'd suggest meeting the ideas slightly and making the new argument 'filter' or 'skip' that takes a callable. Default to None, but provide an os.is_hidden that users don't need to figure out how to implement. E.g. os.walk(PATH, skip=os.is_hidden) os.walk(PATH, skip=lambda entry: entry.name.endswith(('~', '.bak', '.tmp'))) On Tue, May 8, 2018, 5:47 AM Oleg Broytman <phd@phdru.name> wrote:

On Tue, May 8, 2018 at 2:00 PM, David Mertz <mertz@gnosis.cx> wrote:
I think this would be a good addition because it gives direct access to the underlying os.scandir() objects which are currently inaccessible and discarded (if os.walk() were to be written today it'd probably yield (root, os.DirEntry) instead of (root, dirs, files)). As such one can implement advanced filtering logic without having to call os.stat() for each path string yielded by os.walk() (faster). IMO the callback should accept a (root, os.DirEntry) pair though, because the "root" path can also be part of the filtering logic. -- Giampaolo - http://grodola.blogspot.com

On 08/05/2018 15:53, Giampaolo Rodola' wrote:
I like the idea of extending the original idea to a filtered walk possibly with some predefined filters. As there does not seem to be a lot of strong opposition so far to the basic idea, (other than a some "why bother it is too easy to do yourself"), it seems like there is a choice now is between: a) raising an enhancement request on the tracker (I am not sure if this is major enough to require a PEP) or b) setting up a new library on PyPi and putting it out there to see if it sinks or swims. What is the general feeling between the two options? -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG. http://www.avg.com

fnmatch.filter does Unix filename pattern matching. https://docs.python.org/3/library/fnmatch.html#fnmatch.filter grin and grind are like grep and find with options to filter hidden files and VCS directories by default. https://pypi.org/project/grin/ There's an example of using the Python API here: https://github.com/rkern/grin/blob/master/examples/grinpython.py - grin.get_regex(args) - grin.get_filenames(args) https://github.com/rkern/grin/blob/master/grin.py On Wednesday, May 9, 2018, Steve Barnes <gadgetsteve@live.co.uk> wrote:

On Tue, May 8, 2018 at 2:31 AM, Oleg Broytman <phd@phdru.name> wrote:
Not really. It's pretty simple code so you put it in your 'usual suspects' module and just forget about it. Here's our version, maybe 10 years old (reworked last whenever scandir came out): def _compiled_patterns(patterns, globs=True, flags=0): """ $uuid:95a9b8e2-fb6a-59be-b9c2-da0e6e12f8d3$ Compile a list of patterns into regex patterns. If ``globs`` is true, use ``fnmatch`` to convert the patterns into regular expressions prior to compilation. ``flags`` is any of the ``re`` module's regular expression flags. """ if globs: patterns = list(_fnmatch.translate(glob) for glob in patterns) return list(_re.compile(regex, flags=flags) for regex in patterns) def walk(root_path, ignore_directories=[], show_directories=False): """ $uuid:f77197cd-239b-5d93-9253-c3eb7439d720$ Walk the directory tree and return all the file entries, trimming directories as we go. ``ignore_directories`` is a list of Unix file globs. """ ignore_directories = _compiled_patterns(ignore_directories) def do_walk(top): """ $uuid:e6a4f789-5b5f-56a2-8551-297c142c3e17$ """ for entry in _scandir.scandir(top): if not entry.is_dir(): yield entry elif entry.is_symlink(): pass # Ignore links. elif not any(ignore.match(entry.name) for ignore in ignore_directories): if show_directories: yield entry for entry in do_walk(entry.path): yield entry return do_walk(root_path)

On Mon, May 7, 2018, at 02:05, Steve Barnes wrote:
CVS isn't a hidden directory on Linux. Maybe it can be on windows, but it probably won't be if it's manually created, which you mentioned issues with below. There's probably a discussion we should be having about exposing these system-specific attributes, but they really can't be a general solution for the problem you have. MacOS, incidentally, has two distinct attributes for hiding files [chflags hidden and setfile -a V], along with a ".private" file that can be in a directory containing a list of filenames to hide.

There are hidden directories, and then there are hidden directories :-). It makes sense to me to add an option to the stdlib functions to skip directories (and files) that the system considers hidden, so I guess that means dotfiles on Unix and files with the hidden attribute on Windows. But if you want "smart" matching that has special knowledge of CVS directories and so forth, then that seems like something that would fit better as a library on PyPI. The rust "ignore" crate has a pretty good set of semantics, for reference. It's not trivial, but it sure is handy :-): https://docs.rs/ignore/0.4.2/ignore/struct.WalkBuilder.html -n On Tue, May 8, 2018, 00:43 Steve Barnes <gadgetsteve@live.co.uk> wrote:

On Mon, May 07, 2018 at 06:05:15AM +0000, Steve Barnes wrote:
I would write something like: for root, dirs, files in filter(ignorable, os.walk(some_dir)): ... where ignorable() is a function that returns False for whatever you want to ignore. I don't think we can possibly agree on a single definition of "ignorable". This could include any combination of: - dot files; - files with the invisible bit set, for file systems which support that; - files within certain directories; - files ending in ~ (backup files); - files with certain extensions; or more. Possibly this is a good use-case for composible functions, so we could have a set of pre-built filters: ignorable = invisible + dotfiles + directories('.git', '.hg') + extensions('~', '.pdf') but that sounds like it ought to be a separate module, not built in. -- Steve

On Mon, May 7, 2018 at 9:44 PM Steve Barnes <gadgetsteve@live.co.uk> wrote:
On Tue, May 8, 2018 at 12:06 AM Steven D'Aprano <steve@pearwood.info> wrote:
I would write something like: for root, dirs, files in filter(ignorable, os.walk(some_dir)):
I agree with Steven with regards to `filter` needing to be flexible. If you want to avoid duplicate `stat` calls, you'll probably write: import os import stat def is_hidden(st): return bool(st.st_file_attributes & stat.FILE_ATTRIBUTE_HIDDEN) def visible_walk(path): for entry in os.scandir(path): if entry.is_dir(): if not is_hidden(entry.stat()): yield from visible_walk(entry.path) else: if not is_hidden(entry.stat()): yield entry.path Then you can decide whether you want to ignore hidden files or just hidden directories. The variations for such a need are many. So it makes sense to leave any specific filtering need outside of the standard library. A PyPI package with a few standard filtered walks could be a nice exploration for this idea. Cheers, Yuval

Hi! On Tue, May 08, 2018 at 07:12:35AM +0000, Yuval Greenfield <ubershmekel@gmail.com> wrote:
So anyone who wants to filter os.walk() must reimplement os.walk() themselves instead of passing something like filter_dir and filter_file (or accept_dir/accept_file) to os.walk()? Kind of painful, no?
Cheers, Yuval
Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

I like the idea. I think an argument to os.walk() is the simplest option for most users. But per some comments, "hidden" is actually more subtle than the filesystem bit sometimes. I.e. dot-files, ~ suffix, maybe .bak, etc. I'd suggest meeting the ideas slightly and making the new argument 'filter' or 'skip' that takes a callable. Default to None, but provide an os.is_hidden that users don't need to figure out how to implement. E.g. os.walk(PATH, skip=os.is_hidden) os.walk(PATH, skip=lambda entry: entry.name.endswith(('~', '.bak', '.tmp'))) On Tue, May 8, 2018, 5:47 AM Oleg Broytman <phd@phdru.name> wrote:

On Tue, May 8, 2018 at 2:00 PM, David Mertz <mertz@gnosis.cx> wrote:
I think this would be a good addition because it gives direct access to the underlying os.scandir() objects which are currently inaccessible and discarded (if os.walk() were to be written today it'd probably yield (root, os.DirEntry) instead of (root, dirs, files)). As such one can implement advanced filtering logic without having to call os.stat() for each path string yielded by os.walk() (faster). IMO the callback should accept a (root, os.DirEntry) pair though, because the "root" path can also be part of the filtering logic. -- Giampaolo - http://grodola.blogspot.com

On 08/05/2018 15:53, Giampaolo Rodola' wrote:
I like the idea of extending the original idea to a filtered walk possibly with some predefined filters. As there does not seem to be a lot of strong opposition so far to the basic idea, (other than a some "why bother it is too easy to do yourself"), it seems like there is a choice now is between: a) raising an enhancement request on the tracker (I am not sure if this is major enough to require a PEP) or b) setting up a new library on PyPi and putting it out there to see if it sinks or swims. What is the general feeling between the two options? -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG. http://www.avg.com

fnmatch.filter does Unix filename pattern matching. https://docs.python.org/3/library/fnmatch.html#fnmatch.filter grin and grind are like grep and find with options to filter hidden files and VCS directories by default. https://pypi.org/project/grin/ There's an example of using the Python API here: https://github.com/rkern/grin/blob/master/examples/grinpython.py - grin.get_regex(args) - grin.get_filenames(args) https://github.com/rkern/grin/blob/master/grin.py On Wednesday, May 9, 2018, Steve Barnes <gadgetsteve@live.co.uk> wrote:

On Tue, May 8, 2018 at 2:31 AM, Oleg Broytman <phd@phdru.name> wrote:
Not really. It's pretty simple code so you put it in your 'usual suspects' module and just forget about it. Here's our version, maybe 10 years old (reworked last whenever scandir came out): def _compiled_patterns(patterns, globs=True, flags=0): """ $uuid:95a9b8e2-fb6a-59be-b9c2-da0e6e12f8d3$ Compile a list of patterns into regex patterns. If ``globs`` is true, use ``fnmatch`` to convert the patterns into regular expressions prior to compilation. ``flags`` is any of the ``re`` module's regular expression flags. """ if globs: patterns = list(_fnmatch.translate(glob) for glob in patterns) return list(_re.compile(regex, flags=flags) for regex in patterns) def walk(root_path, ignore_directories=[], show_directories=False): """ $uuid:f77197cd-239b-5d93-9253-c3eb7439d720$ Walk the directory tree and return all the file entries, trimming directories as we go. ``ignore_directories`` is a list of Unix file globs. """ ignore_directories = _compiled_patterns(ignore_directories) def do_walk(top): """ $uuid:e6a4f789-5b5f-56a2-8551-297c142c3e17$ """ for entry in _scandir.scandir(top): if not entry.is_dir(): yield entry elif entry.is_symlink(): pass # Ignore links. elif not any(ignore.match(entry.name) for ignore in ignore_directories): if show_directories: yield entry for entry in do_walk(entry.path): yield entry return do_walk(root_path)

On Mon, May 7, 2018, at 02:05, Steve Barnes wrote:
CVS isn't a hidden directory on Linux. Maybe it can be on windows, but it probably won't be if it's manually created, which you mentioned issues with below. There's probably a discussion we should be having about exposing these system-specific attributes, but they really can't be a general solution for the problem you have. MacOS, incidentally, has two distinct attributes for hiding files [chflags hidden and setfile -a V], along with a ".private" file that can be in a directory containing a list of filenames to hide.

There are hidden directories, and then there are hidden directories :-). It makes sense to me to add an option to the stdlib functions to skip directories (and files) that the system considers hidden, so I guess that means dotfiles on Unix and files with the hidden attribute on Windows. But if you want "smart" matching that has special knowledge of CVS directories and so forth, then that seems like something that would fit better as a library on PyPI. The rust "ignore" crate has a pretty good set of semantics, for reference. It's not trivial, but it sure is handy :-): https://docs.rs/ignore/0.4.2/ignore/struct.WalkBuilder.html -n On Tue, May 8, 2018, 00:43 Steve Barnes <gadgetsteve@live.co.uk> wrote:
participants (10)
-
David Mertz
-
Eric Fahlgren
-
Giampaolo Rodola'
-
Nathaniel Smith
-
Oleg Broytman
-
Random832
-
Steve Barnes
-
Steven D'Aprano
-
Wes Turner
-
Yuval Greenfield