Adding '**' recursive search to glob.glob

This may be simple enough to just be a feature request on the tracker, but I thought I'd post it here first to see if people thought it was a good idea. I'd like it if the glob module supported the (relatively common) facility to use ** to mean recursively search subdirectories. It's a reasonably straightforward patch, and offers a feature that is fairly difficult to implement in user code on top of the existing functionality. The syntax is supported in a number of places (for example the bash shell and things like Java Ant) so it will be relatively familiar to users. For people who don't know the syntax, "a/**/b" is equivalent to "a/*/b or a/*/*/b or a/*/*/*/b or ..." (for as many levels as needed). One obvious downside is that if used carelessly, it can make globbing pretty slow. So I'd propose that it be added as an optional extension enabled using a flag argument (glob(pat, allow_recursive=True)) which is false by default. That would also mean that backward compatibility should not be an issue. Any comments? Is this worth submitting a patch to the tracker? Paul.

Paul Moore <p.f.moore@...> writes:
I'd like it if the glob module supported the (relatively common) facility to use ** to mean recursively search subdirectories. It's a reasonably straightforward patch, and offers a feature that is fairly difficult to implement in user code on top of the existing functionality. The syntax is supported in a number of places (for example the bash shell and things like Java Ant) so it will be relatively familiar to users.
Agreed. This was in packaging/distutils2 and I have now got it in distlib [1]; it supports both recursive globs and variants using the {opt1,opt2,op3} syntax.
One obvious downside is that if used carelessly, it can make globbing pretty slow. So I'd propose that it be added as an optional extension enabled using a flag argument (glob(pat, allow_recursive=True)) which is false by default. That would also mean that backward compatibility should not be an issue.
Isn't the requirement to recurse implied by the presence of '**' in the pattern? What's to be gained by specifying it using allow_recursive as well? Will having allow_recursive=True have any effect if '**' is not in the pattern? If you specify a pattern with '**' and allow_recursive=False, does that mean that '**' effectively acts as '*' would (i.e. one directory level only)? Regards, Vinay Sajip [1] https://bitbucket.org/vinay.sajip/distlib/src/29666/distlib/glob.py?at=defau...

On 14.01.13 17:46, Vinay Sajip wrote:
Isn't the requirement to recurse implied by the presence of '**' in the pattern? What's to be gained by specifying it using allow_recursive as well?
I'll be glad to make it enabled by default, however I'm feeling this is too dangerous. glob('**') on FS root takes too long time. Perhaps that's why (and for backward compatibility) this option (called "starglob") is disabled by default in Bash.
Will having allow_recursive=True have any effect if '**' is not in the pattern? If you specify a pattern with '**' and allow_recursive=False, does that mean that '**' effectively acts as '*' would (i.e. one directory level only)?
Yes, as now.

Le Mon, 14 Jan 2013 18:21:40 +0200, Serhiy Storchaka <storchaka@gmail.com> a écrit :
On 14.01.13 17:46, Vinay Sajip wrote:
Isn't the requirement to recurse implied by the presence of '**' in the pattern? What's to be gained by specifying it using allow_recursive as well?
I'll be glad to make it enabled by default, however I'm feeling this is too dangerous. glob('**') on FS root takes too long time. Perhaps that's why (and for backward compatibility) this option (called "starglob") is disabled by default in Bash.
But there's no reason to write glob('**') with the current API. Regards Antoine.

On 15/01/13 02:46, Vinay Sajip wrote:
Paul Moore<p.f.moore@...> writes:
I'd like it if the glob module supported the (relatively common) facility to use ** to mean recursively search subdirectories.
+1
One obvious downside is that if used carelessly, it can make globbing pretty slow. So I'd propose that it be added as an optional extension enabled using a flag argument (glob(pat, allow_recursive=True)) which is false by default. That would also mean that backward compatibility should not be an issue.
Isn't the requirement to recurse implied by the presence of '**' in the pattern? What's to be gained by specifying it using allow_recursive as well?
Not necessarily. At the moment, a glob like "/**/spam" is equivalent to "/*/spam": [steve@ando /]$ touch /tmp/spam [steve@ando /]$ mkdir /tmp/ham [steve@ando /]$ touch /tmp/ham/spam [steve@ando /]$ python3.3 -c "import glob; print(glob.glob('/**/spam'))" ['/tmp/spam'] With the suggested new functionality, the meaning of the glob will change. From a backwards-compatibility point of view, one might not want to enable the new semantics by default. But, from a *future*-compatibility point of view, I don't know that it is a good idea to require a flag every time a new globbing feature is added. glob.glob(pattern, allow_recurse=True, allow_spam=True, allow_ham=True, allow_eggs=True, ...) Rather than a flag, I suggest a version number: glob.glob(pattern, version=1) # current behaviour, as of 3.3 glob.glob(pattern, version=2) # adds ** recursion in Python 3.4 Then in Python 3.5 or 3.6 support for version 1 globs could be dropped.
Will having allow_recursive=True have any effect if '**' is not in the pattern?
I would expect that it will not have any effect unless ** is present. After all, it simply allows ** to recurse, and no other glob metacharacter can recurse.
If you specify a pattern with '**' and allow_recursive=False, does that mean that '**' effectively acts as '*' would (i.e. one directory level only)?
I expect that without allow_recursive=True, ** would behave identically to a single * -- Steven

Steven D'Aprano wrote:
Rather than a flag, I suggest a version number:
glob.glob(pattern, version=1) # current behaviour, as of 3.3 glob.glob(pattern, version=2) # adds ** recursion in Python 3.4
Yuck, then the reader has to know what features are enabled by which version numbers -- not something that's easy to keep in one's head. -- Greg

On Mon, Jan 14, 2013 at 1:38 PM, Greg Ewing <greg.ewing@canterbury.ac.nz>wrote:
Steven D'Aprano wrote:
Rather than a flag, I suggest a version number:
glob.glob(pattern, version=1) # current behaviour, as of 3.3 glob.glob(pattern, version=2) # adds ** recursion in Python 3.4
Yuck, then the reader has to know what features are enabled by which version numbers -- not something that's easy to keep in one's head.
And if you write glob.glob(..., foofeature=True) it will automatically raise an exception if you use it in a version that doesn't support the feature rather than silently ignoring the error. --- Bruce Check this out: http://bit.ly/yearofpuzzles

On 15/01/13 08:38, Greg Ewing wrote:
Steven D'Aprano wrote:
Rather than a flag, I suggest a version number:
glob.glob(pattern, version=1) # current behaviour, as of 3.3 glob.glob(pattern, version=2) # adds ** recursion in Python 3.4
Yuck, then the reader has to know what features are enabled by which version numbers -- not something that's easy to keep in one's head.
True. But neither are a plethora of enable_feature flags. Is it allow_recursion or allow_recursive or enable_double_star? Globbing is not likely to be something that most people use often enough that the name of the arguments will stick in their head. People will likely need to look it up one way or the other. All this assumes that we need to care about backward compatibility of ** in existing globs. It does seem to be an unlikely thing for people to write. If we don't, then no need for a flag at all. Instead, we could raise a warning for globs with ** in 3.4, and then drop the warning in 3.5. Another option, is a new function. Bool parameters that do nothing but change the behaviour of a function are somewhat of a mild anti-pattern. Perhaps it is better to just keep glob.glob as is, and add glob.recglob or rglob to support **. -- Steven

On 2013-01-15 01:31, Steven D'Aprano wrote:
On 15/01/13 08:38, Greg Ewing wrote:
Steven D'Aprano wrote:
Rather than a flag, I suggest a version number:
glob.glob(pattern, version=1) # current behaviour, as of 3.3 glob.glob(pattern, version=2) # adds ** recursion in Python 3.4
Yuck, then the reader has to know what features are enabled by which version numbers -- not something that's easy to keep in one's head.
True. But neither are a plethora of enable_feature flags. Is it allow_recursion or allow_recursive or enable_double_star? Globbing is not likely to be something that most people use often enough that the name of the arguments will stick in their head. People will likely need to look it up one way or the other.
All this assumes that we need to care about backward compatibility of ** in existing globs. It does seem to be an unlikely thing for people to write. If we don't, then no need for a flag at all. Instead, we could raise a warning for globs with ** in 3.4, and then drop the warning in 3.5.
Another option, is a new function. Bool parameters that do nothing but change the behaviour of a function are somewhat of a mild anti-pattern. Perhaps it is better to just keep glob.glob as is, and add glob.recglob or rglob to support **.
If there's rglob, then shouldn't there also be riglob or irglob? If so, then which one? :-)

On Mon, Jan 14, 2013 at 5:31 PM, Steven D'Aprano <steve@pearwood.info>wrote:
Yuck, then the reader has to know what features are
enabled by which version numbers -- not something that's easy to keep in one's head.
True. But neither are a plethora of enable_feature flags. Is it allow_recursion or allow_recursive or enable_double_star? Globbing is not likely to be something that most people use often enough that the name of the arguments will stick in their head. People will likely need to look it up one way or the other.
I see nothing wrong with asking people to consult the documentation for features they don't use that frequently. Better to check the docs than get it wrong. But the reader of the code is more likely to notice something special is going on when they see glob(..., allow_recursive=True) then rglob(...). And I'd rather have flags than rglob to allow recursion and iglob to ignore case and then either riglob or irglob to do both. Yuck. --- Bruce Check this out: http://bit.ly/yearofpuzzles

On Tue, Jan 15, 2013 at 6:03 AM, Bruce Leban <bruce@leapyear.org> wrote:
[...] and iglob to ignore case and [....]
OT - iglob is the iterator version of glob. perhaps in python 2 this should have been called "xglob". In python 3 it should have been just "glob".
rglob('**.py')
or
glob('**.py', True)
I don't mind either, though I think the first one is a bit clearer because "r" is more telling than "True". Don't mention glob('**.py', allow_recursive=True) because that's probably not going to be the norm. Yuval

On 01/14/2013 09:15 PM, Yuval Greenfield wrote:
rglob('**.py')
or
glob('**.py', True)
I don't mind either, though I think the first one is a bit clearer because "r" is more telling than "True". Don't mention glob('**.py', allow_recursive=True) because that's probably not going to be the norm.
If `allow_recursive` is a keyword-only parameter it will be the norm. :) ~Ethan~

On Tue, Jan 15, 2013 at 11:31 AM, Steven D'Aprano <steve@pearwood.info> wrote:
All this assumes that we need to care about backward compatibility of ** in existing globs. It does seem to be an unlikely thing for people to write. If we don't, then no need for a flag at all. Instead, we could raise a warning for globs with ** in 3.4, and then drop the warning in 3.5.
Another option, is a new function. Bool parameters that do nothing but change the behaviour of a function are somewhat of a mild anti-pattern. Perhaps it is better to just keep glob.glob as is, and add glob.recglob or rglob to support **.
Making boolean parameters less awful from a readability perspective is part of the rationale for keyword-only arguments: they force you to include the parameter name, thus making the call self-documenting. In this case, the conservative backwards compatible migration path would be: In 3.4: - add the recursive globbing capability - add "allow_recursive=None" as a keyword-only argument - emit a DeprecationWarning if the double-star pattern is seen when allow_recursive is None (but not when it is explicitly False) In 3.5: - switch the allow_recursive default value to True - drop the deprecation warning Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Vinay Sajip wrote:
Isn't the requirement to recurse implied by the presence of '**' in the pattern? What's to be gained by specifying it using allow_recursive as well? Will having allow_recursive=True have any effect if '**' is not in the pattern? If you specify a pattern with '**' and allow_recursive=False, does that mean that '**' effectively acts as '*' would (i.e. one directory level only)?
The glob string may come from the user or a remote source. It is possible that developer using glob has never considered "**" might be added, leading to an attacker accessing files in directories they are not allowed to, or DoS attacks because glob becomes very slow. Jeremy

On 14.01.13 15:22, Paul Moore wrote:
This may be simple enough to just be a feature request on the tracker, but I thought I'd post it here first to see if people thought it was a good idea.
There were several issues on tracker for this feature. Issue 13968 has almost ready patch (I should only protect recursive glob from infinite symlink loops). Except symlink loops the patch looks working and you can try it and make a review. I'm going to finish the work this week.
For people who don't know the syntax, "a/**/b" is equivalent to "a/*/b or a/*/*/b or a/*/*/*/b or ..." (for as many levels as needed).
Or a/b.
One obvious downside is that if used carelessly, it can make globbing pretty slow. So I'd propose that it be added as an optional extension enabled using a flag argument (glob(pat, allow_recursive=True)) which is false by default. That would also mean that backward compatibility should not be an issue.
Indeed. That's why I added the "recursive" parameter and disable this by default.

Le Mon, 14 Jan 2013 18:14:55 +0200, Serhiy Storchaka <storchaka@gmail.com> a écrit :
One obvious downside is that if used carelessly, it can make globbing pretty slow. So I'd propose that it be added as an optional extension enabled using a flag argument (glob(pat, allow_recursive=True)) which is false by default. That would also mean that backward compatibility should not be an issue.
Indeed. That's why I added the "recursive" parameter and disable this by default.
Using APIs carelessly is the user's problem, not ours. It should be sufficient to add a small warning in the docs, as I did in pathlib: https://pathlib.readthedocs.org/en/latest/#pathlib.Path.glob Regards Antoine.

On 14.01.13 18:21, Antoine Pitrou wrote:
Using APIs carelessly is the user's problem, not ours. It should be sufficient to add a small warning in the docs, as I did in pathlib:
We need a time machine to publish this warning in 1994, before anyone used the glob in his program. Pathlib has an advantage in this.

On Mon, Jan 14, 2013 at 6:27 PM, Serhiy Storchaka <storchaka@gmail.com>wrote:
We need a time machine to publish this warning in 1994, before anyone used the glob in his program.
Pathlib has an advantage in this.
The following have been discussed already: - deprecate the 'glob' module, moving its functionality to shutil - "starglob" or "use_recursive" option - have a separate "rglob" or "tree" function do this for you http://mail.python.org/pipermail/python-bugs-list/2012-February/thread.html#... And more at https://www.google.com/search?q=site%3Amail.python.org+recursive+glob The patch has been discussed to death already. Not to say that it's too late to speak your mind, but I think if it passes the proper tests and review - it should go in. Yuval Greenfield

On 14 January 2013 16:14, Serhiy Storchaka <storchaka@gmail.com> wrote:
For people who don't know the syntax, "a/**/b" is equivalent to "a/*/b or a/*/*/b or a/*/*/*/b or ..." (for as many levels as needed).
Or a/b.
Hmm, from my experiments, bash doesn't show a/b as matching the pattern a/**/b ...
One obvious downside is that if used carelessly, it can make globbing pretty slow. So I'd propose that it be added as an optional extension enabled using a flag argument (glob(pat, allow_recursive=True)) which is false by default. That would also mean that backward compatibility should not be an issue.
Indeed. That's why I added the "recursive" parameter and disable this by default.
Although I can see Vinay's point, that ** is not useful syntax currently, so there's no compatibility problem. Careless use resulting in long glob times is more of a user issue. Having said that, this debate is *precisely* why I suggested making it a parameter in the first place, so people can choose for themselves. So I guess I agree with your decision :-) Paul.

On 14 January 2013 16:33, Serhiy Storchaka <storchaka@gmail.com> wrote:
On 14.01.13 18:25, Paul Moore wrote:
Hmm, from my experiments, bash doesn't show a/b as matching the pattern a/**/b ...
$ shopt -s globstar $ echo Lib/**/test Lib/ctypes/test Lib/sqlite3/test Lib/test Lib/tkinter/test Lib/unittest/test
Ah, thanks. I hadn't enabled globstar. See what happens when you let a Windows user near a Unix shell? :-) And the fact that globstar is an option gives some weight to having a globstar-like flag in the function signature. Sorry for the noise. Paul.
participants (12)
-
Antoine Pitrou
-
Bruce Leban
-
Ethan Furman
-
Greg Ewing
-
Jeremy Sanders
-
MRAB
-
Nick Coghlan
-
Paul Moore
-
Serhiy Storchaka
-
Steven D'Aprano
-
Vinay Sajip
-
Yuval Greenfield