
Fnmath.filter works great. To remind people what it does, it takes an iterable of strings and a pattern and returns a list of the strings that match the pattern. And that is wonderful However, I often need to filter *out* the items that match the pattern (to ignore them). In every project that I need this I end up copying the function out of the fnmatch library and adding 'not' to the test clause. It would be wonderful if there was a filter_false version in the standard library. Or in inversion Boolean option. Or something, to stop from having to copy code every time I need to ignore files.

On Wed, May 17, 2017 at 12:14:05PM -0400, Alex Walters <tritium-list@sdamon.com> wrote:
Why not create a package and publish at PyPI? Then all you need is pip install fnmatch_filter_false in your virtual env. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

That is my normal thought on something like this, but in the case of adding a Boolean argument to fnmatch.filter, it (might be) as simple as a 3 line diff that does not break the API, and as far as I can tell, does not have performance implications. Copying a module out of the standard library that is identical except a 3 line diff that does not break compatibility with the standard library... that just wreaks of something that should be in the standard library to begin with. In the case of adding a separate function to fnmatch, it's still not that big of a diff, and wouldn't have much duplicated code, at least in the way I would implement it - it would essentially do the previous Boolean option method, and wrap that. A filter_false function, now that I think about it, is less ideal than just adding a keyword only Boolean option to fnmatch.filter.

is identical except a 3 line diff that does not break compatibility with
Top posting, apologies. I'm sure there is a better way to do it, and there is a performance hit, but its negligible. This is also a three line delta of the function. from fnmatch import _compile_pattern, filter as old_filter import os import os.path import posixpath data = os.listdir() def filter(names, pat, *, invert=False): """Return the subset of the list NAMES that match PAT.""" result = [] pat = os.path.normcase(pat) match = _compile_pattern(pat) if os.path is posixpath: # normcase on posix is NOP. Optimize it away from the loop. for name in names: if bool(match(name)) == (not invert): result.append(name) else: for name in names: if bool(match(os.path.normcase(name))) == (not invert): result.append(name) return result if __name__ == '__main__': import timeit print(timeit.timeit( "filter(data, '__*')", setup="from __main__ import filter, data" )) print(timeit.timeit( "filter(data, '__*')", setup="from __main__ import old_filter as filter, data" )) The first test (modified code) timed at 22.492161903402575, where the second test (unmodified) timed at 19.555531892032324 that that the phd@phdru.name

On 05/17/2017 07:55 PM, tritium-list@sdamon.com wrote:
If you don't care about slow-downs in this range, you could use this pattern: excluded = set(filter(data, '__*')) result = [item for item in data if item not in excluded] It seems to take just as much longer although the slow-down is not constant but depends on the size of the set you need to generate. Wolfgang

On 19.05.2017 20:01, tritium-list@sdamon.com wrote:
I'm sorry, but then your statement above doesn't make any sense to me: "I'm sure there is a better way to do it, and there is a performance hit, but its negligible." I'm proposing an alternative to you which times in very similarly to your own suggestion without copying or modifying stdlib code. That said I still like your idea of adding the exclude functionality to fnmatch. I just thought you may be interested in a solution that works right now.

On Wed, May 17, 2017 at 12:14:05PM -0400, Alex Walters wrote:
At the cost of a slight inefficiency, you could use the pure Python equivalent given in the docs: https://docs.python.org/3/library/fnmatch.html#fnmatch.filter fnmatch.filter(names, pattern) Return the subset of the list of names that match pattern. It is the same as [n for n in names if fnmatch(n, pattern)], but implemented more efficiently. So your filter_false is: [n for n in names if not fnmatch(n, pattern)] which avoids the need for the copy-and-paste anti-pattern. Otherwise, I would support: - filter_false - or a glob symbol to reverse the sense of the test, e.g. ~ or ! as the first character; but I dislike functions that take boolean arguments to change their behaviour. It's not a 100% hard and fast rule, but in general I prefer to avoid functions that take a constant bool argument: # rather than this: function(important_args, True) function(important_args, False) # use this: function(important_args) function_false(important_args) (although there may be exceptions). -- Steve

I ran a test on the same dataset using listcomps. The modified version of filter is still 22.<some noise>, the unmodified version is still 19.<some noise>. However the listcomp method is 41.<some noise>. That performance is important, at least to me. Before you ask, yes, I have profiled my code, and filtering is a bottleneck.
That is a fair enough argument. In moderate defense of the code I posted earlier, I did make the inversion variable keyword only.

On Thu, May 18, 2017 at 12:07:05AM -0400, tritium-list@sdamon.com wrote:
41 what? Nanoseconds? Hours? For how many files? *wink* In any case, are you running on Linux or Unix? If so, try replacing the fnmatch with fnmatchcase, since that avoids calling os.path.normcase (a no-op on Linux) twice for each file. If you want to reduce the number of function calls even more, try this untested code: # using an implementation detail is a bit naughty match = fnmatch._compile_pattern(pattern).match results = [n for n in names if match(n) is None] -- Steve

On Wed, May 17, 2017 at 12:14:05PM -0400, Alex Walters wrote:
Since I haven't seen any substantial objections to this idea, I've created an issue on the bug tracker, including a patch. http://bugs.python.org/issue30413 Unfortunately I have no CPU cycles available to learn the new Github way of doing things right now, somebody else will need to shepherd this through the rest of the process (making a PR, doing a review, rejecting or approving it, etc.) -- Steve

On Sun, May 21, 2017 at 08:04:04AM +0200, Pavol Lisy wrote:
If fnmatch.filter was written to solve performance, isn't calling _filtered step back?
It adds overhead of *one* function call regardless of how many thousands or millions of names you are filtering. And the benefit is that filter() and filterfalse() can share the same implementation, without duplicating code. If you have a tiny list, then it might be a tiny bit slower, but we shouldn't care too much about the performance for tiny lists. We should care about performance for large lists, but for large N the overhead of one extra function call is insignificant. At least... I hope so. If somebody demonstrates an actual and significant performance hit, then we should rethink the implementation. But even a small performance hit is acceptible, if it comes with a significant benefit (no need to duplicate code) and it is small enough. On my computer, using Python 3.5, I get little significant difference: # original [steve@ando ~]$ python3.5 -m timeit -s "from fnmatch import filter" -s "L = list('abcdefghi')" "filter(L, 'b')" 100000 loops, best of 3: 11.7 usec per loop # patched 100000 loops, best of 3: 12.8 usec per loop Who cares about 1 millionth of a second? :-) For a larger list, the difference vanishes: # original [steve@ando ~]$ python3.5 -m timeit -s "from fnmatch import filter" -s "L = list('abcdefghi')*10000" "filter(L, 'b')" 10 loops, best of 3: 80.2 msec per loop # patched 10 loops, best of 3: 79.1 msec per loop (Yes, the patched version was faster, on my computer.) These are quick and dirty benchmarks, using an older version of Python, but I expect the same will apply in 3.7 if anyone bothers to check :-) -- Steve

On 5/21/17, Steven D'Aprano <steve@pearwood.info> wrote:
I was reading 9-10% slowdown. :) It could be really good improvement if you can run 110m hurdles in 11.7s instead of 12.8s (nice coincidence ;) that according to wiki - 12.80s is current world record: https://en.wikipedia.org/wiki/Men%27s_110_metres_hurdles_world_record_progre... ) But I really agree with this:
But even a small performance hit is acceptible, if it comes with a significant benefit (no need to duplicate code) and it is small enough.
although I was rather thinking about something like macro... PL.

On Wed, May 17, 2017 at 12:14:05PM -0400, Alex Walters <tritium-list@sdamon.com> wrote:
Why not create a package and publish at PyPI? Then all you need is pip install fnmatch_filter_false in your virtual env. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

That is my normal thought on something like this, but in the case of adding a Boolean argument to fnmatch.filter, it (might be) as simple as a 3 line diff that does not break the API, and as far as I can tell, does not have performance implications. Copying a module out of the standard library that is identical except a 3 line diff that does not break compatibility with the standard library... that just wreaks of something that should be in the standard library to begin with. In the case of adding a separate function to fnmatch, it's still not that big of a diff, and wouldn't have much duplicated code, at least in the way I would implement it - it would essentially do the previous Boolean option method, and wrap that. A filter_false function, now that I think about it, is less ideal than just adding a keyword only Boolean option to fnmatch.filter.

is identical except a 3 line diff that does not break compatibility with
Top posting, apologies. I'm sure there is a better way to do it, and there is a performance hit, but its negligible. This is also a three line delta of the function. from fnmatch import _compile_pattern, filter as old_filter import os import os.path import posixpath data = os.listdir() def filter(names, pat, *, invert=False): """Return the subset of the list NAMES that match PAT.""" result = [] pat = os.path.normcase(pat) match = _compile_pattern(pat) if os.path is posixpath: # normcase on posix is NOP. Optimize it away from the loop. for name in names: if bool(match(name)) == (not invert): result.append(name) else: for name in names: if bool(match(os.path.normcase(name))) == (not invert): result.append(name) return result if __name__ == '__main__': import timeit print(timeit.timeit( "filter(data, '__*')", setup="from __main__ import filter, data" )) print(timeit.timeit( "filter(data, '__*')", setup="from __main__ import old_filter as filter, data" )) The first test (modified code) timed at 22.492161903402575, where the second test (unmodified) timed at 19.555531892032324 that that the phd@phdru.name

On 05/17/2017 07:55 PM, tritium-list@sdamon.com wrote:
If you don't care about slow-downs in this range, you could use this pattern: excluded = set(filter(data, '__*')) result = [item for item in data if item not in excluded] It seems to take just as much longer although the slow-down is not constant but depends on the size of the set you need to generate. Wolfgang

On 19.05.2017 20:01, tritium-list@sdamon.com wrote:
I'm sorry, but then your statement above doesn't make any sense to me: "I'm sure there is a better way to do it, and there is a performance hit, but its negligible." I'm proposing an alternative to you which times in very similarly to your own suggestion without copying or modifying stdlib code. That said I still like your idea of adding the exclude functionality to fnmatch. I just thought you may be interested in a solution that works right now.

On Wed, May 17, 2017 at 12:14:05PM -0400, Alex Walters wrote:
At the cost of a slight inefficiency, you could use the pure Python equivalent given in the docs: https://docs.python.org/3/library/fnmatch.html#fnmatch.filter fnmatch.filter(names, pattern) Return the subset of the list of names that match pattern. It is the same as [n for n in names if fnmatch(n, pattern)], but implemented more efficiently. So your filter_false is: [n for n in names if not fnmatch(n, pattern)] which avoids the need for the copy-and-paste anti-pattern. Otherwise, I would support: - filter_false - or a glob symbol to reverse the sense of the test, e.g. ~ or ! as the first character; but I dislike functions that take boolean arguments to change their behaviour. It's not a 100% hard and fast rule, but in general I prefer to avoid functions that take a constant bool argument: # rather than this: function(important_args, True) function(important_args, False) # use this: function(important_args) function_false(important_args) (although there may be exceptions). -- Steve

I ran a test on the same dataset using listcomps. The modified version of filter is still 22.<some noise>, the unmodified version is still 19.<some noise>. However the listcomp method is 41.<some noise>. That performance is important, at least to me. Before you ask, yes, I have profiled my code, and filtering is a bottleneck.
That is a fair enough argument. In moderate defense of the code I posted earlier, I did make the inversion variable keyword only.

On Thu, May 18, 2017 at 12:07:05AM -0400, tritium-list@sdamon.com wrote:
41 what? Nanoseconds? Hours? For how many files? *wink* In any case, are you running on Linux or Unix? If so, try replacing the fnmatch with fnmatchcase, since that avoids calling os.path.normcase (a no-op on Linux) twice for each file. If you want to reduce the number of function calls even more, try this untested code: # using an implementation detail is a bit naughty match = fnmatch._compile_pattern(pattern).match results = [n for n in names if match(n) is None] -- Steve

On Wed, May 17, 2017 at 12:14:05PM -0400, Alex Walters wrote:
Since I haven't seen any substantial objections to this idea, I've created an issue on the bug tracker, including a patch. http://bugs.python.org/issue30413 Unfortunately I have no CPU cycles available to learn the new Github way of doing things right now, somebody else will need to shepherd this through the rest of the process (making a PR, doing a review, rejecting or approving it, etc.) -- Steve

On Sun, May 21, 2017 at 08:04:04AM +0200, Pavol Lisy wrote:
If fnmatch.filter was written to solve performance, isn't calling _filtered step back?
It adds overhead of *one* function call regardless of how many thousands or millions of names you are filtering. And the benefit is that filter() and filterfalse() can share the same implementation, without duplicating code. If you have a tiny list, then it might be a tiny bit slower, but we shouldn't care too much about the performance for tiny lists. We should care about performance for large lists, but for large N the overhead of one extra function call is insignificant. At least... I hope so. If somebody demonstrates an actual and significant performance hit, then we should rethink the implementation. But even a small performance hit is acceptible, if it comes with a significant benefit (no need to duplicate code) and it is small enough. On my computer, using Python 3.5, I get little significant difference: # original [steve@ando ~]$ python3.5 -m timeit -s "from fnmatch import filter" -s "L = list('abcdefghi')" "filter(L, 'b')" 100000 loops, best of 3: 11.7 usec per loop # patched 100000 loops, best of 3: 12.8 usec per loop Who cares about 1 millionth of a second? :-) For a larger list, the difference vanishes: # original [steve@ando ~]$ python3.5 -m timeit -s "from fnmatch import filter" -s "L = list('abcdefghi')*10000" "filter(L, 'b')" 10 loops, best of 3: 80.2 msec per loop # patched 10 loops, best of 3: 79.1 msec per loop (Yes, the patched version was faster, on my computer.) These are quick and dirty benchmarks, using an older version of Python, but I expect the same will apply in 3.7 if anyone bothers to check :-) -- Steve

On 5/21/17, Steven D'Aprano <steve@pearwood.info> wrote:
I was reading 9-10% slowdown. :) It could be really good improvement if you can run 110m hurdles in 11.7s instead of 12.8s (nice coincidence ;) that according to wiki - 12.80s is current world record: https://en.wikipedia.org/wiki/Men%27s_110_metres_hurdles_world_record_progre... ) But I really agree with this:
But even a small performance hit is acceptible, if it comes with a significant benefit (no need to duplicate code) and it is small enough.
although I was rather thinking about something like macro... PL.
participants (7)
-
Alex Walters
-
Oleg Broytman
-
Pavol Lisy
-
Steven D'Aprano
-
tritium-list@sdamon.com
-
Wolfgang Maier
-
אלעזר