Addition to fnmatch.py

I’d like to propose a simple addition to the 'fnmatch' module. Specificity a function that returns a list of names that match a pattern AND a list of those that don't. In a recent project I found that I wished to split a list of files and move those that matched a pattern to one folder and those that didn't to another folder. I was using fnmatch.filter to select the first set of files and then a list comprehension to generate the second set. For a small number of files (~ 10) this was perfectly adequate. However as I needed to process many files (>>10000) the execution time was very significant. Profiling the code showed that the time was spent in generating the second set. I tried a number of solutions including designing a negative filter, walking the file system to find those files that had not been moved and using more and more convoluted ways to improve the second selection. Eventually I gave in and hacked a local copy of fnmatch.py as below: def split(names, pat): """Return the subset of the list NAMES that match PAT.""" """Also returns those names not in NAMES""" result = [] notresult = [] pat = os.path.normcase(pat) pattern_match = _compile_pattern(pat) if os.path is posixpath: # normcase on posix is NOP. Optimize it away from the loop. for name in names: if not pattern_match(name): result.append(name) else: notresult.append(name) else: for name in names: if not pattern_match(os.path.normcase(name)): result.append(name) else: notresult.append(name) return result, notresult The change is the addition of else clauses to the if not pattermath statements. This solved the problem and benchmarking showed that it only took a very small additional time (20ms for a million strings) to generate both lists Number of tests cases 1000000 Example data ['Ba1txmKkiC', 'KlJx.f_AGj', 'Umwbw._Wa9', '4YlgA5LVpI’] Search for '*A*' Test Time(sec) Positive Negative WCmatch.filter 1.953125 26211 0 filter 0.328125 14259 0 split 0.343750 14259 85741 List Comp. 270.468751 14259 85741 The list comprehension [x for x in a if x not in b]*, was nearly 900 times slower. ‘fnmatch’ was an appropriate solution to this problem as typing ‘glob’ style search patterns was easier than having to enter regular expressions when prompted by my code. I would like to propose that split, even though it is very simple, be included in the 'fnmatch' module. John *a is the original and b is those that match.

On Sun, Jun 5, 2022, 12:08 PM MRAB <python@mrabarnett.plus.com> wrote: > On 2022-06-05 16:12, David Mertz, Ph.D. wrote: > > This is exactly the problem solved by set difference. E.g. `{a, b, c} > - {a, c}`. > > > > This operation is linear on the size of the removed set. > > > You don't have to use a set difference, which might matter if the order > was important, but just make a set of the ones found: > > s = set(b) > > And then the list comprehension: > > [x for x in a if x not in s] > Sure, that's nice enough code and has the same big-O complexity. I suspect set difference is a little faster (by a constant multiple) because it hits C code more, but I haven't benchmarked. The OP said the elements were from fnmatch though, which explicitly does not promise order. So really it's just whether you like your code or this better aesthetically: list(set(b) - set(a)) Personally I like my (somewhat shorter) version as more clear. But YMMV. >

Op 5/06/2022 om 18:47 schreef David Mertz, Ph.D.:
timeit.timeit('s=set(b); [x for x in a if x not in s]', setup='a =
I benchmarked it and indeed the list difference is a little faster. list(range(10000)); b = list(range(1000))', number=1000) 0.6156670850032242
And what's more, it seems to also preserve order. I guess because a set is implemented like a dict, and will preserve order of you only remove some elements from the set.

I think y’all have addressed the OP’s problem — at least the performance issues. But I think this brings up a pattern that I don’t have a nifty way to address: How do you divide a sequence into two sequences cleanly? The filter pattern: selectively remove items from a sequence, returning a new sequence. There are a few ways to do that in Python. But what if you need both the remaining items and the removed ones? Easy enough to write a loop that populates two lists, but is there a nifty one-liner? Is this a known functional pattern? -CHB On Sun, Jun 5, 2022 at 7:39 PM Eric V. Smith via Python-ideas < python-ideas@python.org> wrote:
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Mon, 6 Jun 2022 at 18:22, Paul Moore <p.f.moore@gmail.com> wrote:
There’s an itertools recipe, “partition”.
Yes, there is (and plenty of equivalents in other languages). But it's still not what I'd call "particularly easy". What that recipe does is filter the same iterable with the same predicate twice, so it depends on the predicate being repeatable. You can't use it, for instance, to split a data set into training and test sets, even though conceptually this should work: training, test = partition(lambda x: random.random() < 0.9, dataset) To truly perform a one-pass partitioning operation, you'd have to do something like this: def partition(pred, iter): results = [], [] for thing in iter: results[not pred(thing)].append(thing) return results Obviously that works only on finite iterables, so it's not appropriate for an itertools recipe, but it does guarantee that the predicate is called once per item and each item ends up in precisely one list. It would be kinda nice to be able to do something like this with a comprehension, since there's a fair bit of overhead to calling a function (predicate functions have to be called for each element, but a comprehension wraps everything up into one function). But since it's not, the above is good enough for the cases where it's needed - not particularly easy, but not all that hard either. ChrisA

from collections import deque def partition(pred, iterable): results = deque([]), deque([]) def gen_split(only): for thing in iterable: if results[only]: yield results[only].popleft() results[not pred(thing)].append(thing) for thing in results[only]: yield thing return gen_split(True), gen_split(False)

This is really nice, but I think you have a negation wrong in there somewhere: In [1]: from collections import deque ...: ...: def partition(pred, iterable): ...: results = deque([]), deque([]) ...: ...: def gen_split(only): ...: for thing in iterable: ...: if results[only]: ...: yield results[only].popleft() ...: results[not pred(thing)].append(thing) ...: for thing in results[only]: ...: yield thing ...: ...: return gen_split(True), gen_split(False) ...: In [2]: def isEven(n): ...: return n/2 == n//2 ...: In [3]: from itertools import count In [4]: evens, odds = partition(isEven, count()) In [5]: next(evens) Out[5]: 1 In [6]: next(evens) Out[6]: 3 In [7]: next(evens) Out[7]: 5 In [8]: next(odds) Out[8]: 0 In [9]: next(odds) Out[9]: 2 In [10]: next(odds) Out[10]: 4 On Mon, Jun 6, 2022 at 10:32 AM Mathew Elman <mathew.elman@ocado.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

oops, you're right, I have ended up with an unnecessary "not" in there, should be: from collections import deque def partition(pred, iterable): results = deque([]), deque([]) def gen_split(only): for thing in iterable: if results[only]: yield results[only].popleft() results[pred(thing)].append(thing) for thing in results[only]: yield thing return gen_split(True), gen_split(False)

Op 6/06/2022 om 18:07 schreef Mathew Elman:
There still is something wrong. I get the second list twice: odd, even = partition(lambda i: i % 2, range(20)) print(list(odd)) [1, 3, 5, 7, 9, 11, 13, 15, 17, 19] print(list(even)) [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

On Mon, Jun 06, 2022 at 06:17:32PM +0200, Benedict Verhegghe wrote:
Confirmed. But if you replace range(20) with `iter(range(20))`, it works correctly. The plot thickens. The duplicated list is not from the second list created, but the second list *evaluated*. So if you run: odd, even = partition(...) as before, but evaluate *even* first and odd second: print(list(even)) print(list(odd)) it is odd that is doubled, not even. -- Steve

Why don't you use the version from the itertools recipes? ``` from itertools import tee, filterfalse def partition(pred, iterable): "Use a predicate to partition entries into false entries and true entries" # partition(is_odd, range(10)) --> 0 2 4 6 8 and 1 3 5 7 9 t1, t2 = tee(iterable) return filterfalse(pred, t1), filter(pred, t2) ``` -- Steve

This calls the predicate once per element: def partition(pred, iterable): t1, t2 = tee((pred(x), x) for x in iterable) return (x for b, x in t1 if not b), (x for b, x in t2 if b) It's kind of inefficient though.

On Tue, 7 Jun 2022 at 19:09, Ben Rudiak-Gould <benrudiak@gmail.com> wrote:
Honestly, if it weren't that there's currently a recipe in itertools, I think the list-based version would be the best recipe to offer. The cost of doing the job lazily AND maintaining all the other expectations is too high; if you really need it to be lazy, it's probably worth seeing if one of the other demands can be relaxed (eg if it's fine to call the predicate twice). For the OP's task, doing the partitioning eagerly wouldn't be an issue. ChrisA

On 2022-06-07 11:03, Chris Angelico wrote:
The problem with a lazy solution is that if you're producing 2 streams of output, as it were, and if you're consuming from only one of them, what are you going to do about the other one? You still need to store the items for it in case you want to consume them later, and if the original iterable is infinite, you have a problem...

As Chris Angelico notes, that's only good for finite iterables. The recipe in the itertools docs is unbounded, but it's also unbounded in the cache needed if you get long strings of things the do (or don't) confirm the predicate. On Mon, Jun 6, 2022 at 7:06 AM Stephen J. Turnbull < stephenjturnbull@gmail.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

David Mertz, Ph.D. writes:
Couldn't it be linear on min(len(left), len(right))? Ie, if len(left) < len(right): for elem in left: if elem in right: left.discard(elem) else: for elem in right: left.discard(elem) The upper loop is probably only about half as fast as the lower (two hashes needed), so maybe the length test could be tuned. The whole idea is probably not worth it, I can't think of practical use cases at a scale where it would matter.

Le 05/06/2022 à 16:52, John Carter a écrit :
import timeit timeit.timeit('[x for x in a if x not in b]', setup='a =
timeit.timeit('list(set(a)-set(b))', setup='a =
Searching an element in a list scans the elements one by one, so [x for x in a if x not in b] has quadratic complexity. Try list(set(a) - set(b)) instead. list(range(10_000)); b = list(range(1000))', number=100) 5.258457429998089 list(range(10_000)); b = list(range(1000))', number=100) 0.07163406799372751 Jean

Thanks for all the comments. I did consider using sets but order was important and when partitioning a list of general strings, not file names, there could be identical and important strings (strings at different positions could mean different things, especially when generating test and training sets). I also tried doing it twice (generating lists) and it took approximately twice as long though generating two iterables could defer the partitioning till it was needed, i.e. lazy evaluation. I’ve added some of the solutions to by timing test. Identified by capital letters at the end. Reference list lengths 144004 855996 Number of tests cases 1000000 Example data ['xDy7AWbXau', 'TXlzsZV3Ra', 'YJh8uD9ovK', 'aRJ2U7nWs8', 'geu.vHlogu'] FNmatch 0.671875 268756 0 268756 0 WCmatch 1.562500 264939 0 264939 0 IWCmatch 1.281250 0 0 0 1 Easy 0.093750 144004 855996 1000000 1 Re 0.328125 855996 144004 1000000 0 Positive 0.281250 144004 0 144004 1 Negative 0.328125 0 855996 855996 1 UppeeCase 0.375000 268756 0 268756 0 Null 0.000000 0 0 0 1 Both 0.328125 144004 855996 1000000 1 Partition 1.171875 855996 144004 1000000 0 IBoth 0.328125 855996 144004 1000000 0 MRAB 0.437500 144004 855996 1000000 1 METZ 0.500000 144004 855996 1000000 1 CA 0.343750 144004 855996 1000000 1 SJB 0.328125 144004 855996 1000000 1 I checked for order and interestingly all the set solutions preserve it. I think this is because there are no duplicates in the test data. Order is only checked correctly if thee are the same number of elements in the test and reference lists I also tried more_itertoolls.partition. Nearly 4 times slower. John

On Sun, Jun 5, 2022, 12:08 PM MRAB <python@mrabarnett.plus.com> wrote: > On 2022-06-05 16:12, David Mertz, Ph.D. wrote: > > This is exactly the problem solved by set difference. E.g. `{a, b, c} > - {a, c}`. > > > > This operation is linear on the size of the removed set. > > > You don't have to use a set difference, which might matter if the order > was important, but just make a set of the ones found: > > s = set(b) > > And then the list comprehension: > > [x for x in a if x not in s] > Sure, that's nice enough code and has the same big-O complexity. I suspect set difference is a little faster (by a constant multiple) because it hits C code more, but I haven't benchmarked. The OP said the elements were from fnmatch though, which explicitly does not promise order. So really it's just whether you like your code or this better aesthetically: list(set(b) - set(a)) Personally I like my (somewhat shorter) version as more clear. But YMMV. >

Op 5/06/2022 om 18:47 schreef David Mertz, Ph.D.:
timeit.timeit('s=set(b); [x for x in a if x not in s]', setup='a =
I benchmarked it and indeed the list difference is a little faster. list(range(10000)); b = list(range(1000))', number=1000) 0.6156670850032242
And what's more, it seems to also preserve order. I guess because a set is implemented like a dict, and will preserve order of you only remove some elements from the set.

I think y’all have addressed the OP’s problem — at least the performance issues. But I think this brings up a pattern that I don’t have a nifty way to address: How do you divide a sequence into two sequences cleanly? The filter pattern: selectively remove items from a sequence, returning a new sequence. There are a few ways to do that in Python. But what if you need both the remaining items and the removed ones? Easy enough to write a loop that populates two lists, but is there a nifty one-liner? Is this a known functional pattern? -CHB On Sun, Jun 5, 2022 at 7:39 PM Eric V. Smith via Python-ideas < python-ideas@python.org> wrote:
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Mon, 6 Jun 2022 at 18:22, Paul Moore <p.f.moore@gmail.com> wrote:
There’s an itertools recipe, “partition”.
Yes, there is (and plenty of equivalents in other languages). But it's still not what I'd call "particularly easy". What that recipe does is filter the same iterable with the same predicate twice, so it depends on the predicate being repeatable. You can't use it, for instance, to split a data set into training and test sets, even though conceptually this should work: training, test = partition(lambda x: random.random() < 0.9, dataset) To truly perform a one-pass partitioning operation, you'd have to do something like this: def partition(pred, iter): results = [], [] for thing in iter: results[not pred(thing)].append(thing) return results Obviously that works only on finite iterables, so it's not appropriate for an itertools recipe, but it does guarantee that the predicate is called once per item and each item ends up in precisely one list. It would be kinda nice to be able to do something like this with a comprehension, since there's a fair bit of overhead to calling a function (predicate functions have to be called for each element, but a comprehension wraps everything up into one function). But since it's not, the above is good enough for the cases where it's needed - not particularly easy, but not all that hard either. ChrisA

from collections import deque def partition(pred, iterable): results = deque([]), deque([]) def gen_split(only): for thing in iterable: if results[only]: yield results[only].popleft() results[not pred(thing)].append(thing) for thing in results[only]: yield thing return gen_split(True), gen_split(False)

This is really nice, but I think you have a negation wrong in there somewhere: In [1]: from collections import deque ...: ...: def partition(pred, iterable): ...: results = deque([]), deque([]) ...: ...: def gen_split(only): ...: for thing in iterable: ...: if results[only]: ...: yield results[only].popleft() ...: results[not pred(thing)].append(thing) ...: for thing in results[only]: ...: yield thing ...: ...: return gen_split(True), gen_split(False) ...: In [2]: def isEven(n): ...: return n/2 == n//2 ...: In [3]: from itertools import count In [4]: evens, odds = partition(isEven, count()) In [5]: next(evens) Out[5]: 1 In [6]: next(evens) Out[6]: 3 In [7]: next(evens) Out[7]: 5 In [8]: next(odds) Out[8]: 0 In [9]: next(odds) Out[9]: 2 In [10]: next(odds) Out[10]: 4 On Mon, Jun 6, 2022 at 10:32 AM Mathew Elman <mathew.elman@ocado.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

oops, you're right, I have ended up with an unnecessary "not" in there, should be: from collections import deque def partition(pred, iterable): results = deque([]), deque([]) def gen_split(only): for thing in iterable: if results[only]: yield results[only].popleft() results[pred(thing)].append(thing) for thing in results[only]: yield thing return gen_split(True), gen_split(False)

Op 6/06/2022 om 18:07 schreef Mathew Elman:
There still is something wrong. I get the second list twice: odd, even = partition(lambda i: i % 2, range(20)) print(list(odd)) [1, 3, 5, 7, 9, 11, 13, 15, 17, 19] print(list(even)) [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

On Mon, Jun 06, 2022 at 06:17:32PM +0200, Benedict Verhegghe wrote:
Confirmed. But if you replace range(20) with `iter(range(20))`, it works correctly. The plot thickens. The duplicated list is not from the second list created, but the second list *evaluated*. So if you run: odd, even = partition(...) as before, but evaluate *even* first and odd second: print(list(even)) print(list(odd)) it is odd that is doubled, not even. -- Steve

Why don't you use the version from the itertools recipes? ``` from itertools import tee, filterfalse def partition(pred, iterable): "Use a predicate to partition entries into false entries and true entries" # partition(is_odd, range(10)) --> 0 2 4 6 8 and 1 3 5 7 9 t1, t2 = tee(iterable) return filterfalse(pred, t1), filter(pred, t2) ``` -- Steve

This calls the predicate once per element: def partition(pred, iterable): t1, t2 = tee((pred(x), x) for x in iterable) return (x for b, x in t1 if not b), (x for b, x in t2 if b) It's kind of inefficient though.

On Tue, 7 Jun 2022 at 19:09, Ben Rudiak-Gould <benrudiak@gmail.com> wrote:
Honestly, if it weren't that there's currently a recipe in itertools, I think the list-based version would be the best recipe to offer. The cost of doing the job lazily AND maintaining all the other expectations is too high; if you really need it to be lazy, it's probably worth seeing if one of the other demands can be relaxed (eg if it's fine to call the predicate twice). For the OP's task, doing the partitioning eagerly wouldn't be an issue. ChrisA

On 2022-06-07 11:03, Chris Angelico wrote:
The problem with a lazy solution is that if you're producing 2 streams of output, as it were, and if you're consuming from only one of them, what are you going to do about the other one? You still need to store the items for it in case you want to consume them later, and if the original iterable is infinite, you have a problem...

As Chris Angelico notes, that's only good for finite iterables. The recipe in the itertools docs is unbounded, but it's also unbounded in the cache needed if you get long strings of things the do (or don't) confirm the predicate. On Mon, Jun 6, 2022 at 7:06 AM Stephen J. Turnbull < stephenjturnbull@gmail.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

David Mertz, Ph.D. writes:
Couldn't it be linear on min(len(left), len(right))? Ie, if len(left) < len(right): for elem in left: if elem in right: left.discard(elem) else: for elem in right: left.discard(elem) The upper loop is probably only about half as fast as the lower (two hashes needed), so maybe the length test could be tuned. The whole idea is probably not worth it, I can't think of practical use cases at a scale where it would matter.

Le 05/06/2022 à 16:52, John Carter a écrit :
import timeit timeit.timeit('[x for x in a if x not in b]', setup='a =
timeit.timeit('list(set(a)-set(b))', setup='a =
Searching an element in a list scans the elements one by one, so [x for x in a if x not in b] has quadratic complexity. Try list(set(a) - set(b)) instead. list(range(10_000)); b = list(range(1000))', number=100) 5.258457429998089 list(range(10_000)); b = list(range(1000))', number=100) 0.07163406799372751 Jean

Thanks for all the comments. I did consider using sets but order was important and when partitioning a list of general strings, not file names, there could be identical and important strings (strings at different positions could mean different things, especially when generating test and training sets). I also tried doing it twice (generating lists) and it took approximately twice as long though generating two iterables could defer the partitioning till it was needed, i.e. lazy evaluation. I’ve added some of the solutions to by timing test. Identified by capital letters at the end. Reference list lengths 144004 855996 Number of tests cases 1000000 Example data ['xDy7AWbXau', 'TXlzsZV3Ra', 'YJh8uD9ovK', 'aRJ2U7nWs8', 'geu.vHlogu'] FNmatch 0.671875 268756 0 268756 0 WCmatch 1.562500 264939 0 264939 0 IWCmatch 1.281250 0 0 0 1 Easy 0.093750 144004 855996 1000000 1 Re 0.328125 855996 144004 1000000 0 Positive 0.281250 144004 0 144004 1 Negative 0.328125 0 855996 855996 1 UppeeCase 0.375000 268756 0 268756 0 Null 0.000000 0 0 0 1 Both 0.328125 144004 855996 1000000 1 Partition 1.171875 855996 144004 1000000 0 IBoth 0.328125 855996 144004 1000000 0 MRAB 0.437500 144004 855996 1000000 1 METZ 0.500000 144004 855996 1000000 1 CA 0.343750 144004 855996 1000000 1 SJB 0.328125 144004 855996 1000000 1 I checked for order and interestingly all the set solutions preserve it. I think this is because there are no duplicates in the test data. Order is only checked correctly if thee are the same number of elements in the test and reference lists I also tried more_itertoolls.partition. Nearly 4 times slower. John
participants (14)
-
Ben Rudiak-Gould
-
Benedict Verhegghe
-
Chris Angelico
-
Christopher Barker
-
David Mertz, Ph.D.
-
Eric V. Smith
-
Jean Abou Samra
-
John Carter
-
Mathew Elman
-
MRAB
-
Paul Moore
-
Serhiy Storchaka
-
Stephen J. Turnbull
-
Steven D'Aprano