Possible new itertool: comm()
In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups. I couldn't find an obvious implementation around, so I ended up having to write my own (see below). Would this be something useful to add to the itertools module? The reasons I think it might be worth including in the stdlib are: 1. The implementation is somewhat tricky to get right, especially the edge cases (I'm not 100% sure my implementation is completely right in this regard). So implementing it yourself each time is error-prone and time consuming. 2. The times I've needed this have been ad-hoc scripts (I'm on Windows, so while a Unix user might use a quick shell pipeline with comm, that's less convenient for me) where depending on a 3rd party distribution from PyPI is less ideal. There's plenty of room for API bikeshedding here, but the basic algorithm remains the same. What do people think? Paul def comm(it1, it2): """Partition 2 sorted iterators into common, only in 1st, and only in 2nd. >>> list(comm([1,2,3], [2,3,4])) [('<', 1), ('=', 2), ('=', 3), ('>', 4)] >>> list(comm([0,1,2,3], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5)] >>> list(comm([0,1,2,3,6], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5), ('<', 6)] >>> list(comm([0,1], [])) [('<', 0), ('<', 1)] >>> list(comm([], [0,1])) [('>', 0), ('>', 1)] >>> list(comm([0,1], [0,1])) [('=', 0), ('=', 1)] >>> list(comm([], [])) [] """ it1 = iter(it1) it2 = iter(it2) it1_finished = False it2_finished = False try: v1 = next(it1) except StopIteration: it1_finished = True try: v2 = next(it2) except StopIteration: it2_finished = True while not (it1_finished or it2_finished): if v1 < v2: yield ('<', v1) try: v1 = next(it1) except StopIteration: it1_finished = True elif v1 > v2: yield ('>', v2) try: v2 = next(it2) except StopIteration: it2_finished = True else: yield ('=', v1) try: v1 = next(it1) except StopIteration: it1_finished = True try: v2 = next(it2) except StopIteration: it2_finished = True if it1_finished and not it2_finished: yield ('>', v2) for v2 in it2: yield ('>', v2) if it2_finished and not it1_finished: yield ('<', v1) for v1 in it1: yield ('<', v1) if __name__ == '__main__': import doctest doctest.testmod()
On Tue, Jan 6, 2015 at 10:22 AM, Paul Moore
In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.
I couldn't find an obvious implementation around, so I ended up having to write my own (see below).
Would this be something useful to add to the itertools module? The reasons I think it might be worth including in the stdlib are:
1. The implementation is somewhat tricky to get right, especially the edge cases (I'm not 100% sure my implementation is completely right in this regard). So implementing it yourself each time is error-prone and time consuming. 2. The times I've needed this have been ad-hoc scripts (I'm on Windows, so while a Unix user might use a quick shell pipeline with comm, that's less convenient for me) where depending on a 3rd party distribution from PyPI is less ideal.
There's plenty of room for API bikeshedding here, but the basic algorithm remains the same.
What do people think?
Paul
def comm(it1, it2): """Partition 2 sorted iterators into common, only in 1st, and only in 2nd.
>>> list(comm([1,2,3], [2,3,4])) [('<', 1), ('=', 2), ('=', 3), ('>', 4)] >>> list(comm([0,1,2,3], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5)] >>> list(comm([0,1,2,3,6], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5), ('<', 6)] >>> list(comm([0,1], [])) [('<', 0), ('<', 1)] >>> list(comm([], [0,1])) [('>', 0), ('>', 1)] >>> list(comm([0,1], [0,1])) [('=', 0), ('=', 1)] >>> list(comm([], [])) [] """
it1 = iter(it1) it2 = iter(it2)
it1_finished = False it2_finished = False try: v1 = next(it1) except StopIteration: it1_finished = True try: v2 = next(it2) except StopIteration: it2_finished = True while not (it1_finished or it2_finished): if v1 < v2: yield ('<', v1) try: v1 = next(it1) except StopIteration: it1_finished = True elif v1 > v2: yield ('>', v2) try: v2 = next(it2) except StopIteration: it2_finished = True else: yield ('=', v1) try: v1 = next(it1) except StopIteration: it1_finished = True try: v2 = next(it2) except StopIteration: it2_finished = True
if it1_finished and not it2_finished: yield ('>', v2) for v2 in it2: yield ('>', v2) if it2_finished and not it1_finished: yield ('<', v1) for v1 in it1: yield ('<', v1)
if __name__ == '__main__': import doctest doctest.testmod() _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
So my question is how well with this work with generators/iterators? Your examples use lists, but it would be impossible to use this with anything that isn't finite right?
On 6 January 2015 at 16:33, Ian Cordasco
So my question is how well with this work with generators/iterators? Your examples use lists, but it would be impossible to use this with anything that isn't finite right?
The examples use lists because I was using doctest and needed checkable output. But I think it should work fine with any (finite) iterator. I don't see a problem with the algorithm if I use infinite iterators (the algorithm is one-pass and generates results on demand) but I haven't tested it explicitly. [pause, test...] Yep, looks OK: >>> import itertools >>> c1 = itertools.count() >>> c2 = itertools.count(3) >>> output = comm(c1, c2) >>> next(output) ('<', 0) >>> next(output) ('<', 1) >>> next(output) ('<', 2) >>> next(output) ('=', 3) Paul
On Wed, Jan 7, 2015 at 3:33 AM, Ian Cordasco
So my question is how well with this work with generators/iterators? Your examples use lists, but it would be impossible to use this with anything that isn't finite right?
I think it ought to work fine on infinite iterators, based on my reading of the code. Obviously it would itself be infinite in that case. With one infinite and one finite iterator, it'll eventually get into one of the loops at the end, and forever yield from the infinite iterator; if both are infinite, it'll never break out of the primary loop, and continue consuming values from one or the other and yielding tuples:
def count_by(x): n = 0 while True: n += x yield n it = comm(count_by(2), count_by(3)) next(it) ('<', 2) next(it) ('>', 3) next(it) ('<', 4) next(it) ('=', 6) ... etc etc etc ...
ChrisA
On Wed, Jan 7, 2015 at 3:46 AM, Chris Angelico
def count_by(x): n = 0 while True: n += x yield n
And thanks to the magic of sloppy editing, I just rewrote itertools.count(). In my own testing, there was a print call to show what the internal iterators were yielding, but that got left out, leaving this looking a little silly. Anyhow. ChrisA
On Jan 6, 2015, at 8:22 AM, Paul Moore
wrote: In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.
As far as I can tell, this would be a very rare need. Raymond
On Jan 6, 2015, at 16:22, Paul Moore
2. The times I've needed this have been ad-hoc scripts (I'm on Windows, so while a Unix user might use a quick shell pipeline with comm, that's less convenient for me) where depending on a 3rd party distribution from PyPI is less ideal.
Why? Current standard Windows installers include pip, and being pure Python you won't need a compiler, so what's wrong with requiring a PyPI distribution? (Of course that means you need to be able to count on a relatively recent Python 3.4+/2.7+, but it's hard to see how that's worse than something in a future version of the stdlib, which would mean you need to be able to count on 3.5+.) And I'll bet if you submit this as a pull request to more-itertools, it'll be accepted, meaning you don't even have to create or maintain a PyPI project.
On Tue, Jan 06, 2015 at 04:22:23PM +0000, Paul Moore wrote:
In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.
It seems awfully specialised to me, so I'm not sure that it belongs in itertools. But for what it's worth, here's my implementation which I think is easier to understand than yours, and covers the unfortunate case where you have unorderable items such as float NANs in your data. def comm(values1, values2): """Partition 2 sorted iterators into common, only in 1st, and only in 2nd. >>> list(comm([1,2,3], [2,3,4])) [('<', 1), ('=', 2), ('=', 3), ('>', 4)] >>> list(comm([0,1,2,3], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5)] >>> list(comm([0,1,2,3,6], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5), ('<', 6)] >>> list(comm([0,1], [])) [('<', 0), ('<', 1)] >>> list(comm([], [0,1])) [('>', 0), ('>', 1)] >>> list(comm([0,1], [0,1])) [('=', 0), ('=', 1)] >>> list(comm([], [])) [] >>> list(comm([1,2,3,4], [2,3,float('nan'),4])) [('<', 1), ('=', 2), ('=', 3), ('?', (4, nan)), ('>', 4)] """ MISSING = object() it1 = iter(values1) it2 = iter(values2) a = next(it1, MISSING) b = next(it2, MISSING) while True: if a is MISSING: if b is MISSING: return yield ('>', b) for x in it2: yield ('>', x) return if b is MISSING: assert a is not MISSING yield ('<', a) for x in it1: yield ('<', x) return if a == b: yield ('=', a) # Or could use b. a = next(it1, MISSING) b = next(it2, MISSING) elif a < b: yield ('<', a) a = next(it1, MISSING) elif a > b: yield ('>', b) b = next(it2, MISSING) else: # Unorderable a and b, like float NANs. # Maybe we should raise instead? yield ('?', (a, b)) a = next(it1, MISSING) b = next(it2, MISSING) If you prefer this version, feel free to just use it. If you need a licence, consider it being under the MIT licence. (Over the next day or so, time permitting, I'll try to put this on ActiveState's Python recipes and make the licence more official.) -- Steve
On 6 January 2015 at 17:14, Raymond Hettinger
On Jan 6, 2015, at 8:22 AM, Paul Moore
wrote: In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.
As far as I can tell, this would be a very rare need.
It's come up for me a few times, usually when trying to check two lists of files to see which ones have been missed by a program, and which ones the program thinks are present but no longer exist. Its use cases are about as common as those of the Unix "comm" tool :-) But I take your point, it's not *that* common. Paul
On 6 January 2015 at 17:39, Andrew Barnert
2. The times I've needed this have been ad-hoc scripts (I'm on Windows, so while a Unix user might use a quick shell pipeline with comm, that's less convenient for me) where depending on a 3rd party distribution from PyPI is less ideal.
Why? Current standard Windows installers include pip, and being pure Python you won't need a compiler, so what's wrong with requiring a PyPI distribution? (Of course that means you need to be able to count on a relatively recent Python 3.4+/2.7+, but it's hard to see how that's worse than something in a future version of the stdlib, which would mean you need to be able to count on 3.5+.)
Well, I usually write my "little utility scripts" as simple .py files to be run with the system Python. I tend to use them on multiple machines. So unless a dependency is one of the modules I routinely install (things like requests) the process goes run script, oops, needed that distribution, pip install dist, run it again. Not a big issue certainly (and hardly a showstopper) but annoying. And it does mean I'd need to make a PyPI project for my personal utility functions, which doesn't really seem an appropriate use for PyPI, tbh. Requiring Python 3.5+ isn't a big deal, I routinely put the newest version of Python on all my machines. The ones I can't tend to be "secure", meaning I have no access to PyPI either :-(
And I'll bet if you submit this as a pull request to more-itertools, it'll be accepted, meaning you don't even have to create or maintain a PyPI project.
Thanks for the suggestion, I might do that. Paul
On 6 January 2015 at 18:28, Ethan Furman
On 01/06/2015 10:00 AM, Steven D'Aprano wrote:
a = next(it1, MISSING) b = next(it2, MISSING)
Don't forget to guard your next calls -- RunTimeError would not be a friendly way to stop iterating. ;)
Doesn't the second argument get used instead of raising StopIteration? Paul
On Tue, 6 Jan 2015 18:22:44 +0000 Paul Moore
On 6 January 2015 at 17:14, Raymond Hettinger
wrote: On Jan 6, 2015, at 8:22 AM, Paul Moore
wrote: In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.
As far as I can tell, this would be a very rare need.
It's come up for me a few times, usually when trying to check two lists of files to see which ones have been missed by a program, and which ones the program thinks are present but no longer exist.
Why don't you use sets for such things? Your iterator is really only useful for huge or unhashable inputs. Regards Antoine.
On 6 January 2015 at 18:00, Steven D'Aprano
But for what it's worth, here's my implementation which I think is easier to understand than yours,
Possibly - it's a trade-off between using a sentinel vs explicitly catching the exceptions. I'm on the fence, personally.
and covers the unfortunate case where you have unorderable items such as float NANs in your data.
Technically I don't think this is possible, as the input iterators need to be sorted, and that property will be violated if there are unorderable items in there... Paul
On 01/06/2015 10:30 AM, Paul Moore wrote:
On 6 January 2015 at 18:28, Ethan Furman
wrote: On 01/06/2015 10:00 AM, Steven D'Aprano wrote:
a = next(it1, MISSING) b = next(it2, MISSING)
Don't forget to guard your next calls -- RunTimeError would not be a friendly way to stop iterating. ;)
Doesn't the second argument get used instead of raising StopIteration?
Yup, it sure does. Guess I should have paid more attention to that niggling feeling that I was missing something. ;) On the other hand, I figured even if I was wrong it wouldn't hurt to increase awareness about the change to generators. -- ~Ethan~
On 6 January 2015 at 18:36, Antoine Pitrou
On Tue, 6 Jan 2015 18:22:44 +0000 Paul Moore
wrote: On 6 January 2015 at 17:14, Raymond Hettinger
wrote: On Jan 6, 2015, at 8:22 AM, Paul Moore
wrote: In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.
As far as I can tell, this would be a very rare need.
It's come up for me a few times, usually when trying to check two lists of files to see which ones have been missed by a program, and which ones the program thinks are present but no longer exist.
Why don't you use sets for such things? Your iterator is really only useful for huge or unhashable inputs.
For the case described you're right. I had a case the other day where one of the problems was that one list had duplicates, and I needed to see that ([1,1,2,2] vs [1,2] needed to show [1,2] as "only in the first list" and [1,2] as "in both"). I could probably have used counters. But I was starting from thinking that I had a pair of sorted lists, and "in the Unix shell I'd use comm"...) Paul
Folks, I realize I'm sounding a bit like Raymond here, but can you all please find a different forum to discuss algorithms and coding problems and pick each other's solutions apart? The focus on python-ideas should be to quickly validate ideas for the language or the stdlib, not to discuss arbitrary snippets of code. On Tue, Jan 6, 2015 at 10:44 AM, Paul Moore
On 6 January 2015 at 18:36, Antoine Pitrou
wrote: On Tue, 6 Jan 2015 18:22:44 +0000 Paul Moore
wrote: On 6 January 2015 at 17:14, Raymond Hettinger
wrote: On Jan 6, 2015, at 8:22 AM, Paul Moore
wrote: In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.
As far as I can tell, this would be a very rare need.
It's come up for me a few times, usually when trying to check two lists of files to see which ones have been missed by a program, and which ones the program thinks are present but no longer exist.
Why don't you use sets for such things? Your iterator is really only useful for huge or unhashable inputs.
For the case described you're right. I had a case the other day where one of the problems was that one list had duplicates, and I needed to see that ([1,1,2,2] vs [1,2] needed to show [1,2] as "only in the first list" and [1,2] as "in both"). I could probably have used counters. But I was starting from thinking that I had a pair of sorted lists, and "in the Unix shell I'd use comm"...)
Paul _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido)
On 6 January 2015 at 18:43, Ethan Furman
Yup, it sure does. Guess I should have paid more attention to that niggling feeling that I was missing something. ;) On the other hand, I figured even if I was wrong it wouldn't hurt to increase awareness about the change to generators.
I should probably point out that when writing my code, I was conscious of the generator change and took special case of catching all of the StopIterations that might leak out. It's probably good that I was made to think carefully about it, but all those try...except blocks sure look ugly. (And Steven's version with its sentinel value doesn't look much better, with all those comparisons against a magic value - it's just a generally annoying algorithm with all the code paths for corner cases). Paul
On Jan 6, 2015, at 18:29, Paul Moore
On 6 January 2015 at 17:39, Andrew Barnert
wrote: 2. The times I've needed this have been ad-hoc scripts (I'm on Windows, so while a Unix user might use a quick shell pipeline with comm, that's less convenient for me) where depending on a 3rd party distribution from PyPI is less ideal.
Why? Current standard Windows installers include pip, and being pure Python you won't need a compiler, so what's wrong with requiring a PyPI distribution? (Of course that means you need to be able to count on a relatively recent Python 3.4+/2.7+, but it's hard to see how that's worse than something in a future version of the stdlib, which would mean you need to be able to count on 3.5+.)
Well, I usually write my "little utility scripts" as simple .py files to be run with the system Python. I tend to use them on multiple machines. So unless a dependency is one of the modules I routinely install (things like requests) the process goes run script, oops, needed that distribution, pip install dist, run it again.
I do the same thing; I just long ago got into the habit of including more-itertools as one of my routine installs alongside requests, etc. :) I wonder if there's a common-enough set of things that don't really belong in stdlib (often just because they update too frequently) but are often worth having for many people. If I could just "pip install extralib" on every machine, even at the cost of getting a few libs I don't actually need, it would be worth it. On the other hand, if the common set isn't common enough, so for many people extralib only had half of what they want and was 70% stuff they didn't care shouted it wouldn't be that useful...
Not a big issue certainly (and hardly a showstopper) but annoying. And it does mean I'd need to make a PyPI project for my personal utility functions, which doesn't really seem an appropriate use for PyPI, tbh.
Requiring Python 3.5+ isn't a big deal, I routinely put the newest version of Python on all my machines. The ones I can't tend to be "secure", meaning I have no access to PyPI either :-(
And I'll bet if you submit this as a pull request to more-itertools, it'll be accepted, meaning you don't even have to create or maintain a PyPI project.
Thanks for the suggestion, I might do that.
Paul
On 01/06/2015 10:47 AM, Paul Moore wrote:
On 6 January 2015 at 18:43, Ethan Furman wrote:
Yup, it sure does. Guess I should have paid more attention to that niggling feeling that I was missing something. ;) On the other hand, I figured even if I was wrong it wouldn't hurt to increase awareness about the change to generators.
I should probably point out that when writing my code, I was conscious of the generator change and took special case of catching all of the StopIterations that might leak out. It's probably good that I was made to think carefully about it, but all those try...except blocks sure look ugly.
Having looked more closely at your code, I don't see any thing that wouldn't have had to be there even without the generator change because your exhausting two iterators.
(And Steven's version with its sentinel value doesn't look much better, with all those comparisons against a magic value
I think I prefer his, although I would rename MISSING to FINISHED. ;) -- ~Ethan~
On 6 January 2015 at 18:47, Guido van Rossum
Folks, I realize I'm sounding a bit like Raymond here, but can you all please find a different forum to discuss algorithms and coding problems and pick each other's solutions apart? The focus on python-ideas should be to quickly validate ideas for the language or the stdlib, not to discuss arbitrary snippets of code.
Agreed. I'm happy with the consensus (reached pretty quickly) that this isn't useful enough for the stdlib. Thanks all. Paul
On 06Jan2015 09:14, Raymond Hettinger
On Jan 6, 2015, at 8:22 AM, Paul Moore
wrote: In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups. As far as I can tell, this would be a very rare need.
Really? I do this on an ad hoc basis in shell scripts a lot. I think it might
just be rare for you.
In Python I would generally be choosing to use sets, but that is at least
partially because sets are there and the stdlib doesn't have a comm. With sets,
I inherently need finite sources, and I also don't get to yield results
progressively from ordered iterators. Also, it needs to fit into memory.
The most obvious Python use case I happen to actually have to hand would almost
be an abuse of the suggested comm(): a log merge tool I wrote for merging
multiple logs; in that case the "common" set is always empty, hence the "abuse"
idea; it isn't really abuse, just a corner use case.
It gets run many times every day.
Reviewing the code, I notice it starts with:
from heapq import merge
It strikes me that it might be easier to write comm() as a wrapper to
heapq.merge. Though that wouldn't handle Steven's "unorderable items" case.
Cheers,
Cameron Simpson
On 06Jan2015 19:36, Antoine Pitrou
On Tue, 6 Jan 2015 18:22:44 +0000 Paul Moore
wrote: On 6 January 2015 at 17:14, Raymond Hettinger
wrote: On Jan 6, 2015, at 8:22 AM, Paul Moore
wrote: In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.
As far as I can tell, this would be a very rare need.
It's come up for me a few times, usually when trying to check two lists of files to see which ones have been missed by a program, and which ones the program thinks are present but no longer exist.
Why don't you use sets for such things? Your iterator is really only useful for huge or unhashable inputs.
In my use case (an existing tool):
1) I'm merging log files of arbitrary size; I am _not_ going to suck them into
memory. A comm()-like function has a tiny and fixed memory footprint, versus an
unbounded out.
2) I want ordered output, and my inputs are already ordered; why on earth would
I impose a pointless sorting cost on my (currently linear) runtime?
Sets are the "obvious" Python way to do this, because comm() is more or less a
set intersection operation and sets are right there in Python. But for
unbounded sorted inputs and progressive output, they are a _bad_ choice.
Cheers,
Cameron Simpson
On Wed, 7 Jan 2015 08:09:59 +1100
Cameron Simpson
On 06Jan2015 19:36, Antoine Pitrou
wrote: On Tue, 6 Jan 2015 18:22:44 +0000 Paul Moore
wrote: On 6 January 2015 at 17:14, Raymond Hettinger
wrote: On Jan 6, 2015, at 8:22 AM, Paul Moore
wrote: In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.
As far as I can tell, this would be a very rare need.
It's come up for me a few times, usually when trying to check two lists of files to see which ones have been missed by a program, and which ones the program thinks are present but no longer exist.
Why don't you use sets for such things? Your iterator is really only useful for huge or unhashable inputs.
In my use case (an existing tool):
1) I'm merging log files of arbitrary size; I am _not_ going to suck them into memory. A comm()-like function has a tiny and fixed memory footprint, versus an unbounded out.
I don't understand what your use case has to do with comm(). If you just want to merge sorted iterators you don't need all the complication this function has. Regards Antoine.
On 06Jan2015 22:24, Antoine Pitrou
On Wed, 7 Jan 2015 08:09:59 +1100 Cameron Simpson
wrote: On 06Jan2015 19:36, Antoine Pitrou
wrote: On Tue, 6 Jan 2015 18:22:44 +0000 Paul Moore
wrote: On 6 January 2015 at 17:14, Raymond Hettinger
wrote: On Jan 6, 2015, at 8:22 AM, Paul Moore
wrote: In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.
As far as I can tell, this would be a very rare need.
It's come up for me a few times, usually when trying to check two lists of files to see which ones have been missed by a program, and which ones the program thinks are present but no longer exist.
Why don't you use sets for such things? Your iterator is really only useful for huge or unhashable inputs.
In my use case (an existing tool):
1) I'm merging log files of arbitrary size; I am _not_ going to suck them into memory. A comm()-like function has a tiny and fixed memory footprint, versus an unbounded out.
I don't understand what your use case has to do with comm().
I've got two! I, like another poster, also very commonly compare two similar directory trees for commonality, and equivalent list comparisons.
If you just want to merge sorted iterators you don't need all the complication this function has.
Indeed not, but they are very similar tasks: you're pulling in 2 or more
streams of sorted inputs and classifying them. In my direct use case, all into
the same output stream, in order. In comm(), into three streams (well, being an
iterator output: one stream with three classifications).
Cheers,
Cameron Simpson
The comm functionality sounds very reasonable for inclusion. But please call it itertools.common.
Hi,
2015-01-06 18:14 GMT+01:00 Raymond Hettinger
On Jan 6, 2015, at 8:22 AM, Paul Moore
wrote: In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.
As far as I can tell, this would be a very rare need.
I never used the UNIX comm tool. I didn't know that it exists :-) Can you maybe explain the purpose the tool/your function? Show an example. difflib may be a better candidate to add such function than contextlib. Victor
On 7 January 2015 at 14:19, Victor Stinner
I never used the UNIX comm tool. I didn't know that it exists :-)
Can you maybe explain the purpose the tool/your function? Show an example.
The function has doctests included which may help. Basically, take 2 sorted lists (the Unix tool takes 2 sorted files of text), and merge them, preserving order, with each line marked as "only occurs in the first input", "only occurs in the second input" and "occurs in both inputs". So, for example, with inputs 1,1,2,4 and 1,3,4 we would get both: 1 first: 1 first: 2 second: 3 both: 4 The problem with difflib/diff is mainly that it does a *lot* more work than is needed - for sorted input, a fast single-pass algorithm is fine, whereas a general diff needs to work to find common subsequences. Paul
On Wed, Jan 07, 2015 at 03:19:31PM +0100, Victor Stinner wrote:
Hi,
2015-01-06 18:14 GMT+01:00 Raymond Hettinger
: On Jan 6, 2015, at 8:22 AM, Paul Moore
wrote: In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.
As far as I can tell, this would be a very rare need.
I never used the UNIX comm tool. I didn't know that it exists :-)
Can you maybe explain the purpose the tool/your function? Show an example.
It might help to realise that comm is abbreviated from "common". It took me an embarrassingly long time to work that out. This example of using comm in Unix might help: http://www.theunixschool.com/2011/03/comm-beautiful-comparison.html
difflib may be a better candidate to add such function than contextlib.
I think you mean itertools rather than contextlib. It seems to me that difflib would be a good place for it. -- Steve
On 7 January 2015 at 15:38, Steven D'Aprano
This example of using comm in Unix might help:
http://www.theunixschool.com/2011/03/comm-beautiful-comparison.html
The example on that page is the exact reason I wrote my own implementation (the Windows version of comm that I have locally got confused by non-ASCII content in some of the filenames I had). Paul
Adding a itertools.comm() seems unnecessary, although adding it to the widely used more-itertools feels more reasonable. However, the problem with the functionality is that it assumes that the iterables given to it return pre-sorted values. Which is fine it that assumption is satisfied, but one really can't know in advance of exhausting the iterators whether that will prove true. The desired behavior in the absence of this pre-condition being satisfied is unclear to me. I could plausibly see the tool raising some exception when the pre-condition was violated. Of course, in that case, we might well have already returned multiple elements on next(comm(it1, it2)) calls before the exception is encountered, and the initial values might be deceptive or retroactively invalidated. I could plausibly see it ignoring the non-sorted elements in some way. Maybe just pretend they are sorted rather than checking, returning somewhat ill-defined values, I'd think. Or maybe just discarding the missorted values? Or flagging them with some other indicator indicating the missorting? I dunno. In any case, this is not something that can be applied to iterators generically, but only to iterators about which one makes lots of assumptions in advance, and hence seems ill-suited for the stdlib itertools. On Wed, Jan 7, 2015 at 8:09 AM, Paul Moore
On 7 January 2015 at 15:38, Steven D'Aprano
wrote: This example of using comm in Unix might help:
http://www.theunixschool.com/2011/03/comm-beautiful-comparison.html
The example on that page is the exact reason I wrote my own implementation (the Windows version of comm that I have locally got confused by non-ASCII content in some of the filenames I had).
Paul _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On 01/07/2015 01:20 PM, David Mertz wrote:
Adding a itertools.comm() seems unnecessary, although adding it to the widely used more-itertools feels more reasonable.
However, the problem with the functionality is that it assumes that the iterables given to it return pre-sorted values. Which is fine it that assumption is satisfied, but one really can't know in advance of exhausting the iterators whether that will prove true. The desired behavior in the absence of this pre-condition being satisfied is unclear to me.
itertools.groupby has the same requirement for a pre-sorted input iterator. But I do agree that itertools isn't the best place in the stdlib for common(). difflib seems reasonable to me, if it's added at all. Eric.
I don't know if this clarifies anything, but I tried out such a mis-sorted
example using the actual 'comm' tool
511-tmp % comm comm1 comm2
A
B
C
E
F
J
A
K
M
N
T
U
V
512-tmp % cat comm[12]
A
B
C
E
F
J
A
K
T
U
V
A
B
M
N
T
U
V
On Wed, Jan 7, 2015 at 10:20 AM, David Mertz
Adding a itertools.comm() seems unnecessary, although adding it to the widely used more-itertools feels more reasonable.
However, the problem with the functionality is that it assumes that the iterables given to it return pre-sorted values. Which is fine it that assumption is satisfied, but one really can't know in advance of exhausting the iterators whether that will prove true. The desired behavior in the absence of this pre-condition being satisfied is unclear to me.
I could plausibly see the tool raising some exception when the pre-condition was violated. Of course, in that case, we might well have already returned multiple elements on next(comm(it1, it2)) calls before the exception is encountered, and the initial values might be deceptive or retroactively invalidated.
I could plausibly see it ignoring the non-sorted elements in some way. Maybe just pretend they are sorted rather than checking, returning somewhat ill-defined values, I'd think. Or maybe just discarding the missorted values? Or flagging them with some other indicator indicating the missorting? I dunno.
In any case, this is not something that can be applied to iterators generically, but only to iterators about which one makes lots of assumptions in advance, and hence seems ill-suited for the stdlib itertools.
On Wed, Jan 7, 2015 at 8:09 AM, Paul Moore
wrote: On 7 January 2015 at 15:38, Steven D'Aprano
wrote: This example of using comm in Unix might help:
http://www.theunixschool.com/2011/03/comm-beautiful-comparison.html
The example on that page is the exact reason I wrote my own implementation (the Windows version of comm that I have locally got confused by non-ASCII content in some of the filenames I had).
Paul _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Wed, Jan 7, 2015 at 12:24 PM, Eric V. Smith
On 01/07/2015 01:20 PM, David Mertz wrote:
However, the problem with the functionality is that it assumes that the iterables given to it return pre-sorted values. Which is fine it that assumption is satisfied, but one really can't know in advance of exhausting the iterators whether that will prove true. The desired behavior in the absence of this pre-condition being satisfied is unclear to me.
itertools.groupby has the same requirement for a pre-sorted input iterator.
groupby doesn't require sorted input, and behaves in an obvious way for jumbled input. But you're right, of all the functions, it's the most like comm() in terms of the kind of requirements placed on input. -- Devin
participants (15)
-
Andrew Barnert
-
Antoine Pitrou
-
Cameron Simpson
-
Chris Angelico
-
David Mertz
-
Devin Jeanpierre
-
Eric V. Smith
-
Ethan Furman
-
Guido van Rossum
-
Ian Cordasco
-
Jim Baker
-
Paul Moore
-
Raymond Hettinger
-
Steven D'Aprano
-
Victor Stinner