Mailman 3 Possible new itertool: comm() - Python-ideas

newer
Adding `pathlib.Path` method that...

Possible new itertool: comm()

older
build:// Path / URI / URN with the...

Paul Moore

6 Jan 2015 6 Jan '15

4:22 p.m.

In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups. I couldn't find an obvious implementation around, so I ended up having to write my own (see below). Would this be something useful to add to the itertools module? The reasons I think it might be worth including in the stdlib are: 1. The implementation is somewhat tricky to get right, especially the edge cases (I'm not 100% sure my implementation is completely right in this regard). So implementing it yourself each time is error-prone and time consuming. 2. The times I've needed this have been ad-hoc scripts (I'm on Windows, so while a Unix user might use a quick shell pipeline with comm, that's less convenient for me) where depending on a 3rd party distribution from PyPI is less ideal. There's plenty of room for API bikeshedding here, but the basic algorithm remains the same. What do people think? Paul def comm(it1, it2): """Partition 2 sorted iterators into common, only in 1st, and only in 2nd. >>> list(comm([1,2,3], [2,3,4])) [('<', 1), ('=', 2), ('=', 3), ('>', 4)] >>> list(comm([0,1,2,3], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5)] >>> list(comm([0,1,2,3,6], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5), ('<', 6)] >>> list(comm([0,1], [])) [('<', 0), ('<', 1)] >>> list(comm([], [0,1])) [('>', 0), ('>', 1)] >>> list(comm([0,1], [0,1])) [('=', 0), ('=', 1)] >>> list(comm([], [])) [] """ it1 = iter(it1) it2 = iter(it2) it1_finished = False it2_finished = False try: v1 = next(it1) except StopIteration: it1_finished = True try: v2 = next(it2) except StopIteration: it2_finished = True while not (it1_finished or it2_finished): if v1 < v2: yield ('<', v1) try: v1 = next(it1) except StopIteration: it1_finished = True elif v1 > v2: yield ('>', v2) try: v2 = next(it2) except StopIteration: it2_finished = True else: yield ('=', v1) try: v1 = next(it1) except StopIteration: it1_finished = True try: v2 = next(it2) except StopIteration: it2_finished = True if it1_finished and not it2_finished: yield ('>', v2) for v2 in it2: yield ('>', v2) if it2_finished and not it1_finished: yield ('<', v1) for v1 in it1: yield ('<', v1) if __name__ == '__main__': import doctest doctest.testmod()

Show replies by thread

Ian Cordasco

6 Jan 6 Jan

4:33 p.m.

On Tue, Jan 6, 2015 at 10:22 AM, Paul Moore wrote:

...

In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.

I couldn't find an obvious implementation around, so I ended up having to write my own (see below).

Would this be something useful to add to the itertools module? The reasons I think it might be worth including in the stdlib are:

1. The implementation is somewhat tricky to get right, especially the edge cases (I'm not 100% sure my implementation is completely right in this regard). So implementing it yourself each time is error-prone and time consuming. 2. The times I've needed this have been ad-hoc scripts (I'm on Windows, so while a Unix user might use a quick shell pipeline with comm, that's less convenient for me) where depending on a 3rd party distribution from PyPI is less ideal.

There's plenty of room for API bikeshedding here, but the basic algorithm remains the same.

What do people think?

Paul

def comm(it1, it2): """Partition 2 sorted iterators into common, only in 1st, and only in 2nd.

>>> list(comm([1,2,3], [2,3,4])) [('<', 1), ('=', 2), ('=', 3), ('>', 4)] >>> list(comm([0,1,2,3], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5)] >>> list(comm([0,1,2,3,6], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5), ('<', 6)] >>> list(comm([0,1], [])) [('<', 0), ('<', 1)] >>> list(comm([], [0,1])) [('>', 0), ('>', 1)] >>> list(comm([0,1], [0,1])) [('=', 0), ('=', 1)] >>> list(comm([], [])) [] """

it1 = iter(it1) it2 = iter(it2)

it1_finished = False it2_finished = False try: v1 = next(it1) except StopIteration: it1_finished = True try: v2 = next(it2) except StopIteration: it2_finished = True while not (it1_finished or it2_finished): if v1 < v2: yield ('<', v1) try: v1 = next(it1) except StopIteration: it1_finished = True elif v1 > v2: yield ('>', v2) try: v2 = next(it2) except StopIteration: it2_finished = True else: yield ('=', v1) try: v1 = next(it1) except StopIteration: it1_finished = True try: v2 = next(it2) except StopIteration: it2_finished = True

if it1_finished and not it2_finished: yield ('>', v2) for v2 in it2: yield ('>', v2) if it2_finished and not it1_finished: yield ('<', v1) for v1 in it1: yield ('<', v1)

if __name__ == '__main__': import doctest doctest.testmod() _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

So my question is how well with this work with generators/iterators? Your examples use lists, but it would be impossible to use this with anything that isn't finite right?

Paul Moore

4:44 p.m.

On 6 January 2015 at 16:33, Ian Cordasco wrote:

...

So my question is how well with this work with generators/iterators? Your examples use lists, but it would be impossible to use this with anything that isn't finite right?

The examples use lists because I was using doctest and needed checkable output. But I think it should work fine with any (finite) iterator. I don't see a problem with the algorithm if I use infinite iterators (the algorithm is one-pass and generates results on demand) but I haven't tested it explicitly. [pause, test...] Yep, looks OK: >>> import itertools >>> c1 = itertools.count() >>> c2 = itertools.count(3) >>> output = comm(c1, c2) >>> next(output) ('<', 0) >>> next(output) ('<', 1) >>> next(output) ('<', 2) >>> next(output) ('=', 3) Paul

Chris Angelico

4:46 p.m.

On Wed, Jan 7, 2015 at 3:33 AM, Ian Cordasco wrote:

...

So my question is how well with this work with generators/iterators? Your examples use lists, but it would be impossible to use this with anything that isn't finite right?

I think it ought to work fine on infinite iterators, based on my reading of the code. Obviously it would itself be infinite in that case. With one infinite and one finite iterator, it'll eventually get into one of the loops at the end, and forever yield from the infinite iterator; if both are infinite, it'll never break out of the primary loop, and continue consuming values from one or the other and yielding tuples:

...

...
...
def count_by(x): n = 0 while True: n += x yield n it = comm(count_by(2), count_by(3)) next(it) ('<', 2) next(it) ('>', 3) next(it) ('<', 4) next(it) ('=', 6) ... etc etc etc ...

ChrisA

Chris Angelico

4:47 p.m.

On Wed, Jan 7, 2015 at 3:46 AM, Chris Angelico wrote:

...

...
...
...
def count_by(x): n = 0 while True: n += x yield n

And thanks to the magic of sloppy editing, I just rewrote itertools.count(). In my own testing, there was a print call to show what the internal iterators were yielding, but that got left out, leaving this looking a little silly. Anyhow. ChrisA

Raymond Hettinger

5:14 p.m.

...

On Jan 6, 2015, at 8:22 AM, Paul Moore wrote:

In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.

As far as I can tell, this would be a very rare need. Raymond

Andrew Barnert

5:39 p.m.

On Jan 6, 2015, at 16:22, Paul Moore wrote:

...

2. The times I've needed this have been ad-hoc scripts (I'm on Windows, so while a Unix user might use a quick shell pipeline with comm, that's less convenient for me) where depending on a 3rd party distribution from PyPI is less ideal.

Why? Current standard Windows installers include pip, and being pure Python you won't need a compiler, so what's wrong with requiring a PyPI distribution? (Of course that means you need to be able to count on a relatively recent Python 3.4+/2.7+, but it's hard to see how that's worse than something in a future version of the stdlib, which would mean you need to be able to count on 3.5+.) And I'll bet if you submit this as a pull request to more-itertools, it'll be accepted, meaning you don't even have to create or maintain a PyPI project.

Steven D'Aprano

6 p.m.

On Tue, Jan 06, 2015 at 04:22:23PM +0000, Paul Moore wrote:

...

In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.

It seems awfully specialised to me, so I'm not sure that it belongs in itertools. But for what it's worth, here's my implementation which I think is easier to understand than yours, and covers the unfortunate case where you have unorderable items such as float NANs in your data. def comm(values1, values2): """Partition 2 sorted iterators into common, only in 1st, and only in 2nd. >>> list(comm([1,2,3], [2,3,4])) [('<', 1), ('=', 2), ('=', 3), ('>', 4)] >>> list(comm([0,1,2,3], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5)] >>> list(comm([0,1,2,3,6], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5), ('<', 6)] >>> list(comm([0,1], [])) [('<', 0), ('<', 1)] >>> list(comm([], [0,1])) [('>', 0), ('>', 1)] >>> list(comm([0,1], [0,1])) [('=', 0), ('=', 1)] >>> list(comm([], [])) [] >>> list(comm([1,2,3,4], [2,3,float('nan'),4])) [('<', 1), ('=', 2), ('=', 3), ('?', (4, nan)), ('>', 4)] """ MISSING = object() it1 = iter(values1) it2 = iter(values2) a = next(it1, MISSING) b = next(it2, MISSING) while True: if a is MISSING: if b is MISSING: return yield ('>', b) for x in it2: yield ('>', x) return if b is MISSING: assert a is not MISSING yield ('<', a) for x in it1: yield ('<', x) return if a == b: yield ('=', a) # Or could use b. a = next(it1, MISSING) b = next(it2, MISSING) elif a < b: yield ('<', a) a = next(it1, MISSING) elif a > b: yield ('>', b) b = next(it2, MISSING) else: # Unorderable a and b, like float NANs. # Maybe we should raise instead? yield ('?', (a, b)) a = next(it1, MISSING) b = next(it2, MISSING) If you prefer this version, feel free to just use it. If you need a licence, consider it being under the MIT licence. (Over the next day or so, time permitting, I'll try to put this on ActiveState's Python recipes and make the licence more official.) -- Steve

Paul Moore

6:22 p.m.

On 6 January 2015 at 17:14, Raymond Hettinger wrote:

...

...
On Jan 6, 2015, at 8:22 AM, Paul Moore wrote:

In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.

As far as I can tell, this would be a very rare need.

It's come up for me a few times, usually when trying to check two lists of files to see which ones have been missed by a program, and which ones the program thinks are present but no longer exist. Its use cases are about as common as those of the Unix "comm" tool :-) But I take your point, it's not *that* common. Paul

Ethan Furman

6:28 p.m.

On 01/06/2015 10:00 AM, Steven D'Aprano wrote:

...

a = next(it1, MISSING) b = next(it2, MISSING)

Don't forget to guard your next calls -- RunTimeError would not be a friendly way to stop iterating. ;) -- ~Ethan~

Paul Moore

6:29 p.m.

On 6 January 2015 at 17:39, Andrew Barnert wrote:

...

...
2. The times I've needed this have been ad-hoc scripts (I'm on Windows, so while a Unix user might use a quick shell pipeline with comm, that's less convenient for me) where depending on a 3rd party distribution from PyPI is less ideal.

Why? Current standard Windows installers include pip, and being pure Python you won't need a compiler, so what's wrong with requiring a PyPI distribution? (Of course that means you need to be able to count on a relatively recent Python 3.4+/2.7+, but it's hard to see how that's worse than something in a future version of the stdlib, which would mean you need to be able to count on 3.5+.)

Well, I usually write my "little utility scripts" as simple .py files to be run with the system Python. I tend to use them on multiple machines. So unless a dependency is one of the modules I routinely install (things like requests) the process goes run script, oops, needed that distribution, pip install dist, run it again. Not a big issue certainly (and hardly a showstopper) but annoying. And it does mean I'd need to make a PyPI project for my personal utility functions, which doesn't really seem an appropriate use for PyPI, tbh. Requiring Python 3.5+ isn't a big deal, I routinely put the newest version of Python on all my machines. The ones I can't tend to be "secure", meaning I have no access to PyPI either :-(

...

And I'll bet if you submit this as a pull request to more-itertools, it'll be accepted, meaning you don't even have to create or maintain a PyPI project.

Thanks for the suggestion, I might do that. Paul

Paul Moore

6:30 p.m.

On 6 January 2015 at 18:28, Ethan Furman wrote:

...

On 01/06/2015 10:00 AM, Steven D'Aprano wrote:

...
a = next(it1, MISSING) b = next(it2, MISSING)

Don't forget to guard your next calls -- RunTimeError would not be a friendly way to stop iterating. ;)

Doesn't the second argument get used instead of raising StopIteration? Paul

Antoine Pitrou

6:36 p.m.

On Tue, 6 Jan 2015 18:22:44 +0000 Paul Moore wrote:

...

On 6 January 2015 at 17:14, Raymond Hettinger wrote:

...
...
On Jan 6, 2015, at 8:22 AM, Paul Moore wrote:

In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.

As far as I can tell, this would be a very rare need.

It's come up for me a few times, usually when trying to check two lists of files to see which ones have been missed by a program, and which ones the program thinks are present but no longer exist.

Why don't you use sets for such things? Your iterator is really only useful for huge or unhashable inputs. Regards Antoine.

Paul Moore

6:40 p.m.

On 6 January 2015 at 18:00, Steven D'Aprano wrote:

...

But for what it's worth, here's my implementation which I think is easier to understand than yours,

Possibly - it's a trade-off between using a sentinel vs explicitly catching the exceptions. I'm on the fence, personally.

...

and covers the unfortunate case where you have unorderable items such as float NANs in your data.

Technically I don't think this is possible, as the input iterators need to be sorted, and that property will be violated if there are unorderable items in there... Paul

Ethan Furman

6:43 p.m.

On 01/06/2015 10:30 AM, Paul Moore wrote:

...

On 6 January 2015 at 18:28, Ethan Furman wrote:

...
On 01/06/2015 10:00 AM, Steven D'Aprano wrote:

...
a = next(it1, MISSING) b = next(it2, MISSING)

Don't forget to guard your next calls -- RunTimeError would not be a friendly way to stop iterating. ;)

Doesn't the second argument get used instead of raising StopIteration?

Yup, it sure does. Guess I should have paid more attention to that niggling feeling that I was missing something. ;) On the other hand, I figured even if I was wrong it wouldn't hurt to increase awareness about the change to generators. -- ~Ethan~

Paul Moore

6:44 p.m.

On 6 January 2015 at 18:36, Antoine Pitrou wrote:

...

On Tue, 6 Jan 2015 18:22:44 +0000 Paul Moore wrote:

...
On 6 January 2015 at 17:14, Raymond Hettinger wrote:

...
...
On Jan 6, 2015, at 8:22 AM, Paul Moore wrote:

In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.

As far as I can tell, this would be a very rare need.

It's come up for me a few times, usually when trying to check two lists of files to see which ones have been missed by a program, and which ones the program thinks are present but no longer exist.

Why don't you use sets for such things? Your iterator is really only useful for huge or unhashable inputs.

For the case described you're right. I had a case the other day where one of the problems was that one list had duplicates, and I needed to see that ([1,1,2,2] vs [1,2] needed to show [1,2] as "only in the first list" and [1,2] as "in both"). I could probably have used counters. But I was starting from thinking that I had a pair of sorted lists, and "in the Unix shell I'd use comm"...) Paul

Guido van Rossum

6:47 p.m.

Folks, I realize I'm sounding a bit like Raymond here, but can you all please find a different forum to discuss algorithms and coding problems and pick each other's solutions apart? The focus on python-ideas should be to quickly validate ideas for the language or the stdlib, not to discuss arbitrary snippets of code. On Tue, Jan 6, 2015 at 10:44 AM, Paul Moore wrote:

...

On 6 January 2015 at 18:36, Antoine Pitrou wrote:

...
On Tue, 6 Jan 2015 18:22:44 +0000 Paul Moore wrote:

...
On 6 January 2015 at 17:14, Raymond Hettinger wrote:

...
...
On Jan 6, 2015, at 8:22 AM, Paul Moore wrote:

In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.

As far as I can tell, this would be a very rare need.

It's come up for me a few times, usually when trying to check two lists of files to see which ones have been missed by a program, and which ones the program thinks are present but no longer exist.

Why don't you use sets for such things? Your iterator is really only useful for huge or unhashable inputs.

For the case described you're right. I had a case the other day where one of the problems was that one list had duplicates, and I needed to see that ([1,1,2,2] vs [1,2] needed to show [1,2] as "only in the first list" and [1,2] as "in both"). I could probably have used counters. But I was starting from thinking that I had a pair of sorted lists, and "in the Unix shell I'd use comm"...)

Paul _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- --Guido van Rossum (python.org/~guido)

Paul Moore

6:47 p.m.

On 6 January 2015 at 18:43, Ethan Furman wrote:

...

Yup, it sure does. Guess I should have paid more attention to that niggling feeling that I was missing something. ;) On the other hand, I figured even if I was wrong it wouldn't hurt to increase awareness about the change to generators.

I should probably point out that when writing my code, I was conscious of the generator change and took special case of catching all of the StopIterations that might leak out. It's probably good that I was made to think carefully about it, but all those try...except blocks sure look ugly. (And Steven's version with its sentinel value doesn't look much better, with all those comparisons against a magic value - it's just a generally annoying algorithm with all the code paths for corner cases). Paul

Andrew Barnert

6:59 p.m.

On Jan 6, 2015, at 18:29, Paul Moore wrote:

...

On 6 January 2015 at 17:39, Andrew Barnert wrote:

...
...
2. The times I've needed this have been ad-hoc scripts (I'm on Windows, so while a Unix user might use a quick shell pipeline with comm, that's less convenient for me) where depending on a 3rd party distribution from PyPI is less ideal.

Why? Current standard Windows installers include pip, and being pure Python you won't need a compiler, so what's wrong with requiring a PyPI distribution? (Of course that means you need to be able to count on a relatively recent Python 3.4+/2.7+, but it's hard to see how that's worse than something in a future version of the stdlib, which would mean you need to be able to count on 3.5+.)

Well, I usually write my "little utility scripts" as simple .py files to be run with the system Python. I tend to use them on multiple machines. So unless a dependency is one of the modules I routinely install (things like requests) the process goes run script, oops, needed that distribution, pip install dist, run it again.

I do the same thing; I just long ago got into the habit of including more-itertools as one of my routine installs alongside requests, etc. :) I wonder if there's a common-enough set of things that don't really belong in stdlib (often just because they update too frequently) but are often worth having for many people. If I could just "pip install extralib" on every machine, even at the cost of getting a few libs I don't actually need, it would be worth it. On the other hand, if the common set isn't common enough, so for many people extralib only had half of what they want and was 70% stuff they didn't care shouted it wouldn't be that useful...

...

Not a big issue certainly (and hardly a showstopper) but annoying. And it does mean I'd need to make a PyPI project for my personal utility functions, which doesn't really seem an appropriate use for PyPI, tbh.

Requiring Python 3.5+ isn't a big deal, I routinely put the newest version of Python on all my machines. The ones I can't tend to be "secure", meaning I have no access to PyPI either :-(

...
And I'll bet if you submit this as a pull request to more-itertools, it'll be accepted, meaning you don't even have to create or maintain a PyPI project.

Thanks for the suggestion, I might do that.

Paul

Ethan Furman

7:02 p.m.

On 01/06/2015 10:47 AM, Paul Moore wrote:

...

On 6 January 2015 at 18:43, Ethan Furman wrote:

...
Yup, it sure does. Guess I should have paid more attention to that niggling feeling that I was missing something. ;) On the other hand, I figured even if I was wrong it wouldn't hurt to increase awareness about the change to generators.

I should probably point out that when writing my code, I was conscious of the generator change and took special case of catching all of the StopIterations that might leak out. It's probably good that I was made to think carefully about it, but all those try...except blocks sure look ugly.

Having looked more closely at your code, I don't see any thing that wouldn't have had to be there even without the generator change because your exhausting two iterators.

...

(And Steven's version with its sentinel value doesn't look much better, with all those comparisons against a magic value

I think I prefer his, although I would rename MISSING to FINISHED. ;) -- ~Ethan~

Paul Moore

7:39 p.m.

On 6 January 2015 at 18:47, Guido van Rossum wrote:

...

Folks, I realize I'm sounding a bit like Raymond here, but can you all please find a different forum to discuss algorithms and coding problems and pick each other's solutions apart? The focus on python-ideas should be to quickly validate ideas for the language or the stdlib, not to discuss arbitrary snippets of code.

Agreed. I'm happy with the consensus (reached pretty quickly) that this isn't useful enough for the stdlib. Thanks all. Paul

Cameron Simpson

9:04 p.m.

On 06Jan2015 09:14, Raymond Hettinger wrote:

...

...
On Jan 6, 2015, at 8:22 AM, Paul Moore wrote: In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.

As far as I can tell, this would be a very rare need.

Really? I do this on an ad hoc basis in shell scripts a lot. I think it might just be rare for you. In Python I would generally be choosing to use sets, but that is at least partially because sets are there and the stdlib doesn't have a comm. With sets, I inherently need finite sources, and I also don't get to yield results progressively from ordered iterators. Also, it needs to fit into memory. The most obvious Python use case I happen to actually have to hand would almost be an abuse of the suggested comm(): a log merge tool I wrote for merging multiple logs; in that case the "common" set is always empty, hence the "abuse" idea; it isn't really abuse, just a corner use case. It gets run many times every day. Reviewing the code, I notice it starts with: from heapq import merge It strikes me that it might be easier to write comm() as a wrapper to heapq.merge. Though that wouldn't handle Steven's "unorderable items" case. Cheers, Cameron Simpson If it sticks, force it. If it breaks, it needed replacing anyway.

Cameron Simpson

9:09 p.m.

On 06Jan2015 19:36, Antoine Pitrou wrote:

...

On Tue, 6 Jan 2015 18:22:44 +0000 Paul Moore wrote:

...
On 6 January 2015 at 17:14, Raymond Hettinger wrote:

...
...
On Jan 6, 2015, at 8:22 AM, Paul Moore wrote:

In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.

As far as I can tell, this would be a very rare need.

It's come up for me a few times, usually when trying to check two lists of files to see which ones have been missed by a program, and which ones the program thinks are present but no longer exist.

Why don't you use sets for such things? Your iterator is really only useful for huge or unhashable inputs.

In my use case (an existing tool): 1) I'm merging log files of arbitrary size; I am _not_ going to suck them into memory. A comm()-like function has a tiny and fixed memory footprint, versus an unbounded out. 2) I want ordered output, and my inputs are already ordered; why on earth would I impose a pointless sorting cost on my (currently linear) runtime? Sets are the "obvious" Python way to do this, because comm() is more or less a set intersection operation and sets are right there in Python. But for unbounded sorted inputs and progressive output, they are a _bad_ choice. Cheers, Cameron Simpson Yesterday, I was running a CNC plasma cutter that's controlled by Windows XP. This is a machine that moves around a plasma torch that cuts thick steel plate. �A "New Java update is available" window popped up while I was working. �Not good. - John Nagle

Antoine Pitrou

9:24 p.m.

On Wed, 7 Jan 2015 08:09:59 +1100 Cameron Simpson wrote:

...

On 06Jan2015 19:36, Antoine Pitrou wrote:

...
On Tue, 6 Jan 2015 18:22:44 +0000 Paul Moore wrote:

...
On 6 January 2015 at 17:14, Raymond Hettinger wrote:

...
...
On Jan 6, 2015, at 8:22 AM, Paul Moore wrote:

In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.

As far as I can tell, this would be a very rare need.

It's come up for me a few times, usually when trying to check two lists of files to see which ones have been missed by a program, and which ones the program thinks are present but no longer exist.

Why don't you use sets for such things? Your iterator is really only useful for huge or unhashable inputs.

In my use case (an existing tool):

1) I'm merging log files of arbitrary size; I am _not_ going to suck them into memory. A comm()-like function has a tiny and fixed memory footprint, versus an unbounded out.

I don't understand what your use case has to do with comm(). If you just want to merge sorted iterators you don't need all the complication this function has. Regards Antoine.

Cameron Simpson

7 Jan 7 Jan

2:14 a.m.

On 06Jan2015 22:24, Antoine Pitrou wrote:

...

On Wed, 7 Jan 2015 08:09:59 +1100 Cameron Simpson wrote:

...
On 06Jan2015 19:36, Antoine Pitrou wrote:

...
On Tue, 6 Jan 2015 18:22:44 +0000 Paul Moore wrote:

...
On 6 January 2015 at 17:14, Raymond Hettinger wrote:

...
...
On Jan 6, 2015, at 8:22 AM, Paul Moore wrote:

In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.

As far as I can tell, this would be a very rare need.

It's come up for me a few times, usually when trying to check two lists of files to see which ones have been missed by a program, and which ones the program thinks are present but no longer exist.

Why don't you use sets for such things? Your iterator is really only useful for huge or unhashable inputs.

In my use case (an existing tool):

1) I'm merging log files of arbitrary size; I am _not_ going to suck them into memory. A comm()-like function has a tiny and fixed memory footprint, versus an unbounded out.

I don't understand what your use case has to do with comm().

I've got two! I, like another poster, also very commonly compare two similar directory trees for commonality, and equivalent list comparisons.

...

If you just want to merge sorted iterators you don't need all the complication this function has.

Indeed not, but they are very similar tasks: you're pulling in 2 or more streams of sorted inputs and classifying them. In my direct use case, all into the same output stream, in order. In comm(), into three streams (well, being an iterator output: one stream with three classifications). Cheers, Cameron Simpson The double cam chain setup on the 1980's DOHC CB750 was another one of Honda's pointless engineering breakthroughs. You know the cycle (if you'll pardon the pun :-), Wonderful New Feature is introduced with much fanfare, WNF is fawned over by the press, WNF is copied by the other three Japanese makers (this step is sometimes optional), and finally, WNF is quietly dropped by Honda. - Blaine Gardner,

Jim Baker

3:30 a.m.

The comm functionality sounds very reasonable for inclusion. But please call it itertools.common.

Victor Stinner

2:19 p.m.

Hi, 2015-01-06 18:14 GMT+01:00 Raymond Hettinger :

...

...
On Jan 6, 2015, at 8:22 AM, Paul Moore wrote:

In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.

As far as I can tell, this would be a very rare need.

I never used the UNIX comm tool. I didn't know that it exists :-) Can you maybe explain the purpose the tool/your function? Show an example. difflib may be a better candidate to add such function than contextlib. Victor

Paul Moore

3:32 p.m.

On 7 January 2015 at 14:19, Victor Stinner wrote:

...

I never used the UNIX comm tool. I didn't know that it exists :-)

Can you maybe explain the purpose the tool/your function? Show an example.

The function has doctests included which may help. Basically, take 2 sorted lists (the Unix tool takes 2 sorted files of text), and merge them, preserving order, with each line marked as "only occurs in the first input", "only occurs in the second input" and "occurs in both inputs". So, for example, with inputs 1,1,2,4 and 1,3,4 we would get both: 1 first: 1 first: 2 second: 3 both: 4 The problem with difflib/diff is mainly that it does a *lot* more work than is needed - for sorted input, a fast single-pass algorithm is fine, whereas a general diff needs to work to find common subsequences. Paul

Steven D'Aprano

3:38 p.m.

On Wed, Jan 07, 2015 at 03:19:31PM +0100, Victor Stinner wrote:

...

Hi,

2015-01-06 18:14 GMT+01:00 Raymond Hettinger :

...
...
On Jan 6, 2015, at 8:22 AM, Paul Moore wrote:

In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups.

As far as I can tell, this would be a very rare need.

I never used the UNIX comm tool. I didn't know that it exists :-)

Can you maybe explain the purpose the tool/your function? Show an example.

It might help to realise that comm is abbreviated from "common". It took me an embarrassingly long time to work that out. This example of using comm in Unix might help: http://www.theunixschool.com/2011/03/comm-beautiful-comparison.html

...

difflib may be a better candidate to add such function than contextlib.

I think you mean itertools rather than contextlib. It seems to me that difflib would be a good place for it. -- Steve

Paul Moore

4:09 p.m.

On 7 January 2015 at 15:38, Steven D'Aprano wrote:

...

This example of using comm in Unix might help:

http://www.theunixschool.com/2011/03/comm-beautiful-comparison.html

The example on that page is the exact reason I wrote my own implementation (the Windows version of comm that I have locally got confused by non-ASCII content in some of the filenames I had). Paul

David Mertz

6:20 p.m.

Adding a itertools.comm() seems unnecessary, although adding it to the widely used more-itertools feels more reasonable. However, the problem with the functionality is that it assumes that the iterables given to it return pre-sorted values. Which is fine it that assumption is satisfied, but one really can't know in advance of exhausting the iterators whether that will prove true. The desired behavior in the absence of this pre-condition being satisfied is unclear to me. I could plausibly see the tool raising some exception when the pre-condition was violated. Of course, in that case, we might well have already returned multiple elements on next(comm(it1, it2)) calls before the exception is encountered, and the initial values might be deceptive or retroactively invalidated. I could plausibly see it ignoring the non-sorted elements in some way. Maybe just pretend they are sorted rather than checking, returning somewhat ill-defined values, I'd think. Or maybe just discarding the missorted values? Or flagging them with some other indicator indicating the missorting? I dunno. In any case, this is not something that can be applied to iterators generically, but only to iterators about which one makes lots of assumptions in advance, and hence seems ill-suited for the stdlib itertools. On Wed, Jan 7, 2015 at 8:09 AM, Paul Moore wrote:

...

On 7 January 2015 at 15:38, Steven D'Aprano wrote:

...
This example of using comm in Unix might help:

http://www.theunixschool.com/2011/03/comm-beautiful-comparison.html

The example on that page is the exact reason I wrote my own implementation (the Windows version of comm that I have locally got confused by non-ASCII content in some of the filenames I had).

Paul _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Eric V. Smith

6:24 p.m.

On 01/07/2015 01:20 PM, David Mertz wrote:

...

Adding a itertools.comm() seems unnecessary, although adding it to the widely used more-itertools feels more reasonable.

However, the problem with the functionality is that it assumes that the iterables given to it return pre-sorted values. Which is fine it that assumption is satisfied, but one really can't know in advance of exhausting the iterators whether that will prove true. The desired behavior in the absence of this pre-condition being satisfied is unclear to me.

itertools.groupby has the same requirement for a pre-sorted input iterator. But I do agree that itertools isn't the best place in the stdlib for common(). difflib seems reasonable to me, if it's added at all. Eric.

David Mertz

6:26 p.m.

I don't know if this clarifies anything, but I tried out such a mis-sorted example using the actual 'comm' tool 511-tmp % comm comm1 comm2 A B C E F J A K M N T U V 512-tmp % cat comm[12] A B C E F J A K T U V A B M N T U V On Wed, Jan 7, 2015 at 10:20 AM, David Mertz wrote:

...

Adding a itertools.comm() seems unnecessary, although adding it to the widely used more-itertools feels more reasonable.

However, the problem with the functionality is that it assumes that the iterables given to it return pre-sorted values. Which is fine it that assumption is satisfied, but one really can't know in advance of exhausting the iterators whether that will prove true. The desired behavior in the absence of this pre-condition being satisfied is unclear to me.

I could plausibly see the tool raising some exception when the pre-condition was violated. Of course, in that case, we might well have already returned multiple elements on next(comm(it1, it2)) calls before the exception is encountered, and the initial values might be deceptive or retroactively invalidated.

I could plausibly see it ignoring the non-sorted elements in some way. Maybe just pretend they are sorted rather than checking, returning somewhat ill-defined values, I'd think. Or maybe just discarding the missorted values? Or flagging them with some other indicator indicating the missorting? I dunno.

In any case, this is not something that can be applied to iterators generically, but only to iterators about which one makes lots of assumptions in advance, and hence seems ill-suited for the stdlib itertools.

On Wed, Jan 7, 2015 at 8:09 AM, Paul Moore wrote:

...
On 7 January 2015 at 15:38, Steven D'Aprano wrote:

...
This example of using comm in Unix might help:

http://www.theunixschool.com/2011/03/comm-beautiful-comparison.html

The example on that page is the exact reason I wrote my own implementation (the Windows version of comm that I have locally got confused by non-ASCII content in some of the filenames I had).

Paul _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Devin Jeanpierre

9:04 p.m.

On Wed, Jan 7, 2015 at 12:24 PM, Eric V. Smith wrote:

...

On 01/07/2015 01:20 PM, David Mertz wrote:

...
However, the problem with the functionality is that it assumes that the iterables given to it return pre-sorted values. Which is fine it that assumption is satisfied, but one really can't know in advance of exhausting the iterators whether that will prove true. The desired behavior in the absence of this pre-condition being satisfied is unclear to me.

itertools.groupby has the same requirement for a pre-sorted input iterator.

groupby doesn't require sorted input, and behaves in an obvious way for jumbled input. But you're right, of all the functions, it's the most like comm() in terms of the kind of requirements placed on input. -- Devin

3396

Age (days ago)

3397

Last active (days ago)

List overview

Download

33 comments

15 participants

participants (15)

Andrew Barnert
Antoine Pitrou
Cameron Simpson
Chris Angelico
David Mertz
Devin Jeanpierre
Eric V. Smith
Ethan Furman
Guido van Rossum
Ian Cordasco
Jim Baker
Paul Moore
Raymond Hettinger
Steven D'Aprano
Victor Stinner

Possible new itertool: comm()

Cameron Simpson

Cameron Simpson

Cameron Simpson

tags

participants (15)