Possible new itertool: comm()

In writing a utility script today, I found myself needing to do something similar to what the Unix "comm" utility does - take two sorted iterators, and partition the values into "only in the first", "only in the second", and "in both" groups. I couldn't find an obvious implementation around, so I ended up having to write my own (see below). Would this be something useful to add to the itertools module? The reasons I think it might be worth including in the stdlib are: 1. The implementation is somewhat tricky to get right, especially the edge cases (I'm not 100% sure my implementation is completely right in this regard). So implementing it yourself each time is error-prone and time consuming. 2. The times I've needed this have been ad-hoc scripts (I'm on Windows, so while a Unix user might use a quick shell pipeline with comm, that's less convenient for me) where depending on a 3rd party distribution from PyPI is less ideal. There's plenty of room for API bikeshedding here, but the basic algorithm remains the same. What do people think? Paul def comm(it1, it2): """Partition 2 sorted iterators into common, only in 1st, and only in 2nd. >>> list(comm([1,2,3], [2,3,4])) [('<', 1), ('=', 2), ('=', 3), ('>', 4)] >>> list(comm([0,1,2,3], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5)] >>> list(comm([0,1,2,3,6], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5), ('<', 6)] >>> list(comm([0,1], [])) [('<', 0), ('<', 1)] >>> list(comm([], [0,1])) [('>', 0), ('>', 1)] >>> list(comm([0,1], [0,1])) [('=', 0), ('=', 1)] >>> list(comm([], [])) [] """ it1 = iter(it1) it2 = iter(it2) it1_finished = False it2_finished = False try: v1 = next(it1) except StopIteration: it1_finished = True try: v2 = next(it2) except StopIteration: it2_finished = True while not (it1_finished or it2_finished): if v1 < v2: yield ('<', v1) try: v1 = next(it1) except StopIteration: it1_finished = True elif v1 > v2: yield ('>', v2) try: v2 = next(it2) except StopIteration: it2_finished = True else: yield ('=', v1) try: v1 = next(it1) except StopIteration: it1_finished = True try: v2 = next(it2) except StopIteration: it2_finished = True if it1_finished and not it2_finished: yield ('>', v2) for v2 in it2: yield ('>', v2) if it2_finished and not it1_finished: yield ('<', v1) for v1 in it1: yield ('<', v1) if __name__ == '__main__': import doctest doctest.testmod()

On 6 January 2015 at 16:33, Ian Cordasco <graffatcolmingov@gmail.com> wrote:
The examples use lists because I was using doctest and needed checkable output. But I think it should work fine with any (finite) iterator. I don't see a problem with the algorithm if I use infinite iterators (the algorithm is one-pass and generates results on demand) but I haven't tested it explicitly. [pause, test...] Yep, looks OK: >>> import itertools >>> c1 = itertools.count() >>> c2 = itertools.count(3) >>> output = comm(c1, c2) >>> next(output) ('<', 0) >>> next(output) ('<', 1) >>> next(output) ('<', 2) >>> next(output) ('=', 3) Paul

On Wed, Jan 7, 2015 at 3:33 AM, Ian Cordasco <graffatcolmingov@gmail.com> wrote:
I think it ought to work fine on infinite iterators, based on my reading of the code. Obviously it would itself be infinite in that case. With one infinite and one finite iterator, it'll eventually get into one of the loops at the end, and forever yield from the infinite iterator; if both are infinite, it'll never break out of the primary loop, and continue consuming values from one or the other and yielding tuples:
ChrisA

On Wed, Jan 7, 2015 at 3:46 AM, Chris Angelico <rosuav@gmail.com> wrote:
And thanks to the magic of sloppy editing, I just rewrote itertools.count(). In my own testing, there was a print call to show what the internal iterators were yielding, but that got left out, leaving this looking a little silly. Anyhow. ChrisA

On 6 January 2015 at 17:14, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:
It's come up for me a few times, usually when trying to check two lists of files to see which ones have been missed by a program, and which ones the program thinks are present but no longer exist. Its use cases are about as common as those of the Unix "comm" tool :-) But I take your point, it's not *that* common. Paul

On 6 January 2015 at 18:36, Antoine Pitrou <solipsis@pitrou.net> wrote:
For the case described you're right. I had a case the other day where one of the problems was that one list had duplicates, and I needed to see that ([1,1,2,2] vs [1,2] needed to show [1,2] as "only in the first list" and [1,2] as "in both"). I could probably have used counters. But I was starting from thinking that I had a pair of sorted lists, and "in the Unix shell I'd use comm"...) Paul

Folks, I realize I'm sounding a bit like Raymond here, but can you all please find a different forum to discuss algorithms and coding problems and pick each other's solutions apart? The focus on python-ideas should be to quickly validate ideas for the language or the stdlib, not to discuss arbitrary snippets of code. On Tue, Jan 6, 2015 at 10:44 AM, Paul Moore <p.f.moore@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On 06Jan2015 19:36, Antoine Pitrou <solipsis@pitrou.net> wrote:
In my use case (an existing tool): 1) I'm merging log files of arbitrary size; I am _not_ going to suck them into memory. A comm()-like function has a tiny and fixed memory footprint, versus an unbounded out. 2) I want ordered output, and my inputs are already ordered; why on earth would I impose a pointless sorting cost on my (currently linear) runtime? Sets are the "obvious" Python way to do this, because comm() is more or less a set intersection operation and sets are right there in Python. But for unbounded sorted inputs and progressive output, they are a _bad_ choice. Cheers, Cameron Simpson <cs@zip.com.au> Yesterday, I was running a CNC plasma cutter that's controlled by Windows XP. This is a machine that moves around a plasma torch that cuts thick steel plate. �A "New Java update is available" window popped up while I was working. �Not good. - John Nagle

On 06Jan2015 22:24, Antoine Pitrou <solipsis@pitrou.net> wrote:
I've got two! I, like another poster, also very commonly compare two similar directory trees for commonality, and equivalent list comparisons.
Indeed not, but they are very similar tasks: you're pulling in 2 or more streams of sorted inputs and classifying them. In my direct use case, all into the same output stream, in order. In comm(), into three streams (well, being an iterator output: one stream with three classifications). Cheers, Cameron Simpson <cs@zip.com.au> The double cam chain setup on the 1980's DOHC CB750 was another one of Honda's pointless engineering breakthroughs. You know the cycle (if you'll pardon the pun :-), Wonderful New Feature is introduced with much fanfare, WNF is fawned over by the press, WNF is copied by the other three Japanese makers (this step is sometimes optional), and finally, WNF is quietly dropped by Honda. - Blaine Gardner, <blgardne@sim.es.com>

The comm functionality sounds very reasonable for inclusion. But please call it itertools.common.

On 06Jan2015 09:14, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:
Really? I do this on an ad hoc basis in shell scripts a lot. I think it might just be rare for you. In Python I would generally be choosing to use sets, but that is at least partially because sets are there and the stdlib doesn't have a comm. With sets, I inherently need finite sources, and I also don't get to yield results progressively from ordered iterators. Also, it needs to fit into memory. The most obvious Python use case I happen to actually have to hand would almost be an abuse of the suggested comm(): a log merge tool I wrote for merging multiple logs; in that case the "common" set is always empty, hence the "abuse" idea; it isn't really abuse, just a corner use case. It gets run many times every day. Reviewing the code, I notice it starts with: from heapq import merge It strikes me that it might be easier to write comm() as a wrapper to heapq.merge. Though that wouldn't handle Steven's "unorderable items" case. Cheers, Cameron Simpson <cs@zip.com.au> If it sticks, force it. If it breaks, it needed replacing anyway.

Hi, 2015-01-06 18:14 GMT+01:00 Raymond Hettinger <raymond.hettinger@gmail.com>:
I never used the UNIX comm tool. I didn't know that it exists :-) Can you maybe explain the purpose the tool/your function? Show an example. difflib may be a better candidate to add such function than contextlib. Victor

On 7 January 2015 at 14:19, Victor Stinner <victor.stinner@gmail.com> wrote:
I never used the UNIX comm tool. I didn't know that it exists :-)
Can you maybe explain the purpose the tool/your function? Show an example.
The function has doctests included which may help. Basically, take 2 sorted lists (the Unix tool takes 2 sorted files of text), and merge them, preserving order, with each line marked as "only occurs in the first input", "only occurs in the second input" and "occurs in both inputs". So, for example, with inputs 1,1,2,4 and 1,3,4 we would get both: 1 first: 1 first: 2 second: 3 both: 4 The problem with difflib/diff is mainly that it does a *lot* more work than is needed - for sorted input, a fast single-pass algorithm is fine, whereas a general diff needs to work to find common subsequences. Paul

On Wed, Jan 07, 2015 at 03:19:31PM +0100, Victor Stinner wrote:
It might help to realise that comm is abbreviated from "common". It took me an embarrassingly long time to work that out. This example of using comm in Unix might help: http://www.theunixschool.com/2011/03/comm-beautiful-comparison.html
difflib may be a better candidate to add such function than contextlib.
I think you mean itertools rather than contextlib. It seems to me that difflib would be a good place for it. -- Steve

On 7 January 2015 at 15:38, Steven D'Aprano <steve@pearwood.info> wrote:
This example of using comm in Unix might help:
http://www.theunixschool.com/2011/03/comm-beautiful-comparison.html
The example on that page is the exact reason I wrote my own implementation (the Windows version of comm that I have locally got confused by non-ASCII content in some of the filenames I had). Paul

Adding a itertools.comm() seems unnecessary, although adding it to the widely used more-itertools feels more reasonable. However, the problem with the functionality is that it assumes that the iterables given to it return pre-sorted values. Which is fine it that assumption is satisfied, but one really can't know in advance of exhausting the iterators whether that will prove true. The desired behavior in the absence of this pre-condition being satisfied is unclear to me. I could plausibly see the tool raising some exception when the pre-condition was violated. Of course, in that case, we might well have already returned multiple elements on next(comm(it1, it2)) calls before the exception is encountered, and the initial values might be deceptive or retroactively invalidated. I could plausibly see it ignoring the non-sorted elements in some way. Maybe just pretend they are sorted rather than checking, returning somewhat ill-defined values, I'd think. Or maybe just discarding the missorted values? Or flagging them with some other indicator indicating the missorting? I dunno. In any case, this is not something that can be applied to iterators generically, but only to iterators about which one makes lots of assumptions in advance, and hence seems ill-suited for the stdlib itertools. On Wed, Jan 7, 2015 at 8:09 AM, Paul Moore <p.f.moore@gmail.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Wed, Jan 7, 2015 at 12:24 PM, Eric V. Smith <eric@trueblade.com> wrote:
groupby doesn't require sorted input, and behaves in an obvious way for jumbled input. But you're right, of all the functions, it's the most like comm() in terms of the kind of requirements placed on input. -- Devin

I don't know if this clarifies anything, but I tried out such a mis-sorted example using the actual 'comm' tool 511-tmp % comm comm1 comm2 A B C E F J A K M N T U V 512-tmp % cat comm[12] A B C E F J A K T U V A B M N T U V On Wed, Jan 7, 2015 at 10:20 AM, David Mertz <mertz@gnosis.cx> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Jan 6, 2015, at 16:22, Paul Moore <p.f.moore@gmail.com> wrote:
Why? Current standard Windows installers include pip, and being pure Python you won't need a compiler, so what's wrong with requiring a PyPI distribution? (Of course that means you need to be able to count on a relatively recent Python 3.4+/2.7+, but it's hard to see how that's worse than something in a future version of the stdlib, which would mean you need to be able to count on 3.5+.) And I'll bet if you submit this as a pull request to more-itertools, it'll be accepted, meaning you don't even have to create or maintain a PyPI project.

On 6 January 2015 at 17:39, Andrew Barnert <abarnert@yahoo.com> wrote:
Well, I usually write my "little utility scripts" as simple .py files to be run with the system Python. I tend to use them on multiple machines. So unless a dependency is one of the modules I routinely install (things like requests) the process goes run script, oops, needed that distribution, pip install dist, run it again. Not a big issue certainly (and hardly a showstopper) but annoying. And it does mean I'd need to make a PyPI project for my personal utility functions, which doesn't really seem an appropriate use for PyPI, tbh. Requiring Python 3.5+ isn't a big deal, I routinely put the newest version of Python on all my machines. The ones I can't tend to be "secure", meaning I have no access to PyPI either :-(
And I'll bet if you submit this as a pull request to more-itertools, it'll be accepted, meaning you don't even have to create or maintain a PyPI project.
Thanks for the suggestion, I might do that. Paul

On Jan 6, 2015, at 18:29, Paul Moore <p.f.moore@gmail.com> wrote:
I do the same thing; I just long ago got into the habit of including more-itertools as one of my routine installs alongside requests, etc. :) I wonder if there's a common-enough set of things that don't really belong in stdlib (often just because they update too frequently) but are often worth having for many people. If I could just "pip install extralib" on every machine, even at the cost of getting a few libs I don't actually need, it would be worth it. On the other hand, if the common set isn't common enough, so for many people extralib only had half of what they want and was 70% stuff they didn't care shouted it wouldn't be that useful...

On Tue, Jan 06, 2015 at 04:22:23PM +0000, Paul Moore wrote:
It seems awfully specialised to me, so I'm not sure that it belongs in itertools. But for what it's worth, here's my implementation which I think is easier to understand than yours, and covers the unfortunate case where you have unorderable items such as float NANs in your data. def comm(values1, values2): """Partition 2 sorted iterators into common, only in 1st, and only in 2nd. >>> list(comm([1,2,3], [2,3,4])) [('<', 1), ('=', 2), ('=', 3), ('>', 4)] >>> list(comm([0,1,2,3], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5)] >>> list(comm([0,1,2,3,6], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5), ('<', 6)] >>> list(comm([0,1], [])) [('<', 0), ('<', 1)] >>> list(comm([], [0,1])) [('>', 0), ('>', 1)] >>> list(comm([0,1], [0,1])) [('=', 0), ('=', 1)] >>> list(comm([], [])) [] >>> list(comm([1,2,3,4], [2,3,float('nan'),4])) [('<', 1), ('=', 2), ('=', 3), ('?', (4, nan)), ('>', 4)] """ MISSING = object() it1 = iter(values1) it2 = iter(values2) a = next(it1, MISSING) b = next(it2, MISSING) while True: if a is MISSING: if b is MISSING: return yield ('>', b) for x in it2: yield ('>', x) return if b is MISSING: assert a is not MISSING yield ('<', a) for x in it1: yield ('<', x) return if a == b: yield ('=', a) # Or could use b. a = next(it1, MISSING) b = next(it2, MISSING) elif a < b: yield ('<', a) a = next(it1, MISSING) elif a > b: yield ('>', b) b = next(it2, MISSING) else: # Unorderable a and b, like float NANs. # Maybe we should raise instead? yield ('?', (a, b)) a = next(it1, MISSING) b = next(it2, MISSING) If you prefer this version, feel free to just use it. If you need a licence, consider it being under the MIT licence. (Over the next day or so, time permitting, I'll try to put this on ActiveState's Python recipes and make the licence more official.) -- Steve

On 01/06/2015 10:30 AM, Paul Moore wrote:
Yup, it sure does. Guess I should have paid more attention to that niggling feeling that I was missing something. ;) On the other hand, I figured even if I was wrong it wouldn't hurt to increase awareness about the change to generators. -- ~Ethan~

On 6 January 2015 at 18:43, Ethan Furman <ethan@stoneleaf.us> wrote:
Yup, it sure does. Guess I should have paid more attention to that niggling feeling that I was missing something. ;) On the other hand, I figured even if I was wrong it wouldn't hurt to increase awareness about the change to generators.
I should probably point out that when writing my code, I was conscious of the generator change and took special case of catching all of the StopIterations that might leak out. It's probably good that I was made to think carefully about it, but all those try...except blocks sure look ugly. (And Steven's version with its sentinel value doesn't look much better, with all those comparisons against a magic value - it's just a generally annoying algorithm with all the code paths for corner cases). Paul

On 01/06/2015 10:47 AM, Paul Moore wrote:
Having looked more closely at your code, I don't see any thing that wouldn't have had to be there even without the generator change because your exhausting two iterators.
(And Steven's version with its sentinel value doesn't look much better, with all those comparisons against a magic value
I think I prefer his, although I would rename MISSING to FINISHED. ;) -- ~Ethan~

On 6 January 2015 at 18:00, Steven D'Aprano <steve@pearwood.info> wrote:
But for what it's worth, here's my implementation which I think is easier to understand than yours,
Possibly - it's a trade-off between using a sentinel vs explicitly catching the exceptions. I'm on the fence, personally.
and covers the unfortunate case where you have unorderable items such as float NANs in your data.
Technically I don't think this is possible, as the input iterators need to be sorted, and that property will be violated if there are unorderable items in there... Paul

On 6 January 2015 at 16:33, Ian Cordasco <graffatcolmingov@gmail.com> wrote:
The examples use lists because I was using doctest and needed checkable output. But I think it should work fine with any (finite) iterator. I don't see a problem with the algorithm if I use infinite iterators (the algorithm is one-pass and generates results on demand) but I haven't tested it explicitly. [pause, test...] Yep, looks OK: >>> import itertools >>> c1 = itertools.count() >>> c2 = itertools.count(3) >>> output = comm(c1, c2) >>> next(output) ('<', 0) >>> next(output) ('<', 1) >>> next(output) ('<', 2) >>> next(output) ('=', 3) Paul

On Wed, Jan 7, 2015 at 3:33 AM, Ian Cordasco <graffatcolmingov@gmail.com> wrote:
I think it ought to work fine on infinite iterators, based on my reading of the code. Obviously it would itself be infinite in that case. With one infinite and one finite iterator, it'll eventually get into one of the loops at the end, and forever yield from the infinite iterator; if both are infinite, it'll never break out of the primary loop, and continue consuming values from one or the other and yielding tuples:
ChrisA

On Wed, Jan 7, 2015 at 3:46 AM, Chris Angelico <rosuav@gmail.com> wrote:
And thanks to the magic of sloppy editing, I just rewrote itertools.count(). In my own testing, there was a print call to show what the internal iterators were yielding, but that got left out, leaving this looking a little silly. Anyhow. ChrisA

On 6 January 2015 at 17:14, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:
It's come up for me a few times, usually when trying to check two lists of files to see which ones have been missed by a program, and which ones the program thinks are present but no longer exist. Its use cases are about as common as those of the Unix "comm" tool :-) But I take your point, it's not *that* common. Paul

On 6 January 2015 at 18:36, Antoine Pitrou <solipsis@pitrou.net> wrote:
For the case described you're right. I had a case the other day where one of the problems was that one list had duplicates, and I needed to see that ([1,1,2,2] vs [1,2] needed to show [1,2] as "only in the first list" and [1,2] as "in both"). I could probably have used counters. But I was starting from thinking that I had a pair of sorted lists, and "in the Unix shell I'd use comm"...) Paul

Folks, I realize I'm sounding a bit like Raymond here, but can you all please find a different forum to discuss algorithms and coding problems and pick each other's solutions apart? The focus on python-ideas should be to quickly validate ideas for the language or the stdlib, not to discuss arbitrary snippets of code. On Tue, Jan 6, 2015 at 10:44 AM, Paul Moore <p.f.moore@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On 06Jan2015 19:36, Antoine Pitrou <solipsis@pitrou.net> wrote:
In my use case (an existing tool): 1) I'm merging log files of arbitrary size; I am _not_ going to suck them into memory. A comm()-like function has a tiny and fixed memory footprint, versus an unbounded out. 2) I want ordered output, and my inputs are already ordered; why on earth would I impose a pointless sorting cost on my (currently linear) runtime? Sets are the "obvious" Python way to do this, because comm() is more or less a set intersection operation and sets are right there in Python. But for unbounded sorted inputs and progressive output, they are a _bad_ choice. Cheers, Cameron Simpson <cs@zip.com.au> Yesterday, I was running a CNC plasma cutter that's controlled by Windows XP. This is a machine that moves around a plasma torch that cuts thick steel plate. �A "New Java update is available" window popped up while I was working. �Not good. - John Nagle

On 06Jan2015 22:24, Antoine Pitrou <solipsis@pitrou.net> wrote:
I've got two! I, like another poster, also very commonly compare two similar directory trees for commonality, and equivalent list comparisons.
Indeed not, but they are very similar tasks: you're pulling in 2 or more streams of sorted inputs and classifying them. In my direct use case, all into the same output stream, in order. In comm(), into three streams (well, being an iterator output: one stream with three classifications). Cheers, Cameron Simpson <cs@zip.com.au> The double cam chain setup on the 1980's DOHC CB750 was another one of Honda's pointless engineering breakthroughs. You know the cycle (if you'll pardon the pun :-), Wonderful New Feature is introduced with much fanfare, WNF is fawned over by the press, WNF is copied by the other three Japanese makers (this step is sometimes optional), and finally, WNF is quietly dropped by Honda. - Blaine Gardner, <blgardne@sim.es.com>

The comm functionality sounds very reasonable for inclusion. But please call it itertools.common.

On 06Jan2015 09:14, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:
Really? I do this on an ad hoc basis in shell scripts a lot. I think it might just be rare for you. In Python I would generally be choosing to use sets, but that is at least partially because sets are there and the stdlib doesn't have a comm. With sets, I inherently need finite sources, and I also don't get to yield results progressively from ordered iterators. Also, it needs to fit into memory. The most obvious Python use case I happen to actually have to hand would almost be an abuse of the suggested comm(): a log merge tool I wrote for merging multiple logs; in that case the "common" set is always empty, hence the "abuse" idea; it isn't really abuse, just a corner use case. It gets run many times every day. Reviewing the code, I notice it starts with: from heapq import merge It strikes me that it might be easier to write comm() as a wrapper to heapq.merge. Though that wouldn't handle Steven's "unorderable items" case. Cheers, Cameron Simpson <cs@zip.com.au> If it sticks, force it. If it breaks, it needed replacing anyway.

Hi, 2015-01-06 18:14 GMT+01:00 Raymond Hettinger <raymond.hettinger@gmail.com>:
I never used the UNIX comm tool. I didn't know that it exists :-) Can you maybe explain the purpose the tool/your function? Show an example. difflib may be a better candidate to add such function than contextlib. Victor

On 7 January 2015 at 14:19, Victor Stinner <victor.stinner@gmail.com> wrote:
I never used the UNIX comm tool. I didn't know that it exists :-)
Can you maybe explain the purpose the tool/your function? Show an example.
The function has doctests included which may help. Basically, take 2 sorted lists (the Unix tool takes 2 sorted files of text), and merge them, preserving order, with each line marked as "only occurs in the first input", "only occurs in the second input" and "occurs in both inputs". So, for example, with inputs 1,1,2,4 and 1,3,4 we would get both: 1 first: 1 first: 2 second: 3 both: 4 The problem with difflib/diff is mainly that it does a *lot* more work than is needed - for sorted input, a fast single-pass algorithm is fine, whereas a general diff needs to work to find common subsequences. Paul

On Wed, Jan 07, 2015 at 03:19:31PM +0100, Victor Stinner wrote:
It might help to realise that comm is abbreviated from "common". It took me an embarrassingly long time to work that out. This example of using comm in Unix might help: http://www.theunixschool.com/2011/03/comm-beautiful-comparison.html
difflib may be a better candidate to add such function than contextlib.
I think you mean itertools rather than contextlib. It seems to me that difflib would be a good place for it. -- Steve

On 7 January 2015 at 15:38, Steven D'Aprano <steve@pearwood.info> wrote:
This example of using comm in Unix might help:
http://www.theunixschool.com/2011/03/comm-beautiful-comparison.html
The example on that page is the exact reason I wrote my own implementation (the Windows version of comm that I have locally got confused by non-ASCII content in some of the filenames I had). Paul

Adding a itertools.comm() seems unnecessary, although adding it to the widely used more-itertools feels more reasonable. However, the problem with the functionality is that it assumes that the iterables given to it return pre-sorted values. Which is fine it that assumption is satisfied, but one really can't know in advance of exhausting the iterators whether that will prove true. The desired behavior in the absence of this pre-condition being satisfied is unclear to me. I could plausibly see the tool raising some exception when the pre-condition was violated. Of course, in that case, we might well have already returned multiple elements on next(comm(it1, it2)) calls before the exception is encountered, and the initial values might be deceptive or retroactively invalidated. I could plausibly see it ignoring the non-sorted elements in some way. Maybe just pretend they are sorted rather than checking, returning somewhat ill-defined values, I'd think. Or maybe just discarding the missorted values? Or flagging them with some other indicator indicating the missorting? I dunno. In any case, this is not something that can be applied to iterators generically, but only to iterators about which one makes lots of assumptions in advance, and hence seems ill-suited for the stdlib itertools. On Wed, Jan 7, 2015 at 8:09 AM, Paul Moore <p.f.moore@gmail.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Wed, Jan 7, 2015 at 12:24 PM, Eric V. Smith <eric@trueblade.com> wrote:
groupby doesn't require sorted input, and behaves in an obvious way for jumbled input. But you're right, of all the functions, it's the most like comm() in terms of the kind of requirements placed on input. -- Devin

I don't know if this clarifies anything, but I tried out such a mis-sorted example using the actual 'comm' tool 511-tmp % comm comm1 comm2 A B C E F J A K M N T U V 512-tmp % cat comm[12] A B C E F J A K T U V A B M N T U V On Wed, Jan 7, 2015 at 10:20 AM, David Mertz <mertz@gnosis.cx> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Jan 6, 2015, at 16:22, Paul Moore <p.f.moore@gmail.com> wrote:
Why? Current standard Windows installers include pip, and being pure Python you won't need a compiler, so what's wrong with requiring a PyPI distribution? (Of course that means you need to be able to count on a relatively recent Python 3.4+/2.7+, but it's hard to see how that's worse than something in a future version of the stdlib, which would mean you need to be able to count on 3.5+.) And I'll bet if you submit this as a pull request to more-itertools, it'll be accepted, meaning you don't even have to create or maintain a PyPI project.

On 6 January 2015 at 17:39, Andrew Barnert <abarnert@yahoo.com> wrote:
Well, I usually write my "little utility scripts" as simple .py files to be run with the system Python. I tend to use them on multiple machines. So unless a dependency is one of the modules I routinely install (things like requests) the process goes run script, oops, needed that distribution, pip install dist, run it again. Not a big issue certainly (and hardly a showstopper) but annoying. And it does mean I'd need to make a PyPI project for my personal utility functions, which doesn't really seem an appropriate use for PyPI, tbh. Requiring Python 3.5+ isn't a big deal, I routinely put the newest version of Python on all my machines. The ones I can't tend to be "secure", meaning I have no access to PyPI either :-(
And I'll bet if you submit this as a pull request to more-itertools, it'll be accepted, meaning you don't even have to create or maintain a PyPI project.
Thanks for the suggestion, I might do that. Paul

On Jan 6, 2015, at 18:29, Paul Moore <p.f.moore@gmail.com> wrote:
I do the same thing; I just long ago got into the habit of including more-itertools as one of my routine installs alongside requests, etc. :) I wonder if there's a common-enough set of things that don't really belong in stdlib (often just because they update too frequently) but are often worth having for many people. If I could just "pip install extralib" on every machine, even at the cost of getting a few libs I don't actually need, it would be worth it. On the other hand, if the common set isn't common enough, so for many people extralib only had half of what they want and was 70% stuff they didn't care shouted it wouldn't be that useful...

On Tue, Jan 06, 2015 at 04:22:23PM +0000, Paul Moore wrote:
It seems awfully specialised to me, so I'm not sure that it belongs in itertools. But for what it's worth, here's my implementation which I think is easier to understand than yours, and covers the unfortunate case where you have unorderable items such as float NANs in your data. def comm(values1, values2): """Partition 2 sorted iterators into common, only in 1st, and only in 2nd. >>> list(comm([1,2,3], [2,3,4])) [('<', 1), ('=', 2), ('=', 3), ('>', 4)] >>> list(comm([0,1,2,3], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5)] >>> list(comm([0,1,2,3,6], [2,3,4,5])) [('<', 0), ('<', 1), ('=', 2), ('=', 3), ('>', 4), ('>', 5), ('<', 6)] >>> list(comm([0,1], [])) [('<', 0), ('<', 1)] >>> list(comm([], [0,1])) [('>', 0), ('>', 1)] >>> list(comm([0,1], [0,1])) [('=', 0), ('=', 1)] >>> list(comm([], [])) [] >>> list(comm([1,2,3,4], [2,3,float('nan'),4])) [('<', 1), ('=', 2), ('=', 3), ('?', (4, nan)), ('>', 4)] """ MISSING = object() it1 = iter(values1) it2 = iter(values2) a = next(it1, MISSING) b = next(it2, MISSING) while True: if a is MISSING: if b is MISSING: return yield ('>', b) for x in it2: yield ('>', x) return if b is MISSING: assert a is not MISSING yield ('<', a) for x in it1: yield ('<', x) return if a == b: yield ('=', a) # Or could use b. a = next(it1, MISSING) b = next(it2, MISSING) elif a < b: yield ('<', a) a = next(it1, MISSING) elif a > b: yield ('>', b) b = next(it2, MISSING) else: # Unorderable a and b, like float NANs. # Maybe we should raise instead? yield ('?', (a, b)) a = next(it1, MISSING) b = next(it2, MISSING) If you prefer this version, feel free to just use it. If you need a licence, consider it being under the MIT licence. (Over the next day or so, time permitting, I'll try to put this on ActiveState's Python recipes and make the licence more official.) -- Steve

On 01/06/2015 10:30 AM, Paul Moore wrote:
Yup, it sure does. Guess I should have paid more attention to that niggling feeling that I was missing something. ;) On the other hand, I figured even if I was wrong it wouldn't hurt to increase awareness about the change to generators. -- ~Ethan~

On 6 January 2015 at 18:43, Ethan Furman <ethan@stoneleaf.us> wrote:
Yup, it sure does. Guess I should have paid more attention to that niggling feeling that I was missing something. ;) On the other hand, I figured even if I was wrong it wouldn't hurt to increase awareness about the change to generators.
I should probably point out that when writing my code, I was conscious of the generator change and took special case of catching all of the StopIterations that might leak out. It's probably good that I was made to think carefully about it, but all those try...except blocks sure look ugly. (And Steven's version with its sentinel value doesn't look much better, with all those comparisons against a magic value - it's just a generally annoying algorithm with all the code paths for corner cases). Paul

On 01/06/2015 10:47 AM, Paul Moore wrote:
Having looked more closely at your code, I don't see any thing that wouldn't have had to be there even without the generator change because your exhausting two iterators.
(And Steven's version with its sentinel value doesn't look much better, with all those comparisons against a magic value
I think I prefer his, although I would rename MISSING to FINISHED. ;) -- ~Ethan~

On 6 January 2015 at 18:00, Steven D'Aprano <steve@pearwood.info> wrote:
But for what it's worth, here's my implementation which I think is easier to understand than yours,
Possibly - it's a trade-off between using a sentinel vs explicitly catching the exceptions. I'm on the fence, personally.
and covers the unfortunate case where you have unorderable items such as float NANs in your data.
Technically I don't think this is possible, as the input iterators need to be sorted, and that property will be violated if there are unorderable items in there... Paul
participants (15)
-
Andrew Barnert
-
Antoine Pitrou
-
Cameron Simpson
-
Chris Angelico
-
David Mertz
-
Devin Jeanpierre
-
Eric V. Smith
-
Ethan Furman
-
Guido van Rossum
-
Ian Cordasco
-
Jim Baker
-
Paul Moore
-
Raymond Hettinger
-
Steven D'Aprano
-
Victor Stinner