[Python-ideas] Possible new itertool: comm()

Cameron Simpson cs at zip.com.au
Tue Jan 6 22:09:59 CET 2015


On 06Jan2015 19:36, Antoine Pitrou <solipsis at pitrou.net> wrote:
>On Tue, 6 Jan 2015 18:22:44 +0000
>Paul Moore <p.f.moore at gmail.com> wrote:
>> On 6 January 2015 at 17:14, Raymond Hettinger
>> <raymond.hettinger at gmail.com> wrote:
>> >> On Jan 6, 2015, at 8:22 AM, Paul Moore <p.f.moore at gmail.com> wrote:
>> >>
>> >> In writing a utility script today, I found myself needing to do
>> >> something similar to what the Unix "comm" utility does - take two
>> >> sorted iterators, and partition the values into "only in the first",
>> >> "only in the second", and "in both" groups.
>> >
>> > As far as I can tell, this would be a very rare need.
>>
>> It's come up for me a few times, usually when trying to check two
>> lists of files to see which ones have been missed by a program, and
>> which ones the program thinks are present but no longer exist.
>
>Why don't you use sets for such things? Your iterator is really only
>useful for huge or unhashable inputs.

In my use case (an existing tool):

1) I'm merging log files of arbitrary size; I am _not_ going to suck them into 
memory. A comm()-like function has a tiny and fixed memory footprint, versus an 
unbounded out.

2) I want ordered output, and my inputs are already ordered; why on earth would 
I impose a pointless sorting cost on my (currently linear) runtime?

Sets are the "obvious" Python way to do this, because comm() is more or less a 
set intersection operation and sets are right there in Python. But for 
unbounded sorted inputs and progressive output, they are a _bad_ choice.

Cheers,
Cameron Simpson <cs at zip.com.au>

Yesterday, I was running a CNC plasma cutter that's controlled by Windows XP.
This is a machine that moves around a plasma torch that cuts thick steel
plate.  A "New Java update is available" window popped up while I was
working.  Not good. - John Nagle


More information about the Python-ideas mailing list