Syncing up iterators with gaps

Terry Reedy tjreedy at udel.edu
Wed Sep 28 15:43:22 EDT 2016


On 9/28/2016 3:10 PM, Tim Chase wrote:
> I've got several iterators sharing a common key in the same order and
> would like to iterate over them in parallel, operating on all items
> with the same key.  I've simplified the data a bit here, but it would
> be something like
>
>   data1 = [ # key, data1
>     (1, "one A"),
>     (1, "one B"),
>     (2, "two"),
>     (5, "five"),
>     ]
>
>   data2 = [ # key, data1
>     (1, "uno"),
>     (2, "dos"),
>     (3, "tres x"),
>     (3, "tres y"),
>     (3, "tres z"),
>     (4, "cuatro"),
>     ]
>
>   data3 = [ # key, data1, data2
>     (2, "ii", "extra alpha"),
>     (4, "iv", "extra beta"),
>     (5, "v", "extra gamma"),
>     ]
>
> And I'd like to do something like
>
>   for common_key, d1, d2, d3 in magic_happens_here(data1, data2, data3):
>     for row in d1:
>       process_a(common_key, row)
>     for thing in d2:
>       process_b(common_key, row)
>     for thing in d3:
>       process_c(common_key, row)
>
> which would yield the common_key, along with enough of each of those
> iterators (note that gaps can happen, but the sortable order should
> remain the same).  So in the above data, the outer FOR loop would
> happen 5 times with common_key being [1, 2, 3, 4, 5], and each of
> [d1, d2, d3] being an iterator that deals with just that data.

You just need d1, d2, d3 to be iterables, such as a list.  Write a magic 
generator that opens the three files and reads one line of each (with 
next()).  Then in while True loop, find minimum key and make 3 lists (up 
to 2 possibly empty) of the items in each file with that key.  This will 
require up to 3 inner loops.  The read-ahead makes this slightly messy. 
If any list is not empty, yield the key and 3 lists.  Otherwise break 
the outer loop.

> My original method was hauling everything into memory and making
> multiple passes filtering on the data. However, the actual sources
> are CSV-files, some of which are hundreds of megs in size, and my
> system was taking a bit of a hit.  So I was hoping for a way to do
> this with each iterator making only one complete pass through each
> source (since they're sorted by common key).
>
> It's somewhat similar to the *nix "join" command, only dealing with
> N files.

It is also somewhat similar to a 3-way mergesort.

-- 
Terry Jan Reedy




More information about the Python-list mailing list