Syncing up iterators with gaps
Terry Reedy
tjreedy at udel.edu
Wed Sep 28 15:43:22 EDT 2016
On 9/28/2016 3:10 PM, Tim Chase wrote:
> I've got several iterators sharing a common key in the same order and
> would like to iterate over them in parallel, operating on all items
> with the same key. I've simplified the data a bit here, but it would
> be something like
>
> data1 = [ # key, data1
> (1, "one A"),
> (1, "one B"),
> (2, "two"),
> (5, "five"),
> ]
>
> data2 = [ # key, data1
> (1, "uno"),
> (2, "dos"),
> (3, "tres x"),
> (3, "tres y"),
> (3, "tres z"),
> (4, "cuatro"),
> ]
>
> data3 = [ # key, data1, data2
> (2, "ii", "extra alpha"),
> (4, "iv", "extra beta"),
> (5, "v", "extra gamma"),
> ]
>
> And I'd like to do something like
>
> for common_key, d1, d2, d3 in magic_happens_here(data1, data2, data3):
> for row in d1:
> process_a(common_key, row)
> for thing in d2:
> process_b(common_key, row)
> for thing in d3:
> process_c(common_key, row)
>
> which would yield the common_key, along with enough of each of those
> iterators (note that gaps can happen, but the sortable order should
> remain the same). So in the above data, the outer FOR loop would
> happen 5 times with common_key being [1, 2, 3, 4, 5], and each of
> [d1, d2, d3] being an iterator that deals with just that data.
You just need d1, d2, d3 to be iterables, such as a list. Write a magic
generator that opens the three files and reads one line of each (with
next()). Then in while True loop, find minimum key and make 3 lists (up
to 2 possibly empty) of the items in each file with that key. This will
require up to 3 inner loops. The read-ahead makes this slightly messy.
If any list is not empty, yield the key and 3 lists. Otherwise break
the outer loop.
> My original method was hauling everything into memory and making
> multiple passes filtering on the data. However, the actual sources
> are CSV-files, some of which are hundreds of megs in size, and my
> system was taking a bit of a hit. So I was hoping for a way to do
> this with each iterator making only one complete pass through each
> source (since they're sorted by common key).
>
> It's somewhat similar to the *nix "join" command, only dealing with
> N files.
It is also somewhat similar to a 3-way mergesort.
--
Terry Jan Reedy
More information about the Python-list
mailing list