Syncing up iterators with gaps

Wed Sep 28 16:35:53 EDT 2016

Tim Chase wrote:

> I've got several iterators sharing a common key in the same order and
> would like to iterate over them in parallel, operating on all items
> with the same key.  I've simplified the data a bit here, but it would
> be something like
> 
>   data1 = [ # key, data1
>     (1, "one A"),
>     (1, "one B"),
>     (2, "two"),
>     (5, "five"),
>     ]
> 
>   data2 = [ # key, data1
>     (1, "uno"),
>     (2, "dos"),
>     (3, "tres x"),
>     (3, "tres y"),
>     (3, "tres z"),
>     (4, "cuatro"),
>     ]
> 
>   data3 = [ # key, data1, data2
>     (2, "ii", "extra alpha"),
>     (4, "iv", "extra beta"),
>     (5, "v", "extra gamma"),
>     ]
> 
> And I'd like to do something like
> 
>   for common_key, d1, d2, d3 in magic_happens_here(data1, data2, data3):
>     for row in d1:
>       process_a(common_key, row)
>     for thing in d2:
>       process_b(common_key, row)
>     for thing in d3:
>       process_c(common_key, row)
> 
> which would yield the common_key, along with enough of each of those
> iterators (note that gaps can happen, but the sortable order should
> remain the same).  So in the above data, the outer FOR loop would
> happen 5 times with common_key being [1, 2, 3, 4, 5], and each of
> [d1, d2, d3] being an iterator that deals with just that data.
> 
> My original method was hauling everything into memory and making
> multiple passes filtering on the data. However, the actual sources
> are CSV-files, some of which are hundreds of megs in size, and my
> system was taking a bit of a hit.  So I was hoping for a way to do
> this with each iterator making only one complete pass through each
> source (since they're sorted by common key).
> 
> It's somewhat similar to the *nix "join" command, only dealing with
> N files.
> 
> Thanks for any hints.
> 
> -tkc

A bit messy, might try replacing groups list with a dict:

$ cat merge.py      
from itertools import groupby
from operator import itemgetter

first = itemgetter(0)
rest = itemgetter(slice(1, None))

def magic_happens_here(*columns):
    grouped = [groupby(column, key=first) for column in columns]
    missing = object()

    def getnext(g):
        nonlocal n
        try:
            k, g = next(g)
        except StopIteration:
            n -= 1
            return (missing, None)
        return k, g

    n = len(grouped)
    groups = [getnext(g) for g in grouped]
    while n:
        minkey = min(k for k, g in groups if k is not missing)
        yield (minkey,) + tuple(
            map(rest, g) if k == minkey else ()
            for k, g in groups)
        for i, (k, g) in enumerate(groups):
            if k == minkey:
                groups[i] = getnext(grouped[i])

if __name__ == "__main__":
    data1 = [  # key, data1
        (1, "one A"),
        (1, "one B"),
        (2, "two"),
        (5, "five"),
    ]

    data2 = [  # key, data1
        (1, "uno"),
        (2, "dos"),
        (3, "tres x"),
        (3, "tres y"),
        (3, "tres z"),
        (4, "cuatro"),
    ]

    data3 = [  # key, data1, data2
        (2, "ii", "extra alpha"),
        (4, "iv", "extra beta"),
        (5, "v", "extra gamma"),
    ]

    for common_key, d1, d2, d3 in magic_happens_here(data1, data2, data3):
        print(common_key)
        for d in d1, d2, d3:
            print("    ", list(d))
$ python3 merge.py 
1
     [('one A',), ('one B',)]
     [('uno',)]
     []
2
     [('two',)]
     [('dos',)]
     [('ii', 'extra alpha')]
3
     []
     [('tres x',), ('tres y',), ('tres z',)]
     []
4
     []
     [('cuatro',)]
     [('iv', 'extra beta')]
5
     [('five',)]
     []
     [('v', 'extra gamma')]