Syncing up iterators with gaps
Peter Otten
__peter__ at web.de
Wed Sep 28 16:35:53 EDT 2016
Tim Chase wrote:
> I've got several iterators sharing a common key in the same order and
> would like to iterate over them in parallel, operating on all items
> with the same key. I've simplified the data a bit here, but it would
> be something like
>
> data1 = [ # key, data1
> (1, "one A"),
> (1, "one B"),
> (2, "two"),
> (5, "five"),
> ]
>
> data2 = [ # key, data1
> (1, "uno"),
> (2, "dos"),
> (3, "tres x"),
> (3, "tres y"),
> (3, "tres z"),
> (4, "cuatro"),
> ]
>
> data3 = [ # key, data1, data2
> (2, "ii", "extra alpha"),
> (4, "iv", "extra beta"),
> (5, "v", "extra gamma"),
> ]
>
> And I'd like to do something like
>
> for common_key, d1, d2, d3 in magic_happens_here(data1, data2, data3):
> for row in d1:
> process_a(common_key, row)
> for thing in d2:
> process_b(common_key, row)
> for thing in d3:
> process_c(common_key, row)
>
> which would yield the common_key, along with enough of each of those
> iterators (note that gaps can happen, but the sortable order should
> remain the same). So in the above data, the outer FOR loop would
> happen 5 times with common_key being [1, 2, 3, 4, 5], and each of
> [d1, d2, d3] being an iterator that deals with just that data.
>
> My original method was hauling everything into memory and making
> multiple passes filtering on the data. However, the actual sources
> are CSV-files, some of which are hundreds of megs in size, and my
> system was taking a bit of a hit. So I was hoping for a way to do
> this with each iterator making only one complete pass through each
> source (since they're sorted by common key).
>
> It's somewhat similar to the *nix "join" command, only dealing with
> N files.
>
> Thanks for any hints.
>
> -tkc
A bit messy, might try replacing groups list with a dict:
$ cat merge.py
from itertools import groupby
from operator import itemgetter
first = itemgetter(0)
rest = itemgetter(slice(1, None))
def magic_happens_here(*columns):
grouped = [groupby(column, key=first) for column in columns]
missing = object()
def getnext(g):
nonlocal n
try:
k, g = next(g)
except StopIteration:
n -= 1
return (missing, None)
return k, g
n = len(grouped)
groups = [getnext(g) for g in grouped]
while n:
minkey = min(k for k, g in groups if k is not missing)
yield (minkey,) + tuple(
map(rest, g) if k == minkey else ()
for k, g in groups)
for i, (k, g) in enumerate(groups):
if k == minkey:
groups[i] = getnext(grouped[i])
if __name__ == "__main__":
data1 = [ # key, data1
(1, "one A"),
(1, "one B"),
(2, "two"),
(5, "five"),
]
data2 = [ # key, data1
(1, "uno"),
(2, "dos"),
(3, "tres x"),
(3, "tres y"),
(3, "tres z"),
(4, "cuatro"),
]
data3 = [ # key, data1, data2
(2, "ii", "extra alpha"),
(4, "iv", "extra beta"),
(5, "v", "extra gamma"),
]
for common_key, d1, d2, d3 in magic_happens_here(data1, data2, data3):
print(common_key)
for d in d1, d2, d3:
print(" ", list(d))
$ python3 merge.py
1
[('one A',), ('one B',)]
[('uno',)]
[]
2
[('two',)]
[('dos',)]
[('ii', 'extra alpha')]
3
[]
[('tres x',), ('tres y',), ('tres z',)]
[]
4
[]
[('cuatro',)]
[('iv', 'extra beta')]
5
[('five',)]
[]
[('v', 'extra gamma')]
More information about the Python-list
mailing list