Syncing up iterators with gaps

Wed Sep 28 21:38:50 EDT 2016

On 2016-09-29 10:20, Steve D'Aprano wrote:
> On Thu, 29 Sep 2016 05:10 am, Tim Chase wrote:
> >   data1 = [ # key, data1
> >     (1, "one A"),
> >     (1, "one B"),
> >     (2, "two"),
> >     (5, "five"),
> >     ]
> 
> So data1 has keys 1, 1, 2, 5.
> Likewise data2 has keys 1, 2, 3, 3, 3, 4 and data3 has keys 2, 4, 5.

Correct

> (data3 also has *two* values, not one, which is an additional
> complication.)

As commented towards the end, the source is set of CSV files, so each
row is a list where a particular (identifiable) item is the key.
Assume that one can use something like get_key(row) to return the key,
which in the above could be implemented as

  get_key = lambda row: row[0]

and for my csv.DictReader data, would be something like

  get_key = lambda row: row["Account Number"]

> > And I'd like to do something like
> > 
> >   for common_key, d1, d2, d3 in magic_happens_here(data1, data2,
> > data3):
> 
> What's common_key? In particular, given that data1, data2 and data3
> have the first key each of 1, 1 and 2 respectively, how do you get:
> 
> > So in the above data, the outer FOR loop would
> > happen 5 times with common_key being [1, 2, 3, 4, 5]
> 
> I'm confused. Is common_key a *constant* [1, 2, 3, 4, 5] or are you
> saying that it iterates over 1, 2, 3, 4, 5?

Your interpretation later is correct, that it each unique key once,
in-order.  So if you

  data1.append((17, "seventeen"))

the outer loop would iterate over [1,2,3,4,5,17]

(so not constant, to hopefully answer that part of your question)

The actual keys are account-numbers, so they're ascii-sorted strings
of the form "1234567-8901", ascending in order through the files.
But for equality/less-than/greater-than comparisons, they work
effectively as integers in my example.

> If the later, it sounds like you want something like a cross between
> itertools.groupby and the "merge" stage of mergesort.

That's a pretty good description at some level.  I looked into
groupby() but was having trouble getting it to do what I wanted.

> Note that I have modified data3 so instead of three columns, (key
> value value), it has two (key value) and value is a 2-tuple.

I'm cool with that.  Since they're CSV rows, you can imagine the
source data then as a generator something like

  data1 = ( (get_key(row), row) for row in my_csv_iter1 )

to get the data to look like your example input data.

> So first you want an iterator that does an N-way merge:
> 
> merged = [(1, "one A"), (1, "one B"), (1, "uno"), 
>           (2, "two"), (2, "dos"), (2, ("ii", "extra alpha")), 
>           (3, "tres x"), (3, "tres y"), (3, "tres z"),
>           (4, "cuatro"), (4, ("iv", "extra beta")),
>           (5, "five"), (5, ("v", "extra gamma")),
>           ]

This seems to discard the data's origin (data1/data2/data3) which is
how I determine whether to use process_a(), process_b(), or
process_c() in my original example where N iterators were returned,
one for each input iterator.  So the desired output would be akin to
(converting everything to tuples as you suggest below)

  [
   (1, [("one A",), ("one B",)], [1, ("uno",)], []),
   (2, [("two",)], [("dos",)], [("ii", "extra alpha")]),
   (3, [], [("tres x",), ("tres y",)], []),
   (4, [], [("cuatro",)], [("iv", "extra beta")]),
   (5, [("five",)], [], [("v", "extra gamma")]),
   ]

only instead of N list()s, having N generators that are smart enough
to yield the corresponding data.

> You might find it easier to have *all* the iterators yield (key,
> tuple) pairs, where data1 and data2 yield a 1-tuple and data3
> yields a 2-tuple.

Right.  Sorry my example obscured that shoulda-obviously-been-used
simplification.

-tkc