Syncing up iterators with gaps
Tim Chase
python.list at tim.thechases.com
Wed Sep 28 21:38:50 EDT 2016
On 2016-09-29 10:20, Steve D'Aprano wrote:
> On Thu, 29 Sep 2016 05:10 am, Tim Chase wrote:
> > data1 = [ # key, data1
> > (1, "one A"),
> > (1, "one B"),
> > (2, "two"),
> > (5, "five"),
> > ]
>
> So data1 has keys 1, 1, 2, 5.
> Likewise data2 has keys 1, 2, 3, 3, 3, 4 and data3 has keys 2, 4, 5.
Correct
> (data3 also has *two* values, not one, which is an additional
> complication.)
As commented towards the end, the source is set of CSV files, so each
row is a list where a particular (identifiable) item is the key.
Assume that one can use something like get_key(row) to return the key,
which in the above could be implemented as
get_key = lambda row: row[0]
and for my csv.DictReader data, would be something like
get_key = lambda row: row["Account Number"]
> > And I'd like to do something like
> >
> > for common_key, d1, d2, d3 in magic_happens_here(data1, data2,
> > data3):
>
> What's common_key? In particular, given that data1, data2 and data3
> have the first key each of 1, 1 and 2 respectively, how do you get:
>
> > So in the above data, the outer FOR loop would
> > happen 5 times with common_key being [1, 2, 3, 4, 5]
>
> I'm confused. Is common_key a *constant* [1, 2, 3, 4, 5] or are you
> saying that it iterates over 1, 2, 3, 4, 5?
Your interpretation later is correct, that it each unique key once,
in-order. So if you
data1.append((17, "seventeen"))
the outer loop would iterate over [1,2,3,4,5,17]
(so not constant, to hopefully answer that part of your question)
The actual keys are account-numbers, so they're ascii-sorted strings
of the form "1234567-8901", ascending in order through the files.
But for equality/less-than/greater-than comparisons, they work
effectively as integers in my example.
> If the later, it sounds like you want something like a cross between
> itertools.groupby and the "merge" stage of mergesort.
That's a pretty good description at some level. I looked into
groupby() but was having trouble getting it to do what I wanted.
> Note that I have modified data3 so instead of three columns, (key
> value value), it has two (key value) and value is a 2-tuple.
I'm cool with that. Since they're CSV rows, you can imagine the
source data then as a generator something like
data1 = ( (get_key(row), row) for row in my_csv_iter1 )
to get the data to look like your example input data.
> So first you want an iterator that does an N-way merge:
>
> merged = [(1, "one A"), (1, "one B"), (1, "uno"),
> (2, "two"), (2, "dos"), (2, ("ii", "extra alpha")),
> (3, "tres x"), (3, "tres y"), (3, "tres z"),
> (4, "cuatro"), (4, ("iv", "extra beta")),
> (5, "five"), (5, ("v", "extra gamma")),
> ]
This seems to discard the data's origin (data1/data2/data3) which is
how I determine whether to use process_a(), process_b(), or
process_c() in my original example where N iterators were returned,
one for each input iterator. So the desired output would be akin to
(converting everything to tuples as you suggest below)
[
(1, [("one A",), ("one B",)], [1, ("uno",)], []),
(2, [("two",)], [("dos",)], [("ii", "extra alpha")]),
(3, [], [("tres x",), ("tres y",)], []),
(4, [], [("cuatro",)], [("iv", "extra beta")]),
(5, [("five",)], [], [("v", "extra gamma")]),
]
only instead of N list()s, having N generators that are smart enough
to yield the corresponding data.
> You might find it easier to have *all* the iterators yield (key,
> tuple) pairs, where data1 and data2 yield a 1-tuple and data3
> yields a 2-tuple.
Right. Sorry my example obscured that shoulda-obviously-been-used
simplification.
-tkc
More information about the Python-list
mailing list