
On Tue, Jul 03, 2018 at 09:23:07AM -0400, David Mertz wrote:
But before putting it on auto-archive, the BDFL said (1) NO GO on getting a new builtin; (2) NO OBJECTION to putting it in itertools.
My problem with the second idea is that *I* find it very wrong to have something in itertools that does not return an iterator. It wrecks the combinatorial algebra of the module.
That seems like a reasonable objection to me.
That said, it's easy to fix... and I believe independently useful. Just make grouping() a generator function rather than a plain function. This lets us get an incremental grouping of an iterable.
We already have something which lazily groups an iterable, returning groups as they are seen: groupby.
What makes grouping() different from groupby() is that it accumulates ALL of the subgroups rather than just consecutive subgroupings. To make it clear with a simulated example (ignoring the keys for brevity):
groupby("aaAAbbCaAB", key=str.upper) => groups "aaAA", "bb", "C", "aA", "B"
grouping("aaAAbbCaAB", key=str.upper) => groups "aaAAaA", "bbB", "C"
So grouping() cannot even begin returning values until it has processed the entire data set. In that regard, it is like sorted() -- it cannot be lazy, it is a fundamentally eager operation.
I propose that a better name which indicates the non-lazy nature of this function is *grouped* rather than grouping, like sorted().
As for where it belongs, perhaps the collections module is the least worst fit.
This can be useful if the iterable is slow or infinite, but the partial groupings are useful in themselves.
Under what circumstances would the partial groupings be useful? Given the example above:
grouping("aaAAbbCaAB", key=str.upper)
when would you want to see the accumulated partial groups?
# again, ignoring the keys for brevity "aaAA" "aaAA", "bb" "aaAA", "bb", "C" "aaAAaA", "bb", "C" "aaAAaA", "bbB", "C"
I don't see any practical use for this -- if you start processing the partial groupings immediately, you end up double-processing some of the items; if you wait until the last, what's the point of the intermediate values?
As you say yourself:
This isn't so useful for the concrete sequence, but for this it would be great:
for grouped in grouping(data_over_wire()): process_partial_groups(grouped)
And that demonstrated exactly why this would be a terrible bug magnet, suckering people into doing what you just did, and ending up processing values more than once.
To avoid that, your process_partial_groups would need to remember which values it has seen before for each key it has seen before.