Re: [Python-ideas] Where should grouping() live (was: grouping / dict of lists)

3 Jul 2018

      On Tue, Jul 03, 2018 at 09:23:07AM -0400, David Mertz wrote:
...
But before putting it on auto-archive, the BDFL said (1) NO GO on getting a
new builtin; (2) NO OBJECTION to putting it in itertools.
My problem with the second idea is that *I* find it very wrong to have
something in itertools that does not return an iterator.  It wrecks the
combinatorial algebra of the module.
That seems like a reasonable objection to me.
...
That said, it's easy to fix... and I believe independently useful.  Just
make grouping() a generator function rather than a plain function.  This
lets us get an incremental grouping of an iterable.
We already have something which lazily groups an iterable, returning 
groups as they are seen: groupby.

What makes grouping() different from groupby() is that it accumulates 
ALL of the subgroups rather than just consecutive subgroupings. To make 
it clear with a simulated example (ignoring the keys for brevity):

groupby("aaAAbbCaAB", key=str.upper)
=> groups "aaAA", "bb", "C", "aA", "B"

grouping("aaAAbbCaAB", key=str.upper)
=> groups "aaAAaA", "bbB", "C"

So grouping() cannot even begin returning values until it has processed 
the entire data set. In that regard, it is like sorted() -- it cannot be 
lazy, it is a fundamentally eager operation.

I propose that a better name which indicates the non-lazy nature of this 
function is *grouped* rather than grouping, like sorted().

As for where it belongs, perhaps the collections module is the least 
worst fit.
...
This can be useful if
the iterable is slow or infinite, but the partial groupings are useful in
themselves.
Under what circumstances would the partial groupings be useful? Given 
the example above:

grouping("aaAAbbCaAB", key=str.upper)

when would you want to see the accumulated partial groups?

# again, ignoring the keys for brevity
"aaAA"
"aaAA", "bb"
"aaAA", "bb", "C"
"aaAAaA", "bb", "C"
"aaAAaA", "bbB", "C"

I don't see any practical use for this -- if you start processing the 
partial groupings immediately, you end up double-processing some 
of the items; if you wait until the last, what's the point of the 
intermediate values?

As you say yourself:
...
This isn't so useful for the concrete sequence, but for this it would be
great:
for grouped in grouping(data_over_wire()):
    process_partial_groups(grouped)
And that demonstrated exactly why this would be a terrible bug magnet, 
suckering people into doing what you just did, and ending up processing 
values more than once.

To avoid that, your process_partial_groups would need to remember which 
values it has seen before for each key it has seen before.

-- 
Steve

Re: [Python-ideas] Where should grouping() live (was: grouping / dict of lists)

Steven D'Aprano