[Python-ideas] Where should grouping() live (was: grouping / dict of lists)

Tue Jul 3 11:24:14 EDT 2018

On Tue, Jul 03, 2018 at 09:23:07AM -0400, David Mertz wrote:

> But before putting it on auto-archive, the BDFL said (1) NO GO on getting a
> new builtin; (2) NO OBJECTION to putting it in itertools.
> 
> My problem with the second idea is that *I* find it very wrong to have
> something in itertools that does not return an iterator.  It wrecks the
> combinatorial algebra of the module.

That seems like a reasonable objection to me.

> That said, it's easy to fix... and I believe independently useful.  Just
> make grouping() a generator function rather than a plain function.  This
> lets us get an incremental grouping of an iterable.

We already have something which lazily groups an iterable, returning 
groups as they are seen: groupby.

What makes grouping() different from groupby() is that it accumulates 
ALL of the subgroups rather than just consecutive subgroupings. To make 
it clear with a simulated example (ignoring the keys for brevity):

groupby("aaAAbbCaAB", key=str.upper)
=> groups "aaAA", "bb", "C", "aA", "B"

grouping("aaAAbbCaAB", key=str.upper)
=> groups "aaAAaA", "bbB", "C"

So grouping() cannot even begin returning values until it has processed 
the entire data set. In that regard, it is like sorted() -- it cannot be 
lazy, it is a fundamentally eager operation.

I propose that a better name which indicates the non-lazy nature of this 
function is *grouped* rather than grouping, like sorted().

As for where it belongs, perhaps the collections module is the least 
worst fit.

> This can be useful if
> the iterable is slow or infinite, but the partial groupings are useful in
> themselves.

Under what circumstances would the partial groupings be useful? Given 
the example above:

grouping("aaAAbbCaAB", key=str.upper)

when would you want to see the accumulated partial groups?

# again, ignoring the keys for brevity
"aaAA"
"aaAA", "bb"
"aaAA", "bb", "C"
"aaAAaA", "bb", "C"
"aaAAaA", "bbB", "C"

I don't see any practical use for this -- if you start processing the 
partial groupings immediately, you end up double-processing some 
of the items; if you wait until the last, what's the point of the 
intermediate values?

As you say yourself:

> This isn't so useful for the concrete sequence, but for this it would be
> great:
> 
> for grouped in grouping(data_over_wire()):
>     process_partial_groups(grouped)

And that demonstrated exactly why this would be a terrible bug magnet, 
suckering people into doing what you just did, and ending up processing 
values more than once.

To avoid that, your process_partial_groups would need to remember which 
values it has seen before for each key it has seen before.

-- 
Steve