[Python-ideas] Where should grouping() live (was: grouping / dict of lists)
Steven D'Aprano
steve at pearwood.info
Tue Jul 3 11:24:14 EDT 2018
On Tue, Jul 03, 2018 at 09:23:07AM -0400, David Mertz wrote:
> But before putting it on auto-archive, the BDFL said (1) NO GO on getting a
> new builtin; (2) NO OBJECTION to putting it in itertools.
>
> My problem with the second idea is that *I* find it very wrong to have
> something in itertools that does not return an iterator. It wrecks the
> combinatorial algebra of the module.
That seems like a reasonable objection to me.
> That said, it's easy to fix... and I believe independently useful. Just
> make grouping() a generator function rather than a plain function. This
> lets us get an incremental grouping of an iterable.
We already have something which lazily groups an iterable, returning
groups as they are seen: groupby.
What makes grouping() different from groupby() is that it accumulates
ALL of the subgroups rather than just consecutive subgroupings. To make
it clear with a simulated example (ignoring the keys for brevity):
groupby("aaAAbbCaAB", key=str.upper)
=> groups "aaAA", "bb", "C", "aA", "B"
grouping("aaAAbbCaAB", key=str.upper)
=> groups "aaAAaA", "bbB", "C"
So grouping() cannot even begin returning values until it has processed
the entire data set. In that regard, it is like sorted() -- it cannot be
lazy, it is a fundamentally eager operation.
I propose that a better name which indicates the non-lazy nature of this
function is *grouped* rather than grouping, like sorted().
As for where it belongs, perhaps the collections module is the least
worst fit.
> This can be useful if
> the iterable is slow or infinite, but the partial groupings are useful in
> themselves.
Under what circumstances would the partial groupings be useful? Given
the example above:
grouping("aaAAbbCaAB", key=str.upper)
when would you want to see the accumulated partial groups?
# again, ignoring the keys for brevity
"aaAA"
"aaAA", "bb"
"aaAA", "bb", "C"
"aaAAaA", "bb", "C"
"aaAAaA", "bbB", "C"
I don't see any practical use for this -- if you start processing the
partial groupings immediately, you end up double-processing some
of the items; if you wait until the last, what's the point of the
intermediate values?
As you say yourself:
> This isn't so useful for the concrete sequence, but for this it would be
> great:
>
> for grouped in grouping(data_over_wire()):
> process_partial_groups(grouped)
And that demonstrated exactly why this would be a terrible bug magnet,
suckering people into doing what you just did, and ending up processing
values more than once.
To avoid that, your process_partial_groups would need to remember which
values it has seen before for each key it has seen before.
--
Steve
More information about the Python-ideas
mailing list