[Python-ideas] Where should grouping() live

Wed Jul 4 01:09:51 EDT 2018

So this ended up a long post, so the TL;DR

* There are types of data well suited to the key function approach, and
other data not so well suited to it. If you want to support the not as well
suited use cases, you should have a value function as well and/or take a
(key, value) pair.

* There are some nice advantages in flexibility to having a Grouping class,
rather than simply a function.

So: I propose a best of all worlds version: a Grouping class (subclass of
dict):

* The constructor takes an iterable of (key, value) pairs by default.

* The constructor takes an optional key_func -- when not None, it is used
to determine the keys in the iterable instead.

* The constructor also takes a value_func -- when specified, it processes
the items to determine the values.

* a_grouping[key] = value

  adds the value to the list corresponding to the key.

* a_grouping.add(item) -- applies the key_func and value_func to add a new
value to the appropriate group.

Prototype code here:

https://github.com/PythonCHB/grouper

Now the lengthy commentary and examples:

On Tue, Jul 3, 2018 at 5:21 PM, Steven D'Aprano <steve at pearwood.info> wrote:

> On Wed, Jul 04, 2018 at 10:44:17AM +1200, Greg Ewing wrote:
> > Steven D'Aprano wrote:
>
> > Unless we *make* it a data type. Then not only would it fit
> > well in collections, it would also make it fairly easy to do
> > incremental grouping if you really wanted that.

indeed -- one of motivations for my prototype:

https://github.com/PythonCHB/grouper

(Did none of my messages get to this list??)

> > Usual case:
> >
> >    g = groupdict((key(val), val) for val in things)
>
>
> How does groupdict differ from regular defaultdicts, aside from the
> slightly different constructor?
>

* You don't need to declare the defaultdict (and what the default is) first

* You don't need to call .append() yourself

* It can have a custom .init() and .update()

* It can have a .add() method

* It can (optionally) use a key function.

* And you can have other methods that do useful things with the groupings.

   >    g = groupdict()

> >    for key(val), val in things:
> >       g.add(key, val)
> >       process_partial_grouping(g)
>
> I don't think that syntax works. I get:
>
> SyntaxError: can't assign to function call
>

looks like untested code :-)

with my prototype it would be:

g = groupdict()
for key, val in things:
    g[key] = val
process_partial_grouping(g)

(this assumes your things are (key, value) pairs)

Again, IF you data are a sequence of items, and the value is the item
itself, and the key is a simple function of the item, THEN the key function
method makes more sense, which for the incremental adding of data would be:

g = groupdict(key_fun=a_fun)
for thing in things:
    g.add(thing)
process_partial_grouping(g)

Even if it did work, it's hardly any simpler than
>
>     d = defaultdict(list)
>     for val in things:
>         d[key(val)].append(val)
>
> But then Counter is hardly any simpler than a regular dict too.
>

exactly -- and counter is actually a little annoyingly too much like a
regular dict, in my mind :-)

In the latest version of my prototype, the __init__  expects a (key, value)
pair by default, but you can also pass in a key_func, and then it will
process the iterable passes in as (key_func(item), item) pairs.

And the update() method will also use the key_func if one was provided.

So a best of both worlds -- pick your API.

In this thread, and in the PEP, there various ways of accomplishing this
task presented -- none of them (except using a raw itertools.groupby in
some cases) is all that onerous.

But I do think a custom function or even better, custom class, would create
a "one obvious" way to do a common manipulation.

A final (repeated) point:

Some data are better suited to a (key, value) pair style, and some to a key
function style. All of the examples in the PEP are well suited to the key
function style. But the example that kicked off this discussion was about
data already in (key, value) pairs (actual in that case, (value, key) pairs.

And there are other examples. Here's a good one for how one might want to
use a Grouping dict more like a regular dict -- of maybe like a simple
function constructor:

(code in: https://github.com/PythonCHB/grouper/blob/master/examples/
trigrams.py)

#!/usr/bin/env python3

"""
Demo of processing "trigrams" from Dave Thomas' Coding Kata site:

http://codekata.com/kata/kata14-tom-swift-under-the-milkwood/

This is only addressing the part of the problem of building up the trigrams.

This is showing various ways of doing it with the Grouping object.
"""

from grouper import Grouping
from operator import itemgetter

words = "I wish I may I wish I might".split()

# using setdefault with a regular dict:
# how I might do it without a Grouping class
trigrams = {}
for i in range(len(words) - 2):
    pair = tuple(words[i:i + 2])
    follower = words[i + 2]
    trigrams.setdefault(pair, []).append(follower)

print(trigrams)

# using a Grouping with a regular loop:

trigrams = Grouping()
for i in range(len(words) - 2):
    pair = tuple(words[i:i + 2])
    follower = words[i + 2]
    trigrams[pair] = follower

print(trigrams)

# using a Grouping with zip

trigrams = Grouping()
for w1, w2, w3 in zip(words[:], words[1:], words[2:]):
    trigrams[(w1, w2)] = w3

print(trigrams)

# Now we can do it one expression:

trigrams = Grouping(((w1, w2), w3)
                    for w1, w2, w3 in zip(words[:], words[1:], words[2:]))
print(trigrams)

# Now with the key function:
# in this case it needs to be in a sequence, so we can't use a simple loop

trigrams = Grouping(zip(words[:], words[1:], words[2:]),
                    key_fun=itemgetter(0, 1))

print(trigrams)

# Darn! that got the key right, but the value is not right.
# we can post process:
trigrams = {key: [t[2] for t in value] for key, value in trigrams.items()}

print(trigrams)

# But THAT is a lot harder to wrap your head around than the original
setdefault() loop!
# And it mixes key function style and comprehension style -- so no good.

# Adding a value_func helps a lot:
trigrams = Grouping(zip(words[:], words[1:], words[2:]),
                    key_fun=itemgetter(0, 1),
                    value_fun=itemgetter(2))

print(trigrams)

#that works fine, but I, at least, find it klunkier than the comprehensions
style

# Finally, we can use a regular loop with the functions

trigrams = Grouping(key_fun=itemgetter(0, 1),
                    value_fun=itemgetter(2))
for triple in zip(words[:], words[1:], words[2:]):
    trigrams.add(triple)

print(trigrams)

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180703/10166120/attachment.html>