[Python-ideas] grouping / dict of lists

Chris Barker chris.barker at noaa.gov
Sun Jul 1 01:18:09 EDT 2018


On Fri, Jun 29, 2018 at 10:53 AM, Michael Selik <mike at selik.org> wrote:

> I've drafted a PEP for an easier way to construct groups of elements from
> a sequence. https://github.com/selik/peps/blob/master/pep-9999.rst
>
> I'm really warming to the:

Alternate: collections.Grouping

version -- I really like this as a kind of custom mapping, rather than
"just a function" (or alternate constructor) -- and I like your point that
it can have a bit of functionality built in other than on construction.

But I think it should be more like the other collection classes -- i.e. a
general purpose class that can be used for grouping, but also used more
general-purpose-y as well. That way people can do their "custom" stuff (key
function, etc.) with comprehensions.

The big differences are a custom __setitem__:

    def __setitem__(self, key, value):
        self.setdefault(key, []).append(value)

And the __init__ and update would take an iterable of (key, value) pairs,
rather than a single sequence.

This would get away from the itertools.groupby approach, which I find kinda
awkward:

* How often do you have your data in a single sequence?

* Do you need your keys (and values!) to be sortable???)

* Do we really want folks to have to be writing custom key functions and/or
lambdas for really simple stuff?

* and you may need to "transform" both your keys and values

I've enclosed an example implementation, borrowing heavily from Michael's
code.

The test code has a couple examples of use, but I'll put them here for the
sake of discussion.

Michael had:

Grouping('AbBa', key=c.casefold))

with my code, that would be:

Grouping(((c.casefold(), c) for c in 'AbBa'))

Note that the key function is applied outside the Grouping object, it
doesn't need to know anything about it -- and then users can use an
expression in a comprehension rather than a key function.

This looks a tad clumsier with my approach, but this is a pretty contrived
example -- in the more common case [*], you'd be writing a bunch of
lambdas, etc, and I'm not sure there is a way to get the values customized
as well, if you want that. (without applying a map later on)

Here is the example that the OP posted that kicked off this thread:

In [37]: student_school_list = [('Fred', 'SchoolA'),
    ...:                        ('Bob', 'SchoolB'),
    ...:                        ('Mary', 'SchoolA'),
    ...:                        ('Jane', 'SchoolB'),
    ...:                        ('Nancy', 'SchoolC'),
    ...:                        ]

In [38]: Grouping(((item[1], item[0]) for item in student_school_list))
Out[38]: Grouping({'SchoolA': ['Fred', 'Mary'],
                   'SchoolB': ['Bob', 'Jane'],
                   'SchoolC': ['Nancy']})

or

In [40]: Grouping((reversed(item) for item in student_school_list))
Out[40]: Grouping({'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'],
'SchoolC': ['Nancy']})

(note that if those keys and values were didn't have to be reversed, you
could just pass the list in raw.

I really like how I can use a generator expression and simple expressions
to transform the data in the way I need, rather than having to make key
functions.

And with Michael's approach, I think you'd need to call .map() after
generating the grouping -- a much klunkier way to do it. (and you'd get
plain dict rather than a Grouping that you could add stuff too later...)

I'm sure there are ways to improve my code, and maybe Grouping isn't the
best name, but I think something like this would be a nice addition to the
collections module.

-CHB

[*] -- before making any decisions about the best API, it would probably be
a good idea to collect examples of the kind of data that people really do
need to group like this. Does it come in (key, value) pairs naturally? or
in one big sequence with a key function that's easy to write? who knows
without examples of real world use cases.

I will show one "real world" example here:

In my Python classes, I like to use Dave Thomas' trigrams: "code kata":

http://codekata.com/kata/kata14-tom-swift-under-the-milkwood/

A key piece of this is building up a data structure with word pairs, and a
list of all the words that follow the pair in a piece of text.

This is a nice exercise to help people think about how to use dicts, etc.
Currently the most clean code uses .setdefault:

    word_pairs = {}
    # loop through the words
    # (rare case where using the index to loop is easiest)
    for i in range(len(words) - 2):  # minus 2, 'cause you need a pair
        pair = tuple(words[i:i + 2])  # a tuple so it can be a key in the
dict
        follower = words[i + 2]
        word_pairs.setdefault(pair, []).append(follower)

if this were done with my Grouping class, it would be:

In [53]: word_pairs = Grouping()

In [54]: for i in range(len(words) - 2):
    ...:     pair = tuple(words[i:i + 2])  # a tuple so it can be a key in
the dict
    ...:     follower = words[i + 2]
    ...:     word_pairs[pair] = follower
    ...:

In [55]: word_pairs
Out[55]: Grouping({('I', 'wish'): ['I', 'I'], ('wish', 'I'): ['may',
'might'], ('I', 'may'): ['I'], ('may', 'I'): ['wish']})

Not that different, really, but it saves folks from having to find and
understand setdefault. But you could also make it into a generator
expression like so:

In [56]: Grouping(((w1, w2), w3) for w1, w2, w3, in zip(words[:],
words[1:], words[2:]))
Out[56]: Grouping({('I', 'wish'): ['I', 'I'], ('wish', 'I'): ['may',
'might'], ('I', 'may'): ['I'], ('may', 'I'): ['wish']})

which I think is pretty slick. And satisfies the OP's desire for a
comprehension-like approach, rather than the:

- create an empty dict
- loop through the iterable
- use setdefault in the loop

approach.

> As a teacher, I've found that grouping is one of the most awkward tasks
for beginners to learn in Python. While this proposal requires
understanding a key-function, in my experience that's easier to teach than
the nuances of setdefault or defaultdict.

well, yes, and no -- as above, I use an example of this in teaching so that
I CAN teach the nuances of setdefault -- or at least dicts themselves (most
student use a if key in dict" construct before I tell them about setdefault)

So if you are teaching, say data analysis with Python -- it might be nice
to have this builtin, but if you are teaching "programming with Python" I'd
probably encourage them to do it by hand first anyway :-)

> Defaultdict requires passing a factory function or class, similar to a
key-function. Setdefault is awkwardly named and requires a discussion of
references and mutability.

I agree that the naming is awkward, but I haven't found confusion with
references an mutabilty from this --though I do keep hammering those points
throughout the class anyway :-)

and my approach doesn't require any key functions either :-)




-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180630/64261996/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: grouper.py
Type: text/x-python-script
Size: 3429 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180630/64261996/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_grouper.py
Type: text/x-python-script
Size: 2420 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180630/64261996/attachment-0003.bin>


More information about the Python-ideas mailing list