On Fri, Jun 29, 2018 at 10:53 AM, Michael Selik <mike@selik.org> wrote:

I've drafted a PEP for an easier way to construct groups of elements from a sequence. https://github.com/selik/peps/blob/master/pep-9999.rst

I'm really warming to the:

Alternate: collections.Grouping

version -- I really like this as a kind of custom mapping, rather than "just a function" (or alternate constructor) -- and I like your point that it can have a bit of functionality built in other than on construction.

But I think it should be more like the other collection classes -- i.e. a general purpose class that can be used for grouping, but also used more general-purpose-y as well. That way people can do their "custom" stuff (key function, etc.) with comprehensions.

The big differences are a custom __setitem__:

def __setitem__(self, key, value):
self.setdefault(key, []).append(value)

And the __init__ and update would take an iterable of (key, value) pairs, rather than a single sequence.

This would get away from the itertools.groupby approach, which I find kinda awkward:

* How often do you have your data in a single sequence?

* Do you need your keys (and values!) to be sortable???)

* Do we really want folks to have to be writing custom key functions and/or lambdas for really simple stuff?

* and you may need to "transform" both your keys and values

I've enclosed an example implementation, borrowing heavily from Michael's code.

The test code has a couple examples of use, but I'll put them here for the sake of discussion.

Michael had:

Grouping('AbBa', key=c.casefold))

with my code, that would be:

Grouping(((c.casefold(), c) for c in 'AbBa'))

Note that the key function is applied outside the Grouping object, it doesn't need to know anything about it -- and then users can use an expression in a comprehension rather than a key function.

This looks a tad clumsier with my approach, but this is a pretty contrived example -- in the more common case [*], you'd be writing a bunch of lambdas, etc, and I'm not sure there is a way to get the values customized as well, if you want that. (without applying a map later on)

Here is the example that the OP posted that kicked off this thread:

In [37]: student_school_list = [('Fred', 'SchoolA'),
    ...:                        ('Bob', 'SchoolB'),
    ...:                        ('Mary', 'SchoolA'),
    ...:                        ('Jane', 'SchoolB'),
    ...:                        ('Nancy', 'SchoolC'),
    ...:                        ]

In [38]: Grouping(((item[1], item[0]) for item in student_school_list))
Out[38]: Grouping({'SchoolA': ['Fred', 'Mary'],
   'SchoolB': ['Bob', 'Jane'],
                   'SchoolC': ['Nancy']})

or

In [40]: Grouping((reversed(item) for item in student_school_list))
Out[40]: Grouping({'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']})

(note that if those keys and values were didn't have to be reversed, you could just pass the list in raw.

I really like how I can use a generator expression and simple expressions to transform the data in the way I need, rather than having to make key functions.

And with Michael's approach, I think you'd need to call .map() after generating the grouping -- a much klunkier way to do it. (and you'd get plain dict rather than a Grouping that you could add stuff too later...)

I'm sure there are ways to improve my code, and maybe Grouping isn't the best name, but I think something like this would be a nice addition to the collections module.

-CHB

[*] -- before making any decisions about the best API, it would probably be a good idea to collect examples of the kind of data that people really do need to group like this. Does it come in (key, value) pairs naturally? or in one big sequence with a key function that's easy to write? who knows without examples of real world use cases.

I will show one "real world" example here:

In my Python classes, I like to use Dave Thomas' trigrams: "code kata":

http://codekata.com/kata/kata14-tom-swift-under-the-milkwood/

A key piece of this is building up a data structure with word pairs, and a list of all the words that follow the pair in a piece of text.

This is a nice exercise to help people think about how to use dicts, etc. Currently the most clean code uses .setdefault:

    word_pairs = {}
    # loop through the words
    # (rare case where using the index to loop is easiest)
    for i in range(len(words) - 2): # minus 2, 'cause you need a pair
        pair = tuple(words[i:i + 2]) # a tuple so it can be a key in the dict
        follower = words[i + 2]
        word_pairs.setdefault(pair, []).append(follower)

if this were done with my Grouping class, it would be:

In [53]: word_pairs = Grouping()

In [54]: for i in range(len(words) - 2):
    ...:     pair = tuple(words[i:i + 2]) # a tuple so it can be a key in the dict
    ...:     follower = words[i + 2]
    ...:     word_pairs[pair] = follower
    ...:

In [55]: word_pairs
Out[55]: Grouping({('I', 'wish'): ['I', 'I'], ('wish', 'I'): ['may', 'might'], ('I', 'may'): ['I'], ('may', 'I'): ['wish']})

Not that different, really, but it saves folks from having to find and understand setdefault. But you could also make it into a generator expression like so:

In [56]: Grouping(((w1, w2), w3) for w1, w2, w3, in zip(words[:], words[1:], words[2:]))
Out[56]: Grouping({('I', 'wish'): ['I', 'I'], ('wish', 'I'): ['may', 'might'], ('I', 'may'): ['I'], ('may', 'I'): ['wish']})

which I think is pretty slick. And satisfies the OP's desire for a comprehension-like approach, rather than the:

- create an empty dict

- loop through the iterable

- use setdefault in the loop

approach.

> As a teacher, I've found that grouping is one of the most awkward tasks for beginners to learn in Python. While this proposal requires understanding a key-function, in my experience that's easier to teach than the nuances of setdefault or defaultdict.

well, yes, and no -- as above, I use an example of this in teaching so that I CAN teach the nuances of setdefault -- or at least dicts themselves (most student use a if key in dict" construct before I tell them about setdefault)

So if you are teaching, say data analysis with Python -- it might be nice to have this builtin, but if you are teaching "programming with Python" I'd probably encourage them to do it by hand first anyway :-)

> Defaultdict requires passing a factory function or class, similar to a key-function. Setdefault is awkwardly named and requires a discussion of references and mutability.

I agree that the naming is awkward, but I haven't found confusion with references an mutabilty from this --though I do keep hammering those points throughout the class anyway :-)

and my approach doesn't require any key functions either :-)

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA 98115   (206) 526-6317   main reception

Chris.Barker@noaa.gov