[Python-ideas] grouping / dict of lists

Tue Jul 3 02:50:49 EDT 2018

On Mon, Jul 2, 2018 at 9:39 AM, Michael Selik <mike at selik.org> wrote:

> On Mon, Jul 2, 2018 at 2:32 AM Nicolas Rolin <nicolas.rolin at tiime.fr>
>>> wrote:
>>>
>>>> For example the default could be such that grouping unpack tuples (key,
>>>> value) from the iterator and do what's expected with it (group value by
>>>> key). It is quite reasonable, and you have one example with (key, value) in
>>>> your example, and no example with the current default.
>>>>
>>>
I agree, the default should do something that has some chance of being
useful on its own, and ideally, the most "common" use, if we can identify
that.

> On Mon, Jul 2, 2018 at 3:22 AM Nicolas Rolin <nicolas.rolin at tiime.fr>
> wrote:
>
>> My question would be: does it have to be a key function? Can't we just
>> remove the "key" argument?
>>
>
> In the examples or from the parameters? A key function is necessary to
> support a wide variety of uses.
>

not if you have the expectation of an iterable of (key, value) pairs as the
input -- then any processing required to get a different key happens before
hand, allowing folks to use comprehension syntax.

as so: :-)

Because for pretty much all the given examples, I would find my default as
>> readable and nearly as short as the "key" syntax :
>>
>> > grouping(words, key=len)
>> grouping((len(word), word for word in words))
>>
>
> I think the fact that you misplaced a closing parenthesis demonstrates how
> the key-function pattern can be more clear.
>

I think it demonstrates that you shouldn't post untested code :-) -- the
missing paren is a syntax error -- it would be caught right away in real
life.

>
> The code is slightly more verbose, but it is akin to filter(iterable,
>> function) vs (i for i in iterable if function(i)).
>>
>
> Sometimes I prefer ``map`` and sometimes I prefer a list comprehension.
>

That is a "problem" with python: there are two ways to do things like map
and filter, and one way is not always the clearest choice.

But I wonder if map and filter would exist if they didn't pre-date
comprehensions..... That aside, the comprehension approach is pretty well
liked and useful. And almost always prefer it -- an expression is simple on
the eyes to me :-)

But when it's really a win is when you don't have a handy built-in function
to do what you want, even though it's simple expression.

With the map, filter, key approach, you have to write a bunch of little
utility functions or lambdas, which can really clutter up the code.

If so, I like to write out the comprehension to provide that extra variable
> name for clarity.
>
> I'd write:
>     map(len, words)
>
> But I'd also write
>     [len(fullname) for fullname in contacts]
>

A key (pun intended) issue is that passing functions around looks so neat
and  clean when it's a simple built in function like "len" -- but if not,
the it gets uglier, like:

map(lambda name: name.first_name, all_names)

vs

[name.first_name for nam in all names]

I really like the comprehension form much better when what you really want
is a simple expression like an attribute access or index or simple
calculation, or ....

I appreciate that defaulting the grouping key-function to ``itemgetter(0)``
> would enable a pleasant flexibility for people to make that same choice for
> each use. I haven't fully come around to that, yet, because so many other
> tools use the equality function as the default.
>

well, kinda, but maybe those aren't "pythonic" :-)

(and yes, itemgetter() is probably a better option than lambda in many
cases -- but I always forget that exists)

I started out in this topic answering a question about how to do a grouping
for a list of tuples, in that case the OP wanted a comprehension. I don't
think there's a good way to get a direct comprehension, but I do think we
can make a class of function that takes something you could build with a
comprehension.

And I took a look at itertools.groupby, and found it very, very awkward,
ultimately arriving at:

student_school_list =  [('Fred', 'SchoolA'),
                        ('Bob', 'SchoolB'),
                        ('Mary', 'SchoolA'),
                        ('Jane', 'SchoolB'),
                        ('Nancy', 'SchoolC')]

grouped = {a:[t[0] for t in b] for a,b in groupby(sorted(student_school_list,
key=lambda t: t[1]), key=lambda t: t[1])}

{'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC':
['Nancy']}

So why is that so painful?

("it" is itertools.groupby)

a) it  returns an iterable of tuples, so to get a dict you need to do the
dict comp
b) it requires the data to be sorted -- so you ned to sort it first
c) I need to provide a key function to sort by
d) I need to provide (the same, always?) key function to group by
e) when I make the dict, I need to make the list, and use an expression to
get the value I want.
f) because I need those key functions, I need to use lambda for what could
be a simple expression

So the current proposal in the PEP makes that a lot better:

a) It makes a dict, so that step is done
b) It doesn't require the data to be sorted

but:

d) I still need the key function to do anything useful
e) If my data aren't clean, I need to do some post-processing to get the
value I want.

So -- the above example with the (current version of the) PEP's function:

In [35]: grouping(student_school_list, key=lambda t: t[1])
Out[35]:
{'SchoolA': [('Fred', 'SchoolA'), ('Mary', 'SchoolA')],
 'SchoolB': [('Bob', 'SchoolB'), ('Jane', 'SchoolB')],
 'SchoolC': [('Nancy', 'SchoolC')]}

Darn! that's not right -- I need to clean up the values, too:

gr = { k: [t[0] for t in l] for k, l in gr.items()}

OK, but pretty painful really, so I guess I should clean it up first, but I
can't actually see any clean way to do that! Am I missing something?

Ah -- I see it in your PEP: "Sequences of values that are already paired
with their keys can be easily transformed after grouping." -- sorry that
code is not "easily" -- having to write that kind of code makes this whole
thing pretty much useless compared to using, say, setdefault() in the first
place.

One option would be to add a value function, to unpack the value, analogous
to the key function:

In [44]: gr = grouping(student_school_list, key=lambda t: t[1],
value=lambda t: t[0])

Out[45]: {'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'],
'SchoolC': ['Nancy']}

That's pretty slick.

However, I still much prefer an API that assumes an iterator of (key,value)
pairs:

def grouping(iterable):
    groups = {}
    for k, g in iterable:
        groups.setdefault(k, []).append(value)
        return groups

(easy to impliment :-) )

And now you get something that "just works" for at least one case:

In [54]: def grouping(iterable):
    ...:     groups = {}
    ...:     for key, value in iterable:
    ...:         groups.setdefault(key, []).append(value)
    ...:     return groups
    ...:

In [55]: school_student_list
Out[55]:
[('SchoolA', 'Fred'),
 ('SchoolB', 'Bob'),
 ('SchoolA', 'Mary'),
 ('SchoolB', 'Jane'),
 ('SchoolC', 'Nancy')]

In [56]: grouping(school_student_list)
Out[56]: {'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'],
'SchoolC': ['Nancy']}

And if you need to massage the data you can do so with a generator
expression:

In [58]: grouping((reversed(t) for t in student_school_list))

Out[58]: {'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'],
'SchoolC': ['Nancy']}

And here are the examples from the PEP:
(untested -- I may hav missed some brackets, etc)

# Group words by length:

grouping(((len(word), word) for word in words))

# Group names by first initial:

grouping((name[0], name) for name in names))

# Group people by city:

grouping((contact.city, contact) for contact in contacts)

# Group employees by department:

grouping((employee['department'] for employee in employees)

# Group files by extension:

grouping((os.path.splitext(filepath)[1] for filepath in os.listdir('.')))

# Group transactions by type:

grouping(( 'debit' if v > 0 else 'credit' for v in transactions))

# Invert a dictionary, d, without discarding repeated values:

grouping(((v, k) for v, k in d.items()))

So that was an interesting exercise -- many of those are a bit clearer (or
more compact) with the key function. But I also notice a pattern -- all
thos examples fit very well into the key function pattern:

you want the entire item stored in your iterable.
you want to group by some quality of the item itself.

Perhaps those ARE the most common use cases -- I honestly don't know, but
from an earlier post:

" In practice, instead of (key, value) pairs, it's usually either
individual values or n-tuple rows. In the latter case, sometimes the key
should be dropped from the row when grouping, sometimes kept in the row,
and sometimes the key must be computed from multiple values within the row."

It seems that the comprehension style I'm suggesting would be a win for
the  case of n-tuple rows.

Doing this exercise has expanded my view, so I suggest that:

- keep the key function optional parameter.
- add a value function optional parameter. -- it really makes any case
where you don't want to store the whole item a lot easier.

- Have the default key function be itemgetter(0) and the default value
function be itemgetter(1) (or some similar way to implement default support
for processing an iterable of (key, value) pairs.

Having no value function and an equality default for the key function may
be "common", but it's also pretty much useless -- why have a useless
default?

Thinking this through now I do see that having key and value default to to
the pair method means that if you specify  key function, you will probably
have to specify a value function as well -- so maybe that's not ideal.

hmm.

A couple other comments:

implementation detail: Do you gain anything by using the itertools groupby?
over, say:

groups = {}
for item in iterable:
    groups.setdefault(key(item), []).append(item)

Final point:

I still prefer the class idea over a utility function, because:

* with a class, you can ad stuff to the grouping later:

a_grouping['key'] = value

or maybe a_grouping.add(item)

* with a class, you can add utility methods -- I kinda liked that in your
original PEP.

I see this in the section about a custom class: "Merging groupings is not a
one-liner," -- I'm pretty sure the update() method on my prototype was a
merge operation -- so very easy :-) -- and another argument for a custom
class.

final thought about custon class. If you want to be abel to do:

a_grouping['key'] = value

then that really reinforces the "natural" mapping of keys to values -- it's
kind of another way to think about this -- rather than thinking about it as
"groupby" function that creates a dict, think of it as a dict-like object,
that, instead of writing over an existing key, accumulates the multiple
values.

which means you really want the "normal" constructor to take an iterable of
(key, value) pairs, to make it more dict-like.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180702/0b2a12c3/attachment-0001.html>