On Mon, Jul 2, 2018 at 9:39 AM, Michael Selik <mike@selik.org> wrote:
On Mon, Jul 2, 2018 at 2:32 AM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
For example the default could be such that grouping unpack tuples (key, value) from the iterator and do what's expected with it (group value by key). It is quite reasonable, and you have one example with (key, value) in your example, and no example with the current default.

I agree, the default should do something that has some chance of being useful on its own, and ideally, the most "common" use, if we can identify that.
 
On Mon, Jul 2, 2018 at 3:22 AM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
My question would be: does it have to be a key function? Can't we just remove the "key" argument?

In the examples or from the parameters? A key function is necessary to support a wide variety of uses.

not if you have the expectation of an iterable of (key, value) pairs as the input -- then any processing required to get a different key happens before hand, allowing folks to use comprehension syntax.
 
as so: :-)

Because for pretty much all the given examples, I would find my default as readable and nearly as short as the "key" syntax :

> grouping(words, key=len)
grouping((len(word), word for word in words))

I think the fact that you misplaced a closing parenthesis demonstrates how the key-function pattern can be more clear.

I think it demonstrates that you shouldn't post untested code :-) -- the missing paren is a syntax error -- it would be caught right away in real life.

The code is slightly more verbose, but it is akin to filter(iterable, function) vs (i for i in iterable if function(i)).

Sometimes I prefer ``map`` and sometimes I prefer a list comprehension. 

That is a "problem" with python: there are two ways to do things like map and filter, and one way is not always the clearest choice.

But I wonder if map and filter would exist if they didn't pre-date comprehensions..... That aside, the comprehension approach is pretty well liked and useful. And almost always prefer it -- an expression is simple on the eyes to me :-)

But when it's really a win is when you don't have a handy built-in function to do what you want, even though it's simple expression. 

With the map, filter, key approach, you have to write a bunch of little utility functions or lambdas, which can really clutter up the code.

If so, I like to write out the comprehension to provide that extra variable name for clarity.

I'd write: 
    map(len, words)

But I'd also write
    [len(fullname) for fullname in contacts]

A key (pun intended) issue is that passing functions around looks so neat and  clean when it's a simple built in function like "len" -- but if not, the it gets uglier, like:

map(lambda name: name.first_name, all_names)

vs

[name.first_name for nam in all names]

I really like the comprehension form much better when what you really want is a simple expression like an attribute access or index or simple calculation, or ....

I appreciate that defaulting the grouping key-function to ``itemgetter(0)`` would enable a pleasant flexibility for people to make that same choice for each use. I haven't fully come around to that, yet, because so many other tools use the equality function as the default.

well, kinda, but maybe those aren't "pythonic" :-)

(and yes, itemgetter() is probably a better option than lambda in many cases -- but I always forget that exists)

I started out in this topic answering a question about how to do a grouping for a list of tuples, in that case the OP wanted a comprehension. I don't think there's a good way to get a direct comprehension, but I do think we can make a class of function that takes something you could build with a comprehension.

And I took a look at itertools.groupby, and found it very, very awkward, ultimately arriving at:

student_school_list =  [('Fred', 'SchoolA'),
                        ('Bob', 'SchoolB'),
                        ('Mary', 'SchoolA'),
                        ('Jane', 'SchoolB'),
                        ('Nancy', 'SchoolC')]

grouped = {a:[t[0] for t in b] for a,b in groupby(sorted(student_school_list, key=lambda t: t[1]), key=lambda t: t[1])}

{'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']}

So why is that so painful?
 
("it" is itertools.groupby)

a) it  returns an iterable of tuples, so to get a dict you need to do the dict comp
b) it requires the data to be sorted -- so you ned to sort it first
c) I need to provide a key function to sort by
d) I need to provide (the same, always?) key function to group by
e) when I make the dict, I need to make the list, and use an expression to get the value I want.
f) because I need those key functions, I need to use lambda for what could be a simple expression


So the current proposal in the PEP makes that a lot better:

a) It makes a dict, so that step is done
b) It doesn't require the data to be sorted

but:

d) I still need the key function to do anything useful
e) If my data aren't clean, I need to do some post-processing to get the value I want.

So -- the above example with the (current version of the) PEP's function:

In [35]: grouping(student_school_list, key=lambda t: t[1])
Out[35]:
{'SchoolA': [('Fred', 'SchoolA'), ('Mary', 'SchoolA')],
 'SchoolB': [('Bob', 'SchoolB'), ('Jane', 'SchoolB')],
 'SchoolC': [('Nancy', 'SchoolC')]}

Darn! that's not right -- I need to clean up the values, too:

gr = { k: [t[0] for t in l] for k, l in gr.items()}

OK, but pretty painful really, so I guess I should clean it up first, but I can't actually see any clean way to do that! Am I missing something?

Ah -- I see it in your PEP: "Sequences of values that are already paired with their keys can be easily transformed after grouping." -- sorry that code is not "easily" -- having to write that kind of code makes this whole thing pretty much useless compared to using, say, setdefault() in the first place.

One option would be to add a value function, to unpack the value, analogous to the key function:

In [44]: gr = grouping(student_school_list, key=lambda t: t[1], value=lambda t: t[0])

Out[45]: {'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']}

That's pretty slick.

However, I still much prefer an API that assumes an iterator of (key,value) pairs:


def grouping(iterable):
    groups = {}
    for k, g in iterable:
        groups.setdefault(k, []).append(value)
        return groups

(easy to impliment :-) )

And now you get something that "just works" for at least one case:

In [54]: def grouping(iterable):
    ...:     groups = {}
    ...:     for key, value in iterable:
    ...:         groups.setdefault(key, []).append(value)
    ...:     return groups
    ...: 

In [55]: school_student_list
Out[55]: 
[('SchoolA', 'Fred'),
 ('SchoolB', 'Bob'),
 ('SchoolA', 'Mary'),
 ('SchoolB', 'Jane'),
 ('SchoolC', 'Nancy')]

In [56]: grouping(school_student_list)
Out[56]: {'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']}

And if you need to massage the data you can do so with a generator expression:

In [58]: grouping((reversed(t) for t in student_school_list))

Out[58]: {'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']}

And here are the examples from the PEP:
(untested -- I may hav missed some brackets, etc)

# Group words by length:


grouping(((len(word), word) for word in words))

# Group names by first initial:

grouping((name[0], name) for name in names))

# Group people by city:

grouping((contact.city, contact) for contact in contacts) 

# Group employees by department:

grouping((employee['department'] for employee in employees)

# Group files by extension:

grouping((os.path.splitext(filepath)[1] for filepath in os.listdir('.')))

# Group transactions by type:

grouping(( 'debit' if v > 0 else 'credit' for v in transactions))

# Invert a dictionary, d, without discarding repeated values:

grouping(((v, k) for v, k in d.items()))

So that was an interesting exercise -- many of those are a bit clearer (or more compact) with the key function. But I also notice a pattern -- all thos examples fit very well into the key function pattern:

you want the entire item stored in your iterable.
you want to group by some quality of the item itself.

Perhaps those ARE the most common use cases -- I honestly don't know, but from an earlier post:

" In practice, instead of (key, value) pairs, it's usually either individual values or n-tuple rows. In the latter case, sometimes the key should be dropped from the row when grouping, sometimes kept in the row, and sometimes the key must be computed from multiple values within the row."

It seems that the comprehension style I'm suggesting would be a win for the  case of n-tuple rows.

Doing this exercise has expanded my view, so I suggest that:

- keep the key function optional parameter.
- add a value function optional parameter. -- it really makes any case where you don't want to store the whole item a lot easier.

- Have the default key function be itemgetter(0) and the default value function be itemgetter(1) (or some similar way to implement default support for processing an iterable of (key, value) pairs.

Having no value function and an equality default for the key function may be "common", but it's also pretty much useless -- why have a useless default?

Thinking this through now I do see that having key and value default to to the pair method means that if you specify  key function, you will probably have to specify a value function as well -- so maybe that's not ideal.

hmm.


A couple other comments:

implementation detail: Do you gain anything by using the itertools groupby? over, say:

groups = {}
for item in iterable:
    groups.setdefault(key(item), []).append(item)

Final point:

I still prefer the class idea over a utility function, because:

* with a class, you can ad stuff to the grouping later:

a_grouping['key'] = value

or maybe a_grouping.add(item)

* with a class, you can add utility methods -- I kinda liked that in your original PEP.

I see this in the section about a custom class: "Merging groupings is not a one-liner," -- I'm pretty sure the update() method on my prototype was a merge operation -- so very easy :-) -- and another argument for a custom class.

final thought about custon class. If you want to be abel to do:

a_grouping['key'] = value

then that really reinforces the "natural" mapping of keys to values -- it's kind of another way to think about this -- rather than thinking about it as "groupby" function that creates a dict, think of it as a dict-like object, that, instead of writing over an existing key, accumulates the multiple values.

which means you really want the "normal" constructor to take an iterable of (key, value) pairs, to make it more dict-like.

-CHB














--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov