Allow a group by operation for dict comprehension
Hi, I use list and dict comprehension a lot, and a problem I often have is to do the equivalent of a group_by operation (to use sql terminology). For example if I have a list of tuples (student, school) and I want to have the list of students by school the only option I'm left with is to write student_by_school = defaultdict(list) for student, school in student_school_list: student_by_school[school].append(student) What I would expect would be a syntax with comprehension allowing me to write something along the lines of: student_by_school = {group_by(school): student for school, student in student_school_list} or any other syntax that allows me to regroup items from an iterable. Small FAQ: Q: Why include something in comprehensions when you can do it in a small number of lines ? A: A really appreciable part of the list and dict comprehension is the fact that it allows the developer to be really explicit about what he wants to do at a given line. If you see a comprehension, you know that the developer wanted to have an iterable and not have any side effect other than depleting the iterator (if he respects reasonable code guidelines). Initializing an object and doing a for loop to construct it is both too long and not explicit enough about what is intended. It should be reserved for intrinsically complex operations, not one of the base operation one can want to do with lists and dicts. Q: Why group by in particular ? A: If we take SQL queries (https://en.wikipedia.org/wiki/SQL_syntax#Queries) as a reasonable way of seeing how people need to manipulate data on a day-to-day basis, we can see that dict comprehensions already covers most of the base operations, the only missing operations being group by and having. Q: Why not use it on list with syntax such as student_by_school = [ school, student for school, student in student_school_list group by school ] ? A: It would create either a discrepancy with iterators or a perhaps misleading semantic (the one from itertools.groupby, which requires the iterable to be sorted in order to be useful). Having the option do do it with a dict remove any ambiguity and should be enough to cover most "group by" applications. Examples: edible_list = [('fruit', 'orange'), ('meat', 'eggs'), ('meat', 'spam'), ('fruit', 'apple'), ('vegetable', 'fennel'), ('fruit', 'pineapple'), ('fruit', 'pineapple'), ('vegetable', 'carrot')] edible_list_by_food_type = {group_by(food_type): edible for food_type, edible in edible_list} print(edible_list_by_food_type) {'fruit': ['orange', 'pineapple'], 'meat': ['eggs', 'spam'], 'vegetable': ['fennel', 'carrot']} bank_transactions = [200.0, -357.0, -9.99, -15.6, 4320.0, -1200.0] splited_bank_transactions = {group_by('credit' if amount > 0 else 'debit'): amount for amount in bank_transactions} print(splited_bank_transactions) {'credit': [200.0, 4320.0], 'debit': [-357.0, -9.99, -15.6, -1200.0]} -- Nicolas Rolin
On Thu, Jun 28, 2018 at 8:25 AM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
I use list and dict comprehension a lot, and a problem I often have is to do the equivalent of a group_by operation (to use sql terminology).
For example if I have a list of tuples (student, school) and I want to have the list of students by school the only option I'm left with is to write
student_by_school = defaultdict(list) for student, school in student_school_list: student_by_school[school].append(student)
Thank you for bringing this up. I've been drafting a proposal for a better grouping / group-by operation for a little while. I'm not quite ready to share it, as I'm still researching use cases. I'm +1 that this task needs improvement, but -1 on this particular solution.
Hello, I've drafted a PEP for an easier way to construct groups of elements from a sequence. https://github.com/selik/peps/blob/master/pep-9999.rst As a teacher, I've found that grouping is one of the most awkward tasks for beginners to learn in Python. While this proposal requires understanding a key-function, in my experience that's easier to teach than the nuances of setdefault or defaultdict. Defaultdict requires passing a factory function or class, similar to a key-function. Setdefault is awkwardly named and requires a discussion of references and mutability. Those topics are important and should be covered, but I'd like to let them sink in gradually. Grouping often comes up as a question on the first or second day, especially for folks transitioning from Excel. I've tested this proposal on actual students (no students were harmed during experimentation) and found that the majority appreciate it. Some are even able to guess what it does (would do) without any priming. Thanks for your time, -- Michael On Thu, Jun 28, 2018 at 8:38 AM Michael Selik <mike@selik.org> wrote:
On Thu, Jun 28, 2018 at 8:25 AM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
I use list and dict comprehension a lot, and a problem I often have is to do the equivalent of a group_by operation (to use sql terminology).
For example if I have a list of tuples (student, school) and I want to have the list of students by school the only option I'm left with is to write
student_by_school = defaultdict(list) for student, school in student_school_list: student_by_school[school].append(student)
Thank you for bringing this up. I've been drafting a proposal for a better grouping / group-by operation for a little while. I'm not quite ready to share it, as I'm still researching use cases.
I'm +1 that this task needs improvement, but -1 on this particular solution.
On a quick skim I see nothing particularly objectionable or controversial in your PEP, except I'm unclear why it needs to be a class method on `dict`. Adding something to a builtin like this is rather heavy-handed. Is there a really good reason why it can't be a function in `itertools`? (I don't think that it's relevant that it doesn't return an iterator -- it takes in an iterator.) Also, your pure-Python implementation appears to be O(N log N) if key is None but O(N) otherwise; and the version for key is None uses an extra temporary array of size N. Is that intentional? Finally, the first example under "Group and Aggregate" is described as a dict of sets but it actually returns a dict of (sorted) lists. On Fri, Jun 29, 2018 at 10:54 AM Michael Selik <mike@selik.org> wrote:
Hello,
I've drafted a PEP for an easier way to construct groups of elements from a sequence. https://github.com/selik/peps/blob/master/pep-9999.rst
As a teacher, I've found that grouping is one of the most awkward tasks for beginners to learn in Python. While this proposal requires understanding a key-function, in my experience that's easier to teach than the nuances of setdefault or defaultdict. Defaultdict requires passing a factory function or class, similar to a key-function. Setdefault is awkwardly named and requires a discussion of references and mutability. Those topics are important and should be covered, but I'd like to let them sink in gradually. Grouping often comes up as a question on the first or second day, especially for folks transitioning from Excel.
I've tested this proposal on actual students (no students were harmed during experimentation) and found that the majority appreciate it. Some are even able to guess what it does (would do) without any priming.
Thanks for your time, -- Michael
On Thu, Jun 28, 2018 at 8:38 AM Michael Selik <mike@selik.org> wrote:
On Thu, Jun 28, 2018 at 8:25 AM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
I use list and dict comprehension a lot, and a problem I often have is to do the equivalent of a group_by operation (to use sql terminology).
For example if I have a list of tuples (student, school) and I want to have the list of students by school the only option I'm left with is to write
student_by_school = defaultdict(list) for student, school in student_school_list: student_by_school[school].append(student)
Thank you for bringing this up. I've been drafting a proposal for a better grouping / group-by operation for a little while. I'm not quite ready to share it, as I'm still researching use cases.
I'm +1 that this task needs improvement, but -1 on this particular solution.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido)
On Fri, Jun 29, 2018 at 2:43 PM Guido van Rossum <guido@python.org> wrote:
On a quick skim I see nothing particularly objectionable or controversial in your PEP, except I'm unclear why it needs to be a class method on `dict`.
Since it constructs a basic dict, I thought it belongs best as a dict constructor like dict.fromkeys. It seemed to match other classmethods like datetime.now.
Adding something to a builtin like this is rather heavy-handed.
I included an alternate solution of a new class, collections.Grouping, which has some advantages. In addition to having less of that "heavy-handed" feel to it, the class can have a few utility methods that help handle more use cases.
Is there a really good reason why it can't be a function in `itertools`? (I don't think that it's relevant that it doesn't return an iterator -- it takes in an iterator.)
I considered placing it in the itertools module, but decided against because it doesn't return an iterator. I'm open to that if that's the consensus.
Also, your pure-Python implementation appears to be O(N log N) if key is None but O(N) otherwise; and the version for key is None uses an extra temporary array of size N. Is that intentional?
Unintentional. I've been drafting pieces of this over the last year and wasn't careful enough with proofreading. I'll fix that momentarily...
Finally, the first example under "Group and Aggregate" is described as a dict of sets but it actually returns a dict of (sorted) lists.
Doctest complained at the set ordering, so I sorted for printing. You're not the only one to make that point, so I'll use sets for the example and ignore doctest. Thanks for reading! -- Michael PS. I just pushed an update to the GitHub repo, as per these comments.
On Fri, Jun 29, 2018 at 10:54 AM Michael Selik <mike@selik.org> wrote:
Hello,
I've drafted a PEP for an easier way to construct groups of elements from a sequence. https://github.com/selik/peps/blob/master/pep-9999.rst
As a teacher, I've found that grouping is one of the most awkward tasks for beginners to learn in Python. While this proposal requires understanding a key-function, in my experience that's easier to teach than the nuances of setdefault or defaultdict. Defaultdict requires passing a factory function or class, similar to a key-function. Setdefault is awkwardly named and requires a discussion of references and mutability. Those topics are important and should be covered, but I'd like to let them sink in gradually. Grouping often comes up as a question on the first or second day, especially for folks transitioning from Excel.
I've tested this proposal on actual students (no students were harmed during experimentation) and found that the majority appreciate it. Some are even able to guess what it does (would do) without any priming.
Thanks for your time, -- Michael
On Thu, Jun 28, 2018 at 8:38 AM Michael Selik <mike@selik.org> wrote:
On Thu, Jun 28, 2018 at 8:25 AM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
I use list and dict comprehension a lot, and a problem I often have is to do the equivalent of a group_by operation (to use sql terminology).
For example if I have a list of tuples (student, school) and I want to have the list of students by school the only option I'm left with is to write
student_by_school = defaultdict(list) for student, school in student_school_list: student_by_school[school].append(student)
Thank you for bringing this up. I've been drafting a proposal for a better grouping / group-by operation for a little while. I'm not quite ready to share it, as I'm still researching use cases.
I'm +1 that this task needs improvement, but -1 on this particular solution.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido)
On Fri, Jun 29, 2018 at 3:23 PM Michael Selik <mike@selik.org> wrote:
On Fri, Jun 29, 2018 at 2:43 PM Guido van Rossum <guido@python.org> wrote:
On a quick skim I see nothing particularly objectionable or controversial in your PEP, except I'm unclear why it needs to be a class method on `dict`.
Since it constructs a basic dict, I thought it belongs best as a dict constructor like dict.fromkeys. It seemed to match other classmethods like datetime.now.
It doesn't strike me as important enough. Surely not every stdlib function that returns a fresh dict needs to be a class method on dict!
Adding something to a builtin like this is rather heavy-handed.
I included an alternate solution of a new class, collections.Grouping, which has some advantages. In addition to having less of that "heavy-handed" feel to it, the class can have a few utility methods that help handle more use cases.
Hm, this actually feels heavier to me. But then again I never liked or understood the need for Counter -- I prefer basic data types and helper functions over custom abstractions. (Also your description doesn't do it justice, you describe a class using a verb phrase, "consume a sequence and construct a Mapping". The key to Grouping seems to me that it is a dict subclass with a custom constructor. But you don't explain why a subclass is needed, and in that sense I like the other approach better. But I still think it is much better off as a helper function in itertools.
Is there a really good reason why it can't be a function in `itertools`?
(I don't think that it's relevant that it doesn't return an iterator -- it takes in an iterator.)
I considered placing it in the itertools module, but decided against because it doesn't return an iterator. I'm open to that if that's the consensus.
You'll never get consensus on anything here, but you have my blessing for this without consensus.
Also, your pure-Python implementation appears to be O(N log N) if key is
None but O(N) otherwise; and the version for key is None uses an extra temporary array of size N. Is that intentional?
Unintentional. I've been drafting pieces of this over the last year and wasn't careful enough with proofreading. I'll fix that momentarily...
Such are the dangers of premature optimization. :-)
Finally, the first example under "Group and Aggregate" is described as a
dict of sets but it actually returns a dict of (sorted) lists.
Doctest complained at the set ordering, so I sorted for printing. You're not the only one to make that point, so I'll use sets for the example and ignore doctest.
Thanks for reading! -- Michael
PS. I just pushed an update to the GitHub repo, as per these comments.
Good luck with your PEP. If it is to go into itertools the biggest hurdle will be convincing Raymond, and I'm not going to overrule him on this: you and he are the educators here so hopefully you two can agree. --Guido
On Fri, Jun 29, 2018 at 10:54 AM Michael Selik <mike@selik.org> wrote:
Hello,
I've drafted a PEP for an easier way to construct groups of elements from a sequence. https://github.com/selik/peps/blob/master/pep-9999.rst
As a teacher, I've found that grouping is one of the most awkward tasks for beginners to learn in Python. While this proposal requires understanding a key-function, in my experience that's easier to teach than the nuances of setdefault or defaultdict. Defaultdict requires passing a factory function or class, similar to a key-function. Setdefault is awkwardly named and requires a discussion of references and mutability. Those topics are important and should be covered, but I'd like to let them sink in gradually. Grouping often comes up as a question on the first or second day, especially for folks transitioning from Excel.
I've tested this proposal on actual students (no students were harmed during experimentation) and found that the majority appreciate it. Some are even able to guess what it does (would do) without any priming.
Thanks for your time, -- Michael
On Thu, Jun 28, 2018 at 8:38 AM Michael Selik <mike@selik.org> wrote:
On Thu, Jun 28, 2018 at 8:25 AM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
I use list and dict comprehension a lot, and a problem I often have is to do the equivalent of a group_by operation (to use sql terminology).
For example if I have a list of tuples (student, school) and I want to have the list of students by school the only option I'm left with is to write
student_by_school = defaultdict(list) for student, school in student_school_list: student_by_school[school].append(student)
Thank you for bringing this up. I've been drafting a proposal for a better grouping / group-by operation for a little while. I'm not quite ready to share it, as I'm still researching use cases.
I'm +1 that this task needs improvement, but -1 on this particular solution.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido)
-- --Guido van Rossum (python.org/~guido)
On 30 June 2018 at 16:25, Guido van Rossum <guido@python.org> wrote:
On Fri, Jun 29, 2018 at 3:23 PM Michael Selik <mike@selik.org> wrote:
I included an alternate solution of a new class, collections.Grouping, which has some advantages. In addition to having less of that "heavy-handed" feel to it, the class can have a few utility methods that help handle more use cases.
Hm, this actually feels heavier to me. But then again I never liked or understood the need for Counter -- I prefer basic data types and helper functions over custom abstractions. (Also your description doesn't do it justice, you describe a class using a verb phrase, "consume a sequence and construct a Mapping". The key to Grouping seems to me that it is a dict subclass with a custom constructor. But you don't explain why a subclass is needed, and in that sense I like the other approach better.
I'm not sure if the draft was updated since you looked at it, but it does mention that one benefit of the collections.Grouping approach is being able to add native support for mapping a callable across every individual item in the collection (ignoring the group structure), as well as for applying aggregate functions to reduce the groups to single values in a standard dict. Delegating those operations to the container API that way then means that other libraries can expose classes that implement the grouping API, but with a completely different backend storage model.
But I still think it is much better off as a helper function in itertools.
I thought we actually had an open enhancement proposal for adding a "defaultdict.freeze" operation that switched it over to raising KeyError the same way a normal dict does, but I can't seem to find it now. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
I made some heavy revisions to the PEP. Linking again for convenience. https://github.com/selik/peps/blob/master/pep-9999.rst Replying to Guido, Nick, David, Chris, and Ivan in 4 sections below. [Guido] On Fri, Jun 29, 2018 at 11:25 PM Guido van Rossum <guido@python.org> wrote:
On Fri, Jun 29, 2018 at 3:23 PM Michael Selik <mike@selik.org> wrote:
On Fri, Jun 29, 2018 at 2:43 PM Guido van Rossum <guido@python.org> wrote:
On a quick skim I see nothing particularly objectionable or controversial in your PEP, except I'm unclear why it needs to be a class method on `dict`.
Since it constructs a basic dict, I thought it belongs best as a dict constructor like dict.fromkeys. It seemed to match other classmethods like datetime.now.
It doesn't strike me as important enough. Surely not every stdlib function that returns a fresh dict needs to be a class method on dict!
Thinking back, I may have chosen the name "groupby" first, following `itertools.groupby`, SQL, and other languages, and I wanted to make a clear distinction from `itertools.groupby`. Putting it on the `dict` namespace clarified that it's returning a dict. However, naming it `grouping` allows it to be a stand-alone function. But I still think it is much better off as a helper function in itertools.
I considered placing it in the itertools module, but decided against
because it doesn't return an iterator. I'm open to that if that's the consensus.
You'll never get consensus on anything here, but you have my blessing for this without consensus.
That feels like a success, but I'm going to be a bit more ambitious and try to persuade you that `grouping` belongs in the built-ins. I revised my draft to streamline the examples and make a clearer comparison with existing tools. [Nick] On Sat, Jun 30, 2018 at 2:01 AM Nick Coghlan <ncoghlan@gmail.com> wrote:
I'm not sure if the draft was updated since [Guido] looked at it, but it
does mention that one benefit of the collections.Grouping approach is being able to add native support for mapping a callable across every individual item in the collection (ignoring the group structure), as well as for applying aggregate functions to reduce the groups to single values in a standard dict.
Delegating those operations to the container API that way then means that other libraries can expose classes that implement the grouping API, but with a completely different backend storage model.
While it'd be nice to create a standard interface as you point out, my primary goal is to create an "obvious" way for both beginners and experts to group, classify, categorize, bucket, demultiplex, taxonomize, etc. I started revising the PEP last night and found myself getting carried away with adding methods to the Grouping class that were more distracting than useful. Since the most important thing is to make this as accessible and easy as possible, I re-focused the proposal on the core idea of grouping. [Ivan, Chris, David] On Sun, Jul 1, 2018 at 7:29 PM David Mertz <mertz@gnosis.cx> wrote:
{k:set(v) for k,v in deps.items()} {k:Counter(v) for k,v in deps.items()}
I had dropped those specific examples in favor of generically "func(g)", but added them back. Your discussion with Ivan and Chris showed that it was useful to be specific. [Chris] On Sat, Jun 30, 2018 at 10:18 PM Chris Barker <chris.barker@noaa.gov> wrote:
I'm really warming to the: Alternate: collections.Grouping version -- I really like this as a kind of custom mapping, rather than "just a function" (or alternate constructor) -- and I like your point that it can have a bit of functionality built in other than on construction.
I moved ``collections.Grouping`` to the "Rejected Alternatives" section, but that's more like a "personal 2nd choices" instead of "rejected". [...]
__init__ and update would take an iterable of (key, value) pairs, rather than a single sequence.
I added a better demonstration in the PEP for handling that kind of input. You have one of two strategies with my proposed function. Either create a reverse lookup dict: d = {v: k for k, v in items} grouping(d, key=lambda k: d[k]) Or discard the keys after grouping: groups = grouping(items, key=lambda t: t[0]) groups = {k: [v for _, v in g] for k, g in groups.items()} While thinking of examples for this PEP, it's tempting to use overly-simplified data. In practice, instead of (key, value) pairs, it's usually either individual values or n-tuple rows. In the latter case, sometimes the key should be dropped from the row when grouping, sometimes kept in the row, and sometimes the key must be computed from multiple values within the row. [...] building up a data structure with word pairs, and a list of all the
words that follow the pair in a piece of text. [...example code...]
I provided a similar example in my first draft, showing the creation of a Markov chain data structure. A few folks gave the feedback that it was more distracting from the PEP than useful. It's still there in the "stateful key-function" example, but it's now just a few lines. [...] if you are teaching, say data analysis with Python -- it might be
nice to have this builtin, but if you are teaching "programming with Python" I'd probably encourage them to do it by hand first anyway :-)
I agree, but users in both cases will appreciate the proposed built-in. On Sun, Jul 1, 2018 at 10:35 PM Chris Barker <chris.barker@noaa.gov> wrote:
Though maybe list, set and Counter are the [aggregation collections] you'd want to use?
I've been searching the standard library and popular community libraries for use of setdefault, defaultdict, groupby, and the word "group" or "groups" periodically over the past year or so. I admit I haven't been as systematic as maybe I should have been, but I feel like I've been pretty thorough. The majority of grouping uses a list. A significant portion use a set. A handful use a Counter. And that's basically it. Sometimes there's a specialized container class, but they are generally composed of a list, set, or Counter. There may have been other types, but if it was interesting, I think I'd have written down an example of it in my notes. Most other languages with a similar tool have decided to return a mapping of lists or the equivalent for that language. If we make that choice, we're in good company. [...]
before making any decisions about the best API, it would probably be a good idea to collect examples of the kind of data that people really do need to group like this. Does it come in (key, value) pairs naturally? or in one big sequence with a key function that's easy to write? who knows without examples of real world use cases.
It may not come across in the PEP how much research I've put into this. I'll some time to compile the evidence, but I'm confident that it's more common to need a key-function than to have (key, value) pairs. I'll get back to you soon(ish) with data. -- Michael PS. Not to bikeshed, but a Grouper is a kind of fish. :-)
I think the current default quite weird, as it pretty much account to a count() of each key (which can be useful, but not really what I except from a grouping). I would prefer a default that might return an error to a default that says ok and output something that is not what I might want. For example the default could be such that grouping unpack tuples (key, value) from the iterator and do what's expected with it (group value by key). It is quite reasonable, and you have one example with (key, value) in your example, and no example with the current default. It also allows to use syntax of the kind
grouping((food_type, food_name for food_type, food_name in foods))
which is pretty nice to have. -- Nicolas Rolin 2018-07-02 9:43 GMT+02:00 Michael Selik <mike@selik.org>:
I made some heavy revisions to the PEP. Linking again for convenience. https://github.com/selik/peps/blob/master/pep-9999.rst
Replying to Guido, Nick, David, Chris, and Ivan in 4 sections below.
[Guido] On Fri, Jun 29, 2018 at 11:25 PM Guido van Rossum <guido@python.org> wrote:
On Fri, Jun 29, 2018 at 3:23 PM Michael Selik <mike@selik.org> wrote:
On Fri, Jun 29, 2018 at 2:43 PM Guido van Rossum <guido@python.org> wrote:
On a quick skim I see nothing particularly objectionable or controversial in your PEP, except I'm unclear why it needs to be a class method on `dict`.
Since it constructs a basic dict, I thought it belongs best as a dict constructor like dict.fromkeys. It seemed to match other classmethods like datetime.now.
It doesn't strike me as important enough. Surely not every stdlib function that returns a fresh dict needs to be a class method on dict!
Thinking back, I may have chosen the name "groupby" first, following `itertools.groupby`, SQL, and other languages, and I wanted to make a clear distinction from `itertools.groupby`. Putting it on the `dict` namespace clarified that it's returning a dict.
However, naming it `grouping` allows it to be a stand-alone function.
But I still think it is much better off as a helper function in itertools.
I considered placing it in the itertools module, but decided against
because it doesn't return an iterator. I'm open to that if that's the consensus.
You'll never get consensus on anything here, but you have my blessing for this without consensus.
That feels like a success, but I'm going to be a bit more ambitious and try to persuade you that `grouping` belongs in the built-ins. I revised my draft to streamline the examples and make a clearer comparison with existing tools.
[Nick] On Sat, Jun 30, 2018 at 2:01 AM Nick Coghlan <ncoghlan@gmail.com> wrote:
I'm not sure if the draft was updated since [Guido] looked at it, but it
does mention that one benefit of the collections.Grouping approach is being able to add native support for mapping a callable across every individual item in the collection (ignoring the group structure), as well as for applying aggregate functions to reduce the groups to single values in a standard dict.
Delegating those operations to the container API that way then means that other libraries can expose classes that implement the grouping API, but with a completely different backend storage model.
While it'd be nice to create a standard interface as you point out, my primary goal is to create an "obvious" way for both beginners and experts to group, classify, categorize, bucket, demultiplex, taxonomize, etc. I started revising the PEP last night and found myself getting carried away with adding methods to the Grouping class that were more distracting than useful. Since the most important thing is to make this as accessible and easy as possible, I re-focused the proposal on the core idea of grouping.
[Ivan, Chris, David] On Sun, Jul 1, 2018 at 7:29 PM David Mertz <mertz@gnosis.cx> wrote:
{k:set(v) for k,v in deps.items()} {k:Counter(v) for k,v in deps.items()}
I had dropped those specific examples in favor of generically "func(g)", but added them back. Your discussion with Ivan and Chris showed that it was useful to be specific.
[Chris] On Sat, Jun 30, 2018 at 10:18 PM Chris Barker <chris.barker@noaa.gov> wrote:
I'm really warming to the: Alternate: collections.Grouping version -- I really like this as a kind of custom mapping, rather than "just a function" (or alternate constructor) -- and I like your point that it can have a bit of functionality built in other than on construction.
I moved ``collections.Grouping`` to the "Rejected Alternatives" section, but that's more like a "personal 2nd choices" instead of "rejected".
[...]
__init__ and update would take an iterable of (key, value) pairs, rather than a single sequence.
I added a better demonstration in the PEP for handling that kind of input. You have one of two strategies with my proposed function.
Either create a reverse lookup dict: d = {v: k for k, v in items} grouping(d, key=lambda k: d[k])
Or discard the keys after grouping: groups = grouping(items, key=lambda t: t[0]) groups = {k: [v for _, v in g] for k, g in groups.items()}
While thinking of examples for this PEP, it's tempting to use overly-simplified data. In practice, instead of (key, value) pairs, it's usually either individual values or n-tuple rows. In the latter case, sometimes the key should be dropped from the row when grouping, sometimes kept in the row, and sometimes the key must be computed from multiple values within the row.
[...] building up a data structure with word pairs, and a list of all the
words that follow the pair in a piece of text. [...example code...]
I provided a similar example in my first draft, showing the creation of a Markov chain data structure. A few folks gave the feedback that it was more distracting from the PEP than useful. It's still there in the "stateful key-function" example, but it's now just a few lines.
[...] if you are teaching, say data analysis with Python -- it might be
nice to have this builtin, but if you are teaching "programming with Python" I'd probably encourage them to do it by hand first anyway :-)
I agree, but users in both cases will appreciate the proposed built-in.
On Sun, Jul 1, 2018 at 10:35 PM Chris Barker <chris.barker@noaa.gov> wrote:
Though maybe list, set and Counter are the [aggregation collections] you'd want to use?
I've been searching the standard library and popular community libraries for use of setdefault, defaultdict, groupby, and the word "group" or "groups" periodically over the past year or so. I admit I haven't been as systematic as maybe I should have been, but I feel like I've been pretty thorough.
The majority of grouping uses a list. A significant portion use a set. A handful use a Counter. And that's basically it. Sometimes there's a specialized container class, but they are generally composed of a list, set, or Counter. There may have been other types, but if it was interesting, I think I'd have written down an example of it in my notes.
Most other languages with a similar tool have decided to return a mapping of lists or the equivalent for that language. If we make that choice, we're in good company.
[...]
before making any decisions about the best API, it would probably be a good idea to collect examples of the kind of data that people really do need to group like this. Does it come in (key, value) pairs naturally? or in one big sequence with a key function that's easy to write? who knows without examples of real world use cases.
It may not come across in the PEP how much research I've put into this. I'll some time to compile the evidence, but I'm confident that it's more common to need a key-function than to have (key, value) pairs. I'll get back to you soon(ish) with data.
-- Michael
PS. Not to bikeshed, but a Grouper is a kind of fish. :-)
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- -- *Nicolas Rolin* | Data Scientist + 33 631992617 - nicolas.rolin@tiime.fr <prenom.nom@tiime.fr> *15 rue Auber, **75009 Paris* *www.tiime.fr <http://www.tiime.fr>*
On Mon, Jul 2, 2018 at 2:32 AM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
I think the current default quite weird, as it pretty much account to a count() of each key (which can be useful, but not really what I except from a grouping). I would prefer a default that might return an error to a default that says ok and output something that is not what I might want. For example the default could be such that grouping unpack tuples (key, value) from the iterator and do what's expected with it (group value by key). It is quite reasonable, and you have one example with (key, value) in your example, and no example with the current default. It also allows to use syntax of the kind
grouping((food_type, food_name for food_type, food_name in foods))
which is pretty nice to have.
I'm of two minds on this point. First, I agree that it'd be nice to handle the (key, value) pair case more elegantly. It comes to mind often when writing examples, even if proportionally less in practice. Second, I'll paraphrase "Jakob's Law of the Internet User Experience" -- users spend most of their time using *other* functions. Because itertools.groupby and other functions in Python established a standard for the behavior of key-functions, I want to keep that standard. Third, some classes might have a rich equality method that allows many interesting values to all wind up in the same group even if using the default "identity" key-function. Thanks for the suggestion. I'll include it in the PEP, at least for documenting all reasonable options.
On Mon, Jul 2, 2018 at 2:52 AM Michael Selik <mike@selik.org> wrote:
On Mon, Jul 2, 2018 at 2:32 AM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
I think the current default quite weird, as it pretty much account to a count() of each key (which can be useful, but not really what I except from a grouping). I would prefer a default that might return an error to a default that says ok and output something that is not what I might want. For example the default could be such that grouping unpack tuples (key, value) from the iterator and do what's expected with it (group value by key). It is quite reasonable, and you have one example with (key, value) in your example, and no example with the current default. It also allows to use syntax of the kind
grouping((food_type, food_name for food_type, food_name in foods))
which is pretty nice to have.
I'm of two minds on this point. First, I agree that it'd be nice to handle the (key, value) pair case more elegantly. It comes to mind often when writing examples, even if proportionally less in practice.
Second, I'll paraphrase "Jakob's Law of the Internet User Experience" -- users spend most of their time using *other* functions. Because itertools.groupby and other functions in Python established a standard for the behavior of key-functions, I want to keep that standard.
Third, some classes might have a rich equality method that allows many interesting values to all wind up in the same group even if using the default "identity" key-function.
Thanks for the suggestion. I'll include it in the PEP, at least for documenting all reasonable options.
It might not be pure (does not default to identity key-function), but it sure seems practical. :: from itertools import groupby as _groupby from operator import itemgetter def grouping(iterable, key=itemgetter(0)): ''' Group elements of an iterable into a dict of lists. The ``key`` is a function computing a key value for each element. Each key corresponds to a group -- a list of elements in the same order as encountered. By default, the key-function gets the 0th index of each element. .>>> grouping(['apple', 'banana', 'aardvark']) {'a': ['apple', 'aardvark'], 'b': ['banana']} ''' groups = {} for k, g in _groupby(iterable, key): groups.setdefault(k, []).extend(g) return groups
My question would be : does it have to be a key function ? Can't we just remove the "key" argument ? Because for pretty much all the given examples, I would find my default as readable and nearly as short as the "key" syntax :
grouping(words, key=len) grouping((len(word), word for word in words))
grouping(names, key=itemgetter(0)) grouping((name_initial, name_initial+_name for name_initial, *_name in names)
grouping(contacts, key=itemgetter('city') grouping((contact['city'], contact for contact in contacts)
grouping(employees, key=itemgetter('department')) grouping((employee['department'], employee for employee in employees)
grouping(os.listdir('.'), key=lambda filepath: os.path.splitext(filepath)[1]) grouping((os.path.splitext(filepath)[1]), filepath for filepath in os.listdir('.'))
grouping(transactions, key=lambda v: 'debit' if v > 0 else 'credit') grouping(('debit' if v > 0 else 'credit', transaction_amount for transaction_amount in transactions))
The code is slightly more verbose, but it is akin to filter(iterable, function) vs (i for i in iterable if function(i)). -- Nicolas Rolin 2018-07-02 11:52 GMT+02:00 Michael Selik <mike@selik.org>:
On Mon, Jul 2, 2018 at 2:32 AM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
I think the current default quite weird, as it pretty much account to a count() of each key (which can be useful, but not really what I except from a grouping). I would prefer a default that might return an error to a default that says ok and output something that is not what I might want. For example the default could be such that grouping unpack tuples (key, value) from the iterator and do what's expected with it (group value by key). It is quite reasonable, and you have one example with (key, value) in your example, and no example with the current default. It also allows to use syntax of the kind
grouping((food_type, food_name for food_type, food_name in foods))
which is pretty nice to have.
I'm of two minds on this point. First, I agree that it'd be nice to handle the (key, value) pair case more elegantly. It comes to mind often when writing examples, even if proportionally less in practice.
Second, I'll paraphrase "Jakob's Law of the Internet User Experience" -- users spend most of their time using *other* functions. Because itertools.groupby and other functions in Python established a standard for the behavior of key-functions, I want to keep that standard.
Third, some classes might have a rich equality method that allows many interesting values to all wind up in the same group even if using the default "identity" key-function.
Thanks for the suggestion. I'll include it in the PEP, at least for documenting all reasonable options.
-- -- *Nicolas Rolin* | Data Scientist + 33 631992617 - nicolas.rolin@tiime.fr <prenom.nom@tiime.fr> *15 rue Auber, **75009 Paris* *www.tiime.fr <http://www.tiime.fr>*
On Mon, Jul 2, 2018 at 2:32 AM Nicolas Rolin <nicolas.rolin@tiime.fr>
wrote:
For example the default could be such that grouping unpack tuples (key, value) from the iterator and do what's expected with it (group value by key). It is quite reasonable, and you have one example with (key, value) in your example, and no example with the current default.
On Mon, Jul 2, 2018 at 3:22 AM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
My question would be: does it have to be a key function? Can't we just remove the "key" argument?
In the examples or from the parameters? A key function is necessary to support a wide variety of uses. Because for pretty much all the given examples, I would find my default as
readable and nearly as short as the "key" syntax :
grouping(words, key=len) grouping((len(word), word for word in words))
I think the fact that you misplaced a closing parenthesis demonstrates how the key-function pattern can be more clear. The code is slightly more verbose, but it is akin to filter(iterable,
function) vs (i for i in iterable if function(i)).
Sometimes I prefer ``map`` and sometimes I prefer a list comprehension. It usually hinges on whether I think the reader might get confused over what one of the elements is. If so, I like to write out the comprehension to provide that extra variable name for clarity. I'd write: map(len, words) But I'd also write [len(fullname) for fullname in contacts] I appreciate that defaulting the grouping key-function to ``itemgetter(0)`` would enable a pleasant flexibility for people to make that same choice for each use. I haven't fully come around to that, yet, because so many other tools use the equality function as the default. On Mon, Jul 2, 2018 at 3:48 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Mon, Jul 02, 2018 at 02:52:03AM -0700, Michael Selik wrote:
Third, some classes might have a rich equality method that allows many interesting values to all wind up in the same group even if using the default "identity" key-function.
I would expect an identity key function to group by *identity* (is), not equality. But I would expect the default grouper to group by *equality*.
Yep, I should have been saying "equality function" instead of "identity function." Thanks for the clarification.
On Mon, Jul 2, 2018 at 2:32 AM Nicolas Rolin <nicolas.rolin@tiime.fr>
wrote:
For example the default could be such that grouping unpack tuples (key, value) from the iterator and do what's expected with it (group value by key). It is quite reasonable, and you have one example with (key, value) in your example, and no example with the current default.
I agree, the default should do something that has some chance of being useful on its own, and ideally, the most "common" use, if we can identify
On Mon, Jul 2, 2018 at 9:39 AM, Michael Selik <mike@selik.org> wrote: that.
On Mon, Jul 2, 2018 at 3:22 AM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
My question would be: does it have to be a key function? Can't we just remove the "key" argument?
In the examples or from the parameters? A key function is necessary to support a wide variety of uses.
not if you have the expectation of an iterable of (key, value) pairs as the input -- then any processing required to get a different key happens before hand, allowing folks to use comprehension syntax. as so: :-) Because for pretty much all the given examples, I would find my default as
readable and nearly as short as the "key" syntax :
grouping(words, key=len) grouping((len(word), word for word in words))
I think the fact that you misplaced a closing parenthesis demonstrates how the key-function pattern can be more clear.
I think it demonstrates that you shouldn't post untested code :-) -- the missing paren is a syntax error -- it would be caught right away in real life.
The code is slightly more verbose, but it is akin to filter(iterable,
function) vs (i for i in iterable if function(i)).
Sometimes I prefer ``map`` and sometimes I prefer a list comprehension.
That is a "problem" with python: there are two ways to do things like map and filter, and one way is not always the clearest choice. But I wonder if map and filter would exist if they didn't pre-date comprehensions..... That aside, the comprehension approach is pretty well liked and useful. And almost always prefer it -- an expression is simple on the eyes to me :-) But when it's really a win is when you don't have a handy built-in function to do what you want, even though it's simple expression. With the map, filter, key approach, you have to write a bunch of little utility functions or lambdas, which can really clutter up the code. If so, I like to write out the comprehension to provide that extra variable
name for clarity.
I'd write: map(len, words)
But I'd also write [len(fullname) for fullname in contacts]
A key (pun intended) issue is that passing functions around looks so neat and clean when it's a simple built in function like "len" -- but if not, the it gets uglier, like: map(lambda name: name.first_name, all_names) vs [name.first_name for nam in all names] I really like the comprehension form much better when what you really want is a simple expression like an attribute access or index or simple calculation, or .... I appreciate that defaulting the grouping key-function to ``itemgetter(0)``
would enable a pleasant flexibility for people to make that same choice for each use. I haven't fully come around to that, yet, because so many other tools use the equality function as the default.
well, kinda, but maybe those aren't "pythonic" :-) (and yes, itemgetter() is probably a better option than lambda in many cases -- but I always forget that exists) I started out in this topic answering a question about how to do a grouping for a list of tuples, in that case the OP wanted a comprehension. I don't think there's a good way to get a direct comprehension, but I do think we can make a class of function that takes something you could build with a comprehension. And I took a look at itertools.groupby, and found it very, very awkward, ultimately arriving at: student_school_list = [('Fred', 'SchoolA'), ('Bob', 'SchoolB'), ('Mary', 'SchoolA'), ('Jane', 'SchoolB'), ('Nancy', 'SchoolC')] grouped = {a:[t[0] for t in b] for a,b in groupby(sorted(student_school_list, key=lambda t: t[1]), key=lambda t: t[1])} {'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']} So why is that so painful? ("it" is itertools.groupby) a) it returns an iterable of tuples, so to get a dict you need to do the dict comp b) it requires the data to be sorted -- so you ned to sort it first c) I need to provide a key function to sort by d) I need to provide (the same, always?) key function to group by e) when I make the dict, I need to make the list, and use an expression to get the value I want. f) because I need those key functions, I need to use lambda for what could be a simple expression So the current proposal in the PEP makes that a lot better: a) It makes a dict, so that step is done b) It doesn't require the data to be sorted but: d) I still need the key function to do anything useful e) If my data aren't clean, I need to do some post-processing to get the value I want. So -- the above example with the (current version of the) PEP's function: In [35]: grouping(student_school_list, key=lambda t: t[1]) Out[35]: {'SchoolA': [('Fred', 'SchoolA'), ('Mary', 'SchoolA')], 'SchoolB': [('Bob', 'SchoolB'), ('Jane', 'SchoolB')], 'SchoolC': [('Nancy', 'SchoolC')]} Darn! that's not right -- I need to clean up the values, too: gr = { k: [t[0] for t in l] for k, l in gr.items()} OK, but pretty painful really, so I guess I should clean it up first, but I can't actually see any clean way to do that! Am I missing something? Ah -- I see it in your PEP: "Sequences of values that are already paired with their keys can be easily transformed after grouping." -- sorry that code is not "easily" -- having to write that kind of code makes this whole thing pretty much useless compared to using, say, setdefault() in the first place. One option would be to add a value function, to unpack the value, analogous to the key function: In [44]: gr = grouping(student_school_list, key=lambda t: t[1], value=lambda t: t[0]) Out[45]: {'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']} That's pretty slick. However, I still much prefer an API that assumes an iterator of (key,value) pairs: def grouping(iterable): groups = {} for k, g in iterable: groups.setdefault(k, []).append(value) return groups (easy to impliment :-) ) And now you get something that "just works" for at least one case: In [54]: def grouping(iterable): ...: groups = {} ...: for key, value in iterable: ...: groups.setdefault(key, []).append(value) ...: return groups ...: In [55]: school_student_list Out[55]: [('SchoolA', 'Fred'), ('SchoolB', 'Bob'), ('SchoolA', 'Mary'), ('SchoolB', 'Jane'), ('SchoolC', 'Nancy')] In [56]: grouping(school_student_list) Out[56]: {'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']} And if you need to massage the data you can do so with a generator expression: In [58]: grouping((reversed(t) for t in student_school_list)) Out[58]: {'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']} And here are the examples from the PEP: (untested -- I may hav missed some brackets, etc) # Group words by length: grouping(((len(word), word) for word in words)) # Group names by first initial: grouping((name[0], name) for name in names)) # Group people by city: grouping((contact.city, contact) for contact in contacts) # Group employees by department: grouping((employee['department'] for employee in employees) # Group files by extension: grouping((os.path.splitext(filepath)[1] for filepath in os.listdir('.'))) # Group transactions by type: grouping(( 'debit' if v > 0 else 'credit' for v in transactions)) # Invert a dictionary, d, without discarding repeated values: grouping(((v, k) for v, k in d.items())) So that was an interesting exercise -- many of those are a bit clearer (or more compact) with the key function. But I also notice a pattern -- all thos examples fit very well into the key function pattern: you want the entire item stored in your iterable. you want to group by some quality of the item itself. Perhaps those ARE the most common use cases -- I honestly don't know, but from an earlier post: " In practice, instead of (key, value) pairs, it's usually either individual values or n-tuple rows. In the latter case, sometimes the key should be dropped from the row when grouping, sometimes kept in the row, and sometimes the key must be computed from multiple values within the row." It seems that the comprehension style I'm suggesting would be a win for the case of n-tuple rows. Doing this exercise has expanded my view, so I suggest that: - keep the key function optional parameter. - add a value function optional parameter. -- it really makes any case where you don't want to store the whole item a lot easier. - Have the default key function be itemgetter(0) and the default value function be itemgetter(1) (or some similar way to implement default support for processing an iterable of (key, value) pairs. Having no value function and an equality default for the key function may be "common", but it's also pretty much useless -- why have a useless default? Thinking this through now I do see that having key and value default to to the pair method means that if you specify key function, you will probably have to specify a value function as well -- so maybe that's not ideal. hmm. A couple other comments: implementation detail: Do you gain anything by using the itertools groupby? over, say: groups = {} for item in iterable: groups.setdefault(key(item), []).append(item) Final point: I still prefer the class idea over a utility function, because: * with a class, you can ad stuff to the grouping later: a_grouping['key'] = value or maybe a_grouping.add(item) * with a class, you can add utility methods -- I kinda liked that in your original PEP. I see this in the section about a custom class: "Merging groupings is not a one-liner," -- I'm pretty sure the update() method on my prototype was a merge operation -- so very easy :-) -- and another argument for a custom class. final thought about custon class. If you want to be abel to do: a_grouping['key'] = value then that really reinforces the "natural" mapping of keys to values -- it's kind of another way to think about this -- rather than thinking about it as "groupby" function that creates a dict, think of it as a dict-like object, that, instead of writing over an existing key, accumulates the multiple values. which means you really want the "normal" constructor to take an iterable of (key, value) pairs, to make it more dict-like. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Mon, Jul 2, 2018 at 11:50 PM, Chris Barker <chris.barker@noaa.gov> wrote:
- keep the key function optional parameter. - add a value function optional parameter. -- it really makes any case where you don't want to store the whole item a lot easier.
- Have the default key function be itemgetter(0) and the default value function be itemgetter(1) (or some similar way to implement default support for processing an iterable of (key, value) pairs.
Having no value function and an equality default for the key function may be "common", but it's also pretty much useless -- why have a useless default?
Thinking this through now I do see that having key and value default to to the pair method means that if you specify key function, you will probably have to specify a value function as well -- so maybe that's not ideal.
OK, I prototyped a class solution that defaults to key, value pairs, but you can specify a key and/or value function. and with convoluted logic, if you specify just a key, then the value defaults to the entire item. So folks will pretty much get what they expect. I think it's actually pretty slick -- best of both worlds? Code here: https://github.com/PythonCHB/grouper/blob/master/grouper/grouper.py -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Tue, Jul 3, 2018 at 2:52 AM Chris Barker via Python-ideas < python-ideas@python.org> wrote:
I'd write:
map(len, words)
But I'd also write [len(fullname) for fullname in contacts]
map(lambda name: name.first_name, all_names) vs [name.first_name for nam in all names]
I really like the comprehension form much better when what you really want is a simple expression like an attribute access or index or simple calculation, or ....
Why not `map(attrgetter('first_name'), all_names)`?
In [56]: grouping(school_student_list) Out[56]: {'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']}
This one case is definitely nice. However... And here are the examples from the PEP:
(untested -- I may hav missed some brackets, etc)
What you've missed, in *several* examples is the value part of the tuple in your API. You've pulled out the key, and forgotten to include anything in the actual groups. I have a hunch that if your API were used, this would be a common pitfall. I think this argues against your API and for Michael's that simply deals with "sequences of groupable things." That's much more like what one deals with in SQL, and is familiar that way. If the things grouped are compound object such as dictionaries, objects with common attributes, named tuples, etc. then the list of things in a group usually *does not* want the grouping attribute removed.
grouping(((len(word), word) for word in words)) grouping((name[0], name) for name in names)) grouping((contact.city, contact) for contact in contacts)
Good so far, but a lot of redundancy in always spelling tuple of `(derived-key, object)`.
grouping((employee['department'] for employee in employees) grouping((os.path.splitext(filepath)[1] for filepath in os.listdir('.'))) grouping(('debit' if v > 0 else 'credit' for v in transactions))
And here you forget about the object itself 3 times in a row (or also forget some derived "value" that you might want in your other comments).
grouping(((v, k) for v, k in d.items()))
This is nice, and spelled correctly.
So that was an interesting exercise -- many of those are a bit clearer (or more compact) with the key function. But I also notice a pattern -- all those examples fit very well into the key function pattern:
Yep. I also think that the row-style "list of data" where you want to discard the key from the values is nicely spelled (in the PEP) as: INDEX = 0 grouping(sequence, key=lambda row: row.pop(INDEX)) groups = {}
for item in iterable: groups.setdefault(key(item), []).append(item)
I agree this seems better as an implementation.
I still prefer the class idea over a utility function, because: * with a class, you can ad stuff to the grouping later:
a_grouping['key'] = value
or maybe a_grouping.add(item) * with a class, you can add utility methods -- I kinda liked that in your original PEP.
I agree still (after all, I proposed it to Michael). But this seems minor, and Guido seems not to like `collections` that much (or at least he commented on not using Counter ... which I personally love to use and to teach). That said, a 'grouping()' function seems fine to me also... with a couple utility functions (that need not be builtin, or even standard library necessarily) in place of methods. A lot of what methods would do can easily be done using comprehensions as well, some examples are shown in the PEP. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
Guido said he has mooted this discussion, so it's probably not reaching him. It took one thousand fewer messages for him to stop following this than with PEP 572, for some reason :-). But before putting it on auto-archive, the BDFL said (1) NO GO on getting a new builtin; (2) NO OBJECTION to putting it in itertools. My problem with the second idea is that *I* find it very wrong to have something in itertools that does not return an iterator. It wrecks the combinatorial algebra of the module. That said, it's easy to fix... and I believe independently useful. Just make grouping() a generator function rather than a plain function. This lets us get an incremental grouping of an iterable. This can be useful if the iterable is slow or infinite, but the partial groupings are useful in themselves. Python 3.7.0 (default, Jun 28 2018, 07:39:16) [Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin Type "help", "copyright", "credits" or "license" for more information.
from grouping import grouping grouped = grouping('AbBa', key=str.casefold) for dct in grouped: print(dct) ... {'a': ['A']} {'a': ['A'], 'b': ['b']} {'a': ['A'], 'b': ['b', 'B']} {'a': ['A', 'a'], 'b': ['b', 'B']}
This isn't so useful for the concrete sequence, but for this it would be great: for grouped in grouping(data_over_wire()): process_partial_groups(grouped) The implementation need not and should not rely on "pre-grouping" with itertools.groupby: def grouping(iterable, key=None): groups = {} key = key or (lambda x: x) for item in iterable: groups.setdefault(key(item), []).append(item) yield groups -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Tue, Jul 3, 2018 at 9:23 AM David Mertz <mertz@gnosis.cx> wrote:
Guido said he has mooted this discussion, so it's probably not reaching him.
I meant 'muted'. Hopefully he hasn't 'mooted' it. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
I'd prefer to simply write an example for the documentation or clarify the existing ones, then add good answers to StackOverflow questions. On Tue, Jul 3, 2018, 6:23 AM David Mertz <mertz@gnosis.cx> wrote:
Guido said he has mooted this discussion, so it's probably not reaching him. It took one thousand fewer messages for him to stop following this than with PEP 572, for some reason :-).
But before putting it on auto-archive, the BDFL said (1) NO GO on getting a new builtin; (2) NO OBJECTION to putting it in itertools.
On Tue, Jul 03, 2018 at 09:23:07AM -0400, David Mertz wrote:
But before putting it on auto-archive, the BDFL said (1) NO GO on getting a new builtin; (2) NO OBJECTION to putting it in itertools.
My problem with the second idea is that *I* find it very wrong to have something in itertools that does not return an iterator. It wrecks the combinatorial algebra of the module.
That seems like a reasonable objection to me.
That said, it's easy to fix... and I believe independently useful. Just make grouping() a generator function rather than a plain function. This lets us get an incremental grouping of an iterable.
We already have something which lazily groups an iterable, returning groups as they are seen: groupby. What makes grouping() different from groupby() is that it accumulates ALL of the subgroups rather than just consecutive subgroupings. To make it clear with a simulated example (ignoring the keys for brevity): groupby("aaAAbbCaAB", key=str.upper) => groups "aaAA", "bb", "C", "aA", "B" grouping("aaAAbbCaAB", key=str.upper) => groups "aaAAaA", "bbB", "C" So grouping() cannot even begin returning values until it has processed the entire data set. In that regard, it is like sorted() -- it cannot be lazy, it is a fundamentally eager operation. I propose that a better name which indicates the non-lazy nature of this function is *grouped* rather than grouping, like sorted(). As for where it belongs, perhaps the collections module is the least worst fit.
This can be useful if the iterable is slow or infinite, but the partial groupings are useful in themselves.
Under what circumstances would the partial groupings be useful? Given the example above: grouping("aaAAbbCaAB", key=str.upper) when would you want to see the accumulated partial groups? # again, ignoring the keys for brevity "aaAA" "aaAA", "bb" "aaAA", "bb", "C" "aaAAaA", "bb", "C" "aaAAaA", "bbB", "C" I don't see any practical use for this -- if you start processing the partial groupings immediately, you end up double-processing some of the items; if you wait until the last, what's the point of the intermediate values? As you say yourself:
This isn't so useful for the concrete sequence, but for this it would be great:
for grouped in grouping(data_over_wire()): process_partial_groups(grouped)
And that demonstrated exactly why this would be a terrible bug magnet, suckering people into doing what you just did, and ending up processing values more than once. To avoid that, your process_partial_groups would need to remember which values it has seen before for each key it has seen before. -- Steve
On Tue, Jul 3, 2018 at 8:24 AM, Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Jul 03, 2018 at 09:23:07AM -0400, David Mertz wrote:
My problem with the second idea is that *I* find it very wrong to have something in itertools that does not return an iterator. It wrecks the combinatorial algebra of the module.
hmm -- that seems to be a pretty pedantic approach -- practicality beats purity, after all :-) I think we should first decide if a grouping() function is a useful addition to the standard library (after all: "not every two line function needs to in the stdlib"), and f so, then we can find a home for it. personally, I'm wondering if a "dicttools" or something module would make sense -- I imagine there are all sorts of other handy utilities for working with dicts that could go there. (though, yeah, we'd want to actually have a handful of these before creating a new module :-) )
That said, it's easy to fix... and I believe independently useful. Just
make grouping() a generator function rather than a plain function. This lets us get an incremental grouping of an iterable.
We already have something which lazily groups an iterable, returning groups as they are seen: groupby.
What makes grouping() different from groupby() is that it accumulates ALL of the subgroups rather than just consecutive subgroupings.
well, yeah, but it wont actually get you those until you exhaust the iterator -- so while it's different than itertools.groupby, it is different than itertools.groupby(sorted(iterable))? In short, this wouldn't really solve the problems that itertools.groupby has for this sort of task -- so what's the point?
As for where it belongs, perhaps the collections module is the least worst fit.
That depends some on whether we go with a simple function, in which case collections is a pretty bad fit (but maybe still the least worse). Personally I still like the idea of having this be special type of dict, rather than "just a function" -- and then it's really obvious where to put it :-) -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
It seems a really stupid reason to make this choice, but: If we make a Grouping class, it has an obvious home in the collections module If we make a grouping (or grouped) function, we don't know where to put it But since I like the Grouping class idea anyway, it's one more reason... -CHB On Tue, Jul 3, 2018 at 9:15 AM, Chris Barker <chris.barker@noaa.gov> wrote:
On Tue, Jul 3, 2018 at 8:24 AM, Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Jul 03, 2018 at 09:23:07AM -0400, David Mertz wrote:
My problem with the second idea is that *I* find it very wrong to have something in itertools that does not return an iterator. It wrecks the combinatorial algebra of the module.
hmm -- that seems to be a pretty pedantic approach -- practicality beats purity, after all :-)
I think we should first decide if a grouping() function is a useful addition to the standard library (after all: "not every two line function needs to in the stdlib"), and f so, then we can find a home for it.
personally, I'm wondering if a "dicttools" or something module would make sense -- I imagine there are all sorts of other handy utilities for working with dicts that could go there. (though, yeah, we'd want to actually have a handful of these before creating a new module :-) )
That said, it's easy to fix... and I believe independently useful. Just
make grouping() a generator function rather than a plain function. This lets us get an incremental grouping of an iterable.
We already have something which lazily groups an iterable, returning groups as they are seen: groupby.
What makes grouping() different from groupby() is that it accumulates ALL of the subgroups rather than just consecutive subgroupings.
well, yeah, but it wont actually get you those until you exhaust the iterator -- so while it's different than itertools.groupby, it is different than itertools.groupby(sorted(iterable))?
In short, this wouldn't really solve the problems that itertools.groupby has for this sort of task -- so what's the point?
As for where it belongs, perhaps the collections module is the least worst fit.
That depends some on whether we go with a simple function, in which case collections is a pretty bad fit (but maybe still the least worse).
Personally I still like the idea of having this be special type of dict, rather than "just a function" -- and then it's really obvious where to put it :-)
-CHB
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
I admit a hypothetical itertools.grouping that returned incrementally built dictionaries doesn't fill any simple need I have often encountered. I can be hand-wavy about "stateful bucketing of streams" and looking at windowing/tails, but I don't have a clean and simple example where I need this. The "run to exhaustion" interface has more obvious uses (albeit, they *must* be technically a subset of the incremental ones). I think I will also concede that in incrementally built and yielded dictionary isn't *really* in the spirit of itertools either. I suppose tee() can grow unboundedly if only one tine is utilized... but in general, itertools is meant to provide iterators that keep memory usage limited to a few elements in memory at a time (yes, groupby, takewhile, or dropwhile have pathological cases that could be unbounded... but usually they're not). So maybe we really do need a dicttools or mappingtools module, with this as the first function to put inside it. ... but I STILL like a new collections.Grouping (or collections.Grouper) the best. It might overcome Guido's reluctance... and what goes there is really delegated by him, not his own baby. On Tue, Jul 3, 2018 at 12:19 PM Chris Barker via Python-ideas < python-ideas@python.org> wrote:
On Tue, Jul 3, 2018 at 8:24 AM, Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Jul 03, 2018 at 09:23:07AM -0400, David Mertz wrote:
My problem with the second idea is that *I* find it very wrong to have something in itertools that does not return an iterator. It wrecks the combinatorial algebra of the module.
hmm -- that seems to be a pretty pedantic approach -- practicality beats purity, after all :-)
I think we should first decide if a grouping() function is a useful addition to the standard library (after all: "not every two line function needs to in the stdlib"), and f so, then we can find a home for it.
personally, I'm wondering if a "dicttools" or something module would make sense -- I imagine there are all sorts of other handy utilities for working with dicts that could go there. (though, yeah, we'd want to actually have a handful of these before creating a new module :-) )
That said, it's easy to fix... and I believe independently useful. Just
make grouping() a generator function rather than a plain function. This lets us get an incremental grouping of an iterable.
We already have something which lazily groups an iterable, returning groups as they are seen: groupby.
What makes grouping() different from groupby() is that it accumulates ALL of the subgroups rather than just consecutive subgroupings.
well, yeah, but it wont actually get you those until you exhaust the iterator -- so while it's different than itertools.groupby, it is different than itertools.groupby(sorted(iterable))?
In short, this wouldn't really solve the problems that itertools.groupby has for this sort of task -- so what's the point?
As for where it belongs, perhaps the collections module is the least worst fit.
That depends some on whether we go with a simple function, in which case collections is a pretty bad fit (but maybe still the least worse).
Personally I still like the idea of having this be special type of dict, rather than "just a function" -- and then it's really obvious where to put it :-)
-CHB
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Tue, Jul 3, 2018 at 12:01 PM, David Mertz <mertz@gnosis.cx> wrote:
... but I STILL like a new collections.Grouping (or collections.Grouper) the best.
me too.
It might overcome Guido's reluctance... and what goes there is really delegated by him, not his own baby.
Is collections anyone in particular's baby? like itertools "belongs" to Raymond? -CHB
On Tue, Jul 3, 2018 at 12:19 PM Chris Barker via Python-ideas < python-ideas@python.org> wrote:
On Tue, Jul 3, 2018 at 8:24 AM, Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Jul 03, 2018 at 09:23:07AM -0400, David Mertz wrote:
My problem with the second idea is that *I* find it very wrong to have something in itertools that does not return an iterator. It wrecks the combinatorial algebra of the module.
hmm -- that seems to be a pretty pedantic approach -- practicality beats purity, after all :-)
I think we should first decide if a grouping() function is a useful addition to the standard library (after all: "not every two line function needs to in the stdlib"), and f so, then we can find a home for it.
personally, I'm wondering if a "dicttools" or something module would make sense -- I imagine there are all sorts of other handy utilities for working with dicts that could go there. (though, yeah, we'd want to actually have a handful of these before creating a new module :-) )
That said, it's easy to fix... and I believe independently useful. Just
make grouping() a generator function rather than a plain function. This lets us get an incremental grouping of an iterable.
We already have something which lazily groups an iterable, returning groups as they are seen: groupby.
What makes grouping() different from groupby() is that it accumulates ALL of the subgroups rather than just consecutive subgroupings.
well, yeah, but it wont actually get you those until you exhaust the iterator -- so while it's different than itertools.groupby, it is different than itertools.groupby(sorted(iterable))?
In short, this wouldn't really solve the problems that itertools.groupby has for this sort of task -- so what's the point?
As for where it belongs, perhaps the collections module is the least worst fit.
That depends some on whether we go with a simple function, in which case collections is a pretty bad fit (but maybe still the least worse).
Personally I still like the idea of having this be special type of dict, rather than "just a function" -- and then it's really obvious where to put it :-)
-CHB
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE <https://maps.google.com/?q=7600+Sand+Point+Way+NE&entry=gmail&source=g> (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Steven D'Aprano wrote:
I propose that a better name which indicates the non-lazy nature of this function is *grouped* rather than grouping, like sorted().
+1
As for where it belongs, perhaps the collections module is the least worst fit.
But then there's the equally strong purist argument that it's not a data type, just a function. Unless we *make* it a data type. Then not only would it fit well in collections, it would also make it fairly easy to do incremental grouping if you really wanted that. Usual case: g = groupdict((key(val), val) for val in things) Incremental case: g = groupdict() for key(val), val in things: g.add(key, val) process_partial_grouping(g) -- Greg
On Wed, Jul 04, 2018 at 10:44:17AM +1200, Greg Ewing wrote:
Steven D'Aprano wrote:
I propose that a better name which indicates the non-lazy nature of this function is *grouped* rather than grouping, like sorted().
+1
As for where it belongs, perhaps the collections module is the least worst fit.
But then there's the equally strong purist argument that it's not a data type, just a function.
Yes, I realised that after I posted my earlier comment.
Unless we *make* it a data type. Then not only would it fit well in collections, it would also make it fairly easy to do incremental grouping if you really wanted that.
Usual case:
g = groupdict((key(val), val) for val in things)
How does groupdict differ from regular defaultdicts, aside from the slightly different constructor?
Incremental case:
g = groupdict() for key(val), val in things: g.add(key, val) process_partial_grouping(g)
I don't think that syntax works. I get: SyntaxError: can't assign to function call Even if it did work, it's hardly any simpler than d = defaultdict(list) for val in things: d[key(val)].append(val) But then Counter is hardly any simpler than a regular dict too. -- Steve
So this ended up a long post, so the TL;DR * There are types of data well suited to the key function approach, and other data not so well suited to it. If you want to support the not as well suited use cases, you should have a value function as well and/or take a (key, value) pair. * There are some nice advantages in flexibility to having a Grouping class, rather than simply a function. So: I propose a best of all worlds version: a Grouping class (subclass of dict): * The constructor takes an iterable of (key, value) pairs by default. * The constructor takes an optional key_func -- when not None, it is used to determine the keys in the iterable instead. * The constructor also takes a value_func -- when specified, it processes the items to determine the values. * a_grouping[key] = value adds the value to the list corresponding to the key. * a_grouping.add(item) -- applies the key_func and value_func to add a new value to the appropriate group. Prototype code here: https://github.com/PythonCHB/grouper Now the lengthy commentary and examples: On Tue, Jul 3, 2018 at 5:21 PM, Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Jul 04, 2018 at 10:44:17AM +1200, Greg Ewing wrote:
Steven D'Aprano wrote:
Unless we *make* it a data type. Then not only would it fit well in collections, it would also make it fairly easy to do incremental grouping if you really wanted that.
indeed -- one of motivations for my prototype: https://github.com/PythonCHB/grouper (Did none of my messages get to this list??)
Usual case:
g = groupdict((key(val), val) for val in things)
How does groupdict differ from regular defaultdicts, aside from the slightly different constructor?
* You don't need to declare the defaultdict (and what the default is) first * You don't need to call .append() yourself * It can have a custom .init() and .update() * It can have a .add() method * It can (optionally) use a key function. * And you can have other methods that do useful things with the groupings.
g = groupdict()
for key(val), val in things: g.add(key, val) process_partial_grouping(g)
I don't think that syntax works. I get:
SyntaxError: can't assign to function call
looks like untested code :-) with my prototype it would be: g = groupdict() for key, val in things: g[key] = val process_partial_grouping(g) (this assumes your things are (key, value) pairs) Again, IF you data are a sequence of items, and the value is the item itself, and the key is a simple function of the item, THEN the key function method makes more sense, which for the incremental adding of data would be: g = groupdict(key_fun=a_fun) for thing in things: g.add(thing) process_partial_grouping(g) Even if it did work, it's hardly any simpler than
d = defaultdict(list) for val in things: d[key(val)].append(val)
But then Counter is hardly any simpler than a regular dict too.
exactly -- and counter is actually a little annoyingly too much like a regular dict, in my mind :-) In the latest version of my prototype, the __init__ expects a (key, value) pair by default, but you can also pass in a key_func, and then it will process the iterable passes in as (key_func(item), item) pairs. And the update() method will also use the key_func if one was provided. So a best of both worlds -- pick your API. In this thread, and in the PEP, there various ways of accomplishing this task presented -- none of them (except using a raw itertools.groupby in some cases) is all that onerous. But I do think a custom function or even better, custom class, would create a "one obvious" way to do a common manipulation. A final (repeated) point: Some data are better suited to a (key, value) pair style, and some to a key function style. All of the examples in the PEP are well suited to the key function style. But the example that kicked off this discussion was about data already in (key, value) pairs (actual in that case, (value, key) pairs. And there are other examples. Here's a good one for how one might want to use a Grouping dict more like a regular dict -- of maybe like a simple function constructor: (code in: https://github.com/PythonCHB/grouper/blob/master/examples/ trigrams.py) #!/usr/bin/env python3 """ Demo of processing "trigrams" from Dave Thomas' Coding Kata site: http://codekata.com/kata/kata14-tom-swift-under-the-milkwood/ This is only addressing the part of the problem of building up the trigrams. This is showing various ways of doing it with the Grouping object. """ from grouper import Grouping from operator import itemgetter words = "I wish I may I wish I might".split() # using setdefault with a regular dict: # how I might do it without a Grouping class trigrams = {} for i in range(len(words) - 2): pair = tuple(words[i:i + 2]) follower = words[i + 2] trigrams.setdefault(pair, []).append(follower) print(trigrams) # using a Grouping with a regular loop: trigrams = Grouping() for i in range(len(words) - 2): pair = tuple(words[i:i + 2]) follower = words[i + 2] trigrams[pair] = follower print(trigrams) # using a Grouping with zip trigrams = Grouping() for w1, w2, w3 in zip(words[:], words[1:], words[2:]): trigrams[(w1, w2)] = w3 print(trigrams) # Now we can do it one expression: trigrams = Grouping(((w1, w2), w3) for w1, w2, w3 in zip(words[:], words[1:], words[2:])) print(trigrams) # Now with the key function: # in this case it needs to be in a sequence, so we can't use a simple loop trigrams = Grouping(zip(words[:], words[1:], words[2:]), key_fun=itemgetter(0, 1)) print(trigrams) # Darn! that got the key right, but the value is not right. # we can post process: trigrams = {key: [t[2] for t in value] for key, value in trigrams.items()} print(trigrams) # But THAT is a lot harder to wrap your head around than the original setdefault() loop! # And it mixes key function style and comprehension style -- so no good. # Adding a value_func helps a lot: trigrams = Grouping(zip(words[:], words[1:], words[2:]), key_fun=itemgetter(0, 1), value_fun=itemgetter(2)) print(trigrams) #that works fine, but I, at least, find it klunkier than the comprehensions style # Finally, we can use a regular loop with the functions trigrams = Grouping(key_fun=itemgetter(0, 1), value_fun=itemgetter(2)) for triple in zip(words[:], words[1:], words[2:]): trigrams.add(triple) print(trigrams) -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Tue, Jul 3, 2018 at 10:11 PM Chris Barker via Python-ideas < python-ideas@python.org> wrote:
* There are types of data well suited to the key function approach, and other data not so well suited to it. If you want to support the not as well suited use cases, you should have a value function as well and/or take a (key, value) pair.
* There are some nice advantages in flexibility to having a Grouping class, rather than simply a function.
The tri-grams example is interesting and shows some clever things you can do. The bi-grams example I wrote in my draft PEP could be extended to handle tri-grams with just a key-function, no value-function. However, because this example is fun it may be distracting from the core value of ``grouped`` or ``grouping``. I don't think we need a nicer API for complex grouping tasks. As the tasks get increasingly sophisticated, any general-purpose API will be less nice than something built for that specific task. Instead, I want the easiest possible interface for making groups for every-day use cases. The wide range of situations that ``sorted`` covers with just a key-function suggests that ``grouped`` should follow the same pattern. I do think that the default, key=None, could be set to handle (key, value) pairs. But I'm still reluctant to break the standard of sorted, min, max, and groupby.
On Fri, Jul 6, 2018 at 5:13 PM, Michael Selik <mike@selik.org> wrote:
On Tue, Jul 3, 2018 at 10:11 PM Chris Barker via Python-ideas < python-ideas@python.org> wrote:
* There are types of data well suited to the key function approach, and other data not so well suited to it. If you want to support the not as well suited use cases, you should have a value function as well and/or take a (key, value) pair.
* There are some nice advantages in flexibility to having a Grouping class, rather than simply a function.
The tri-grams example is interesting and shows some clever things you can do. The bi-grams example I wrote in my draft PEP could be extended to handle tri-grams with just a key-function, no value-function.
hmm, I'll take a look -- 'cause I found that I was really limited to only a certain class of problems without a way to get "custom" values. Do you mean the "foods" example?
foods = [ ... ('fruit', 'apple'), ... ('vegetable', 'broccoli'), ... ('fruit', 'clementine'), ... ('vegetable', 'daikon') ... ] groups = grouping(foods, key=lambda pair: pair[0]) {k: [v for _, v in g] for k, g in groups.items()} {'fruit': ['apple', 'clementine'], 'vegetable': ['broccoli', 'daikon']}
Because that one, I think, makes my point well. To get what you want, you have to post-processthe Grouping with a (somewhat complex) comprehension. If someone is that adept with comprehensions, and want to do it that way, the grouping function isn't really buying them much at all, over setdefault, or defaultdict, or roll your own. Contrast this with: groups = grouping(foods, key=lambda pair: pair[0], value=lambda pair: pair[1]) and you're done. or: groups = grouping(foods, key=itemgetter(0), value=itemgetter0)) Or even better: groups = grouping(foods) :-) However, because this example is fun it may be distracting from the core
value of ``grouped`` or ``grouping``.
Actually, I think it's the opposite -- it opens up the concept to be more general purpose -- I guess I'm thinking of this a "dict with lists as the values" that has many purposes beyond strictly "groupby". Maybe that's because I'm a general python programmer, and not a database guy, but if something is going to be added to the stdlib, why not add a more general purpose class?
I don't think we need a nicer API for complex grouping tasks. As the tasks get increasingly sophisticated, any general-purpose API will be less nice than something built for that specific task.
I guess this is where we disagree -- I think we've found an API that is general purpose, and cleanly supports multiple tasks. Instead, I want the easiest possible interface for making groups for
every-day use cases. The wide range of situations that ``sorted`` covers with just a key-function suggests that ``grouped`` should follow the same pattern.
not at all -- sorted() is about, well, sorting -- which means rearranging items. I certainly don't expect it to break up the items for me. Again, this is a matter of perspective -- if you you start with "groupby" as a concept, then I can see how you see the parallel with sorted -- you are rearranging the items, but this time into groups. But if you start with "a dict of lists", then you take a wider perspective: - It can naturally an easily be used to group things - It can do another nifty things - And as a "dict of something", it's natural to think of keys AND values, and to want a dict-like API -- i.e. pass in (key, value) pairs. I do think that the default, key=None, could be set to handle (key, value)
pairs.
OK, so for my part, if you provide the (key, value) pair API, then you don't really need a value_func. But as the "pass in a function to process the data" model IS well suited to some tasks, and some people simply like the style, why not? And it creates an asymetry: or you have a (key, the_item) problem, you can use either the key function API or the (key, value) API -- but if you have a (key, value) problem, you can only use the (key, value) API But I'm still reluctant to break the standard of sorted, min, max, and
groupby.
This is the power of Python's keyword parameters -- anyone coming to this from a perspective of "I expect this to be like sorted, min, max, and groupby" can simply ignore the value parameter :-) One more argument :-) There have been comments a bout how maybe some of the classes in collections are maybe not needed -- Counter, in particular. I tend to agree, but i think the reason Counter is not-that-useful is because it doesn't do enough -- not that it isn't useful -- it's just such a thin wrapper around a dict, that I hardly see the point. Example: In [12]: c = Counter() In [13]: c['f'] += 1 In [14]: c['g'] = "some random thing" In [15]: c Out[15]: Counter({'f': 1, 'g': 'some random thing'}) Is that really that useful? I need to do the counting by hand, and can easily use the regular dict interface to make a mess of it. it has a handy constructor, but that's about it. Anyway, I think we've got this nailed down to a handful of options / decisions 1) a dict subclass vs a function that constructs a dict-of-lists - I think a dict subclass offers some real value -- but it comes down a bit to goals: Do we want a general purpose special dict? or a function to perform the "usual" groupby operation? 2) Do we have a value function keyword parameter? - I think this adds real value without taking anything away from the convenience of the simpler key only API 3) Do we accept an iterable of (key, value) pairs if no key function is provided? - I think yes, also because why not? a default of the identity function for key and value is pretty useless. So it comes down to what the community thinks. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
David Mertz wrote:
Just make grouping() a generator function rather than a plain function. This lets us get an incremental grouping of an iterable. This can be useful if the iterable is slow or infinite, but the partial groupings are useful in themselves.
Do you have any real-world examples? I'm having trouble thinking of any. Even if there are a few, it seems like the vast majority of the time you *won't* want the intermediate groupings, just the final one, and then what do you do? It would be annoying to have to write code to exhaust the iterator just to get the result you're after. Also, were you intending it to return a series of independent objects, or does it just return the same object every time, adding stuff to it? The former would be very inefficient for the majority of uses, whereas the latter doesn't seem to be in keeping with the spirit of itertools. This idea seems like purity beating practicality to me. -- Greg
Replying to the question in subject, I think it would be better in collections as a class. Having it just as a function doesn't buy much, because one can do the same with three lines and a defaultdict. However, if this is a class it can support adding new elements, merge the groupeddicts, etc. -- Ivan
On Wed, Jul 04, 2018 at 11:08:05AM +0100, Ivan Levkivskyi wrote:
Replying to the question in subject, I think it would be better in collections as a class. Having it just as a function doesn't buy much, because one can do the same with three lines and a defaultdict. However, if this is a class it can support adding new elements, merge the groupeddicts, etc.
defaultdicts support adding new elements, and they have an update method same as regular dicts :-) -- Steve
On 4 July 2018 at 11:25, Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Jul 04, 2018 at 11:08:05AM +0100, Ivan Levkivskyi wrote:
Replying to the question in subject, I think it would be better in collections as a class. Having it just as a function doesn't buy much, because one can do the same with three lines and a defaultdict. However, if this is a class it can support adding new elements, merge the groupeddicts, etc.
defaultdicts support adding new elements, and they have an update method same as regular dicts :-)
Except that updating will not do what I want. Merging two groupeddicts is not just `one.update(other)` Moreover, using just an update with regular dicts will do something bug-prone, it will add every group from `other` as an element to the corresponding group in `one`. -- Ivan
On Wed, Jul 4, 2018, 3:11 AM Ivan Levkivskyi <levkivskyi@gmail.com> wrote:
Replying to the question in subject, I think it would be better in collections as a class. Having it just as a function doesn't buy much, because one can do the same with three lines and a defaultdict.
Four lines. You'll need to convert from defaultdict back to a basic dict to avoid mistaken inserts. For some use cases. However, if this is a class it can support adding new elements, merge the
groupeddicts, etc.
-- Ivan
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
I'm -1 on adding it in stdlib. But if it happens, I'm -1 on functools and collections. They are used very much. Every Python tool import them regardless how much of their contents are used. On the other hand, itertools contains random stuff very rarely used. If you really want to add it in collections, I suggests from collections.groupdict import GroupDict. Regards, On Tue, Jul 3, 2018 at 10:23 PM David Mertz <mertz@gnosis.cx> wrote:
Guido said he has mooted this discussion, so it's probably not reaching him. It took one thousand fewer messages for him to stop following this than with PEP 572, for some reason :-).
But before putting it on auto-archive, the BDFL said (1) NO GO on getting a new builtin; (2) NO OBJECTION to putting it in itertools.
My problem with the second idea is that *I* find it very wrong to have something in itertools that does not return an iterator. It wrecks the combinatorial algebra of the module.
That said, it's easy to fix... and I believe independently useful. Just make grouping() a generator function rather than a plain function. This lets us get an incremental grouping of an iterable. This can be useful if the iterable is slow or infinite, but the partial groupings are useful in themselves.
Python 3.7.0 (default, Jun 28 2018, 07:39:16) [Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin Type "help", "copyright", "credits" or "license" for more information.
from grouping import grouping grouped = grouping('AbBa', key=str.casefold) for dct in grouped: print(dct) ... {'a': ['A']} {'a': ['A'], 'b': ['b']} {'a': ['A'], 'b': ['b', 'B']} {'a': ['A', 'a'], 'b': ['b', 'B']}
This isn't so useful for the concrete sequence, but for this it would be great:
for grouped in grouping(data_over_wire()):
process_partial_groups(grouped)
The implementation need not and should not rely on "pre-grouping" with itertools.groupby:
def grouping(iterable, key=None): groups = {} key = key or (lambda x: x) for item in iterable: groups.setdefault(key(item), []).append(item) yield groups
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- INADA Naoki <songofacandy@gmail.com>
On Wed, Jul 4, 2018 at 3:53 AM, INADA Naoki <songofacandy@gmail.com> wrote:
But if it happens, I'm -1 on functools and collections. They are used very much. Every Python tool import them regardless how much of their contents are used.
really? collections? what for? I'm guessing namedtuple and maybe deque. But collections already has 9 classes (well, things) in it so we'd be adding a bit less than 10% more to it. what is the concern? import time, memory? In either case, it seems like the wrong driver for deciding where to put new things.
If you really want to add it in collections, I suggests from collections.groupdict import GroupDict.
Perhaps the stdlib should have a deeper namespaces in general -- if that is established as a policy, then this could be the first thing to follow that policy. But I thought "flat is better than nested" -- sigh. So maybe we need to bite the bullet and solve the problem at another level: 1) if, say, namedtuple has gotten very popular, maybe it should move to builtins. 2) Whatever happened to the proposals to make it easier to lazy-load stuff in modules? If that gets implemented, then we can speed up startup in general, and not have to be too worried about adding "too much" to a module because one thing in it is common use. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Tue, Jul 3, 2018 at 6:23 AM, David Mertz <mertz@gnosis.cx> wrote:
Guido said he has mooted this discussion
... But before putting it on auto-archive, the BDFL said (1) NO GO on getting a new builtin; (2) NO OBJECTION to putting it in itertools. I don't recall him offering an opinion on a class in collections, did he? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Yes, he said a definite no to a built-in. But he expressed a less specific lack of enthusiasm for collections classes (including Counter, which exists and which I personally use often). On Thu, Jul 5, 2018, 1:16 AM Chris Barker <chris.barker@noaa.gov> wrote:
On Tue, Jul 3, 2018 at 6:23 AM, David Mertz <mertz@gnosis.cx> wrote:
Guido said he has muted this discussion
...
But before putting it on auto-archive, the BDFL said (1) NO GO on getting a new builtin; (2) NO OBJECTION to putting it in itertools.
I don't recall him offering an opinion on a class in collections, did he?
-CHB
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
On Thu, Jul 5, 2018 at 3:26 AM, David Mertz <mertz@gnosis.cx> wrote:
Yes, he said a definite no to a built-in. But he expressed a less specific lack of enthusiasm for collections classes (including Counter, which exists and which I personally use often).
And a Grouping class would do more than Counter, which I find trivial enough that I generally don't bother to use it. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
The way I see grouping is as an aggregation operation. As such, in my head, grouping is similar to min/max. However, if builtins are a no-go, then I feel I need to think a little outside the box: Is there a possibility that there will be desired many more aggregate functions in the near future? Is there a case for collecting aggregate functions into another top level module? Also, I would consider statistics to have similarities - median, mean etc are aggregate functions. Histograms are also doing something similar to grouping. Apologies I have not offered any concrete suggestions, but just thought I should offer my thoughts. On Thu, 5 Jul 2018, 22:24 Chris Barker via Python-ideas, < python-ideas@python.org> wrote:
On Thu, Jul 5, 2018 at 3:26 AM, David Mertz <mertz@gnosis.cx> wrote:
Yes, he said a definite no to a built-in. But he expressed a less specific lack of enthusiasm for collections classes (including Counter, which exists and which I personally use often).
And a Grouping class would do more than Counter, which I find trivial enough that I generally don't bother to use it.
-CHB --
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Fri, Jul 06, 2018 at 09:49:37AM +0100, Cammil Taank wrote:
I would consider statistics to have similarities - median, mean etc are aggregate functions. Histograms are also doing something similar to grouping.
I was thinking the same thing, but I don't think it is a good fit. Grouping records with arbitrary structure is very different from the numerically-focused statistics module. (Yes, a few statistics apply to nominal and ordinal data too, but the primary focus is on numbers.) -- Steve
On Jul 6, 2018, at 2:10 AM, Steven D'Aprano <steve@pearwood.info> wrote: I would consider statistics to have similarities - median, mean etc are aggregate functions. Not really, more like reduce, actually -/ you get a single result. Histograms are also doing something similar to grouping. .(Yes, a few statistics apply to nominal and ordinal data too, And for that, a generic grouping function could be used. In fact, allowing Counter to be used as the accumulater was one suggestion in this thread, and would build s histogram. Now that I think about it, you could write a key function that built a histogram for continuous data as well. Though that might be a bit klunky. But if someone thinks that’s a good idea, a PR for an example would be accepted: https://github.com/PythonCHB/grouper -CHB -- Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
(Fixing quote and attribution.) On Fri, Jul 6, 2018, 11:32 Chris Barker - NOAA Federal via Python-ideas <python-ideas@python.org> wrote:
On Jul 6, 2018, at 2:10 AM, Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Jul 06, 2018 at 09:49:37AM +0100, Cammil Taank wrote:
I would consider statistics
to have similarities - median, mean etc are aggregate functions.
Not really, more like reduce, actually -/ you get a single result.
Histograms are also doing something similar to grouping.
.(Yes, a few statistics apply to nominal and ordinal data too,
And for that, a generic grouping function could be used.
In fact, allowing Counter to be used as the accumulater was one suggestion in this thread, and would build s histogram.
Now that I think about it, you could write a key function that built a histogram for continuous data as well.
Though that might be a bit klunky.
But if someone thinks that’s a good idea, a PR for an example would be accepted:
+1 for `collections`, because it's where you look for something similar to Counter. -1 for `statistics`, because the need isn't specific to statistics. It'd be like putting `commonprefix`, which is a general string operation, into `os.path`. It's hacky to import a domain-specific module to use one of its non-domain-specific helpers for a different domain. Someone can argue for functools, as that's the functional programming module, containing `reduce`.
2018-07-03 14:58 GMT+02:00 David Mertz <mertz@gnosis.cx>:
On Tue, Jul 3, 2018 at 2:52 AM Chris Barker via Python-ideas <pytho <python-ideas@python.org>
What you've missed, in *several* examples is the value part of the tuple in your API. You've pulled out the key, and forgotten to include anything in the actual groups. I have a hunch that if your API were used, this would be a common pitfall.
I think this argues against your API and for Michael's that simply deals with "sequences of groupable things." That's much more like what one deals with in SQL, and is familiar that way. If the things grouped are compound object such as dictionaries, objects with common attributes, named tuples, etc. then the list of things in a group usually *does not* want the grouping attribute removed.
I agree the examples have lisp-level of brackets. However by using the fact tuples don't need brackets and the fact we can use a list instead of an iterable (the grouper will have to stock the whole object in memory anyway, and if it is really big, use itertools.groupby who is designed exactly for that) For example grouping(((len(word), word) for word in words)) can be written grouping([len(word), word for word in words]) which is less "bracket issue prone". The main advantage of having this syntax is that it gives a definition very close to the one of a dict comprehension, which is nice considering want we obtain is a dict (without that feature I'm not sure I will never attempt to use this function). And that allows us to have the same construction syntax as a dictionary, with an iterable of (key, value) pair ( https://docs.python.org/3.7/library/stdtypes.html#dict).
So that was an interesting exercise -- many of those are a bit clearer
(or more compact) with the key function. But I also notice a pattern -- all those examples fit very well into the key function pattern:
Yep.
Well those were the examples used to showcase the keyfunction in the PEP. This is as bad as it gets for the "initialization by comprehension" syntax.
I agree still (after all, I proposed it to Michael). But this seems minor, and Guido seems not to like `collections` that much (or at least he commented on not using Counter ... which I personally love to use and to teach).
Actually counter is very very close to grouping (replace the append with starting value [] in the for loop by a += with starting value 0 and groupping becomes a counter), so adding it to collections makes the more sense by a long shot. As far as I'm concerned, CHB semantics and syntax for the groupper object does everything that is needed, and even a little bit too much. It could be called AppendDict and just accept a (key, value) interable in input, and instead of doing dict[key] = value as a dict does, does dict[key] = [value] if key not in dict else dict[key] + [value] (and should be coded in C I guess) -- Nicolas Rolin
On Tue, Jul 03, 2018 at 04:12:14PM +0200, Nicolas Rolin wrote:
I agree the examples have lisp-level of brackets. However by using the fact tuples don't need brackets and the fact we can use a list instead of an iterable (the grouper will have to stock the whole object in memory anyway, and if it is really big, use itertools.groupby who is designed exactly for that) For example grouping(((len(word), word) for word in words)) can be written grouping([len(word), word for word in words])
which is less "bracket issue prone".
Did you try this? It is a syntax error. Generator expressions must be surrounded by round brackets: grouping([len(word), (word for word in words)]) Or perhaps you meant this: grouping([(len(word), word) for word in words]) but now it seems pointless to use a list comprehension instead of a generator expression: grouping((len(word), word) for word in words) but why are we using key values by hand when grouping ought to do it for us, as Michael Selik's version does? grouping(words, key=len) -- Steve
On Tue, Jul 3, 2018 at 8:33 AM, Steven D'Aprano <steve@pearwood.info> wrote:
but why are we using key values by hand when grouping ought to do it for
us, as Michael Selik's version does?
grouping(words, key=len)
because supplying a key function is sometimes cleaner, and sometimes uglier than building up a comprehension -- which I think comes down to: 1) taste (style?) 2) whether the key function is as simple as the expression 3) whether you ned to transform the value in any way. This argument is pretty much the same as whether you should use a comprehension or map: map(len, words) vs (len(word) for word in words) In that case, map() looks cleaner and easier, but when you have something less simple: map(operator.attrgetter('something'), some_objects) vs (object.something for object in some_objects) I like the comprehension better. add a filter, and comps really get nicer -- after all they were added to the language for a reason. Then when you add the additional complication of needing to "transform" the value as well, it's easy to do with the comprehension, but there is no way to do it with only a key function. I think the "confilct" here is that Micheal started with a bunch of examples that area ll well suited to the key_function approach, and Nicolas started with a use-case that is better suited to the comprehension / (key,value) approach. However, while the key, value approach can be reasonably (if a bit klunky) used everywhere the key function approach can, the opposite is not true (for when the value needs to be transformed as well. But in the spirit of "Python has both map and comprehensions", I say let's use both! * The default behavior is to process a (key.value) pair. * A key function can be provided in which case it is used, and the value is the full item. * A value function can be provided, in which case, it is used to "process" the value. If this is too confusing an interface, we could forget the value function, and folks would have to use the (key, value) interface if they need to transform the value. What makes no sense to me is having the identify function as the default key (and yes, it is the identity function, it would return the actual object, or not be there at all) -- the grouping would be done by the hash of key after passing through the key function). That's because having a default that is (almost) completely useless makes no sense -- it might as well be a required parameter. (unless there was a value function as well, in which case, it's not a completely useless default). - CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Tue, Jul 03, 2018 at 10:33:55AM -0700, Chris Barker wrote:
On Tue, Jul 3, 2018 at 8:33 AM, Steven D'Aprano <steve@pearwood.info> wrote:
but why are we using key values by hand when grouping ought to do it for
us, as Michael Selik's version does?
grouping(words, key=len)
because supplying a key function is sometimes cleaner, and sometimes uglier than building up a comprehension -- which I think comes down to:
1) taste (style?)
2) whether the key function is as simple as the expression
3) whether you ned to transform the value in any way.
Of course you can prepare the sequence any way you like, but these are not equivalent: grouping(words, keyfunc=len) grouping((len(word), word) for word in words) The first groups words by their length; the second groups pairs of (length, word) tuples by equality. py> grouping("a bb ccc d ee fff".split(), keyfunc=len) {1: ['a', 'd'], 2: ['bb', 'ee'], 3: ['ccc', 'fff']} py> grouping((len(w), w) for w in "a bb ccc d ee fff".split()) {(3, 'ccc'): [(3, 'ccc')], (1, 'd'): [(1, 'd')], (2, 'ee'): [(2, 'ee')], (3, 'fff'): [(3, 'fff')], (1, 'a'): [(1, 'a')], (2, 'bb'): [(2, 'bb')]} Don't worry, it wasn't obvious to me at 1am (my local time) either :-) -- Steve
On Tue, Jul 3, 2018, 6:32 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Jul 03, 2018 at 10:33:55AM -0700, Chris Barker wrote:
On Tue, Jul 3, 2018 at 8:33 AM, Steven D'Aprano <steve@pearwood.info> wrote:
but why are we using key values by hand when grouping ought to do it for
us, as Michael Selik's version does?
grouping(words, key=len)
because supplying a key function is sometimes cleaner, and sometimes uglier than building up a comprehension -- which I think comes down to:
1) taste (style?)
2) whether the key function is as simple as the expression
3) whether you ned to transform the value in any way.
Of course you can prepare the sequence any way you like, but these are not equivalent:
grouping(words, keyfunc=len)
grouping((len(word), word) for word in words)
The first groups words by their length; the second groups pairs of (length, word) tuples by equality.
py> grouping("a bb ccc d ee fff".split(), keyfunc=len) {1: ['a', 'd'], 2: ['bb', 'ee'], 3: ['ccc', 'fff']}
py> grouping((len(w), w) for w in "a bb ccc d ee fff".split()) {(3, 'ccc'): [(3, 'ccc')], (1, 'd'): [(1, 'd')], (2, 'ee'): [(2, 'ee')], (3, 'fff'): [(3, 'fff')], (1, 'a'): [(1, 'a')], (2, 'bb'): [(2, 'bb')]}
This handles the case that someone is passing in n-tuple rows and wants to keep the rows unchanged.
On 2018-07-03 23:20, Greg Ewing wrote:
Nicolas Rolin wrote:
grouping(((len(word), word) for word in words))
That actually has one more level of parens than are needed, you can just write
grouping((len(word), word) for word in words)
FWIW, here's my opinion. I much prefer something like: grouped(words, key=len) I think that building an iterable of 2-tuples to pass to 'grouped' is much like following a decorate-sort-undecorate pattern when sorting, or something similar when using 'min' or 'max'. Passing an iterable of items and optionally a key function is simpler, IMHO. Why would you pass 2-tuples, anyway? Maybe it's because 'grouped' returns a dict and a dict can be built from an iterable of 2-tuples, but that's OK because a dict needs key/value pairs. When 'Counter' was being proposed, it was suggested that one could be created from an iterable of 2-tuples, which sort of made sense because a Counter is like a dict, but, then, how would you count 2-tuples? Fortunately, Counter counts items, so you can do things like: counts = Counter(list_of_words) I think it's the same thing here. 'grouped' returns a dict, so passing 2-tuples initially seems reasonable, but, as in the case with Counter, I think it would be a mistake. It would be nice to be able to say: grouped(words, key=str.casefold) rather than: grouped((word.casefold(), word) for word in words) It would match the pattern of sorted, min and max.
MRAB wrote:
I think that building an iterable of 2-tuples to pass to 'grouped' is much like following a decorate-sort-undecorate pattern when sorting, or something similar when using 'min' or 'max'. Passing an iterable of items and optionally a key function is simpler, IMHO.
It should certainly be an option, but I don't think it should be the only one. Like with map() vs. comprehensions, sometimes one way is more convenient, sometimes the other. -- Greg
On Mon, Jul 02, 2018 at 02:52:03AM -0700, Michael Selik wrote:
Third, some classes might have a rich equality method that allows many interesting values to all wind up in the same group even if using the default "identity" key-function.
I would expect an identity key function to group by *identity* (is), not equality. But I would expect the default grouper to group by *equality*. -- Steve
On Mon, Jul 2, 2018 at 12:50 AM Michael Selik <mike@selik.org> wrote:
[Guido]
You'll never get consensus on anything here, but you have my blessing for this without consensus.
That feels like a success, but I'm going to be a bit more ambitious and try to persuade you that `grouping` belongs in the built-ins. I revised my draft to streamline the examples and make a clearer comparison with existing tools.
Sorry, I'm not biting. This will not be a builtin nor a method on a builtin. I'm going to mute this thread because it's getting too noisy. -- --Guido van Rossum (python.org/~guido)
On Fri, Jun 29, 2018 at 11:25 PM, Guido van Rossum <guido@python.org> wrote:
Hm, this actually feels heavier to me. But then again I never liked or understood the need for Counter --
actually, me neither -- and partly because it's too lightweight -- that is, it's still a regular dict, and you pretty much have to know that to use it. That it, it provides a nice counting constructor, but after that, it's just a key:integer dict :-) But in this case, I think there is more of an argument for a custom class -- if al it were was a dict with a custom constructor (and update) method, then yeah, better to have a function. But there is more tha could be built on top of a grouping class, one that happened to be a dict under the hod, but really s its own thing, with a handful of interfaces and methods that are specific to it. More detail elsewhere in the discussion. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Mon, Jul 2, 2018 at 8:49 AM Chris Barker <chris.barker@noaa.gov> wrote:
On Fri, Jun 29, 2018 at 11:25 PM, Guido van Rossum <guido@python.org> wrote:
Hm, this actually feels heavier to me. But then again I never liked or understood the need for Counter --
actually, me neither -- and partly because it's too lightweight -- that is, it's still a regular dict, and you pretty much have to know that to use it. That it, it provides a nice counting constructor, but after that, it's just a key:integer dict :-)
Counter provides ``most_common`` which is often implemented inefficiently if written from scratch. People mistakenly use ``sorted`` instead of ``heapq.nlargest``.
Counter also consider any missing key has the value "0". With the constructor (accepting any iterable) and the most_common(n), it's just a very set of features if you need to count anything. Le 13/07/2018 à 19:45, Michael Selik a écrit :
On Mon, Jul 2, 2018 at 8:49 AM Chris Barker <chris.barker@noaa.gov <mailto:chris.barker@noaa.gov>> wrote:
On Fri, Jun 29, 2018 at 11:25 PM, Guido van Rossum <guido@python.org <mailto:guido@python.org>> wrote:
Hm, this actually feels heavier to me. But then again I never liked or understood the need for Counter --
actually, me neither -- and partly because it's too lightweight -- that is, it's still a regular dict, and you pretty much have to know that to use it. That it, it provides a nice counting constructor, but after that, it's just a key:integer dict :-)
Counter provides ``most_common`` which is often implemented inefficiently if written from scratch. People mistakenly use ``sorted`` instead of ``heapq.nlargest``.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
30.06.18 00:42, Guido van Rossum пише:
On a quick skim I see nothing particularly objectionable or controversial in your PEP, except I'm unclear why it needs to be a class method on `dict`. Adding something to a builtin like this is rather heavy-handed. Is there a really good reason why it can't be a function in `itertools`? (I don't think that it's relevant that it doesn't return an iterator -- it takes in an iterator.)
Also, your pure-Python implementation appears to be O(N log N) if key is None but O(N) otherwise; and the version for key is None uses an extra temporary array of size N. Is that intentional?
And it adds a requirement to keys be orderable. I think there should be two functions with different requirements: for hashable and orderable keys. The latter should return a list of pairs or a sorted dict if they be supported by the stdlib. I'm not sure they fit well for the itertools module. Maybe the purposed algorithms module would be a better place. Or maybe just keep them as recipes in the documentation (they are just few lines). Concrete implementation can be simpler than the general implementation.
On Fri, Jun 29, 2018 at 10:53 AM, Michael Selik <mike@selik.org> wrote:
I've drafted a PEP for an easier way to construct groups of elements from a sequence. https://github.com/selik/peps/blob/master/pep-9999.rst
I'm really warming to the:
As a teacher, I've found that grouping is one of the most awkward tasks for beginners to learn in Python. While this proposal requires understanding a key-function, in my experience that's easier to teach than
Alternate: collections.Grouping version -- I really like this as a kind of custom mapping, rather than "just a function" (or alternate constructor) -- and I like your point that it can have a bit of functionality built in other than on construction. But I think it should be more like the other collection classes -- i.e. a general purpose class that can be used for grouping, but also used more general-purpose-y as well. That way people can do their "custom" stuff (key function, etc.) with comprehensions. The big differences are a custom __setitem__: def __setitem__(self, key, value): self.setdefault(key, []).append(value) And the __init__ and update would take an iterable of (key, value) pairs, rather than a single sequence. This would get away from the itertools.groupby approach, which I find kinda awkward: * How often do you have your data in a single sequence? * Do you need your keys (and values!) to be sortable???) * Do we really want folks to have to be writing custom key functions and/or lambdas for really simple stuff? * and you may need to "transform" both your keys and values I've enclosed an example implementation, borrowing heavily from Michael's code. The test code has a couple examples of use, but I'll put them here for the sake of discussion. Michael had: Grouping('AbBa', key=c.casefold)) with my code, that would be: Grouping(((c.casefold(), c) for c in 'AbBa')) Note that the key function is applied outside the Grouping object, it doesn't need to know anything about it -- and then users can use an expression in a comprehension rather than a key function. This looks a tad clumsier with my approach, but this is a pretty contrived example -- in the more common case [*], you'd be writing a bunch of lambdas, etc, and I'm not sure there is a way to get the values customized as well, if you want that. (without applying a map later on) Here is the example that the OP posted that kicked off this thread: In [37]: student_school_list = [('Fred', 'SchoolA'), ...: ('Bob', 'SchoolB'), ...: ('Mary', 'SchoolA'), ...: ('Jane', 'SchoolB'), ...: ('Nancy', 'SchoolC'), ...: ] In [38]: Grouping(((item[1], item[0]) for item in student_school_list)) Out[38]: Grouping({'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']}) or In [40]: Grouping((reversed(item) for item in student_school_list)) Out[40]: Grouping({'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']}) (note that if those keys and values were didn't have to be reversed, you could just pass the list in raw. I really like how I can use a generator expression and simple expressions to transform the data in the way I need, rather than having to make key functions. And with Michael's approach, I think you'd need to call .map() after generating the grouping -- a much klunkier way to do it. (and you'd get plain dict rather than a Grouping that you could add stuff too later...) I'm sure there are ways to improve my code, and maybe Grouping isn't the best name, but I think something like this would be a nice addition to the collections module. -CHB [*] -- before making any decisions about the best API, it would probably be a good idea to collect examples of the kind of data that people really do need to group like this. Does it come in (key, value) pairs naturally? or in one big sequence with a key function that's easy to write? who knows without examples of real world use cases. I will show one "real world" example here: In my Python classes, I like to use Dave Thomas' trigrams: "code kata": http://codekata.com/kata/kata14-tom-swift-under-the-milkwood/ A key piece of this is building up a data structure with word pairs, and a list of all the words that follow the pair in a piece of text. This is a nice exercise to help people think about how to use dicts, etc. Currently the most clean code uses .setdefault: word_pairs = {} # loop through the words # (rare case where using the index to loop is easiest) for i in range(len(words) - 2): # minus 2, 'cause you need a pair pair = tuple(words[i:i + 2]) # a tuple so it can be a key in the dict follower = words[i + 2] word_pairs.setdefault(pair, []).append(follower) if this were done with my Grouping class, it would be: In [53]: word_pairs = Grouping() In [54]: for i in range(len(words) - 2): ...: pair = tuple(words[i:i + 2]) # a tuple so it can be a key in the dict ...: follower = words[i + 2] ...: word_pairs[pair] = follower ...: In [55]: word_pairs Out[55]: Grouping({('I', 'wish'): ['I', 'I'], ('wish', 'I'): ['may', 'might'], ('I', 'may'): ['I'], ('may', 'I'): ['wish']}) Not that different, really, but it saves folks from having to find and understand setdefault. But you could also make it into a generator expression like so: In [56]: Grouping(((w1, w2), w3) for w1, w2, w3, in zip(words[:], words[1:], words[2:])) Out[56]: Grouping({('I', 'wish'): ['I', 'I'], ('wish', 'I'): ['may', 'might'], ('I', 'may'): ['I'], ('may', 'I'): ['wish']}) which I think is pretty slick. And satisfies the OP's desire for a comprehension-like approach, rather than the: - create an empty dict - loop through the iterable - use setdefault in the loop approach. the nuances of setdefault or defaultdict. well, yes, and no -- as above, I use an example of this in teaching so that I CAN teach the nuances of setdefault -- or at least dicts themselves (most student use a if key in dict" construct before I tell them about setdefault) So if you are teaching, say data analysis with Python -- it might be nice to have this builtin, but if you are teaching "programming with Python" I'd probably encourage them to do it by hand first anyway :-)
Defaultdict requires passing a factory function or class, similar to a key-function. Setdefault is awkwardly named and requires a discussion of references and mutability.
I agree that the naming is awkward, but I haven't found confusion with references an mutabilty from this --though I do keep hammering those points throughout the class anyway :-) and my approach doesn't require any key functions either :-) -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 1 July 2018 at 15:18, Chris Barker via Python-ideas <python-ideas@python.org> wrote:
On Fri, Jun 29, 2018 at 10:53 AM, Michael Selik <mike@selik.org> wrote:
I've drafted a PEP for an easier way to construct groups of elements from a sequence. https://github.com/selik/peps/blob/master/pep-9999.rst
I'm really warming to the:
Alternate: collections.Grouping
version -- I really like this as a kind of custom mapping, rather than "just a function" (or alternate constructor) -- and I like your point that it can have a bit of functionality built in other than on construction.
But I think it should be more like the other collection classes -- i.e. a general purpose class that can be used for grouping, but also used more general-purpose-y as well. That way people can do their "custom" stuff (key function, etc.) with comprehensions.
The big differences are a custom __setitem__:
def __setitem__(self, key, value): self.setdefault(key, []).append(value)
And the __init__ and update would take an iterable of (key, value) pairs, rather than a single sequence.
This would get away from the itertools.groupby approach, which I find kinda awkward:
* How often do you have your data in a single sequence?
* Do you need your keys (and values!) to be sortable???)
* Do we really want folks to have to be writing custom key functions and/or lambdas for really simple stuff?
* and you may need to "transform" both your keys and values
I've enclosed an example implementation, borrowing heavily from Michael's code.
The test code has a couple examples of use, but I'll put them here for the sake of discussion.
Michael had:
Grouping('AbBa', key=c.casefold))
with my code, that would be:
Grouping(((c.casefold(), c) for c in 'AbBa'))
Note that the key function is applied outside the Grouping object, it doesn't need to know anything about it -- and then users can use an expression in a comprehension rather than a key function.
This looks a tad clumsier with my approach, but this is a pretty contrived example -- in the more common case [*], you'd be writing a bunch of lambdas, etc, and I'm not sure there is a way to get the values customized as well, if you want that. (without applying a map later on)
Here is the example that the OP posted that kicked off this thread:
In [37]: student_school_list = [('Fred', 'SchoolA'), ...: ('Bob', 'SchoolB'), ...: ('Mary', 'SchoolA'), ...: ('Jane', 'SchoolB'), ...: ('Nancy', 'SchoolC'), ...: ]
In [38]: Grouping(((item[1], item[0]) for item in student_school_list)) Out[38]: Grouping({'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']})
Unpacking and repacking the tuple would also work: Grouping(((school, student) for student, school in student_school_list)) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Fri, Jun 29, 2018 at 10:53:34AM -0700, Michael Selik wrote:
Hello,
I've drafted a PEP for an easier way to construct groups of elements from a sequence. https://github.com/selik/peps/blob/master/pep-9999.rst
Seems useful, but I suggest that since it has to process the entire data set eagerly, the name ought to be grouped() following the precedent set by sorted(). I also suggest using keyfunc as the second parameter, following the same convention as itertools.groupby. That gives this possible implementation: def grouped(iterable, keyfunc=None): groups = {} for k, g in itertools.groupby(iterable, keyfunc): groups.setdefault(k, []).extend(g) return groups Since Guido has ruled out making this a built-in, there's no really comfortable place in the standard library for it: - it doesn't return an iterator (since it is eager, it would be pointless to yield key/items pairs instead of just returning the dict), so itertools is not a good fit; - it doesn't return a specialist class, so collections is not a good fit; - there's currently no "useful utilities which aren't useful enough to be built-in" module. I fear that this proposal will fall into that awkward position of being doomed by not having somewhere to put it. (Your suggestion to consider this an alternate constructor of dicts seems more sensible all the time... but again Guido disagrees.) -- Steve
On Thu, Jun 28, 2018 at 8:25 AM, Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
I use list and dict comprehension a lot, and a problem I often have is to do the equivalent of a group_by operation (to use sql terminology).
I don't know from SQL, so "group by" doesn't mean anything to me, but this:
For example if I have a list of tuples (student, school) and I want to have the list of students by school the only option I'm left with is to write
student_by_school = defaultdict(list) for student, school in student_school_list: student_by_school[school].append(student)
seems to me that the issue here is that there is not way to have a "defaultdict comprehension" I can't think of syntactically clean way to make that possible, though. Could itertools.groupby help here? It seems to work, but boy! it's ugly: In [*45*]: student_school_list Out[*45*]: [('Fred', 'SchoolA'), ('Bob', 'SchoolB'), ('Mary', 'SchoolA'), ('Jane', 'SchoolB'), ('Nancy', 'SchoolC')] In [*46*]: {a:[t[0] *for* t *in* b] *for* a,b *in* groupby(sorted(student_school_list, key=*lambda* t: t[1]), key=*lambda* t: t[ ...: 1])} ...: ...: ...: ...: ...: ...: ...: Out[*46*]: {'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']} -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
PyToolz, Pandas, Dask .groupby() toolz.itertoolz.groupby does this succinctly without any new/magical/surprising syntax. https://toolz.readthedocs.io/en/latest/api.html#toolz.itertoolz.groupby
From https://github.com/pytoolz/toolz/blob/master/toolz/itertoolz.py :
""" def groupby(key, seq): """ Group a collection by a key function >>> names = ['Alice', 'Bob', 'Charlie', 'Dan', 'Edith', 'Frank'] >>> groupby(len, names) # doctest: +SKIP {3: ['Bob', 'Dan'], 5: ['Alice', 'Edith', 'Frank'], 7: ['Charlie']} >>> iseven = lambda x: x % 2 == 0 >>> groupby(iseven, [1, 2, 3, 4, 5, 6, 7, 8]) # doctest: +SKIP {False: [1, 3, 5, 7], True: [2, 4, 6, 8]} Non-callable keys imply grouping on a member. >>> groupby('gender', [{'name': 'Alice', 'gender': 'F'}, ... {'name': 'Bob', 'gender': 'M'}, ... {'name': 'Charlie', 'gender': 'M'}]) # doctest:+SKIP {'F': [{'gender': 'F', 'name': 'Alice'}], 'M': [{'gender': 'M', 'name': 'Bob'}, {'gender': 'M', 'name': 'Charlie'}]} See Also: countby """ if not callable(key): key = getter(key) d = collections.defaultdict(lambda: [].append) for item in seq: d[key(item)](item) rv = {} for k, v in iteritems(d): rv[k] = v.__self__ return rv """ If you're willing to install Pandas (and NumPy, and ...), there's pandas.DataFrame.groupby: https://pandas.pydata.org/pandas-docs/stable/generated/ pandas.DataFrame.groupby.html https://github.com/pandas-dev/pandas/blob/v0.23.1/pandas/ core/generic.py#L6586-L6659 Dask has a different groupby implementation: https://gist.github.com/darribas/41940dfe7bf4f987eeaa# file-pandas_dask_test-ipynb https://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFram... On Thursday, June 28, 2018, Chris Barker via Python-ideas < python-ideas@python.org> wrote:
On Thu, Jun 28, 2018 at 8:25 AM, Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
I use list and dict comprehension a lot, and a problem I often have is to do the equivalent of a group_by operation (to use sql terminology).
I don't know from SQL, so "group by" doesn't mean anything to me, but this:
For example if I have a list of tuples (student, school) and I want to have the list of students by school the only option I'm left with is to write
student_by_school = defaultdict(list) for student, school in student_school_list: student_by_school[school].append(student)
seems to me that the issue here is that there is not way to have a "defaultdict comprehension"
I can't think of syntactically clean way to make that possible, though.
Could itertools.groupby help here? It seems to work, but boy! it's ugly:
In [*45*]: student_school_list
Out[*45*]:
[('Fred', 'SchoolA'),
('Bob', 'SchoolB'),
('Mary', 'SchoolA'),
('Jane', 'SchoolB'),
('Nancy', 'SchoolC')]
In [*46*]: {a:[t[0] *for* t *in* b] *for* a,b *in* groupby(sorted (student_school_list, key=*lambda* t: t[1]), key=*lambda* t: t[
...: 1])}
...:
...:
...:
...:
...:
...:
...:
Out[*46*]: {'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']}
-CHB
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
I agree with these recommendations. There are excellent 3rd party tools that do what you want. This is way too much to try to shoehorn into a comprehension. I'd add one more option. You want something that behaves like SQL. Right in the standard library is sqlite3, and you can create an in-memory DB to hope the data you expect to group. On Thu, Jun 28, 2018, 3:48 PM Wes Turner <wes.turner@gmail.com> wrote:
PyToolz, Pandas, Dask .groupby()
toolz.itertoolz.groupby does this succinctly without any new/magical/surprising syntax.
https://toolz.readthedocs.io/en/latest/api.html#toolz.itertoolz.groupby
From https://github.com/pytoolz/toolz/blob/master/toolz/itertoolz.py :
""" def groupby(key, seq): """ Group a collection by a key function >>> names = ['Alice', 'Bob', 'Charlie', 'Dan', 'Edith', 'Frank'] >>> groupby(len, names) # doctest: +SKIP {3: ['Bob', 'Dan'], 5: ['Alice', 'Edith', 'Frank'], 7: ['Charlie']} >>> iseven = lambda x: x % 2 == 0 >>> groupby(iseven, [1, 2, 3, 4, 5, 6, 7, 8]) # doctest: +SKIP {False: [1, 3, 5, 7], True: [2, 4, 6, 8]} Non-callable keys imply grouping on a member. >>> groupby('gender', [{'name': 'Alice', 'gender': 'F'}, ... {'name': 'Bob', 'gender': 'M'}, ... {'name': 'Charlie', 'gender': 'M'}]) # doctest:+SKIP {'F': [{'gender': 'F', 'name': 'Alice'}], 'M': [{'gender': 'M', 'name': 'Bob'}, {'gender': 'M', 'name': 'Charlie'}]} See Also: countby """ if not callable(key): key = getter(key) d = collections.defaultdict(lambda: [].append) for item in seq: d[key(item)](item) rv = {} for k, v in iteritems(d): rv[k] = v.__self__ return rv """
If you're willing to install Pandas (and NumPy, and ...), there's pandas.DataFrame.groupby:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.grou...
https://github.com/pandas-dev/pandas/blob/v0.23.1/pandas/core/generic.py#L65...
Dask has a different groupby implementation:
https://gist.github.com/darribas/41940dfe7bf4f987eeaa#file-pandas_dask_test-...
https://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFram...
On Thursday, June 28, 2018, Chris Barker via Python-ideas < python-ideas@python.org> wrote:
On Thu, Jun 28, 2018 at 8:25 AM, Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
I use list and dict comprehension a lot, and a problem I often have is to do the equivalent of a group_by operation (to use sql terminology).
I don't know from SQL, so "group by" doesn't mean anything to me, but this:
For example if I have a list of tuples (student, school) and I want to have the list of students by school the only option I'm left with is to write
student_by_school = defaultdict(list) for student, school in student_school_list: student_by_school[school].append(student)
seems to me that the issue here is that there is not way to have a "defaultdict comprehension"
I can't think of syntactically clean way to make that possible, though.
Could itertools.groupby help here? It seems to work, but boy! it's ugly:
In [*45*]: student_school_list
Out[*45*]:
[('Fred', 'SchoolA'),
('Bob', 'SchoolB'),
('Mary', 'SchoolA'),
('Jane', 'SchoolB'),
('Nancy', 'SchoolC')]
In [*46*]: {a:[t[0] *for* t *in* b] *for* a,b *in* groupby(sorted(student_school_list, key=*lambda* t: t[1]), key=*lambda* t: t[
...: 1])}
...:
...:
...:
...:
...:
...:
...:
Out[*46*]: {'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']}
-CHB
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Thu, Jun 28, 2018 at 1:34 PM, David Mertz <mertz@gnosis.cx> wrote:
I'd add one more option. You want something that behaves like SQL. Right in the standard library is sqlite3, and you can create an in-memory DB to hope the data you expect to group.
There are also packages designed to make DB-style queries easier. Here's one I found with a quick google. -CHB
On Thu, Jun 28, 2018, 3:48 PM Wes Turner <wes.turner@gmail.com> wrote:
PyToolz, Pandas, Dask .groupby()
toolz.itertoolz.groupby does this succinctly without any new/magical/surprising syntax.
https://toolz.readthedocs.io/en/latest/api.html#toolz.itertoolz.groupby
From https://github.com/pytoolz/toolz/blob/master/toolz/itertoolz.py :
""" def groupby(key, seq): """ Group a collection by a key function >>> names = ['Alice', 'Bob', 'Charlie', 'Dan', 'Edith', 'Frank'] >>> groupby(len, names) # doctest: +SKIP {3: ['Bob', 'Dan'], 5: ['Alice', 'Edith', 'Frank'], 7: ['Charlie']} >>> iseven = lambda x: x % 2 == 0 >>> groupby(iseven, [1, 2, 3, 4, 5, 6, 7, 8]) # doctest: +SKIP {False: [1, 3, 5, 7], True: [2, 4, 6, 8]} Non-callable keys imply grouping on a member. >>> groupby('gender', [{'name': 'Alice', 'gender': 'F'}, ... {'name': 'Bob', 'gender': 'M'}, ... {'name': 'Charlie', 'gender': 'M'}]) # doctest:+SKIP {'F': [{'gender': 'F', 'name': 'Alice'}], 'M': [{'gender': 'M', 'name': 'Bob'}, {'gender': 'M', 'name': 'Charlie'}]} See Also: countby """ if not callable(key): key = getter(key) d = collections.defaultdict(lambda: [].append) for item in seq: d[key(item)](item) rv = {} for k, v in iteritems(d): rv[k] = v.__self__ return rv """
If you're willing to install Pandas (and NumPy, and ...), there's pandas.DataFrame.groupby:
https://pandas.pydata.org/pandas-docs/stable/generated/ pandas.DataFrame.groupby.html
https://github.com/pandas-dev/pandas/blob/v0.23.1/pandas/ core/generic.py#L6586-L6659
Dask has a different groupby implementation: https://gist.github.com/darribas/41940dfe7bf4f987eeaa# file-pandas_dask_test-ipynb
https://dask.pydata.org/en/latest/dataframe-api.html# dask.dataframe.DataFrame.groupby
On Thursday, June 28, 2018, Chris Barker via Python-ideas < python-ideas@python.org> wrote:
On Thu, Jun 28, 2018 at 8:25 AM, Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
I use list and dict comprehension a lot, and a problem I often have is to do the equivalent of a group_by operation (to use sql terminology).
I don't know from SQL, so "group by" doesn't mean anything to me, but this:
For example if I have a list of tuples (student, school) and I want to have the list of students by school the only option I'm left with is to write
student_by_school = defaultdict(list) for student, school in student_school_list: student_by_school[school].append(student)
seems to me that the issue here is that there is not way to have a "defaultdict comprehension"
I can't think of syntactically clean way to make that possible, though.
Could itertools.groupby help here? It seems to work, but boy! it's ugly:
In [*45*]: student_school_list
Out[*45*]:
[('Fred', 'SchoolA'),
('Bob', 'SchoolB'),
('Mary', 'SchoolA'),
('Jane', 'SchoolB'),
('Nancy', 'SchoolC')]
In [*46*]: {a:[t[0] *for* t *in* b] *for* a,b *in* groupby(sorted (student_school_list, key=*lambda* t: t[1]), key=*lambda* t: t[
...: 1])}
...:
...:
...:
...:
...:
...:
...:
Out[*46*]: {'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy']}
-CHB
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE <https://maps.google.com/?q=7600+Sand+Point+Way+NE&entry=gmail&source=g> (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Thu, Jun 28, 2018 at 3:17 PM, Chris Barker <chris.barker@noaa.gov> wrote:
There are also packages designed to make DB-style queries easier.
Here's one I found with a quick google.
opps -- hit send too soon: http://178.62.194.22/ https://github.com/pythonql/pythonql -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
2018-06-28 22:34 GMT+02:00 David Mertz <mertz@gnosis.cx>:
I agree with these recommendations. There are excellent 3rd party tools that do what you want. This is way too much to try to shoehorn into a comprehension.
There are actually no 3rd party tools that can "do what I want", because if I wanted to have a function to do a group by, I would have taken the 5 minutes and 7 lines necessary to do so (or don't use a function and do my 3 liner). My main point is that comprehensions in python are very powerful and you can do pretty much any basic data manipulation that you want with it EXCEPT when you want to "split" a list in sublists, in which case you have either to use functions or a for loop. You can note that with list comprehension you can flatten an iterable (from sublists to a single list) with the [a for b in c for a in b] syntax, but doing the inverse operation is impossible. The questions I should have asked In my original post was : - Is splitting lists into sublists (by grouping elements) a high level enough construction to be worthy of a nice integration in the comprehension syntax ? - In which case, is there a way to find a simple syntax that is not too confusing ? My personal answer would be respectively "yes" and "maybe I don't know". I was hoping to have some views on the topic, and it seemed to have a bit sidetracked :) -- Nicolas Rolin
On Thu, Jun 28, 2018, 6:46 PM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
The questions I should have asked In my original post was : - Is splitting lists into sublists (by grouping elements) a high level enough construction to be worthy of a nice integration in the comprehension syntax ?
My intuition is no, it's not important enough to alter the syntax, despite being an important task. - In which case, is there a way to find a simple syntax that is not too
confusing ?
If you'd like to give it a shot, try to find something which is currently invalid syntax, but does not break compatibility. The latter criteria means no new keywords. The syntax should look nice as a single line with reasonably verbose variable names. One issue is that Python code is mostly 1-dimensional, characters in a line, and you're trying to express something which is 2-dimensional, in a sense. There's only so much you can do without newlines and indentation.
A syntax that would work (which atm is a syntax error, and requires no new keyword) would be student_by_school = {school: [student] for school, student in student_school_list, grouped=True} with grouped=True being a modifier on the dict comprehension so that at each iteration loop current_dict[key] = value if key not in current_dict else current_dict[key] + value This is an extremely borderline syntax (as it is perfectly legal to put **{'grouped': True} in a dict comprehension), but it works. It even keeps the extremely important "should look like a template of the final object" property. But it doesn't requires me to defines 2 lambda functions just to do the job of a comprehension. -- Nicolas Rolin 2018-06-29 4:57 GMT+02:00 Michael Selik <mike@selik.org>:
On Thu, Jun 28, 2018, 6:46 PM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
The questions I should have asked In my original post was : - Is splitting lists into sublists (by grouping elements) a high level enough construction to be worthy of a nice integration in the comprehension syntax ?
My intuition is no, it's not important enough to alter the syntax, despite being an important task.
- In which case, is there a way to find a simple syntax that is not too
confusing ?
If you'd like to give it a shot, try to find something which is currently invalid syntax, but does not break compatibility. The latter criteria means no new keywords. The syntax should look nice as a single line with reasonably verbose variable names.
One issue is that Python code is mostly 1-dimensional, characters in a line, and you're trying to express something which is 2-dimensional, in a sense. There's only so much you can do without newlines and indentation.
-- -- *Nicolas Rolin* | Data Scientist + 33 631992617 - nicolas.rolin@tiime.fr <prenom.nom@tiime.fr> *15 rue Auber, **75009 Paris* *www.tiime.fr <http://www.tiime.fr>*
Can I make a plea for people to not post code with source highlighting as HTML please? It is rendered like this for some of us: On Thu, Jun 28, 2018 at 10:01:00AM -0700, Chris Barker via Python-ideas wrote: In [*46*]: {a:[t[0] *for* t *in* b] *for* a,b *in* groupby(sorted(student_school_list, key=*lambda* t: t[1]), key=*lambda* t: t[ ... (Aside from the iPython prompt, the rest ought to be legal Python but isn't because of the extra asterisks added.) And in the archives: https://mail.python.org/pipermail/python-ideas/2018-June/051723.html Gmail, I believe, has a "Paste As Plain Text" command in the right-click menu. Or possibly find a way to copy the text without formatting in the first case. Thanks, -- Steve
On Thu, Jun 28, 2018 at 4:59 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Can I make a plea for people to not post code with source highlighting as HTML please? It is rendered like this for some of us:
On Thu, Jun 28, 2018 at 10:01:00AM -0700, Chris Barker via Python-ideas wrote:
In [*46*]: {a:[t[0] *for* t *in* b] *for* a,b *in* groupby(sorted(student_school_list, key=*lambda* t: t[1]), key=*lambda* t: t[
Oh god -- yeach!! -- sorry about that -- that was copy an pasted from iPython -- I was assuming it would strip out the formatting and give reasonable plain text -- but apparently not. I'll stop that. -CHB
Ctrl-Shift-V pastes without HTML formatting. On Thursday, June 28, 2018, Steven D'Aprano <steve@pearwood.info> wrote:
Can I make a plea for people to not post code with source highlighting as HTML please? It is rendered like this for some of us:
On Thu, Jun 28, 2018 at 10:01:00AM -0700, Chris Barker via Python-ideas wrote:
In [*46*]: {a:[t[0] *for* t *in* b] *for* a,b *in* groupby(sorted(student_school_list, key=*lambda* t: t[1]), key=*lambda* t: t[ ...
(Aside from the iPython prompt, the rest ought to be legal Python but isn't because of the extra asterisks added.)
And in the archives:
https://mail.python.org/pipermail/python-ideas/2018-June/051723.html
Gmail, I believe, has a "Paste As Plain Text" command in the right-click menu. Or possibly find a way to copy the text without formatting in the first case.
Thanks,
-- Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Steven D'Aprano wrote:
On Thu, Jun 28, 2018 at 10:01:00AM -0700, Chris Barker via Python-ideas wrote:
In [*46*]: {a:[t[0] *for* t *in* b] *for* a,b *in* groupby(sorted(student_school_list, key=*lambda* t: t[1]), key=*lambda* t: t[ ...
the rest ought to be legal Python but isn't
We should *make* it legal Python code! Then there would be no difficulty with adding new keywords! -- Greg
On 28/06/18 16:25, Nicolas Rolin wrote:
Hi,
I use list and dict comprehension a lot, and a problem I often have is to do the equivalent of a group_by operation (to use sql terminology).
For example if I have a list of tuples (student, school) and I want to have the list of students by school the only option I'm left with is to write
student_by_school = defaultdict(list) for student, school in student_school_list: student_by_school[school].append(student)
What I would expect would be a syntax with comprehension allowing me to write something along the lines of:
student_by_school = {group_by(school): student for school, student in student_school_list}
or any other syntax that allows me to regroup items from an iterable.
Sorry, I don't like the extra load on comprehensions here. You are doing something inherently somewhat complicated and then attempting to hide the magic. Worse, you are hiding it by pretending to be something else (an ordinary comprehension), which will break people's intuition about what is being produced. -- Rhodri James *-* Kynesim Ltd
Why not write a helper function? Something like def group_by(iterable, groupfunc, itemfunc=lambda x:x, sortfunc=lambda x:x): # Python 2 & 3 compatible! D = {} for x in iterable: group = groupfunc(x) D[group] = D.get(group, []) + [itemfunc(x)] if sortfunc is not None: for group in D: D[group] = sorted(D[group], key=sortfunc) return D Then: student_list = [ ('james', 'Dublin'), ('jim', 'Cork'), ('mary', 'Cork'), ('fred', 'Dublin') ] student_by_school = group_by(student_list, lambda stu_sch : stu_sch[1], lambda stu_sch : stu_sch[0]) print (student_by_school) {'Dublin': ['fred', 'james'], 'Cork': ['jim', 'mary']} Regards Rob Cliffe On 28/06/2018 16:25, Nicolas Rolin wrote:
Hi,
I use list and dict comprehension a lot, and a problem I often have is to do the equivalent of a group_by operation (to use sql terminology).
For example if I have a list of tuples (student, school) and I want to have the list of students by school the only option I'm left with is to write
student_by_school = defaultdict(list) for student, school in student_school_list: student_by_school[school].append(student)
What I would expect would be a syntax with comprehension allowing me to write something along the lines of:
student_by_school = {group_by(school): student for school, student in student_school_list}
or any other syntax that allows me to regroup items from an iterable.
Small FAQ:
Q: Why include something in comprehensions when you can do it in a small number of lines ?
A: A really appreciable part of the list and dict comprehension is the fact that it allows the developer to be really explicit about what he wants to do at a given line. If you see a comprehension, you know that the developer wanted to have an iterable and not have any side effect other than depleting the iterator (if he respects reasonable code guidelines). Initializing an object and doing a for loop to construct it is both too long and not explicit enough about what is intended. It should be reserved for intrinsically complex operations, not one of the base operation one can want to do with lists and dicts.
Q: Why group by in particular ?
A: If we take SQL queries (https://en.wikipedia.org/wiki/SQL_syntax#Queries) as a reasonable way of seeing how people need to manipulate data on a day-to-day basis, we can see that dict comprehensions already covers most of the base operations, the only missing operations being group by and having.
Q: Why not use it on list with syntax such as student_by_school = [ school, student for school, student in student_school_list group by school ] ?
A: It would create either a discrepancy with iterators or a perhaps misleading semantic (the one from itertools.groupby, which requires the iterable to be sorted in order to be useful). Having the option do do it with a dict remove any ambiguity and should be enough to cover most "group by" applications.
Examples:
edible_list = [('fruit', 'orange'), ('meat', 'eggs'), ('meat', 'spam'), ('fruit', 'apple'), ('vegetable', 'fennel'), ('fruit', 'pineapple'), ('fruit', 'pineapple'), ('vegetable', 'carrot')] edible_list_by_food_type = {group_by(food_type): edible for food_type, edible in edible_list}
print(edible_list_by_food_type) {'fruit': ['orange', 'pineapple'], 'meat': ['eggs', 'spam'], 'vegetable': ['fennel', 'carrot']}
bank_transactions = [200.0, -357.0, -9.99, -15.6, 4320.0, -12000] splited_bank_transactions = {group_by('credit' if amount > 0 else 'debit'): amount for amount in bank_transactions}
print(splited_bank_transactions) {'credit': [200.0, 4320.0], 'debit': [-357.0, -9.99, -15.6, -1200.0]}
-- Nicolas Rolin
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> Virus-free. www.avg.com <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Thu, Jun 28, 2018 at 10:24 AM Rob Cliffe via Python-ideas < python-ideas@python.org> wrote:
def group_by(iterable, groupfunc, itemfunc=lambda x:x, sortfunc=lambda x:x): # Python 2 & 3 compatible!
D = {} for x in iterable: group = groupfunc(x) D[group] = D.get(group, []) + [itemfunc(x)] if sortfunc is not None: for group in D: D[group] = sorted(D[group], key=sortfunc) return D
The fact that you didn't use ``setdefault`` here, opting for repeatedly constructing new lists via concatenation, demonstrates the need for a built-in or standard library tool that is easier to use. I'll submit a proposal for your review soon.
On Thu, Jun 28, 2018 at 11:23:49AM -0700, Michael Selik wrote:
The fact that you didn't use ``setdefault`` here, opting for repeatedly constructing new lists via concatenation, demonstrates the need for a built-in or standard library tool that is easier to use.
That would be setdefault :-) What it indicates to me is the need for people to learn to use setdefault, rather than new syntax :-) -- Steve
On Thu, Jun 28, 2018 at 4:23 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Nicolas Rolin wrote:
student_by_school = {group_by(school): student for school, student in student_school_list}
In the spirit of making the target expression look like a template for the generated elements,
{school: [student...] for school, student in student_school_list}
hmm -- this seems a bit non-general -- would this only work for a list? maybe you would want a set, or??? so could be get a defaultdict comprehension with something like: { school: (default_factory=list, student) for school, student in student_school_list } But I can't think of an reasonable syntax to make that work. -CHB
-- Greg
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Thu, Jun 28, 2018 at 4:34 PM Chris Barker via Python-ideas < python-ideas@python.org> wrote:
On Thu, Jun 28, 2018 at 4:23 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Nicolas Rolin wrote:
student_by_school = {group_by(school): student for school, student in student_school_list}
In the spirit of making the target expression look like a template for the generated elements,
{school: [student...] for school, student in student_school_list}
hmm -- this seems a bit non-general -- would this only work for a list? maybe you would want a set, or???
so could be get a defaultdict comprehension with something like:
{ school: (default_factory=list, student) for school, student in student_school_list }
But I can't think of an reasonable syntax to make that work.
Many languages with a group-by or grouping function choose to return a mapping of sequences, requiring any reduction, aggregation, or transformation of those sequences to be performed after the grouping.
Hold the phone! On Thu, Jun 28, 2018 at 8:25 AM, Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
student_by_school = defaultdict(list) for student, school in student_school_list: student_by_school[school].append(student)
What I would expect would be a syntax with comprehension allowing me to write something along the lines of:
student_by_school = {group_by(school): student for school, student in student_school_list}
OK -- I agreed that this could/should be easier, and pretty much like using setdefault, but did like the single expression thing, so went to "there should be a way to make a defaultdict comprehension" -- and played with itertools.groupby (which is really really awkward for this), but then light dawned on Marblehead: I've noticed (and taught) that dict comprehensions are kinda redundant with the dict() constructor, and _think_, in fact, that they were added before the current dict() constructor was added. so, if you think "dict constructor" rather than dict comprehensions, you realize that defaultdict takes the same arguments as the dict(), so the above is: defaultdict(list, student_by_school) which really couldn't be any cleaner and neater..... Here it is in action: In [97]: student_school_list Out[97]: [('Fred', 'SchoolA'), ('Bob', 'SchoolB'), ('Mary', 'SchoolA'), ('Jane', 'SchoolB'), ('Nancy', 'SchoolC')] In [98]: result = defaultdict(list, student_by_school) In [99]: result.items() Out[99]: dict_items([('SchoolA', ['Fred', 'Mary']), ('SchoolB', ['Bob', 'Jane']), ('SchoolC', ['Nancy'])]) So: <small voice> never mind </small voice> -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Thu, Jun 28, 2018 at 5:12 PM Chris Barker via Python-ideas < python-ideas@python.org> wrote:
In [97]: student_school_list Out[97]: [('Fred', 'SchoolA'), ('Bob', 'SchoolB'), ('Mary', 'SchoolA'), ('Jane', 'SchoolB'), ('Nancy', 'SchoolC')]
In [98]: result = defaultdict(list, student_by_school)
In [99]: result.items() Out[99]: dict_items([('SchoolA', ['Fred', 'Mary']), ('SchoolB', ['Bob', 'Jane']), ('SchoolC', ['Nancy'])])
Wait, wha... In [1]: from collections import defaultdict In [2]: students = [('Fred', 'SchoolA'), ...: ('Bob', 'SchoolB'), ...: ('Mary', 'SchoolA'), ...: ('Jane', 'SchoolB'), ...: ('Nancy', 'SchoolC')] ...: In [3]: defaultdict(list, students) Out[3]: defaultdict(list, {'Fred': 'SchoolA', 'Bob': 'SchoolB', 'Mary': 'SchoolA', 'Jane': 'SchoolB', 'Nancy': 'SchoolC'}) In [4]: defaultdict(list, students).items() Out[4]: dict_items([('Fred', 'SchoolA'), ('Bob', 'SchoolB'), ('Mary', 'SchoolA'), ('Jane', 'SchoolB'), ('Nancy', 'SchoolC')]) I think you accidentally swapped variables there: student_school_list vs student_by_school
I think you accidentally swapped variables there: student_school_list vs student_by_school
Oops, yeah. That’s what I get for whipping out a message before catching a bus. (And on a phone now) But maybe you could wrap the defaultdict constructor around a generator expression that transforms the list first. That would get the keys right. Though still not call append for you. So maybe a solution is an accumulator special case of defaultdict — it uses a list be default and appends by default. Almost like counter... -CHB
On Jun 28, 2018, at 5:30 PM, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
So maybe a solution is an accumulator special case of defaultdict — it uses a list be default and appends by default.
Almost like counter...
Which, of course, is pretty much what your proposal is. Which makes me think — a new classmethod on the builtin dict is a pretty heavy lift compared to a new type of dict in the collections module. -CHB
There are a few tools that can accomplish these map-reduce/transformation tasks. See Options A, B, C below. # Given >>> import itertools as it >>> import collections as ct >>> import more_itertools as mit >>> student_school_list = [ ... ("Albert", "Prospectus"), ("Max", "Smallville"), ("Nikola", "Shockley"), ("Maire", "Excelsior"), ... ("Neils", "Smallville"), ("Ernest", "Tabbicage"), ("Michael", "Shockley"), ("Stephen", "Prospectus") ... ] >>> kfunc = lambda x: x[1] >>> vfunc = lambda x: x[0] >>> sorted_iterable = sorted(student_school_list, key=kfunc) # Example (see OP) >>> student_by_school = ct.defaultdict(list) >>> for student, school in student_school_list: ... student_by_school[school].append(student) >>> student_by_school defaultdict(list, {'Prospectus': ['Albert', 'Stephen'], 'Smallville': ['Max', 'Neils'], 'Shockley': ['Nikola', 'Michael'], 'Excelsior': ['Maire'], 'Tabbicage': ['Ernest']}) --- # Options # A: itertools.groupby >>> {k: [x[0] for x in v] for k, v in it.groupby(sorted_iterable, key=kfunc)} {'Excelsior': ['Maire'], 'Prospectus': ['Albert', 'Stephen'], 'Shockley': ['Nikola', 'Michael'], 'Smallville': ['Max', 'Neils'], 'Tabbicage': ['Ernest']} # B: more_itertools.groupby_transform >>> {k: list(v) for k, v in mit.groupby_transform(sorted_iterable, keyfunc=kfunc, valuefunc=vfunc)} {'Excelsior': ['Maire'], 'Prospectus': ['Albert', 'Stephen'], 'Shockley': ['Nikola', 'Michael'], 'Smallville': ['Max', 'Neils'], 'Tabbicage': ['Ernest']} # C: more_itertools.map_reduce >>> mit.map_reduce(student_school_list, keyfunc=kfunc, valuefunc=vfunc) defaultdict(None, {'Prospectus': ['Albert', 'Stephen'], 'Smallville': ['Max', 'Neils'], 'Shockley': ['Nikola', 'Michael'], 'Excelsior': ['Maire'], 'Tabbicage': ['Ernest']}) --- # Summary - Option A: standard library, sorted iterable, some manual value transformations (via list comprehension) - Option B: third-party tool, sorted iterable, accepts a value transformation function - Option C: third-party tool, any iterable, accepts transformation function(s) I have grown to like `itertools.groupby`, but I understand it can be odd at first. Perhaps something like the `map_reduce` tool (or approach) may help? It's simple, does not require a sorted iterable as in A and B, and you have control over how you want your keys, values and aggregated/reduced values to be (see docs for more details). # Documentation - Option A: https://docs.python.org/3/library/itertools.html#itertools.groupby - Option B: https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.grou... - Option C: https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.map_... On Thu, Jun 28, 2018 at 8:37 PM, Chris Barker - NOAA Federal via Python-ideas <python-ideas@python.org> wrote:
On Jun 28, 2018, at 5:30 PM, Chris Barker - NOAA Federal < chris.barker@noaa.gov> wrote:
So maybe a solution is an accumulator special case of defaultdict — it uses a list be default and appends by default.
Almost like counter...
Which, of course, is pretty much what your proposal is.
Which makes me think — a new classmethod on the builtin dict is a pretty heavy lift compared to a new type of dict in the collections module.
-CHB _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
I think you cheated a little in your cut-and-paste. `student_by_school` is not defined in the code you've shown. What you **did** define, ` student_school_list` doesn't give you what you want if you use ` defaultdict(list,student_school_list)`. I thought for a moment I might just use: [(b,a) for a,b in student_school_list] But that's wrong for reasons that are probably obvious to everyone else. I'm not really sure what `student_by_school` could possibly be to make this work as shown. On Thu, Jun 28, 2018 at 8:13 PM Chris Barker via Python-ideas < python-ideas@python.org> wrote:
In [97]: student_school_list Out[97]: [('Fred', 'SchoolA'), ('Bob', 'SchoolB'), ('Mary', 'SchoolA'), ('Jane', 'SchoolB'), ('Nancy', 'SchoolC')]
In [98]: result = defaultdict(list, student_by_school)
In [99]: result.items() Out[99]: dict_items([('SchoolA', ['Fred', 'Mary']), ('SchoolB', ['Bob', 'Jane']), ('SchoolC', ['Nancy'])])
So: <small voice> never mind </small voice>
-CHB
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
participants (20)
-
Cammil Taank
-
Chris Barker
-
Chris Barker - NOAA Federal
-
David Mertz
-
Franklin? Lee
-
Greg Ewing
-
Guido van Rossum
-
INADA Naoki
-
Ivan Levkivskyi
-
Michael Selik
-
Michel Desmoulin
-
MRAB
-
Nick Coghlan
-
Nicolas Rolin
-
pylang
-
Rhodri James
-
Rob Cliffe
-
Serhiy Storchaka
-
Steven D'Aprano
-
Wes Turner