Fwd: grouping / dict of lists

Steven: You've misunderstood part of the discussion. There are two different signatures being discussed/proposed for a grouping() function. The one you show we might call grouping_michael(). The alternate API we might call grouping_chris(). These two calls will produce the same result (the first output you show) grouping_michael(words, keyfunc=len) grouping_chris((len(word), word) for word in words) I happen to prefer grouping_michael(), but recognize they each make slightly different things obvious. Absolutely no one wants the behavior in your second output. On Tue, Jul 3, 2018, 9:32 PM Steven D'Aprano <steve@pearwood.info> wrote:

There are some cases when that's the correct behavior. It mimics pandas.DataFrame.groupby. For example, what if you have a sequence of (key, v1, v2) triples? Group by key, then keep the triples intact is the right choice sometimes. On Wed, Jul 4, 2018, 6:39 AM David Mertz <mertz@gnosis.cx> wrote:

On Wed, Jul 4, 2018 at 6:34 AM, David Mertz <mertz@gnosis.cx> wrote:
I starting thinking grouping_chris was the obvious and natural thing to do, but his discussion has made it clear that grouping_michael is more natural for some kinds of data. and in some cases, it really comes down to taste, after all, who's to say which of these is "better" map(func, iterable) or (expression for item in iterable) given that map existed in Python when comprehensions were added, I tend to see the latter as more "Pythonic" but that's just me. So I'm currently lobbying for both :-) The default is iterable of (key. value) pairs, but the use can specify a key function is they want to do it that way. While a bit of a schizophrenic API, it makes sens (to me), because grouping_mikael isn't useful with a default key function anyway. The other enhancement I suggest is that an (optional) value function be added, as there are use cases where that would be really helpful. -CHB NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Thu, Jul 5, 2018 at 1:23 AM, Chris Barker via Python-ideas <python-ideas@python.org> wrote:
I use this kind of function in several different projects over the years, and I rewrote it many times as needed. I added several options, such as: - key function - value function - "ignore": Skip values with these keys. - "postprocess": Apply a function to each group after completion. - Pass in the container to store in. For example, create an OrderedDict and pass it in. It may already hold items. - Specify the container for each group. - Specify how to add to the container for each group. Then I cut it down to two optional parameters: - key function. If not provided, the iterable is considered to have key-value pairs. - The storage container. Finally, I removed the key function, and only took pairs and an optional container. However, I don't remember why I removed the key function. It may be that I was writing throwaway lambdas, and I decided I might as well just write the transformation into the comprehension. I think a key function is worth having. One thing I didn't do is create a class for this container. I used defaultdict, then used a default dict but converted it to a plain dict, and finally used to a plain dict. Aside: I also wrote a lazy grouping function, which returned a lazy container of generators. Asking for next(grouping[k]) will cause the container to iterate through the original iterable until it had something to add to group k. I have never used it, but it was fun to try and make it. class IterBuckets(dict): def __init__(self, pairs): self.pairs = iter(pairs) def __missing__(self, key): return self.setdefault(key, IterBucket(self.advance)) def advance(self): k, v = next(self.pairs) self[k].append(v) class IterBucket(collections.deque): def __init__(self, more): self.more = more def __iter__(self): return self def __next__(self): while not self: self.more() return self.popleft()

On Fri, Jul 6, 2018 at 12:26 PM, Franklin? Lee I use this kind of function in several different projects over the
years, and I rewrote it many times as needed.
interesting...
OK -- seems we're all converging on that :-)
- The storage container.
so this means you'r passing in a full set of storage containers? I'm a vit confused by that -- if they might be pre-populated, then they would need to be instance,s an you'd need to have one for every key -- how would you know in advance aht you needed??? I played around with passing in a optional storage object: https://github.com/PythonCHB/grouper/commit/d986816905406ec402724beaed2b88c9... but as we might want a list or a set, or a Counter, or ??? it got pretty ugly, as lists and sets and Counters all have different APIs for adding stuff. So I gave up and figured just saying "it's always a list) made the most sense.
exactly -- but I suspect hat may be because you where writing a comprehension anyway, as you needed to manipulate the values, also -- so if there were a value function, you could use either API.
I think a key function is worth having.
I think there's more or less consensus on that too. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

In my mind, I *rarely* (which is more than never) have my data in the form of a sequence of key/value pairs. The version of the API that assumes data starts that way feels like either a niche case, or demands preprocessing before it's ready to pass to grouping() or collections.Grouping(). That said, an identity key is rarely interesting either. So I think have key=None mean "assume we get key/val pairs is harmless to the more common case where we give an explicit key function. The uncommon need for grouping on equality can be handled with 'key=lambda x: x'. On Mon, Jul 9, 2018, 12:22 PM Chris Barker <chris.barker@noaa.gov> wrote:

On Mon, Jul 9, 2018 at 3:38 PM, David Mertz <mertz@gnosis.cx> wrote:
sure, but it makes it easy to use a different approach -- i.e. a comprehension with expressions rather than a key (and maybe value) function. AT least two people have expressed a preference for that. That said, an identity key is rarely interesting either.
is it EVER interesting?? wouldn't it be essentially a Counter, without the counting? :-)
So I think have key=None mean "assume we get key/val pairs is harmless to the more common case where we give an explicit key function.
I'm not sure we'll know what's more common 'till it's loose in the wild -- any problem can be solved either way -- which one will people prefer? Given that there Is a preference for comprehensions over map() -- I think it will see a lot of use. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Mon, Jul 9, 2018 at 12:22 PM, Chris Barker <chris.barker@noaa.gov> wrote:
On Fri, Jul 6, 2018 at 12:26 PM, Franklin? Lee I use this kind of function
No, I mean the mapping (outer) container. For example, I can pass in an empty OrderedDict, or a dict that already contained some groups from a previous call to the grouping function. I took out the option for the per-group (inner) containers. I never found it necessary to scrooge (scrooge on?) the memory, when I could postprocessed the lists after grouping. A mapvalues function will make postprocessing more convenient, and lend weight to a dicttools suggestion. # Unfortunate double meaning of 'map' in the function signature: def mapvalues(f, mapping): try: items = mapping.items() except AttributeError: items = mapping return {k: f(v) for k, v in items}
My solution at the time was to add another parameter to specify how to add to the container. In fuller generality, the option for the per-group container may consist of specifying a monad (if I remember monads correctly). You need to at least specify a per-group container constructor and a binary function that adds to it. In the case of `Counter`, the constructor is `int` and the binary function is `int.__add__`, and the Counter constructor effectively runs concurrent `reduce`.

On Mon, Jul 9, 2018 at 5:55 PM, Franklin? Lee <leewangzhong+python@gmail.com
wrote:
Sure -- that's what my prototype does if you pass a Mapping in (or use .update() ) why not? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

I noticed recently that *all* examples for collection.defaultdict ( https://docs.python.org/3.7/library/collections.html#collections.defaultdict) are cases of grouping (for an int, a list and a set) from an iterator with a key, value output. I wondered how common those constructions were, and what are defaultdict used for else. So I took a little dive into a few libs to see it (std lib, pypy, pandas, tensorflow, ..), and I saw essentially : A) basic cases of "grouping" with a simple for loop and a default_dict[key].append(value). I saw many kind of default factory utilized, with list, int, set, dict, and even defaultdict(list). ex : https://frama.link/UtNqvpvb, https://frama.link/o3Hb3-4U, https://frama.link/dw92yJ1q, https://frama.link/1Gqoa7WM, https://frama.link/bWswbHsU, https://frama.link/SZh2q8pS B) cases of grouping, but where the for loop used was alimenting more than one "grouper". pretty annoying if we want to group something. ex: https://frama.link/Db-Ny49a, https://frama.link/bZakUR33, https://frama.link/MwJFqh5o, C) classes attributes initialization (grouping is done by repeatably calling a function, so any grouping constructor will be useless here). ex : https://frama.link/GoGWuQwR, https://frama.link/BugcS8wU D) Sometimes you just want to defautdict inside a defauldict inside a dict and just have fun : https://frama.link/asBNLr1g, https://frama.link/8j7gzfA5
A sample code would be : from collections import defaultdict class groupingdict(defaultdict): def group_by_iterator(self, iterator): empty_element = self.default_factory() if hasattr(empty_element, "__add__"): for key, element in iterator: self[key] += element elif hasattr(empty_element, "update"): for key, element in iterator: self[key].update(element) elif hasattr(empty_element, "add"): for key, element in iterator: self[key].add(element) else: raise TypeError('default factory does not support iteration') return self So that for example :
groupingdict(dict, {1: {'a': 'e', 'b': 'f'}, 2: {'a': 'e'}})
My implementation is garbage and There should be 2 method, one returning the object and one modifing it, but I think it gives more leeway than just a function returning a dict 2018-07-13 7:11 GMT+02:00 Chris Barker via Python-ideas < python-ideas@python.org>:
-- -- *Nicolas Rolin* | Data Scientist + 33 631992617 - nicolas.rolin@tiime.fr <prenom.nom@tiime.fr> *15 rue Auber, **75009 Paris* *www.tiime.fr <http://www.tiime.fr>*

Thanks for linking to these. I looked at many of them in my own research, but for some reason didn't think to write down the links. I'll respond to each one separately. Throughout, I'm going to use my proposed ``grouped`` builtin to demonstrate possible revisions. Note that I am *not* suggesting a replacement to defaultdict. The proposal is to make a common task easier and more reliable. It does not satisfy all uses of defaultdict. https://github.com/selik/peps/blob/master/pep-9999.rst On Fri, Jul 13, 2018 at 6:17 AM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
facesizes = collections.defaultdict(int) for face in cubofaces: facesizes[len(face)] += 1 This is more specifically a Counter. It might look better as: facesizes = collections.Counter(len(face) for face in cubofaces) https://frama.link/o3Hb3-4U,
accum = defaultdict(list) garbageitems = [] for item in root: filename = findfile(opts.fileroot, item.attrib['classname']) accum[filename].append(float(item.attrib['time'])) if filename is None: garbageitems.append(item) This might be more clear if separated into two parts. def keyfunc(item): return findfile(opts.fileroot, item.attrib['classname']) groups = grouped(root, keyfunc) groups = {k: [float(v.attrib['time']) for v in g] for k, g in groups.items()} garbage = groups.pop(None, []) Interestingly, this code later makes a distinction between ``garbage`` and ``garbageitems`` even though they appear to be the same. The programmer might not have made that error (?) if using the ``grouped`` function, because the act of grouping becomes single-purpose.
In this case, the grouping is being performed by Pandas, not by the defaultdict. expected = defaultdict(dict) for n1, gp1 in data.groupby('A'): for n2, gp2 in gp1.groupby('B'): expected[n1][n2] = op(gp2.loc[:, ['C', 'D']]) However, this is still relevant to our discussion, because the defaultdict had to be converted to a regular dict afterward. Presumably to ensure that missing keys cause KeyErrors. expected = dict((k, DataFrame(v)) for k, v in compat.iteritems(expected)) If you felt like one-lining this, it might look like expected = {n1: DataFrame({n2: op(gp2.loc[:, ['C', 'D']]) for n2, gp2 in gp1}) for n1, gp1 in data.groupby('A')} I'm not saying that's more clear, but the fact that's possible emphasizes that the grouping is performed by Pandas. Grouping cannot be one-lined if restricted to Python standard library.
results = defaultdict(dict) for i, k1 in enumerate(arg1.columns): for j, k2 in enumerate(arg2.columns): if j < i and arg2 is arg1: # Symmetric case results[i][j] = results[j][i] else: results[i][j] = f(*_prep_binary(arg1.iloc[:, i], arg2.iloc[:, j])) The nested loop looks like a cross-product. It might be clearer using itertools, but I wouldn't use ``grouped`` here. cross = itertools.product(enumerate(arg1.columns), enumerate(arg2.columns))
self.mapping = collections.defaultdict(set) for op in (op for op in graph.get_operations()): if op.name.startswith(common.SKIPPED_PREFIXES): continue for op_input in op.inputs: self.mapping[op_input].add(op) This is a case of a single element being added to multiple groups, which is your section B, below. The loop and filter could be better. It looks like someone intended to convert if/continue to a comprehension, but stopped partway through the revision. ops = (op for op in graph.get_operations() if not op.name.startswith(common.SKIPPED_PREFIXES)) I'm not sure how I like it, but here's a revision to use ``grouped``: inputs = ((op_input, op) for op in ops for op_input in op.inputs) groups = grouped(inputs, key=itemgetter(0)) groups = {k: set(v for _, v in g) for k, g in groups.items()} https://frama.link/SZh2q8pS
handler_results = collections.defaultdict(list) for handler in per_handler_ops.keys(): for op in per_handler_ops[handler]: handler_results[handler].append(op_results[op]) return handler_results Here it looks like the results are already grouped by handler and this loop is constructing a parallel dict of lists for the results instead of the ops themselves. return {handler: [op_results[op] for op in group] for handler, group in per_handler_ops.items()} B) cases of grouping, but where the for loop used was alimenting more than
one "grouper". pretty annoying if we want to group something. ex: https://frama.link/Db-Ny49a,
def _get_headnode_dict(fixer_list): """ Accepts a list of fixers and returns a dictionary of head node type --> fixer list. """ head_nodes = collections.defaultdict(list) every = [] for fixer in fixer_list: if fixer.pattern: try: heads = _get_head_types(fixer.pattern) except _EveryNode: every.append(fixer) else: for node_type in heads: head_nodes[node_type].append(fixer) else: if fixer._accept_type is not None: head_nodes[fixer._accept_type].append(fixer) else: every.append(fixer) for node_type in chain(pygram.python_grammar.symbol2number.values(), pygram.python_grammar.tokens): head_nodes[node_type].extend(every) return dict(head_nodes) Again this ends with a conversion from defaultdict to basic dict: ``return dict(head_nodes)``. This is somewhat common in usage of defaultdict, because after the creation of groups, the later usage wants missing keys to raise KeyError, not insert a new, empty group. The association of a single element to multiple groups makes this a little awkward again. def _get_headnode_dict(fixer_list): """ Accepts a list of fixers and returns a dictionary of head node type --> fixer list. """ nodetypes = set(chain(pygram.python_grammar.symbol2number.values(), pygram.python_grammar.tokens)) heads = {} for fixer in fixer_list: if fixer.pattern: try: heads[fixer] = _get_head_types(fixer.pattern) except _EveryNode: heads[fixer] = nodetypes elif fixer._accept_type is not None: heads[fixer] = [fixer._accept_type] else: heads[fixer] = nodetypes pairs = ((head, fixer) for fixer, types in heads.items() for head in types) groups = grouped(pairs, key=itemgetter(0)) return {head: [fixer for _, fixer in g] for head, g in groups.items()} https://frama.link/bZakUR33,
directories = defaultdict(set) cache_directories = defaultdict(set) groups = defaultdict(list) for source, target, group, disk_id, condition in files: target = PureWindowsPath(target) groups[group].append((source, target, disk_id, condition)) if target.suffix.lower() in {".py", ".pyw"}: cache_directories[group].add(target.parent) for dirname in target.parents: parent = make_id(dirname.parent) if parent and parent != '.': directories[parent].add(dirname.name) This one has 3 different groupings created simultaneously. We could separate the 3 tasks, but they all rely on a converted ``target`` path, so it'd really end up being 4 loops. At that point, perhaps it's best to do just 1 loop as written.
paramdicts = {} testers = collections.defaultdict(list) for name, attr in cls.__dict__.items(): if name.endswith('_params'): if not hasattr(attr, 'keys'): d = {} for x in attr: if not hasattr(x, '__iter__'): x = (x,) n = '_'.join(str(v) for v in x).replace(' ', '_') d[n] = x attr = d paramdicts[name[:-7] + '_as_'] = attr if '_as_' in name: testers[name.split('_as_')[0] + '_as_'].append(name) This one isn't a multiple groups issue. We can separate the tasks of paramdicts and testers. as_names = (name for name in cls.__dict__ if '_as_' in name) testers = grouped(as_names, key=lambda name: name.split('_as_')[0] + '_as_') The paramdicts isn't a dict of lists, so it's not really appropriate for ``grouped``.
Instead of ``defaultdict(lambda: 0)`` it seems better to use ``Counter()``. It's called "invocation_counts" after all.
This seems like one of the ideal uses for defaultdict. It even has bits of code made to handle the possibly accidental insert of new keys by popping them off afterwards. D) Sometimes you just want to defautdict inside a defauldict inside a dict
and just have fun : https://frama.link/asBNLr1g,
I only have a tenuous grasp of git, so the fact that this file isn't in the master branch is stretching my mind. But again, this doesn't feel like a grouping task to me. It probably could be rewritten as such, but not trivially.
It's hard to tell at a glance if this should truly be a dict of dicts of sets or a dict of sets where the keys are 2-tuples. I'll assume the former is correct. On a side note, it looks like they're doing some silly things to fit into the 80 characters per line restriction, like looping over keys and looking up values inside the loop instead of using ``.items()``. I understand that the character limit is strict, but... https://github.com/tensorflow/tensorflow/blob/b202db076ecdc8a0fc80a49252d4a8...
Let's take a look at these two examples: grouped(contacts, key=itemgetter('city') grouped(os.listdir('.'), key=lambda filepath: os.path.splitext(filepath)[1]) How would those be rewritten using your proposal?

On Fri, Jul 13, 2018 at 12:38 PM, Michael Selik <mike@selik.org> wrote:
Thanks for linking to these.
yup -- real use cases are really helpful. Though the other paradigm for grouping is use of setdefault() rather than defaultdict. So it would be nice to look for those, too.
I looked at many of them in my own research, but for some reason didn't think to write down the links. I'll respond to each one separately.
agreed -- and it shouldn't. I"d like to see how some of these pan our with my proposed API: either a Grouped class, or at least (key, value) iterables and/or a value function. I don't have time now to do them all, but for the moment: I noticed recently that *all* examples for collection.defaultdict (
and yet others on this thread think a (key, value) input would be rare -- I guess it depends on whether you are thinking dict-like already....
so this one is a prime case for a value function -- I think post-processing the groups is a pretty common case -- why make people post-process it? def keyfunc(item): return findfile(opts.fileroot, item.attrib['classname']) def valuefunc(item): float(item.attrib['time']) groups = grouped(root, keyfunc, valuefunc) garbage = groups.pop(None, []) And the post-processing is then mixing comprehension style with key function style (what to call that -- "functional" style?), so why not use a (key, value) iterable: groups = grouped((findfile(opts.fileroot, item.attrib['classname']), item.attrib['time']) for item in root)) OK -- that's packing a bit too much into a line, so how about: def keyfunc(item): return findfile(opts.fileroot, item.attrib['classname']) groups = grouped( (keyfunc(item), item.attrib['time']) for item in root)
yeah, this is weird -- But it does make a case for having a class with the option f using a set to collect (which I have in an older commit of my prototype: inputs = ((op_input, op) for op in ops for op_input in op.inputs) groups = Grouping(inputs, key=itemgetter(0), collection=set) otherwise, you could have a method to do it: groups.map_on_groups(set) (not sure I like that method name, but I hope you get the idea) OK, back to work. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

There are some cases when that's the correct behavior. It mimics pandas.DataFrame.groupby. For example, what if you have a sequence of (key, v1, v2) triples? Group by key, then keep the triples intact is the right choice sometimes. On Wed, Jul 4, 2018, 6:39 AM David Mertz <mertz@gnosis.cx> wrote:

On Wed, Jul 4, 2018 at 6:34 AM, David Mertz <mertz@gnosis.cx> wrote:
I starting thinking grouping_chris was the obvious and natural thing to do, but his discussion has made it clear that grouping_michael is more natural for some kinds of data. and in some cases, it really comes down to taste, after all, who's to say which of these is "better" map(func, iterable) or (expression for item in iterable) given that map existed in Python when comprehensions were added, I tend to see the latter as more "Pythonic" but that's just me. So I'm currently lobbying for both :-) The default is iterable of (key. value) pairs, but the use can specify a key function is they want to do it that way. While a bit of a schizophrenic API, it makes sens (to me), because grouping_mikael isn't useful with a default key function anyway. The other enhancement I suggest is that an (optional) value function be added, as there are use cases where that would be really helpful. -CHB NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Thu, Jul 5, 2018 at 1:23 AM, Chris Barker via Python-ideas <python-ideas@python.org> wrote:
I use this kind of function in several different projects over the years, and I rewrote it many times as needed. I added several options, such as: - key function - value function - "ignore": Skip values with these keys. - "postprocess": Apply a function to each group after completion. - Pass in the container to store in. For example, create an OrderedDict and pass it in. It may already hold items. - Specify the container for each group. - Specify how to add to the container for each group. Then I cut it down to two optional parameters: - key function. If not provided, the iterable is considered to have key-value pairs. - The storage container. Finally, I removed the key function, and only took pairs and an optional container. However, I don't remember why I removed the key function. It may be that I was writing throwaway lambdas, and I decided I might as well just write the transformation into the comprehension. I think a key function is worth having. One thing I didn't do is create a class for this container. I used defaultdict, then used a default dict but converted it to a plain dict, and finally used to a plain dict. Aside: I also wrote a lazy grouping function, which returned a lazy container of generators. Asking for next(grouping[k]) will cause the container to iterate through the original iterable until it had something to add to group k. I have never used it, but it was fun to try and make it. class IterBuckets(dict): def __init__(self, pairs): self.pairs = iter(pairs) def __missing__(self, key): return self.setdefault(key, IterBucket(self.advance)) def advance(self): k, v = next(self.pairs) self[k].append(v) class IterBucket(collections.deque): def __init__(self, more): self.more = more def __iter__(self): return self def __next__(self): while not self: self.more() return self.popleft()

On Fri, Jul 6, 2018 at 12:26 PM, Franklin? Lee I use this kind of function in several different projects over the
years, and I rewrote it many times as needed.
interesting...
OK -- seems we're all converging on that :-)
- The storage container.
so this means you'r passing in a full set of storage containers? I'm a vit confused by that -- if they might be pre-populated, then they would need to be instance,s an you'd need to have one for every key -- how would you know in advance aht you needed??? I played around with passing in a optional storage object: https://github.com/PythonCHB/grouper/commit/d986816905406ec402724beaed2b88c9... but as we might want a list or a set, or a Counter, or ??? it got pretty ugly, as lists and sets and Counters all have different APIs for adding stuff. So I gave up and figured just saying "it's always a list) made the most sense.
exactly -- but I suspect hat may be because you where writing a comprehension anyway, as you needed to manipulate the values, also -- so if there were a value function, you could use either API.
I think a key function is worth having.
I think there's more or less consensus on that too. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

In my mind, I *rarely* (which is more than never) have my data in the form of a sequence of key/value pairs. The version of the API that assumes data starts that way feels like either a niche case, or demands preprocessing before it's ready to pass to grouping() or collections.Grouping(). That said, an identity key is rarely interesting either. So I think have key=None mean "assume we get key/val pairs is harmless to the more common case where we give an explicit key function. The uncommon need for grouping on equality can be handled with 'key=lambda x: x'. On Mon, Jul 9, 2018, 12:22 PM Chris Barker <chris.barker@noaa.gov> wrote:

On Mon, Jul 9, 2018 at 3:38 PM, David Mertz <mertz@gnosis.cx> wrote:
sure, but it makes it easy to use a different approach -- i.e. a comprehension with expressions rather than a key (and maybe value) function. AT least two people have expressed a preference for that. That said, an identity key is rarely interesting either.
is it EVER interesting?? wouldn't it be essentially a Counter, without the counting? :-)
So I think have key=None mean "assume we get key/val pairs is harmless to the more common case where we give an explicit key function.
I'm not sure we'll know what's more common 'till it's loose in the wild -- any problem can be solved either way -- which one will people prefer? Given that there Is a preference for comprehensions over map() -- I think it will see a lot of use. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Mon, Jul 9, 2018 at 12:22 PM, Chris Barker <chris.barker@noaa.gov> wrote:
On Fri, Jul 6, 2018 at 12:26 PM, Franklin? Lee I use this kind of function
No, I mean the mapping (outer) container. For example, I can pass in an empty OrderedDict, or a dict that already contained some groups from a previous call to the grouping function. I took out the option for the per-group (inner) containers. I never found it necessary to scrooge (scrooge on?) the memory, when I could postprocessed the lists after grouping. A mapvalues function will make postprocessing more convenient, and lend weight to a dicttools suggestion. # Unfortunate double meaning of 'map' in the function signature: def mapvalues(f, mapping): try: items = mapping.items() except AttributeError: items = mapping return {k: f(v) for k, v in items}
My solution at the time was to add another parameter to specify how to add to the container. In fuller generality, the option for the per-group container may consist of specifying a monad (if I remember monads correctly). You need to at least specify a per-group container constructor and a binary function that adds to it. In the case of `Counter`, the constructor is `int` and the binary function is `int.__add__`, and the Counter constructor effectively runs concurrent `reduce`.

On Mon, Jul 9, 2018 at 5:55 PM, Franklin? Lee <leewangzhong+python@gmail.com
wrote:
Sure -- that's what my prototype does if you pass a Mapping in (or use .update() ) why not? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

I noticed recently that *all* examples for collection.defaultdict ( https://docs.python.org/3.7/library/collections.html#collections.defaultdict) are cases of grouping (for an int, a list and a set) from an iterator with a key, value output. I wondered how common those constructions were, and what are defaultdict used for else. So I took a little dive into a few libs to see it (std lib, pypy, pandas, tensorflow, ..), and I saw essentially : A) basic cases of "grouping" with a simple for loop and a default_dict[key].append(value). I saw many kind of default factory utilized, with list, int, set, dict, and even defaultdict(list). ex : https://frama.link/UtNqvpvb, https://frama.link/o3Hb3-4U, https://frama.link/dw92yJ1q, https://frama.link/1Gqoa7WM, https://frama.link/bWswbHsU, https://frama.link/SZh2q8pS B) cases of grouping, but where the for loop used was alimenting more than one "grouper". pretty annoying if we want to group something. ex: https://frama.link/Db-Ny49a, https://frama.link/bZakUR33, https://frama.link/MwJFqh5o, C) classes attributes initialization (grouping is done by repeatably calling a function, so any grouping constructor will be useless here). ex : https://frama.link/GoGWuQwR, https://frama.link/BugcS8wU D) Sometimes you just want to defautdict inside a defauldict inside a dict and just have fun : https://frama.link/asBNLr1g, https://frama.link/8j7gzfA5
A sample code would be : from collections import defaultdict class groupingdict(defaultdict): def group_by_iterator(self, iterator): empty_element = self.default_factory() if hasattr(empty_element, "__add__"): for key, element in iterator: self[key] += element elif hasattr(empty_element, "update"): for key, element in iterator: self[key].update(element) elif hasattr(empty_element, "add"): for key, element in iterator: self[key].add(element) else: raise TypeError('default factory does not support iteration') return self So that for example :
groupingdict(dict, {1: {'a': 'e', 'b': 'f'}, 2: {'a': 'e'}})
My implementation is garbage and There should be 2 method, one returning the object and one modifing it, but I think it gives more leeway than just a function returning a dict 2018-07-13 7:11 GMT+02:00 Chris Barker via Python-ideas < python-ideas@python.org>:
-- -- *Nicolas Rolin* | Data Scientist + 33 631992617 - nicolas.rolin@tiime.fr <prenom.nom@tiime.fr> *15 rue Auber, **75009 Paris* *www.tiime.fr <http://www.tiime.fr>*

Thanks for linking to these. I looked at many of them in my own research, but for some reason didn't think to write down the links. I'll respond to each one separately. Throughout, I'm going to use my proposed ``grouped`` builtin to demonstrate possible revisions. Note that I am *not* suggesting a replacement to defaultdict. The proposal is to make a common task easier and more reliable. It does not satisfy all uses of defaultdict. https://github.com/selik/peps/blob/master/pep-9999.rst On Fri, Jul 13, 2018 at 6:17 AM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:
facesizes = collections.defaultdict(int) for face in cubofaces: facesizes[len(face)] += 1 This is more specifically a Counter. It might look better as: facesizes = collections.Counter(len(face) for face in cubofaces) https://frama.link/o3Hb3-4U,
accum = defaultdict(list) garbageitems = [] for item in root: filename = findfile(opts.fileroot, item.attrib['classname']) accum[filename].append(float(item.attrib['time'])) if filename is None: garbageitems.append(item) This might be more clear if separated into two parts. def keyfunc(item): return findfile(opts.fileroot, item.attrib['classname']) groups = grouped(root, keyfunc) groups = {k: [float(v.attrib['time']) for v in g] for k, g in groups.items()} garbage = groups.pop(None, []) Interestingly, this code later makes a distinction between ``garbage`` and ``garbageitems`` even though they appear to be the same. The programmer might not have made that error (?) if using the ``grouped`` function, because the act of grouping becomes single-purpose.
In this case, the grouping is being performed by Pandas, not by the defaultdict. expected = defaultdict(dict) for n1, gp1 in data.groupby('A'): for n2, gp2 in gp1.groupby('B'): expected[n1][n2] = op(gp2.loc[:, ['C', 'D']]) However, this is still relevant to our discussion, because the defaultdict had to be converted to a regular dict afterward. Presumably to ensure that missing keys cause KeyErrors. expected = dict((k, DataFrame(v)) for k, v in compat.iteritems(expected)) If you felt like one-lining this, it might look like expected = {n1: DataFrame({n2: op(gp2.loc[:, ['C', 'D']]) for n2, gp2 in gp1}) for n1, gp1 in data.groupby('A')} I'm not saying that's more clear, but the fact that's possible emphasizes that the grouping is performed by Pandas. Grouping cannot be one-lined if restricted to Python standard library.
results = defaultdict(dict) for i, k1 in enumerate(arg1.columns): for j, k2 in enumerate(arg2.columns): if j < i and arg2 is arg1: # Symmetric case results[i][j] = results[j][i] else: results[i][j] = f(*_prep_binary(arg1.iloc[:, i], arg2.iloc[:, j])) The nested loop looks like a cross-product. It might be clearer using itertools, but I wouldn't use ``grouped`` here. cross = itertools.product(enumerate(arg1.columns), enumerate(arg2.columns))
self.mapping = collections.defaultdict(set) for op in (op for op in graph.get_operations()): if op.name.startswith(common.SKIPPED_PREFIXES): continue for op_input in op.inputs: self.mapping[op_input].add(op) This is a case of a single element being added to multiple groups, which is your section B, below. The loop and filter could be better. It looks like someone intended to convert if/continue to a comprehension, but stopped partway through the revision. ops = (op for op in graph.get_operations() if not op.name.startswith(common.SKIPPED_PREFIXES)) I'm not sure how I like it, but here's a revision to use ``grouped``: inputs = ((op_input, op) for op in ops for op_input in op.inputs) groups = grouped(inputs, key=itemgetter(0)) groups = {k: set(v for _, v in g) for k, g in groups.items()} https://frama.link/SZh2q8pS
handler_results = collections.defaultdict(list) for handler in per_handler_ops.keys(): for op in per_handler_ops[handler]: handler_results[handler].append(op_results[op]) return handler_results Here it looks like the results are already grouped by handler and this loop is constructing a parallel dict of lists for the results instead of the ops themselves. return {handler: [op_results[op] for op in group] for handler, group in per_handler_ops.items()} B) cases of grouping, but where the for loop used was alimenting more than
one "grouper". pretty annoying if we want to group something. ex: https://frama.link/Db-Ny49a,
def _get_headnode_dict(fixer_list): """ Accepts a list of fixers and returns a dictionary of head node type --> fixer list. """ head_nodes = collections.defaultdict(list) every = [] for fixer in fixer_list: if fixer.pattern: try: heads = _get_head_types(fixer.pattern) except _EveryNode: every.append(fixer) else: for node_type in heads: head_nodes[node_type].append(fixer) else: if fixer._accept_type is not None: head_nodes[fixer._accept_type].append(fixer) else: every.append(fixer) for node_type in chain(pygram.python_grammar.symbol2number.values(), pygram.python_grammar.tokens): head_nodes[node_type].extend(every) return dict(head_nodes) Again this ends with a conversion from defaultdict to basic dict: ``return dict(head_nodes)``. This is somewhat common in usage of defaultdict, because after the creation of groups, the later usage wants missing keys to raise KeyError, not insert a new, empty group. The association of a single element to multiple groups makes this a little awkward again. def _get_headnode_dict(fixer_list): """ Accepts a list of fixers and returns a dictionary of head node type --> fixer list. """ nodetypes = set(chain(pygram.python_grammar.symbol2number.values(), pygram.python_grammar.tokens)) heads = {} for fixer in fixer_list: if fixer.pattern: try: heads[fixer] = _get_head_types(fixer.pattern) except _EveryNode: heads[fixer] = nodetypes elif fixer._accept_type is not None: heads[fixer] = [fixer._accept_type] else: heads[fixer] = nodetypes pairs = ((head, fixer) for fixer, types in heads.items() for head in types) groups = grouped(pairs, key=itemgetter(0)) return {head: [fixer for _, fixer in g] for head, g in groups.items()} https://frama.link/bZakUR33,
directories = defaultdict(set) cache_directories = defaultdict(set) groups = defaultdict(list) for source, target, group, disk_id, condition in files: target = PureWindowsPath(target) groups[group].append((source, target, disk_id, condition)) if target.suffix.lower() in {".py", ".pyw"}: cache_directories[group].add(target.parent) for dirname in target.parents: parent = make_id(dirname.parent) if parent and parent != '.': directories[parent].add(dirname.name) This one has 3 different groupings created simultaneously. We could separate the 3 tasks, but they all rely on a converted ``target`` path, so it'd really end up being 4 loops. At that point, perhaps it's best to do just 1 loop as written.
paramdicts = {} testers = collections.defaultdict(list) for name, attr in cls.__dict__.items(): if name.endswith('_params'): if not hasattr(attr, 'keys'): d = {} for x in attr: if not hasattr(x, '__iter__'): x = (x,) n = '_'.join(str(v) for v in x).replace(' ', '_') d[n] = x attr = d paramdicts[name[:-7] + '_as_'] = attr if '_as_' in name: testers[name.split('_as_')[0] + '_as_'].append(name) This one isn't a multiple groups issue. We can separate the tasks of paramdicts and testers. as_names = (name for name in cls.__dict__ if '_as_' in name) testers = grouped(as_names, key=lambda name: name.split('_as_')[0] + '_as_') The paramdicts isn't a dict of lists, so it's not really appropriate for ``grouped``.
Instead of ``defaultdict(lambda: 0)`` it seems better to use ``Counter()``. It's called "invocation_counts" after all.
This seems like one of the ideal uses for defaultdict. It even has bits of code made to handle the possibly accidental insert of new keys by popping them off afterwards. D) Sometimes you just want to defautdict inside a defauldict inside a dict
and just have fun : https://frama.link/asBNLr1g,
I only have a tenuous grasp of git, so the fact that this file isn't in the master branch is stretching my mind. But again, this doesn't feel like a grouping task to me. It probably could be rewritten as such, but not trivially.
It's hard to tell at a glance if this should truly be a dict of dicts of sets or a dict of sets where the keys are 2-tuples. I'll assume the former is correct. On a side note, it looks like they're doing some silly things to fit into the 80 characters per line restriction, like looping over keys and looking up values inside the loop instead of using ``.items()``. I understand that the character limit is strict, but... https://github.com/tensorflow/tensorflow/blob/b202db076ecdc8a0fc80a49252d4a8...
Let's take a look at these two examples: grouped(contacts, key=itemgetter('city') grouped(os.listdir('.'), key=lambda filepath: os.path.splitext(filepath)[1]) How would those be rewritten using your proposal?

On Fri, Jul 13, 2018 at 12:38 PM, Michael Selik <mike@selik.org> wrote:
Thanks for linking to these.
yup -- real use cases are really helpful. Though the other paradigm for grouping is use of setdefault() rather than defaultdict. So it would be nice to look for those, too.
I looked at many of them in my own research, but for some reason didn't think to write down the links. I'll respond to each one separately.
agreed -- and it shouldn't. I"d like to see how some of these pan our with my proposed API: either a Grouped class, or at least (key, value) iterables and/or a value function. I don't have time now to do them all, but for the moment: I noticed recently that *all* examples for collection.defaultdict (
and yet others on this thread think a (key, value) input would be rare -- I guess it depends on whether you are thinking dict-like already....
so this one is a prime case for a value function -- I think post-processing the groups is a pretty common case -- why make people post-process it? def keyfunc(item): return findfile(opts.fileroot, item.attrib['classname']) def valuefunc(item): float(item.attrib['time']) groups = grouped(root, keyfunc, valuefunc) garbage = groups.pop(None, []) And the post-processing is then mixing comprehension style with key function style (what to call that -- "functional" style?), so why not use a (key, value) iterable: groups = grouped((findfile(opts.fileroot, item.attrib['classname']), item.attrib['time']) for item in root)) OK -- that's packing a bit too much into a line, so how about: def keyfunc(item): return findfile(opts.fileroot, item.attrib['classname']) groups = grouped( (keyfunc(item), item.attrib['time']) for item in root)
yeah, this is weird -- But it does make a case for having a class with the option f using a set to collect (which I have in an older commit of my prototype: inputs = ((op_input, op) for op in ops for op_input in op.inputs) groups = Grouping(inputs, key=itemgetter(0), collection=set) otherwise, you could have a method to do it: groups.map_on_groups(set) (not sure I like that method name, but I hope you get the idea) OK, back to work. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
participants (5)
-
Chris Barker
-
David Mertz
-
Franklin? Lee
-
Michael Selik
-
Nicolas Rolin