Mailman 3 Fwd: grouping / dict of lists - Python-ideas

newer
Re: [Python-ideas] Add the imath...

Fwd: grouping / dict of lists

older
Re: [Python-ideas] Add the imath...

David Mertz

July 4, 2018

1:34 p.m.

Steven: You've misunderstood part of the discussion. There are two different signatures being discussed/proposed for a grouping() function. The one you show we might call grouping_michael(). The alternate API we might call grouping_chris(). These two calls will produce the same result (the first output you show) grouping_michael(words, keyfunc=len) grouping_chris((len(word), word) for word in words) I happen to prefer grouping_michael(), but recognize they each make slightly different things obvious. Absolutely no one wants the behavior in your second output. On Tue, Jul 3, 2018, 9:32 PM Steven D'Aprano <steve@pearwood.info> wrote:

...

Attachments:

attachment.htm (text/html — 2.2 KB)

Show replies by date

Michael Selik

July 2018

6 p.m.

New subject: Fwd: grouping / dict of lists

There are some cases when that's the correct behavior. It mimics pandas.DataFrame.groupby. For example, what if you have a sequence of (key, v1, v2) triples? Group by key, then keep the triples intact is the right choice sometimes. On Wed, Jul 4, 2018, 6:39 AM David Mertz <mertz@gnosis.cx> wrote:

...

Chris Barker

5:23 a.m.

New subject: Fwd: grouping / dict of lists

On Wed, Jul 4, 2018 at 6:34 AM, David Mertz <mertz@gnosis.cx> wrote:

...

I starting thinking grouping_chris was the obvious and natural thing to do, but his discussion has made it clear that grouping_michael is more natural for some kinds of data. and in some cases, it really comes down to taste, after all, who's to say which of these is "better" map(func, iterable) or (expression for item in iterable) given that map existed in Python when comprehensions were added, I tend to see the latter as more "Pythonic" but that's just me. So I'm currently lobbying for both :-) The default is iterable of (key. value) pairs, but the use can specify a key function is they want to do it that way. While a bit of a schizophrenic API, it makes sens (to me), because grouping_mikael isn't useful with a default key function anyway. The other enhancement I suggest is that an (optional) value function be added, as there are use cases where that would be really helpful. -CHB NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Franklin? Lee

7:26 p.m.

New subject: Fwd: grouping / dict of lists

On Thu, Jul 5, 2018 at 1:23 AM, Chris Barker via Python-ideas <python-ideas@python.org> wrote:

...

I use this kind of function in several different projects over the years, and I rewrote it many times as needed. I added several options, such as: - key function - value function - "ignore": Skip values with these keys. - "postprocess": Apply a function to each group after completion. - Pass in the container to store in. For example, create an OrderedDict and pass it in. It may already hold items. - Specify the container for each group. - Specify how to add to the container for each group. Then I cut it down to two optional parameters: - key function. If not provided, the iterable is considered to have key-value pairs. - The storage container. Finally, I removed the key function, and only took pairs and an optional container. However, I don't remember why I removed the key function. It may be that I was writing throwaway lambdas, and I decided I might as well just write the transformation into the comprehension. I think a key function is worth having. One thing I didn't do is create a class for this container. I used defaultdict, then used a default dict but converted it to a plain dict, and finally used to a plain dict. Aside: I also wrote a lazy grouping function, which returned a lazy container of generators. Asking for next(grouping[k]) will cause the container to iterate through the original iterable until it had something to add to group k. I have never used it, but it was fun to try and make it. class IterBuckets(dict): def __init__(self, pairs): self.pairs = iter(pairs) def __missing__(self, key): return self.setdefault(key, IterBucket(self.advance)) def advance(self): k, v = next(self.pairs) self[k].append(v) class IterBucket(collections.deque): def __init__(self, more): self.more = more def __iter__(self): return self def __next__(self): while not self: self.more() return self.popleft()

Chris Barker

4:22 p.m.

New subject: Fwd: grouping / dict of lists

On Fri, Jul 6, 2018 at 12:26 PM, Franklin? Lee I use this kind of function in several different projects over the

...

years, and I rewrote it many times as needed.

...

I added several options, such as: - key function - value function - "ignore": Skip values with these keys. - "postprocess": Apply a function to each group after completion. - Pass in the container to store in. For example, create an OrderedDict and pass it in. It may already hold items. - Specify the container for each group. - Specify how to add to the container for each group.

interesting...

...

Then I cut it down to two optional parameters: - key function. If not provided, the iterable is considered to have key-value pairs.

OK -- seems we're all converging on that :-)

...

- The storage container.

so this means you'r passing in a full set of storage containers? I'm a vit confused by that -- if they might be pre-populated, then they would need to be instance,s an you'd need to have one for every key -- how would you know in advance aht you needed??? I played around with passing in a optional storage object: https://github.com/PythonCHB/grouper/commit/d986816905406ec402724beaed2b88c9... but as we might want a list or a set, or a Counter, or ??? it got pretty ugly, as lists and sets and Counters all have different APIs for adding stuff. So I gave up and figured just saying "it's always a list) made the most sense.

...

Finally, I removed the key function, and only took pairs and an optional container. However, I don't remember why I removed the key function. It may be that I was writing throwaway lambdas, and I decided I might as well just write the transformation into the comprehension.

exactly -- but I suspect hat may be because you where writing a comprehension anyway, as you needed to manipulate the values, also -- so if there were a value function, you could use either API.

...

I think a key function is worth having.

I think there's more or less consensus on that too. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

David Mertz

8:38 p.m.

New subject: Fwd: grouping / dict of lists

In my mind, I *rarely* (which is more than never) have my data in the form of a sequence of key/value pairs. The version of the API that assumes data starts that way feels like either a niche case, or demands preprocessing before it's ready to pass to grouping() or collections.Grouping(). That said, an identity key is rarely interesting either. So I think have key=None mean "assume we get key/val pairs is harmless to the more common case where we give an explicit key function. The uncommon need for grouping on equality can be handled with 'key=lambda x: x'. On Mon, Jul 9, 2018, 12:22 PM Chris Barker <chris.barker@noaa.gov> wrote:

...

On Fri, Jul 6, 2018 at 12:26 PM, Franklin? Lee I use this kind of function in several different projects over the

...
years, and I rewrote it many times as needed.

...
I added several options, such as: - key function - value function - "ignore": Skip values with these keys. - "postprocess": Apply a function to each group after completion. - Pass in the container to store in. For example, create an OrderedDict and pass it in. It may already hold items. - Specify the container for each group. - Specify how to add to the container for each group.

interesting...

...
Then I cut it down to two optional parameters: - key function. If not provided, the iterable is considered to have key-value pairs.

OK -- seems we're all converging on that :-)

...
- The storage container.

so this means you'r passing in a full set of storage containers? I'm a vit confused by that -- if they might be pre-populated, then they would need to be instance,s an you'd need to have one for every key -- how would you know in advance aht you needed???

I played around with passing in a optional storage object:

https://github.com/PythonCHB/grouper/commit/d986816905406ec402724beaed2b88c9...

but as we might want a list or a set, or a Counter, or ??? it got pretty ugly, as lists and sets and Counters all have different APIs for adding stuff. So I gave up and figured just saying "it's always a list) made the most sense.

...
Finally, I removed the key function, and only took pairs and an optional container. However, I don't remember why I removed the key function. It may be that I was writing throwaway lambdas, and I decided I might as well just write the transformation into the comprehension.

exactly -- but I suspect hat may be because you where writing a comprehension anyway, as you needed to manipulate the values, also -- so if there were a value function, you could use either API.

...
I think a key function is worth having.

I think there's more or less consensus on that too.

-CHB

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

Chris Barker

5:08 a.m.

New subject: Fwd: grouping / dict of lists

On Mon, Jul 9, 2018 at 3:38 PM, David Mertz <mertz@gnosis.cx> wrote:

...

In my mind, I *rarely* (which is more than never) have my data in the form of a sequence of key/value pairs. The version of the API that assumes data starts that way feels like either a niche case, or demands preprocessing before it's ready to pass to grouping() or collections.Grouping().

sure, but it makes it easy to use a different approach -- i.e. a comprehension with expressions rather than a key (and maybe value) function. AT least two people have expressed a preference for that. That said, an identity key is rarely interesting either.

...

is it EVER interesting?? wouldn't it be essentially a Counter, without the counting? :-)

...

So I think have key=None mean "assume we get key/val pairs is harmless to the more common case where we give an explicit key function.

I'm not sure we'll know what's more common 'till it's loose in the wild -- any problem can be solved either way -- which one will people prefer? Given that there Is a preference for comprehensions over map() -- I think it will see a lot of use. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Franklin? Lee

10:55 p.m.

New subject: Fwd: grouping / dict of lists

On Mon, Jul 9, 2018 at 12:22 PM, Chris Barker <chris.barker@noaa.gov> wrote:

...

On Fri, Jul 6, 2018 at 12:26 PM, Franklin? Lee I use this kind of function

...

...
I added several options, such as: - key function - value function - "ignore": Skip values with these keys. - "postprocess": Apply a function to each group after completion. - Pass in the container to store in. For example, create an OrderedDict and pass it in. It may already hold items. - Specify the container for each group. - Specify how to add to the container for each group.

interesting...

...
Then I cut it down to two optional parameters: - key function. If not provided, the iterable is considered to have key-value pairs.

OK -- seems we're all converging on that :-)

...
- The storage container.

so this means you'r passing in a full set of storage containers? I'm a vit confused by that -- if they might be pre-populated, then they would need to be instance,s an you'd need to have one for every key -- how would you know in advance aht you needed???

No, I mean the mapping (outer) container. For example, I can pass in an empty OrderedDict, or a dict that already contained some groups from a previous call to the grouping function. I took out the option for the per-group (inner) containers. I never found it necessary to scrooge (scrooge on?) the memory, when I could postprocessed the lists after grouping. A mapvalues function will make postprocessing more convenient, and lend weight to a dicttools suggestion. # Unfortunate double meaning of 'map' in the function signature: def mapvalues(f, mapping): try: items = mapping.items() except AttributeError: items = mapping return {k: f(v) for k, v in items}

...

I played around with passing in a optional storage object:

https://github.com/PythonCHB/grouper/commit/d986816905406ec402724beaed2b88c9...

but as we might want a list or a set, or a Counter, or ??? it got pretty ugly, as lists and sets and Counters all have different APIs for adding stuff. So I gave up and figured just saying "it's always a list) made the most sense.

My solution at the time was to add another parameter to specify how to add to the container. In fuller generality, the option for the per-group container may consist of specifying a monad (if I remember monads correctly). You need to at least specify a per-group container constructor and a binary function that adds to it. In the case of `Counter`, the constructor is `int` and the binary function is `int.__add__`, and the Counter constructor effectively runs concurrent `reduce`.

Chris Barker

5:11 a.m.

New subject: Fwd: grouping / dict of lists

On Mon, Jul 9, 2018 at 5:55 PM, Franklin? Lee <leewangzhong+python@gmail.com

...

wrote:

...

Sure -- that's what my prototype does if you pass a Mapping in (or use .update() ) why not? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Nicolas Rolin

1:17 p.m.

New subject: Fwd: grouping / dict of lists

I noticed recently that *all* examples for collection.defaultdict ( https://docs.python.org/3.7/library/collections.html#collections.defaultdict) are cases of grouping (for an int, a list and a set) from an iterator with a key, value output. I wondered how common those constructions were, and what are defaultdict used for else. So I took a little dive into a few libs to see it (std lib, pypy, pandas, tensorflow, ..), and I saw essentially : A) basic cases of "grouping" with a simple for loop and a default_dict[key].append(value). I saw many kind of default factory utilized, with list, int, set, dict, and even defaultdict(list). ex : https://frama.link/UtNqvpvb, https://frama.link/o3Hb3-4U, https://frama.link/dw92yJ1q, https://frama.link/1Gqoa7WM, https://frama.link/bWswbHsU, https://frama.link/SZh2q8pS B) cases of grouping, but where the for loop used was alimenting more than one "grouper". pretty annoying if we want to group something. ex: https://frama.link/Db-Ny49a, https://frama.link/bZakUR33, https://frama.link/MwJFqh5o, C) classes attributes initialization (grouping is done by repeatably calling a function, so any grouping constructor will be useless here). ex : https://frama.link/GoGWuQwR, https://frama.link/BugcS8wU D) Sometimes you just want to defautdict inside a defauldict inside a dict and just have fun : https://frama.link/asBNLr1g, https://frama.link/8j7gzfA5

...

From what I saw, the most useful would be to add method to a defaultdict to fill it from an iterable, and using a grouping method adapted to the default_factor (so __add__ for list, int and str, add for set, update for dict and proably __add__ for anything else)

A sample code would be : from collections import defaultdict class groupingdict(defaultdict): def group_by_iterator(self, iterator): empty_element = self.default_factory() if hasattr(empty_element, "__add__"): for key, element in iterator: self[key] += element elif hasattr(empty_element, "update"): for key, element in iterator: self[key].update(element) elif hasattr(empty_element, "add"): for key, element in iterator: self[key].add(element) else: raise TypeError('default factory does not support iteration') return self So that for example :

...

groupingdict(dict).group_by_iterator( (grouping_key, a_dict) for grouping_key, a_dict in [ (1, {'a': 'c'}), (1, {'b': 'f'}), (1, {'a': 'e'}), (2, {'a': 'e'}) ] ) returns

...

groupingdict(dict, {1: {'a': 'e', 'b': 'f'}, 2: {'a': 'e'}})

My implementation is garbage and There should be 2 method, one returning the object and one modifing it, but I think it gives more leeway than just a function returning a dict 2018-07-13 7:11 GMT+02:00 Chris Barker via Python-ideas < python-ideas@python.org>:

...

On Mon, Jul 9, 2018 at 5:55 PM, Franklin? Lee < leewangzhong+python@gmail.com> wrote:

...
...
...
- The storage container.

so this means you'r passing in a full set of storage containers? I'm a vit confused by that -- if they might be pre-populated, then they would need to be instance,s an you'd need to have one for every key -- how would you know in advance aht you needed???

No, I mean the mapping (outer) container. For example, I can pass in an empty OrderedDict, or a dict that already contained some groups from a previous call to the grouping function.

Sure -- that's what my prototype does if you pass a Mapping in (or use .update() )

why not?

-CHB

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- -- *Nicolas Rolin* | Data Scientist + 33 631992617 - nicolas.rolin@tiime.fr <prenom.nom@tiime.fr> *15 rue Auber, **75009 Paris* *www.tiime.fr <http://www.tiime.fr>*

Michael Selik

5:38 p.m.

New subject: Fwd: grouping / dict of lists

Thanks for linking to these. I looked at many of them in my own research, but for some reason didn't think to write down the links. I'll respond to each one separately. Throughout, I'm going to use my proposed ``grouped`` builtin to demonstrate possible revisions. Note that I am *not* suggesting a replacement to defaultdict. The proposal is to make a common task easier and more reliable. It does not satisfy all uses of defaultdict. https://github.com/selik/peps/blob/master/pep-9999.rst On Fri, Jul 13, 2018 at 6:17 AM Nicolas Rolin <nicolas.rolin@tiime.fr> wrote:

...

I noticed recently that *all* examples for collection.defaultdict ( https://docs.python.org/3.7/library/collections.html#collections.defaultdict) are cases of grouping (for an int, a list and a set) from an iterator with a key, value output.

I wondered how common those constructions were, and what are defaultdict used for else. So I took a little dive into a few libs to see it (std lib, pypy, pandas, tensorflow, ..), and I saw essentially :

...

A) basic cases of "grouping" with a simple for loop and a default_dict[key].append(value). I saw many kind of default factory utilized, with list, int, set, dict, and even defaultdict(list). ex : https://frama.link/UtNqvpvb,

facesizes = collections.defaultdict(int) for face in cubofaces: facesizes[len(face)] += 1 This is more specifically a Counter. It might look better as: facesizes = collections.Counter(len(face) for face in cubofaces) https://frama.link/o3Hb3-4U,

...

accum = defaultdict(list) garbageitems = [] for item in root: filename = findfile(opts.fileroot, item.attrib['classname']) accum[filename].append(float(item.attrib['time'])) if filename is None: garbageitems.append(item) This might be more clear if separated into two parts. def keyfunc(item): return findfile(opts.fileroot, item.attrib['classname']) groups = grouped(root, keyfunc) groups = {k: [float(v.attrib['time']) for v in g] for k, g in groups.items()} garbage = groups.pop(None, []) Interestingly, this code later makes a distinction between ``garbage`` and ``garbageitems`` even though they appear to be the same. The programmer might not have made that error (?) if using the ``grouped`` function, because the act of grouping becomes single-purpose.

...

https://frama.link/dw92yJ1q

In this case, the grouping is being performed by Pandas, not by the defaultdict. expected = defaultdict(dict) for n1, gp1 in data.groupby('A'): for n2, gp2 in gp1.groupby('B'): expected[n1][n2] = op(gp2.loc[:, ['C', 'D']]) However, this is still relevant to our discussion, because the defaultdict had to be converted to a regular dict afterward. Presumably to ensure that missing keys cause KeyErrors. expected = dict((k, DataFrame(v)) for k, v in compat.iteritems(expected)) If you felt like one-lining this, it might look like expected = {n1: DataFrame({n2: op(gp2.loc[:, ['C', 'D']]) for n2, gp2 in gp1}) for n1, gp1 in data.groupby('A')} I'm not saying that's more clear, but the fact that's possible emphasizes that the grouping is performed by Pandas. Grouping cannot be one-lined if restricted to Python standard library.

...

https://frama.link/1Gqoa7WM

results = defaultdict(dict) for i, k1 in enumerate(arg1.columns): for j, k2 in enumerate(arg2.columns): if j < i and arg2 is arg1: # Symmetric case results[i][j] = results[j][i] else: results[i][j] = f(*_prep_binary(arg1.iloc[:, i], arg2.iloc[:, j])) The nested loop looks like a cross-product. It might be clearer using itertools, but I wouldn't use ``grouped`` here. cross = itertools.product(enumerate(arg1.columns), enumerate(arg2.columns))

...

https://frama.link/bWswbHsU,

self.mapping = collections.defaultdict(set) for op in (op for op in graph.get_operations()): if op.name.startswith(common.SKIPPED_PREFIXES): continue for op_input in op.inputs: self.mapping[op_input].add(op) This is a case of a single element being added to multiple groups, which is your section B, below. The loop and filter could be better. It looks like someone intended to convert if/continue to a comprehension, but stopped partway through the revision. ops = (op for op in graph.get_operations() if not op.name.startswith(common.SKIPPED_PREFIXES)) I'm not sure how I like it, but here's a revision to use ``grouped``: inputs = ((op_input, op) for op in ops for op_input in op.inputs) groups = grouped(inputs, key=itemgetter(0)) groups = {k: set(v for _, v in g) for k, g in groups.items()} https://frama.link/SZh2q8pS

...

handler_results = collections.defaultdict(list) for handler in per_handler_ops.keys(): for op in per_handler_ops[handler]: handler_results[handler].append(op_results[op]) return handler_results Here it looks like the results are already grouped by handler and this loop is constructing a parallel dict of lists for the results instead of the ops themselves. return {handler: [op_results[op] for op in group] for handler, group in per_handler_ops.items()} B) cases of grouping, but where the for loop used was alimenting more than

...

one "grouper". pretty annoying if we want to group something. ex: https://frama.link/Db-Ny49a,

def _get_headnode_dict(fixer_list): """ Accepts a list of fixers and returns a dictionary of head node type --> fixer list. """ head_nodes = collections.defaultdict(list) every = [] for fixer in fixer_list: if fixer.pattern: try: heads = _get_head_types(fixer.pattern) except _EveryNode: every.append(fixer) else: for node_type in heads: head_nodes[node_type].append(fixer) else: if fixer._accept_type is not None: head_nodes[fixer._accept_type].append(fixer) else: every.append(fixer) for node_type in chain(pygram.python_grammar.symbol2number.values(), pygram.python_grammar.tokens): head_nodes[node_type].extend(every) return dict(head_nodes) Again this ends with a conversion from defaultdict to basic dict: ``return dict(head_nodes)``. This is somewhat common in usage of defaultdict, because after the creation of groups, the later usage wants missing keys to raise KeyError, not insert a new, empty group. The association of a single element to multiple groups makes this a little awkward again. def _get_headnode_dict(fixer_list): """ Accepts a list of fixers and returns a dictionary of head node type --> fixer list. """ nodetypes = set(chain(pygram.python_grammar.symbol2number.values(), pygram.python_grammar.tokens)) heads = {} for fixer in fixer_list: if fixer.pattern: try: heads[fixer] = _get_head_types(fixer.pattern) except _EveryNode: heads[fixer] = nodetypes elif fixer._accept_type is not None: heads[fixer] = [fixer._accept_type] else: heads[fixer] = nodetypes pairs = ((head, fixer) for fixer, types in heads.items() for head in types) groups = grouped(pairs, key=itemgetter(0)) return {head: [fixer for _, fixer in g] for head, g in groups.items()} https://frama.link/bZakUR33,

...

directories = defaultdict(set) cache_directories = defaultdict(set) groups = defaultdict(list) for source, target, group, disk_id, condition in files: target = PureWindowsPath(target) groups[group].append((source, target, disk_id, condition)) if target.suffix.lower() in {".py", ".pyw"}: cache_directories[group].add(target.parent) for dirname in target.parents: parent = make_id(dirname.parent) if parent and parent != '.': directories[parent].add(dirname.name) This one has 3 different groupings created simultaneously. We could separate the 3 tasks, but they all rely on a converted ``target`` path, so it'd really end up being 4 loops. At that point, perhaps it's best to do just 1 loop as written.

...

https://frama.link/MwJFqh5o

paramdicts = {} testers = collections.defaultdict(list) for name, attr in cls.__dict__.items(): if name.endswith('_params'): if not hasattr(attr, 'keys'): d = {} for x in attr: if not hasattr(x, '__iter__'): x = (x,) n = '_'.join(str(v) for v in x).replace(' ', '_') d[n] = x attr = d paramdicts[name[:-7] + '_as_'] = attr if '_as_' in name: testers[name.split('_as_')[0] + '_as_'].append(name) This one isn't a multiple groups issue. We can separate the tasks of paramdicts and testers. as_names = (name for name in cls.__dict__ if '_as_' in name) testers = grouped(as_names, key=lambda name: name.split('_as_')[0] + '_as_') The paramdicts isn't a dict of lists, so it's not really appropriate for ``grouped``.

...

C) classes attributes initialization (grouping is done by repeatably calling a function, so any grouping constructor will be useless here). ex : https://frama.link/GoGWuQwR,

Instead of ``defaultdict(lambda: 0)`` it seems better to use ``Counter()``. It's called "invocation_counts" after all.

...

https://frama.link/BugcS8wU

This seems like one of the ideal uses for defaultdict. It even has bits of code made to handle the possibly accidental insert of new keys by popping them off afterwards. D) Sometimes you just want to defautdict inside a defauldict inside a dict

...

and just have fun : https://frama.link/asBNLr1g,

I only have a tenuous grasp of git, so the fact that this file isn't in the master branch is stretching my mind. But again, this doesn't feel like a grouping task to me. It probably could be rewritten as such, but not trivially.

...

https://frama.link/8j7gzfA5

It's hard to tell at a glance if this should truly be a dict of dicts of sets or a dict of sets where the keys are 2-tuples. I'll assume the former is correct. On a side note, it looks like they're doing some silly things to fit into the 80 characters per line restriction, like looping over keys and looking up values inside the loop instead of using ``.items()``. I understand that the character limit is strict, but... https://github.com/tensorflow/tensorflow/blob/b202db076ecdc8a0fc80a49252d4a8...

...

From what I saw, the most useful would be to add method to a defaultdict to fill it from an iterable, and using a grouping method adapted to the default_factor (so __add__ for list, int and str, add for set, update for dict and proably __add__ for anything else)

A sample code would be :

from collections import defaultdict class groupingdict(defaultdict): def group_by_iterator(self, iterator): empty_element = self.default_factory() if hasattr(empty_element, "__add__"): for key, element in iterator: self[key] += element elif hasattr(empty_element, "update"): for key, element in iterator: self[key].update(element) elif hasattr(empty_element, "add"): for key, element in iterator: self[key].add(element) else: raise TypeError('default factory does not support iteration') return self

So that for example :

...
groupingdict(dict).group_by_iterator( (grouping_key, a_dict) for grouping_key, a_dict in [ (1, {'a': 'c'}), (1, {'b': 'f'}), (1, {'a': 'e'}), (2, {'a': 'e'}) ] ) returns

...
groupingdict(dict, {1: {'a': 'e', 'b': 'f'}, 2: {'a': 'e'}})

My implementation is garbage and There should be 2 method, one returning the object and one modifing it, but I think it gives more leeway than just a function returning a dict

Let's take a look at these two examples: grouped(contacts, key=itemgetter('city') grouped(os.listdir('.'), key=lambda filepath: os.path.splitext(filepath)[1]) How would those be rewritten using your proposal?

Chris Barker

7:16 p.m.

New subject: Fwd: grouping / dict of lists

On Fri, Jul 13, 2018 at 12:38 PM, Michael Selik <mike@selik.org> wrote:

...

Thanks for linking to these.

yup -- real use cases are really helpful. Though the other paradigm for grouping is use of setdefault() rather than defaultdict. So it would be nice to look for those, too.

...

I looked at many of them in my own research, but for some reason didn't think to write down the links. I'll respond to each one separately.

...

Throughout, I'm going to use my proposed ``grouped`` builtin to demonstrate possible revisions. Note that I am *not* suggesting a replacement to defaultdict. The proposal is to make a common task easier and more reliable. It does not satisfy all uses of defaultdict.

agreed -- and it shouldn't. I"d like to see how some of these pan our with my proposed API: either a Grouped class, or at least (key, value) iterables and/or a value function. I don't have time now to do them all, but for the moment: I noticed recently that *all* examples for collection.defaultdict (

...

...
https://docs.python.org/3.7/library/collections.html# collections.defaultdict) are cases of grouping (for an int, a list and a set) from an iterator with a key, value output.

and yet others on this thread think a (key, value) input would be rare -- I guess it depends on whether you are thinking dict-like already....

...

...
https://frama.link/o3Hb3-4U,

accum = defaultdict(list) garbageitems = []

for item in root: filename = findfile(opts.fileroot, item.attrib['classname']) accum[filename].append(float(item.attrib['time'])) if filename is None: garbageitems.append(item)

This might be more clear if separated into two parts.

def keyfunc(item): return findfile(opts.fileroot, item.attrib['classname']) groups = grouped(root, keyfunc) groups = {k: [float(v.attrib['time']) for v in g] for k, g in groups.items()} garbage = groups.pop(None, [])

so this one is a prime case for a value function -- I think post-processing the groups is a pretty common case -- why make people post-process it? def keyfunc(item): return findfile(opts.fileroot, item.attrib['classname']) def valuefunc(item): float(item.attrib['time']) groups = grouped(root, keyfunc, valuefunc) garbage = groups.pop(None, []) And the post-processing is then mixing comprehension style with key function style (what to call that -- "functional" style?), so why not use a (key, value) iterable: groups = grouped((findfile(opts.fileroot, item.attrib['classname']), item.attrib['time']) for item in root)) OK -- that's packing a bit too much into a line, so how about: def keyfunc(item): return findfile(opts.fileroot, item.attrib['classname']) groups = grouped( (keyfunc(item), item.attrib['time']) for item in root)

...

self.mapping = collections.defaultdict(set) for op in (op for op in graph.get_operations()): if op.name.startswith(common.SKIPPED_PREFIXES): continue for op_input in op.inputs: self.mapping[op_input].add(op)

This is a case of a single element being added to multiple groups, which is your section B, below. The loop and filter could be better. It looks like someone intended to convert if/continue to a comprehension, but stopped partway through the revision.

yeah, this is weird -- But it does make a case for having a class with the option f using a set to collect (which I have in an older commit of my prototype: inputs = ((op_input, op) for op in ops for op_input in op.inputs) groups = Grouping(inputs, key=itemgetter(0), collection=set) otherwise, you could have a method to do it: groups.map_on_groups(set) (not sure I like that method name, but I hope you get the idea) OK, back to work. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Michael Selik

July 2018

6 p.m.

New subject: Fwd: grouping / dict of lists

...

Chris Barker

5:23 a.m.

New subject: Fwd: grouping / dict of lists

On Wed, Jul 4, 2018 at 6:34 AM, David Mertz <mertz@gnosis.cx> wrote:

...

Franklin? Lee

7:26 p.m.

New subject: Fwd: grouping / dict of lists

On Thu, Jul 5, 2018 at 1:23 AM, Chris Barker via Python-ideas <python-ideas@python.org> wrote:

...

Chris Barker

4:22 p.m.

New subject: Fwd: grouping / dict of lists

On Fri, Jul 6, 2018 at 12:26 PM, Franklin? Lee I use this kind of function in several different projects over the

...

years, and I rewrote it many times as needed.

...

I added several options, such as: - key function - value function - "ignore": Skip values with these keys. - "postprocess": Apply a function to each group after completion. - Pass in the container to store in. For example, create an OrderedDict and pass it in. It may already hold items. - Specify the container for each group. - Specify how to add to the container for each group.

interesting...

...

Then I cut it down to two optional parameters: - key function. If not provided, the iterable is considered to have key-value pairs.

OK -- seems we're all converging on that :-)

...

- The storage container.

...

Finally, I removed the key function, and only took pairs and an optional container. However, I don't remember why I removed the key function. It may be that I was writing throwaway lambdas, and I decided I might as well just write the transformation into the comprehension.

exactly -- but I suspect hat may be because you where writing a comprehension anyway, as you needed to manipulate the values, also -- so if there were a value function, you could use either API.

...

I think a key function is worth having.

David Mertz

8:38 p.m.

New subject: Fwd: grouping / dict of lists

...

On Fri, Jul 6, 2018 at 12:26 PM, Franklin? Lee I use this kind of function in several different projects over the

...
years, and I rewrote it many times as needed.

...
I added several options, such as: - key function - value function - "ignore": Skip values with these keys. - "postprocess": Apply a function to each group after completion. - Pass in the container to store in. For example, create an OrderedDict and pass it in. It may already hold items. - Specify the container for each group. - Specify how to add to the container for each group.

interesting...

...
Then I cut it down to two optional parameters: - key function. If not provided, the iterable is considered to have key-value pairs.

OK -- seems we're all converging on that :-)

...
- The storage container.

so this means you'r passing in a full set of storage containers? I'm a vit confused by that -- if they might be pre-populated, then they would need to be instance,s an you'd need to have one for every key -- how would you know in advance aht you needed???

I played around with passing in a optional storage object:

https://github.com/PythonCHB/grouper/commit/d986816905406ec402724beaed2b88c9...

but as we might want a list or a set, or a Counter, or ??? it got pretty ugly, as lists and sets and Counters all have different APIs for adding stuff. So I gave up and figured just saying "it's always a list) made the most sense.

...
Finally, I removed the key function, and only took pairs and an optional container. However, I don't remember why I removed the key function. It may be that I was writing throwaway lambdas, and I decided I might as well just write the transformation into the comprehension.

exactly -- but I suspect hat may be because you where writing a comprehension anyway, as you needed to manipulate the values, also -- so if there were a value function, you could use either API.

...
I think a key function is worth having.

I think there's more or less consensus on that too.

-CHB

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

Chris Barker

5:08 a.m.

New subject: Fwd: grouping / dict of lists

On Mon, Jul 9, 2018 at 3:38 PM, David Mertz <mertz@gnosis.cx> wrote:

...

In my mind, I *rarely* (which is more than never) have my data in the form of a sequence of key/value pairs. The version of the API that assumes data starts that way feels like either a niche case, or demands preprocessing before it's ready to pass to grouping() or collections.Grouping().

...

is it EVER interesting?? wouldn't it be essentially a Counter, without the counting? :-)

...

So I think have key=None mean "assume we get key/val pairs is harmless to the more common case where we give an explicit key function.

Franklin? Lee

July 2018

10:55 p.m.

New subject: Fwd: grouping / dict of lists

On Mon, Jul 9, 2018 at 12:22 PM, Chris Barker <chris.barker@noaa.gov> wrote:

...

On Fri, Jul 6, 2018 at 12:26 PM, Franklin? Lee I use this kind of function

...

...
I added several options, such as: - key function - value function - "ignore": Skip values with these keys. - "postprocess": Apply a function to each group after completion. - Pass in the container to store in. For example, create an OrderedDict and pass it in. It may already hold items. - Specify the container for each group. - Specify how to add to the container for each group.

interesting...

...
Then I cut it down to two optional parameters: - key function. If not provided, the iterable is considered to have key-value pairs.

OK -- seems we're all converging on that :-)

...
- The storage container.

so this means you'r passing in a full set of storage containers? I'm a vit confused by that -- if they might be pre-populated, then they would need to be instance,s an you'd need to have one for every key -- how would you know in advance aht you needed???

...

I played around with passing in a optional storage object:

https://github.com/PythonCHB/grouper/commit/d986816905406ec402724beaed2b88c9...

but as we might want a list or a set, or a Counter, or ??? it got pretty ugly, as lists and sets and Counters all have different APIs for adding stuff. So I gave up and figured just saying "it's always a list) made the most sense.

Chris Barker

5:11 a.m.

New subject: Fwd: grouping / dict of lists

On Mon, Jul 9, 2018 at 5:55 PM, Franklin? Lee <leewangzhong+python@gmail.com

...

wrote:

...

Nicolas Rolin

1:17 p.m.

New subject: Fwd: grouping / dict of lists

...

From what I saw, the most useful would be to add method to a defaultdict to fill it from an iterable, and using a grouping method adapted to the default_factor (so __add__ for list, int and str, add for set, update for dict and proably __add__ for anything else)

...

groupingdict(dict).group_by_iterator( (grouping_key, a_dict) for grouping_key, a_dict in [ (1, {'a': 'c'}), (1, {'b': 'f'}), (1, {'a': 'e'}), (2, {'a': 'e'}) ] ) returns

...

groupingdict(dict, {1: {'a': 'e', 'b': 'f'}, 2: {'a': 'e'}})

...

On Mon, Jul 9, 2018 at 5:55 PM, Franklin? Lee < leewangzhong+python@gmail.com> wrote:

...
...
...
- The storage container.

so this means you'r passing in a full set of storage containers? I'm a vit confused by that -- if they might be pre-populated, then they would need to be instance,s an you'd need to have one for every key -- how would you know in advance aht you needed???

No, I mean the mapping (outer) container. For example, I can pass in an empty OrderedDict, or a dict that already contained some groups from a previous call to the grouping function.

Sure -- that's what my prototype does if you pass a Mapping in (or use .update() )

why not?

-CHB

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- -- *Nicolas Rolin* | Data Scientist + 33 631992617 - nicolas.rolin@tiime.fr <prenom.nom@tiime.fr> *15 rue Auber, **75009 Paris* *www.tiime.fr <http://www.tiime.fr>*

Michael Selik

5:38 p.m.

New subject: Fwd: grouping / dict of lists

...

I noticed recently that *all* examples for collection.defaultdict ( https://docs.python.org/3.7/library/collections.html#collections.defaultdict) are cases of grouping (for an int, a list and a set) from an iterator with a key, value output.

I wondered how common those constructions were, and what are defaultdict used for else. So I took a little dive into a few libs to see it (std lib, pypy, pandas, tensorflow, ..), and I saw essentially :

...

A) basic cases of "grouping" with a simple for loop and a default_dict[key].append(value). I saw many kind of default factory utilized, with list, int, set, dict, and even defaultdict(list). ex : https://frama.link/UtNqvpvb,

...

https://frama.link/dw92yJ1q

...

https://frama.link/1Gqoa7WM

...

https://frama.link/bWswbHsU,

...

one "grouper". pretty annoying if we want to group something. ex: https://frama.link/Db-Ny49a,

...

https://frama.link/MwJFqh5o

...

C) classes attributes initialization (grouping is done by repeatably calling a function, so any grouping constructor will be useless here). ex : https://frama.link/GoGWuQwR,

Instead of ``defaultdict(lambda: 0)`` it seems better to use ``Counter()``. It's called "invocation_counts" after all.

...

https://frama.link/BugcS8wU

...

and just have fun : https://frama.link/asBNLr1g,

...

https://frama.link/8j7gzfA5

...

From what I saw, the most useful would be to add method to a defaultdict to fill it from an iterable, and using a grouping method adapted to the default_factor (so __add__ for list, int and str, add for set, update for dict and proably __add__ for anything else)

A sample code would be :

from collections import defaultdict class groupingdict(defaultdict): def group_by_iterator(self, iterator): empty_element = self.default_factory() if hasattr(empty_element, "__add__"): for key, element in iterator: self[key] += element elif hasattr(empty_element, "update"): for key, element in iterator: self[key].update(element) elif hasattr(empty_element, "add"): for key, element in iterator: self[key].add(element) else: raise TypeError('default factory does not support iteration') return self

So that for example :

...
groupingdict(dict).group_by_iterator( (grouping_key, a_dict) for grouping_key, a_dict in [ (1, {'a': 'c'}), (1, {'b': 'f'}), (1, {'a': 'e'}), (2, {'a': 'e'}) ] ) returns

...
groupingdict(dict, {1: {'a': 'e', 'b': 'f'}, 2: {'a': 'e'}})

My implementation is garbage and There should be 2 method, one returning the object and one modifing it, but I think it gives more leeway than just a function returning a dict

Chris Barker

7:16 p.m.

New subject: Fwd: grouping / dict of lists

On Fri, Jul 13, 2018 at 12:38 PM, Michael Selik <mike@selik.org> wrote:

...

Thanks for linking to these.

yup -- real use cases are really helpful. Though the other paradigm for grouping is use of setdefault() rather than defaultdict. So it would be nice to look for those, too.

...

I looked at many of them in my own research, but for some reason didn't think to write down the links. I'll respond to each one separately.

...

Throughout, I'm going to use my proposed ``grouped`` builtin to demonstrate possible revisions. Note that I am *not* suggesting a replacement to defaultdict. The proposal is to make a common task easier and more reliable. It does not satisfy all uses of defaultdict.

...

...
https://docs.python.org/3.7/library/collections.html# collections.defaultdict) are cases of grouping (for an int, a list and a set) from an iterator with a key, value output.

and yet others on this thread think a (key, value) input would be rare -- I guess it depends on whether you are thinking dict-like already....

...

...
https://frama.link/o3Hb3-4U,

accum = defaultdict(list) garbageitems = []

for item in root: filename = findfile(opts.fileroot, item.attrib['classname']) accum[filename].append(float(item.attrib['time'])) if filename is None: garbageitems.append(item)

This might be more clear if separated into two parts.

def keyfunc(item): return findfile(opts.fileroot, item.attrib['classname']) groups = grouped(root, keyfunc) groups = {k: [float(v.attrib['time']) for v in g] for k, g in groups.items()} garbage = groups.pop(None, [])

...

self.mapping = collections.defaultdict(set) for op in (op for op in graph.get_operations()): if op.name.startswith(common.SKIPPED_PREFIXES): continue for op_input in op.inputs: self.mapping[op_input].add(op)

This is a case of a single element being added to multiple groups, which is your section B, below. The loop and filter could be better. It looks like someone intended to convert if/continue to a comprehension, but stopped partway through the revision.

2441

Age (days ago)

2450

Last active (days ago)

List overview

Download

11 comments

5 participants

participants (5)

Chris Barker
David Mertz
Franklin? Lee
Michael Selik
Nicolas Rolin

Fwd: grouping / dict of lists

tags

participants (5)