Support floating-point values in collections.Counter
It would be useful in many scenarios for values in collections.Counter to be allowed to be floating point. I know that Counter nominally emulates a multiset, which would suggest only integer values, but in a more general sense, it could be an accumulator of either floating point or integer data. As near as I can tell, Collection already does support float values in both Python 2.7 and 3.6, and the way the code is implemented, this change should be a no-op. All that is required is to update the documentation to say floating-point values are allowed, as it currently says only integers are allowed.
On Tue, Dec 19, 2017 at 12:51 AM, Joel Croteau <jcroteau@gmail.com> wrote:
It would be useful in many scenarios for values in collections.Counter to be allowed to be floating point. I know that Counter nominally emulates a multiset, which would suggest only integer values, but in a more general sense, it could be an accumulator of either floating point or integer data. As near as I can tell, Collection already does support float values in both Python 2.7 and 3.6, and the way the code is implemented, this change should be a no-op. All that is required is to update the documentation to say floating-point values are allowed, as it currently says only integers are allowed.
How should the `elements` method work? Currently it raises TypeError: integer argument expected, got float At least it should be documented that the method only works when all counts are integers. The error message could also state exactly what key it failed on.
On Mon, Dec 18, 2017 at 11:51:46PM +0000, Joel Croteau wrote:
It would be useful in many scenarios for values in collections.Counter to be allowed to be floating point.
Can you give a concrete example?
I know that Counter nominally emulates a multiset, which would suggest only integer values, but in a more general sense, it could be an accumulator of either floating point or integer data.
As near as I can tell, Collection already does support float values in both Python 2.7 and 3.6, and the way the code is implemented, this change should be a no-op. All that is required is to update the documentation to say floating-point values are allowed, as it currently says only integers are allowed.
I don't think its that simple. What should the elements() method do when an element has a "count" of 2.5, say? What happens if the count is a NAN? There are operations that discard negative and zero, or positive and zero, counts. How should they treat -0.0 and NANs? I am intrigued by this suggestion, but I'm not quite sure where I would use such an accumulator, or whether a Counter is the right solution for it. Perhaps some concrete use-cases would convince me. -- Steve
On 18 December 2017 at 23:51, Joel Croteau <jcroteau@gmail.com> wrote:
It would be useful in many scenarios for values in collections.Counter to be allowed to be floating point.
Do you have any evidence of this? Code examples that would be significantly improved by such a change? I can't think of any myself. I might consider writing totals - defaultdict(float) for ...: totals[something] = calculation(something) but using a counter is neither noticeably easier, nor clearer... One way of demonstrating such a need would be if your proposed behaviour were available on PyPI and getting used a lot - I'm not aware of any such module if it is. Paul
Well here is some code I wrote recently to build a histogram over a weighted graph, before becoming aware that Counter existed (score is a float here): from collections import defaultdict total_score_by_depth = defaultdict(float) total_items_by_depth = defaultdict(int) num_nodes_by_score = defaultdict(int) num_nodes_by_log_score = defaultdict(int) num_edges_by_score = defaultdict(int) for state in iter_graph_components(): try: # There is probably some overlap here ak = state['ak'] _, c = ak.score_paths(max_depth=15) for edge in state['graph'].edges: num_edges_by_score[np.ceil(20.0 * edge.score) / 20.0] += 1 for node in c.nodes: total_score_by_depth[node.depth] += node.score total_items_by_depth[node.depth] += 1 num_nodes_by_score[np.ceil(20.0 * node.score) / 20.0] += 1 num_nodes_by_log_score[np.ceil(-np.log10(node.score))] += 1 num_nodes_by_score[0.0] += len(state['graph'].nodes) - len(c.nodes) num_nodes_by_log_score[100.0] += len(state['graph'].nodes) - len(c.nodes) except MemoryError: print("Skipped massive.") Without going too much into what this does, note that I could replace the other defaultdicts with Counters, but I can't do the same thing with a total_score_by_depth, at least not without violating the API. I would suggest that with a name like Counter, treating a class like a Counter should be the more common use case. If it's meant to be a multiset, we should call it a Multiset. Here is an example from Stack Overflow of someone else also wanting a float counter, and the only suggestion being to use defaultdict: https://stackoverflow.com/questions/10900207/any-way-to-tackle-float-counter... On Tue, Dec 19, 2017 at 3:08 AM Paul Moore <p.f.moore@gmail.com> wrote:
On 18 December 2017 at 23:51, Joel Croteau <jcroteau@gmail.com> wrote:
It would be useful in many scenarios for values in collections.Counter to be allowed to be floating point.
Do you have any evidence of this? Code examples that would be significantly improved by such a change? I can't think of any myself.
I might consider writing
totals - defaultdict(float) for ...: totals[something] = calculation(something)
but using a counter is neither noticeably easier, nor clearer...
One way of demonstrating such a need would be if your proposed behaviour were available on PyPI and getting used a lot - I'm not aware of any such module if it is.
Paul
On 20 December 2017 at 03:09, Joel Croteau <jcroteau@gmail.com> wrote:
Well here is some code I wrote recently to build a histogram over a weighted graph, before becoming aware that Counter existed (score is a float here):
from collections import defaultdict
total_score_by_depth = defaultdict(float) total_items_by_depth = defaultdict(int) num_nodes_by_score = defaultdict(int) num_nodes_by_log_score = defaultdict(int) num_edges_by_score = defaultdict(int) for state in iter_graph_components(): try: # There is probably some overlap here ak = state['ak'] _, c = ak.score_paths(max_depth=15) for edge in state['graph'].edges: num_edges_by_score[np.ceil(20.0 * edge.score) / 20.0] += 1 for node in c.nodes: total_score_by_depth[node.depth] += node.score total_items_by_depth[node.depth] += 1 num_nodes_by_score[np.ceil(20.0 * node.score) / 20.0] += 1 num_nodes_by_log_score[np.ceil(-np.log10(node.score))] += 1 num_nodes_by_score[0.0] += len(state['graph'].nodes) - len(c.nodes) num_nodes_by_log_score[100.0] += len(state['graph'].nodes) - len(c.nodes) except MemoryError: print("Skipped massive.")
Without going too much into what this does, note that I could replace the other defaultdicts with Counters, but I can't do the same thing with a total_score_by_depth, at least not without violating the API.
Hmm, OK. I can't see any huge benefit from switching to a Counter, though. You're not using any features of a Counter that aren't shared by a defaultdict, nor is there any code here that could be simplified or replaced by using such features...
I would suggest that with a name like Counter, treating a class like a Counter should be the more common use case. If it's meant to be a multiset, we should call it a Multiset.
Personally, I consider "counting" to be something we do with integers (whole numbers), not with floats. So for me the name Counter clearly implies an integer. Multiset would be a reasonable alternative name, but Python has a tradition of using "natural language" names over "computer science" names, so I'm not surprised Counter was chosen instead. I guess it's ultimately a matter of opinion whether a float-based Counter is a natural extension or not. Paul
On 12/20/17 5:05 AM, Paul Moore wrote:
Well here is some code I wrote recently to build a histogram over a weighted graph, before becoming aware that Counter existed (score is a float here):
from collections import defaultdict
total_score_by_depth = defaultdict(float) total_items_by_depth = defaultdict(int) num_nodes_by_score = defaultdict(int) num_nodes_by_log_score = defaultdict(int) num_edges_by_score = defaultdict(int) for state in iter_graph_components(): try: # There is probably some overlap here ak = state['ak'] _, c = ak.score_paths(max_depth=15) for edge in state['graph'].edges: num_edges_by_score[np.ceil(20.0 * edge.score) / 20.0] += 1 for node in c.nodes: total_score_by_depth[node.depth] += node.score total_items_by_depth[node.depth] += 1 num_nodes_by_score[np.ceil(20.0 * node.score) / 20.0] += 1 num_nodes_by_log_score[np.ceil(-np.log10(node.score))] += 1 num_nodes_by_score[0.0] += len(state['graph'].nodes) - len(c.nodes) num_nodes_by_log_score[100.0] += len(state['graph'].nodes) - len(c.nodes) except MemoryError: print("Skipped massive.")
Without going too much into what this does, note that I could replace the other defaultdicts with Counters, but I can't do the same thing with a total_score_by_depth, at least not without violating the API. Hmm, OK. I can't see any huge benefit from switching to a Counter,
On 20 December 2017 at 03:09, Joel Croteau <jcroteau@gmail.com> wrote: though. You're not using any features of a Counter that aren't shared by a defaultdict, nor is there any code here that could be simplified or replaced by using such features...
I would suggest that with a name like Counter, treating a class like a Counter should be the more common use case. If it's meant to be a multiset, we should call it a Multiset. Personally, I consider "counting" to be something we do with integers (whole numbers), not with floats. So for me the name Counter clearly implies an integer. Multiset would be a reasonable alternative name, but Python has a tradition of using "natural language" names over "computer science" names, so I'm not surprised Counter was chosen instead.
I guess it's ultimately a matter of opinion whether a float-based Counter is a natural extension or not.
One thing to note is that Counter supports negative numbers, so we are already outside the natural numbers :) Python 3.6.4 (default, Dec 19 2017, 08:11:42) >>> from collections import Counter >>> c = Counter(a=4, b=2, c=0, d=-2) >>> d = Counter(a=1, b=2, c=3, d=4) >>> c.subtract(d) >>> c Counter({'a': 3, 'b': 0, 'c': -3, 'd': -6}) >>> list(c.elements()) ['a', 'a', 'a'] --Ned.
On Mon, Dec 18, 2017 at 6:51 PM, Joel Croteau <jcroteau@gmail.com> wrote:
It would be useful in many scenarios for values in collections.Counter to be allowed to be floating point. I know that Counter nominally emulates a multiset, which would suggest only integer values, but in a more general sense, it could be an accumulator of either floating point or integer data. As near as I can tell, Collection already does support float values in both Python 2.7 and 3.6, and the way the code is implemented, this change should be a no-op. All that is required is to update the documentation to say floating-point values are allowed, as it currently says only integers are allowed.
That's beyond the scope of Counter. I think what you really want is a generalization of Counter which represents a key'd number bag. A dict of key=>number which supports arithmetic operations, like Numpy arrays are to lists. Example methods: __init__(...): Like dict's version, but it will combine the values of duplicate keys in its params. update(...): Similar to __init__. fromkeys(...): Like dict's version, but uses 0 or 1 as the default value, and combines values like the constructor. With value=1, this is roughly equivalent to the Counter constructor. <arithmetic operators>: Arithmetic with other number bags, with dicts, and with number-like values. clearsmall(tolerance=0): Removes keys whose values are close to 0. Other methods may take inspiration from Numpy. The class should probably be a third-party package (and probably already is), so that the method list can be solidified first.
participants (6)
-
Franklin? Lee
-
Joel Croteau
-
Ned Batchelder
-
Paul Moore
-
Steven D'Aprano
-
Søren Pilgård