Mailman 3 collections.Counter should implement __mul__, __rmul__ - Python-ideas

newer
Re: [Python-ideas] Change magic...

collections.Counter should implement mul, rmul

Peter Norvig

April 15, 2018

2:05 p.m.

For most types that implement __add__, `x + x` is equal to `2 * x`. That is true for all numbers, list, tuple, str, timedelta, etc. -- but not for collections.Counter. I can add two Counters, but I can't multiply one by a scalar. That seems like an oversight. It would be worthwhile to implement multiplication because, among other reasons, Counters are a nice representation for discrete probability distributions, for which multiplication is an even more fundamental operation than addition. Here's an implementation: def __mul__(self, scalar): "Multiply each entry by a scalar." result = Counter() for key in self: result[key] = self[key] * scalar return result def __rmul__(self, scalar): "Multiply each entry by a scalar." result = Counter() for key in self: result[key] = scalar * self[key] return result

Attachments:

attachment.htm (text/html — 1.6 KB)

Show replies by date

Wes Turner

April 2018

3:45 p.m.

Good call. Is it any faster to initialize Counter with a dict comprehension? return Counter({k: v*scalar for (k, v) in self.items()) On Sun, Apr 15, 2018 at 5:05 PM, Peter Norvig <peter@norvig.com> wrote:

...

Peter Norvig

4:02 p.m.

That's actually how I coded it myself the first time. But I worried it would be wasteful to create an intermediate dict and discard it. `timeit` results: 3.79 µs for the for-loop, 5.08 µs for the dict-comprehension with a 10-key Counter 257 µs for the for-loop, 169 µs for the dict-comprehension with a 1000-key Counter So results are mixed, but you are probably right. On Sun, Apr 15, 2018 at 3:46 PM Wes Turner <wes.turner@gmail.com> wrote:

...

Raymond Hettinger

5:05 p.m.

...

If you view the Counter as a sparse associative array of numeric values, it does seem like an oversight. If you view the Counter as a Multiset or Bag, it doesn't make sense at all ;-) From an implementation point of view, Counter is just a kind of dict that has a __missing__() method that returns zero. That makes it trivially easy to subclass Counter to add new functionality or just use dictionary comprehensions for bulk updates.

...

It would be worthwhile to implement multiplication because, among other reasons, Counters are a nice representation for discrete probability distributions, for which multiplication is an even more fundamental operation than addition.

There is an open issue on this topic. See: https://bugs.python.org/issue25478 One stumbling point is that a number of commenters are fiercely opposed to non-integer uses of Counter. Also, some of the use cases (such as those found in Allen Downey's "Think Stats" and "Think Bayes" books) also need division and rescaling to a total (i.e. normalizing the total to 1.0) for a probability mass function. If the idea were to go forward, it still isn't clear whether the correct API should be low level (__mul__ and __div__ and a "total" property) or higher level (such as a normalize() or rescale() method that produces a new Counter instance). The low level approach has the advantage that it is simple to understand and that it feels like a logical extension of the __add__ and __sub__ methods. The downside is that doesn't really add any new capabilities (being just short-cuts for a simple dict comprehension or call to c.values()). And, it starts to feature creep the Counter class further away from its core mission of counting and ventures into the realm of generic sparse arrays with numeric values. There is also a learnability/intelligibility issue in __add__ and __sub__ correspond to "elementwise" operations while __mul__ and __div__ would be "scalar broadcast" operations. Peter, I'm really glad you chimed in. My advocacy lacked sufficient weight to move this idea forward. Raymond

Peter Norvig

5:44 p.m.

If you think of a Counter as a multiset, then it should support __or__, not __add__, right? I do think it would have been fine if Counter did not support "+" at all (and/or if Counter was limited to integer values). But given where we are now, it feels like we should preserve `c + c == 2 * c`. As to the "doesn't really add any new capabilities" argument, that's true, but it is also true for Counter as a whole: it doesn't add much over defaultdict(int), but it is certainly convenient to have a standard way to do what it does. I agree with your intuition that low level is better. `total` would be useful. If you have total and mul, then as you and others have pointed out, normalize is just c *= 1/c.total. I can also see the argument for a new FrequencyTable class in the statistics module. (By the way, I refactored my https://github.com/norvig/pytudes/blob/master/ipynb/Probability.ipynb a bit, and now I no longer need a `normalize` function.) On Sun, Apr 15, 2018 at 5:06 PM Raymond Hettinger < raymond.hettinger@gmail.com> wrote:

...

...
On Apr 15, 2018, at 2:05 PM, Peter Norvig <peter@norvig.com> wrote:

For most types that implement __add__, `x + x` is equal to `2 * x`.

...

That is true for all numbers, list, tuple, str, timedelta, etc. -- but not for collections.Counter. I can add two Counters, but I can't multiply one by a scalar. That seems like an oversight.

If you view the Counter as a sparse associative array of numeric values, it does seem like an oversight. If you view the Counter as a Multiset or Bag, it doesn't make sense at all ;-)

From an implementation point of view, Counter is just a kind of dict that has a __missing__() method that returns zero. That makes it trivially easy to subclass Counter to add new functionality or just use dictionary comprehensions for bulk updates.

...
It would be worthwhile to implement multiplication because, among other

reasons, Counters are a nice representation for discrete probability distributions, for which multiplication is an even more fundamental operation than addition.

There is an open issue on this topic. See: https://bugs.python.org/issue25478

One stumbling point is that a number of commenters are fiercely opposed to non-integer uses of Counter. Also, some of the use cases (such as those found in Allen Downey's "Think Stats" and "Think Bayes" books) also need division and rescaling to a total (i.e. normalizing the total to 1.0) for a probability mass function.

If the idea were to go forward, it still isn't clear whether the correct API should be low level (__mul__ and __div__ and a "total" property) or higher level (such as a normalize() or rescale() method that produces a new Counter instance). The low level approach has the advantage that it is simple to understand and that it feels like a logical extension of the __add__ and __sub__ methods. The downside is that doesn't really add any new capabilities (being just short-cuts for a simple dict comprehension or call to c.values()). And, it starts to feature creep the Counter class further away from its core mission of counting and ventures into the realm of generic sparse arrays with numeric values. There is also a learnability/intelligibility issue in __add__ and __sub__ correspond to "elementwise" operations while __mul__ and __div__ would be "scalar broadcast" operations.

Peter, I'm really glad you chimed in. My advocacy lacked sufficient weight to move this idea forward.

Raymond

Wes Turner

6:34 p.m.

On Sunday, April 15, 2018, Peter Norvig <peter@norvig.com> wrote:

...

nltk.probability.FreqDist(collections.Counter) doesn't have a __mul__ either http://www.nltk.org/api/nltk.html#nltk.probability.FreqDist numpy.unique(, return_counts=True).unique_counts returns an array sorted by value with a __mul__. https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html scipy.stats.itemfreq returns an array sorted by value with a __mul__ and the items in the first column. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.itemfreq.ht... pandas.Series.value_counts(, normalize=False) returns a Series sorted by descending frequency. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_c...

...

On Sun, Apr 15, 2018 at 5:06 PM Raymond Hettinger < raymond.hettinger@gmail.com> wrote:

...
...
On Apr 15, 2018, at 2:05 PM, Peter Norvig <peter@norvig.com> wrote:

For most types that implement __add__, `x + x` is equal to `2 * x`.

...

That is true for all numbers, list, tuple, str, timedelta, etc. -- but not for collections.Counter. I can add two Counters, but I can't multiply one by a scalar. That seems like an oversight.

If you view the Counter as a sparse associative array of numeric values, it does seem like an oversight. If you view the Counter as a Multiset or Bag, it doesn't make sense at all ;-)

From an implementation point of view, Counter is just a kind of dict that has a __missing__() method that returns zero. That makes it trivially easy to subclass Counter to add new functionality or just use dictionary comprehensions for bulk updates.

...
It would be worthwhile to implement multiplication because, among other

reasons, Counters are a nice representation for discrete probability distributions, for which multiplication is an even more fundamental operation than addition.

There is an open issue on this topic. See: https://bugs.python.org/ issue25478

One stumbling point is that a number of commenters are fiercely opposed to non-integer uses of Counter. Also, some of the use cases (such as those found in Allen Downey's "Think Stats" and "Think Bayes" books) also need division and rescaling to a total (i.e. normalizing the total to 1.0) for a probability mass function.

If the idea were to go forward, it still isn't clear whether the correct API should be low level (__mul__ and __div__ and a "total" property) or higher level (such as a normalize() or rescale() method that produces a new Counter instance). The low level approach has the advantage that it is simple to understand and that it feels like a logical extension of the __add__ and __sub__ methods. The downside is that doesn't really add any new capabilities (being just short-cuts for a simple dict comprehension or call to c.values()). And, it starts to feature creep the Counter class further away from its core mission of counting and ventures into the realm of generic sparse arrays with numeric values. There is also a learnability/intelligibility issue in __add__ and __sub__ correspond to "elementwise" operations while __mul__ and __div__ would be "scalar broadcast" operations.

Peter, I'm really glad you chimed in. My advocacy lacked sufficient weight to move this idea forward.

Raymond

Wes Turner

7:18 p.m.

tf.bincount() returns a vector with integer counts. https://www.tensorflow.org/api_docs/python/tf/bincount Keras calls np.bincount in an mnist example. np.bincount returns an array with a __mul__ https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.bincount.h... - sklearn.preprocessing.normalize http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-norm... http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.norma... featuretools.primitives.NUnique has a normalize method. https://docs.featuretools.com/generated/featuretools.primitives.NUnique.html... And I'm done sharing non-pure-python solutions for this problem, I promise On Sunday, April 15, 2018, Wes Turner <wes.turner@gmail.com> wrote:

...

On Sunday, April 15, 2018, Peter Norvig <peter@norvig.com> wrote:

...
If you think of a Counter as a multiset, then it should support __or__, not __add__, right?

I do think it would have been fine if Counter did not support "+" at all (and/or if Counter was limited to integer values). But given where we are now, it feels like we should preserve `c + c == 2 * c`.

As to the "doesn't really add any new capabilities" argument, that's true, but it is also true for Counter as a whole: it doesn't add much over defaultdict(int), but it is certainly convenient to have a standard way to do what it does.

I agree with your intuition that low level is better. `total` would be useful. If you have total and mul, then as you and others have pointed out, normalize is just c *= 1/c.total.

I can also see the argument for a new FrequencyTable class in the statistics module. (By the way, I refactored my https://github.com/norvig/pytudes/blob/master/ipynb/Probability.ipynb a bit, and now I no longer need a `normalize` function.)

nltk.probability.FreqDist(collections.Counter) doesn't have a __mul__ either http://www.nltk.org/api/nltk.html#nltk.probability.FreqDist

numpy.unique(, return_counts=True).unique_counts returns an array sorted by value with a __mul__. https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html

scipy.stats.itemfreq returns an array sorted by value with a __mul__ and the items in the first column. https://docs.scipy.org/doc/scipy/reference/generated/ scipy.stats.itemfreq.html

pandas.Series.value_counts(, normalize=False) returns a Series sorted by descending frequency. https://pandas.pydata.org/pandas-docs/stable/generated/ pandas.Series.value_counts.html

...
On Sun, Apr 15, 2018 at 5:06 PM Raymond Hettinger < raymond.hettinger@gmail.com> wrote:

...
...
On Apr 15, 2018, at 2:05 PM, Peter Norvig <peter@norvig.com> wrote:

For most types that implement __add__, `x + x` is equal to `2 * x`.

...

That is true for all numbers, list, tuple, str, timedelta, etc. -- but not for collections.Counter. I can add two Counters, but I can't multiply one by a scalar. That seems like an oversight.

If you view the Counter as a sparse associative array of numeric values, it does seem like an oversight. If you view the Counter as a Multiset or Bag, it doesn't make sense at all ;-)

From an implementation point of view, Counter is just a kind of dict that has a __missing__() method that returns zero. That makes it trivially easy to subclass Counter to add new functionality or just use dictionary comprehensions for bulk updates.

...
It would be worthwhile to implement multiplication because, among

other reasons, Counters are a nice representation for discrete probability distributions, for which multiplication is an even more fundamental operation than addition.

There is an open issue on this topic. See: https://bugs.python.org/issue25478

One stumbling point is that a number of commenters are fiercely opposed to non-integer uses of Counter. Also, some of the use cases (such as those found in Allen Downey's "Think Stats" and "Think Bayes" books) also need division and rescaling to a total (i.e. normalizing the total to 1.0) for a probability mass function.

If the idea were to go forward, it still isn't clear whether the correct API should be low level (__mul__ and __div__ and a "total" property) or higher level (such as a normalize() or rescale() method that produces a new Counter instance). The low level approach has the advantage that it is simple to understand and that it feels like a logical extension of the __add__ and __sub__ methods. The downside is that doesn't really add any new capabilities (being just short-cuts for a simple dict comprehension or call to c.values()). And, it starts to feature creep the Counter class further away from its core mission of counting and ventures into the realm of generic sparse arrays with numeric values. There is also a learnability/intelligibility issue in __add__ and __sub__ correspond to "elementwise" operations while __mul__ and __div__ would be "scalar broadcast" operations.

Peter, I'm really glad you chimed in. My advocacy lacked sufficient weight to move this idea forward.

Raymond

Raymond Hettinger

8:41 p.m.

...

On Apr 15, 2018, at 7:18 PM, Wes Turner <wes.turner@gmail.com> wrote:

And I'm done sharing non-pure-python solutions for this problem, I promise

Keep them coming :-) Thanks for the research. It helps to remind ourselves that almost none of our problems are new :-) Raymond

Raymond Hettinger

8:39 p.m.

...

On Apr 15, 2018, at 5:44 PM, Peter Norvig <peter@norvig.com> wrote:

If you think of a Counter as a multiset, then it should support __or__, not __add__, right?

FWIW, Counter is explicitly documented to support the four multiset-style mathematical operations discussed in Knuth TAOCP Volume II section 4.6.3 exercise 19:

...

...
...
c = Counter(a=3, b=1) d = Counter(a=1, b=2) c + d # add two counters together: c[x] + d[x] Counter({'a': 4, 'b': 3}) c - d # saturating subtraction (keeping only positive counts) Counter({'a': 2}) c & d # intersection: min(c[x], d[x]) Counter({'a': 1, 'b': 1}) c | d # union: max(c[x], d[x]) Counter({'a': 3, 'b': 2})

The wikipedia article on Multisets lists a further operation, inclusion, that is not currently supported: https://en.wikipedia.org/wiki/Multiset#Basic_properties_and_operations

...

I do think it would have been fine if Counter did not support "+" at all (and/or if Counter was limited to integer values). But given where we are now, it feels like we should preserve `c + c == 2 * c`.

The + operation has legitimate use cases (it is perfectly reasonable to want to combine the results two separate counts). And, as you pointed out, it is what we already have and cannot change :-) So, the API design issue that confronts us is that it would be a bit weird and disorienting for the arithmetic operators to have two different signatures: <counter> += <counter> <counter> -= <counter> <counter> *= <scalar> <counter> /= <scalar> Also, we should respect the comments given by others on the tracker issue. In particular, there is a preference to not have an in-place operation and only allow a new counter instance to be created. That will help people avoid data structure modality problems: . c[category] += 1 # Makes sense during the frequency counting or accumulation phase c /= c.total # Covert to a probability mass function c[category] += 1 # This code looks correct but no longer makes any sense

...

As to the "doesn't really add any new capabilities" argument, that's true, but it is also true for Counter as a whole: it doesn't add much over defaultdict(int), but it is certainly convenient to have a standard way to do what it does.

IIRC, the defaultdict(int) in your first version triggered a bug because the model inadvertently changed during the analysis phase rather than being frozen after the training phase. The Counter doesn't suffer from the same issue (modifying the dict on a failed lookup). Also, the Counter class does have a few value added features: Counter(iterable), c.most_common(), c.elements(), etc. But yes, at its heart the counter is mostly just a specialized dictionary. The thought I was trying to express is that suggestions to build out Counter API are a little less compelling when we already have a way to do it that is flexible, fast, clear, and standard (i.e. dict comprehensions).

...

I agree with your intuition that low level is better. `total` would be useful. If you have total and mul, then as you and others have pointed out, normalize is just c *= 1/c.total.

I fully support adding some functionality for scaling to support probability distributions, bayesian update steps, chi-square tests, and whatnot. The people who need convincing are the other respondents on the tracker. They had a strong mental model for the Counter class that is somewhat at odds with this proposal. Raymond

Chris Angelico

8:42 p.m.

On Mon, Apr 16, 2018 at 1:39 PM, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:

...

This needn't be a blocker. Strings can be added to strings, and strings can be multiplied by integers. If it's of practical value to multiply a Counter by a number, by all means do it. ChrisA

Peter Norvig

9:04 p.m.

On Sun, Apr 15, 2018 at 8:39 PM Raymond Hettinger < raymond.hettinger@gmail.com> wrote:

...

Wow, I never noticed "&" and "|" -- I guess when I got to "Common patterns for working with" in the documentation, I figured that there wouldn't be any new methods introduced after that and I stopped reading.

...

Is it weird and disorienting to have: <str> += <str> <str> *= <scalar>

Raymond Hettinger

9:39 p.m.

...

Yes, there is a precedent that does seem to have worked out well in practice :-) It isn't exactly parallel because strings aren't containers of numbers, they don't have & and |, and there isn't a reason to want a / operation, but it does suggest that signature variation might not be problematic. BTW, do you just want __mul__ and __rmul__? If those went in, presumably there will be a request to support __imul__ because otherwise c*=3 would still work but would be inefficient (that was the rationale for adding inplace variants for all the current arithmetic operators). Likewise, presumably someone would legitimately want __div__ to support the normalization use case. Perhaps less likely, there would be also be a request for __floordiv__ to allow exactly scaled results to stay in the domain of integers. Which if any of these makes sense to you? Also, any thoughts on the cleanest way to express the computation of a chi-squared statistic (for example, to compare observed first digit frequencies to the frequencies predicted by Benford's Law)? This isn't an arbitrary question (it came up when a professor first proposed a variant of this idea a few years ago). Raymond

Peter Norvig

9:55 p.m.

I don't have strong feelings, but I would say yes to __imul__, no to __div__ and __floordiv__ (with str/list/tuple as the precedent). For chisquare, I would be perfectly happy with: digit_counts = Counter(...) scipy.stats.chisquare(list(digit_counts.values())) On Sun, Apr 15, 2018 at 9:39 PM Raymond Hettinger < raymond.hettinger@gmail.com> wrote:

...

Wes Turner

10:24 p.m.

On Monday, April 16, 2018, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:

...

https://en.wikipedia.org/wiki/Chi-squared_distribution https://en.wikipedia.org/wiki/Chi-squared_test https://en.wikipedia.org/wiki/Benford%27s_law (How might one test this with e.g. *double* SHA256?) proportions_chisquare(count, nobs, value=None) https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.propo... https://www.statsmodels.org/dev/genindex.html?highlight=chi scipy.stats.chisquare(f_obs, f_exp=None, ddof=0, axis=0) https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.stats.chis... sklearn.feature_selection.chi2(X, y) http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.c... kernel_approximation.AdditiveChi2Sampler kernel_approximation.SkewedChi2Sampler http://scikit-learn.org/stable/modules/classes.html#module-sklearn.kernel_ap... has sklearn.metrics.pairwise.chi2_kernel(X, Y=None, gamma=1.0) http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.ch... sklearn.metrics.pairwise.additive_chi2_kernel(X, Y=None) http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.ad... ... FreqDist(collections.Counter(odict)) ... sparse-coding ... One-Hot / Binarization http://contrib.scikit-learn.org/categorical-encoding/ StandardScalar (for standardization) refuses to work with sparse matrices: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Stand...

...

Tim Peters

10:07 p.m.

[Peter Norvig]

...

Adding Counter * integer doesn't bother me a bit, but the definition of what that should compute isn't obvious. In particular, that implementation doesn't preserve that `x+x == 2*x` if x has any negative values:

...

It would be strange if x+x != 2*x, and if x*-1 != -x:

...

Etc. Then again, it's already the case that, e.g., x-y isn't always the same as x + -y:

...

So screw obvious formal identities ;-) I'm not clear on why "+" and "-" discard keys with values <= 0 to begin with. For "-" it's natural enough viewing "-" as being multiset difference, but for "+"? That's just made up ;-) In any case, despite the oddities, I think your implementation would be least surprising overall (ignore the sign of the resulting values). At least for Counters that actually make sense as multisets (have no values <= 0), and for a positive integer multiplier `n > 0`, it does preserve that `x*n` = `x + x + ... + x` (with `n` instances of `x`).

Steve Barnes

10:22 p.m.

On 16/04/2018 06:07, Tim Peters wrote:

...

[Peter Norvig]

...
For most types that implement __add__, `x + x` is equal to `2 * x`.

That is true for all numbers, list, tuple, str, timedelta, etc. -- but not for collections.Counter. I can add two Counters, but I can't multiply one by a scalar. That seems like an oversight.

... Here's an implementation:

def __mul__(self, scalar): "Multiply each entry by a scalar." result = Counter() for key in self: result[key] = self[key] * scalar return result

def __rmul__(self, scalar): "Multiply each entry by a scalar." result = Counter() for key in self: result[key] = scalar * self[key] return result

Adding Counter * integer doesn't bother me a bit, but the definition of what that should compute isn't obvious. In particular, that implementation doesn't preserve that `x+x == 2*x` if x has any negative values:

...
...
...
x = Counter(a=-1) x Counter({'a': -1}) x+x Counter()

It would be strange if x+x != 2*x, and if x*-1 != -x:

...
...
...
y = Counter(a=1) y Counter({'a': 1}) -y Counter()

Etc.

Then again, it's already the case that, e.g., x-y isn't always the same as x + -y:

...
...
...
x = Counter(a=1) y = Counter(a=2) x - y Counter() x + -y Counter({'a': 1})

So screw obvious formal identities ;-)

I'm not clear on why "+" and "-" discard keys with values <= 0 to begin with. For "-" it's natural enough viewing "-" as being multiset difference, but for "+"? That's just made up ;-)

In any case, despite the oddities, I think your implementation would be least surprising overall (ignore the sign of the resulting values). At least for Counters that actually make sense as multisets (have no values <= 0), and for a positive integer multiplier `n > 0`, it does preserve that `x*n` = `x + x + ... + x` (with `n` instances of `x`). _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Wouldn't it make sense to have the current counter behaviour, (negative counts not allowed), and also a counter that did allow negative values (my bank doesn't seem to have a problem with my balance being able to go below negative), and possibly at the same time a counter class that allowed fractional counts? Then:

...

-- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG. http://www.avg.com

Steven D'Aprano

2:41 a.m.

On Mon, Apr 16, 2018 at 05:22:54AM +0000, Steve Barnes wrote:

...

I wish my bank was as understanding, they keep telling me I'm overdrawn... *wink*

...

and possibly at the same time a counter class that allowed fractional counts?

I understand the idea of counting in fractions (1/3, 2/3, 1, 1+1/3, ...) but I don't understand what fractional frequencies would mean. What's your use-case for fractional frequencies? -- Steve

Tim Peters

9:21 a.m.

[Steve Barnes <gadgetsteve@live.co.uk>]

...

We already have all that: a counter's values can be anything at all:

...

It's four specific binary "multiset" _operations_ that purge values <= 0 from their results:

...

OK, also unary prefix "+" and "-". Other methods do not:

...

As above, results are a property not of the counter objects, but of the specific operations performed. some_counter / scalar would most obviously work like:the current:

...

Raymond Hettinger

10:37 p.m.

...

Any thoughts on Counter * float? A key use case for what is being proposed is: c *= 1 / c.total Raymond

Tim Peters

10:51 p.m.

[Tim]

...

...
Adding Counter * integer doesn't bother me a bit, but the definition of what that should compute isn't obvious.

[Raymond]

...

Any thoughts on Counter * float? A key use case for what is being proposed is:

c *= 1 / c.total

Ah, I thought I had already addressed that, but looks like my fingers forgot to type it ;-) By all mean, yes! Indeed, that strengthens the "argument" for why `Counter * int` should ignore the signs of the values - if we allow multiplying by anything supporting __mul__, that clearly says we view multiplication as being outside the "multiset" view, and so there's no reason at all to suppress values <= 0. I also have no problem with inplace operators. Or with adding `Counter /= scalar", for that matter. Perhaps whining could be reduced by rearranging the docs some: clearly separate operations designed to support the multiset view from the others. Then "but that operation makes no sense for multisets!" can be answered with "so don't use it on multisets - like the docs told you" ;-)

Raymond Hettinger

11:58 p.m.

...

But surely __rdiv__() would be over the top, harmonic means be damned ;-) Raymond

Tim Peters

5:43 p.m.

[Tim]

...

...
I also have no problem with inplace operators. Or with adding `Counter /= scalar", for that matter.

[Raymond]

...

But surely __rdiv__() would be over the top, harmonic means be damned ;-)

Agreed - itertools.Counter is the poster child for "practicality beats purity" :-) In context, the c *= 1 / c.total example would clearly be useful at times. But it's a strained way to spell c /= c.total and, for float values, the former also introduces a needless rounding error (to compute the reciprocal). BTW, if _`Counter * scalar` is added, we should think more about oddball cases. While everyone knows what _they_ mean by "scalar", Python doesn't. The obvious implementation (Peter already gave it) would lead to things like `Counter * Counter`, where both multiplicands have integer values, yielding a Counter whose values are also Counters. That is, if c = Counter(a=1, b=2) d = Counter(a=3, b=4) then c*d would yield a Counter mapping 'a` to 1 * d == d, and 'b' to 2 * d == Counter(a=6, b=8). That's "bad", because the next suggestion will be that c*d return Counter(a=3, b=8) instead. That is, map a shared key to the product of the values associated with that key. For example, `c` is a Counter tabulating category counts, and `d` a Counter giving category weights. I don't suggest doing that now, but it would be nice to dream up a way to stop "things like" Counter * Counter at the start so that backward compatibility doesn't preclude adding sensible meanings later.

Raymond Hettinger

8:33 a.m.

...

I've started working on an implementation and several choices arise: 1) Reject scalar with a TypeError if scalar is a Counter 2) Reject scalar with a TypeError if scalar is a Mapping 3) Reject scalar with a TypeError if scalar is a Collection 4) Reject scalar with a TypeError if scalar is Sized (has a __len__ method). I lean toward rejecting all things Sized because _everyone_ knows that scalars aren't sized ;-) Raymond

Tim Peters

9:05 a.m.

[Raymond]

...

Hard to know how gonzo to get :-( _Scalar = (Sized, Container, Iterable)): # has __len__, __getitem__, or __iter__ ... if isinstance(arg, _Scalar): raise TypeError ... would also reject things like generator expressions. But ... those would blow up anyway, when multiplication was attempted. So, ya! Sticking to Sized sounds good :-)

Petr Viktorin

9:13 a.m.

On 04/18/18 17:33, Raymond Hettinger wrote:

...

Why is Iterable (__iter__) not on the list? (Apologies if I missed this somewhere in the conversation.)

...

Tim Peters

11:45 a.m.

[Raymond]

...

[Petr Viktorin <encukou@gmail.com>]

...

Why is Iterable (__iter__) not on the list?

(Apologies if I missed this somewhere in the conversation.)

I believe Raymond implicitly meant "test one of the above", not "test all of the above", and he's leaning toward Sized alone. What we're trying to stop is things like "Counter * Counter" for now, because the obvious implementation(*) of Counter.__mul__ would do a strange thing with that, where a quite different thing is plausibly wanted (and may - or may not - be added later - but, due to backward compatibility, cannot be added later if the initial implementation does the strange thing). Rejecting a Sized argument for now would stop that. Piling on additional tests could stop other things "early", but every test added slows the normal case (the argument is fine). In the case of an Iterable `arg` that's not Sized, it seems highly unlikely that arg.__mul__ or arg.__rmul__ exist, so the obvious implementation would blow up later without bothering to check in advance: >>> x = (i for i in range(10)) >>> 3 * x Traceback (most recent call last): ... TypeError: unsupported operand type(s) for *: 'int' and 'generator' (*) The obvious implementation: def __mul__(self, other): if isinstance(other, Sized): raise TypeError("cannot multiply Counter by Sized type %s" % type(other)) result = Counter() for k, v in self.items(): result[k] = v * other return result

Wes Turner

12:33 p.m.

The use cases these TypeErrors would exclude include weightings (whether from a generator without an intermediate tuple/list or from a dict) where the need is to do elementwise multiplication: if len(self) != len(other): raise ValueError("tensors are multiplicable") if self.keys() != other.keys(): #odict if set(self.keys()).difference(other.keys()): # is this contract # is it this function's responsibility to check it for (k1, v1), (k2, v2) in zip(self.items(), other.items()): self[k1] = v1*v2 At which point we might as well just introduce a sparse labeled array type that's interface-compatible with np.array and/or pd.Series(index=) with a @classmethod Counter initializer that works like collections.Counter. (see links above) On Wednesday, April 18, 2018, Tim Peters <tim.peters@gmail.com> wrote:

...

[Raymond]

...
...
I've started working on an implementation and several choices arise:

1) Reject scalar with a TypeError if scalar is a Counter 2) Reject scalar with a TypeError if scalar is a Mapping 3) Reject scalar with a TypeError if scalar is a Collection 4) Reject scalar with a TypeError if scalar is Sized (has a __len__ method).

[Petr Viktorin <encukou@gmail.com>]

...
Why is Iterable (__iter__) not on the list?

(Apologies if I missed this somewhere in the conversation.)

I believe Raymond implicitly meant "test one of the above", not "test all of the above", and he's leaning toward Sized alone.

What we're trying to stop is things like "Counter * Counter" for now, because the obvious implementation(*) of Counter.__mul__ would do a strange thing with that, where a quite different thing is plausibly wanted (and may - or may not - be added later - but, due to backward compatibility, cannot be added later if the initial implementation does the strange thing).

Rejecting a Sized argument for now would stop that. Piling on additional tests could stop other things "early", but every test added slows the normal case (the argument is fine).

In the case of an Iterable `arg` that's not Sized, it seems highly unlikely that arg.__mul__ or arg.__rmul__ exist, so the obvious implementation would blow up later without bothering to check in advance:

>>> x = (i for i in range(10)) >>> 3 * x Traceback (most recent call last): ... TypeError: unsupported operand type(s) for *: 'int' and 'generator'

(*) The obvious implementation:

def __mul__(self, other): if isinstance(other, Sized): raise TypeError("cannot multiply Counter by Sized type %s" % type(other)) result = Counter() for k, v in self.items(): result[k] = v * other return result _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Tim Peters

12:42 p.m.

[Wes Turner <wes.turner@gmail.com>]

...

Yes. That's exactly what I had in mind when I wrote:

...

The obvious implementation (already given (*)) of Counter.__mul__ would NOT AT ALL do elementwise multiplication. Nobody has asked for that yet either, so nobody is proposing to add it now either. But it's predictable that someone _will_ ask for it when __mul__ is defined for "scalars". It may or may not be added at that time. But it will be flat-out impossible to add it later (because of backward compatibility) _if_ the initial implementation does something entirely different. So, at first, we want to raise an exception for a non-'scalar" argument, so that it remains _possible_ to do something sane with it later. ...

...

Serhiy Storchaka

1:29 p.m.

18.04.18 21:45, Tim Peters пише:

...

Wouldn't be better to return NotImplemented here?

...

If discard non-positive values, this will automatically make multiplying Counter by Counter (or by sequence) invalid, because they are not comparable with 0. for k, v in self.items(): v = v * other if v > 0: result[k] = v * other

Tim Peters

2:39 p.m.

[Tim]

...

...
(*) The obvious implementation:

def __mul__(self, other): if isinstance(other, Sized): raise TypeError("cannot multiply Counter by Sized type %s" % type(other))

[Serhiy]

...

Wouldn't be better to return NotImplemented here?

Probably, but that's up to Raymond. What I like about the above is that the error message is very explicit about why the argument was rejected. A generic, e.g., TypeError: unsupported operand type(s) for *: 'Counter' and 'generator' isn't as informative. OTOH, ya, it's better practice to give other.__rmul__ a chance at it.

...

...
result = Counter() for k, v in self.items(): result[k] = v * other return result

...

If discard non-positive values, this will automatically make multiplying Counter by Counter (or by sequence) invalid, because they are not comparable with 0.

for k, v in self.items(): v = v * other if v > 0: result[k] = v * other

I'd start to buy that only if you promised to answer all Stackoverflow questions of the form: Hey, I tried multiplying two Counters and got TypeError: '>' not supported between instances of 'Counter' and 'int' What the hell is that supposed to mean? ;-)

Serhiy Storchaka

9:26 a.m.

16.04.18 08:07, Tim Peters пише:

...

There are methods update() and subtract() which are similar to operators "+" and "-", but don't discard non-positive values. I expect that "*" and "/" discard non-positive values for consistency with "+" and "-". And a new method should be added which does multiplication without discarding non-positive values.

Tim Peters

12:34 p.m.

[Serhiy Storchaka <storchaka@gmail.com>]

...

There are methods update() and subtract() which are similar to operators "+" and "-", but don't discard non-positive values.

Yup.

...

Counter supports a wonderfully weird mix of methods driven by use cases, not by ideology. + (binary) - (binary) | & have semantics driven by viewing a Counter as a multiset implementation. That's why they discard values <= 0. They correspond, respectively, to "the standard" multiset operations of sum (disjoint union), difference, union, and intersection. That the unary versions of '+' and '-' also discard values <= 0 is justified by saying "because they're shorthand for what the binary operator does when given an empty Counter as the left argument", but they're not standard multiset operations on their own. Nothing else in Counter is trying to cater to the multiset view, but to other use cases. And that's why "*" and "/" should do what everyone _expects_ them to do ;-) There are no analogous multiset operations to justify them caring at all what the values are. If Raymond had it to do over again, I'd suggest that only "-" discard values <= 0. The other operators deliver legit(*) multisets _given_ that their arguments are legit multisets - only "-" has to care about creating an illegitimate (for a multiset) value from legit multiset arguments. But there there's no good reason for "*" or "/" to care at all. They don't make sense for multisets. After, e.g., c /= sum(c.values()) it's sane to expect that the new sum(c.values()) is close to 1 regardless of the numeric types or signs of the original values. Indeed, normalizing values so that their sum _is_ close to 1 is a primary use case motivating the current change. Note I suggested before rearranging the docs to make clear that the multiset view is just a part of what Counter is intended to be used for, and that only a handful of specific operations are intended to support it. (*) "legit" meaning that all values are integers > 0

Serhiy Storchaka

1:24 p.m.

18.04.18 22:34, Tim Peters пише:

...

Counter supports a wonderfully weird mix of methods driven by use cases, not by ideology.

+ (binary) - (binary) | &

have semantics driven by viewing a Counter as a multiset implementation. That's why they discard values <= 0. They correspond, respectively, to "the standard" multiset operations of sum (disjoint union), difference, union, and intersection.

This explains only why binary "-" discards non-positive values and "&" discards keys that are only in one Counter. Multisets contain only positive counts.

...

Nothing else in Counter is trying to cater to the multiset view, but to other use cases. And that's why "*" and "/" should do what everyone _expects_ them to do ;-) There are no analogous multiset operations to justify them caring at all what the values are.

Isn't everyone expect that x*2 == x + x? Isn't this the definition of multiplication? And when we have a multiplication, it can be generalized to division.

...

But there there's no good reason for "*" or "/" to care at all. They don't make sense for multisets.

I disagree. "+" and "*" are defined for sequences, and these operations can be defined for multisets in terms of sequences of their elements. x + y = multiset(x.elements() + y.elements()) x * n = multiset(x.elements() * n)

...

After, e.g.,

c /= sum(c.values())

it's sane to expect that the new sum(c.values()) is close to 1 regardless of the numeric types or signs of the original values. Indeed, normalizing values so that their sum _is_ close to 1 is a primary use case motivating the current change.

If there are negative values, then their sum can be very small, and the relative error of the sum can be large. Dividing by it can results in values with large magnitude, significantly larger than 1, and large errors. What is the use case for division a Counter with negative values by the sum of its values?

Tim Peters

1:55 p.m.

[Tim]

...

...
Counter supports a wonderfully weird mix of methods driven by use cases, not by ideology.

+ (binary) - (binary) | &

have semantics driven by viewing a Counter as a multiset implementation. That's why they discard values <= 0. They correspond, respectively, to "the standard" multiset operations of sum (disjoint union), difference, union, and intersection.

[Serhiy Storchaka <storchaka@gmail.com>]

...

This explains only why binary "-" discards non-positive values and "&" discards keys that are only in one Counter. Multisets contain only positive counts.

As I said later, if Raymond had it to do over again, I'd suggest that only "-" special-case values <= 0. We have what we have now. Perhaps he had other use cases in mind too - I don't know about that.

...

...
Nothing else in Counter is trying to cater to the multiset view, but to other use cases. And that's why "*" and "/" should do what everyone _expects_ them to do ;-) There are no analogous multiset operations to justify them caring at all what the values are.

...

Isn't everyone expect that x*2 == x + x?

As shown in earlier messages, it's already the case that, e.g., "x - y" isn't always the same as "x + -y" for multisets now. It's already too late to stress about satisfying "obvious" formal identities ;-) Again, Counter isn't driven by ideology, but by use cases, and it caters to all kinds of use cases now.

...

Isn't this the definition of multiplication?

In some algebraic structures, yes. Same as, e.g., "x - y" can be "defined by" "x + -y".

...

And when we have a multiplication, it can be generalized to division.

In some algebraic structures, yes.

...

...
But there there's no good reason for "*" or "/" to care at all. They don't make sense for multisets.

...

I disagree. "+" and "*" are defined for sequences, and these operations can be defined for multisets in terms of sequences of their elements.

Ya, but you're just making that up because it suits your current argument. The mathematical definition of multisets says nothing at all about "sequences". I used "the standard" earlier as shorthand for "use Google to find a standard account"; e.g., here: http://planetmath.org/operationsonmultisets Counter implements all and only the multiset operations spelled out there (or in any number of other standard accounts).

...

...
After, e.g.,

c /= sum(c.values())

it's sane to expect that the new sum(c.values()) is close to 1 regardless of the numeric types or signs of the original values. Indeed, normalizing values so that their sum _is_ close to 1 is a primary use case motivating the current change.

...

If there are negative values, then their sum can be very small, and the relative error of the sum can be large.

So?

...

Dividing by it can results in values with large magnitude, significantly larger than 1, and large errors.

Likewise: so what? There's no reason to assume that the values aren't, e.g.. fractions.Fractions, where arithmetic is exact. If they're floats, then _of course_ all kinds of numeric surprises are possible. But unless you want to claim that float surprises go away if values <= 0 are thrown away, it's just irrelevant to the case you _were_ arguing. Do you seriously want to argue that c /= sum(c.values()) should include negative values in the sum, but then throw away keys with quotients <= 0 when the division is performed? That's pretty much incomprehensible.

...

What is the use case for division a Counter with negative values by the sum of its values?

Avoiding the incomprehensible behavior noted just above - for which I'd like to see a use case too ;-) But, seriously, no, I don't have a good use for that. It _follows_ from the obvious implementation Peter gave in the thread's first message, which is in fact obvious to just about everyone else so far. I can't count _against_ it that c /= sum(c.values()) assert sum(c.values()) == 1 would succeed if the values support exact arithmetic and the original sum isn't 0.

Tim Peters

6:48 p.m.

[Serhiy]

...

Isn't everyone expect that x*2 == x + x? Isn't this the definition of multiplication?

Note: in my first message in this thread, I made the same objection. And that, e.g., it would also be weird if x*-1 is not the same as -x. Obviously, I don't care enough about those to make up a meaning for "multiset * scalar" that preserves those in all cases. And we have to make up a meaning, because there is no generally accepted _mathematical_ meaning for what "multiset times a scalar" should do. It's not that there's controversy about what it should do, it's that nobody asks the question. It's like asking "what should the logarithm of matrix to a base that's a string mean?" This isn't Perl ;-) For a scalar that's a non-negative integer, I agree repeated addition makes _most_ sense. But, as I also noted in my first message, Peter's "obvious" implementation satisfies that! Provided that the Counter represents a legit (all values are strictly positive integers) multiset to begin with, `Counter * n` is the same as adding the Counter n times. If the Counter doesn't represent a legit multiset to begin with, who cares? It's not "a multiset operation" at all then. Or if the scalar isn't a non-negative integer. What on Earth is "multiset * math.pi" supposed to do that gives a legit multiset result? For each value v, return math.floor(v * math.pi)? That's just nuts - _any_ answer to that would be nuts, utterly arbitrary just to preserve a useless notion of "consistency". The use cases Peter have in mind appear to do with manipulating discrete probability distributions represented as Counters, which really have nothing to do with multisets. Has values will always be non-negative, but almost never integers, so almost never _sanely_ viewed as being multisets. Counter doesn't care. And I don't want to get in his way by insisting on some notion of formal consistency in new operations that have nothing to do with multisets except by accident, or by artificially forced contrivance. I've used Counters to represent multisets a fair bit, but have never had the slightest desire to use "times an integer" as a shorthand for repeated addition, and can't even dream up a meaning for what dividing a multiset by an integer (let alone a float!) could return that would make a lick of sense. "Scalar broadcast" makes instant sense in both cases, though, just by dropping the illusion that people using Counters _as multisets_ are interested in these operations at all.

Steven D'Aprano

7:32 p.m.

On Wed, Apr 18, 2018 at 11:24:13PM +0300, Serhiy Storchaka wrote:

...

Isn't everyone expect that x*2 == x + x? Isn't this the definition of multiplication?

I can't think of any counter-examples, but it wouldn't surprise me even a tiny bit to learn of some.

...

And when we have a multiplication, it can be generalized to division.

Not always. A counter-example is matrix multiplication, where multiplication is non-commutative: A*B ≠ B*A and division is not defined at all. Instead, we multiply by the inverse matrix, in the appropriate order if A*B = C then A = C*inv(B) and B = inv(A)*C And of course, when you talk about data types that aren't numeric, division makes even less sense: string multiplication means repetition, but string division has no obvious meaning. -- Steve

Tim Peters

9:10 p.m.

[Serhiy]

...

...
Isn't everyone expect that x*2 == x + x? Isn't this the definition of multiplication?

[Steven D'Aprano <steve@pearwood.info>]

...

I can't think of any counter-examples, but it wouldn't surprise me even a tiny bit to learn of some.

Sure you can: explain -3.1 * -2.7 = 8.37 in terms of repeated addition ;-) It's not even necessarily true in finite fields. For example, look at the "+" and "*" tables for the 4-element finite field GF(4) here: https://en.wikipedia.org/wiki/Finite_field#Field_with_four_elements For every element x of that field, x+x = 0, so a*(1+a) = 1 in that field can't be obtained by adding any number of a's (or any number of 1+a's).. Adding x to itself just gives x again if an odd number of x's are added together, or 0 if an even number. The sense in which it's always true in fields is technical: first, x*y is guaranteed to be equivalent to repeated addition of x only if y is "an integer"; and, more subtly, "integer" is defined as meaning the result of adding the multiplicative identity any number of times to the additive identity. In GF(4), under that definition, only 0 (the additive identity) and 1 (the multiplicative identity (1) added to the additive identity (0)) are "integers". Attempting to add 1 more times just continues alternating between 0 and 1. `a` and `1+a` can't be reached that way, so are not integers, and none of a*(1+a), a*a, (1+a)*a, or (1+a)*(1+a) is the same as adding either operand any number of times. You could nevertheless define x*n as _meaning_ x+x+...+x (n times) for cardinal numbers n, but then n is outside the field. What's the application to Counter.__mul__? Not much ;-) The current '+' and '-' map _pairs_ of Counters to a Counter. The proposed '*' instead maps a Counter and "a scalar" to a Counter. It's always nice if x*n is the same as repeated addition of x when n is a cardinal integer, but because Counter.__add__ does special stuff specific to the multiset interpretation, the proposed '*' doesn't always do the same _unless_ the Counter is a legitimate multiset. So it still works fine _in_ the multiset view, provided that you're actually working with multiset Counters. But applications seeking to mix Counters and scalars with "*" and "/" don't have multisets in mind at all, so I'd be fine with `Counter * n` not being equivalent to repeated multiset addition even when Counter is a legitimate multiset and `n` is a cardinal. It's just gravy that it _is_ equivalent in that case. Where it breaks down is that, e.g, >>> a = Counter(b=-100) >>> a Counter({'b': -100}) >>> a + a Counter() but the proposed a*2 would return Counter(a=-200). But, in that case, `a` wasn't a legit multiset to begin with.

Franklin? Lee

3:53 p.m.

Instead of extending Counter to fit fancier usecases, why not have a new class that is designed for arithmetic? I, for one, would love Numpy-style list and dict classes in the standard library. And they wouldn't be confusingly called Counter, and have strange behaviors with negative values. I only saw discussion about whether or not we want Counter to support Peter's use, but there isn't much talk of supporting it with a new class. If we can get behind a new class, there wouldn't be as much conflict about what to do with the old one. On Sun, Apr 15, 2018 at 5:05 PM, Peter Norvig <peter@norvig.com> wrote:

...

Wes Turner

April 2018

3:45 p.m.

...

Peter Norvig

4:02 p.m.

...

Raymond Hettinger

5:05 p.m.

...

It would be worthwhile to implement multiplication because, among other reasons, Counters are a nice representation for discrete probability distributions, for which multiplication is an even more fundamental operation than addition.

Peter Norvig

5:44 p.m.

...

...
On Apr 15, 2018, at 2:05 PM, Peter Norvig <peter@norvig.com> wrote:

For most types that implement __add__, `x + x` is equal to `2 * x`.

...

That is true for all numbers, list, tuple, str, timedelta, etc. -- but not for collections.Counter. I can add two Counters, but I can't multiply one by a scalar. That seems like an oversight.

If you view the Counter as a sparse associative array of numeric values, it does seem like an oversight. If you view the Counter as a Multiset or Bag, it doesn't make sense at all ;-)

From an implementation point of view, Counter is just a kind of dict that has a __missing__() method that returns zero. That makes it trivially easy to subclass Counter to add new functionality or just use dictionary comprehensions for bulk updates.

...
It would be worthwhile to implement multiplication because, among other

reasons, Counters are a nice representation for discrete probability distributions, for which multiplication is an even more fundamental operation than addition.

There is an open issue on this topic. See: https://bugs.python.org/issue25478

One stumbling point is that a number of commenters are fiercely opposed to non-integer uses of Counter. Also, some of the use cases (such as those found in Allen Downey's "Think Stats" and "Think Bayes" books) also need division and rescaling to a total (i.e. normalizing the total to 1.0) for a probability mass function.

If the idea were to go forward, it still isn't clear whether the correct API should be low level (__mul__ and __div__ and a "total" property) or higher level (such as a normalize() or rescale() method that produces a new Counter instance). The low level approach has the advantage that it is simple to understand and that it feels like a logical extension of the __add__ and __sub__ methods. The downside is that doesn't really add any new capabilities (being just short-cuts for a simple dict comprehension or call to c.values()). And, it starts to feature creep the Counter class further away from its core mission of counting and ventures into the realm of generic sparse arrays with numeric values. There is also a learnability/intelligibility issue in __add__ and __sub__ correspond to "elementwise" operations while __mul__ and __div__ would be "scalar broadcast" operations.

Peter, I'm really glad you chimed in. My advocacy lacked sufficient weight to move this idea forward.

Raymond

Wes Turner

6:34 p.m.

On Sunday, April 15, 2018, Peter Norvig <peter@norvig.com> wrote:

...

On Sun, Apr 15, 2018 at 5:06 PM Raymond Hettinger < raymond.hettinger@gmail.com> wrote:

...
...
On Apr 15, 2018, at 2:05 PM, Peter Norvig <peter@norvig.com> wrote:

For most types that implement __add__, `x + x` is equal to `2 * x`.

...

That is true for all numbers, list, tuple, str, timedelta, etc. -- but not for collections.Counter. I can add two Counters, but I can't multiply one by a scalar. That seems like an oversight.

If you view the Counter as a sparse associative array of numeric values, it does seem like an oversight. If you view the Counter as a Multiset or Bag, it doesn't make sense at all ;-)

From an implementation point of view, Counter is just a kind of dict that has a __missing__() method that returns zero. That makes it trivially easy to subclass Counter to add new functionality or just use dictionary comprehensions for bulk updates.

...
It would be worthwhile to implement multiplication because, among other

reasons, Counters are a nice representation for discrete probability distributions, for which multiplication is an even more fundamental operation than addition.

There is an open issue on this topic. See: https://bugs.python.org/ issue25478

One stumbling point is that a number of commenters are fiercely opposed to non-integer uses of Counter. Also, some of the use cases (such as those found in Allen Downey's "Think Stats" and "Think Bayes" books) also need division and rescaling to a total (i.e. normalizing the total to 1.0) for a probability mass function.

If the idea were to go forward, it still isn't clear whether the correct API should be low level (__mul__ and __div__ and a "total" property) or higher level (such as a normalize() or rescale() method that produces a new Counter instance). The low level approach has the advantage that it is simple to understand and that it feels like a logical extension of the __add__ and __sub__ methods. The downside is that doesn't really add any new capabilities (being just short-cuts for a simple dict comprehension or call to c.values()). And, it starts to feature creep the Counter class further away from its core mission of counting and ventures into the realm of generic sparse arrays with numeric values. There is also a learnability/intelligibility issue in __add__ and __sub__ correspond to "elementwise" operations while __mul__ and __div__ would be "scalar broadcast" operations.

Peter, I'm really glad you chimed in. My advocacy lacked sufficient weight to move this idea forward.

Raymond

Wes Turner

7:18 p.m.

...

On Sunday, April 15, 2018, Peter Norvig <peter@norvig.com> wrote:

...
If you think of a Counter as a multiset, then it should support __or__, not __add__, right?

I do think it would have been fine if Counter did not support "+" at all (and/or if Counter was limited to integer values). But given where we are now, it feels like we should preserve `c + c == 2 * c`.

As to the "doesn't really add any new capabilities" argument, that's true, but it is also true for Counter as a whole: it doesn't add much over defaultdict(int), but it is certainly convenient to have a standard way to do what it does.

I agree with your intuition that low level is better. `total` would be useful. If you have total and mul, then as you and others have pointed out, normalize is just c *= 1/c.total.

I can also see the argument for a new FrequencyTable class in the statistics module. (By the way, I refactored my https://github.com/norvig/pytudes/blob/master/ipynb/Probability.ipynb a bit, and now I no longer need a `normalize` function.)

nltk.probability.FreqDist(collections.Counter) doesn't have a __mul__ either http://www.nltk.org/api/nltk.html#nltk.probability.FreqDist

numpy.unique(, return_counts=True).unique_counts returns an array sorted by value with a __mul__. https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html

scipy.stats.itemfreq returns an array sorted by value with a __mul__ and the items in the first column. https://docs.scipy.org/doc/scipy/reference/generated/ scipy.stats.itemfreq.html

pandas.Series.value_counts(, normalize=False) returns a Series sorted by descending frequency. https://pandas.pydata.org/pandas-docs/stable/generated/ pandas.Series.value_counts.html

...
On Sun, Apr 15, 2018 at 5:06 PM Raymond Hettinger < raymond.hettinger@gmail.com> wrote:

...
...
On Apr 15, 2018, at 2:05 PM, Peter Norvig <peter@norvig.com> wrote:

For most types that implement __add__, `x + x` is equal to `2 * x`.

...

That is true for all numbers, list, tuple, str, timedelta, etc. -- but not for collections.Counter. I can add two Counters, but I can't multiply one by a scalar. That seems like an oversight.

If you view the Counter as a sparse associative array of numeric values, it does seem like an oversight. If you view the Counter as a Multiset or Bag, it doesn't make sense at all ;-)

From an implementation point of view, Counter is just a kind of dict that has a __missing__() method that returns zero. That makes it trivially easy to subclass Counter to add new functionality or just use dictionary comprehensions for bulk updates.

...
It would be worthwhile to implement multiplication because, among

other reasons, Counters are a nice representation for discrete probability distributions, for which multiplication is an even more fundamental operation than addition.

There is an open issue on this topic. See: https://bugs.python.org/issue25478

One stumbling point is that a number of commenters are fiercely opposed to non-integer uses of Counter. Also, some of the use cases (such as those found in Allen Downey's "Think Stats" and "Think Bayes" books) also need division and rescaling to a total (i.e. normalizing the total to 1.0) for a probability mass function.

If the idea were to go forward, it still isn't clear whether the correct API should be low level (__mul__ and __div__ and a "total" property) or higher level (such as a normalize() or rescale() method that produces a new Counter instance). The low level approach has the advantage that it is simple to understand and that it feels like a logical extension of the __add__ and __sub__ methods. The downside is that doesn't really add any new capabilities (being just short-cuts for a simple dict comprehension or call to c.values()). And, it starts to feature creep the Counter class further away from its core mission of counting and ventures into the realm of generic sparse arrays with numeric values. There is also a learnability/intelligibility issue in __add__ and __sub__ correspond to "elementwise" operations while __mul__ and __div__ would be "scalar broadcast" operations.

Peter, I'm really glad you chimed in. My advocacy lacked sufficient weight to move this idea forward.

Raymond

Raymond Hettinger

April 2018

8:41 p.m.

...

On Apr 15, 2018, at 7:18 PM, Wes Turner <wes.turner@gmail.com> wrote:

And I'm done sharing non-pure-python solutions for this problem, I promise

Keep them coming :-) Thanks for the research. It helps to remind ourselves that almost none of our problems are new :-) Raymond

Raymond Hettinger

8:39 p.m.

...

On Apr 15, 2018, at 5:44 PM, Peter Norvig <peter@norvig.com> wrote:

If you think of a Counter as a multiset, then it should support __or__, not __add__, right?

FWIW, Counter is explicitly documented to support the four multiset-style mathematical operations discussed in Knuth TAOCP Volume II section 4.6.3 exercise 19:

...

...
...
c = Counter(a=3, b=1) d = Counter(a=1, b=2) c + d # add two counters together: c[x] + d[x] Counter({'a': 4, 'b': 3}) c - d # saturating subtraction (keeping only positive counts) Counter({'a': 2}) c & d # intersection: min(c[x], d[x]) Counter({'a': 1, 'b': 1}) c | d # union: max(c[x], d[x]) Counter({'a': 3, 'b': 2})

The wikipedia article on Multisets lists a further operation, inclusion, that is not currently supported: https://en.wikipedia.org/wiki/Multiset#Basic_properties_and_operations

...

I do think it would have been fine if Counter did not support "+" at all (and/or if Counter was limited to integer values). But given where we are now, it feels like we should preserve `c + c == 2 * c`.

...

As to the "doesn't really add any new capabilities" argument, that's true, but it is also true for Counter as a whole: it doesn't add much over defaultdict(int), but it is certainly convenient to have a standard way to do what it does.

...

I agree with your intuition that low level is better. `total` would be useful. If you have total and mul, then as you and others have pointed out, normalize is just c *= 1/c.total.

Chris Angelico

8:42 p.m.

On Mon, Apr 16, 2018 at 1:39 PM, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:

...

This needn't be a blocker. Strings can be added to strings, and strings can be multiplied by integers. If it's of practical value to multiply a Counter by a number, by all means do it. ChrisA

Peter Norvig

9:04 p.m.

On Sun, Apr 15, 2018 at 8:39 PM Raymond Hettinger < raymond.hettinger@gmail.com> wrote:

...

Is it weird and disorienting to have: <str> += <str> <str> *= <scalar>

Raymond Hettinger

9:39 p.m.

...

Peter Norvig

9:55 p.m.

...

Wes Turner

April 2018

10:24 p.m.

On Monday, April 16, 2018, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:

...

Tim Peters

10:07 p.m.

[Peter Norvig]

...

It would be strange if x+x != 2*x, and if x*-1 != -x:

...

Etc. Then again, it's already the case that, e.g., x-y isn't always the same as x + -y:

...

Steve Barnes

10:22 p.m.

On 16/04/2018 06:07, Tim Peters wrote:

...

[Peter Norvig]

...
For most types that implement __add__, `x + x` is equal to `2 * x`.

That is true for all numbers, list, tuple, str, timedelta, etc. -- but not for collections.Counter. I can add two Counters, but I can't multiply one by a scalar. That seems like an oversight.

... Here's an implementation:

def __mul__(self, scalar): "Multiply each entry by a scalar." result = Counter() for key in self: result[key] = self[key] * scalar return result

def __rmul__(self, scalar): "Multiply each entry by a scalar." result = Counter() for key in self: result[key] = scalar * self[key] return result

Adding Counter * integer doesn't bother me a bit, but the definition of what that should compute isn't obvious. In particular, that implementation doesn't preserve that `x+x == 2*x` if x has any negative values:

...
...
...
x = Counter(a=-1) x Counter({'a': -1}) x+x Counter()

It would be strange if x+x != 2*x, and if x*-1 != -x:

...
...
...
y = Counter(a=1) y Counter({'a': 1}) -y Counter()

Etc.

Then again, it's already the case that, e.g., x-y isn't always the same as x + -y:

...
...
...
x = Counter(a=1) y = Counter(a=2) x - y Counter() x + -y Counter({'a': 1})

So screw obvious formal identities ;-)

I'm not clear on why "+" and "-" discard keys with values <= 0 to begin with. For "-" it's natural enough viewing "-" as being multiset difference, but for "+"? That's just made up ;-)

In any case, despite the oddities, I think your implementation would be least surprising overall (ignore the sign of the resulting values). At least for Counters that actually make sense as multisets (have no values <= 0), and for a positive integer multiplier `n > 0`, it does preserve that `x*n` = `x + x + ... + x` (with `n` instances of `x`). _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

...

-- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG. http://www.avg.com

Steven D'Aprano

2:41 a.m.

On Mon, Apr 16, 2018 at 05:22:54AM +0000, Steve Barnes wrote:

...

I wish my bank was as understanding, they keep telling me I'm overdrawn... *wink*

...

and possibly at the same time a counter class that allowed fractional counts?

I understand the idea of counting in fractions (1/3, 2/3, 1, 1+1/3, ...) but I don't understand what fractional frequencies would mean. What's your use-case for fractional frequencies? -- Steve

Tim Peters

9:21 a.m.

[Steve Barnes <gadgetsteve@live.co.uk>]

...

We already have all that: a counter's values can be anything at all:

...

It's four specific binary "multiset" _operations_ that purge values <= 0 from their results:

...

OK, also unary prefix "+" and "-". Other methods do not:

...

As above, results are a property not of the counter objects, but of the specific operations performed. some_counter / scalar would most obviously work like:the current:

...

Raymond Hettinger

10:37 p.m.

...

Any thoughts on Counter * float? A key use case for what is being proposed is: c *= 1 / c.total Raymond

Tim Peters

April 2018

10:51 p.m.

[Tim]

...

...
Adding Counter * integer doesn't bother me a bit, but the definition of what that should compute isn't obvious.

[Raymond]

...

Any thoughts on Counter * float? A key use case for what is being proposed is:

c *= 1 / c.total

Raymond Hettinger

11:58 p.m.

...

But surely __rdiv__() would be over the top, harmonic means be damned ;-) Raymond

Tim Peters

5:43 p.m.

[Tim]

...

...
I also have no problem with inplace operators. Or with adding `Counter /= scalar", for that matter.

[Raymond]

...

But surely __rdiv__() would be over the top, harmonic means be damned ;-)

Raymond Hettinger

8:33 a.m.

...

Tim Peters

9:05 a.m.

[Raymond]

...

Petr Viktorin

9:13 a.m.

On 04/18/18 17:33, Raymond Hettinger wrote:

...

Why is Iterable (__iter__) not on the list? (Apologies if I missed this somewhere in the conversation.)

...

Tim Peters

April 2018

11:45 a.m.

[Raymond]

...

[Petr Viktorin <encukou@gmail.com>]

...

Why is Iterable (__iter__) not on the list?

(Apologies if I missed this somewhere in the conversation.)

Wes Turner

12:33 p.m.

...

[Raymond]

...
...
I've started working on an implementation and several choices arise:

1) Reject scalar with a TypeError if scalar is a Counter 2) Reject scalar with a TypeError if scalar is a Mapping 3) Reject scalar with a TypeError if scalar is a Collection 4) Reject scalar with a TypeError if scalar is Sized (has a __len__ method).

[Petr Viktorin <encukou@gmail.com>]

...
Why is Iterable (__iter__) not on the list?

(Apologies if I missed this somewhere in the conversation.)

I believe Raymond implicitly meant "test one of the above", not "test all of the above", and he's leaning toward Sized alone.

What we're trying to stop is things like "Counter * Counter" for now, because the obvious implementation(*) of Counter.__mul__ would do a strange thing with that, where a quite different thing is plausibly wanted (and may - or may not - be added later - but, due to backward compatibility, cannot be added later if the initial implementation does the strange thing).

Rejecting a Sized argument for now would stop that. Piling on additional tests could stop other things "early", but every test added slows the normal case (the argument is fine).

In the case of an Iterable `arg` that's not Sized, it seems highly unlikely that arg.__mul__ or arg.__rmul__ exist, so the obvious implementation would blow up later without bothering to check in advance:

>>> x = (i for i in range(10)) >>> 3 * x Traceback (most recent call last): ... TypeError: unsupported operand type(s) for *: 'int' and 'generator'

(*) The obvious implementation:

def __mul__(self, other): if isinstance(other, Sized): raise TypeError("cannot multiply Counter by Sized type %s" % type(other)) result = Counter() for k, v in self.items(): result[k] = v * other return result _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Tim Peters

12:42 p.m.

[Wes Turner <wes.turner@gmail.com>]

...

Yes. That's exactly what I had in mind when I wrote:

...

Serhiy Storchaka

1:29 p.m.

18.04.18 21:45, Tim Peters пише:

...

Wouldn't be better to return NotImplemented here?

...

Tim Peters

2:39 p.m.

[Tim]

...

...
(*) The obvious implementation:

def __mul__(self, other): if isinstance(other, Sized): raise TypeError("cannot multiply Counter by Sized type %s" % type(other))

[Serhiy]

...

Wouldn't be better to return NotImplemented here?

...

...
result = Counter() for k, v in self.items(): result[k] = v * other return result

...

If discard non-positive values, this will automatically make multiplying Counter by Counter (or by sequence) invalid, because they are not comparable with 0.

for k, v in self.items(): v = v * other if v > 0: result[k] = v * other

Serhiy Storchaka

9:26 a.m.

16.04.18 08:07, Tim Peters пише:

...

Tim Peters

April 2018

12:34 p.m.

[Serhiy Storchaka <storchaka@gmail.com>]

...

There are methods update() and subtract() which are similar to operators "+" and "-", but don't discard non-positive values.

Yup.

...

Serhiy Storchaka

1:24 p.m.

18.04.18 22:34, Tim Peters пише:

...

Counter supports a wonderfully weird mix of methods driven by use cases, not by ideology.

+ (binary) - (binary) | &

have semantics driven by viewing a Counter as a multiset implementation. That's why they discard values <= 0. They correspond, respectively, to "the standard" multiset operations of sum (disjoint union), difference, union, and intersection.

This explains only why binary "-" discards non-positive values and "&" discards keys that are only in one Counter. Multisets contain only positive counts.

...

Nothing else in Counter is trying to cater to the multiset view, but to other use cases. And that's why "*" and "/" should do what everyone _expects_ them to do ;-) There are no analogous multiset operations to justify them caring at all what the values are.

Isn't everyone expect that x*2 == x + x? Isn't this the definition of multiplication? And when we have a multiplication, it can be generalized to division.

...

But there there's no good reason for "*" or "/" to care at all. They don't make sense for multisets.

...

After, e.g.,

c /= sum(c.values())

it's sane to expect that the new sum(c.values()) is close to 1 regardless of the numeric types or signs of the original values. Indeed, normalizing values so that their sum _is_ close to 1 is a primary use case motivating the current change.

Tim Peters

1:55 p.m.

[Tim]

...

...
Counter supports a wonderfully weird mix of methods driven by use cases, not by ideology.

+ (binary) - (binary) | &

have semantics driven by viewing a Counter as a multiset implementation. That's why they discard values <= 0. They correspond, respectively, to "the standard" multiset operations of sum (disjoint union), difference, union, and intersection.

[Serhiy Storchaka <storchaka@gmail.com>]

...

This explains only why binary "-" discards non-positive values and "&" discards keys that are only in one Counter. Multisets contain only positive counts.

...

...
Nothing else in Counter is trying to cater to the multiset view, but to other use cases. And that's why "*" and "/" should do what everyone _expects_ them to do ;-) There are no analogous multiset operations to justify them caring at all what the values are.

...

Isn't everyone expect that x*2 == x + x?

...

Isn't this the definition of multiplication?

In some algebraic structures, yes. Same as, e.g., "x - y" can be "defined by" "x + -y".

...

And when we have a multiplication, it can be generalized to division.

In some algebraic structures, yes.

...

...
But there there's no good reason for "*" or "/" to care at all. They don't make sense for multisets.

...

I disagree. "+" and "*" are defined for sequences, and these operations can be defined for multisets in terms of sequences of their elements.

...

...
After, e.g.,

c /= sum(c.values())

it's sane to expect that the new sum(c.values()) is close to 1 regardless of the numeric types or signs of the original values. Indeed, normalizing values so that their sum _is_ close to 1 is a primary use case motivating the current change.

...