[Python-ideas] collections.Counter should implement mul, rmul

Sun Apr 15 23:39:45 EDT 2018

> On Apr 15, 2018, at 5:44 PM, Peter Norvig <peter at norvig.com> wrote:
> 
> If you think of a Counter as a multiset, then it should support __or__, not __add__, right?

FWIW, Counter is explicitly documented to support the four multiset-style mathematical operations discussed in Knuth TAOCP Volume II section 4.6.3 exercise 19:

>>> c = Counter(a=3, b=1)
>>> d = Counter(a=1, b=2)
>>> c + d                       # add two counters together:  c[x] + d[x]
Counter({'a': 4, 'b': 3})
>>> c - d                       # saturating subtraction (keeping only positive counts)
Counter({'a': 2})
>>> c & d                       # intersection:  min(c[x], d[x]) 
Counter({'a': 1, 'b': 1})
>>> c | d                       # union:  max(c[x], d[x])
Counter({'a': 3, 'b': 2})

The wikipedia article on Multisets lists a further operation, inclusion, that is not currently supported:  https://en.wikipedia.org/wiki/Multiset#Basic_properties_and_operations

> I do think it would have been fine if Counter did not support "+" at all (and/or if Counter was limited to integer values). But  given where we are now, it feels like we should preserve `c + c == 2 * c`. 

The + operation has legitimate use cases (it is perfectly reasonable to want to combine the results two separate counts).  And, as you pointed out, it is what we already have and cannot change :-)

So, the API design issue that confronts us is that it would be a bit weird and disorienting for the arithmetic operators to have two different signatures:

    <counter> += <counter>
    <counter> -= <counter>
    <counter> *= <scalar>
    <counter> /= <scalar>

Also, we should respect the comments given by others on the tracker issue.  In particular, there is a preference to not have an in-place operation and only allow a new counter instance to be created.  That will help people avoid data structure modality problems:
.  
    c[category] += 1       # Makes sense during the frequency counting or accumulation phase
    c /= c.total           # Covert to a probability mass function
    c[category] += 1       # This code looks correct but no longer makes any sense

> As to the "doesn't really add any new capabilities" argument, that's true, but it is also true for Counter as a whole: it doesn't add much over defaultdict(int), but it is certainly convenient to have a standard way to do what it does.

IIRC, the defaultdict(int) in your first version triggered a bug because the model inadvertently changed during the analysis phase rather than being frozen after the training phase.  The Counter doesn't suffer from the same issue (modifying the dict on a failed lookup).  Also, the Counter class does have a few value added features:  Counter(iterable), c.most_common(), c.elements(), etc.   But yes, at its heart the counter is mostly just a specialized dictionary.  The thought I was trying to express is that suggestions to build out Counter API are a little less compelling when we already have a way to do it that is flexible, fast, clear, and standard (i.e. dict comprehensions).

> I agree with your intuition that low level is better. `total` would be useful. If you have total and mul, then as you and others have pointed out, normalize is just c *= 1/c.total.

I fully support adding some functionality for scaling to support probability distributions, bayesian update steps, chi-square tests, and whatnot.  The people who need convincing are the other respondents on the tracker.  They had a strong mental model for the Counter class that is somewhat at odds with this proposal.

Raymond

[Python-ideas] collections.Counter should implement __mul__, __rmul__

[Python-ideas] collections.Counter should implement mul, rmul