Additions to collections.Counter and a Counter derived class

Hi all, I have been using the Counter class recently and came across several things that I was hoping to get feedback on. (This is my first time mailing this list, so any advice is greatly appreciated) 1) Addition of a Counter.least_common method: This would add a method to Counter that is basically the opposite of the pre-existing Counter.most_common method. In this case, the least common elements are considered the elements in c with the lowest (non-zero) frequency. This was addressed in https://bugs.python.org/issue16994, but it was never resolved and is still open (since Jan. 2013). This is a small change, but I think that it is useful to include in the stdlib. I have written a patch for this, but have not submitted a PR yet. It can be found at https://github.com/mcognetta/cpython/tree/collections_counter_least_common 2) Undefined behavior when using Counter.most_common: Consider the case c = Counter([1, 1, 2, 2, 3, 3, 'a', 'a', 'b', 'b', 'c', 'c']), when calling c.most_common(3), there are more than 3 "most common" elements in c and c.most_common(3) will not always return the same list, since there is no defined total order on the elements in c. Should this be mentioned in the documentation? Additionally, perhaps there is room for a method that produces all of the elements with the n highest frequencies in order of their frequencies. For example, in the case of c = Counter([1, 1, 1, 2, 2, 3, 3, 4, 4, 5]) c.aforementioned_method(2) would return [(1, 3), (2, 2), (3, 2), (4, 2)] since the two highest frequencies are 3 and 2. 3) Addition of a collections.Frequency or collections.Proportion class derived from collections.Counter: This is sort of discussed in https://bugs.python.org/issue25478. The idea behind this would be a dictionary that, instead of returning the integer frequency of an element, would return it's proportional representation in the iterable. So, for example f = Frequency('aabbcc'), f would hold Frequency({'a': 0.3333333333333333, 'b': 0.3333333333333333, 'c': 0.3333333333333333}). To address
The pitfall I imagine here is that if you continue adding elements after normalize() is called, the >results will be nonsensical.
from the issue, this would not be a problem because we could just build it entirely on top of a Counter, keep a count of the total number of elements in the Counter, and just divide by that every time we output or return the object or any of its elements. I think that this would be a pretty useful addition especially for code related to discrete probability distributions (which is what motivated this in the first place). Thanks in advance, -Marco

On Tue, Mar 14, 2017 at 2:38 AM, Marco Cognetta <cognetta.marco@gmail.com> wrote:
-1 on adding this. I read the issue, and do not find a convincing use case that is common enough to merit a new method. As some people noted in the issue, the "least common" is really the infinitely many keys not in the collection at all. But I can imagine an occasional need to, e.g. "find outliers." However, that is not hard to spell as `mycounter.most_common()[-1*N:]`. Or if your program does this often, write a utility function `find_outliers(...)` 2) Undefined behavior when using Counter.most_common:
Should this be mentioned in the documentation?
+1. I'd definitely support adding this point to the documentation.
-0 on this. I can see wanting this, but I'm not sure often enough to add to the promise of the class. The utility function to do this would be somewhat less trivial to write than `find_outliers(..)` but not enormously hard. I think I'd be +0 on adding a recipe to the documentation for a utility function.
One could write a subclass easily enough. The essential feature in my mind would be to keep an attributed Counter.total around to perform the normalization. I'm +1 on adding that to collections.Counter itself. I'm not sure if this would be better as an attribute kept directly or as a property that called `sum(self.values())` when accessed. I believe that having `mycounter.total` would provide the right normalization in a clean API, and also expose easy access to other questions one would naturally ask (e.g. "How many observations were made?") -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Thanks for the reply. I will add it to the documentation and will work on a use case and recipe for the n highest frequency problem. As for
I'm not sure if this would be better as an attribute kept directly or as a property that called `sum(self.values())` when accessed. I believe that having `mycounter.total` would provide the right normalization in a clean API, and also expose easy access to other questions one would naturally ask (e.g. "How many observations were made?")
I am not quite sure what you mean, especially the observations part. For the attribute part, do you mean we would just have a hidden class variable like num_values that was incremented or decremented whenever something is added or removed, so we have O(1) size queries instead of O(n) (where n is the number of keys)? Then there could be a method like 'normalize' that printed all elements with their frequencies divided by the total count? On Wed, Mar 15, 2017 at 12:52 AM, David Mertz <mertz@gnosis.cx> wrote:

I added a couple comments at https://bugs.python.org/issue25478 about what I mean. Raymond replied as well. So it feels like we should use that thread there. In a scientific context I often think of a Counter as a way to count observations of a categorical variable. "I saw 3 As, then 7 Bs, etc". My proposed attribute/property would let you ask `num_observations = mycounter.total`. As I comment on the issue, we could ask for frequency by spelling it like: freqs = {k:v/c.total for k, v in c.items()} And yes, the question in my mind is whether we should define: @property def total(self): return sum(self.values()) Or keep a running count in all the mutator methods. Probably the property, actually, since we don't want to slow down all the existing code. I don't really want .normalize() as a method, but if we did, the spelling should be `.normalized()` instead. This reverse/reversed and sort/sorted that we already have to reflect mutators versus copies. But notice that copying is almost always in functions rather than in methods. The thing is, once something is normalized, it really IS NOT a Counter anymore. My dict comprehension emphasizes that by creating a plain dictionary. Even if .normalized() existed, I don't believe it should return a Counter object (instead either a plain dict, or some new specialized subclass like Frequencies). On Wed, Mar 15, 2017 at 2:14 AM, Marco Cognetta <cognetta.marco@gmail.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Tue, Mar 14, 2017 at 08:52:52AM -0700, David Mertz wrote:
That's not how you find outliers :-) Just because a data point is uncommon doesn't mean it is an outlier. I don't think there's any good reason to want to find the "least common" values in a statistics context, but there might be other use-cases for it. For example, suppose we are interested in the *least* popular products being sold: Counter(order.item for order in orders) We can get the best selling products easily, but not the duds that don't sell much at all. However, the problem is that what we really need to see is the items that don't sell at all (count=0), and they won't show up! So I think that this is not actually a useful feature.
The docs already say that "Elements with equal counts are ordered arbitrarily" so I'm not sure what more is needed. -- Steve

On Wed, Mar 15, 2017 at 10:39 AM, Steven D'Aprano <steve@pearwood.info> wrote:
That's kinda *by definition* what an outlier is in categorical data! E.g.: In [1]: from glob import glob In [2]: from collections import Counter In [3]: names = Counter() In [4]: for fname in glob('babynames/yob*.txt'): ...: for line in open(fname): ...: name, sex, num = line.strip().split(',') ...: num = int(num) ...: names[name] += num ...: In [5]: names.most_common(3) Out[5]: [('James', 5086540), ('John', 5073452), ('Robert', 4795444)] In [6]: rare_names = names.most_common()[-3:] In [7]: rare_names Out[7]: [('Zyerre', 5), ('Zylas', 5), ('Zytavion', 5)] In [8]: sum(names.values()) # nicer would be `names.total` Out[8]: 326086290 This isn't exactly statistics, but it's like your product example. There are infinitely many random strings that occurred zero times among US births. But a "rare name" is one that occurred at least once, not one of these zero-occurring possible strings. I realize from my example, however, that I'm probably more interested in the actual uncommonality, not the specific `.least_common()`. I.e. I'd like to know which names occurred fewer than 10 times... but I don't know how many items that will include. Or as a percentage, which names occur in fewer than 0.01% of births? I don't think there's any good reason to want to find the "least common"
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On 2017-03-15 11:06, David Mertz wrote:
Just because a data point is uncommon doesn't mean it is an outlier.
That's kinda *by definition* what an outlier is in categorical data!
Not really. Or rather, it depends what you mean by "uncommon". But this thread is about adding "least_common", and just because a data point is among the least frequent doesn't mean it's an outlier. You explained why yourself:
I realize from my example, however, that I'm probably more interested in the actual uncommonality, not the specific `.least_common()`.
Exactly. If you have one data point that occurs once, another that occurs twice, another that occurs three times, and so on up to 10, then the "least common" one (or two or three) isn't an outlier. To be an outlier, it would have to be "much less common than the rest". That is, what matters is not the frequency rank but the magnitude of the separation in frequency between the outliers and the nonoutliers. But that's a much subtler notion than just "least common". -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

On Wed, Mar 15, 2017 at 11:14 AM, Brendan Barnwell <brenbarn@brenbarn.net> wrote:
OK. Fair enough. Although exactly what separation in frequencies makes an outlier is pretty fuzzy, especially in large samples where there are likely to be no/few gaps at all per se. It does tend to convince me that what we need is a more specialized class in the `statistics` module. In my large babynames dataset (a common one available from US Census), the distribution in a linear scale is basically a vertical line followed by a horizontal line. It's much starker than a Zipf distribution. On a semilog scale, it has some choppiness in the tail, where you might define some as "outliers" but it's not obvious what that cutoff would be. x = range(len(names)) y = [t[1] for t in names.most_common()] plt.plot(x, y) plt.title("Baby name frequencies USA 1880-2011") plt.semilogy() -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Wed, Mar 15, 2017 at 11:06:20AM -0700, David Mertz wrote:
I'm not sure that "outlier" is defined for non-numeric data, or at least not formally defined. You'd need a definition of central location (which would be the mode) and a definition of spread, and I'm not sure how you would measure spread for categorical data. What's the spread of this data? ["Jack", "Jack", "Jill", "Jack"] The mode is clearly "Jack", but beyond that I'm not sure what can be said except to give the frequencies themselves. One commonly used definition of outlier (due to John Tukey) is: - divide your data into four equal quarters; - the points between each quarter are known as quartiles, and there are three of them: Q1, Q2 (the median), Q3; - define the Interquartile Range IQR = Q3 - Q2; - define lower and upper fences as Q1 - 1.5*IQR and Q3 + 1.5*IQR; - anything not between the lower and upper fences is an outlier. Or to be precise, a *suspected* outlier, since for very long tailed distributions, rare values are to be expected and should not be discarded without good reason. If your data is Gaussian, that corresponds to discarding roughly 1% of the most extreme values.
Indeed. While the frequencies themselves are useful, the least_common(count) (by analogy with Counter.most_common) is not so useful. -- Steve

On Tue, Mar 14, 2017 at 2:38 AM, Marco Cognetta <cognetta.marco@gmail.com> wrote:
-1 on adding this. I read the issue, and do not find a convincing use case that is common enough to merit a new method. As some people noted in the issue, the "least common" is really the infinitely many keys not in the collection at all. But I can imagine an occasional need to, e.g. "find outliers." However, that is not hard to spell as `mycounter.most_common()[-1*N:]`. Or if your program does this often, write a utility function `find_outliers(...)` 2) Undefined behavior when using Counter.most_common:
Should this be mentioned in the documentation?
+1. I'd definitely support adding this point to the documentation.
-0 on this. I can see wanting this, but I'm not sure often enough to add to the promise of the class. The utility function to do this would be somewhat less trivial to write than `find_outliers(..)` but not enormously hard. I think I'd be +0 on adding a recipe to the documentation for a utility function.
One could write a subclass easily enough. The essential feature in my mind would be to keep an attributed Counter.total around to perform the normalization. I'm +1 on adding that to collections.Counter itself. I'm not sure if this would be better as an attribute kept directly or as a property that called `sum(self.values())` when accessed. I believe that having `mycounter.total` would provide the right normalization in a clean API, and also expose easy access to other questions one would naturally ask (e.g. "How many observations were made?") -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Thanks for the reply. I will add it to the documentation and will work on a use case and recipe for the n highest frequency problem. As for
I'm not sure if this would be better as an attribute kept directly or as a property that called `sum(self.values())` when accessed. I believe that having `mycounter.total` would provide the right normalization in a clean API, and also expose easy access to other questions one would naturally ask (e.g. "How many observations were made?")
I am not quite sure what you mean, especially the observations part. For the attribute part, do you mean we would just have a hidden class variable like num_values that was incremented or decremented whenever something is added or removed, so we have O(1) size queries instead of O(n) (where n is the number of keys)? Then there could be a method like 'normalize' that printed all elements with their frequencies divided by the total count? On Wed, Mar 15, 2017 at 12:52 AM, David Mertz <mertz@gnosis.cx> wrote:

I added a couple comments at https://bugs.python.org/issue25478 about what I mean. Raymond replied as well. So it feels like we should use that thread there. In a scientific context I often think of a Counter as a way to count observations of a categorical variable. "I saw 3 As, then 7 Bs, etc". My proposed attribute/property would let you ask `num_observations = mycounter.total`. As I comment on the issue, we could ask for frequency by spelling it like: freqs = {k:v/c.total for k, v in c.items()} And yes, the question in my mind is whether we should define: @property def total(self): return sum(self.values()) Or keep a running count in all the mutator methods. Probably the property, actually, since we don't want to slow down all the existing code. I don't really want .normalize() as a method, but if we did, the spelling should be `.normalized()` instead. This reverse/reversed and sort/sorted that we already have to reflect mutators versus copies. But notice that copying is almost always in functions rather than in methods. The thing is, once something is normalized, it really IS NOT a Counter anymore. My dict comprehension emphasizes that by creating a plain dictionary. Even if .normalized() existed, I don't believe it should return a Counter object (instead either a plain dict, or some new specialized subclass like Frequencies). On Wed, Mar 15, 2017 at 2:14 AM, Marco Cognetta <cognetta.marco@gmail.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Tue, Mar 14, 2017 at 08:52:52AM -0700, David Mertz wrote:
That's not how you find outliers :-) Just because a data point is uncommon doesn't mean it is an outlier. I don't think there's any good reason to want to find the "least common" values in a statistics context, but there might be other use-cases for it. For example, suppose we are interested in the *least* popular products being sold: Counter(order.item for order in orders) We can get the best selling products easily, but not the duds that don't sell much at all. However, the problem is that what we really need to see is the items that don't sell at all (count=0), and they won't show up! So I think that this is not actually a useful feature.
The docs already say that "Elements with equal counts are ordered arbitrarily" so I'm not sure what more is needed. -- Steve

On Wed, Mar 15, 2017 at 10:39 AM, Steven D'Aprano <steve@pearwood.info> wrote:
That's kinda *by definition* what an outlier is in categorical data! E.g.: In [1]: from glob import glob In [2]: from collections import Counter In [3]: names = Counter() In [4]: for fname in glob('babynames/yob*.txt'): ...: for line in open(fname): ...: name, sex, num = line.strip().split(',') ...: num = int(num) ...: names[name] += num ...: In [5]: names.most_common(3) Out[5]: [('James', 5086540), ('John', 5073452), ('Robert', 4795444)] In [6]: rare_names = names.most_common()[-3:] In [7]: rare_names Out[7]: [('Zyerre', 5), ('Zylas', 5), ('Zytavion', 5)] In [8]: sum(names.values()) # nicer would be `names.total` Out[8]: 326086290 This isn't exactly statistics, but it's like your product example. There are infinitely many random strings that occurred zero times among US births. But a "rare name" is one that occurred at least once, not one of these zero-occurring possible strings. I realize from my example, however, that I'm probably more interested in the actual uncommonality, not the specific `.least_common()`. I.e. I'd like to know which names occurred fewer than 10 times... but I don't know how many items that will include. Or as a percentage, which names occur in fewer than 0.01% of births? I don't think there's any good reason to want to find the "least common"
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On 2017-03-15 11:06, David Mertz wrote:
Just because a data point is uncommon doesn't mean it is an outlier.
That's kinda *by definition* what an outlier is in categorical data!
Not really. Or rather, it depends what you mean by "uncommon". But this thread is about adding "least_common", and just because a data point is among the least frequent doesn't mean it's an outlier. You explained why yourself:
I realize from my example, however, that I'm probably more interested in the actual uncommonality, not the specific `.least_common()`.
Exactly. If you have one data point that occurs once, another that occurs twice, another that occurs three times, and so on up to 10, then the "least common" one (or two or three) isn't an outlier. To be an outlier, it would have to be "much less common than the rest". That is, what matters is not the frequency rank but the magnitude of the separation in frequency between the outliers and the nonoutliers. But that's a much subtler notion than just "least common". -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

On Wed, Mar 15, 2017 at 11:14 AM, Brendan Barnwell <brenbarn@brenbarn.net> wrote:
OK. Fair enough. Although exactly what separation in frequencies makes an outlier is pretty fuzzy, especially in large samples where there are likely to be no/few gaps at all per se. It does tend to convince me that what we need is a more specialized class in the `statistics` module. In my large babynames dataset (a common one available from US Census), the distribution in a linear scale is basically a vertical line followed by a horizontal line. It's much starker than a Zipf distribution. On a semilog scale, it has some choppiness in the tail, where you might define some as "outliers" but it's not obvious what that cutoff would be. x = range(len(names)) y = [t[1] for t in names.most_common()] plt.plot(x, y) plt.title("Baby name frequencies USA 1880-2011") plt.semilogy() -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Wed, Mar 15, 2017 at 11:06:20AM -0700, David Mertz wrote:
I'm not sure that "outlier" is defined for non-numeric data, or at least not formally defined. You'd need a definition of central location (which would be the mode) and a definition of spread, and I'm not sure how you would measure spread for categorical data. What's the spread of this data? ["Jack", "Jack", "Jill", "Jack"] The mode is clearly "Jack", but beyond that I'm not sure what can be said except to give the frequencies themselves. One commonly used definition of outlier (due to John Tukey) is: - divide your data into four equal quarters; - the points between each quarter are known as quartiles, and there are three of them: Q1, Q2 (the median), Q3; - define the Interquartile Range IQR = Q3 - Q2; - define lower and upper fences as Q1 - 1.5*IQR and Q3 + 1.5*IQR; - anything not between the lower and upper fences is an outlier. Or to be precise, a *suspected* outlier, since for very long tailed distributions, rare values are to be expected and should not be discarded without good reason. If your data is Gaussian, that corresponds to discarding roughly 1% of the most extreme values.
Indeed. While the frequencies themselves are useful, the least_common(count) (by analogy with Counter.most_common) is not so useful. -- Steve
participants (4)
-
Brendan Barnwell
-
David Mertz
-
Marco Cognetta
-
Steven D'Aprano