[Python-ideas] Additions to collections.Counter and a Counter derived class

David Mertz mertz at gnosis.cx
Wed Mar 15 14:06:20 EDT 2017


On Wed, Mar 15, 2017 at 10:39 AM, Steven D'Aprano <steve at pearwood.info>
wrote:

> > But I can imagine an occasional need to, e.g. "find outliers."  However,
> > that is not hard to spell as `mycounter.most_common()[-1*N:]`.  Or if
> your
> > program does this often, write a utility function `find_outliers(...)`
>
> That's not how you find outliers :-)
> Just because a data point is uncommon doesn't mean it is an outlier.
>

That's kinda *by definition* what an outlier is in categorical data!

E.g.:

In [1]: from glob import glob

In [2]: from collections import Counter

In [3]: names = Counter()

In [4]: for fname in glob('babynames/yob*.txt'):
   ...:     for line in open(fname):
   ...:         name, sex, num = line.strip().split(',')
   ...:         num = int(num)
   ...:         names[name] += num
   ...:

In [5]: names.most_common(3)
Out[5]: [('James', 5086540), ('John', 5073452), ('Robert', 4795444)]

In [6]: rare_names = names.most_common()[-3:]

In [7]: rare_names
Out[7]: [('Zyerre', 5), ('Zylas', 5), ('Zytavion', 5)]

In [8]: sum(names.values()) # nicer would be `names.total`
Out[8]: 326086290

This isn't exactly statistics, but it's like your product example.  There
are infinitely many random strings that occurred zero times among US
births.  But a "rare name" is one that occurred at least once, not one of
these zero-occurring possible strings.

I realize from my example, however, that I'm probably more interested in
the actual uncommonality, not the specific `.least_common()`.  I.e. I'd
like to know which names occurred fewer than 10 times... but I don't know
how many items that will include.  Or as a percentage, which names occur in
fewer than 0.01% of births?


I don't think there's any good reason to want to find the "least common"
> values in a statistics context, but there might be other use-cases for
> it. For example, suppose we are interested in the *least* popular
> products being sold:
>
> Counter(order.item for order in orders)
>
>
> We can get the best selling products easily, but not the duds that don't
> sell much at all.
>
> However, the problem is that what we really need to see is the items
> that don't sell at all (count=0), and they won't show up! So I think
> that this is not actually a useful feature.
>
>
> > 2) Undefined behavior when using Counter.most_common:
> > > 'c', 'c']), when calling c.most_common(3), there are more than 3 "most
> > > common" elements in c and c.most_common(3) will not always return the
> > > same list, since there is no defined total order on the elements in c.
> > >
> > Should this be mentioned in the documentation?
> > >
> >
> > +1. I'd definitely support adding this point to the documentation.
>
> The docs already say that "Elements with equal counts are ordered
> arbitrarily" so I'm not sure what more is needed.
>
>
> --
> Steve
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>



-- 
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20170315/f8b76837/attachment-0001.html>


More information about the Python-ideas mailing list