[Numpy-discussion] Automatic number of bins for numpy histograms

Varun nayyarv at gmail.com
Sun Apr 12 03:19:20 EDT 2015


http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta
tistics/A utomating%20Binwidth%20Choice%20for%20Histogram.ipynb

Long story short, histogram visualisations that depend on numpy (such as
matplotlib, or  nearly all of them) have poor default behaviour as I have to
constantly play around with  the number of bins to get a good idea of what I'm
looking at. The bins=10 works ok for  up to 1000 points or very normal data,
but has poor performance for anything else, and  doesn't account for
variability either. I don't have a method easily available to scale the number
of bins given the data.

R doesn't suffer from these problems and provides methods for use with it's
hist  method. I would like to provide similar functionality for matplotlib, to
at least provide  some kind of good starting point, as histograms are very
useful for initial data discovery.

The notebook above provides an explanation of the problem as well as some
proposed  alternatives. Use different datasets (type and size) to see the
performance of the  suggestions. All of the methods proposed exist in R and
literature.

I've put together an implementation to add this new functionality, but am
hesitant to  make a pull request as I would like some feedback from a
maintainer before doing so.

https://github.com/numpy/numpy/compare/master...nayyarv:master

I've provided them as functions for easy refactoring, as it can be argued that
it should be  in it's own function/file/class, or alternatively can be turned
into simple if, elif statements.  I believe this belongs in numpy as it is
where the functionality exists for histogram  methods that most libraries
build on, and it would useful for them to not require scipy for  example.

I will update the documentation accordingly before making a pull request, and
add in  more tests to show it's functionality. I can adapt my ipython notebook
into a quick  tutorial/help file if need be.

I've already attempted to add this into matplotlib before being redirected
here  https://github.com/matplotlib/matplotlib/issues/4316




More information about the NumPy-Discussion mailing list