[Numpy-discussion] Automatic number of bins for numpy histograms

Neil Girdhar mistersheik at gmail.com
Mon Apr 13 08:02:27 EDT 2015


Can I suggest that we instead add the P-square algorithm for the dynamic
calculation of histograms?  (
http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf
)

This is already implemented in C++'s boost library (
http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp
)

I implemented it in Boost Python as a module, which I'm happy to share.
This is much better than fixed-width histograms in practice.  Rather than
adjusting the number of bins, it adjusts what you really want, which is the
resolution of the bins throughout the domain.

Best,

Neil

On Sun, Apr 12, 2015 at 4:02 AM, Ralf Gommers <ralf.gommers at gmail.com>
wrote:

>
>
> On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río <
> jaime.frio at gmail.com> wrote:
>
>> On Sun, Apr 12, 2015 at 12:19 AM, Varun <nayyarv at gmail.com> wrote:
>>
>>>
>>> http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta
>>> tistics/A
>>> <http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/statistics/A>
>>> utomating%20Binwidth%20Choice%20for%20Histogram.ipynb
>>>
>>> Long story short, histogram visualisations that depend on numpy (such as
>>> matplotlib, or  nearly all of them) have poor default behaviour as I
>>> have to
>>> constantly play around with  the number of bins to get a good idea of
>>> what I'm
>>> looking at. The bins=10 works ok for  up to 1000 points or very normal
>>> data,
>>> but has poor performance for anything else, and  doesn't account for
>>> variability either. I don't have a method easily available to scale the
>>> number
>>> of bins given the data.
>>>
>>> R doesn't suffer from these problems and provides methods for use with
>>> it's
>>> hist  method. I would like to provide similar functionality for
>>> matplotlib, to
>>> at least provide  some kind of good starting point, as histograms are
>>> very
>>> useful for initial data discovery.
>>>
>>> The notebook above provides an explanation of the problem as well as some
>>> proposed  alternatives. Use different datasets (type and size) to see the
>>> performance of the  suggestions. All of the methods proposed exist in R
>>> and
>>> literature.
>>>
>>> I've put together an implementation to add this new functionality, but am
>>> hesitant to  make a pull request as I would like some feedback from a
>>> maintainer before doing so.
>>>
>>
>> +1 on the PR.
>>
>
> +1 as well.
>
> Unfortunately we can't change the default of 10, but a number of string
> methods, with a "bins=auto" or some such name prominently recommended in
> the docstring, would be very good to have.
>
> Ralf
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150413/a0b9faf3/attachment.html>


More information about the NumPy-Discussion mailing list