[Numpy-discussion] Automatic number of bins for numpy histograms

Antony Lee antony.lee at berkeley.edu
Tue Apr 14 17:02:05 EDT 2015


Another improvement would be to make sure, for integer-valued datasets,
that all bins cover the same number of integer, as it is easy to end up
otherwise with bins "effectively" wider than others:

hist(np.random.randint(11, size=10000))

shows a peak in the last bin, as it covers both 9 and 10.

Antony

2015-04-13 5:02 GMT-07:00 Neil Girdhar <mistersheik at gmail.com>:

> Can I suggest that we instead add the P-square algorithm for the dynamic
> calculation of histograms?  (
> http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf
> )
>
> This is already implemented in C++'s boost library (
> http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp
> )
>
> I implemented it in Boost Python as a module, which I'm happy to share.
> This is much better than fixed-width histograms in practice.  Rather than
> adjusting the number of bins, it adjusts what you really want, which is the
> resolution of the bins throughout the domain.
>
> Best,
>
> Neil
>
> On Sun, Apr 12, 2015 at 4:02 AM, Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
>
>>
>>
>> On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río <
>> jaime.frio at gmail.com> wrote:
>>
>>> On Sun, Apr 12, 2015 at 12:19 AM, Varun <nayyarv at gmail.com> wrote:
>>>
>>>>
>>>> http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta
>>>> tistics/A
>>>> <http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/statistics/A>
>>>> utomating%20Binwidth%20Choice%20for%20Histogram.ipynb
>>>>
>>>> Long story short, histogram visualisations that depend on numpy (such as
>>>> matplotlib, or  nearly all of them) have poor default behaviour as I
>>>> have to
>>>> constantly play around with  the number of bins to get a good idea of
>>>> what I'm
>>>> looking at. The bins=10 works ok for  up to 1000 points or very normal
>>>> data,
>>>> but has poor performance for anything else, and  doesn't account for
>>>> variability either. I don't have a method easily available to scale the
>>>> number
>>>> of bins given the data.
>>>>
>>>> R doesn't suffer from these problems and provides methods for use with
>>>> it's
>>>> hist  method. I would like to provide similar functionality for
>>>> matplotlib, to
>>>> at least provide  some kind of good starting point, as histograms are
>>>> very
>>>> useful for initial data discovery.
>>>>
>>>> The notebook above provides an explanation of the problem as well as
>>>> some
>>>> proposed  alternatives. Use different datasets (type and size) to see
>>>> the
>>>> performance of the  suggestions. All of the methods proposed exist in R
>>>> and
>>>> literature.
>>>>
>>>> I've put together an implementation to add this new functionality, but
>>>> am
>>>> hesitant to  make a pull request as I would like some feedback from a
>>>> maintainer before doing so.
>>>>
>>>
>>> +1 on the PR.
>>>
>>
>> +1 as well.
>>
>> Unfortunately we can't change the default of 10, but a number of string
>> methods, with a "bins=auto" or some such name prominently recommended in
>> the docstring, would be very good to have.
>>
>> Ralf
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150414/c40b3094/attachment.html>


More information about the NumPy-Discussion mailing list