[Numpy-discussion] Automatic number of bins for numpy histograms
Neil Girdhar
mistersheik at gmail.com
Mon Apr 13 08:02:27 EDT 2015
Can I suggest that we instead add the P-square algorithm for the dynamic
calculation of histograms? (
http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf
)
This is already implemented in C++'s boost library (
http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp
)
I implemented it in Boost Python as a module, which I'm happy to share.
This is much better than fixed-width histograms in practice. Rather than
adjusting the number of bins, it adjusts what you really want, which is the
resolution of the bins throughout the domain.
Best,
Neil
On Sun, Apr 12, 2015 at 4:02 AM, Ralf Gommers <ralf.gommers at gmail.com>
wrote:
>
>
> On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río <
> jaime.frio at gmail.com> wrote:
>
>> On Sun, Apr 12, 2015 at 12:19 AM, Varun <nayyarv at gmail.com> wrote:
>>
>>>
>>> http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta
>>> tistics/A
>>> <http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/statistics/A>
>>> utomating%20Binwidth%20Choice%20for%20Histogram.ipynb
>>>
>>> Long story short, histogram visualisations that depend on numpy (such as
>>> matplotlib, or nearly all of them) have poor default behaviour as I
>>> have to
>>> constantly play around with the number of bins to get a good idea of
>>> what I'm
>>> looking at. The bins=10 works ok for up to 1000 points or very normal
>>> data,
>>> but has poor performance for anything else, and doesn't account for
>>> variability either. I don't have a method easily available to scale the
>>> number
>>> of bins given the data.
>>>
>>> R doesn't suffer from these problems and provides methods for use with
>>> it's
>>> hist method. I would like to provide similar functionality for
>>> matplotlib, to
>>> at least provide some kind of good starting point, as histograms are
>>> very
>>> useful for initial data discovery.
>>>
>>> The notebook above provides an explanation of the problem as well as some
>>> proposed alternatives. Use different datasets (type and size) to see the
>>> performance of the suggestions. All of the methods proposed exist in R
>>> and
>>> literature.
>>>
>>> I've put together an implementation to add this new functionality, but am
>>> hesitant to make a pull request as I would like some feedback from a
>>> maintainer before doing so.
>>>
>>
>> +1 on the PR.
>>
>
> +1 as well.
>
> Unfortunately we can't change the default of 10, but a number of string
> methods, with a "bins=auto" or some such name prominently recommended in
> the docstring, would be very good to have.
>
> Ralf
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150413/a0b9faf3/attachment.html>
More information about the NumPy-Discussion
mailing list