[Numpy-discussion] Automatic number of bins for numpy histograms

Neil Girdhar mistersheik at gmail.com
Tue Apr 14 17:28:57 EDT 2015


Yes, you're right.  Although in practice, people almost always want
adaptive bins.

On Tue, Apr 14, 2015 at 5:08 PM, Chris Barker <chris.barker at noaa.gov> wrote:

> On Mon, Apr 13, 2015 at 5:02 AM, Neil Girdhar <mistersheik at gmail.com>
> wrote:
>
>> Can I suggest that we instead add the P-square algorithm for the dynamic
>> calculation of histograms?  (
>> http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf
>> )
>>
>
> This look slike a great thing to have in numpy. However, I suspect that a
> lot of the downstream code that uses histogram expects equally-spaced bins.
>
> So this should probably be a "in addition to", rather than an "instead of"
>
> -CHB
>
>
>
>>
>> This is already implemented in C++'s boost library (
>> http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp
>> )
>>
>> I implemented it in Boost Python as a module, which I'm happy to share.
>> This is much better than fixed-width histograms in practice.  Rather than
>> adjusting the number of bins, it adjusts what you really want, which is the
>> resolution of the bins throughout the domain.
>>
>> Best,
>>
>> Neil
>>
>> On Sun, Apr 12, 2015 at 4:02 AM, Ralf Gommers <ralf.gommers at gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río <
>>> jaime.frio at gmail.com> wrote:
>>>
>>>> On Sun, Apr 12, 2015 at 12:19 AM, Varun <nayyarv at gmail.com> wrote:
>>>>
>>>>>
>>>>> http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/sta
>>>>> tistics/A
>>>>> <http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/statistics/A>
>>>>> utomating%20Binwidth%20Choice%20for%20Histogram.ipynb
>>>>>
>>>>> Long story short, histogram visualisations that depend on numpy (such
>>>>> as
>>>>> matplotlib, or  nearly all of them) have poor default behaviour as I
>>>>> have to
>>>>> constantly play around with  the number of bins to get a good idea of
>>>>> what I'm
>>>>> looking at. The bins=10 works ok for  up to 1000 points or very normal
>>>>> data,
>>>>> but has poor performance for anything else, and  doesn't account for
>>>>> variability either. I don't have a method easily available to scale
>>>>> the number
>>>>> of bins given the data.
>>>>>
>>>>> R doesn't suffer from these problems and provides methods for use with
>>>>> it's
>>>>> hist  method. I would like to provide similar functionality for
>>>>> matplotlib, to
>>>>> at least provide  some kind of good starting point, as histograms are
>>>>> very
>>>>> useful for initial data discovery.
>>>>>
>>>>> The notebook above provides an explanation of the problem as well as
>>>>> some
>>>>> proposed  alternatives. Use different datasets (type and size) to see
>>>>> the
>>>>> performance of the  suggestions. All of the methods proposed exist in
>>>>> R and
>>>>> literature.
>>>>>
>>>>> I've put together an implementation to add this new functionality, but
>>>>> am
>>>>> hesitant to  make a pull request as I would like some feedback from a
>>>>> maintainer before doing so.
>>>>>
>>>>
>>>> +1 on the PR.
>>>>
>>>
>>> +1 as well.
>>>
>>> Unfortunately we can't change the default of 10, but a number of string
>>> methods, with a "bins=auto" or some such name prominently recommended in
>>> the docstring, would be very good to have.
>>>
>>> Ralf
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150414/21f25f84/attachment.html>


More information about the NumPy-Discussion mailing list