[Numpy-discussion] Automatic number of bins for numpy histograms

Neil Girdhar mistersheik at gmail.com
Tue Apr 14 22:05:18 EDT 2015


By the way, the p^2 algorithm still needs to know how many bins you want.
It just adapts the endpoints of the bins.  I like adaptive=True.  However,
you will have to find a way to return both the bins and and their
calculated endpoints.

The P^2 algorithm can also give approximate answers to numpy.percentile,
numpy.median.  How approximate they are depends on the number of bins you
let it keep track of.  I believe the authors bound the error as a function
of number of points and bins.

On Tue, Apr 14, 2015 at 10:00 PM, Paul Hobson <pmhobson at gmail.com> wrote:

>
>
> On Tue, Apr 14, 2015 at 4:24 PM, Jaime Fernández del Río <
> jaime.frio at gmail.com> wrote:
>
>> On Tue, Apr 14, 2015 at 4:12 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>
>>> On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar <mistersheik at gmail.com>
>>> wrote:
>>> > Can I suggest that we instead add the P-square algorithm for the
>>> dynamic
>>> > calculation of histograms?
>>> > (
>>> http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%20quantiles%20calcultation%20-%20P2%20Algorythm.pdf
>>> )
>>> >
>>> > This is already implemented in C++'s boost library
>>> > (
>>> http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_p_square.hpp
>>> )
>>> >
>>> > I implemented it in Boost Python as a module, which I'm happy to share.
>>> > This is much better than fixed-width histograms in practice.  Rather
>>> than
>>> > adjusting the number of bins, it adjusts what you really want, which
>>> is the
>>> > resolution of the bins throughout the domain.
>>>
>>> This definitely sounds like a useful thing to have in numpy or scipy
>>> (though if it's possible to do without using Boost/C++ that would be
>>> nice). But yeah, we should leave the existing histogram alone (in this
>>> regard) and add a new name for this like "adaptive_histogram" or
>>> something. Then you can set about convincing matplotlib and friends to
>>> use it by default :-)
>>>
>>
>> Would having a negative number of bins mean "this many, but with
>> optimized boundaries" be too clever an interface?
>>
>
> As a user, I think so. Wouldn't np.histogram(..., adaptive=True) do well
> enough?
> -p
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150414/3642e14d/attachment.html>


More information about the NumPy-Discussion mailing list