[Numpy-discussion] Automatic number of bins for numpy histograms

Jaime Fernández del Río jaime.frio at gmail.com
Wed Apr 15 12:40:58 EDT 2015


On Wed, Apr 15, 2015 at 8:06 AM, Neil Girdhar <mistersheik at gmail.com> wrote:

> You got it.  I remember this from when I worked at Google and we would
> process (many many) logs.  With enough bins, the approximation is still
> really close.  It's great if you want to make an automatic plot of data.
> Calling numpy.partition a hundred times is probably slower than calling P^2
> with n=100 bins.  I don't think it does O(n) computations per point.  I
> think it's more like O(log(n)).
>

Looking at it again, it probably is O(n) after all: it does a binary
search, which is O(log n), but it then goes on to update all the n bin
counters and estimations, so O(n) I'm afraid. So there is no algorithmic
advantage over partition/percentile: if there are m samples and n bins, P-2
that O(n) m times, while partition does O(m) n times, so both end up being
O(m n). It seems to me that the big thing of P^2 is not having to hold the
full dataset in memory. Online statistics (is that the name for this?),
even if only estimations, is a cool thing, but I am not sure numpy is the
place for them. That's not to say that we couldn't eventually have P^2
implemented for histogram, but I would start off with a partition based one.

Would SciPy have a place for online statistics? Perhaps there's room for
yet another scikit?

Jaime

-- 
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150415/d7c6779f/attachment.html>


More information about the NumPy-Discussion mailing list