[Numpy-discussion] Automatic number of bins for numpy histograms
Jaime Fernández del Río
jaime.frio at gmail.com
Wed Apr 15 12:40:58 EDT 2015
On Wed, Apr 15, 2015 at 8:06 AM, Neil Girdhar <mistersheik at gmail.com> wrote:
> You got it. I remember this from when I worked at Google and we would
> process (many many) logs. With enough bins, the approximation is still
> really close. It's great if you want to make an automatic plot of data.
> Calling numpy.partition a hundred times is probably slower than calling P^2
> with n=100 bins. I don't think it does O(n) computations per point. I
> think it's more like O(log(n)).
>
Looking at it again, it probably is O(n) after all: it does a binary
search, which is O(log n), but it then goes on to update all the n bin
counters and estimations, so O(n) I'm afraid. So there is no algorithmic
advantage over partition/percentile: if there are m samples and n bins, P-2
that O(n) m times, while partition does O(m) n times, so both end up being
O(m n). It seems to me that the big thing of P^2 is not having to hold the
full dataset in memory. Online statistics (is that the name for this?),
even if only estimations, is a cool thing, but I am not sure numpy is the
place for them. That's not to say that we couldn't eventually have P^2
implemented for histogram, but I would start off with a partition based one.
Would SciPy have a place for online statistics? Perhaps there's room for
yet another scikit?
Jaime
--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150415/d7c6779f/attachment.html>
More information about the NumPy-Discussion
mailing list