[Numpy-discussion] Automatic number of bins for numpy histograms

Neil Girdhar mistersheik at gmail.com
Wed Apr 15 18:46:40 EDT 2015


Cool, thanks for looking at this.  P2 might still be better even if the
whole dataset is in memory because of cache misses.  Partition, which I
guess is based on quickselect, is going to run over all of the data as many
times as there are bins roughly, whereas p2 only runs over it once.  From a
cache miss standpoint, I think p2 is better?  Anyway, it might be worth
maybe coding to verify any performance advantages?  Not sure if it should
be in numpy or not since it really should accept an iterable rather than a
numpy vector, right?

Best,

Neil

On Wed, Apr 15, 2015 at 12:40 PM, Jaime Fernández del Río <
jaime.frio at gmail.com> wrote:

> On Wed, Apr 15, 2015 at 8:06 AM, Neil Girdhar <mistersheik at gmail.com>
> wrote:
>
>> You got it.  I remember this from when I worked at Google and we would
>> process (many many) logs.  With enough bins, the approximation is still
>> really close.  It's great if you want to make an automatic plot of data.
>> Calling numpy.partition a hundred times is probably slower than calling P^2
>> with n=100 bins.  I don't think it does O(n) computations per point.  I
>> think it's more like O(log(n)).
>>
>
> Looking at it again, it probably is O(n) after all: it does a binary
> search, which is O(log n), but it then goes on to update all the n bin
> counters and estimations, so O(n) I'm afraid. So there is no algorithmic
> advantage over partition/percentile: if there are m samples and n bins, P-2
> that O(n) m times, while partition does O(m) n times, so both end up being
> O(m n). It seems to me that the big thing of P^2 is not having to hold the
> full dataset in memory. Online statistics (is that the name for this?),
> even if only estimations, is a cool thing, but I am not sure numpy is the
> place for them. That's not to say that we couldn't eventually have P^2
> implemented for histogram, but I would start off with a partition based one.
>
> Would SciPy have a place for online statistics? Perhaps there's room for
> yet another scikit?
>
> Jaime
>
> --
> (\__/)
> ( O.o)
> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
> de dominación mundial.
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150415/72ee3433/attachment.html>


More information about the NumPy-Discussion mailing list