Automatic number of bins for numpy histograms

http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/s... tistics/A utomating%20Binwidth%20Choice%20for%20Histogram.ipynb Long story short, histogram visualisations that depend on numpy (such as matplotlib, or nearly all of them) have poor default behaviour as I have to constantly play around with the number of bins to get a good idea of what I'm looking at. The bins=10 works ok for up to 1000 points or very normal data, but has poor performance for anything else, and doesn't account for variability either. I don't have a method easily available to scale the number of bins given the data. R doesn't suffer from these problems and provides methods for use with it's hist method. I would like to provide similar functionality for matplotlib, to at least provide some kind of good starting point, as histograms are very useful for initial data discovery. The notebook above provides an explanation of the problem as well as some proposed alternatives. Use different datasets (type and size) to see the performance of the suggestions. All of the methods proposed exist in R and literature. I've put together an implementation to add this new functionality, but am hesitant to make a pull request as I would like some feedback from a maintainer before doing so. https://github.com/numpy/numpy/compare/master...nayyarv:master I've provided them as functions for easy refactoring, as it can be argued that it should be in it's own function/file/class, or alternatively can be turned into simple if, elif statements. I believe this belongs in numpy as it is where the functionality exists for histogram methods that most libraries build on, and it would useful for them to not require scipy for example. I will update the documentation accordingly before making a pull request, and add in more tests to show it's functionality. I can adapt my ipython notebook into a quick tutorial/help file if need be. I've already attempted to add this into matplotlib before being redirected here https://github.com/matplotlib/matplotlib/issues/4316

On Sun, Apr 12, 2015 at 12:19 AM, Varun <nayyarv@gmail.com> wrote:
http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/s... tistics/A utomating%20Binwidth%20Choice%20for%20Histogram.ipynb
Long story short, histogram visualisations that depend on numpy (such as matplotlib, or nearly all of them) have poor default behaviour as I have to constantly play around with the number of bins to get a good idea of what I'm looking at. The bins=10 works ok for up to 1000 points or very normal data, but has poor performance for anything else, and doesn't account for variability either. I don't have a method easily available to scale the number of bins given the data.
R doesn't suffer from these problems and provides methods for use with it's hist method. I would like to provide similar functionality for matplotlib, to at least provide some kind of good starting point, as histograms are very useful for initial data discovery.
The notebook above provides an explanation of the problem as well as some proposed alternatives. Use different datasets (type and size) to see the performance of the suggestions. All of the methods proposed exist in R and literature.
I've put together an implementation to add this new functionality, but am hesitant to make a pull request as I would like some feedback from a maintainer before doing so.
+1 on the PR. Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Sun, Apr 12, 2015 at 12:19 AM, Varun <nayyarv@gmail.com> wrote:
http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/s... tistics/A <http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/s...> utomating%20Binwidth%20Choice%20for%20Histogram.ipynb
Long story short, histogram visualisations that depend on numpy (such as matplotlib, or nearly all of them) have poor default behaviour as I have to constantly play around with the number of bins to get a good idea of what I'm looking at. The bins=10 works ok for up to 1000 points or very normal data, but has poor performance for anything else, and doesn't account for variability either. I don't have a method easily available to scale the number of bins given the data.
R doesn't suffer from these problems and provides methods for use with it's hist method. I would like to provide similar functionality for matplotlib, to at least provide some kind of good starting point, as histograms are very useful for initial data discovery.
The notebook above provides an explanation of the problem as well as some proposed alternatives. Use different datasets (type and size) to see the performance of the suggestions. All of the methods proposed exist in R and literature.
I've put together an implementation to add this new functionality, but am hesitant to make a pull request as I would like some feedback from a maintainer before doing so.
+1 on the PR.
+1 as well. Unfortunately we can't change the default of 10, but a number of string methods, with a "bins=auto" or some such name prominently recommended in the docstring, would be very good to have. Ralf

Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%2... ) This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_... ) I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain. Best, Neil On Sun, Apr 12, 2015 at 4:02 AM, Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Sun, Apr 12, 2015 at 12:19 AM, Varun <nayyarv@gmail.com> wrote:
http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/s... tistics/A <http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/s...> utomating%20Binwidth%20Choice%20for%20Histogram.ipynb
Long story short, histogram visualisations that depend on numpy (such as matplotlib, or nearly all of them) have poor default behaviour as I have to constantly play around with the number of bins to get a good idea of what I'm looking at. The bins=10 works ok for up to 1000 points or very normal data, but has poor performance for anything else, and doesn't account for variability either. I don't have a method easily available to scale the number of bins given the data.
R doesn't suffer from these problems and provides methods for use with it's hist method. I would like to provide similar functionality for matplotlib, to at least provide some kind of good starting point, as histograms are very useful for initial data discovery.
The notebook above provides an explanation of the problem as well as some proposed alternatives. Use different datasets (type and size) to see the performance of the suggestions. All of the methods proposed exist in R and literature.
I've put together an implementation to add this new functionality, but am hesitant to make a pull request as I would like some feedback from a maintainer before doing so.
+1 on the PR.
+1 as well.
Unfortunately we can't change the default of 10, but a number of string methods, with a "bins=auto" or some such name prominently recommended in the docstring, would be very good to have.
Ralf
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Another improvement would be to make sure, for integer-valued datasets, that all bins cover the same number of integer, as it is easy to end up otherwise with bins "effectively" wider than others: hist(np.random.randint(11, size=10000)) shows a peak in the last bin, as it covers both 9 and 10. Antony 2015-04-13 5:02 GMT-07:00 Neil Girdhar <mistersheik@gmail.com>:
Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%2... )
This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_... )
I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain.
Best,
Neil
On Sun, Apr 12, 2015 at 4:02 AM, Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Sun, Apr 12, 2015 at 12:19 AM, Varun <nayyarv@gmail.com> wrote:
http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/s... tistics/A <http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/s...> utomating%20Binwidth%20Choice%20for%20Histogram.ipynb
Long story short, histogram visualisations that depend on numpy (such as matplotlib, or nearly all of them) have poor default behaviour as I have to constantly play around with the number of bins to get a good idea of what I'm looking at. The bins=10 works ok for up to 1000 points or very normal data, but has poor performance for anything else, and doesn't account for variability either. I don't have a method easily available to scale the number of bins given the data.
R doesn't suffer from these problems and provides methods for use with it's hist method. I would like to provide similar functionality for matplotlib, to at least provide some kind of good starting point, as histograms are very useful for initial data discovery.
The notebook above provides an explanation of the problem as well as some proposed alternatives. Use different datasets (type and size) to see the performance of the suggestions. All of the methods proposed exist in R and literature.
I've put together an implementation to add this new functionality, but am hesitant to make a pull request as I would like some feedback from a maintainer before doing so.
+1 on the PR.
+1 as well.
Unfortunately we can't change the default of 10, but a number of string methods, with a "bins=auto" or some such name prominently recommended in the docstring, would be very good to have.
Ralf
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Mon, Apr 13, 2015 at 5:02 AM, Neil Girdhar <mistersheik@gmail.com> wrote:
Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%2... )
This look slike a great thing to have in numpy. However, I suspect that a lot of the downstream code that uses histogram expects equally-spaced bins. So this should probably be a "in addition to", rather than an "instead of" -CHB
This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_... )
I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain.
Best,
Neil
On Sun, Apr 12, 2015 at 4:02 AM, Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Sun, Apr 12, 2015 at 12:19 AM, Varun <nayyarv@gmail.com> wrote:
http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/s... tistics/A <http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/s...> utomating%20Binwidth%20Choice%20for%20Histogram.ipynb
Long story short, histogram visualisations that depend on numpy (such as matplotlib, or nearly all of them) have poor default behaviour as I have to constantly play around with the number of bins to get a good idea of what I'm looking at. The bins=10 works ok for up to 1000 points or very normal data, but has poor performance for anything else, and doesn't account for variability either. I don't have a method easily available to scale the number of bins given the data.
R doesn't suffer from these problems and provides methods for use with it's hist method. I would like to provide similar functionality for matplotlib, to at least provide some kind of good starting point, as histograms are very useful for initial data discovery.
The notebook above provides an explanation of the problem as well as some proposed alternatives. Use different datasets (type and size) to see the performance of the suggestions. All of the methods proposed exist in R and literature.
I've put together an implementation to add this new functionality, but am hesitant to make a pull request as I would like some feedback from a maintainer before doing so.
+1 on the PR.
+1 as well.
Unfortunately we can't change the default of 10, but a number of string methods, with a "bins=auto" or some such name prominently recommended in the docstring, would be very good to have.
Ralf
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Yes, you're right. Although in practice, people almost always want adaptive bins. On Tue, Apr 14, 2015 at 5:08 PM, Chris Barker <chris.barker@noaa.gov> wrote:
On Mon, Apr 13, 2015 at 5:02 AM, Neil Girdhar <mistersheik@gmail.com> wrote:
Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%2... )
This look slike a great thing to have in numpy. However, I suspect that a lot of the downstream code that uses histogram expects equally-spaced bins.
So this should probably be a "in addition to", rather than an "instead of"
-CHB
This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_... )
I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain.
Best,
Neil
On Sun, Apr 12, 2015 at 4:02 AM, Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Sun, Apr 12, 2015 at 9:45 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Sun, Apr 12, 2015 at 12:19 AM, Varun <nayyarv@gmail.com> wrote:
http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/s... tistics/A <http://nbviewer.ipython.org/github/nayyarv/matplotlib/blob/master/examples/s...> utomating%20Binwidth%20Choice%20for%20Histogram.ipynb
Long story short, histogram visualisations that depend on numpy (such as matplotlib, or nearly all of them) have poor default behaviour as I have to constantly play around with the number of bins to get a good idea of what I'm looking at. The bins=10 works ok for up to 1000 points or very normal data, but has poor performance for anything else, and doesn't account for variability either. I don't have a method easily available to scale the number of bins given the data.
R doesn't suffer from these problems and provides methods for use with it's hist method. I would like to provide similar functionality for matplotlib, to at least provide some kind of good starting point, as histograms are very useful for initial data discovery.
The notebook above provides an explanation of the problem as well as some proposed alternatives. Use different datasets (type and size) to see the performance of the suggestions. All of the methods proposed exist in R and literature.
I've put together an implementation to add this new functionality, but am hesitant to make a pull request as I would like some feedback from a maintainer before doing so.
+1 on the PR.
+1 as well.
Unfortunately we can't change the default of 10, but a number of string methods, with a "bins=auto" or some such name prominently recommended in the docstring, would be very good to have.
Ralf
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar <mistersheik@gmail.com> wrote:
Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? (http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%2...)
This is already implemented in C++'s boost library (http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_...)
I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain.
This definitely sounds like a useful thing to have in numpy or scipy (though if it's possible to do without using Boost/C++ that would be nice). But yeah, we should leave the existing histogram alone (in this regard) and add a new name for this like "adaptive_histogram" or something. Then you can set about convincing matplotlib and friends to use it by default :-) -n -- Nathaniel J. Smith -- http://vorpus.org

On Tue, Apr 14, 2015 at 4:12 PM, Nathaniel Smith <njs@pobox.com> wrote:
Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%2... )
This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_... )
I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is
On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar <mistersheik@gmail.com> wrote: the
resolution of the bins throughout the domain.
This definitely sounds like a useful thing to have in numpy or scipy (though if it's possible to do without using Boost/C++ that would be nice). But yeah, we should leave the existing histogram alone (in this regard) and add a new name for this like "adaptive_histogram" or something. Then you can set about convincing matplotlib and friends to use it by default :-)
Would having a negative number of bins mean "this many, but with optimized boundaries" be too clever an interface? I have taken a look at the paper linked, and the P-2 algorithm would not be too complicated to implement from scratch, although it would require writing some C code I'm afraid. Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

If you're going to C, is there a reason not to go to C++ and include the already-written Boost code? Otherwise, why not use Python? On Tue, Apr 14, 2015 at 7:24 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Tue, Apr 14, 2015 at 4:12 PM, Nathaniel Smith <njs@pobox.com> wrote:
Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%2... )
This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_... )
I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather
adjusting the number of bins, it adjusts what you really want, which is
On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar <mistersheik@gmail.com> wrote: than the
resolution of the bins throughout the domain.
This definitely sounds like a useful thing to have in numpy or scipy (though if it's possible to do without using Boost/C++ that would be nice). But yeah, we should leave the existing histogram alone (in this regard) and add a new name for this like "adaptive_histogram" or something. Then you can set about convincing matplotlib and friends to use it by default :-)
Would having a negative number of bins mean "this many, but with optimized boundaries" be too clever an interface?
I have taken a look at the paper linked, and the P-2 algorithm would not be too complicated to implement from scratch, although it would require writing some C code I'm afraid.
Jaime
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Tue, Apr 14, 2015 at 6:16 PM, Neil Girdhar <mistersheik@gmail.com> wrote:
If you're going to C, is there a reason not to go to C++ and include the already-written Boost code? Otherwise, why not use Python?
I think we have an explicit rule against C++, although I may be wrong. Not sure how much of boost we would have to make part of numpy to use that, the whole accumulators lib I'm guessing? Seems like an awful lot given what we are after. Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

Yeah, I'm not arguing, I'm just curious about your reasoning. That explains why not C++. Why would you want to do this in C and not Python? On Wed, Apr 15, 2015 at 1:48 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Tue, Apr 14, 2015 at 6:16 PM, Neil Girdhar <mistersheik@gmail.com> wrote:
If you're going to C, is there a reason not to go to C++ and include the already-written Boost code? Otherwise, why not use Python?
I think we have an explicit rule against C++, although I may be wrong. Not sure how much of boost we would have to make part of numpy to use that, the whole accumulators lib I'm guessing? Seems like an awful lot given what we are after.
Jaime
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Wed, Apr 15, 2015 at 4:36 AM, Neil Girdhar <mistersheik@gmail.com> wrote:
Yeah, I'm not arguing, I'm just curious about your reasoning. That explains why not C++. Why would you want to do this in C and not Python?
Well, the algorithm has to iterate over all the inputs, updating the estimated percentile positions at every iteration. Because the estimated percentiles may change in every iteration, I don't think there is an easy way of vectorizing the calculation with numpy. So I think it would be very slow if done in Python. Looking at this in some more details, how is this typically used? Because it gives you approximate values that should split your sample into similarly filled bins, but because the values are approximate, to compute a proper histogram you would still need to do the binning to get the exact results, right? Even with this drawback P-2 does have an algorithmic advantage, so for huge inputs and many bins it should come ahead. But for many medium sized problems it may be faster to simply use np.partition, which gives you the whole thing in a single go. And it would be much simpler to implement. Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

You got it. I remember this from when I worked at Google and we would process (many many) logs. With enough bins, the approximation is still really close. It's great if you want to make an automatic plot of data. Calling numpy.partition a hundred times is probably slower than calling P^2 with n=100 bins. I don't think it does O(n) computations per point. I think it's more like O(log(n)). Best, Neil On Wed, Apr 15, 2015 at 10:02 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Wed, Apr 15, 2015 at 4:36 AM, Neil Girdhar <mistersheik@gmail.com> wrote:
Yeah, I'm not arguing, I'm just curious about your reasoning. That explains why not C++. Why would you want to do this in C and not Python?
Well, the algorithm has to iterate over all the inputs, updating the estimated percentile positions at every iteration. Because the estimated percentiles may change in every iteration, I don't think there is an easy way of vectorizing the calculation with numpy. So I think it would be very slow if done in Python.
Looking at this in some more details, how is this typically used? Because it gives you approximate values that should split your sample into similarly filled bins, but because the values are approximate, to compute a proper histogram you would still need to do the binning to get the exact results, right? Even with this drawback P-2 does have an algorithmic advantage, so for huge inputs and many bins it should come ahead. But for many medium sized problems it may be faster to simply use np.partition, which gives you the whole thing in a single go. And it would be much simpler to implement.
Jaime
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Wed, Apr 15, 2015 at 8:06 AM, Neil Girdhar <mistersheik@gmail.com> wrote:
You got it. I remember this from when I worked at Google and we would process (many many) logs. With enough bins, the approximation is still really close. It's great if you want to make an automatic plot of data. Calling numpy.partition a hundred times is probably slower than calling P^2 with n=100 bins. I don't think it does O(n) computations per point. I think it's more like O(log(n)).
Looking at it again, it probably is O(n) after all: it does a binary search, which is O(log n), but it then goes on to update all the n bin counters and estimations, so O(n) I'm afraid. So there is no algorithmic advantage over partition/percentile: if there are m samples and n bins, P-2 that O(n) m times, while partition does O(m) n times, so both end up being O(m n). It seems to me that the big thing of P^2 is not having to hold the full dataset in memory. Online statistics (is that the name for this?), even if only estimations, is a cool thing, but I am not sure numpy is the place for them. That's not to say that we couldn't eventually have P^2 implemented for histogram, but I would start off with a partition based one. Would SciPy have a place for online statistics? Perhaps there's room for yet another scikit? Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

Cool, thanks for looking at this. P2 might still be better even if the whole dataset is in memory because of cache misses. Partition, which I guess is based on quickselect, is going to run over all of the data as many times as there are bins roughly, whereas p2 only runs over it once. From a cache miss standpoint, I think p2 is better? Anyway, it might be worth maybe coding to verify any performance advantages? Not sure if it should be in numpy or not since it really should accept an iterable rather than a numpy vector, right? Best, Neil On Wed, Apr 15, 2015 at 12:40 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Wed, Apr 15, 2015 at 8:06 AM, Neil Girdhar <mistersheik@gmail.com> wrote:
You got it. I remember this from when I worked at Google and we would process (many many) logs. With enough bins, the approximation is still really close. It's great if you want to make an automatic plot of data. Calling numpy.partition a hundred times is probably slower than calling P^2 with n=100 bins. I don't think it does O(n) computations per point. I think it's more like O(log(n)).
Looking at it again, it probably is O(n) after all: it does a binary search, which is O(log n), but it then goes on to update all the n bin counters and estimations, so O(n) I'm afraid. So there is no algorithmic advantage over partition/percentile: if there are m samples and n bins, P-2 that O(n) m times, while partition does O(m) n times, so both end up being O(m n). It seems to me that the big thing of P^2 is not having to hold the full dataset in memory. Online statistics (is that the name for this?), even if only estimations, is a cool thing, but I am not sure numpy is the place for them. That's not to say that we couldn't eventually have P^2 implemented for histogram, but I would start off with a partition based one.
Would SciPy have a place for online statistics? Perhaps there's room for yet another scikit?
Jaime
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Tue, Apr 14, 2015 at 4:24 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Tue, Apr 14, 2015 at 4:12 PM, Nathaniel Smith <njs@pobox.com> wrote:
Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%2... )
This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_... )
I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather
adjusting the number of bins, it adjusts what you really want, which is
On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar <mistersheik@gmail.com> wrote: than the
resolution of the bins throughout the domain.
This definitely sounds like a useful thing to have in numpy or scipy (though if it's possible to do without using Boost/C++ that would be nice). But yeah, we should leave the existing histogram alone (in this regard) and add a new name for this like "adaptive_histogram" or something. Then you can set about convincing matplotlib and friends to use it by default :-)
Would having a negative number of bins mean "this many, but with optimized boundaries" be too clever an interface?
As a user, I think so. Wouldn't np.histogram(..., adaptive=True) do well enough? -p

By the way, the p^2 algorithm still needs to know how many bins you want. It just adapts the endpoints of the bins. I like adaptive=True. However, you will have to find a way to return both the bins and and their calculated endpoints. The P^2 algorithm can also give approximate answers to numpy.percentile, numpy.median. How approximate they are depends on the number of bins you let it keep track of. I believe the authors bound the error as a function of number of points and bins. On Tue, Apr 14, 2015 at 10:00 PM, Paul Hobson <pmhobson@gmail.com> wrote:
On Tue, Apr 14, 2015 at 4:24 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Tue, Apr 14, 2015 at 4:12 PM, Nathaniel Smith <njs@pobox.com> wrote:
Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%2... )
This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_... )
I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather
On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar <mistersheik@gmail.com> wrote: than
adjusting the number of bins, it adjusts what you really want, which is the resolution of the bins throughout the domain.
This definitely sounds like a useful thing to have in numpy or scipy (though if it's possible to do without using Boost/C++ that would be nice). But yeah, we should leave the existing histogram alone (in this regard) and add a new name for this like "adaptive_histogram" or something. Then you can set about convincing matplotlib and friends to use it by default :-)
Would having a negative number of bins mean "this many, but with optimized boundaries" be too clever an interface?
As a user, I think so. Wouldn't np.histogram(..., adaptive=True) do well enough? -p
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

"Then you can set about convincing matplotlib and friends to use it by default" Just to note, this proposal was originally made over in the matplotlib project. We sent it over here where its benefits would have wider reach. Matplotlib's plan is not to change the defaults, but to offload as much as possible to numpy so that it can support these new features if they are available. We might need to do some input validation so that users running older version of numpy can get a sensible error message. Cheers! Ben Root On Tue, Apr 14, 2015 at 7:12 PM, Nathaniel Smith <njs@pobox.com> wrote:
Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%2... )
This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_... )
I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather than adjusting the number of bins, it adjusts what you really want, which is
On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar <mistersheik@gmail.com> wrote: the
resolution of the bins throughout the domain.
This definitely sounds like a useful thing to have in numpy or scipy (though if it's possible to do without using Boost/C++ that would be nice). But yeah, we should leave the existing histogram alone (in this regard) and add a new name for this like "adaptive_histogram" or something. Then you can set about convincing matplotlib and friends to use it by default :-)
-n
-- Nathaniel J. Smith -- http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

This blog post, and the links within also seem relevant. Appears to have python code available to try things out as well. https://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantil... -Eric On Wed, Apr 15, 2015 at 11:24 AM, Benjamin Root <ben.root@ou.edu> wrote:
"Then you can set about convincing matplotlib and friends to use it by default"
Just to note, this proposal was originally made over in the matplotlib project. We sent it over here where its benefits would have wider reach. Matplotlib's plan is not to change the defaults, but to offload as much as possible to numpy so that it can support these new features if they are available. We might need to do some input validation so that users running older version of numpy can get a sensible error message.
Cheers! Ben Root
On Tue, Apr 14, 2015 at 7:12 PM, Nathaniel Smith <njs@pobox.com> wrote:
Can I suggest that we instead add the P-square algorithm for the dynamic calculation of histograms? ( http://pierrechainais.ec-lille.fr/Centrale/Option_DAD/IMPACT_files/Dynamic%2... )
This is already implemented in C++'s boost library ( http://www.boost.org/doc/libs/1_44_0/boost/accumulators/statistics/extended_... )
I implemented it in Boost Python as a module, which I'm happy to share. This is much better than fixed-width histograms in practice. Rather
adjusting the number of bins, it adjusts what you really want, which is
On Mon, Apr 13, 2015 at 8:02 AM, Neil Girdhar <mistersheik@gmail.com> wrote: than the
resolution of the bins throughout the domain.
This definitely sounds like a useful thing to have in numpy or scipy (though if it's possible to do without using Boost/C++ that would be nice). But yeah, we should leave the existing histogram alone (in this regard) and add a new name for this like "adaptive_histogram" or something. Then you can set about convincing matplotlib and friends to use it by default :-)
-n
-- Nathaniel J. Smith -- http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Wed, Apr 15, 2015 at 9:14 AM, Eric Moore <ewm@redtetrahedron.org> wrote:
This blog post, and the links within also seem relevant. Appears to have python code available to try things out as well.
https://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantil...
Very cool indeed... The original works is licensed under an Apache 2.0 license (https://github.com/tdunning/t-digest/blob/master/LICENSE). I am not fluent in legalese, so not sure whether that means we can use it or not, seems awfully more complicated than what we normally use. Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

Using a URL shortener for the notebook to get around the 80 char width limit http://goo.gl/JmfTRJ
participants (11)
-
Antony Lee
-
Benjamin Root
-
Chris Barker
-
Eric Moore
-
Jaime Fernández del Río
-
Nathaniel Smith
-
Neil Girdhar
-
Paul Hobson
-
Ralf Gommers
-
Sturla Molden
-
Varun