Please note that I am a newbie and just a lurker. I noticed in a recent email that cKDTree was mentioned. Q: What is the relationship if any between SciPy an scikit-learn when it comes to cKDTree? The reason that I ask are the following 2 links https://jakevdp.github.io/blog/2013/04/29/benchmarking-nearest-neighbor-sear... https://github.com/scikit-learn/scikit-learn/issues/3682
I did benches a while ago, and the scikit-learn KDTree tends to be faster for larger dimensions situations: http://gael-varoquaux.info/programming/scikit-learn-014-release-features-and... Note that it's not a general rule: they implement different heuristics, and hence explore different tradeoffs. G On Sun, Sep 04, 2016 at 05:00:08PM -0400, Robert Lucente wrote:
Please note that I am a newbie and just a lurker.
I noticed in a recent email that cKDTree was mentioned.
Q: What is the relationship if any between SciPy an scikit-learn when it comes to cKDTree?
The reason that I ask are the following 2 links
https://jakevdp.github.io/blog/2013/04/29/benchmarking-nearest-neighbor-sear...
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
-- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
On 4 September 2016 at 23:00, Robert Lucente <rlucente@pipeline.com> wrote:
Please note that I am a newbie and just a lurker.
I noticed in a recent email that cKDTree was mentioned.
Q: What is the relationship if any between SciPy an scikit-learn when it comes to cKDTree?
The reason that I ask are the following 2 links
https://jakevdp.github.io/blog/2013/04/29/benchmarking-nearest-neighbor-sear...
Note that these benchmarks are from 2013 and 2014. Scipy's KDTree has seen its performance recently improved, twice. Scikit's last update to its KDTree was in 2015. So, we need to run the benchmarks again. /David.
From my own casual benchmarks, the new scipy cKDTree is much faster than any of the scikit-learn options, though it still only supports axis-aligned euclidean-like metrics (where sklearn's BallTree supports dozens of additional metrics). The cKDTree also has a limited range of query types compared to scikit-learn's trees, Jake
Jake VanderPlas Senior Data Science Fellow Director of Research in Physical Sciences University of Washington eScience Institute On Mon, Sep 5, 2016 at 12:46 AM, Daπid <davidmenhur@gmail.com> wrote:
On 4 September 2016 at 23:00, Robert Lucente <rlucente@pipeline.com> wrote:
Please note that I am a newbie and just a lurker.
I noticed in a recent email that cKDTree was mentioned.
Q: What is the relationship if any between SciPy an scikit-learn when it comes to cKDTree?
The reason that I ask are the following 2 links
https://jakevdp.github.io/blog/2013/04/29/benchmarking- nearest-neighbor-searches-in-python/
Note that these benchmarks are from 2013 and 2014. Scipy's KDTree has seen its performance recently improved, twice. Scikit's last update to its KDTree was in 2015. So, we need to run the benchmarks again.
/David. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
Would you guys consider in scope for scipy to have implementation of faster nearest neighbor search methods than KdTree? Some methods are fairly simple... e.g principal axis tree which use the principal direction of the dataset to split the dataset into smaller subsets. As soon as intrinsic dimensionality is significantly smaller than the dimension of the space, it is significantly faster. Besides, only having to compute the (an approximate) principal axis is much faster than doing an actual PCA. On Tue, Sep 6, 2016 at 4:14 AM, Jacob Vanderplas <jakevdp@cs.washington.edu> wrote:
From my own casual benchmarks, the new scipy cKDTree is much faster than any of the scikit-learn options, though it still only supports axis-aligned euclidean-like metrics (where sklearn's BallTree supports dozens of additional metrics). The cKDTree also has a limited range of query types compared to scikit-learn's trees, Jake
Jake VanderPlas Senior Data Science Fellow Director of Research in Physical Sciences University of Washington eScience Institute
On Mon, Sep 5, 2016 at 12:46 AM, Daπid <davidmenhur@gmail.com> wrote:
On 4 September 2016 at 23:00, Robert Lucente <rlucente@pipeline.com> wrote:
Please note that I am a newbie and just a lurker.
I noticed in a recent email that cKDTree was mentioned.
Q: What is the relationship if any between SciPy an scikit-learn when it comes to cKDTree?
The reason that I ask are the following 2 links
https://jakevdp.github.io/blog/2013/04/29/benchmarking-neare st-neighbor-searches-in-python/
Note that these benchmarks are from 2013 and 2014. Scipy's KDTree has seen its performance recently improved, twice. Scikit's last update to its KDTree was in 2015. So, we need to run the benchmarks again.
/David. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
As for adding more functionality along those lines, I would advocate creation of a new package or perhaps a scikit. As we've seen with scikit-learn, useful features can filter-up into scipy (e.g. sparse.csgraph) and the development of new features within an independent package is *much* easier than development from within a scipy PR. Jake Jake VanderPlas Senior Data Science Fellow Director of Research in Physical Sciences University of Washington eScience Institute On Tue, Sep 6, 2016 at 11:03 AM, Sylvain Corlay <sylvain.corlay@gmail.com> wrote:
Would you guys consider in scope for scipy to have implementation of faster nearest neighbor search methods than KdTree?
Some methods are fairly simple... e.g principal axis tree which use the principal direction of the dataset to split the dataset into smaller subsets. As soon as intrinsic dimensionality is significantly smaller than the dimension of the space, it is significantly faster.
Besides, only having to compute the (an approximate) principal axis is much faster than doing an actual PCA.
On Tue, Sep 6, 2016 at 4:14 AM, Jacob Vanderplas < jakevdp@cs.washington.edu> wrote:
From my own casual benchmarks, the new scipy cKDTree is much faster than any of the scikit-learn options, though it still only supports axis-aligned euclidean-like metrics (where sklearn's BallTree supports dozens of additional metrics). The cKDTree also has a limited range of query types compared to scikit-learn's trees, Jake
Jake VanderPlas Senior Data Science Fellow Director of Research in Physical Sciences University of Washington eScience Institute
On Mon, Sep 5, 2016 at 12:46 AM, Daπid <davidmenhur@gmail.com> wrote:
On 4 September 2016 at 23:00, Robert Lucente <rlucente@pipeline.com> wrote:
Please note that I am a newbie and just a lurker.
I noticed in a recent email that cKDTree was mentioned.
Q: What is the relationship if any between SciPy an scikit-learn when it comes to cKDTree?
The reason that I ask are the following 2 links
https://jakevdp.github.io/blog/2013/04/29/benchmarking-neare st-neighbor-searches-in-python/
Note that these benchmarks are from 2013 and 2014. Scipy's KDTree has seen its performance recently improved, twice. Scikit's last update to its KDTree was in 2015. So, we need to run the benchmarks again.
/David. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
I understand (especially about the scikits), although I was surprised by some recently added features in scipy, such as differential evolution DE is more of a "recipe" which has been studied empirically for which there is no theoretical convergence rate. Even though it may "work well" for certain problems, it causes some issues such as defining on which basis it should even be "improved", and what should be the foundation for a change. (Recently a result on convergence in probability (with no error rate) on a bounded domain has been published, but no result exists on the speed of convergence afaik...) Evolutionary optimization is definitely cool and may work strikingly well for certain problems, but I was surprised that it got elected to inclusion in scipy. DE would have been a nice seed for a scikit-evolution. For those interested in stochastic algorithms for global multivariate optimization for which we have proven convergence results and an abundant literature, we have at least the two following categories of methods - *MCMC* (Markov Chain Monte Carlo) which comprises simulated annealing, for which there are nice results of convergence in distribution. - *Robbins-Monro* methods, which comprise stochastic gradient methods, for which we have almost-sure convergence and central-limit - type theorems. This sort of raises the question of the scope of scipy. Should scipy go through the same sort of "big split" as Jupyter or more à la d3js? Is amount of specialized knowledge required to understand a methods part of what defines the line of division in what should be or not be in scipy? Sylvain On Tue, Sep 6, 2016 at 8:08 PM, Jacob Vanderplas <jakevdp@cs.washington.edu> wrote:
As for adding more functionality along those lines, I would advocate creation of a new package or perhaps a scikit. As we've seen with scikit-learn, useful features can filter-up into scipy (e.g. sparse.csgraph) and the development of new features within an independent package is *much* easier than development from within a scipy PR. Jake
Jake VanderPlas Senior Data Science Fellow Director of Research in Physical Sciences University of Washington eScience Institute
On Tue, Sep 6, 2016 at 11:03 AM, Sylvain Corlay <sylvain.corlay@gmail.com> wrote:
Would you guys consider in scope for scipy to have implementation of faster nearest neighbor search methods than KdTree?
Some methods are fairly simple... e.g principal axis tree which use the principal direction of the dataset to split the dataset into smaller subsets. As soon as intrinsic dimensionality is significantly smaller than the dimension of the space, it is significantly faster.
Besides, only having to compute the (an approximate) principal axis is much faster than doing an actual PCA.
On Tue, Sep 6, 2016 at 4:14 AM, Jacob Vanderplas < jakevdp@cs.washington.edu> wrote:
From my own casual benchmarks, the new scipy cKDTree is much faster than any of the scikit-learn options, though it still only supports axis-aligned euclidean-like metrics (where sklearn's BallTree supports dozens of additional metrics). The cKDTree also has a limited range of query types compared to scikit-learn's trees, Jake
Jake VanderPlas Senior Data Science Fellow Director of Research in Physical Sciences University of Washington eScience Institute
On Mon, Sep 5, 2016 at 12:46 AM, Daπid <davidmenhur@gmail.com> wrote:
On 4 September 2016 at 23:00, Robert Lucente <rlucente@pipeline.com> wrote:
Please note that I am a newbie and just a lurker.
I noticed in a recent email that cKDTree was mentioned.
Q: What is the relationship if any between SciPy an scikit-learn when it comes to cKDTree?
The reason that I ask are the following 2 links
https://jakevdp.github.io/blog/2013/04/29/benchmarking-neare st-neighbor-searches-in-python/
Note that these benchmarks are from 2013 and 2014. Scipy's KDTree has seen its performance recently improved, twice. Scikit's last update to its KDTree was in 2015. So, we need to run the benchmarks again.
/David. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
Tue, 06 Sep 2016 20:23:05 +0200, Sylvain Corlay kirjoitti: [clip]
This sort of raises the question of the scope of scipy. Should scipy go through the same sort of "big split" as Jupyter or more à la d3js? Is amount of specialized knowledge required to understand a methods part of what defines the line of division in what should be or not be in scipy?
I'm not sure a "big split" will help with defining the scope --- only changing the name of "scipy.spatial" to "scikit-spatial" will likely not help in deciding what its scope is. Splitting may help with speeding up the release cycle, but if the team stays the same, it sounds likely to multiply the amount of release work. Rebuild times etc. development convenience are probably not a reason to split, as I think it's fast enough in practice currently. The general decision rule to accept has so far been generally conditional on, (i) the method is applicable in many fields and "generally agreed" to be useful, (ii) fits the topic of the subpackage, and does not require extensive support frameworks to operate, (iii) the implementation looks sound and unlikely to need much tweaking in the future, and (iv) someone wants to do it. I don't remember being involved with DE, and can't immediately comment on how the PCA trees look here. I think the original suggestion of scikits being something that will eventually, after maturing, end up in Scipy has not very generally been the case in practice. To my memory, many of the recent new features were written specifically for Scipy from the beginning. (OTOH, I admit I did not go and look through merged pull requests with this eye, so maybe the this claim is not accurate.) I also don't really feel that the implied idea of Scipy as a "repository of algorithms" where you hand off code, developed in some other project on PyPi, for someone else to maintain is workable in practice --- that doesn't sound like a living open source project. I would rather mentally substitute "scipy.interpolate" with "scikit-interpolate" and work accordingly to that picture. -- Pauli Virtanen
Tue, 06 Sep 2016 20:23:05 +0200, Sylvain Corlay kirjoitti: [clip]
This sort of raises the question of the scope of scipy. Should scipy go through the same sort of "big split" as Jupyter or more à la d3js? Is amount of specialized knowledge required to understand a methods part of what defines the line of division in what should be or not be in scipy?
To write a bit more on this, although I think it's difficult to give hard rules on what "generally useful and generally agreed to work" means, I would perhaps weigh the following against each other: - Is the method used/useful in different domains in practice? How much domain-specific background knowledge is needed to use it properly? - Consider the stuff already in the module. Is what you are adding an omission? Does it solve a problem that you'd expect the module be able to solve? Does it supplement an existing feature in a significant way? - Consider the equivalence class of similar methods / features usually expected. Among them, what would in principle be the minimal set so that there's not a glaring omission in the offered features remaining? How much stuff would that be? Does including a representative one of them cover most use cases? Would it in principle sound reasonable to include everything from the minimal set in the module? - Is what you are adding something that is well understood in the literature? If not, how sure are you that it will turn out well? Does the method perform well compared to other similar ones? - Note that the twice-a-year release cycle and backward-compat policy makes correcting things later on more difficult. The scopes of the submodules also vary, so it's probably best to consider each as if a separate project --- "numerical evaluation of special functions" is relatively well-defined, but "commonly needed optimization algorithms" less so. On a meta-level, it's probably also bad to be too restrictive on the scope, as telling people to go away can result to just that. -- Pauli Virtanen
On Wed, Sep 7, 2016 at 11:19 AM, Pauli Virtanen <pav@iki.fi> wrote:
Tue, 06 Sep 2016 20:23:05 +0200, Sylvain Corlay kirjoitti: [clip]
This sort of raises the question of the scope of scipy. Should scipy go through the same sort of "big split" as Jupyter or more à la d3js? Is amount of specialized knowledge required to understand a methods part of what defines the line of division in what should be or not be in scipy?
To write a bit more on this, although I think it's difficult to give hard rules on what "generally useful and generally agreed to work" means, I would perhaps weigh the following against each other:
- Is the method used/useful in different domains in practice? How much domain-specific background knowledge is needed to use it properly?
- Consider the stuff already in the module. Is what you are adding an omission? Does it solve a problem that you'd expect the module be able to solve? Does it supplement an existing feature in a significant way?
- Consider the equivalence class of similar methods / features usually expected. Among them, what would in principle be the minimal set so that there's not a glaring omission in the offered features remaining? How much stuff would that be? Does including a representative one of them cover most use cases? Would it in principle sound reasonable to include everything from the minimal set in the module?
- Is what you are adding something that is well understood in the literature? If not, how sure are you that it will turn out well? Does the method perform well compared to other similar ones?
- Note that the twice-a-year release cycle and backward-compat policy makes correcting things later on more difficult.
The scopes of the submodules also vary, so it's probably best to consider each as if a separate project --- "numerical evaluation of special functions" is relatively well-defined, but "commonly needed optimization algorithms" less so.
On a meta-level, it's probably also bad to be too restrictive on the scope, as telling people to go away can result to just that.
Thanks Pauli, this and your other mail are the best summary yet of how to judge suitability for inclusion. I propose to stick this almost verbatim in the developer docs. Ralf
On Wed, Sep 7, 2016 at 6:23 AM, Sylvain Corlay <sylvain.corlay@gmail.com> wrote:
I understand (especially about the scikits), although I was surprised by some recently added features in scipy, such as differential evolution
DE is more of a "recipe" which has been studied empirically for which there is no theoretical convergence rate. Even though it may "work well" for certain problems, it causes some issues such as defining on which basis it should even be "improved", and what should be the foundation for a change. (Recently a result on convergence in probability (with no error rate) on a bounded domain has been published, but no result exists on the speed of convergence afaik...)
Evolutionary optimization is definitely cool and may work strikingly well for certain problems, but I was surprised that it got elected to inclusion in scipy. DE would have been a nice seed for a scikit-evolution.
For those interested in stochastic algorithms for global multivariate optimization for which we have proven convergence results and an abundant literature, we have at least the two following categories of methods
- *MCMC* (Markov Chain Monte Carlo) which comprises simulated annealing, for which there are nice results of convergence in distribution. - *Robbins-Monro* methods, which comprise stochastic gradient methods, for which we have almost-sure convergence and central-limit - type theorems.
I think it's fair to summarize the reason for the current status and inclusion of DE as: theoretical convergence rates and papers are nice, but code that actually works on a good set of benchmarks is way more important. Some years ago we had only bad global optimizers. We did have simulated annealing, but it just didn't work for most problems. No one stepped up to improve it, so we deprecated and removed it. DE was benchmarked quite thoroughly (on the benchmark functions from benchmarks/benchmarks/test_go_benchmark_functions.py) and came out looking good. It solved problems that the only other good optimizer we had (basinhopping) did not do well on. So that means there was value in adding it. Cheers, Ralf
This sort of raises the question of the scope of scipy. Should scipy go through the same sort of "big split" as Jupyter or more à la d3js? Is amount of specialized knowledge required to understand a methods part of what defines the line of division in what should be or not be in scipy?
Sylvain
On Tue, Sep 6, 2016 at 8:08 PM, Jacob Vanderplas < jakevdp@cs.washington.edu> wrote:
As for adding more functionality along those lines, I would advocate creation of a new package or perhaps a scikit. As we've seen with scikit-learn, useful features can filter-up into scipy (e.g. sparse.csgraph) and the development of new features within an independent package is *much* easier than development from within a scipy PR. Jake
Jake VanderPlas Senior Data Science Fellow Director of Research in Physical Sciences University of Washington eScience Institute
On Tue, Sep 6, 2016 at 11:03 AM, Sylvain Corlay <sylvain.corlay@gmail.com
wrote:
Would you guys consider in scope for scipy to have implementation of faster nearest neighbor search methods than KdTree?
Some methods are fairly simple... e.g principal axis tree which use the principal direction of the dataset to split the dataset into smaller subsets. As soon as intrinsic dimensionality is significantly smaller than the dimension of the space, it is significantly faster.
Besides, only having to compute the (an approximate) principal axis is much faster than doing an actual PCA.
On Tue, Sep 6, 2016 at 4:14 AM, Jacob Vanderplas < jakevdp@cs.washington.edu> wrote:
From my own casual benchmarks, the new scipy cKDTree is much faster than any of the scikit-learn options, though it still only supports axis-aligned euclidean-like metrics (where sklearn's BallTree supports dozens of additional metrics). The cKDTree also has a limited range of query types compared to scikit-learn's trees, Jake
Jake VanderPlas Senior Data Science Fellow Director of Research in Physical Sciences University of Washington eScience Institute
On Mon, Sep 5, 2016 at 12:46 AM, Daπid <davidmenhur@gmail.com> wrote:
On 4 September 2016 at 23:00, Robert Lucente <rlucente@pipeline.com> wrote:
Please note that I am a newbie and just a lurker.
I noticed in a recent email that cKDTree was mentioned.
Q: What is the relationship if any between SciPy an scikit-learn when it comes to cKDTree?
The reason that I ask are the following 2 links
https://jakevdp.github.io/blog/2013/04/29/benchmarking-neare st-neighbor-searches-in-python/
Note that these benchmarks are from 2013 and 2014. Scipy's KDTree has seen its performance recently improved, twice. Scikit's last update to its KDTree was in 2015. So, we need to run the benchmarks again.
/David. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
Hi Ralf, On Thu, Sep 8, 2016 at 10:18 AM, Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Wed, Sep 7, 2016 at 6:23 AM, Sylvain Corlay <sylvain.corlay@gmail.com> wrote:
I understand (especially about the scikits), although I was surprised by some recently added features in scipy, such as differential evolution
DE is more of a "recipe" which has been studied empirically for which there is no theoretical convergence rate. Even though it may "work well" for certain problems, it causes some issues such as defining on which basis it should even be "improved", and what should be the foundation for a change. (Recently a result on convergence in probability (with no error rate) on a bounded domain has been published, but no result exists on the speed of convergence afaik...)
Evolutionary optimization is definitely cool and may work strikingly well for certain problems, but I was surprised that it got elected to inclusion in scipy. DE would have been a nice seed for a scikit-evolution.
For those interested in stochastic algorithms for global multivariate optimization for which we have proven convergence results and an abundant literature, we have at least the two following categories of methods
- *MCMC* (Markov Chain Monte Carlo) which comprises simulated annealing, for which there are nice results of convergence in distribution. - *Robbins-Monro* methods, which comprise stochastic gradient methods, for which we have almost-sure convergence and central-limit - type theorems.
I think it's fair to summarize the reason for the current status and inclusion of DE as: theoretical convergence rates and papers are nice, but code that actually works on a good set of benchmarks is way more important.
Some years ago we had only bad global optimizers. We did have simulated annealing, but it just didn't work for most problems. No one stepped up to improve it, so we deprecated and removed it.
DE was benchmarked quite thoroughly (on the benchmark functions from benchmarks/benchmarks/test_go_benchmark_functions.py) and came out looking good. It solved problems that the only other good optimizer we had (basinhopping) did not do well on. So that means there was value in adding it.
Cheers, Ralf
It does raise the question on what to define as an "improvement" for a DE optimizer since it is only empirical. Is an improvement something that makes it work better on a fixed set of examples? Is it something that makes it closer to the seminal article describing it or a later more general description of this approach? DE was added to Scipy almost at the same time as a first implementation of a simple LP solver. In the case of the LP solver, bug reports can be easily be sorted: I have this reasonably-sized and conditioned LP problem. The solver says it is infeasible why this set of values is in a feasible set -> clear bug. I find this a bit worrying that it was never question of theoretical bounds or rates for a method to be elected as part of Scipy. I would have thought that some scientific foundation was a requirement. Just imagine: I have a new uniformly filling sequence, but no proof that it is pseudo-random, and I don't even know the discrepancy of the sequence, but a bunch of examples for which "it works"... Well, I doubt that anyone would want to use it for Monte-Carlo simulation / cryptography etc... In any case, I find this question on the need of a scientific justification worthy to be answered in general - especially in the context of the discussions on scope and the 1.0 release. Sylvain
Hi, Thu, 08 Sep 2016 12:02:35 +0200, Sylvain Corlay kirjoitti:
Just imagine: I have a new uniformly filling sequence, but no proof that it is pseudo-random, and I don't even know the discrepancy of the sequence, but a bunch of examples for which "it works"... Well, I doubt that anyone would want to use it for Monte-Carlo simulation / cryptography etc...
Are you presenting this as a fair analogy to DE and the literature about it?
In any case, I find this question on the need of a scientific justification worthy to be answered in general - especially in the context of the discussions on scope and the 1.0 release.
Yes, justification that an approach works and that is regarded as useful is required. This is not really the place to do completely original research. If I understand you correctly, you are saying that you would not generally recommend DE as a global optimization method, because there are superior choices? Or, are you saying it's essentially crock? -- Pauli Virtanen
Hi Pauli, Ralf, On Thu, Sep 8, 2016 at 8:26 PM, Pauli Virtanen <pav@iki.fi> wrote:
Just imagine: I have a new uniformly filling sequence, but no proof that it is pseudo-random, and I don't even know the discrepancy of the sequence, but a bunch of examples for which "it works"... Well, I doubt that anyone would want to use it for Monte-Carlo simulation / cryptography etc...
Are you presenting this as a fair analogy to DE and the literature about it?
In any case, I find this question on the need of a scientific justification worthy to be answered in general - especially in the context of the discussions on scope and the 1.0 release.
Yes, justification that an approach works and that is regarded as useful is required. This is not really the place to do completely original research.
If I understand you correctly, you are saying that you would not generally recommend DE as a global optimization method, because there are superior choices? Or, are you saying it's essentially crock?
That is not really it. I was mostly pointing DE as an example which is sort of in a gray area, in order to ask the questions of scope, criterion for inclusion into scipy etc. When it was included into scipy, it triggered my attention since I had worked on flavors of DE in the past (It is used as an alternative to lloyd algorithms for optimal quantization). There is indeed some literature about the applications in this area. So I did find it useful, and found that it does work well for the problems I used it for. (However the sort of generic implementation proposed today in scipy would not have been a good fit in this case.) My understanding of what scipy's scope is, is that it is a collection of routines that are robust reference implementations of well established numerical methods that are of general interest regardless of the area of applications. Linear algebra, interpolation, quadrature, random number generation, convex optimization, specialized optimizers for dense LP and QP etc... In each ones of these areas, if you need something more specialized, you should probably used a specialized library or implement something ad-hoc to your use case. Evolutionary optimization algorithms don't seem to fall into this category for the reasons that we discussed earlier. It is mostly a set of heuristics. It is cool, inspired by nature, etc. (however, a number of citations is probably not a substitute for a mathematical proof...) The other methods that I listed for stochastic optimization would have been more natural candidates to fall into the "category" that I roughly described above, in that they are extremely well established and backed by theory. I imagine that the inclusion of DE into Scipy could have been questioned at the time, but that now that it is in there, it should probably not be removed without a good alternative. Finally, I am still curious about what can be considered a bug or a feature in the case of a method like this. On the subject of the use of a faster flavor of KdTree as I was proposing, I was only gauging interest. The long discussion on DE on this specific thread is mostly coincidental. My main goal was to use it as an example for the question of the scope - also to ask about the "big split" idea. If there was to be a split, a potential scipy-incubator organization with proposals for inclusion as scipy subprojects would make sense then... Sylvain
On Fri, Sep 9, 2016 at 9:12 AM, Sylvain Corlay <sylvain.corlay@gmail.com> wrote:
Hi Pauli, Ralf,
On Thu, Sep 8, 2016 at 8:26 PM, Pauli Virtanen <pav@iki.fi> wrote:
Just imagine: I have a new uniformly filling sequence, but no proof that it is pseudo-random, and I don't even know the discrepancy of the sequence, but a bunch of examples for which "it works"... Well, I doubt that anyone would want to use it for Monte-Carlo simulation / cryptography etc...
Are you presenting this as a fair analogy to DE and the literature about it?
In any case, I find this question on the need of a scientific justification worthy to be answered in general - especially in the context of the discussions on scope and the 1.0 release.
Yes, justification that an approach works and that is regarded as useful is required. This is not really the place to do completely original research.
If I understand you correctly, you are saying that you would not generally recommend DE as a global optimization method, because there are superior choices? Or, are you saying it's essentially crock?
That is not really it. I was mostly pointing DE as an example which is sort of in a gray area, in order to ask the questions of scope, criterion for inclusion into scipy etc.
When it was included into scipy, it triggered my attention since I had worked on flavors of DE in the past (It is used as an alternative to lloyd algorithms for optimal quantization). There is indeed some literature about the applications in this area. So I did find it useful, and found that it does work well for the problems I used it for. (However the sort of generic implementation proposed today in scipy would not have been a good fit in this case.)
My understanding of what scipy's scope is, is that it is a collection of routines that are robust reference implementations
"reference implementation" has a bit of a negative implication for me. Like we have reference BLAS/LAPACK, which are references for correctness but you should really be using something better for pretty much any application. That is definitely not the intention for anything in Scipy, nor is it the case for most modules. If we have an algorithm or data structure, we want to strike a good balance between features, maintainability, usability and performance.
of well established numerical methods that are of general interest regardless of the area of applications. Linear algebra, interpolation, quadrature, random number generation, convex optimization, specialized optimizers for dense LP and QP etc... In each ones of these areas, if you need something more specialized, you should probably used a specialized library or implement something ad-hoc to your use case.
For linear algebra there's really not that much more specialized that we want to not have in scope. Statistical distributions, special functions and distance metrics are other examples of where we go for comprehensive coverage. It can also depend on what other packages are out there, for example for hypothesis tests we don't accept much anymore because statsmodels is a better package for more comprehensive coverage.
Evolutionary optimization algorithms don't seem to fall into this category for the reasons that we discussed earlier. It is mostly a set of heuristics. It is cool, inspired by nature, etc. (however, a number of citations is probably not a substitute for a mathematical proof...)
We're just going to have to disagree about this one. You seem to attach an unusually high value to a "mathematical proof" (not that that's a black or white thing either), and a low value to things like realistic benchmarks and solving users' problems. It's not about number of citations either. The AMPGO (http://infinity77.net/global_optimization/ampgo.html, https://github.com/andyfaff/ampgo) paper has only 14 I see, but we'd probably add it if someone submits a good-quality PR.
The other methods that I listed for stochastic optimization would have been more natural candidates to fall into the "category" that I roughly described above, in that they are extremely well established and backed by theory. I imagine that the inclusion of DE into Scipy could have been questioned at the time,
I still don' see any reason for having had to question that inclusion.
but that now that it is in there, it should probably not be removed without a good alternative. Finally, I am still curious about what can be considered a bug or a feature in the case of a method like this.
I answered the feature part already, but I guess you didn't like the answer much:)
On the subject of the use of a faster flavor of KdTree as I was proposing, I was only gauging interest. The long discussion on DE on this specific thread is mostly coincidental. My main goal was to use it as an example for the question of the scope
I hope this discussion helped a bit. But if you want a fixed definition of scope and a recipe to apply so that you can know without discussion if a new feature will be in scope for Scipy - that's not possible I'm afraid. There's always some judgment involved.
- also to ask about the "big split" idea. If there was to be a split, a potential scipy-incubator organization with proposals for inclusion as scipy subprojects would make sense then...
There are no plans for splitting up SciPy. It was considered several times, but it's really not the best way to spend our time. It's possible we would get more development on some modules, but that's not guaranteed, and the maintenance and release overhead goes up significantly. Cheers, Ralf
Hi, Thu, 08 Sep 2016 23:12:54 +0200, Sylvain Corlay kirjoitti: [clip]
On the subject of the use of a faster flavor of KdTree as I was proposing, I was only gauging interest. The long discussion on DE on this specific thread is mostly coincidental. My main goal was to use it as an example for the question of the scope - also to ask about the "big split" idea. If there was to be a split, a potential scipy-incubator organization with proposals for inclusion as scipy subprojects would make sense then... If there was to be a split, a potential scipy-incubator organization with proposals for inclusion as scipy subprojects would make sense then...
Now that we're still derailed: Several years back, there was a scipy.sandbox package which was nominally aligned with things intended for inclusion, but it was scrapped in the end: http://blog.jarrodmillman.com/2007/12/end-of-scipy-sandbox.html Some of these did end up as scipy subpackages, but many also were scrapped or turned up into separate projects. This was of course before Git/Github --- nowadays I think the role is mainly served by PRs, and for more complicated stuff, separate ad-hoc repositories plus issue tickets. Of course, scipy.sandbox had also overlap with completely separate projects. With the "staging area", the main question is that is the amount of significant new features proposed such that there should be a separate "staging area", aside from PRs etc. Or, would such staging area encourage their creation? Or would it be useful for creating completely new projects? It's not really clear to me that this is the case. In a more decentralized "big split" approach, "Scipy" probably would mainly stand as some sort an umbrella organization for more or less separate projects. Does such central authority need to exist, apart from the sense of maintaining the specific projects? Is it better in some sense than the completely decentralized scikit approach? -- Pauli Virtanen
Hey, On Sat, Sep 10, 2016 at 2:34 PM, Pauli Virtanen <pav@iki.fi> wrote:
Hi,
Thu, 08 Sep 2016 23:12:54 +0200, Sylvain Corlay kirjoitti: [clip]
On the subject of the use of a faster flavor of KdTree as I was proposing, I was only gauging interest. The long discussion on DE on this specific thread is mostly coincidental. My main goal was to use it as an example for the question of the scope - also to ask about the "big split" idea. If there was to be a split, a potential scipy-incubator organization with proposals for inclusion as scipy subprojects would make sense then... If there was to be a split, a potential scipy-incubator organization with proposals for inclusion as scipy subprojects would make sense then...
Now that we're still derailed:
Several years back, there was a scipy.sandbox package which was nominally aligned with things intended for inclusion, but it was scrapped in the end: http://blog.jarrodmillman.com/2007/12/end-of-scipy-sandbox.html Some of these did end up as scipy subpackages, but many also were scrapped or turned up into separate projects.
This was of course before Git/Github --- nowadays I think the role is mainly served by PRs, and for more complicated stuff, separate ad-hoc repositories plus issue tickets. Of course, scipy.sandbox had also overlap with completely separate projects.
With the "staging area", the main question is that is the amount of significant new features proposed such that there should be a separate "staging area", aside from PRs etc. Or, would such staging area encourage their creation? Or would it be useful for creating completely new projects? It's not really clear to me that this is the case.
In a more decentralized "big split" approach, "Scipy" probably would mainly stand as some sort an umbrella organization for more or less separate projects. Does such central authority need to exist, apart from the sense of maintaining the specific projects? Is it better in some sense than the completely decentralized scikit approach?
The benefits / reasons of a split would be the same as usual: managing growth by a separation of concerns and focus, with lines of division would be defined with both technological and scientific criterions. Keeping things under a single umbrella would show the intention of keeping consistency between the different components on certain aspects. Experiments that imply significant changes like a new type of language bindings would have less implications for packaging since they would remain separate. Cheers, Sylvain
Sylvain Corlay <sylvain.corlay@gmail.com> wrote:
On the subject of the use of a faster flavor of KdTree as I was proposing,
There are many flavours of spatial search trees. Some are fast to build, some are fast to query. Some support any distance measure, some support arbitrary distance measures. Some support periodic spaces, others do not. For example, SciPy's cKDTree support periodic spaces for cosmology simulations. Scikit-learn's KDTree and BallTree do not because their scope is different (only machine learning). In theory one could add different flavours of spatial search in scipy.spatial. kd-trees, ball-trees, PCA-trees, and vp-trees all have their merits. We must keep the API consistent. The code must be maintainable. The code must be maintained. The main thing, as I see it, is to avoid the code rot that brought down OpenSSL. SciPy is not a "dumping ground" for algorithms that someone has found remotely useful at some point in time. SciPy is not a collection of abandonware. Sturla
Sturla Molden <sturla.molden@gmail.com> wrote:
is different (only machine learning). In theory one could add different flavours of spatial search in scipy.spatial. kd-trees, ball-trees, PCA-trees, and vp-trees all have their merits. We must keep the API consistent. The code must be maintainable. The code must be maintained.
I am also going to be obnoxious about doing all the heavy lifting in C or C++, not Python or Cython, for the sake of releasing the GIL. Those who use spatial search trees in e.g. computer games are dependent on multithreading. Not neccesarily for parallel processing, but the interpreter cannot be locked up for extended periods of time. This is the reason cKDTree is currently written in C++. Sturla
On Thu, Sep 8, 2016 at 10:02 PM, Sylvain Corlay <sylvain.corlay@gmail.com> wrote:
Hi Ralf,
On Thu, Sep 8, 2016 at 10:18 AM, Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Wed, Sep 7, 2016 at 6:23 AM, Sylvain Corlay <sylvain.corlay@gmail.com> wrote:
I understand (especially about the scikits), although I was surprised by some recently added features in scipy, such as differential evolution
DE is more of a "recipe" which has been studied empirically for which there is no theoretical convergence rate. Even though it may "work well" for certain problems, it causes some issues such as defining on which basis it should even be "improved", and what should be the foundation for a change. (Recently a result on convergence in probability (with no error rate) on a bounded domain has been published, but no result exists on the speed of convergence afaik...)
Evolutionary optimization is definitely cool and may work strikingly well for certain problems, but I was surprised that it got elected to inclusion in scipy. DE would have been a nice seed for a scikit-evolution.
For those interested in stochastic algorithms for global multivariate optimization for which we have proven convergence results and an abundant literature, we have at least the two following categories of methods
- *MCMC* (Markov Chain Monte Carlo) which comprises simulated annealing, for which there are nice results of convergence in distribution. - *Robbins-Monro* methods, which comprise stochastic gradient methods, for which we have almost-sure convergence and central-limit - type theorems.
I think it's fair to summarize the reason for the current status and inclusion of DE as: theoretical convergence rates and papers are nice, but code that actually works on a good set of benchmarks is way more important.
Some years ago we had only bad global optimizers. We did have simulated annealing, but it just didn't work for most problems. No one stepped up to improve it, so we deprecated and removed it.
DE was benchmarked quite thoroughly (on the benchmark functions from benchmarks/benchmarks/test_go_benchmark_functions.py) and came out looking good. It solved problems that the only other good optimizer we had (basinhopping) did not do well on. So that means there was value in adding it.
Cheers, Ralf
It does raise the question on what to define as an "improvement" for a DE optimizer since it is only empirical. Is an improvement something that makes it work better on a fixed set of examples?
Unless the set of examples is too small or the improvement is tailored for those examples (which requires a bit of judgement by the reviewers), yes.
Is it something that makes it closer to the seminal article describing it or a later more general description of this approach?
That has to be judged on a case by case basis I'd say.
I find this a bit worrying that it was never question of theoretical bounds or rates for a method to be elected as part of Scipy. I would have thought that some scientific foundation was a requirement.
There is a large body of peer-reviewed work for DE, and Google Scholar tells me that the original Storn and Price paper has 11652 citations. So that very very clearly meets any requirement for a scientific foundation we have. I'm not sure if your questioning of reasons for including DE has anything to do with your original question of adding new methods to KDTree, but on that topic: I would suggest to sketch the picture of pros and cons for Jake's suggestion of a scikit. How wide would the scope of that scikit potentially be, is there a smaller subset of methods that would make sense for scipy. Yu Feng and Sturla Molden have been improving cKDTree pretty much every release, so it's certainly not code that is defined as maintenance-only. Ralf
Tue, 06 Sep 2016 20:23:05 +0200, Sylvain Corlay kirjoitti:
I understand (especially about the scikits), although I was surprised by some recently added features in scipy, such as differential evolution
Also, it could be helpful to list these other added features that you think are problematic. -- Pauli Virtanen
Sylvain Corlay <sylvain.corlay@gmail.com> wrote:
Evolutionary optimization is definitely cool and may work strikingly well for certain problems, but I was surprised that it got elected to inclusion in scipy.
You are not the only one to wonder about this. But I pick my battles carefully, and this did not merit my protests. For the same reason I think ball-trees would be a more important addition than PCA-trees or vp-trees. Sturla
Robert Lucente <rlucente@pipeline.com> wrote:
I noticed in a recent email that cKDTree was mentioned.
Q: What is the relationship if any between SciPy an scikit-learn when it comes to cKDTree?
The scope of the projects are slightly different. E.g. SciPy allows period query spaces for cosmology simulations. Scikit-learn allows arbitrary distance measures for machine learning. SciPy only allows a small set of distance measures (Minkowski distances) for performance. SciPy releases the GIL for user interfaces, games, parallel processing and computer graphics. For this reason all the heavy lifting is done in C++. I am not sure about scikit-learn, but it is written in Cython which usually has somewhat less room for GIL release. SciPy support parallel (multithreaded) queries in two of the methods. I think scikit-learn depends on joblib for parallel processing. There are different authors. The kd-tree and ball-tree in scikit-learn was written by Jake Vanderplas. cKDTree was mainly written by Anne Archibald, me and Yu Feng. Sturla
participants (8)
-
Daπid -
Gael Varoquaux -
Jacob Vanderplas -
Pauli Virtanen -
Ralf Gommers -
Robert Lucente -
Sturla Molden -
Sylvain Corlay