Proposal for Scikit-Signal - a SciPy toolbox for signal processing
Hi I gave a talk at SciPy India 2011 about a Python implementation of the Hilbert-Huang Transform that I was working on. The HHT is a method used as an alternative to Fourier and Wavelet analyses of nonlinear and nonstationary data. Following the talk Gael Varoquaux said that there's room for a separate scikit for signal processing. He also gave a lightning talk about bootstrapping a SciPy community project soon after. So with this list let us start working out what the project should be like. For noobs like me, Gael's talk was quite a useful guide. Here's the link to a gist he made about it - https://gist.github.com/1433151 Here's the link to my SciPy talk: http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c-a34360a2cc1f/vie... I personally am researching nonlinear and nonstationary signal processing, I'd love to know what others can bring to this project. Also, let's talk about the limitations of the current signal processing tools available in SciPy and other scikits. I think there's a lot of documentation to be worked out, and there is also a lack of physically meaningful examples in the documentation. Thanks PS: I'm ccing a few people who might already be on the scipy-dev list. Sorry for the inconvenience.
Hi
I gave a talk at SciPy India 2011 about a Python implementation of the Hilbert-Huang Transform that I was working on. The HHT is a method used as an alternative to Fourier and Wavelet analyses of nonlinear and nonstationary data. Following the talk Gael Varoquaux said that there's room for a separate scikit for signal processing. He also gave a lightning talk about bootstrapping a SciPy community project soon after.
So with this list let us start working out what the project should be like.
For noobs like me, Gael's talk was quite a useful guide. Here's the link to a gist he made about it - https://gist.github.com/1433151
Here's the link to my SciPy talk: http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c-a34360a2cc1f/vie...
I personally am researching nonlinear and nonstationary signal processing, I'd love to know what others can bring to this project. Also, let's talk about the limitations of the current signal processing tools available in SciPy and other scikits. I think there's a lot of documentation to be worked out, and there is also a lack of physically meaningful examples in the documentation.
I've been playing with Empirical Mode Decomposition, and coded it up in numpy, and it is pretty neat, but I do believe that NASA has patented it, which would probably preclude distributing it in a scikit, without a lot of legal effort. From the Nasa website : "An example of successfully brokering NASA technology through a no-cost brokerage partnership was the exclusive license for the Hilbert-Huang Transform, composed of 10 U.S. patents and one domestic patent application, which was part of a lot auctioned by Ocean Tomo Federal Services LLC, in October 2008." Alan -- ----------------------------------------------------------------------- | Alan K. Jackson | To see a World in a Grain of Sand | | alan@ajackson.org | And a Heaven in a Wild Flower, | | www.ajackson.org | Hold Infinity in the palm of your hand | | Houston, Texas | And Eternity in an hour. - Blake | -----------------------------------------------------------------------
Hi Alan
I've been playing with Empirical Mode Decomposition, and coded it up in numpy, and it is pretty neat, but I do believe that NASA has patented it, which would probably preclude distributing it in a scikit, without a lot of legal effort.
EMD and HHT are nowhere in the scikit plan right now (unless others decide to put it up), I simply mentioned it because it's one of my interests. Let's start talking about a basic signal processing scikit first. I guess there will be enough room for adaptive methods like the HHT later. Regards
Hi all, I think a scikit-signal would be a nice project in the scipy ecosystem. My gut feeling is that there is already a lot of great pieces of code for signal processing in Python but it's too fragmented. Most of it is in scipy.signal but one may need some pieces in scipy.ndimage and external projects like for example for wavelets. I would be neat to have a main entry point for signal processing in Python. As demonstrated with scikit-learn a small and great project can emerge from scipy/numpy. The benefit is that the entry cost can be much lower for a developer compared to contributing directly to scipy and a small project can release more often and eventually back port new stuff from scipy core. As I already said to Jaydev, I think one should start by defining the scope of such a project and list/review existing codes to bootstrap the project. Best, Alex On Tue, Dec 27, 2011 at 6:55 PM, Jaidev Deshpande <deshpande.jaidev@gmail.com> wrote:
Hi Alan
I've been playing with Empirical Mode Decomposition, and coded it up in numpy, and it is pretty neat, but I do believe that NASA has patented it, which would probably preclude distributing it in a scikit, without a lot of legal effort.
EMD and HHT are nowhere in the scikit plan right now (unless others decide to put it up), I simply mentioned it because it's one of my interests.
Let's start talking about a basic signal processing scikit first. I guess there will be enough room for adaptive methods like the HHT later.
Regards _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
Hi Jaidev, On Mon, Dec 26, 2011 at 5:39 PM, Jaidev Deshpande <deshpande.jaidev@gmail.com> wrote:
Hi
I gave a talk at SciPy India 2011 about a Python implementation of the Hilbert-Huang Transform that I was working on. The HHT is a method used as an alternative to Fourier and Wavelet analyses of nonlinear and nonstationary data. Following the talk Gael Varoquaux said that there's room for a separate scikit for signal processing. He also gave a lightning talk about bootstrapping a SciPy community project soon after.
So with this list let us start working out what the project should be like.
For noobs like me, Gael's talk was quite a useful guide. Here's the link to a gist he made about it - https://gist.github.com/1433151
Here's the link to my SciPy talk: http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c-a34360a2cc1f/vie...
I personally am researching nonlinear and nonstationary signal processing, I'd love to know what others can bring to this project. Also, let's talk about the limitations of the current signal processing tools available in SciPy and other scikits. I think there's a lot of documentation to be worked out, and there is also a lack of physically meaningful examples in the documentation.
I think it would be a good addition to the ecosystem. We are for example missing a lot of core algorithms even for linear signal processing, and scipy.signal itself would benefit from some refactoring. I myself started something (the talkbox scikit), but realistically, I won't have time to work on a full toolbox, so we can consolidate here. I would be willing to work on merging what I already have in what you have in mind: - Linear prediction coding with Levinson Durbin implementation - A start of periodogram function I could spend time to implement a few more things like MUSIC/PENCIL, and some basic matching pursuit algorithms. cheers, David
On Wed, Dec 28, 2011 at 1:16 PM, David Cournapeau <cournape@gmail.com> wrote:
Hi Jaidev,
On Mon, Dec 26, 2011 at 5:39 PM, Jaidev Deshpande <deshpande.jaidev@gmail.com> wrote:
Hi
I gave a talk at SciPy India 2011 about a Python implementation of the Hilbert-Huang Transform that I was working on. The HHT is a method used as an alternative to Fourier and Wavelet analyses of nonlinear and nonstationary data. Following the talk Gael Varoquaux said that there's room for a separate scikit for signal processing. He also gave a lightning talk about bootstrapping a SciPy community project soon after.
So with this list let us start working out what the project should be like.
For noobs like me, Gael's talk was quite a useful guide. Here's the link to a gist he made about it - https://gist.github.com/1433151
Here's the link to my SciPy talk: http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c-a34360a2cc1f/vie...
I personally am researching nonlinear and nonstationary signal processing, I'd love to know what others can bring to this project. Also, let's talk about the limitations of the current signal processing tools available in SciPy and other scikits. I think there's a lot of documentation to be worked out, and there is also a lack of physically meaningful examples in the documentation.
I think it would be a good addition to the ecosystem. We are for example missing a lot of core algorithms even for linear signal processing, and scipy.signal itself would benefit from some refactoring.
I myself started something (the talkbox scikit), but realistically, I won't have time to work on a full toolbox, so we can consolidate here. I would be willing to work on merging what I already have in what you have in mind: - Linear prediction coding with Levinson Durbin implementation - A start of periodogram function
I could spend time to implement a few more things like MUSIC/PENCIL, and some basic matching pursuit algorithms.
depending on your scope, nitime will also be interesting which has much more in terms of multi-dimensional signals, e.g. a multivariate Levinson Durbin and various cross-spectral functions. statsmodels has quite a bit of time series analysis now, but the focus and datasets are pretty different, although I benefited from reading the scipy.signal, talkbox and matplotlib codes for some basic tools. Josef
cheers,
David _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
depending on your scope, nitime will also be interesting which has much more in terms of multi-dimensional signals, e.g. a multivariate Levinson Durbin and various cross-spectral functions.
I fully agree. nitime hides some great pieces of code like slepian tapers and various cross-sprectrum/coherence tools. I feel this code should be visible to a broader audience which would favor cross-fertilization between scientific domains and help nitime reach a critical mass of users/contributors. Alex
David Cournapeau wrote:
Hi Jaidev,
On Mon, Dec 26, 2011 at 5:39 PM, Jaidev Deshpande <deshpande.jaidev@gmail.com> wrote:
Hi
I gave a talk at SciPy India 2011 about a Python implementation of the Hilbert-Huang Transform that I was working on. The HHT is a method used as an alternative to Fourier and Wavelet analyses of nonlinear and nonstationary data. Following the talk Gael Varoquaux said that there's room for a separate scikit for signal processing. He also gave a lightning talk about bootstrapping a SciPy community project soon after.
So with this list let us start working out what the project should be like.
For noobs like me, Gael's talk was quite a useful guide. Here's the link to a gist he made about it - https://gist.github.com/1433151
Here's the link to my SciPy talk: http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c- a34360a2cc1f/view/search/tag%3Ascipy
I personally am researching nonlinear and nonstationary signal processing, I'd love to know what others can bring to this project. Also, let's talk about the limitations of the current signal processing tools available in SciPy and other scikits. I think there's a lot of documentation to be worked out, and there is also a lack of physically meaningful examples in the documentation.
I think it would be a good addition to the ecosystem. We are for example missing a lot of core algorithms even for linear signal processing, and scipy.signal itself would benefit from some refactoring.
I myself started something (the talkbox scikit), but realistically, I won't have time to work on a full toolbox, so we can consolidate here. I would be willing to work on merging what I already have in what you have in mind: - Linear prediction coding with Levinson Durbin implementation - A start of periodogram function
I could spend time to implement a few more things like MUSIC/PENCIL, and some basic matching pursuit algorithms.
cheers,
David
I have code for periodogram that you can use. I'm using my own fft wrapper around fftw (via pyublas), but you can easily modify this to use some other fft library.
On Mon, Dec 26, 2011 at 5:39 PM, Jaidev Deshpande <deshpande.jaidev@gmail.com> wrote:
Hi
I gave a talk at SciPy India 2011 about a Python implementation of the Hilbert-Huang Transform that I was working on. The HHT is a method used as an alternative to Fourier and Wavelet analyses of nonlinear and nonstationary data. Following the talk Gael Varoquaux said that there's room for a separate scikit for signal processing. He also gave a lightning talk about bootstrapping a SciPy community project soon after.
So with this list let us start working out what the project should be like.
For noobs like me, Gael's talk was quite a useful guide. Here's the link to a gist he made about it - https://gist.github.com/1433151
Here's the link to my SciPy talk: http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c-a34360a2cc1f/vie...
I personally am researching nonlinear and nonstationary signal processing, I'd love to know what others can bring to this project. Also, let's talk about the limitations of the current signal processing tools available in SciPy and other scikits. I think there's a lot of documentation to be worked out, and there is also a lack of physically meaningful examples in the documentation.
Thanks
PS: I'm ccing a few people who might already be on the scipy-dev list. Sorry for the inconvenience.
Jaidev, at this point, I think we should just start with actual code. Could you register a scikit-signal organization on github ? I could then start populating a project skeleton, and then everyone can start adding actual code regards, David
On Tue, Jan 3, 2012 at 1:14 AM, David Cournapeau <cournape@gmail.com> wrote:
Hi
I gave a talk at SciPy India 2011 about a Python implementation of the Hilbert-Huang Transform that I was working on. The HHT is a method used as an alternative to Fourier and Wavelet analyses of nonlinear and nonstationary data. Following the talk Gael Varoquaux said that there's room for a separate scikit for signal processing. He also gave a lightning talk about bootstrapping a SciPy community project soon after.
So with this list let us start working out what the project should be
On Mon, Dec 26, 2011 at 5:39 PM, Jaidev Deshpande <deshpande.jaidev@gmail.com> wrote: like.
For noobs like me, Gael's talk was quite a useful guide. Here's the link to a gist he made about it - https://gist.github.com/1433151
Here's the link to my SciPy talk:
http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c-a34360a2cc1f/vie...
I personally am researching nonlinear and nonstationary signal processing, I'd love to know what others can bring to this project. Also, let's talk about the limitations of the current signal processing tools available in SciPy and other scikits. I think there's a lot of documentation to be worked out, and there is also a lack of physically meaningful examples in the documentation.
Thanks
PS: I'm ccing a few people who might already be on the scipy-dev list. Sorry for the inconvenience.
Jaidev,
at this point, I think we should just start with actual code. Could you register a scikit-signal organization on github ? I could then start populating a project skeleton, and then everyone can start adding actual code
This sounds like a great idea. Given that the 'learn', 'image' and 'statsmodels' projects have dropped (or will soon drop) the 'scikits' namespace, should the 'signal' project not bother using the 'scikits' namespace? Maybe you've already thought about this, but if not, it is something to consider. Warren
Given that the 'learn', 'image' and 'statsmodels' projects have dropped (or will soon drop) the 'scikits' namespace, should the 'signal' project not bother using the 'scikits' namespace? Maybe you've already thought about this, but if not, it is something to consider.
I would still vote for sksignal as import name (like sklearn) and scikit-signal for the brand name. It's convient to go sk->tab to get the scikit's list with ipython autocomplete Alex
Hi David,
Could you register a scikit-signal organization on github ? I could then start populating a project skeleton, and then everyone can start adding actual code
The organization's up at https://github.com/scikit-signal I've never done this before, by the way. So just let me know if you want any changes. Also, who'd like to be owners? Thanks
On Tue, Jan 3, 2012 at 8:58 AM, Jaidev Deshpande <deshpande.jaidev@gmail.com> wrote:
Hi David,
Could you register a scikit-signal organization on github ? I could then start populating a project skeleton, and then everyone can start adding actual code
The organization's up at https://github.com/scikit-signal
I've never done this before, by the way. So just let me know if you want any changes. Also, who'd like to be owners?
My github account: cournape. I will start a scikit-learn package as soon as you give me the privileges, cheers, David
I don't know if this has already been discussed or not. But, I really don't understand the reasoning behind "yet-another-project" for signal processing. That is the whole-point of the signal sub-project under the scipy namespace. Why not just develop there? Github access is easy to grant. I must admit, I've never been a fan of the scikits namespace. I would prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place. If you don't want to do that, then just pick a top-level name and use it. I disagree with Gael that there should be a scikits-signal package. There are too many scikits already that should just be scipy projects (with scipy available in modular form). In my mind, almost every scikits- project should just be a scipy- project. There really was no need for the scikits namespace in the first place. Signal processing was the main thing I started writing SciPy for in the first place. These are the tools that made Matlab famous and I've always wanted Python to have the best-of-breed algorithms for. To me SciPy as a project has failed if general signal processing tools are being written in other high-level packages. I've watched this trend away from common development in SciPy in image processing, machine learning, optimization, and differential equation solution with some sadness over the past several years. Frankly, it makes me want to just pull out all of the individual packages I wrote that originally got pulled together into SciPy into separate projects and develop them individually from there. Leaving it to packaging and distribution issues to pull them together again. Hmm.. perhaps that is not such a bad idea. What do others think? What should really be in core SciPy and what should be in other packages? Perhaps it doesn't matter now and SciPy should just be maintained as it is with new features added in other packages? A lot has changed in the landscape since Pearu, Eric, and I released SciPy. Many people have contributed to the individual packages --- but the vision has waned for the project has a whole. The SciPy community is vibrant and alive, but the SciPy project does not seem to have a coherent goal. I'd like to see that changed this year if possible. In working on SciPy for .NET, I did a code.google search for open source packages that were relying on scipy imports. What I found was that almost all cases of scipy were: linalg, optimize, stats, special. It makes the case that scipy as a packages should be limited to that core set of tools (and their dependencies). All the other modules should just be distributed as separate projects / packages. What is your experience? what packages in scipy do you use? Thanks, -Travis On Jan 3, 2012, at 1:14 AM, David Cournapeau wrote:
On Mon, Dec 26, 2011 at 5:39 PM, Jaidev Deshpande <deshpande.jaidev@gmail.com> wrote:
Hi
I gave a talk at SciPy India 2011 about a Python implementation of the Hilbert-Huang Transform that I was working on. The HHT is a method used as an alternative to Fourier and Wavelet analyses of nonlinear and nonstationary data. Following the talk Gael Varoquaux said that there's room for a separate scikit for signal processing. He also gave a lightning talk about bootstrapping a SciPy community project soon after.
So with this list let us start working out what the project should be like.
For noobs like me, Gael's talk was quite a useful guide. Here's the link to a gist he made about it - https://gist.github.com/1433151
Here's the link to my SciPy talk: http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c-a34360a2cc1f/vie...
I personally am researching nonlinear and nonstationary signal processing, I'd love to know what others can bring to this project. Also, let's talk about the limitations of the current signal processing tools available in SciPy and other scikits. I think there's a lot of documentation to be worked out, and there is also a lack of physically meaningful examples in the documentation.
Thanks
PS: I'm ccing a few people who might already be on the scipy-dev list. Sorry for the inconvenience.
Jaidev,
at this point, I think we should just start with actual code. Could you register a scikit-signal organization on github ? I could then start populating a project skeleton, and then everyone can start adding actual code
regards,
David _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On 1/3/12 3:00 AM, Travis Oliphant wrote:
I don't know if this has already been discussed or not. But, I really don't understand the reasoning behind "yet-another-project" for signal processing. That is the whole-point of the signal sub-project under the scipy namespace. Why not just develop there? Github access is easy to grant.
I must admit, I've never been a fan of the scikits namespace. I would prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place. If you don't want to do that, then just pick a top-level name and use it.
I disagree with Gael that there should be a scikits-signal package. There are too many scikits already that should just be scipy projects (with scipy available in modular form). In my mind, almost every scikits- project should just be a scipy- project. There really was no need for the scikits namespace in the first place.
Signal processing was the main thing I started writing SciPy for in the first place. These are the tools that made Matlab famous and I've always wanted Python to have the best-of-breed algorithms for. To me SciPy as a project has failed if general signal processing tools are being written in other high-level packages. I've watched this trend away from common development in SciPy in image processing, machine learning, optimization, and differential equation solution with some sadness over the past several years. Frankly, it makes me want to just pull out all of the individual packages I wrote that originally got pulled together into SciPy into separate projects and develop them individually from there. Leaving it to packaging and distribution issues to pull them together again.
Hmm.. perhaps that is not such a bad idea. What do others think? What should really be in core SciPy and what should be in other packages? Perhaps it doesn't matter now and SciPy should just be maintained as it is with new features added in other packages? A lot has changed in the landscape since Pearu, Eric, and I released SciPy. Many people have contributed to the individual packages --- but the vision has waned for the project has a whole. The SciPy community is vibrant and alive, but the SciPy project does not seem to have a coherent goal. I'd like to see that changed this year if possible.
In working on SciPy for .NET, I did a code.google search for open source packages that were relying on scipy imports. What I found was that almost all cases of scipy were: linalg, optimize, stats, special. It makes the case that scipy as a packages should be limited to that core set of tools (and their dependencies). All the other modules should just be distributed as separate projects / packages.
What is your experience? what packages in scipy do you use?
Thanks,
-Travis
My experience, I have not used scikits and I mainly use the scipy.signal package. I don't have a strong opinion if .signal should be part of the core scipy or an independent package. But it seems that there should be one package! And hopefully, one development effort. In general extending and enhancing the current .signal (regardless if it is part of scipy or not) not fragmenting the signal processing related code across multiple packages. Regards, Chris
On Tue, Jan 3, 2012 at 09:00, Travis Oliphant <travis@continuum.io> wrote:
I don't know if this has already been discussed or not. But, I really don't understand the reasoning behind "yet-another-project" for signal processing. That is the whole-point of the signal sub-project under the scipy namespace. Why not just develop there? Github access is easy to grant.
I must admit, I've never been a fan of the scikits namespace. I would prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place. If you don't want to do that, then just pick a top-level name and use it.
I disagree with Gael that there should be a scikits-signal package. There are too many scikits already that should just be scipy projects (with scipy available in modular form). In my mind, almost every scikits- project should just be a scipy- project. There really was no need for the scikits namespace in the first place.
To be fair, the idea of the scikits namespace formed when the landscape was quite different and may no longer be especially relevant, but it had its reasons. Some projects can't go into the monolithic scipy-as-it-is for license, build, or development cycle reasons. Saying that scipy shouldn't be monolithic then is quite reasonable by itself, but no one has stepped up to do the work (I took a stab at it once). It isn't a reasonable response to someone who wants to contribute something. Enthusiasm isn't a fungible quantity. Someone who just wants to contribute his wrapper for whatever and is told to first go refactor a mature package with a lot of users is going to walk away. As they should. Instead, we tried to make it easier for people to contribute their code to the Python world. At the time, project hosting was limited, so Enthought's offer of sharing scipy's SVN/Trac/mailing list infrastructure was useful. Now, not so much. At the time, namespace packages seemed like a reasonable technology. Experience both inside and outside scikits has convinced most of us otherwise. One thing that does not seemed to have changed is that some people still want some kind of branding to demonstrate that their package belongs to this community. We used the name "scikits" instead of "scipy" because we anticipated confusion about what was in scipy-the-monolithic-package and what was available in separate packages (and since we were using namespace packages, technical issues with namespace packages and the non-empty scipy/__init__.py file). You don't say what you think "being a scipy- project" means, so it's hard to see what you are proposing as an alternative.
Signal processing was the main thing I started writing SciPy for in the first place. These are the tools that made Matlab famous and I've always wanted Python to have the best-of-breed algorithms for. To me SciPy as a project has failed if general signal processing tools are being written in other high-level packages. I've watched this trend away from common development in SciPy in image processing, machine learning, optimization, and differential equation solution with some sadness over the past several years. Frankly, it makes me want to just pull out all of the individual packages I wrote that originally got pulled together into SciPy into separate projects and develop them individually from there. Leaving it to packaging and distribution issues to pull them together again.
Hmm.. perhaps that is not such a bad idea. What do others think? What should really be in core SciPy and what should be in other packages? Perhaps it doesn't matter now and SciPy should just be maintained as it is with new features added in other packages? A lot has changed in the landscape since Pearu, Eric, and I released SciPy. Many people have contributed to the individual packages --- but the vision has waned for the project has a whole. The SciPy community is vibrant and alive, but the SciPy project does not seem to have a coherent goal. I'd like to see that changed this year if possible.
In working on SciPy for .NET, I did a code.google search for open source packages that were relying on scipy imports. What I found was that almost all cases of scipy were: linalg, optimize, stats, special. It makes the case that scipy as a packages should be limited to that core set of tools (and their dependencies). All the other modules should just be distributed as separate projects / packages.
As you say, the landscape has changed significantly. Monolithic packages are becoming less workable as the number of things we want to build/wrap is increasing. Building multiple packages that you want has also become marginally easier. At least, easier than trying to build a single package that wraps everything you don't want. It was a lot easier to envision everything under the sun being in scipy proper back in 2000. I think it would be reasonable to remake scipy as a slimmed-down core package (with deprecated compatibility stubs for a while) with a constellation of top-level packages around it. We could open up the github.com/scipy organization to those other projects who want that kind of branding, though that still does invite the potential confusion that we tried to avoid with the "scikits" name. That said, since we don't need to fit it into a valid namespace package name, just using the branding of calling them a "scipy toolkit" or "scipy addon" would be fine. Breaking up scipy might help the individual packages develop and release at their own pace. But mostly, I would like to encourage the idea that one should not be sad or frustrated when people contribute open source code to our community just because it's not in scipy or any particular package (or for that matter using the "right" DVCS). The important thing is that it is available to the Python community and that it works with the other tools that we have (i.e. talks with numpy). If your emotional response is anything but gratitude, then it's unworthy of you. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Jan 3, 2012, at 5:47 AM, Robert Kern wrote:
On Tue, Jan 3, 2012 at 09:00, Travis Oliphant <travis@continuum.io> wrote:
I don't know if this has already been discussed or not. But, I really don't understand the reasoning behind "yet-another-project" for signal processing. That is the whole-point of the signal sub-project under the scipy namespace. Why not just develop there? Github access is easy to grant.
I must admit, I've never been a fan of the scikits namespace. I would prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place. If you don't want to do that, then just pick a top-level name and use it.
I disagree with Gael that there should be a scikits-signal package. There are too many scikits already that should just be scipy projects (with scipy available in modular form). In my mind, almost every scikits- project should just be a scipy- project. There really was no need for the scikits namespace in the first place.
To be fair, the idea of the scikits namespace formed when the landscape was quite different and may no longer be especially relevant, but it had its reasons. Some projects can't go into the monolithic scipy-as-it-is for license, build, or development cycle reasons. Saying that scipy shouldn't be monolithic then is quite reasonable by itself, but no one has stepped up to do the work (I took a stab at it once). It isn't a reasonable response to someone who wants to contribute something. Enthusiasm isn't a fungible quantity. Someone who just wants to contribute his wrapper for whatever and is told to first go refactor a mature package with a lot of users is going to walk away. As they should.
This is an excellent point. I think SciPy suffers from the same issues that also affect the Python standard library. Like any organization, there is a dynamic balance between "working together" and "communication overhead" / dealing with legacy issues. I'm constantly grateful and inspired by the code that gets written and contributed by individuals. I would just like to see all of this code get more traction (and simple entry points are key for that). It's the main reason for my desire to see a Foundation that can sponsor the community. My previously mentioned sadness comes from my inability to contribute meaningfully over the past couple of years, and the missing full time effort that would help keep the SciPy project more cohesive. I'm hopeful this can change either directly or indirectly this year. Just to be clear, any sadness and frustration I feel is not with anyone in the community of people who are spending their free time writing code and contributing organizational efforts to making SciPy (both the package and the community) what it is. My frustration is directed squarely at myself for not being able to do more, both personally and in funding and sponsoring more. In the end, I would just like to see more resources devoted to these efforts. -Travis
Hi Travis, It is good that you are asking these questions. I think that they are important. Let me try to give my view on some of the points you raise.
There are too many scikits already that should just be scipy projects
I used to think pretty much as you did: I don't want to have to depend on too many packages. In addition we are a community, so why so many packages? My initial vision when investing in the scikit-learn was that we would merge it back to scipy after a while. The dynamic of the project has changed a bit my way of seeing things, and I now think that it is a good thing to have scikits-like packages that are more specialized than scipy for the following reasons: 1. Development is technically easier in smaller packages A developer working on a specific package does not need to tackle complexity of the full scipy suite. Building can be made easier, as scipy must (for good reasons) depend on Fortran and C++ packs. It is well known that the complexity of developing a project grows super-linearly with the number of lines of code. It's also much easier to achieve short release cycles. Short release cycles are critical to the dynamic of a community-driven project (and I'd like to thanks our current release manager, Ralf Gommers, for his excellent work). 2. Narrowing the application domain helps developers and users It is much easier to make entry points, in the code and in the documentation, with a given application in mind. Also, best practices and conventions may vary between communities. While this is (IMHO) one of the tragedies of contemporary science, it such domain specialization helps people feeling comfortable. Computational trade offs tend to be fairly specific to a given context. For instance machine learning will more often be interested in datasets with a large number of features and a (comparatively) small number of samples, whereas in statistics it is the opposite. Thus the same algorithm might be implemented differently. Catering for all needs tends to make the code much more complex, and may confuse the user by presenting him too many options. Developers cannot be expert in everything. If I specialize in machine learning, and follow the recent developments in literature, chances are that I do not have time to competitive in numerical integration. Having too wide a scope in a project means that each developer understands well a small fraction of the code. It makes things really hard for the release manager, but also for day to day work, e.g. what to do with a new broken test. 3. It is easier to build an application-specific community An application specific library is easier to brand. One can tailor a website, a user manual, and conference presentation or papers to an application. As a result the project gains visibility in the community of scientists and engineers it target. Also, having more focused mailing lists helps building enthusiasm, a they have less volume, and are more focused on on questions that people are interested in. Finally, a sad but true statement, is that people tend to get more credo when working on an application-specific project than on a core layer. Similarly, it is easier for me to get credit to fund development of an application-specific project. On a positive note, I would like to stress that I think that the scikit-learn has had a general positive impact on the scipy ecosystem, including for those who do not use it, or who do not care at all about machine learning. First, it is drawing more users in the community, and as a result, there is more interest and money flying around. But more importantly, when I look at the latest release of scipy, I see many of the new contributors that are also scikit-learn contributors (not only Fabian). This can be partly explained by the fact that getting involved in the scikit-learn was an easy and high-return-on-investment move for them, but they quickly grew to realize that the base layer could be improved. We have always had the vision to push in scipy any improvement that was general-enough to be useful across application domains. Remember, David Cournapeau was lured in the scipy business by working on the original scikit-learn.
Frankly, it makes me want to pull out all of the individual packages I wrote that originally got pulled together into SciPy into separate projects and develop them individually from there.
What you are proposing is interesting, that said, I think that the current status quo with scipy is a good one. Having a core collection of numerical tools is, IMHO, a key element of the Python scientific community for two reasons: * For the user, knowing that he will find the answer to most of his simple questions in a single library makes it easy to start. It also makes it easier to document. * Different packages need to rely on a lot of common generic tools. Linear algebra, sparse linear algebra, simple statistics and signal processing, simple black-box optimizer, interpolation ND-image-like processing. Indeed You ask what package in scipy do people use. Actually, in scikit-learn we use all sub-packages apart from 'integrate'. I checked, and we even use 'io' in one of the examples. Any code doing high-end application-specific numerical computing will need at least a few of the packages of scipy. Of course, a package may need an optimizer tailored to a specific application, in which case they will roll there own, an this effort might be duplicated a bit. But having the common core helps consolidating the ecosystem. So the setup that I am advocating is a core library, with many other satellite packages. Or rather a constellation of packages that use each other rather then a monolithic universe. This is a common strategy of breaking a package up into parts that can be used independently to make them lighter and hopefully ease the development of the whole. For instance, this is what was done to the ETS (Enthought Tool Suite). And we have all seen this strategy gone bad, for instance in the situation of 'dependency hell', in which case all packages start depending on each other, the installation becomes an issue and there is a grid lock of version-compatibility bugs. This is why any such ecosystem must have an almost tree-like structure in its dependency graph. Some packages must be on top of the graph, more 'core' than others, and as we descend the graph, packages can reduce their dependencies. I think that we have more or less this situation with scipy, and I am quite happy about it. Now I hear your frustration when this development happens a bit in the wild with no visible construction of an ecosystem. This ecosystem does get constructed via the scipy mailing-lists, conferences, and in general the community, but it may not be very clear to the external observer. One reason why my group decided to invest in the scikit-learn was that it was the learning package that seemed the closest in terms of code and community connections. This was the virtue of the 'scikits' branding. For technical reasons, the different scikits have started getting rid of this namespace in the module import. You seem to think that the branding name 'scikits' does not reflect accurately the fact that they are tight members of the scipy constellationhile I must say that I am not a huge fan of the name 'scikits', we have now invested in it, and I don't think that we can easily move away. If the problem is a branding issue, it may be partly addressed with appropriate communication. A set of links across the different web pages of the ecosystem, and a central document explaining the relationships between the packages might help. But this idea is not completely new and it simply is waiting for someone to invest time in it. For instance, there was the project of reworking the scipy.org homepage. Another important problem is the question of what sits 'inside' this collection of tools, and what is outside. The answer to this question will pretty much depend on who you ask. In practice, for the end user, it is very much conditioned by what meta-package they can download. EPD, Sage, Python(x,y), and many others give different answers. To conclude, I'd like to stress that, in my eyes, what really matters is a solution that gives us a vibrant community, with a good production of quality code and documentation. I think that the current set of small projects makes it easier to gather developers and users, and that it work well as long as they talk to each other and do not duplicate too much each-other's functionality. If on top of that they are BSD-licensed and use numpy as their data model, I am a happy man. What I am pushing for is a Bazar-like development model, in which it is easy for various approaches answering different needs to develop in parallel with different compromises. In such a context, I think that Jaidev could kick start a successful and useful scikit-signal. Hopefully this would not preclude improvements to the docs, examples, and existing code in scipy.signal. Sorry for the long post, and thank you for reading. Gael
Hi Gael, Thanks for your email. I appreciate the detailed response. Please don't mis-interpret my distaste for the scikit namespace as anything more than organizational. I'm very impressed with most of the scikits themselves: scikit-learn being a particular favorite. It is very clear to most people that smaller teams and projects is useful for diverse collaboration and very effective to involve more people in development. This is all very good and I'm very encouraged by this development. Even in the SciPy package itself, active development happens on only a few packages which have received attention from small teams. Of course, the end user wants integration so the more packages exist the more we need tools like EPD, ActivePython, Python(X,Y), and Sage (and corresponding repositories like CRAN). The landscape is much better in this direction than it was earlier, but packaging and distribution is still a major weak-point in Python. I think the scientific computing community should continue to just develop it's own packaging solutions. I've been a big fan of David Cournapeau's work in this area (bento being his latest effort). Your vision of a bazaar model is a good one. I just think we need to get scipy itself more into that model. I agree it's useful to have a core-set of common functionality, but I am quite in favor of moving to a more tight-knit core for the main scipy package with additional scipy-*named* packages (e.g. scipy-odr), etc. These can install directly into the scipy package infrastructure (or use whatever import mechanisms the distributions desire). This move to more modular packages for SciPy itself, has been in my mind for a long time which is certainly why I see the scikits name-space as superfluous. But, I understand that branding means something. So, my (off the top of my head) take on what should be core scipy is: fftpack stats io special optimize] linalg lib.blas lib.lapack misc I think the other packages should be maintained, built and distributed as scipy-constants scipy-integrate scipy-cluster scipy-ndimage scipy-spatial scipy-odr scipy-sparse scipy-maxentropy scipy-signal scipy-weave (actually I think weave should be installed separately and/or merged with other foreign code integration tools like fwrap, f2py, etc.) Then, we could create a scipy superpack to install it all together. What issues do people see with a plan like this? Obviously it takes time and effort to do this. But, I'm hoping to find time or sponsor people who will have time to do this work. Thus, I'd like to have the conversation to find out what people think *should* be done. There also may be students looking for a way to get involved or people interested in working on Google Summer of Code projects. Thanks, -Travis On Jan 3, 2012, at 9:44 AM, Gael Varoquaux wrote:
Hi Travis,
It is good that you are asking these questions. I think that they are important. Let me try to give my view on some of the points you raise.
There are too many scikits already that should just be scipy projects
I used to think pretty much as you did: I don't want to have to depend on too many packages. In addition we are a community, so why so many packages? My initial vision when investing in the scikit-learn was that we would merge it back to scipy after a while. The dynamic of the project has changed a bit my way of seeing things, and I now think that it is a good thing to have scikits-like packages that are more specialized than scipy for the following reasons:
1. Development is technically easier in smaller packages
A developer working on a specific package does not need to tackle complexity of the full scipy suite. Building can be made easier, as scipy must (for good reasons) depend on Fortran and C++ packs. It is well known that the complexity of developing a project grows super-linearly with the number of lines of code.
It's also much easier to achieve short release cycles. Short release cycles are critical to the dynamic of a community-driven project (and I'd like to thanks our current release manager, Ralf Gommers, for his excellent work).
2. Narrowing the application domain helps developers and users
It is much easier to make entry points, in the code and in the documentation, with a given application in mind. Also, best practices and conventions may vary between communities. While this is (IMHO) one of the tragedies of contemporary science, it such domain specialization helps people feeling comfortable.
Computational trade offs tend to be fairly specific to a given context. For instance machine learning will more often be interested in datasets with a large number of features and a (comparatively) small number of samples, whereas in statistics it is the opposite. Thus the same algorithm might be implemented differently. Catering for all needs tends to make the code much more complex, and may confuse the user by presenting him too many options.
Developers cannot be expert in everything. If I specialize in machine learning, and follow the recent developments in literature, chances are that I do not have time to competitive in numerical integration. Having too wide a scope in a project means that each developer understands well a small fraction of the code. It makes things really hard for the release manager, but also for day to day work, e.g. what to do with a new broken test.
3. It is easier to build an application-specific community
An application specific library is easier to brand. One can tailor a website, a user manual, and conference presentation or papers to an application. As a result the project gains visibility in the community of scientists and engineers it target.
Also, having more focused mailing lists helps building enthusiasm, a they have less volume, and are more focused on on questions that people are interested in.
Finally, a sad but true statement, is that people tend to get more credo when working on an application-specific project than on a core layer. Similarly, it is easier for me to get credit to fund development of an application-specific project.
On a positive note, I would like to stress that I think that the scikit-learn has had a general positive impact on the scipy ecosystem, including for those who do not use it, or who do not care at all about machine learning. First, it is drawing more users in the community, and as a result, there is more interest and money flying around. But more importantly, when I look at the latest release of scipy, I see many of the new contributors that are also scikit-learn contributors (not only Fabian). This can be partly explained by the fact that getting involved in the scikit-learn was an easy and high-return-on-investment move for them, but they quickly grew to realize that the base layer could be improved. We have always had the vision to push in scipy any improvement that was general-enough to be useful across application domains. Remember, David Cournapeau was lured in the scipy business by working on the original scikit-learn.
Frankly, it makes me want to pull out all of the individual packages I wrote that originally got pulled together into SciPy into separate projects and develop them individually from there.
What you are proposing is interesting, that said, I think that the current status quo with scipy is a good one. Having a core collection of numerical tools is, IMHO, a key element of the Python scientific community for two reasons:
* For the user, knowing that he will find the answer to most of his simple questions in a single library makes it easy to start. It also makes it easier to document.
* Different packages need to rely on a lot of common generic tools. Linear algebra, sparse linear algebra, simple statistics and signal processing, simple black-box optimizer, interpolation ND-image-like processing. Indeed You ask what package in scipy do people use. Actually, in scikit-learn we use all sub-packages apart from 'integrate'. I checked, and we even use 'io' in one of the examples. Any code doing high-end application-specific numerical computing will need at least a few of the packages of scipy. Of course, a package may need an optimizer tailored to a specific application, in which case they will roll there own, an this effort might be duplicated a bit. But having the common core helps consolidating the ecosystem.
So the setup that I am advocating is a core library, with many other satellite packages. Or rather a constellation of packages that use each other rather then a monolithic universe. This is a common strategy of breaking a package up into parts that can be used independently to make them lighter and hopefully ease the development of the whole. For instance, this is what was done to the ETS (Enthought Tool Suite). And we have all seen this strategy gone bad, for instance in the situation of 'dependency hell', in which case all packages start depending on each other, the installation becomes an issue and there is a grid lock of version-compatibility bugs. This is why any such ecosystem must have an almost tree-like structure in its dependency graph. Some packages must be on top of the graph, more 'core' than others, and as we descend the graph, packages can reduce their dependencies. I think that we have more or less this situation with scipy, and I am quite happy about it.
Now I hear your frustration when this development happens a bit in the wild with no visible construction of an ecosystem. This ecosystem does get constructed via the scipy mailing-lists, conferences, and in general the community, but it may not be very clear to the external observer. One reason why my group decided to invest in the scikit-learn was that it was the learning package that seemed the closest in terms of code and community connections. This was the virtue of the 'scikits' branding. For technical reasons, the different scikits have started getting rid of this namespace in the module import. You seem to think that the branding name 'scikits' does not reflect accurately the fact that they are tight members of the scipy constellationhile I must say that I am not a huge fan of the name 'scikits', we have now invested in it, and I don't think that we can easily move away.
If the problem is a branding issue, it may be partly addressed with appropriate communication. A set of links across the different web pages of the ecosystem, and a central document explaining the relationships between the packages might help. But this idea is not completely new and it simply is waiting for someone to invest time in it. For instance, there was the project of reworking the scipy.org homepage.
Another important problem is the question of what sits 'inside' this collection of tools, and what is outside. The answer to this question will pretty much depend on who you ask. In practice, for the end user, it is very much conditioned by what meta-package they can download. EPD, Sage, Python(x,y), and many others give different answers.
To conclude, I'd like to stress that, in my eyes, what really matters is a solution that gives us a vibrant community, with a good production of quality code and documentation. I think that the current set of small projects makes it easier to gather developers and users, and that it work well as long as they talk to each other and do not duplicate too much each-other's functionality. If on top of that they are BSD-licensed and use numpy as their data model, I am a happy man.
What I am pushing for is a Bazar-like development model, in which it is easy for various approaches answering different needs to develop in parallel with different compromises. In such a context, I think that Jaidev could kick start a successful and useful scikit-signal. Hopefully this would not preclude improvements to the docs, examples, and existing code in scipy.signal.
Sorry for the long post, and thank you for reading.
Gael _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Wed, Jan 4, 2012 at 06:37, Travis Oliphant <travis@continuum.io> wrote:
So, my (off the top of my head) take on what should be core scipy is:
fftpack stats io special optimize] linalg lib.blas lib.lapack misc
I think the other packages should be maintained, built and distributed as
scipy-constants scipy-integrate scipy-cluster scipy-ndimage scipy-spatial scipy-odr scipy-sparse scipy-maxentropy scipy-signal scipy-weave (actually I think weave should be installed separately and/or merged with other foreign code integration tools like fwrap, f2py, etc.)
Then, we could create a scipy superpack to install it all together. What issues do people see with a plan like this?
The main technical issue/decision is how to split up the "physical" packages themselves. Do we use namespace packages, such that scipy.signal will still be imported as "from scipy import signal", or do we rename the packages such that each one is its own top-level package? It's important to specify this when making a proposal because each imposes different costs that we may want to factor into how we divide up the packages. I think the lesson we've learned from scikits (and ETS, for that matter) is that this community at least does not want to use namespace packages. Some of this derives from a distate of setuptools, which is used in the implementation, but a lot of it derives from the very concept of namespace packages independent of any implementation. Monitoring the scikit-learn and pystatsmodels mailing lists, I noticed that a number of installation problems stemmed just from having the top-level package being "scikits" and shared between several packages. This is something that can only be avoided by not using namespace packages altogether. There are also technical issues that cut across implementations. Namely, the scipy/__init__.py files need to be identical between all of the packages. Maintaining non-empty identical __init__.py files is not feasible. We don't make many changes to it these days, but we won't be able to make *any* changes ever again. We could empty it out, if we are willing to make this break with backwards compatibility once. Going with unique top-level packages, do we use a convention like "scipy_signal", at least for the packages being broken out from the current monolithic scipy? Do we provide a proxy package hierarchy for backwards compatibility (e.g. having proxy modules like scipy/signal/signaltools.py that just import everything from scipy_signal/signaltools.py) like Enthought does with etsproxy after we split up ETS? -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Wed, Jan 4, 2012 at 1:37 AM, Travis Oliphant <travis@continuum.io> wrote: <snip>
So, my (off the top of my head) take on what should be core scipy is:
fftpack stats io special optimize] linalg lib.blas lib.lapack misc
I think the other packages should be maintained, built and distributed as
scipy-constants scipy-integrate scipy-cluster scipy-ndimage scipy-spatial scipy-odr scipy-sparse scipy-maxentropy scipy-signal scipy-weave (actually I think weave should be installed separately and/or merged with other foreign code integration tools like fwrap, f2py, etc.)
Then, we could create a scipy superpack to install it all together. What issues do people see with a plan like this?
My first thought is that what is 'core' could use a little more discussion. We are using parts of integrate and signal in statsmodels so our dependencies almost double if these are split off as a separate installation. I'd suspect others might feel the same. This isn't a deal breaker though, and I like the idea of being more modular, depending on how it's implemented and how easy it is for users to grab and install different parts. Skipper
On Wed, Jan 4, 2012 at 9:30 AM, Skipper Seabold <jsseabold@gmail.com> wrote:
On Wed, Jan 4, 2012 at 1:37 AM, Travis Oliphant <travis@continuum.io> wrote: <snip>
So, my (off the top of my head) take on what should be core scipy is:
fftpack stats io special optimize] linalg lib.blas lib.lapack misc
I think the other packages should be maintained, built and distributed as
scipy-constants scipy-integrate scipy-cluster scipy-ndimage scipy-spatial scipy-odr scipy-sparse scipy-maxentropy scipy-signal scipy-weave (actually I think weave should be installed separately and/or merged with other foreign code integration tools like fwrap, f2py, etc.)
Then, we could create a scipy superpack to install it all together. What issues do people see with a plan like this?
My first thought is that what is 'core' could use a little more discussion. We are using parts of integrate and signal in statsmodels so our dependencies almost double if these are split off as a separate installation. I'd suspect others might feel the same. This isn't a deal breaker though, and I like the idea of being more modular, depending on how it's implemented and how easy it is for users to grab and install different parts.
I think that breaking up scipy just gives us a lot more installation problems, and if it's merged together again into a superpack, then it wouldn't change a whole lot, but increase the work of the release management. I wouldn't mind if weave is split out, since it crashes and I never use it. The splitup is also difficult because of interdependencies, stats is a final usage sub package and doesn't need to be in the core, it's not used by any other part, AFAIK it uses at least also integrate. optimize uses sparse is at least one other case I know. I've been in favor of cleaning up imports for a long time, but splitting up scipy means we can only rely on a smaller set of functions without increasing the number of packages that need to be installed. What if stats wants to use spatial or signal? Josef
Skipper _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
josef.pktd@gmail.com wrote:
The splitup is also difficult because of interdependencies, stats is a final usage sub package and doesn't need to be in the core, it's not used by any other part, AFAIK it uses at least also integrate.
optimize uses sparse is at least one other case I know.
There could then be another level of split-up, per module, to circumvent these dependency problems. For instance the core optimize module would not include the nonlin module (the one depending on sparse) which would in turn be in scipy-optimize-nonlin, part of the "contrib" meta package. Also, somebody developing a new optimization solver would name their package scipy-optimize-$SOLVER so that it could be included in the contrib area.
What if stats wants to use spatial or signal?
The same would apply here. The bits from stats that want to use spatial would stay in the contrib area until spatial moves to core. -- Denis
On Wed, Jan 4, 2012 at 10:56 AM, Denis Laxalde <denis.laxalde@mcgill.ca> wrote:
josef.pktd@gmail.com wrote:
The splitup is also difficult because of interdependencies, stats is a final usage sub package and doesn't need to be in the core, it's not used by any other part, AFAIK it uses at least also integrate. and interpolate I think
optimize uses sparse is at least one other case I know.
There could then be another level of split-up, per module, to circumvent these dependency problems. For instance the core optimize module would not include the nonlin module (the one depending on sparse) which would in turn be in scipy-optimize-nonlin, part of the "contrib" meta package. Also, somebody developing a new optimization solver would name their package scipy-optimize-$SOLVER so that it could be included in the contrib area.
What if stats wants to use spatial or signal?
The same would apply here. The bits from stats that want to use spatial would stay in the contrib area until spatial moves to core.
That sounds like it will be difficult to keep track of things. I don't see any clear advantages that would justify the additional installation problems. The advantage of the current scipy is that it is a minimal common set of functionality that we can assume a user has installed when we require scipy. scipy.stats, statsmodels and sklearn load large parts of scipy, but maybe not fully overlapping. If I want to use sklearn additional to statsmodels, I don't have to worry about additional dependencies, since we try to stick with numpy and scipy as required dependencies (statsmodels also has pandas now). If we break up scipy, then we have to think which additional sub- or sub-sub-packages users need to install before they can use the scikits, unless we require users to install a super-super-package that includes (almost) all of the current scipy. The next stage will be keeping track of versions. It sounds a lot of fun if there are changes, and we not only have to check for numpy and scipy version, but also the version of each sub-package. Nothing is impossible, I just don't see the advantage of moving away from the current one-click install that works very well on Windows. Josef
-- Denis _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Wed, Jan 4, 2012 at 3:53 PM, <josef.pktd@gmail.com> wrote:
On Wed, Jan 4, 2012 at 9:30 AM, Skipper Seabold <jsseabold@gmail.com> wrote:
On Wed, Jan 4, 2012 at 1:37 AM, Travis Oliphant <travis@continuum.io> wrote: <snip>
So, my (off the top of my head) take on what should be core scipy is:
fftpack stats io special optimize] linalg lib.blas lib.lapack misc
I think the other packages should be maintained, built and distributed as
scipy-constants scipy-integrate scipy-cluster scipy-ndimage scipy-spatial scipy-odr scipy-sparse scipy-maxentropy scipy-signal scipy-weave (actually I think weave should be installed separately and/or merged with other foreign code integration tools like fwrap, f2py, etc.)
Then, we could create a scipy superpack to install it all together. What issues do people see with a plan like this?
My first thought is that what is 'core' could use a little more discussion. We are using parts of integrate and signal in statsmodels so our dependencies almost double if these are split off as a separate installation. I'd suspect others might feel the same. This isn't a deal breaker though, and I like the idea of being more modular, depending on how it's implemented and how easy it is for users to grab and install different parts.
I think that breaking up scipy just gives us a lot more installation problems, and if it's merged together again into a superpack, then it wouldn't change a whole lot, but increase the work of the release management. I wouldn't mind if weave is split out, since it crashes and I never use it.
The splitup is also difficult because of interdependencies, stats is a final usage sub package and doesn't need to be in the core, it's not used by any other part, AFAIK it uses at least also integrate.
optimize uses sparse is at least one other case I know.
I've been in favor of cleaning up imports for a long time, but splitting up scipy means we can only rely on a smaller set of functions without increasing the number of packages that need to be installed.
What if stats wants to use spatial or signal?
I agree with Josef that splitting scipy will be difficult, and I suspect it's (a) not worth the pain and (b) that it doesn't solve the issue that I think Travis hopes it will solve (more development of the sub-packages). Installation, dependency problems and effort of releasing will probably get worse. Looking at Travis' list of non-core packages I'd say that sparse certainly belongs in the core and integrate probably too. Looking at what's left: - constants : very small and low cost to keep in core. Not much to improve there. - cluster : low maintenance cost, small. not sure about usage, quality. - ndimage : difficult one. hard to understand code, may not see much development either way. - spatial : kdtree is widely used, of good quality. low maintenance cost. - odr : quite small, low cost to keep in core. pretty much done as far as I can tell. - maxentropy : is deprecated, will disappear. - signal : not in great shape, could be viable independent package. On the other hand, if scikits-signal takes off and those developers take care to improve and build on scipy.signal when possible, that's OK too. - weave : no point spending any effort on it. keep for backwards compatibility only, direct people to Cython instead. Overall, I don't see many viable independent packages there. So here's an alternative to spending a lot of effort on reorganizing the package structure: 1. Formulate a coherent vision of what in principle belongs in scipy (current modules + what's missing). 2. Focus on making it easier to contribute to scipy. There are many ways to do this; having more accessible developer docs, having a list of "easy fixes", adding info to tickets on how to get started on the reported issues, etc. We can learn a lot from Sympy and IPython here. 3. Recognize that quality of code and especially documentation is important, and fill the main gaps. 4. Deprecate sub-modules that don't belong in scipy (anymore), and remove them for scipy 1.0. I think that this applies only to maxentropy and weave. 5. Find a clear (group of) maintainer(s) for each sub-module. For people familiar with one module, responding to tickets and pull requests for that module would not cost so much time. In my opinion, spending effort on improving code/documentation quality and attracting new developers (those go hand in hand) instead of reorganizing will have both more impact and be more beneficial for our users. Cheers, Ralf
Thanks for the feedback. My point was to generate discussion and start the ball rolling on exactly the kind of conversation that has started. Exactly as Ralf mentioned, the point is to get development on sub-packages --- something that the scikits effort and other individual efforts have done very, very well. In fact, it has worked so well, that it taught me a great deal about what is important in open source. My perhaps irrational dislike for the *name* "scikits" should not be interpreted as anything but a naming taste preference (and I am not known for my ability to choose names well anyway). I very much like and admire the community around scikits. I just would have preferred something easier to type (even just sci_* would have been better in my mind as high-level packages: sci_learn, sci_image, sci_statsmodels, etc.). I didn't feel like I was able to fully participate in that discussion when it happened, so you can take my comments now as simply historical and something I've been wanting to get off my chest for a while. Without better packaging and dependency management systems (especially on Windows and Mac), splitting out code doesn't help those who are not distribution dependent (who themselves won't be impacted much). There are scenarios under which it could make sense to split out SciPy, but I agree that right now it doesn't make sense to completely split everything. However, I do think it makes sense to clean things up and move some things out in preparation for SciPy 1.0 One thing that would be nice is what is the view of documentation and examples for the different packages. Where is work there most needed?
Looking at Travis' list of non-core packages I'd say that sparse certainly belongs in the core and integrate probably too. Looking at what's left: - constants : very small and low cost to keep in core. Not much to improve there.
Agreed.
- cluster : low maintenance cost, small. not sure about usage, quality.
I think cluster overlaps with scikits-learn quite a bit. It basically contains a K-means vector quantization code with functionality that I suspect exists in scikits-learn. I would recommend deprecation and removal while pointing people to scikits-learn for equivalent functionality (or moving it to scikits-learn).
- ndimage : difficult one. hard to understand code, may not see much development either way.
This overlaps with scikits-image but has quite a bit of useful functionality on its own. The package is fairly mature and just needs maintenance.
- spatial : kdtree is widely used, of good quality. low maintenance cost.
Good to hear maintenance cost is low.
- odr : quite small, low cost to keep in core. pretty much done as far as I can tell.
Agreed.
- maxentropy : is deprecated, will disappear.
Great.
- signal : not in great shape, could be viable independent package. On the other hand, if scikits-signal takes off and those developers take care to improve and build on scipy.signal when possible, that's OK too.
What are the needs of this package? What needs to be fixed / improved? It is a broad field and I could see fixing scipy.signal with a few simple algorithms (the filter design, for example), and then pushing a separate package to do more advanced signal processing algorithms. This sounds fine to me. It looks like I can put attention to scipy.signal then, as It was one of the areas I was most interested in originally.
- weave : no point spending any effort on it. keep for backwards compatibility only, direct people to Cython instead.
Agreed. Anyway we can deprecate this for SciPy 1.0?
Overall, I don't see many viable independent packages there. So here's an alternative to spending a lot of effort on reorganizing the package structure: 1. Formulate a coherent vision of what in principle belongs in scipy (current modules + what's missing).
O.K. so SciPy should contain "basic" modules that are going to be needed for a lot of different kinds of analysis to be a dependency for other more advanced packages. This is somewhat vague, of course. What do others think is missing? Off the top of my head: basic wavelets (dwt primarily) and more complete interpolation strategies (I'd like to finish the basic interpolation approaches I started a while ago). Originally, I used GAMS as an "overview" of the kinds of things needed in SciPy. Are there other relevant taxonomies these days? http://gams.nist.gov/cgi-bin/serve.cgi
2. Focus on making it easier to contribute to scipy. There are many ways to do this; having more accessible developer docs, having a list of "easy fixes", adding info to tickets on how to get started on the reported issues, etc. We can learn a lot from Sympy and IPython here.
Definitely!
3. Recognize that quality of code and especially documentation is important, and fill the main gaps.
Is there a write-up of recognized gaps here that we can start with?
4. Deprecate sub-modules that don't belong in scipy (anymore), and remove them for scipy 1.0. I think that this applies only to maxentropy and weave.
I think it also applies to cluster as described above.
5. Find a clear (group of) maintainer(s) for each sub-module. For people familiar with one module, responding to tickets and pull requests for that module would not cost so much time.
Is there a list where this is kept?
In my opinion, spending effort on improving code/documentation quality and attracting new developers (those go hand in hand) instead of reorganizing will have both more impact and be more beneficial for our users.
Agreed. Thanks for the feedback. Best, -Travis
Hi all, On Wed, Jan 4, 2012 at 5:43 PM, Travis Oliphant <travis@continuum.io> wrote:
What do others think is missing? Off the top of my head: basic wavelets (dwt primarily) and more complete interpolation strategies (I'd like to finish the basic interpolation approaches I started a while ago). Originally, I used GAMS as an "overview" of the kinds of things needed in SciPy. Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View from Berkeley' paper on parallel computing is not a bad starting point; summarized here they are: Dense Linear Algebra Sparse Linear Algebra [1] Spectral Methods N-Body Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines Descriptions of each can be found here: http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is here: http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html That list is biased towards the classes of codes used in supercomputing environments, and some of the topics are probably beyond the scope of scipy (say structured/unstructured grids, at least for now). But it can be a decent guiding outline to reason about what are the 'big areas' of scientific computing, so that scipy at least provides building blocks that would be useful in these directions. One area that hasn't been directly mentioned too much is the situation with statistical tools. On the one hand, we have the phenomenal work of pandas, statsmodels and sklearn, which together are helping turn python into a great tool for statistical data analysis (understood in a broad sense). But it would probably be valuable to have enough of a statistical base directly in numpy/scipy so that the 'out of the box' experience for statistical work is improved. I know we have scipy.stats, but it seems like it needs some love. Cheers, f
On Wed, Jan 4, 2012 at 9:22 PM, Fernando Perez <fperez.net@gmail.com> wrote:
Hi all,
On Wed, Jan 4, 2012 at 5:43 PM, Travis Oliphant <travis@continuum.io> wrote:
What do others think is missing? Off the top of my head: basic wavelets (dwt primarily) and more complete interpolation strategies (I'd like to finish the basic interpolation approaches I started a while ago). Originally, I used GAMS as an "overview" of the kinds of things needed in SciPy. Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View from Berkeley' paper on parallel computing is not a bad starting point; summarized here they are:
Dense Linear Algebra Sparse Linear Algebra [1] Spectral Methods N-Body Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines
Descriptions of each can be found here: http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is here:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
That list is biased towards the classes of codes used in supercomputing environments, and some of the topics are probably beyond the scope of scipy (say structured/unstructured grids, at least for now).
But it can be a decent guiding outline to reason about what are the 'big areas' of scientific computing, so that scipy at least provides building blocks that would be useful in these directions.
One area that hasn't been directly mentioned too much is the situation with statistical tools. On the one hand, we have the phenomenal work of pandas, statsmodels and sklearn, which together are helping turn python into a great tool for statistical data analysis (understood in a broad sense). But it would probably be valuable to have enough of a statistical base directly in numpy/scipy so that the 'out of the box' experience for statistical work is improved. I know we have scipy.stats, but it seems like it needs some love.
(I didn't send something like the first part earlier, because I didn't want to talk so much.) Every new code and sub-package need additional topic specific maintainers. Pauli, Warren and Ralf are doing a great job as default, general maintainers, and especially Warren and Ralf have been pushing bug-fixes and enhancements into stats (and I have been reviewing almost all of it). If there is a well defined set of enhancements that could go into stats, then I wouldn't mind, but I don't see much reason in duplicating code and maintenance work with statsmodels. Of course there are large parts that statsmodels doesn't cover either, and it is useful to extend the coverage of statistics in either package. However, adding code that is not low maintenance (because it's fully tested) or doesn't have committed maintainers doesn't make much sense in my opinion. Cheers, Josef
Cheers,
f _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Jan 4, 2012, at 8:22 PM, Fernando Perez wrote:
Hi all,
On Wed, Jan 4, 2012 at 5:43 PM, Travis Oliphant <travis@continuum.io> wrote:
What do others think is missing? Off the top of my head: basic wavelets (dwt primarily) and more complete interpolation strategies (I'd like to finish the basic interpolation approaches I started a while ago). Originally, I used GAMS as an "overview" of the kinds of things needed in SciPy. Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View from Berkeley' paper on parallel computing is not a bad starting point; summarized here they are:
Dense Linear Algebra Sparse Linear Algebra [1] Spectral Methods N-Body Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines
This is a nice list, thanks!
Descriptions of each can be found here: http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is here:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
That list is biased towards the classes of codes used in supercomputing environments, and some of the topics are probably beyond the scope of scipy (say structured/unstructured grids, at least for now).
But it can be a decent guiding outline to reason about what are the 'big areas' of scientific computing, so that scipy at least provides building blocks that would be useful in these directions.
Thanks for the links.
One area that hasn't been directly mentioned too much is the situation with statistical tools. On the one hand, we have the phenomenal work of pandas, statsmodels and sklearn, which together are helping turn python into a great tool for statistical data analysis (understood in a broad sense). But it would probably be valuable to have enough of a statistical base directly in numpy/scipy so that the 'out of the box' experience for statistical work is improved. I know we have scipy.stats, but it seems like it needs some love.
It seems like scipy stats has received quite a bit of attention. There is always more to do, of course, but I'm not sure what specifically you think is missing or needs work. A big question to me is the impact of data-frames as the underlying data-representation of the algorithms and the relationship between the data-frame and a NumPy array. -Travis
Cheers,
f _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Wed, Jan 4, 2012 at 7:29 PM, Travis Oliphant <travis@continuum.io> wrote:
It seems like scipy stats has received quite a bit of attention. There is always more to do, of course, but I'm not sure what specifically you think is missing or needs work.
Well, I recently needed to do some simple linear modeling, and the stats glm docstring isn't very encouraging: Docstring: Calculates a linear model fit ... anova/ancova/lin-regress/t-test/etc. Taken from: Peterson et al. Statistical limitations in functional neuroimaging I. Non-inferential methods and statistical models. Phil Trans Royal Soc Lond B 354: 1239-1260. Returns ------- statistic, p-value ??? ### END of docstring I turned to statsmodels, which had great examples and it was very easy to use (for an ignoramus on the matter like myself). But perhaps that happens to be an isolated point. I have to admit, I've just been using the pandas/statsmodels/sklearn combo directly. Part of that has to do also with the nice, long-form examples available for them, something which I think we still lack in numpy/scipy but where some of the new focused projects have done a great job (the matplotlib gallery blazed that trail, and others have followed with excellent results). Cheers, f
On Wed, Jan 4, 2012 at 10:46 PM, Fernando Perez <fperez.net@gmail.com> wrote:
On Wed, Jan 4, 2012 at 7:29 PM, Travis Oliphant <travis@continuum.io> wrote:
It seems like scipy stats has received quite a bit of attention. There is always more to do, of course, but I'm not sure what specifically you think is missing or needs work.
Well, I recently needed to do some simple linear modeling, and the stats glm docstring isn't very encouraging:
Docstring: Calculates a linear model fit ... anova/ancova/lin-regress/t-test/etc. Taken from:
Peterson et al. Statistical limitations in functional neuroimaging I. Non-inferential methods and statistical models. Phil Trans Royal Soc Lond B 354: 1239-1260.
Returns ------- statistic, p-value ???
### END of docstring
glm should have been removed a long time ago, since it doesn't make much sense. a basic OLS class might not be bad for scipy, also from some of the questions that I have seen on stackoverflow of users that use the cookbook class.
I turned to statsmodels, which had great examples and it was very easy to use (for an ignoramus on the matter like myself).
But perhaps that happens to be an isolated point. I have to admit, I've just been using the pandas/statsmodels/sklearn combo directly. Part of that has to do also with the nice, long-form examples available for them, something which I think we still lack in numpy/scipy but where some of the new focused projects have done a great job (the matplotlib gallery blazed that trail, and others have followed with excellent results).
I'm not exactly unhappy about this :), especially once we get to the stage where you can type print modelresults.summary() and we print diagnostic checks why you shouldn't trust your model results, or we print no warning comments and the diagnostic checks don't indicate anything is wrong. Of course I'm not so happy about the lack of examples in scipy. Josef
Cheers,
f _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Wed, Jan 4, 2012 at 9:29 PM, Travis Oliphant <travis@continuum.io> wrote:
On Jan 4, 2012, at 8:22 PM, Fernando Perez wrote:
Hi all,
On Wed, Jan 4, 2012 at 5:43 PM, Travis Oliphant <travis@continuum.io> wrote:
What do others think is missing? Off the top of my head: basic wavelets (dwt primarily) and more complete interpolation strategies (I'd like to finish the basic interpolation approaches I started a while ago). Originally, I used GAMS as an "overview" of the kinds of things needed in SciPy. Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View from Berkeley' paper on parallel computing is not a bad starting point; summarized here they are:
Dense Linear Algebra Sparse Linear Algebra [1] Spectral Methods N-Body Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines
This is a nice list, thanks!
Descriptions of each can be found here: http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is here:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
That list is biased towards the classes of codes used in supercomputing environments, and some of the topics are probably beyond the scope of scipy (say structured/unstructured grids, at least for now).
But it can be a decent guiding outline to reason about what are the 'big areas' of scientific computing, so that scipy at least provides building blocks that would be useful in these directions.
Thanks for the links.
One area that hasn't been directly mentioned too much is the situation with statistical tools. On the one hand, we have the phenomenal work of pandas, statsmodels and sklearn, which together are helping turn python into a great tool for statistical data analysis (understood in a broad sense). But it would probably be valuable to have enough of a statistical base directly in numpy/scipy so that the 'out of the box' experience for statistical work is improved. I know we have scipy.stats, but it seems like it needs some love.
It seems like scipy stats has received quite a bit of attention. There is always more to do, of course, but I'm not sure what specifically you think is missing or needs work.
Test coverage, for example. I recently fixed several wildly incorrect skewness and kurtosis formulas for some distributions, and I now have very little confidence that any of the other distributions are correct. Of course, most of them probably *are* correct, but without tests, all are in doubt. Warren A big question to me is the impact of data-frames as the underlying
data-representation of the algorithms and the relationship between the data-frame and a NumPy array.
-Travis
Cheers,
f _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Jan 5, 2012, at 12:02 AM, Warren Weckesser wrote:
On Wed, Jan 4, 2012 at 9:29 PM, Travis Oliphant <travis@continuum.io> wrote:
On Jan 4, 2012, at 8:22 PM, Fernando Perez wrote:
Hi all,
On Wed, Jan 4, 2012 at 5:43 PM, Travis Oliphant <travis@continuum.io> wrote:
What do others think is missing? Off the top of my head: basic wavelets (dwt primarily) and more complete interpolation strategies (I'd like to finish the basic interpolation approaches I started a while ago). Originally, I used GAMS as an "overview" of the kinds of things needed in SciPy. Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View from Berkeley' paper on parallel computing is not a bad starting point; summarized here they are:
Dense Linear Algebra Sparse Linear Algebra [1] Spectral Methods N-Body Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines
This is a nice list, thanks!
Descriptions of each can be found here: http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is here:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
That list is biased towards the classes of codes used in supercomputing environments, and some of the topics are probably beyond the scope of scipy (say structured/unstructured grids, at least for now).
But it can be a decent guiding outline to reason about what are the 'big areas' of scientific computing, so that scipy at least provides building blocks that would be useful in these directions.
Thanks for the links.
One area that hasn't been directly mentioned too much is the situation with statistical tools. On the one hand, we have the phenomenal work of pandas, statsmodels and sklearn, which together are helping turn python into a great tool for statistical data analysis (understood in a broad sense). But it would probably be valuable to have enough of a statistical base directly in numpy/scipy so that the 'out of the box' experience for statistical work is improved. I know we have scipy.stats, but it seems like it needs some love.
It seems like scipy stats has received quite a bit of attention. There is always more to do, of course, but I'm not sure what specifically you think is missing or needs work.
Test coverage, for example. I recently fixed several wildly incorrect skewness and kurtosis formulas for some distributions, and I now have very little confidence that any of the other distributions are correct. Of course, most of them probably *are* correct, but without tests, all are in doubt.
There is such a thing as *over-reliance* on tests as well. Tests help but it is not a black or white kind of thing as seems to come across in many of the messages on this list about what part of scipy is in "good shape" or "easy to maintain" or "has love." Just because tests exist doesn't mean that you can trust the code --- you also then have to trust the tests. Ultimately, trust is built from successful *usage*. Tests are only a pseudo-subsitute for that usage. It so happens that usage that comes along with the code itself makes it easier to iterate on changes and catch some of the errors that can happen on re-factoring. In summary, tests are good! But, they also add overhead and themselves must be maintained, and I don't think it helps to disparage working code. I've seen a lot of terrible code that has *great* tests and seen projects fail because developers focus too much on the tests and not enough on what the code is actually doing. Great tests can catch many things but they cannot make up for not paying attention when writing the code. -Travis
Warren
A big question to me is the impact of data-frames as the underlying data-representation of the algorithms and the relationship between the data-frame and a NumPy array.
-Travis
Cheers,
f _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Thu, Jan 5, 2012 at 7:26 AM, Travis Oliphant <travis@continuum.io> wrote:
On Jan 5, 2012, at 12:02 AM, Warren Weckesser wrote:
On Wed, Jan 4, 2012 at 9:29 PM, Travis Oliphant <travis@continuum.io>wrote:
On Jan 4, 2012, at 8:22 PM, Fernando Perez wrote:
Hi all,
On Wed, Jan 4, 2012 at 5:43 PM, Travis Oliphant <travis@continuum.io> wrote:
What do others think is missing? Off the top of my head: basic wavelets (dwt primarily) and more complete interpolation strategies (I'd like to finish the basic interpolation approaches I started a while ago). Originally, I used GAMS as an "overview" of the kinds of things needed in SciPy. Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View from Berkeley' paper on parallel computing is not a bad starting point; summarized here they are:
Dense Linear Algebra Sparse Linear Algebra [1] Spectral Methods N-Body Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines
This is a nice list, thanks!
Descriptions of each can be found here: http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is here:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
That list is biased towards the classes of codes used in supercomputing environments, and some of the topics are probably beyond the scope of scipy (say structured/unstructured grids, at least for now).
But it can be a decent guiding outline to reason about what are the 'big areas' of scientific computing, so that scipy at least provides building blocks that would be useful in these directions.
Thanks for the links.
One area that hasn't been directly mentioned too much is the situation with statistical tools. On the one hand, we have the phenomenal work of pandas, statsmodels and sklearn, which together are helping turn python into a great tool for statistical data analysis (understood in a broad sense). But it would probably be valuable to have enough of a statistical base directly in numpy/scipy so that the 'out of the box' experience for statistical work is improved. I know we have scipy.stats, but it seems like it needs some love.
It seems like scipy stats has received quite a bit of attention. There is always more to do, of course, but I'm not sure what specifically you think is missing or needs work.
Test coverage, for example. I recently fixed several wildly incorrect skewness and kurtosis formulas for some distributions, and I now have very little confidence that any of the other distributions are correct. Of course, most of them probably *are* correct, but without tests, all are in doubt.
There is such a thing as *over-reliance* on tests as well.
True in principle, but we're so far from that point that you don't have to worry about that for the foreseeable future.
Tests help but it is not a black or white kind of thing as seems to come across in many of the messages on this list about what part of scipy is in "good shape" or "easy to maintain" or "has love." Just because tests exist doesn't mean that you can trust the code --- you also then have to trust the tests. Ultimately, trust is built from successful *usage*. Tests are only a pseudo-subsitute for that usage. It so happens that usage that comes along with the code itself makes it easier to iterate on changes and catch some of the errors that can happen on re-factoring.
In summary, tests are good! But, they also add overhead and themselves must be maintained, and I don't think it helps to disparage working code. I've seen a lot of terrible code that has *great* tests and seen projects fail because developers focus too much on the tests and not enough on what the code is actually doing. Great tests can catch many things but they cannot make up for not paying attention when writing the code.
Certainly, but besides giving more confidence that code is correct, a major advantage is that it is a massive help when working on existing code - especially for new developers. Now we have to be extremely careful in reviewing patches to check nothing gets broken (including backwards compatibility). Tests in that respect are not a maintenance burden, but a time saver. As an example, last week I wanted to add a way to easily adjust the bandwidth of gaussian_kde. This was maybe 10 lines of code, didn't take long at all. Then I spent some time adding tests and improving the docs, and thought I was done. After sending the PR, I spent at least an equal amount of time reworking everything a couple of times to not break any of the existing subclasses that could be found. In addition it took a lot of Josef's time to review it all and convince me of the error of my way. A few tests could have saved us a lot of time. Ralf
On Thu, Jan 5, 2012 at 1:47 AM, Ralf Gommers <ralf.gommers@googlemail.com> wrote:
On Thu, Jan 5, 2012 at 7:26 AM, Travis Oliphant <travis@continuum.io> wrote:
On Jan 5, 2012, at 12:02 AM, Warren Weckesser wrote:
On Wed, Jan 4, 2012 at 9:29 PM, Travis Oliphant <travis@continuum.io> wrote:
On Jan 4, 2012, at 8:22 PM, Fernando Perez wrote:
Hi all,
On Wed, Jan 4, 2012 at 5:43 PM, Travis Oliphant <travis@continuum.io> wrote:
What do others think is missing? Off the top of my head: basic wavelets (dwt primarily) and more complete interpolation strategies (I'd like to finish the basic interpolation approaches I started a while ago). Originally, I used GAMS as an "overview" of the kinds of things needed in SciPy. Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View from Berkeley' paper on parallel computing is not a bad starting point; summarized here they are:
Dense Linear Algebra Sparse Linear Algebra [1] Spectral Methods N-Body Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines
This is a nice list, thanks!
Descriptions of each can be found here: http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is here:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
That list is biased towards the classes of codes used in supercomputing environments, and some of the topics are probably beyond the scope of scipy (say structured/unstructured grids, at least for now).
But it can be a decent guiding outline to reason about what are the 'big areas' of scientific computing, so that scipy at least provides building blocks that would be useful in these directions.
Thanks for the links.
One area that hasn't been directly mentioned too much is the situation with statistical tools. On the one hand, we have the phenomenal work of pandas, statsmodels and sklearn, which together are helping turn python into a great tool for statistical data analysis (understood in a broad sense). But it would probably be valuable to have enough of a statistical base directly in numpy/scipy so that the 'out of the box' experience for statistical work is improved. I know we have scipy.stats, but it seems like it needs some love.
It seems like scipy stats has received quite a bit of attention. There is always more to do, of course, but I'm not sure what specifically you think is missing or needs work.
Test coverage, for example. I recently fixed several wildly incorrect skewness and kurtosis formulas for some distributions, and I now have very little confidence that any of the other distributions are correct. Of course, most of them probably *are* correct, but without tests, all are in doubt.
There is such a thing as *over-reliance* on tests as well.
True in principle, but we're so far from that point that you don't have to worry about that for the foreseeable future.
Tests help but it is not a black or white kind of thing as seems to come across in many of the messages on this list about what part of scipy is in "good shape" or "easy to maintain" or "has love." Just because tests exist doesn't mean that you can trust the code --- you also then have to trust the tests. Ultimately, trust is built from successful *usage*. Tests are only a pseudo-subsitute for that usage. It so happens that usage that comes along with the code itself makes it easier to iterate on changes and catch some of the errors that can happen on re-factoring.
In summary, tests are good! But, they also add overhead and themselves must be maintained, and I don't think it helps to disparage working code. I've seen a lot of terrible code that has *great* tests and seen projects fail because developers focus too much on the tests and not enough on what the code is actually doing. Great tests can catch many things but they cannot make up for not paying attention when writing the code.
Certainly, but besides giving more confidence that code is correct, a major advantage is that it is a massive help when working on existing code - especially for new developers. Now we have to be extremely careful in reviewing patches to check nothing gets broken (including backwards compatibility). Tests in that respect are not a maintenance burden, but a time saver.
Overall I also think that adding sufficient tests at the time of adding the code is a big time saver in the long run. It is a lot more difficult to figure out later why something is wrong and how to fix it. Without sufficient tests it's also difficult to tell whether code that looks good works as advertised, (my last mistake was a misplaced bracket that only showed up in cases that were not covered by the tests). And of course as Ralf mentioned, refactoring without test coverage is dangerous business even if the change looks "innocent. Josef
As an example, last week I wanted to add a way to easily adjust the bandwidth of gaussian_kde. This was maybe 10 lines of code, didn't take long at all. Then I spent some time adding tests and improving the docs, and thought I was done. After sending the PR, I spent at least an equal amount of time reworking everything a couple of times to not break any of the existing subclasses that could be found. In addition it took a lot of Josef's time to review it all and convince me of the error of my way. A few tests could have saved us a lot of time.
Ralf
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Thu, Jan 5, 2012 at 7:10 AM, <josef.pktd@gmail.com> wrote:
On Thu, Jan 5, 2012 at 1:47 AM, Ralf Gommers <ralf.gommers@googlemail.com> wrote:
On Thu, Jan 5, 2012 at 7:26 AM, Travis Oliphant <travis@continuum.io>
On Jan 5, 2012, at 12:02 AM, Warren Weckesser wrote:
On Wed, Jan 4, 2012 at 9:29 PM, Travis Oliphant <travis@continuum.io> wrote:
On Jan 4, 2012, at 8:22 PM, Fernando Perez wrote:
Hi all,
On Wed, Jan 4, 2012 at 5:43 PM, Travis Oliphant <travis@continuum.io
wrote:
What do others think is missing? Off the top of my head: basic wavelets (dwt primarily) and more complete interpolation strategies (I'd like to finish the basic interpolation approaches I started a while ago). Originally, I used GAMS as an "overview" of the kinds of things needed in SciPy. Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View from Berkeley' paper on parallel computing is not a bad starting point; summarized here they are:
Dense Linear Algebra Sparse Linear Algebra [1] Spectral Methods N-Body Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines
This is a nice list, thanks!
Descriptions of each can be found here: http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is here:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
That list is biased towards the classes of codes used in supercomputing environments, and some of the topics are probably beyond the scope of scipy (say structured/unstructured grids, at
least
for now).
But it can be a decent guiding outline to reason about what are the 'big areas' of scientific computing, so that scipy at least provides building blocks that would be useful in these directions.
Thanks for the links.
One area that hasn't been directly mentioned too much is the situation with statistical tools. On the one hand, we have the phenomenal work of pandas, statsmodels and sklearn, which together are helping turn python into a great tool for statistical data analysis (understood in a broad sense). But it would probably be valuable to have enough of a statistical base directly in numpy/scipy so that the 'out of the box' experience for statistical work is improved. I know we have scipy.stats, but it seems like it needs some love.
It seems like scipy stats has received quite a bit of attention. There is always more to do, of course, but I'm not sure what specifically you think is missing or needs work.
Test coverage, for example. I recently fixed several wildly incorrect skewness and kurtosis formulas for some distributions, and I now have very little confidence that any of the other distributions are correct. Of course, most of them probably *are* correct, but without tests, all are in doubt.
There is such a thing as *over-reliance* on tests as well.
True in principle, but we're so far from that point that you don't have to worry about that for the foreseeable future.
Tests help but it is not a black or white kind of thing as seems to come across in many of the messages on this list about what part of scipy is
in
"good shape" or "easy to maintain" or "has love." Just because tests exist doesn't mean that you can trust the code --- you also then have to trust the tests. Ultimately, trust is built from successful *usage*. Tests are only a pseudo-subsitute for that usage. It so happens that usage that comes along with the code itself makes it easier to iterate on changes and catch some of the errors that can happen on re-factoring.
In summary, tests are good! But, they also add overhead and themselves must be maintained, and I don't think it helps to disparage working code. I've seen a lot of terrible code that has *great* tests and seen
wrote: projects
fail because developers focus too much on the tests and not enough on what the code is actually doing. Great tests can catch many things but they cannot make up for not paying attention when writing the code.
Certainly, but besides giving more confidence that code is correct, a major advantage is that it is a massive help when working on existing code - especially for new developers. Now we have to be extremely careful in reviewing patches to check nothing gets broken (including backwards compatibility). Tests in that respect are not a maintenance burden, but a time saver.
Overall I also think that adding sufficient tests at the time of adding the code is a big time saver in the long run. It is a lot more difficult to figure out later why something is wrong and how to fix it.
Without sufficient tests it's also difficult to tell whether code that looks good works as advertised, (my last mistake was a misplaced bracket that only showed up in cases that were not covered by the tests).
And of course as Ralf mentioned, refactoring without test coverage is dangerous business even if the change looks "innocent.
And sufficient means test everything. I always turn up bugs when I increase test coverage. It can be embarrassing. Chuck
On Thu, Jan 5, 2012 at 1:02 AM, Warren Weckesser <warren.weckesser@enthought.com> wrote:
On Wed, Jan 4, 2012 at 9:29 PM, Travis Oliphant <travis@continuum.io> wrote:
On Jan 4, 2012, at 8:22 PM, Fernando Perez wrote:
Hi all,
On Wed, Jan 4, 2012 at 5:43 PM, Travis Oliphant <travis@continuum.io> wrote:
What do others think is missing? Off the top of my head: basic wavelets (dwt primarily) and more complete interpolation strategies (I'd like to finish the basic interpolation approaches I started a while ago). Originally, I used GAMS as an "overview" of the kinds of things needed in SciPy. Are there other relevant taxonomies these days?
Well, probably not something that fits these ideas for scipy one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View from Berkeley' paper on parallel computing is not a bad starting point; summarized here they are:
Dense Linear Algebra Sparse Linear Algebra [1] Spectral Methods N-Body Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines
This is a nice list, thanks!
Descriptions of each can be found here: http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is here:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
That list is biased towards the classes of codes used in supercomputing environments, and some of the topics are probably beyond the scope of scipy (say structured/unstructured grids, at least for now).
But it can be a decent guiding outline to reason about what are the 'big areas' of scientific computing, so that scipy at least provides building blocks that would be useful in these directions.
Thanks for the links.
One area that hasn't been directly mentioned too much is the situation with statistical tools. On the one hand, we have the phenomenal work of pandas, statsmodels and sklearn, which together are helping turn python into a great tool for statistical data analysis (understood in a broad sense). But it would probably be valuable to have enough of a statistical base directly in numpy/scipy so that the 'out of the box' experience for statistical work is improved. I know we have scipy.stats, but it seems like it needs some love.
It seems like scipy stats has received quite a bit of attention. There is always more to do, of course, but I'm not sure what specifically you think is missing or needs work.
Test coverage, for example. I recently fixed several wildly incorrect skewness and kurtosis formulas for some distributions, and I now have very little confidence that any of the other distributions are correct. Of course, most of them probably *are* correct, but without tests, all are in doubt.
Actually for this part it's not so much the test coverage, I have written some imperfect tests, but they are disabled because skew, kurtosis (3rd and 4th moments) and entropy still have several bugs for sure. One problem is that they are statistical tests with some false alarms, especially for distributions that are far away from the normal. But the main problem is that it requires a lot of work fixing those bugs, find the correct formulas (which is not so easy for some more exotic distributions) and then finding out where the current calculations are wrong. As you have seen for the cases that you recently fixed. variances (2nd moments) might be ok, but I'm not completely convinced anymore since I discovered that the corresponding test was a dummy. Better tests would be useful, but statistical tests based on random samples were the only once I could come up with at the time that (mostly) worked across all 100 distributions. Josef
Warren
A big question to me is the impact of data-frames as the underlying data-representation of the algorithms and the relationship between the data-frame and a NumPy array.
-Travis
Cheers,
f _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Wed, Jan 4, 2012 at 6:43 PM, Travis Oliphant <travis@continuum.io> wrote:
Thanks for the feedback. My point was to generate discussion and start the ball rolling on exactly the kind of conversation that has started.
Exactly as Ralf mentioned, the point is to get development on sub-packages --- something that the scikits effort and other individual efforts have done very, very well. In fact, it has worked so well, that it taught me a great deal about what is important in open source. My perhaps irrational dislike for the *name* "scikits" should not be interpreted as anything but a naming taste preference (and I am not known for my ability to choose names well anyway). I very much like and admire the community around scikits. I just would have preferred something easier to type (even just sci_* would have been better in my mind as high-level packages: sci_learn, sci_image, sci_statsmodels, etc.). I didn't feel like I was able to fully participate in that discussion when it happened, so you can take my comments now as simply historical and something I've been wanting to get off my chest for a while.
Without better packaging and dependency management systems (especially on Windows and Mac), splitting out code doesn't help those who are not distribution dependent (who themselves won't be impacted much). There are scenarios under which it could make sense to split out SciPy, but I agree that right now it doesn't make sense to completely split everything. However, I do think it makes sense to clean things up and move some things out in preparation for SciPy 1.0
One thing that would be nice is what is the view of documentation and examples for the different packages. Where is work there most needed?
Looking at Travis' list of non-core packages I'd say that sparse certainly belongs in the core and integrate probably too. Looking at what's left: - constants : very small and low cost to keep in core. Not much to improve there.
Agreed.
- cluster : low maintenance cost, small. not sure about usage, quality.
I think cluster overlaps with scikits-learn quite a bit. It basically contains a K-means vector quantization code with functionality that I suspect exists in scikits-learn. I would recommend deprecation and removal while pointing people to scikits-learn for equivalent functionality (or moving it to scikits-learn).
I disagree. Why should I go to scikits-learn for basic functionality like that? It is hardly specific to machine learning. Same with various matrix factorizations.
- ndimage : difficult one. hard to understand code, may not see much development either way.
This overlaps with scikits-image but has quite a bit of useful functionality on its own. The package is fairly mature and just needs maintenance.
Again, pretty basic stuff in there, but I could be persuaded to go to scikits-image since it *is* image specific and might be better maintained.
- spatial : kdtree is widely used, of good quality. low maintenance cost.
Indexing of all sorts tends to be fundamental. But not everyone knows they want it ;) Good to hear maintenance cost is low.
- odr : quite small, low cost to keep in core. pretty much done as far as I can tell.
Agreed.
- maxentropy : is deprecated, will disappear.
Great.
- signal : not in great shape, could be viable independent package. On the other hand, if scikits-signal takes off and those developers take care to improve and build on scipy.signal when possible, that's OK too.
What are the needs of this package? What needs to be fixed / improved? It is a broad field and I could see fixing scipy.signal with a few simple algorithms (the filter design, for example), and then pushing a separate package to do more advanced signal processing algorithms. This sounds fine to me. It looks like I can put attention to scipy.signal then, as It was one of the areas I was most interested in originally.
Filter design could use improvement. I also have a remez algorithm that works for complex filter design that belongs somewhere.
- weave : no point spending any effort on it. keep for backwards compatibility only, direct people to Cython instead.
Agreed. Anyway we can deprecate this for SciPy 1.0?
Overall, I don't see many viable independent packages there. So here's an alternative to spending a lot of effort on reorganizing the package structure: 1. Formulate a coherent vision of what in principle belongs in scipy (current modules + what's missing).
O.K. so SciPy should contain "basic" modules that are going to be needed for a lot of different kinds of analysis to be a dependency for other more advanced packages. This is somewhat vague, of course.
What do others think is missing? Off the top of my head: basic wavelets (dwt primarily) and more complete interpolation strategies (I'd like to finish the basic interpolation approaches I started a while ago). Originally, I used GAMS as an "overview" of the kinds of things needed in SciPy. Are there other relevant taxonomies these days?
http://gams.nist.gov/cgi-bin/serve.cgi
2. Focus on making it easier to contribute to scipy. There are many ways to do this; having more accessible developer docs, having a list of "easy fixes", adding info to tickets on how to get started on the reported issues, etc. We can learn a lot from Sympy and IPython here.
Definitely!
3. Recognize that quality of code and especially documentation is important, and fill the main gaps.
Is there a write-up of recognized gaps here that we can start with?
4. Deprecate sub-modules that don't belong in scipy (anymore), and remove them for scipy 1.0. I think that this applies only to maxentropy and weave.
I think it also applies to cluster as described above.
5. Find a clear (group of) maintainer(s) for each sub-module. For people familiar with one module, responding to
tickets and pull requests for that module would not cost so much time.
Is there a list where this is kept?
In my opinion, spending effort on improving code/documentation quality and attracting new developers (those go hand in hand) instead of reorganizing will have both more impact and be more beneficial for our users.
Chuck
- cluster : low maintenance cost, small. not sure about usage, quality.
I think cluster overlaps with scikits-learn quite a bit. It basically contains a K-means vector quantization code with functionality that I suspect exists in scikits-learn. I would recommend deprecation and removal while pointing people to scikits-learn for equivalent functionality (or moving it to scikits-learn).
I disagree. Why should I go to scikits-learn for basic functionality like that? It is hardly specific to machine learning. Same with various matrix factorizations.
What is basic and what is not basic is the whole point of the discussion. I'm not sure that the functionality in cluster.vq and cluster.hierarchy can be considered "basic". But, it will certainly depend on the kinds of problems you tend to solve. I also don't understand your reference to matrix factorizations in this context. But, this isn't a big-deal to me, either, so if there are strong opinions wanting to keep it, then great.
What are the needs of this package? What needs to be fixed / improved? It is a broad field and I could see fixing scipy.signal with a few simple algorithms (the filter design, for example), and then pushing a separate package to do more advanced signal processing algorithms. This sounds fine to me. It looks like I can put attention to scipy.signal then, as It was one of the areas I was most interested in originally.
Filter design could use improvement. I also have a remez algorithm that works for complex filter design that belongs somewhere.
It seems like this should go into scipy.signal next to the remez algorithm that is already there. -Travis
On Wed, Jan 4, 2012 at 8:07 PM, Travis Oliphant <travis@continuum.io> wrote:
- cluster : low maintenance cost, small. not sure about usage, quality.
I think cluster overlaps with scikits-learn quite a bit. It basically contains a K-means vector quantization code with functionality that I suspect exists in scikits-learn. I would recommend deprecation and removal while pointing people to scikits-learn for equivalent functionality (or moving it to scikits-learn).
I disagree. Why should I go to scikits-learn for basic functionality like that? It is hardly specific to machine learning. Same with various matrix factorizations.
What is basic and what is not basic is the whole point of the discussion. I'm not sure that the functionality in cluster.vq and cluster.hierarchy can be considered "basic". But, it will certainly depend on the kinds of problems you tend to solve. I also don't understand your reference to matrix factorizations in this context.
But, this isn't a big-deal to me, either, so if there are strong opinions wanting to keep it, then great.
Clustering is pretty basic to lots of things. That said, K-means might not be the one to keep. There are various matrix factorizations beyond the basic svd that are less common, but potentially useful, such as that in partial least squares and positive matrix factorization. I think the scikits-learn folks use some of these and they might have and idea as to how useful they have been. ISTR someone posting about doing PLS for scipy a while back.
What are the needs of this package? What needs to be fixed / improved? It is a broad field and I could see fixing scipy.signal with a few simple algorithms (the filter design, for example), and then pushing a separate package to do more advanced signal processing algorithms. This sounds fine to me. It looks like I can put attention to scipy.signal then, as It was one of the areas I was most interested in originally.
Filter design could use improvement. I also have a remez algorithm that works for complex filter design that belongs somewhere.
It seems like this should go into scipy.signal next to the remez algorithm that is already there.
I'd actually like it to replace the current one since it it is readable -- mostly python with a bit of Cython for finding extrema -- and does hermitean filters, which covers both the symmetric and anti-symmetric filters that the current version does. Chuck
On Jan 4, 2012, at 9:53 PM, Charles R Harris wrote:
On Wed, Jan 4, 2012 at 8:07 PM, Travis Oliphant <travis@continuum.io> wrote:
- cluster : low maintenance cost, small. not sure about usage, quality.
I think cluster overlaps with scikits-learn quite a bit. It basically contains a K-means vector quantization code with functionality that I suspect exists in scikits-learn. I would recommend deprecation and removal while pointing people to scikits-learn for equivalent functionality (or moving it to scikits-learn).
I disagree. Why should I go to scikits-learn for basic functionality like that? It is hardly specific to machine learning. Same with various matrix factorizations.
What is basic and what is not basic is the whole point of the discussion. I'm not sure that the functionality in cluster.vq and cluster.hierarchy can be considered "basic". But, it will certainly depend on the kinds of problems you tend to solve. I also don't understand your reference to matrix factorizations in this context.
But, this isn't a big-deal to me, either, so if there are strong opinions wanting to keep it, then great.
Clustering is pretty basic to lots of things. That said, K-means might not be the one to keep.
There are various matrix factorizations beyond the basic svd that are less common, but potentially useful, such as that in partial least squares and positive matrix factorization. I think the scikits-learn folks use some of these and they might have and idea as to how useful they have been. ISTR someone posting about doing PLS for scipy a while back.
What are the needs of this package? What needs to be fixed / improved? It is a broad field and I could see fixing scipy.signal with a few simple algorithms (the filter design, for example), and then pushing a separate package to do more advanced signal processing algorithms. This sounds fine to me. It looks like I can put attention to scipy.signal then, as It was one of the areas I was most interested in originally.
Filter design could use improvement. I also have a remez algorithm that works for complex filter design that belongs somewhere.
It seems like this should go into scipy.signal next to the remez algorithm that is already there.
I'd actually like it to replace the current one since it it is readable -- mostly python with a bit of Cython for finding extrema -- and does hermitean filters, which covers both the symmetric and anti-symmetric filters that the current version does.
Cool! That sounds even better :-) -Travis
Chuck _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
Just one point here: one of the current shortcomings in scipy from my perspective is interpolation, which is spread between interpolate, signal, and ndimage, each package with strengths and inexplicable (to a new user) weaknesses. One trouble spot is the fact that it's not clear that ndimage is where one ought to turn for general interpolation/resampling of gridded data (a topic which comes up at least once every couple months on the list).
- ndimage : difficult one. hard to understand code, may not see much development either way.
This overlaps with scikits-image but has quite a bit of useful functionality on its own. The package is fairly mature and just needs maintenance.
Again, pretty basic stuff in there, but I could be persuaded to go to scikits-image since it *is* image specific and might be better maintained.
See above. The interpolation stuff is pretty useful for a lot of tasks that aren't really "imaging" per se, but which involve gridded data. (GIS, e.g.) Similarly, the code for convolutions and similar (median filtering, e.g.) seems pretty generally useful and in many ways better than what's in scipy.signal for certain tasks. I'm less certain about the morphological operations and the connected-components labeling, which might be more task-specific and fit better with scikits-image? (Probably after a re-write in Cython?) Zach
Great points. I agree that interpolation still needs love. I've had the exact same concern multiple times before. It comes up quite a bit in classes. It looks like interpolate and signal are still areas that I can spend some free time. I know Warren has spent time in signal. Is anyone else working on interpolate --- I can check this of course myself, but just in case someone is following this conversation who is interested in coordinating. We may need to continue the conversation about ndimage. I appreciate the patience with me after my being silent for a while. I'm technically between jobs as I recently left Enthought. I just re-did my mail account setup so now I see all scipy-dev and numpy-discussion mails instead of having to remember to go look at the conversations. Thanks, -Travis On Jan 4, 2012, at 9:16 PM, Zachary Pincus wrote:
Just one point here: one of the current shortcomings in scipy from my perspective is interpolation, which is spread between interpolate, signal, and ndimage, each package with strengths and inexplicable (to a new user) weaknesses.
One trouble spot is the fact that it's not clear that ndimage is where one ought to turn for general interpolation/resampling of gridded data (a topic which comes up at least once every couple months on the list).
- ndimage : difficult one. hard to understand code, may not see much development either way.
This overlaps with scikits-image but has quite a bit of useful functionality on its own. The package is fairly mature and just needs maintenance.
Again, pretty basic stuff in there, but I could be persuaded to go to scikits-image since it *is* image specific and might be better maintained.
See above. The interpolation stuff is pretty useful for a lot of tasks that aren't really "imaging" per se, but which involve gridded data. (GIS, e.g.) Similarly, the code for convolutions and similar (median filtering, e.g.) seems pretty generally useful and in many ways better than what's in scipy.signal for certain tasks.
I'm less certain about the morphological operations and the connected-components labeling, which might be more task-specific and fit better with scikits-image? (Probably after a re-write in Cython?)
Zach _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Wed, Jan 4, 2012 at 10:36 PM, Travis Oliphant <travis@continuum.io> wrote:
Great points.
I agree that interpolation still needs love. I've had the exact same concern multiple times before. It comes up quite a bit in classes.
It looks like interpolate and signal are still areas that I can spend some free time. I know Warren has spent time in signal. Is anyone else working on interpolate --- I can check this of course myself, but just in case someone is following this conversation who is interested in coordinating.
There have been several starts on a control system toolbox that has some overlap with scipy.signal, but I haven't heard of any discussion in a while. The scipy wavelets look like a complete mystery, the docs are sparse, and with a google search I found only a single example of it's usage. Josef
We may need to continue the conversation about ndimage.
I appreciate the patience with me after my being silent for a while. I'm technically between jobs as I recently left Enthought. I just re-did my mail account setup so now I see all scipy-dev and numpy-discussion mails instead of having to remember to go look at the conversations.
Thanks,
-Travis
On Jan 4, 2012, at 9:16 PM, Zachary Pincus wrote:
Just one point here: one of the current shortcomings in scipy from my perspective is interpolation, which is spread between interpolate, signal, and ndimage, each package with strengths and inexplicable (to a new user) weaknesses.
One trouble spot is the fact that it's not clear that ndimage is where one ought to turn for general interpolation/resampling of gridded data (a topic which comes up at least once every couple months on the list).
- ndimage : difficult one. hard to understand code, may not see much development either way.
This overlaps with scikits-image but has quite a bit of useful functionality on its own. The package is fairly mature and just needs maintenance.
Again, pretty basic stuff in there, but I could be persuaded to go to scikits-image since it *is* image specific and might be better maintained.
See above. The interpolation stuff is pretty useful for a lot of tasks that aren't really "imaging" per se, but which involve gridded data. (GIS, e.g.) Similarly, the code for convolutions and similar (median filtering, e.g.) seems pretty generally useful and in many ways better than what's in scipy.signal for certain tasks.
I'm less certain about the morphological operations and the connected-components labeling, which might be more task-specific and fit better with scikits-image? (Probably after a re-write in Cython?)
Zach _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
05.01.2012 04:16, Zachary Pincus kirjoitti:
Just one point here: one of the current shortcomings in scipy from my perspective is interpolation, which is spread between interpolate, signal, and ndimage, each package with strengths and inexplicable (to a new user) weaknesses.
Interpolation and splines are indeed a weak point currently. What's missing is: - interface for interpolating gridded data (unifying ndimage, RectBivariateSpline, and scipy.spline routines) - the interface for `griddata` could be simplified a bit (-> allow variable number of arguments). Also, no natural neighbor interpolation so far. - FITPACK is a quirky beast, especially its 2D-routines (apart from RectBivariateSpline) which very often don't work for real data. I'm also not fully sure how far it and its smoothing can be trusted in 1D (see stackoverflow) - There are two sets of incompatible spline routines in scipy.interpolate, which should be cleaned up. The *Spline class interfaces are also not very pretty, as there is __class__ changing magic going on. The interp2d interface is somewhat confusing, and IMO would be best deprecated. - There is also a problem with large 1D data sets: FITPACK is slow, and the other set of spline routines try to invert a dense matrix, rather than e.g. using the band matrix routines. - RBF sort of works, but uses dense matrices and is not suitable for large data sets. IDW interpolation could be an useful addition here. And probably more: making a laundry list of what to fix could be helpful.
Also, as long as a list is being made: scipy.signal has matched functions [cq]spline1d() and [cq]spline1d_eval(), but only [cq]spline2d(), with no matching _eval function. And as far as FITPACK goes, I agree can be extremely, and possibly dangerously, "quirky" -- it's prone to almost arbitrarily bad ringing artifacts when the smoothing coefficient isn't large enough, and is very (very) sensitive to initial conditions in terms of what will and won't provoke the ringing. It has its uses, but it seems to me odd enough that it really shouldn't be the "default" 1D spline tool to direct people to. Zach On Jan 9, 2012, at 6:37 AM, Pauli Virtanen wrote:
05.01.2012 04:16, Zachary Pincus kirjoitti:
Just one point here: one of the current shortcomings in scipy from my perspective is interpolation, which is spread between interpolate, signal, and ndimage, each package with strengths and inexplicable (to a new user) weaknesses.
Interpolation and splines are indeed a weak point currently.
What's missing is:
- interface for interpolating gridded data (unifying ndimage, RectBivariateSpline, and scipy.spline routines)
- the interface for `griddata` could be simplified a bit (-> allow variable number of arguments). Also, no natural neighbor interpolation so far.
- FITPACK is a quirky beast, especially its 2D-routines (apart from RectBivariateSpline) which very often don't work for real data. I'm also not fully sure how far it and its smoothing can be trusted in 1D (see stackoverflow)
- There are two sets of incompatible spline routines in scipy.interpolate, which should be cleaned up.
The *Spline class interfaces are also not very pretty, as there is __class__ changing magic going on.
The interp2d interface is somewhat confusing, and IMO would be best deprecated.
- There is also a problem with large 1D data sets: FITPACK is slow, and the other set of spline routines try to invert a dense matrix, rather than e.g. using the band matrix routines.
- RBF sort of works, but uses dense matrices and is not suitable for large data sets. IDW interpolation could be an useful addition here.
And probably more: making a laundry list of what to fix could be helpful.
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Mon, Jan 9, 2012 at 8:02 AM, Zachary Pincus <zachary.pincus@yale.edu> wrote:
Also, as long as a list is being made: scipy.signal has matched functions [cq]spline1d() and [cq]spline1d_eval(), but only [cq]spline2d(), with no matching _eval function.
And as far as FITPACK goes, I agree can be extremely, and possibly dangerously, "quirky" -- it's prone to almost arbitrarily bad ringing artifacts when the smoothing coefficient isn't large enough, and is very (very) sensitive to initial conditions in terms of what will and won't provoke the ringing. It has its uses, but it seems to me odd enough that it really shouldn't be the "default" 1D spline tool to direct people to.
Do you have an example of "arbitrarily" bad ringing?
From what I was reading up on splines in the last weeks, I got the impression was that this is a "feature" of interpolating splines, and to be useful with a larger number of points we always need to smooth sufficiently (reduce knots or penalize). (I just read a comment that R with 5000 points only chooses about 200 knots).
Josef
Zach
On Jan 9, 2012, at 6:37 AM, Pauli Virtanen wrote:
05.01.2012 04:16, Zachary Pincus kirjoitti:
Just one point here: one of the current shortcomings in scipy from my perspective is interpolation, which is spread between interpolate, signal, and ndimage, each package with strengths and inexplicable (to a new user) weaknesses.
Interpolation and splines are indeed a weak point currently.
What's missing is:
- interface for interpolating gridded data (unifying ndimage, RectBivariateSpline, and scipy.spline routines)
- the interface for `griddata` could be simplified a bit (-> allow variable number of arguments). Also, no natural neighbor interpolation so far.
- FITPACK is a quirky beast, especially its 2D-routines (apart from RectBivariateSpline) which very often don't work for real data. I'm also not fully sure how far it and its smoothing can be trusted in 1D (see stackoverflow)
- There are two sets of incompatible spline routines in scipy.interpolate, which should be cleaned up.
The *Spline class interfaces are also not very pretty, as there is __class__ changing magic going on.
The interp2d interface is somewhat confusing, and IMO would be best deprecated.
- There is also a problem with large 1D data sets: FITPACK is slow, and the other set of spline routines try to invert a dense matrix, rather than e.g. using the band matrix routines.
- RBF sort of works, but uses dense matrices and is not suitable for large data sets. IDW interpolation could be an useful addition here.
And probably more: making a laundry list of what to fix could be helpful.
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Jan 9, 2012, at 10:46 AM, josef.pktd@gmail.com wrote:
On Mon, Jan 9, 2012 at 8:02 AM, Zachary Pincus <zachary.pincus@yale.edu> wrote:
Also, as long as a list is being made: scipy.signal has matched functions [cq]spline1d() and [cq]spline1d_eval(), but only [cq]spline2d(), with no matching _eval function.
And as far as FITPACK goes, I agree can be extremely, and possibly dangerously, "quirky" -- it's prone to almost arbitrarily bad ringing artifacts when the smoothing coefficient isn't large enough, and is very (very) sensitive to initial conditions in terms of what will and won't provoke the ringing. It has its uses, but it seems to me odd enough that it really shouldn't be the "default" 1D spline tool to direct people to.
Do you have an example of "arbitrarily" bad ringing?
From what I was reading up on splines in the last weeks, I got the impression was that this is a "feature" of interpolating splines, and to be useful with a larger number of points we always need to smooth sufficiently (reduce knots or penalize). (I just read a comment that R with 5000 points only chooses about 200 knots).
Example below; it's using parametric splines because I have a simple interactive tool to draw them and notice occasional "blowing up" like what you see below. I *think* I've seen similar issues with normal splines, but haven't used them a lot lately. (For the record, treating the x and y values as separate and using the non-parametric spline fitting does NOT yield these crazy errors on *these data*...) As far as the smoothing parameter, the "good" data will go crazy if s=3, but is fine with s=0.25 or s=4; similarly the "bad" data isn't prone to ringing if s=0.25 or s=5. So there's serious sensitivity both to the x,y positions of the data (as below) and to the smoothing parameter in a fairly small range. I could probably come up with an example that goes crazy with even fewer input points, but this was the first thing I came up with. Small modifications to the input data seem to make it go even crazier, but the below illustrates the general point. Zach import numpy import scipy.interpolate as interp good = numpy.array( [[ 24.21162868, 28.75056713, 32.64108579, 36.85581434, 41.07054289, 46.582111 , 52.417889 , 55.17367305, 57.92945711, 61.00945105, 64.89996971, 72.19469221, 75.76100098, 83.21782842, 83.21782842, 88.56729158, 86.29782236, 90.18834103, 86.62203225], [ 70.57364276, 71.22206254, 69.27680321, 72.5189021 , 65.06207466, 70.89785265, 67.33154388, 68.62838343, 69.92522299, 67.00733399, 77.21994548, 68.30417354, 71.38416748, 71.38416748, 64.25154993, 70.08732793, 61.00945105, 63.44102521, 56.47051261]]) bad = good.copy() # now make a *small* change bad[:,-1] = 87.432556973542049, 55.984197773255048 good_tck, good_u = interp.splprep(good, s=4) bad_tck, bad_u = interp.splprep(bad, s=4) print good.ptp(axis=1) print numpy.array(interp.splev(numpy.linspace(good_u[0], good_u[-1], 300), good_tck)).ptp(axis=1) print numpy.array(interp.splev(numpy.linspace(bad_u[0], bad_u[-1], 300), bad_tck)).ptp(axis=1) And the output on my machine is: [ 65.97671235 20.74943287] [ 67.69845281 20.52518913] [ 2868.98673621 450984.86622631]
On Mon, Jan 9, 2012 at 2:06 PM, Zachary Pincus <zachary.pincus@yale.edu> wrote:
On Jan 9, 2012, at 10:46 AM, josef.pktd@gmail.com wrote:
On Mon, Jan 9, 2012 at 8:02 AM, Zachary Pincus <zachary.pincus@yale.edu> wrote:
Also, as long as a list is being made: scipy.signal has matched functions [cq]spline1d() and [cq]spline1d_eval(), but only [cq]spline2d(), with no matching _eval function.
And as far as FITPACK goes, I agree can be extremely, and possibly dangerously, "quirky" -- it's prone to almost arbitrarily bad ringing artifacts when the smoothing coefficient isn't large enough, and is very (very) sensitive to initial conditions in terms of what will and won't provoke the ringing. It has its uses, but it seems to me odd enough that it really shouldn't be the "default" 1D spline tool to direct people to.
Do you have an example of "arbitrarily" bad ringing?
From what I was reading up on splines in the last weeks, I got the impression was that this is a "feature" of interpolating splines, and to be useful with a larger number of points we always need to smooth sufficiently (reduce knots or penalize). (I just read a comment that R with 5000 points only chooses about 200 knots).
Example below; it's using parametric splines because I have a simple interactive tool to draw them and notice occasional "blowing up" like what you see below. I *think* I've seen similar issues with normal splines, but haven't used them a lot lately. (For the record, treating the x and y values as separate and using the non-parametric spline fitting does NOT yield these crazy errors on *these data*...)
As far as the smoothing parameter, the "good" data will go crazy if s=3, but is fine with s=0.25 or s=4; similarly the "bad" data isn't prone to ringing if s=0.25 or s=5. So there's serious sensitivity both to the x,y positions of the data (as below) and to the smoothing parameter in a fairly small range.
I could probably come up with an example that goes crazy with even fewer input points, but this was the first thing I came up with. Small modifications to the input data seem to make it go even crazier, but the below illustrates the general point.
(disclaimer: as mentioned, I only started very recently to read anything about smoothing splines, except for the scipy documentation) I'm not so familiar with the parametric splines, but I might have seen something similar with regular splines, but I ignored or worked around it without paying attention. The local behavior around 4, good: s>3.8 is fine, and bad: s>4.3 is fine, might come because there is no non-crazy spline with the given smoothness, and in this case increasing the smooothness factor makes the exploding behavior go away. One impression I had when I tried this out a few weeks ago, is that the spline smoothing factor s is imposed with equality not inequality. In the examples that I tried with varying s, the reported error sum of squares always matched s to a few decimals. (I don't know how because I didn't see the knots change in some examples.) In your example it looks like the spline algorithm only looks for spline approximation in the neighborhood of those that give the specified s. It does not search for better fitting splines with lower s. That's would explain the strange behavior that there are "nice" splines at 0.25. In my recent examples, I used an information criterium (AIC just because I had it available) to do a global search for the best s. It looks to me like the current spline implementation only does a local search with fixed s. What I didn't try to figure out is how to avoid recalculating everything for each different s. In what I have been reading, they use either cross-validation or information criteria to choose the smoothing parameters, but I haven't read anything about whether the search needs to be global or can be just a local search. below I mainly add the code to plot to your example Josef
Zach
import numpy import scipy.interpolate as interp good = numpy.array( [[ 24.21162868, 28.75056713, 32.64108579, 36.85581434, 41.07054289, 46.582111 , 52.417889 , 55.17367305, 57.92945711, 61.00945105, 64.89996971, 72.19469221, 75.76100098, 83.21782842, 83.21782842, 88.56729158, 86.29782236, 90.18834103, 86.62203225], [ 70.57364276, 71.22206254, 69.27680321, 72.5189021 , 65.06207466, 70.89785265, 67.33154388, 68.62838343, 69.92522299, 67.00733399, 77.21994548, 68.30417354, 71.38416748, 71.38416748, 64.25154993, 70.08732793, 61.00945105, 63.44102521, 56.47051261]]) bad = good.copy() # now make a *small* change bad[:,-1] = 87.432556973542049, 55.984197773255048
good_tck, good_u = interp.splprep(good, s=4) bad_tck, bad_u = interp.splprep(bad, s=4) print good.ptp(axis=1) print numpy.array(interp.splev(numpy.linspace(good_u[0], good_u[-1], 300), good_tck)).ptp(axis=1) print numpy.array(interp.splev(numpy.linspace(bad_u[0], bad_u[-1], 300), bad_tck)).ptp(axis=1)
And the output on my machine is: [ 65.97671235 20.74943287] [ 67.69845281 20.52518913] [ 2868.98673621 450984.86622631]
with plot and sum of squares fp import numpy import scipy.interpolate as interp good = numpy.array( [[ 24.21162868, 28.75056713, 32.64108579, 36.85581434, 41.07054289, 46.582111 , 52.417889 , 55.17367305, 57.92945711, 61.00945105, 64.89996971, 72.19469221, 75.76100098, 83.21782842, 83.21782842, 88.56729158, 86.29782236, 90.18834103, 86.62203225], [ 70.57364276, 71.22206254, 69.27680321, 72.5189021 , 65.06207466, 70.89785265, 67.33154388, 68.62838343, 69.92522299, 67.00733399, 77.21994548, 68.30417354, 71.38416748, 71.38416748, 64.25154993, 70.08732793, 61.00945105, 63.44102521, 56.47051261]]) bad = good.copy() # now make a *small* change bad[:,-1] = 87.432556973542049, 55.984197773255048 (good_tck, good_u), good_fp,_, _ = interp.splprep(good, s=0.25, full_output=True) #3.8) (bad_tck, bad_u), bad_fp,_, _ = interp.splprep(bad, s=4.3, full_output=True) print good.ptp(axis=1) xg = numpy.linspace(good_u[0], good_u[-1], 300) yg = numpy.array(interp.splev(xg, good_tck)) xb = numpy.linspace(bad_u[0], bad_u[-1], 300) yb = numpy.array(interp.splev(xb, bad_tck)) print yg.ptp(axis=1) print yb.ptp(axis=1) #And the output on my machine is: #[ 65.97671235 20.74943287] #[ 67.69845281 20.52518913] #[ 2868.98673621 450984.86622631] print 'fp' print good_fp print bad_fp import matplotlib.pyplot as plt plt.plot(good[0], good[1], 'bo', alpha=0.75) plt.plot(bad[0], bad[1], 'ro', alpha=0.75) plt.plot(yg[0], yg[1], 'b-') plt.plot(yb[0], yb[1], 'r-') plt.show()
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
09.01.2012 21:30, josef.pktd@gmail.com kirjoitti: [clip]
One impression I had when I tried this out a few weeks ago, is that the spline smoothing factor s is imposed with equality not inequality. In the examples that I tried with varying s, the reported error sum of squares always matched s to a few decimals. (I don't know how because I didn't see the knots change in some examples.)
As far as I understand the FITPACK code, it starts with a low number of knots in the spline, and then inserts new knots until the criterion given with `s` is satisfied for the LSQ spline. Then it adjusts k-th derivative discontinuities until the sum of squares of errors is equal to `s`. Provided I understood this correctly (at least this is what was written in fppara.f): I'm not so sure that using k-th derivative discontinuity as the smoothness term in the optimization is what people actually expect from "smoothing". A more likely candidate would be the curvature. However, the default value for the splines is k=3, cubic, which yields a somewhat strange "smoothness" constraint. If this is indeed what FITPACK does, then it seems to me that the approach to smoothing is somewhat flawed. (However, it'd probably best to read the book before making judgments here.) Pauli
09.01.2012 20:06, Zachary Pincus kirjoitti: [clip]
good_tck, good_u = interp.splprep(good, s=4) bad_tck, bad_u = interp.splprep(bad, s=4) print good.ptp(axis=1) print numpy.array(interp.splev(numpy.linspace(good_u[0], good_u[-1], 300), good_tck)).ptp(axis=1) print numpy.array(interp.splev(numpy.linspace(bad_u[0], bad_u[-1], 300), bad_tck)).ptp(axis=1)
And the output on my machine is: [ 65.97671235 20.74943287] [ 67.69845281 20.52518913] [ 2868.98673621 450984.86622631]
After a closer look at this, it seems to me that there could also be a numerical problem (or perhaps a bug) in the fitpack algorithm, i.e., the bad results are not necessarily due to a "wrong" smoothness metric. In the "bad" case it seems that the 3rd derivative discontinuities also explode. -- Pauli Virtanen
On Wed, Jan 4, 2012 at 9:33 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Wed, Jan 4, 2012 at 6:43 PM, Travis Oliphant <travis@continuum.io> wrote:
Thanks for the feedback. My point was to generate discussion and start the ball rolling on exactly the kind of conversation that has started.
Exactly as Ralf mentioned, the point is to get development on sub-packages --- something that the scikits effort and other individual efforts have done very, very well. In fact, it has worked so well, that it taught me a great deal about what is important in open source. My perhaps irrational dislike for the *name* "scikits" should not be interpreted as anything but a naming taste preference (and I am not known for my ability to choose names well anyway). I very much like and admire the community around scikits. I just would have preferred something easier to type (even just sci_* would have been better in my mind as high-level packages: sci_learn, sci_image, sci_statsmodels, etc.). I didn't feel like I was able to fully participate in that discussion when it happened, so you can take my comments now as simply historical and something I've been wanting to get off my chest for a while.
Without better packaging and dependency management systems (especially on Windows and Mac), splitting out code doesn't help those who are not distribution dependent (who themselves won't be impacted much). There are scenarios under which it could make sense to split out SciPy, but I agree that right now it doesn't make sense to completely split everything. However, I do think it makes sense to clean things up and move some things out in preparation for SciPy 1.0
One thing that would be nice is what is the view of documentation and examples for the different packages. Where is work there most needed?
Looking at Travis' list of non-core packages I'd say that sparse certainly belongs in the core and integrate probably too. Looking at what's left: - constants : very small and low cost to keep in core. Not much to improve there.
Agreed.
- cluster : low maintenance cost, small. not sure about usage, quality.
I think cluster overlaps with scikits-learn quite a bit. It basically contains a K-means vector quantization code with functionality that I suspect exists in scikits-learn. I would recommend deprecation and removal while pointing people to scikits-learn for equivalent functionality (or moving it to scikits-learn).
I disagree. Why should I go to scikits-learn for basic functionality like that? It is hardly specific to machine learning. Same with various matrix factorizations.
- ndimage : difficult one. hard to understand code, may not see much development either way.
This overlaps with scikits-image but has quite a bit of useful functionality on its own. The package is fairly mature and just needs maintenance.
Again, pretty basic stuff in there, but I could be persuaded to go to scikits-image since it *is* image specific and might be better maintained.
- spatial : kdtree is widely used, of good quality. low maintenance cost.
Indexing of all sorts tends to be fundamental. But not everyone knows they want it ;)
Good to hear maintenance cost is low.
- odr : quite small, low cost to keep in core. pretty much done as far as I can tell.
Agreed.
- maxentropy : is deprecated, will disappear.
Great.
- signal : not in great shape, could be viable independent package. On the other hand, if scikits-signal takes off and those developers take care to improve and build on scipy.signal when possible, that's OK too.
What are the needs of this package? What needs to be fixed / improved? It is a broad field and I could see fixing scipy.signal with a few simple algorithms (the filter design, for example), and then pushing a separate package to do more advanced signal processing algorithms. This sounds fine to me. It looks like I can put attention to scipy.signal then, as It was one of the areas I was most interested in originally.
Filter design could use improvement. I also have a remez algorithm that works for complex filter design that belongs somewhere.
ltisys was pretty neglected, but Warren, I think, made quite big improvements. There was several times the discussion whether MIMO works or should work, similar there was a discrete time proposal but I didn't keep up with what happened to it. In statsmodels we are very happy with signal.lfilter but I wished there were a multi input version of it. Other things that are basic, periodograms, burg and levinson_durbin are scipy algorithms I think, but having them in a scikits.signal would be good also. Josef
- weave : no point spending any effort on it. keep for backwards compatibility only, direct people to Cython instead.
Agreed. Anyway we can deprecate this for SciPy 1.0?
Overall, I don't see many viable independent packages there. So here's an alternative to spending a lot of effort on reorganizing the package structure: 1. Formulate a coherent vision of what in principle belongs in scipy (current modules + what's missing).
O.K. so SciPy should contain "basic" modules that are going to be needed for a lot of different kinds of analysis to be a dependency for other more advanced packages. This is somewhat vague, of course.
What do others think is missing? Off the top of my head: basic wavelets (dwt primarily) and more complete interpolation strategies (I'd like to finish the basic interpolation approaches I started a while ago). Originally, I used GAMS as an "overview" of the kinds of things needed in SciPy. Are there other relevant taxonomies these days?
http://gams.nist.gov/cgi-bin/serve.cgi
2. Focus on making it easier to contribute to scipy. There are many ways to do this; having more accessible developer docs, having a list of "easy fixes", adding info to tickets on how to get started on the reported issues, etc. We can learn a lot from Sympy and IPython here.
Definitely!
3. Recognize that quality of code and especially documentation is important, and fill the main gaps.
Is there a write-up of recognized gaps here that we can start with?
4. Deprecate sub-modules that don't belong in scipy (anymore), and remove them for scipy 1.0. I think that this applies only to maxentropy and weave.
I think it also applies to cluster as described above.
5. Find a clear (group of) maintainer(s) for each sub-module. For people familiar with one module, responding to
tickets and pull requests for that module would not cost so much time.
Is there a list where this is kept?
In my opinion, spending effort on improving code/documentation quality and attracting new developers (those go hand in hand) instead of reorganizing will have both more impact and be more beneficial for our users.
Chuck
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Wed, Jan 4, 2012 at 8:30 PM, <josef.pktd@gmail.com> wrote:
On Wed, Jan 4, 2012 at 9:33 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Wed, Jan 4, 2012 at 6:43 PM, Travis Oliphant <travis@continuum.io>
wrote:
<snip>
ltisys was pretty neglected, but Warren, I think, made quite big improvements. There was several times the discussion whether MIMO works or should work, similar there was a discrete time proposal but I didn't keep up with what happened to it.
In statsmodels we are very happy with signal.lfilter but I wished there were a multi input version of it. Other things that are basic, periodograms, burg and levinson_durbin are scipy algorithms I think, but having them in a scikits.signal would be good also.
Those all sound like good additions. Burg and Levinson_Durbin would also be useful for folks making a maximum entropy package and would be a natural fit with lfilter. I've seen various approaches to image interpolation that could also make use of the lfilter functionality. <snip> Chuck
Some comments on signal processing: Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a general purpose filter, which is an IIR filter, single rate. Efficiency is very important in my work, so I implement many optimized variations. Most of the time, FIR filters are used. These then come in variations for single rate, interpolation, and decimation (there is also another design for rational rate conversion). Then these have variants for scalar/complex input/output, as well as complex in/out with scalar coefficients. IIR filters are seperate. FFT based FIR filters are another type, and include both complex in/out as well as scalar in/out (taking advantage of the 'two channel' trick for fft).
On Thu, Jan 5, 2012 at 10:32 AM, Neal Becker <ndbecker2@gmail.com> wrote:
Some comments on signal processing:
Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a general purpose filter, which is an IIR filter, single rate. Efficiency is very important in my work, so I implement many optimized variations.
Most of the time, FIR filters are used. These then come in variations for single rate, interpolation, and decimation (there is also another design for rational rate conversion). Then these have variants for scalar/complex input/output, as well as complex in/out with scalar coefficients.
IIR filters are seperate.
FFT based FIR filters are another type, and include both complex in/out as well as scalar in/out (taking advantage of the 'two channel' trick for fft).
just out of curiosity: why no FFT base IIR filter? It looks like a small change in the implementation, but it is slower than lfilter for shorter time series so I mostly dropped fft based filtering. Josef
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Jan 5, 2012, at 10:00 AM, josef.pktd@gmail.com wrote:
On Thu, Jan 5, 2012 at 10:32 AM, Neal Becker <ndbecker2@gmail.com> wrote:
Some comments on signal processing:
Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a general purpose filter, which is an IIR filter, single rate. Efficiency is very important in my work, so I implement many optimized variations.
Most of the time, FIR filters are used. These then come in variations for single rate, interpolation, and decimation (there is also another design for rational rate conversion). Then these have variants for scalar/complex input/output, as well as complex in/out with scalar coefficients.
IIR filters are seperate.
FFT based FIR filters are another type, and include both complex in/out as well as scalar in/out (taking advantage of the 'two channel' trick for fft).
just out of curiosity: why no FFT base IIR filter?
It looks like a small change in the implementation, but it is slower than lfilter for shorter time series so I mostly dropped fft based filtering.
I think he is talking about filter design, correct? lfilter can be used to implement FIR and IIR filters -- although an FIR filter is easily computed with convolve/correlate as well. FIR filter design is usually done in the FFT-domain. But, this picks the coefficients for the actual filtering itself done with something like convolve If you *do* filtering in the FFT-domain than it's usually going to be IIR. What are you referring to when you say "small change in the implementation" -Travis
Josef
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Thu, Jan 5, 2012 at 11:14 AM, Travis Oliphant <travis@continuum.io> wrote:
On Jan 5, 2012, at 10:00 AM, josef.pktd@gmail.com wrote:
On Thu, Jan 5, 2012 at 10:32 AM, Neal Becker <ndbecker2@gmail.com> wrote:
Some comments on signal processing:
Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a general purpose filter, which is an IIR filter, single rate. Efficiency is very important in my work, so I implement many optimized variations.
Most of the time, FIR filters are used. These then come in variations for single rate, interpolation, and decimation (there is also another design for rational rate conversion). Then these have variants for scalar/complex input/output, as well as complex in/out with scalar coefficients.
IIR filters are seperate.
FFT based FIR filters are another type, and include both complex in/out as well as scalar in/out (taking advantage of the 'two channel' trick for fft).
just out of curiosity: why no FFT base IIR filter?
It looks like a small change in the implementation, but it is slower than lfilter for shorter time series so I mostly dropped fft based filtering.
I think he is talking about filter design, correct?
lfilter can be used to implement FIR and IIR filters -- although an FIR filter is easily computed with convolve/correlate as well.
FIR filter design is usually done in the FFT-domain. But, this picks the coefficients for the actual filtering itself done with something like convolve
If you *do* filtering in the FFT-domain than it's usually going to be IIR. What are you referring to when you say "small change in the implementation"
maybe I'm interpreting things wrongly since I'm not so familiar with the signal processing terminology as far as I understand fftconvolve(in1, in2) applies a FIR filter in2 to in1, however it is possible to divide by the fft of an in3, that would have both IIR filter terms as in lfilter. (I tried out different versions of fft based time series analysis in the statsmodels sandbox.) I never looked very closely at filter design itself, because that is very different from the estimation procedures we use in time series analysis. Josef
-Travis
Josef
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
Travis Oliphant wrote:
On Jan 5, 2012, at 10:00 AM, josef.pktd@gmail.com wrote:
On Thu, Jan 5, 2012 at 10:32 AM, Neal Becker <ndbecker2@gmail.com> wrote:
Some comments on signal processing:
Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a general purpose filter, which is an IIR filter, single rate. Efficiency is very important in my work, so I implement many optimized variations.
Most of the time, FIR filters are used. These then come in variations for single rate, interpolation, and decimation (there is also another design for rational rate conversion). Then these have variants for scalar/complex input/output, as well as complex in/out with scalar coefficients.
IIR filters are seperate.
FFT based FIR filters are another type, and include both complex in/out as well as scalar in/out (taking advantage of the 'two channel' trick for fft).
just out of curiosity: why no FFT base IIR filter?
It looks like a small change in the implementation, but it is slower than lfilter for shorter time series so I mostly dropped fft based filtering.
I think he is talking about filter design, correct?
The comments I made were all about efficient filter implementation, not about filter design. About FFT-based IIR filter, I never heard of it. I was talking about the fact that fft can be used to efficiently implement a linear convolution exactly (for the case of convolution of a finite or short sequence - the impulse response of the filter - with a long or infinite sequence, the overlap-add or overlap-save techniques are used).
lfilter can be used to implement FIR and IIR filters -- although an FIR filter is easily computed with convolve/correlate as well.
FIR filter design is usually done in the FFT-domain. But, this picks the coefficients for the actual filtering itself done with something like convolve
If you *do* filtering in the FFT-domain than it's usually going to be IIR. What are you referring to when you say "small change in the implementation"
-Travis
Josef
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Jan 5, 2012, at 1:19 PM, Neal Becker wrote:
Travis Oliphant wrote:
On Jan 5, 2012, at 10:00 AM, josef.pktd@gmail.com wrote:
On Thu, Jan 5, 2012 at 10:32 AM, Neal Becker <ndbecker2@gmail.com> wrote:
Some comments on signal processing:
Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a general purpose filter, which is an IIR filter, single rate. Efficiency is very important in my work, so I implement many optimized variations.
Most of the time, FIR filters are used. These then come in variations for single rate, interpolation, and decimation (there is also another design for rational rate conversion). Then these have variants for scalar/complex input/output, as well as complex in/out with scalar coefficients.
IIR filters are seperate.
FFT based FIR filters are another type, and include both complex in/out as well as scalar in/out (taking advantage of the 'two channel' trick for fft).
just out of curiosity: why no FFT base IIR filter?
It looks like a small change in the implementation, but it is slower than lfilter for shorter time series so I mostly dropped fft based filtering.
I think he is talking about filter design, correct?
The comments I made were all about efficient filter implementation, not about filter design.
About FFT-based IIR filter, I never heard of it. I was talking about the fact that fft can be used to efficiently implement a linear convolution exactly (for the case of convolution of a finite or short sequence - the impulse response of the filter - with a long or infinite sequence, the overlap-add or overlap-save techniques are used).
Sure, of course. It's hard to know the way people are using terms. I agree that people don't usually use the term IIR when talking about an FFT-based filter (but there is an "effective" time-domain response for every filtering operation done in the Fourier domain --- as you noted). That's what I was referring to. It's been a while since I wrote lfilter, but it transposes the filtering operation into Direct Form II, and then does a straightforward implementation of the feed-back and feed-forward equations. Here is some information on the approach: https://ccrma.stanford.edu/~jos/fp/Direct_Form_II.html IIR filters implemented in the time-domain need something like lfilter. FIR filters are "just" convolution in the time domain --- and there are different approaches to doing that discrete-time convolution as you've noted. IIR filters are *just* convolution as well (but convolution with an infinite sequence). Of course, if you use the FFT-domain to implement the filter, then you can just as well design in that space the filtering-function you want to multiply the input signal with (it's just important to keep in mind the impact in the time-domain of what you are doing in the frequency domain --- i.e. sharp-edges result in ringing, the basic time-frequency product limitations, etc.) These same ideas come under different names and have different emphasis in different disciplines. -Travis
Travis Oliphant wrote:
On Jan 5, 2012, at 1:19 PM, Neal Becker wrote:
Travis Oliphant wrote:
On Jan 5, 2012, at 10:00 AM, josef.pktd@gmail.com wrote:
On Thu, Jan 5, 2012 at 10:32 AM, Neal Becker <ndbecker2@gmail.com> wrote:
Some comments on signal processing:
Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a general purpose filter, which is an IIR filter, single rate. Efficiency is very important in my work, so I implement many optimized variations.
Most of the time, FIR filters are used. These then come in variations for single rate, interpolation, and decimation (there is also another design for rational rate conversion). Then these have variants for scalar/complex input/output, as well as complex in/out with scalar coefficients.
IIR filters are seperate.
FFT based FIR filters are another type, and include both complex in/out as well as scalar in/out (taking advantage of the 'two channel' trick for fft).
just out of curiosity: why no FFT base IIR filter?
It looks like a small change in the implementation, but it is slower than lfilter for shorter time series so I mostly dropped fft based filtering.
I think he is talking about filter design, correct?
The comments I made were all about efficient filter implementation, not about filter design.
About FFT-based IIR filter, I never heard of it. I was talking about the fact that fft can be used to efficiently implement a linear convolution exactly (for the case of convolution of a finite or short sequence - the impulse response of the filter - with a long or infinite sequence, the overlap-add or overlap-save techniques are used).
Sure, of course. It's hard to know the way people are using terms. I agree that people don't usually use the term IIR when talking about an FFT-based filter (but there is an "effective" time-domain response for every filtering operation done in the Fourier domain --- as you noted). That's what I was referring to.
It's been a while since I wrote lfilter, but it transposes the filtering operation into Direct Form II, and then does a straightforward implementation of the feed-back and feed-forward equations.
Here is some information on the approach: https://ccrma.stanford.edu/~jos/fp/Direct_Form_II.html
IIR filters implemented in the time-domain need something like lfilter. FIR filters are "just" convolution in the time domain --- and there are different approaches to doing that discrete-time convolution as you've noted. IIR filters are *just* convolution as well (but convolution with an infinite sequence). Of course, if you use the FFT-domain to implement the filter, then you can just as well design in that space the filtering-function you want to multiply the input signal with (it's just important to keep in mind the impact in the time-domain of what you are doing in the frequency domain --- i.e. sharp-edges result in ringing, the basic time-frequency product limitations, etc.)
These same ideas come under different names and have different emphasis in different disciplines.
-Travis
Here, I claim the best approach is to realize that 1. Just making the coefficients in the freq domain be samples of a desired response gives you no exact result (as you noted), but 2. On the other hand, fft can be used to perform fast convolution, which is (can be) mathematically exactly the same as time domain convolution. Therefore, just realize that * use your favorite FIR filter design tool (e.g., remez) to design the filter Now the only approximation is in the fir filter design step, and you should know precisely what is the nature of any approximation
On Thu, Jan 5, 2012 at 5:30 PM, Neal Becker <ndbecker2@gmail.com> wrote:
Travis Oliphant wrote:
On Jan 5, 2012, at 1:19 PM, Neal Becker wrote:
Travis Oliphant wrote:
On Jan 5, 2012, at 10:00 AM, josef.pktd@gmail.com wrote:
On Thu, Jan 5, 2012 at 10:32 AM, Neal Becker <ndbecker2@gmail.com> wrote:
Some comments on signal processing:
Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a general purpose filter, which is an IIR filter, single rate. Efficiency is very important in my work, so I implement many optimized variations.
Most of the time, FIR filters are used. These then come in variations for single rate, interpolation, and decimation (there is also another design for rational rate conversion). Then these have variants for scalar/complex input/output, as well as complex in/out with scalar coefficients.
IIR filters are seperate.
FFT based FIR filters are another type, and include both complex in/out as well as scalar in/out (taking advantage of the 'two channel' trick for fft).
just out of curiosity: why no FFT base IIR filter?
It looks like a small change in the implementation, but it is slower than lfilter for shorter time series so I mostly dropped fft based filtering.
I think he is talking about filter design, correct?
The comments I made were all about efficient filter implementation, not about filter design.
About FFT-based IIR filter, I never heard of it. I was talking about the fact that fft can be used to efficiently implement a linear convolution exactly (for the case of convolution of a finite or short sequence - the impulse response of the filter - with a long or infinite sequence, the overlap-add or overlap-save techniques are used).
Sure, of course. It's hard to know the way people are using terms. I agree that people don't usually use the term IIR when talking about an FFT-based filter (but there is an "effective" time-domain response for every filtering operation done in the Fourier domain --- as you noted). That's what I was referring to.
It's been a while since I wrote lfilter, but it transposes the filtering operation into Direct Form II, and then does a straightforward implementation of the feed-back and feed-forward equations.
Here is some information on the approach: https://ccrma.stanford.edu/~jos/fp/Direct_Form_II.html
IIR filters implemented in the time-domain need something like lfilter. FIR filters are "just" convolution in the time domain --- and there are different approaches to doing that discrete-time convolution as you've noted. IIR filters are *just* convolution as well (but convolution with an infinite sequence). Of course, if you use the FFT-domain to implement the filter, then you can just as well design in that space the filtering-function you want to multiply the input signal with (it's just important to keep in mind the impact in the time-domain of what you are doing in the frequency domain --- i.e. sharp-edges result in ringing, the basic time-frequency product limitations, etc.)
These same ideas come under different names and have different emphasis in different disciplines.
-Travis
Here, I claim the best approach is to realize that 1. Just making the coefficients in the freq domain be samples of a desired response gives you no exact result (as you noted), but 2. On the other hand, fft can be used to perform fast convolution, which is (can be) mathematically exactly the same as time domain convolution. Therefore, just realize that * use your favorite FIR filter design tool (e.g., remez) to design the filter Now the only approximation is in the fir filter design step, and you should know precisely what is the nature of any approximation
Thanks, if I understand both of you correctly, then the difference comes down to whether we want to have a parsimonious IIR parameterization, with only a few parameters that can be estimated as in time series analysis (Box-Jenkins), or whether you want to design a filter where having a "long" FIR representation doesn't have any disadvantages (in frequency domain, FFT, the filter might be full length anyway). Josef
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On 1/5/2012 9:32 AM, Neal Becker wrote:
Some comments on signal processing:
Correct me if I'm wrong, but I think scipy signal (like matlab) implement only a general purpose filter, which is an IIR filter, single rate. Efficiency is very important in my work, so I implement many optimized variations.
Most of the time, FIR filters are used. These then come in variations for single rate, interpolation, and decimation (there is also another design for rational rate conversion). Then these have variants for scalar/complex input/output, as well as complex in/out with scalar coefficients.
IIR filters are seperate.
FFT based FIR filters are another type, and include both complex in/out as well as scalar in/out (taking advantage of the 'two channel' trick for fft).
This link, http://www.scipy.org/Cookbook/ApplyFIRFilter, describes the different "filter" methods currently implemented in scipy. Not just lfilter. Regards, Chris
On Thu, Jan 5, 2012 at 8:25 AM, Neal Becker <ndbecker2@gmail.com> wrote:
Charles R Harris wrote:
...
Filter design could use improvement. I also have a remez algorithm that works for complex filter design that belongs somewhere.
Can I get a copy of this please??
Sure, it's attached. It's pretty old at this point and I don't consider it finished. If you want to work on it I could put a repository up on github. I experimented with both fft and barycentric Lagrange for interpolation (ala the original), and ended up using barycentric interpolation to generate evenly spaced sample points and then an fft for finer interpolation, allowing fine grids with less computation. Along with that the band edges are all rounded to grid points whereas the original used the exact values. I haven't looked at this for two years and it needs tests, a filter design front end, and probably some cleanup/refactoring. Chuck
On Thu, Jan 5, 2012 at 2:43 AM, Travis Oliphant <travis@continuum.io> wrote:
5. Find a clear (group of) maintainer(s) for each sub-module. For people familiar with one module, responding to
tickets and pull requests for that module would not cost so much time.
Is there a list where this is kept?
Not really. The only way you can tell a little bit right now is the way Trac tickets get assigned. For example Pauli gets documentation, Josef gets stats tickets. We could have a list on Trac, linked to from the developers page on scipy.org, where we have a list of modules with for each module a (group of) people listed who are interested in it and would respond to tickets and PRs for that module. Not necessarily to fix everything asap, but at least to review patches, respond to tickets and outline how bugs should be fixed or enhancements could best be added. For PRs I think everyone can follow the RSS feed that Pauli set up. For Trac I'm not sure it's possible to send notifications to more than one person. If not, at least the tickets should get assigned to one person who could then forward them, until there's a better solution. As administrative points I would propose: - People should be able to add and remove themselves from this list. - Commit rights are not necessary to be on the list (but of course can be asked for). - Add a recommendation that no one person should be the Trac assignee for more than two modules, and preferably only one if it's a large one. The group of people interested in a module could also compile a list of things to do to improve the quality of the module, and add tickets to an "easy fixes" list. Ralf
I'll jump in the discussion. As author of the odes scikit, I'd like to note that we moved development to github for the normal reasons, https://github.com/bmcage/odes We work on a cython implementation of the sundials solvers we need (I discussed with the pysundials author, and they effectively have no more time to work on that except to keep it doing for what they use it), and are experimenting with the API. When we finalize this work, I'll ask to remove the svn version from the old servers. My co-worker on this hates the scikit namespace, but for now, it is still in. The reason for the scikit and not patches to integrate are as before: dependency on sundials. I do think the (c)vode solver in scipy is too old-fashioned, and should better be replaced by the current vode solver of sundials. So I would urge that some thoughts are given if those parts of scipy.integrate really should make it in a 1.0 version. Another issue with the odes scikit is that nobody seems to know how the API for ODE or DAE is best done, different fields have their own typically workflow. So just doing it as it is usefull for my applicatoins seems like the fastest way forward, and if a broader community is interested, we can discuss. Also, I can change the API of my own things, but to find time to change ode class in scipy.integrate would be difficult (I don't have a fixed position). Benny PS: For those interested, you can see the API for DAE from https://github.com/bmcage/odes/blob/master/scikits/odes/sundials/ida.pyx . I would think the main annoyance would be that the equations must be passed to the init method as a class ResFunction due to performance/technical reasons, which is not very scipy like. That however would be for another mail thread, which I'll do at another time. Odes does not have it's own mailing list at the moment.
Ralf Gommers wrote:
For PRs I think everyone can follow the RSS feed that Pauli set up. For Trac I'm not sure it's possible to send notifications to more than one person.
Trac generates RSS feeds as well in the "Custom Query" tab based on filters (e.g. by component, status). -- Denis
I would like to give some feedback on my experience as a contributor to scikit-learn. Here are a few things I like: - Contributing and following the project allows me to improve my knowledge of the field (I'm a graduate student in machine learning). The signal-to-noise ratio on the mailing-list is high, as the threads are usually directly related to my interest. It's also a valuable addition to my CV. - The barrier to entry is very low: the code base is not too big, the code is clear and the API is simple. This explains partly why we get so many pull-requests from occasional contributors. - Contributors get push privilege (become part of the scikit-learn github organization) after just a few pull requests and are fully credited in the changelogs and file headers. We never had any problem with this policy: people usually know when a commit can be pushed to master directly and when it warrants a pull-request / review first. - All important decisions are taken democratically and we now have well-identified workflows. The small of size of the project probably helps a lot. - The project is very dynamic and is moving fast! I like the idea of a core scipy library and an ecosystem of scikits with a well-identified scope around it! The success of scikit-learn could be used as a model, so as to reproduce the successes and not repeat the failures (see Gael's document on bootstrapping a community). This is already happening in scikit-image, as far as I can see. Why use the prefix scikit- rather than a top-level package name? Because scikit should be a brand name and should be a guarantee of quality. My 2 cents, Mathieu
On Tue, Jan 3, 2012 at 9:00 AM, Travis Oliphant <travis@continuum.io> wrote:
I don't know if this has already been discussed or not. But, I really don't understand the reasoning behind "yet-another-project" for signal processing. That is the whole-point of the signal sub-project under the scipy namespace. Why not just develop there? Github access is easy to grant.
I must admit, I've never been a fan of the scikits namespace. I would prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place. If you don't want to do that, then just pick a top-level name and use it.
As mentioned by other, there are multiple reasons why one may not want to put something in scipy. I would note that putting something in scikits today means it cannot be integrated into scipy later. But putting things in scipy has (implicitly at least) much stronger requirements around API stability than a scikit, and a much slower release process (I think on average, we made one release year). cheers, David
On Tue, Jan 3, 2012 at 20:18, David Cournapeau <cournape@gmail.com> wrote:
On Tue, Jan 3, 2012 at 9:00 AM, Travis Oliphant <travis@continuum.io> wrote:
I don't know if this has already been discussed or not. But, I really don't understand the reasoning behind "yet-another-project" for signal processing. That is the whole-point of the signal sub-project under the scipy namespace. Why not just develop there? Github access is easy to grant.
I must admit, I've never been a fan of the scikits namespace. I would prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place. If you don't want to do that, then just pick a top-level name and use it.
As mentioned by other, there are multiple reasons why one may not want to put something in scipy. I would note that putting something in scikits today means it cannot be integrated into scipy later.
Why not? We incorporate pre-existing code all of the time. What makes a scikits project any different from others? -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Tue, Jan 3, 2012 at 8:33 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Tue, Jan 3, 2012 at 20:18, David Cournapeau <cournape@gmail.com> wrote:
On Tue, Jan 3, 2012 at 9:00 AM, Travis Oliphant <travis@continuum.io> wrote:
I don't know if this has already been discussed or not. But, I really don't understand the reasoning behind "yet-another-project" for signal processing. That is the whole-point of the signal sub-project under the scipy namespace. Why not just develop there? Github access is easy to grant.
I must admit, I've never been a fan of the scikits namespace. I would prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place. If you don't want to do that, then just pick a top-level name and use it.
As mentioned by other, there are multiple reasons why one may not want to put something in scipy. I would note that putting something in scikits today means it cannot be integrated into scipy later.
Why not? We incorporate pre-existing code all of the time. What makes a scikits project any different from others?
Sorry, I meant the contrary from what I wrote: of course, putting something in scikits does not prevent it from being integrated in scipy later. David
I don't know if this has already been discussed or not. But, I really don't understand the reasoning behind "yet-another-project" for signal
On Tue, Jan 3, 2012 at 9:00 AM, Travis Oliphant <travis@continuum.io> wrote: processing. That is the whole-point of the signal sub-project under the scipy namespace. Why not just develop there? Github access is easy to grant.
I must admit, I've never been a fan of the scikits namespace. I would
prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place. If you don't want to do that, then just pick a top-level name and use it.
As mentioned by other, there are multiple reasons why one may not want to put something in scipy. I would note that putting something in scikits today means it cannot be integrated into scipy later. But putting things in scipy has (implicitly at least) much stronger requirements around API stability than a scikit, and a much slower release process (I think on average, we made one release year).
Integrating code into scipy after initially developing it as a separate
On Tue, Jan 3, 2012 at 9:18 PM, David Cournapeau <cournape@gmail.com> wrote: package is something that is not really happening right now though. In cases like scikits.image/learn/statsmodels, which are active, growing projects, that of course doesn't make sense, but for packages that are stable and see little active development it should happen more imho. Example 1: numerical differentiation. Algopy and numdifftools are two mature packages that are general enough that it would make sense to integrate them. Especially algopy has quite good docs. Not much active development, and the respective authors would be in favor, see http://projects.scipy.org/scipy/ticket/1510. Example 2: pywavelets. Nice complete package with good docs, much better than scipy.signal.wavelets. Very little development activity for the package, and wavelets are of interest for a wide variety of applications. Would have helped with the recent peak finding additions by Jacob Silterra for example. (Not sure how the author of pywavelets would feel about this, it's just an example). I'm sure it's not difficult to find more examples. Scipy is getting released more frequently now than before, and I hope we can keep it that way. Perhaps there are simple reasons that integrating code doesn't happen, like lack of time of the main developer. But on the other hand, maybe we as scipy developers aren't as welcoming as we should be, or should just go and ask developers how they would feel about incorporating their mature code? Ralf
Perhaps that is a concrete thing that I can do over the next few months: Follow-up with different developers of packages that might be interested in incorporating their code into ScIPy as a module or as part of another module. Longer term, I would like to figure out how to make SciPy more modular. -Travis On Jan 3, 2012, at 2:37 PM, Ralf Gommers wrote:
On Tue, Jan 3, 2012 at 9:18 PM, David Cournapeau <cournape@gmail.com> wrote: On Tue, Jan 3, 2012 at 9:00 AM, Travis Oliphant <travis@continuum.io> wrote:
I don't know if this has already been discussed or not. But, I really don't understand the reasoning behind "yet-another-project" for signal processing. That is the whole-point of the signal sub-project under the scipy namespace. Why not just develop there? Github access is easy to grant.
I must admit, I've never been a fan of the scikits namespace. I would prefer that we just stick with the scipy namespace and work on making scipy more modular and easy to distribute as separate modules in the first place. If you don't want to do that, then just pick a top-level name and use it.
As mentioned by other, there are multiple reasons why one may not want to put something in scipy. I would note that putting something in scikits today means it cannot be integrated into scipy later. But putting things in scipy has (implicitly at least) much stronger requirements around API stability than a scikit, and a much slower release process (I think on average, we made one release year).
Integrating code into scipy after initially developing it as a separate package is something that is not really happening right now though. In cases like scikits.image/learn/statsmodels, which are active, growing projects, that of course doesn't make sense, but for packages that are stable and see little active development it should happen more imho.
Example 1: numerical differentiation. Algopy and numdifftools are two mature packages that are general enough that it would make sense to integrate them. Especially algopy has quite good docs. Not much active development, and the respective authors would be in favor, see http://projects.scipy.org/scipy/ticket/1510.
Example 2: pywavelets. Nice complete package with good docs, much better than scipy.signal.wavelets. Very little development activity for the package, and wavelets are of interest for a wide variety of applications. Would have helped with the recent peak finding additions by Jacob Silterra for example. (Not sure how the author of pywavelets would feel about this, it's just an example).
I'm sure it's not difficult to find more examples. Scipy is getting released more frequently now than before, and I hope we can keep it that way. Perhaps there are simple reasons that integrating code doesn't happen, like lack of time of the main developer. But on the other hand, maybe we as scipy developers aren't as welcoming as we should be, or should just go and ask developers how they would feel about incorporating their mature code?
Ralf
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Tue, Jan 03, 2012 at 09:37:10PM +0100, Ralf Gommers wrote:
Integrating code into scipy after initially developing it as a separate package is something that is not really happening right now though.
I would look to respectfully disagree :). With regards to large contributions, Jake VanderPlas's work on arpack started in the scikit-learn. The discussion that we had recently on integrating the graph algorithmic shows that such an integration will continue. In addition, if I look at the commits in scipy, I see plenty that were initiated in the scikit-learn (I see them, because I look at the contributions of scikit-learn developers). That said, I know what you mean: a lot of worthwhile code is just developed on its own, and never gets merged into a major package. It's a pity, as it would be more useful. That said, it is also easy to see why it doesn't happen: the authors implemented that code to scratch an itch, and once that itch scratched, there are done.
Example 1: numerical differentiation. Algopy and numdifftools are two mature packages that are general enough that it would make sense to integrate them. Especially algopy has quite good docs. Not much active development, and the respective authors would be in favor, see http://projects.scipy.org/scipy/ticket/1510.
OK, this sounds like an interesting project that could/should get funding. Time to make a list for next year's GSOC, if we can find somebody willing to mentor it.
Example 2: pywavelets. Nice complete package with good docs, much better than scipy.signal.wavelets. Very little development activity for the package, and wavelets are of interest for a wide variety of applications.
Yes, pywavelet is high on my list of code that should live in a biggest package. I find that it's actually fairly technical code, and I would be weary of merging it in if there is not somebody with good expertise to maintain it. [snip (reordered quoting of Ralf's email)]
In cases like scikits.image/learn/statsmodels, which are active, growing projects, that of course doesn't make sense
Well, actually, if people think that some of the algorithms that we have in scikit-learn should be merged back in scipy, we are open to it. A few things to keep in mind: - We have gathered a significant experience on some techniques relative to stochastic algorithms and big data. I wouldn't like to merge in scipy too technical code, for the fear of it 'dying' there. Some people say that code goes to the Python standard library to die [1] :). - For the reasons explained in my previous mail (i.e. pros of having domain specific packages when it comes to highly specialized features) I don't think that it is desirable to see in the long run the full codebase of scikit-learn merged in scipy.
Scipy is getting released more frequently now than before, and I hope we can keep it that way.
This, plus the move to github, does make it much easier to contribute. I think that it is having a noticeable impact.
or should just go and ask developers how they would feel about incorporating their mature code?
That might actually be useful. Gael [1] http://frompythonimportpodcast.com/episode-004-dave-hates-decorators-where-c...
участники (20)
-
alan@ajackson.org
-
Alexandre Gramfort
-
Benny Malengier
-
Charles R Harris
-
Christopher Felton
-
David Cournapeau
-
Denis Laxalde
-
Fernando Perez
-
Gael Varoquaux
-
Jaidev Deshpande
-
josef.pktd@gmail.com
-
Mathieu Blondel
-
Neal Becker
-
Pauli Virtanen
-
Ralf Gommers
-
Robert Kern
-
Skipper Seabold
-
Travis Oliphant
-
Warren Weckesser
-
Zachary Pincus