GSoC Project Proposal: Datasource and Jonathan Taylor's statistical models
Hello all, I am a first year PhD student in Economics at American University, and I would very much like to participate in the GSoC with the NumPy/SciPy community. I am looking for some feedback and discussion before I submit a proposal. Judging by the ideas page and the discussion in this thread ( http://mail.scipy.org/pipermail/scipy-dev/2009-February/011373.html ) I think the following project proposal would be useful to the community. My proposal would have two parts, the first would be to improve datasource and integrate it into the numpy/scipy io. I see this as a way to get my feet wet working on a project. I do not imagine that it would take more than 2-3 weeks work on my end. The second part would be to get Jonathan Taylor's statistical models from the NiPy project into scipy.stats. I think that I would be a good candidate for this work, as I am currently studying statistics and learning the ins and outs of NumPy/SciPy, so I don't mind doing some of the less appealing work as this is also a great learning opportunity. Also I see this as a great way to get involved in the SciPy community in an area that currently needs some attention. I am a student, so I would be able to help maintain the code, bug fix, and address other areas of the statistical capabilities that need attention. Below is a general outline of my proposal with some areas that I have identified as needing work. I am eager to discuss some aspects of the projects with those that are interested and to work on the appropriate milestones. 1) Improve datasource and integrate it into all the numpy/scipy io Bug Fixes Catch and handle malformed URLs Refactoring Enhancements Improve findfile method Improve cache method Add zip archive, tar file handling capabilities Improve networking interface to handle timeouts and proxies if there is sufficient interest Documentation Document changes Tests Implement test coverage for new changes Copy/Move to scipy.io 2) Integrate Jonathan Taylor's statistical models into scipy.stats These models are currently in the NiPy project Merge relevant branches (branch trunk-josef models has the most recent changes, I believe) I will focus mostly on bringing over the linear models, which I believe would include at the least: bspline.py, contrast.py, gam.py, glm.py, model.py, regression.py, utils.py Bug Fixes Bug hunting Improve existing test coverage Refactoring Eliminate existing and created duplicate functionality Make sure parameters are consistent, etc. Enhancements Documentation Document changes Make any necessary changes to stats/info.py Testing Make sure test coverage is adequate
I think this proposal would be useful, and I would be willing to serve as a GSoC mentor in order to support it. (I was a mentor for the past two summers.) Alan Isaac
I should add that I know Skipper Seabold, who is a Economics PhD student at American University, where I work. Alan Isaac
On Fri, Mar 27, 2009 at 11:43 AM, Alan G Isaac <aisaac@american.edu> wrote:
I think this proposal would be useful, and I would be willing to serve as a GSoC mentor in order to support it. (I was a mentor for the past two summers.)
I also like this project and am happy to hear that you are interested in mentoring it. Jarrod
Skipper Seabold wrote:
Hello all,
I am a first year PhD student in Economics at American University, and I would very much like to participate in the GSoC with the NumPy/SciPy community. I am looking for some feedback and discussion before I submit a proposal.
Judging by the ideas page and the discussion in this thread ( http://mail.scipy.org/pipermail/scipy-dev/2009-February/011373.html ) I think the following project proposal would be useful to the community.
My proposal would have two parts, the first would be to improve datasource and integrate it into the numpy/scipy io. I see this as a way to get my feet wet working on a project. I do not imagine that it would take more than 2-3 weeks work on my end.
Can you provide a link to datasource?
The second part would be to get Jonathan Taylor's statistical models from the NiPy project into scipy.stats. I think that I would be a good candidate for this work, as I am currently studying statistics and learning the ins and outs of NumPy/SciPy, so I don't mind doing some of the less appealing work as this is also a great learning opportunity. Also I see this as a great way to get involved in the SciPy community in an area that currently needs some attention. I am a student, so I would be able to help maintain the code, bug fix, and address other areas of the statistical capabilities that need attention.
I would be willing to help to some degree. I would strongly suggest that the main emphasis is just to get Jonathan's code integrated into Scipy and perhaps something from various places like the Scikit learn (how many logistic regression or least squares codes do we really need?) and EconPy http://code.google.com/p/econpy/wiki/EconPy It is too complex to address anything more than this and this would provide a very solid base for future development.
Below is a general outline of my proposal with some areas that I have identified as needing work. I am eager to discuss some aspects of the projects with those that are interested and to work on the appropriate milestones.
1) Improve datasource and integrate it into all the numpy/scipy io
Bug Fixes Catch and handle malformed URLs
Refactoring
Enhancements Improve findfile method Improve cache method Add zip archive, tar file handling capabilities Improve networking interface to handle timeouts and proxies if there is sufficient interest
Documentation Document changes
Tests Implement test coverage for new changes
Copy/Move to scipy.io
This looks like quite a lot of work for a short period especially to do both parts (I am also biased in having the stats part finished).
2) Integrate Jonathan Taylor's statistical models into scipy.stats
These models are currently in the NiPy project Merge relevant branches (branch trunk-josef models has the most recent changes, I believe)
I will focus mostly on bringing over the linear models, which I believe would include at the least: bspline.py, contrast.py, gam.py, glm.py, model.py, regression.py, utils.py
Not that it is really that important, but these are not all 'linear models' :-)
Bug Fixes Bug hunting Improve existing test coverage
Refactoring Eliminate existing and created duplicate functionality Make sure parameters are consistent, etc.
I would not be that concerned with duplicate functionality because it is better to train people to use the new code and depreciate the old code. There some cases where you may want different versions, for example code that assumes normality will be faster than code for generalized linear models where non-normal distributions are allowed.
Enhancements
I would think that it is essential to get these to work with masked arrays (allows missing observations) or record array (enables the use of 'variable' names in model statements like most statistics packages do).
Documentation Document changes Make any necessary changes to stats/info.py
Actually the reference is the SciPy documentation marathon. I would also suggest that examples/tutorials are important here.
Testing Make sure test coverage is adequate
I would like to see the inclusion of Statistical Reference Datasets Project: http://www.itl.nist.gov/div898/strd/ The datasets would allow us to validate the accuracy of the code. Regards Bruce
On Fri, Mar 27, 2009 at 12:21 PM, Bruce Southey <bsouthey@gmail.com> wrote:
Can you provide a link to datasource?
http://projects.scipy.org/numpy/browser/trunk/numpy/lib/_datasource.py
Jarrod Millman wrote:
On Fri, Mar 27, 2009 at 12:21 PM, Bruce Southey <bsouthey@gmail.com> wrote:
Can you provide a link to datasource?
http://projects.scipy.org/numpy/browser/trunk/numpy/lib/_datasource.py _______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
Thanks! Not getting into the merits of either part, I think you are asking for trouble doing both because there is not clear connection between the two parts. Knowing one part is not going to help you with the other. (The argument that it helps get 'your feet wet' is rather lame.) But you can justify it by saying you want an solution to access online datasets like: UC Irvine Machine Learning Repository: http://archive.ics.uci.edu/ml/ Datamob: http://datamob.org/datasets Bruce
Bruce Southey wrote:
Not getting into the merits of either part, I think you are asking for trouble doing both because there is not clear connection between the two parts. Knowing one part is not going to help you with the other. (The argument that it helps get 'your feet wet' is rather lame.)
Your point is well taken. I think I will focus on the second part, as there seems to be much more interest in the statistical functionality. And my work would undoubtedly be better if focused.
I would strongly suggest that the main emphasis is just to get Jonathan's code integrated into Scipy and perhaps something from various places like the Scikit learn (how many logistic regression or least squares codes do we really need?) and EconPy http://code.google.com/p/econpy/wiki/EconPy
I will have a closer look through Scikit learn and econpy and revise.
I would think that it is essential to get these to work with masked arrays (allows missing observations) or record array (enables the use of 'variable' names in model statements like most statistics packages do).
I agree. There has been some discussion of the most appropriate way to handle this in your thread previously mentioned (eg., it would not always be appropriate to force conversion to a masked array, should stats and mstats be merged, etc.), and I would appreciate any direction that could be offered. I like the idea of the "usemask" flag here http://mail.scipy.org/pipermail/scipy-dev/2009-February/011414.html but obviously would defer to others for the best solution. Should I be spending most of my time looking through mstats rather than stats?
I would like to see the inclusion of Statistical Reference Datasets Project: http://www.itl.nist.gov/div898/strd/
The datasets would allow us to validate the accuracy of the code.
Very good idea. Thanks for some initial feedback. I will take under advisement and revise my proposal as needed. Best, Skipper
I think it would be very good if you can improve some of the statistics in scipy. On Fri, Mar 27, 2009 at 6:09 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
Bruce Southey wrote:
Not getting into the merits of either part, I think you are asking for trouble doing both because there is not clear connection between the two parts. Knowing one part is not going to help you with the other. (The argument that it helps get 'your feet wet' is rather lame.)
Your point is well taken. I think I will focus on the second part, as there seems to be much more interest in the statistical functionality. And my work would undoubtedly be better if focused.
I think there is enough to do in improving statistics that you don't need to add another side project. And as a warmup increasing test coverage would be very useful.
I would strongly suggest that the main emphasis is just to get Jonathan's code integrated into Scipy and perhaps something from various places like the Scikit learn (how many logistic regression or least
From a help search in R, I would say 5 to 10 logistic regression implementations.
squares codes do we really need?) and EconPy http://code.google.com/p/econpy/wiki/EconPy
I will have a closer look through Scikit learn and econpy and revise.
One of my current favorites is pymvpa it looks mostly well written and has quite a good coverage of multivariate statistics, for distributions pymc is the most complete (which will not concern you so much in the focus on models) and of course nipy. (all MIT or BSD licence) most machine learning libraries have more restrictive licenses, which constrains how much we can look at them For regression and implementation details I also look quite often at Econometrics Toolbox: by James P. LeSage http://www.spatial-econometrics.com/ in matlab, which has "classical" algorithms in econometrics in public domain.
I would think that it is essential to get these to work with masked arrays (allows missing observations) or record array (enables the use of 'variable' names in model statements like most statistics packages do).
I agree. There has been some discussion of the most appropriate way to handle this in your thread previously mentioned (eg., it would not always be appropriate to force conversion to a masked array, should stats and mstats be merged, etc.), and I would appreciate any direction that could be offered. I like the idea of the "usemask" flag here http://mail.scipy.org/pipermail/scipy-dev/2009-February/011414.html but obviously would defer to others for the best solution. Should I be spending most of my time looking through mstats rather than stats?
I would like to see the inclusion of Statistical Reference Datasets Project: http://www.itl.nist.gov/div898/strd/
The datasets would allow us to validate the accuracy of the code.
Very good idea.
The problem is that it has very limited coverage, I recently scraped/parsed the ANOVA examples (has only balanced) to check stats.f_oneway and anova in pymvpa. I took a non-linear regression case to test optimize.curve_fit and there is additional linear regression and descriptive which would be more for numpy and one more. I was looking for other benchmarks but with only limited success.
Thanks for some initial feedback. I will take under advisement and revise my proposal as needed.
A straightforward port of "models" would not be a lot of work, mainly increasing test coverage and fixing any bugs that they reveal. However, changes to the structure (refactoring) and completing missing pieces such as additional test statistics can be quite time consuming.
From my experience with stats, one of the biggest time sink in checking the code from someone else, can be hunting for a reference to fix some numbers that are not quite right compared to R or matlab (e.g. tiehandling or some "exotic" distributions). Being able to follow some good books is very helpful.
Some time will be required on the design when pulling in new code into scipy because code that is written for a specialized package might not be in the right form for a general purpose scipy. I assume we will have more discussion later, Josef
participants (5)
-
Alan G Isaac
-
Bruce Southey
-
Jarrod Millman
-
josef.pktd@gmail.com
-
Skipper Seabold