[GSoC 2008]Machine learning package in SciPy
Hi all, it might be a good idea to have a machine learning(ML) package in SciPy. As I understand there are some ML code in SciKits, but it is in raw state? There are a lot of machine learning projects, with its own data format, number of classifiers, feature selection algorithms and benchmarks. But if you want to compare your own algorithm with some others, you should convert your data format to input format of every tool you want to use and after training, you should convert output format of each tools to the single format to have facility to compare results(for example you want to see common which features was selected by different tools). Now I'm analyzing different ML approaches for the special case of text classification problem. I couldn't find ML framework appropriate for my task. I've got two simple requirements for this framework. It should support sparse data format and has at least svm classifier. For example, Orange [1] is a vary good data mining project but has poor sparse format support. PyML [2] has all needed features, but there are problems with installation on different platforms and code design is not perfect. I believe that creation framework, which will be convenient for scientist to integrate their algorithms to it, is a vary useful challenge. Scientists often talk about standard machine learning software[3] and may be SciPy will be appropriate platform for developing such software. I can write detailed proposal, but I want to see is it interesting for someone? Any wishes and recommendations? 1. Orange http://magix.fri.uni-lj.si/orange/ 2. PyML http://pyml.sourceforge.net/ 3. The Need for Open Source Software in Machine Learning http://www.jmlr.org/papers/volume8/sonnenburg07a/sonnenburg07a.pdf -- Anton Slesarev
Hi, David Cournapeau is maintaining the learn scikit. This is the main place where machine learning code will be put. For instance, there are classifiers (SVMs with libsvm) and there will be in the near future the more used manifold learning techniques. I didn't understand what you meant by "you want to see common which features was selected by different tools". Sparse matrix support must be made at the C level for libsvm, you would have to ask Albert who wrapped libsvm. For the manifold learning code, techniques that can support sparse matrices support them (for instance Laplacian Eigenmaps). Matthieu 2008/3/11, Anton Slesarev <slesarev.anton@gmail.com>:
Hi all,
it might be a good idea to have a machine learning(ML) package in SciPy. As I understand there are some ML code in SciKits, but it is in raw state?
There are a lot of machine learning projects, with its own data format, number of classifiers, feature selection algorithms and benchmarks. But if you want to compare your own algorithm with some others, you should convert your data format to input format of every tool you want to use and after training, you should convert output format of each tools to the single format to have facility to compare results(for example you want to see common which features was selected by different tools).
Now I'm analyzing different ML approaches for the special case of text classification problem. I couldn't find ML framework appropriate for my task. I've got two simple requirements for this framework. It should support sparse data format and has at least svm classifier. For example, Orange [1] is a vary good data mining project but has poor sparse format support. PyML [2] has all needed features, but there are problems with installation on different platforms and code design is not perfect.
I believe that creation framework, which will be convenient for scientist to integrate their algorithms to it, is a vary useful challenge. Scientists often talk about standard machine learning software[3] and may be SciPy will be appropriate platform for developing such software.
I can write detailed proposal, but I want to see is it interesting for someone? Any wishes and recommendations?
1. Orange http://magix.fri.uni-lj.si/orange/ 2. PyML http://pyml.sourceforge.net/ 3. The Need for Open Source Software in Machine Learning http://www.jmlr.org/papers/volume8/sonnenburg07a/sonnenburg07a.pdf
-- Anton Slesarev _______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-dev
-- French PhD student Website : http://matthieu-brucher.developpez.com/ Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn : http://www.linkedin.com/in/matthieubrucher
David Cournapeau is maintaining the learn scikit. This is the main place where machine learning code will be put. For instance, there are classifiers (SVMs with libsvm) and there will be in the near future the more used manifold learning techniques.
I didn't understand what you meant by "you want to see common which features was selected by different tools".
I mean that if we have standard format for different classifiers we can compare their results, we can see intersection of features that have been selected. If we use different tools, it is need to make exhausting conversions between different formats.
Sparse matrix support must be made at the C level for libsvm, you would have to ask Albert who wrapped libsvm.
I see. I say that it is good idea to write parsers for different data formats.
For the manifold learning code, techniques that can support sparse matrices support them (for instance Laplacian Eigenmaps).
Matthieu
2008/3/11, Anton Slesarev <slesarev.anton@gmail.com>:
Hi all,
it might be a good idea to have a machine learning(ML) package in SciPy. As I understand there are some ML code in SciKits, but it is in raw state?
There are a lot of machine learning projects, with its own data format, number of classifiers, feature selection algorithms and benchmarks. But if you want to compare your own algorithm with some others, you should convert your data format to input format of every tool you want to use and after training, you should convert output format of each tools to the single format to have facility to compare results(for example you want to see common which features was selected by different tools).
Now I'm analyzing different ML approaches for the special case of text classification problem. I couldn't find ML framework appropriate for my task. I've got two simple requirements for this framework. It should support sparse data format and has at least svm classifier. For example, Orange [1] is a vary good data mining project but has poor sparse format support. PyML [2] has all needed features, but there are problems with installation on different platforms and code design is not perfect.
I believe that creation framework, which will be convenient for scientist to integrate their algorithms to it, is a vary useful challenge. Scientists often talk about standard machine learning software[3] and may be SciPy will be appropriate platform for developing such software.
I can write detailed proposal, but I want to see is it interesting for someone? Any wishes and recommendations?
1. Orange http://magix.fri.uni-lj.si/orange/ 2. PyML http://pyml.sourceforge.net/ 3. The Need for Open Source Software in Machine Learning http://www.jmlr.org/papers/volume8/sonnenburg07a/sonnenburg07a.pdf
-- Anton Slesarev _______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-dev
-- French PhD student Website : http://matthieu-brucher.developpez.com/ Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn : http://www.linkedin.com/in/matthieubrucher _______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-dev
-- Anton Slesarev
2008/3/11, Anton Slesarev <slesarev.anton@gmail.com>:
David Cournapeau is maintaining the learn scikit. This is the main place
where machine learning code will be put. For instance, there are classifiers (SVMs with libsvm) and there will be in the near future the more used manifold learning techniques.
I didn't understand what you meant by "you want to see common which features was selected by different tools".
I mean that if we have standard format for different classifiers we can compare their results, we can see intersection of features that have been selected. If we use different tools, it is need to make exhausting conversions between different formats.
Well, a classifier gives back the class to which a point belongs to, doesn't it ? Do you mean adding probabilities ? It would depend on the kind of the classifier (strong or fuzzy). For the time being, David proposed a data format, at least for inputs. The same format could be extended to output values (when it has sense). Sparse matrix support must be made at the C level for libsvm, you should
have to ask Albert who wrapped libsvm.
I see. I say that it is good idea to write parsers for different data
formats.
I agree, I hope that David will be able to finish the Scons build of numpy and scipy so that he will be able to enhance his data format proposal :) Matthieu -- French PhD student Website : http://matthieu-brucher.developpez.com/ Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn : http://www.linkedin.com/in/matthieubrucher
participants (2)
-
Anton Slesarev
-
Matthieu Brucher