From aniket.g.meshram at gmail.com Fri Dec 1 16:05:11 2017 From: aniket.g.meshram at gmail.com (Aniket Meshram) Date: Sat, 2 Dec 2017 02:35:11 +0530 Subject: [scikit-learn] Error while running 'python setup.py build_ext --inplace' Message-ID: hi, I'm following the 'ways to contribute page' After forking and cloning, I ran the command 'python setup.py build_ext --inplace' which is giving me the following error: cc1: some warnings being treated as errors error: Command "x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c sklearn/neighbors/quad_tree.c -o build/temp.linux-x86_64-2.7/sklearn/neighbors/quad_tree.o" failed with exit status 1 AnnGM -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Dec 2 06:30:02 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sat, 2 Dec 2017 22:30:02 +1100 Subject: [scikit-learn] Error while running 'python setup.py build_ext --inplace' In-Reply-To: References: Message-ID: There's not enough information there for us to help you. Please provide the full log if possible. Are you sure you want to build from source? On 2 December 2017 at 08:05, Aniket Meshram wrote: > hi, > > I'm following the 'ways to contribute page' > > After forking and cloning, I ran the command 'python setup.py build_ext > --inplace' > which is giving me the following error: > > cc1: some warnings being treated as errors > error: Command "x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 > -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time > -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat > -Werror=format-security -fPIC -I/usr/lib/python2.7/dist-packages/numpy/core/include > -I/usr/lib/python2.7/dist-packages/numpy/core/include > -I/usr/include/python2.7 -c sklearn/neighbors/quad_tree.c -o > build/temp.linux-x86_64-2.7/sklearn/neighbors/quad_tree.o" failed with > exit status 1 > > AnnGM > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aniket.g.meshram at gmail.com Sun Dec 3 10:30:47 2017 From: aniket.g.meshram at gmail.com (Aniket Meshram) Date: Sun, 3 Dec 2017 21:00:47 +0530 Subject: [scikit-learn] Error while running 'python setup.py build_ext --inplace' In-Reply-To: References: Message-ID: hi Joel. Please find attached full log for 'pip install --editable .' *Are you sure you want to build from source?* *[Well, I'm trying the development version, since I'd like to help fix bugs. Wait, so you mean we are experiencing issue with source install. Oh man!. if Yes, then can you please suggest some other method. **(i mean apart from stable install) OR is it ok to go for stable install? ]* Thanks On Sat, Dec 2, 2017 at 5:00 PM, Joel Nothman wrote: > There's not enough information there for us to help you. Please provide > the full log if possible. Are you sure you want to build from source? > > On 2 December 2017 at 08:05, Aniket Meshram > wrote: > >> hi, >> >> I'm following the 'ways to contribute page' >> >> After forking and cloning, I ran the command 'python setup.py build_ext >> --inplace' >> which is giving me the following error: >> >> cc1: some warnings being treated as errors >> error: Command "x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 >> -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time >> -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat >> -Werror=format-security -fPIC -I/usr/lib/python2.7/dist-packages/numpy/core/include >> -I/usr/lib/python2.7/dist-packages/numpy/core/include >> -I/usr/include/python2.7 -c sklearn/neighbors/quad_tree.c -o >> build/temp.linux-x86_64-2.7/sklearn/neighbors/quad_tree.o" failed with >> exit status 1 >> >> AnnGM >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Regards, Aniket G. Meshram -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- sumedh at sumedh-Inspiron-N4010:scikit-learn $ sudo -H pip install --editable . Obtaining file:///home/sumedh/Downloads/Programming/OpenSourceContributions/scikit-learn Installing collected packages: scikit-learn Running setup.py develop for scikit-learn Complete output from command /usr/bin/python -c "import setuptools, tokenize;__file__='/home/sumedh/Downloads/Programming/OpenSourceContributions/scikit-learn/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" develop --no-deps: Partial import of sklearn during the build process. blas_opt_info: blas_mkl_info: libraries mkl,vml,guide not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu'] NOT AVAILABLE openblas_info: libraries openblas not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu'] NOT AVAILABLE atlas_3_10_blas_threads_info: Setting PTATLAS=ATLAS libraries tatlas not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu'] NOT AVAILABLE atlas_3_10_blas_info: libraries satlas not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu'] NOT AVAILABLE atlas_blas_threads_info: Setting PTATLAS=ATLAS libraries ptf77blas,ptcblas,atlas not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu'] NOT AVAILABLE atlas_blas_info: libraries f77blas,cblas,atlas not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu'] NOT AVAILABLE /usr/lib/python2.7/dist-packages/numpy/distutils/system_info.py:1640: UserWarning: Atlas (http://math-atlas.sourceforge.net/) libraries not found. Directories to search for the libraries can be specified in the numpy/distutils/site.cfg file (section [atlas]) or by setting the ATLAS environment variable. warnings.warn(AtlasNotFoundError.__doc__) blas_info: libraries blas not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu'] NOT AVAILABLE /usr/lib/python2.7/dist-packages/numpy/distutils/system_info.py:1649: UserWarning: Blas (http://www.netlib.org/blas/) libraries not found. Directories to search for the libraries can be specified in the numpy/distutils/site.cfg file (section [blas]) or by setting the BLAS environment variable. warnings.warn(BlasNotFoundError.__doc__) blas_src_info: NOT AVAILABLE /usr/lib/python2.7/dist-packages/numpy/distutils/system_info.py:1652: UserWarning: Blas (http://www.netlib.org/blas/) sources not found. Directories to search for the sources can be specified in the numpy/distutils/site.cfg file (section [blas_src]) or by setting the BLAS_SRC environment variable. warnings.warn(BlasSrcNotFoundError.__doc__) NOT AVAILABLE sklearn/setup.py:72: UserWarning: Blas (http://www.netlib.org/blas/) libraries not found. Directories to search for the libraries can be specified in the numpy/distutils/site.cfg file (section [blas]) or by setting the BLAS environment variable. warnings.warn(BlasNotFoundError.__doc__) missing cimport in module 'sklearn.neighbors': sklearn/manifold/_barnes_hut_tsne.pyx running develop running build_scripts running egg_info running build_src build_src building library "libsvm-skl" sources building library "cblas" sources building extension "sklearn.__check_build._check_build" sources building extension "sklearn.cluster._dbscan_inner" sources building extension "sklearn.cluster._hierarchical" sources building extension "sklearn.cluster._k_means_elkan" sources building extension "sklearn.cluster._k_means" sources building extension "sklearn.datasets._svmlight_format" sources building extension "sklearn.decomposition._online_lda" sources building extension "sklearn.decomposition.cdnmf_fast" sources building extension "sklearn.ensemble._gradient_boosting" sources building extension "sklearn.feature_extraction._hashing" sources building extension "sklearn.manifold._utils" sources building extension "sklearn.manifold._barnes_hut_tsne" sources building extension "sklearn.metrics.pairwise_fast" sources building extension "sklearn.metrics/cluster.expected_mutual_info_fast" sources building extension "sklearn.neighbors.ball_tree" sources building extension "sklearn.neighbors.kd_tree" sources building extension "sklearn.neighbors.dist_metrics" sources building extension "sklearn.neighbors.typedefs" sources building extension "sklearn.neighbors.quad_tree" sources building extension "sklearn.tree._tree" sources building extension "sklearn.tree._splitter" sources building extension "sklearn.tree._criterion" sources building extension "sklearn.tree._utils" sources building extension "sklearn.svm.libsvm" sources building extension "sklearn.svm.liblinear" sources building extension "sklearn.svm.libsvm_sparse" sources building extension "sklearn._isotonic" sources building extension "sklearn.linear_model.cd_fast" sources building extension "sklearn.linear_model.sgd_fast" sources building extension "sklearn.linear_model.sag_fast" sources building extension "sklearn.utils.sparsefuncs_fast" sources building extension "sklearn.utils.arrayfuncs" sources building extension "sklearn.utils.murmurhash" sources building extension "sklearn.utils.lgamma" sources building extension "sklearn.utils.graph_shortest_path" sources building extension "sklearn.utils.fast_dict" sources building extension "sklearn.utils.seq_dataset" sources building extension "sklearn.utils.weight_vector" sources building extension "sklearn.utils._random" sources building extension "sklearn.utils._logistic_sigmoid" sources building data_files sources build_src: building npy-pkg config files writing requirements to scikit_learn.egg-info/requires.txt writing scikit_learn.egg-info/PKG-INFO writing top-level names to scikit_learn.egg-info/top_level.txt writing dependency_links to scikit_learn.egg-info/dependency_links.txt warning: manifest_maker: standard file '-c' not found reading manifest file 'scikit_learn.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'scikit_learn.egg-info/SOURCES.txt' running build_ext customize UnixCCompiler customize UnixCCompiler using build_clib customize UnixCCompiler customize UnixCCompiler using build_ext customize UnixCCompiler customize UnixCCompiler using build_ext building 'sklearn.neighbors.quad_tree' extension compiling C sources C compiler: x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC compile options: '-I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -I/usr/include/python2.7 -c' x86_64-linux-gnu-gcc: sklearn/neighbors/quad_tree.c In file included from /usr/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarraytypes.h:1777:0, from /usr/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h:18, from /usr/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h:4, from sklearn/neighbors/quad_tree.c:259: /usr/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp] #warning "Using deprecated NumPy API, disable it by " \ ^ sklearn/neighbors/quad_tree.c:2365:1: warning: function declaration isn?t a prototype [-Wstrict-prototypes] static PyObject *__pyx_pf_7sklearn_9neighbors_9quad_tree_9_QuadTree_16test_summarize(); /* proto */ ^ sklearn/neighbors/quad_tree.c: In function ?__pyx_f_7sklearn_9neighbors_9quad_tree_9_QuadTree_insert_point?: sklearn/neighbors/quad_tree.c:3523:14: error: format not a string literal and no format arguments [-Werror=format-security] printf(__pyx_k_QuadTree_found_a_duplicate); ^ sklearn/neighbors/quad_tree.c: In function ?__pyx_f_7sklearn_9neighbors_9quad_tree_9_QuadTree__get_cell_ndarray?: sklearn/neighbors/quad_tree.c:6584:36: warning: passing argument 1 of ?(PyObject * (*)(PyTypeObject *, PyArray_Descr *, int, npy_intp *, npy_intp *, void *, int, PyObject *))*(PyArray_API + 752u)? from incompatible pointer type [-Wincompatible-pointer-types] __pyx_t_2 = PyArray_NewFromDescr(((PyObject *)__pyx_ptype_5numpy_ndarray), ((PyArray_Descr *)__pyx_t_1), 1, __pyx_v_shape, __pyx_v_strides, ((void *)__pyx_v_self->cells), NPY_DEFAULT, Py_None); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 574; __pyx_clineno = __LINE__; goto __pyx_L1_error;} ^ sklearn/neighbors/quad_tree.c:6584:36: note: expected ?PyTypeObject * {aka struct _typeobject *}? but argument is of type ?PyObject * {aka struct _object *}? sklearn/neighbors/quad_tree.c: At top level: sklearn/neighbors/quad_tree.c:7015:18: warning: function declaration isn?t a prototype [-Wstrict-prototypes] static PyObject *__pyx_pf_7sklearn_9neighbors_9quad_tree_9_QuadTree_16test_summarize() { ^ cc1: some warnings being treated as errors In file included from /usr/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarraytypes.h:1777:0, from /usr/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h:18, from /usr/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h:4, from sklearn/neighbors/quad_tree.c:259: /usr/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp] #warning "Using deprecated NumPy API, disable it by " \ ^ sklearn/neighbors/quad_tree.c:2365:1: warning: function declaration isn?t a prototype [-Wstrict-prototypes] static PyObject *__pyx_pf_7sklearn_9neighbors_9quad_tree_9_QuadTree_16test_summarize(); /* proto */ ^ sklearn/neighbors/quad_tree.c: In function ?__pyx_f_7sklearn_9neighbors_9quad_tree_9_QuadTree_insert_point?: sklearn/neighbors/quad_tree.c:3523:14: error: format not a string literal and no format arguments [-Werror=format-security] printf(__pyx_k_QuadTree_found_a_duplicate); ^ sklearn/neighbors/quad_tree.c: In function ?__pyx_f_7sklearn_9neighbors_9quad_tree_9_QuadTree__get_cell_ndarray?: sklearn/neighbors/quad_tree.c:6584:36: warning: passing argument 1 of ?(PyObject * (*)(PyTypeObject *, PyArray_Descr *, int, npy_intp *, npy_intp *, void *, int, PyObject *))*(PyArray_API + 752u)? from incompatible pointer type [-Wincompatible-pointer-types] __pyx_t_2 = PyArray_NewFromDescr(((PyObject *)__pyx_ptype_5numpy_ndarray), ((PyArray_Descr *)__pyx_t_1), 1, __pyx_v_shape, __pyx_v_strides, ((void *)__pyx_v_self->cells), NPY_DEFAULT, Py_None); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 574; __pyx_clineno = __LINE__; goto __pyx_L1_error;} ^ sklearn/neighbors/quad_tree.c:6584:36: note: expected ?PyTypeObject * {aka struct _typeobject *}? but argument is of type ?PyObject * {aka struct _object *}? sklearn/neighbors/quad_tree.c: At top level: sklearn/neighbors/quad_tree.c:7015:18: warning: function declaration isn?t a prototype [-Wstrict-prototypes] static PyObject *__pyx_pf_7sklearn_9neighbors_9quad_tree_9_QuadTree_16test_summarize() { ^ cc1: some warnings being treated as errors error: Command "x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -I/usr/include/python2.7 -c sklearn/neighbors/quad_tree.c -o build/temp.linux-x86_64-2.7/sklearn/neighbors/quad_tree.o" failed with exit status 1 ---------------------------------------- Command "/usr/bin/python -c "import setuptools, tokenize;__file__='/home/sumedh/Downloads/Programming/OpenSourceContributions/scikit-learn/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" develop --no-deps" failed with error code 1 in /home/sumedh/Downloads/Programming/OpenSourceContributions/scikit-learn/ sumedh at sumedh-Inspiron-N4010:scikit-learn $ From pengyu.ut at gmail.com Sun Dec 3 15:54:08 2017 From: pengyu.ut at gmail.com (Peng Yu) Date: Sun, 3 Dec 2017 14:54:08 -0600 Subject: [scikit-learn] a dataset suitable for logistic regression Message-ID: Hi, iris is a three-class dataset. Is there a dataset in sklearn that is suitable for binary classification? Thanks. -- Regards, Peng From se.raschka at gmail.com Sun Dec 3 17:00:36 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sun, 3 Dec 2017 17:00:36 -0500 Subject: [scikit-learn] a dataset suitable for logistic regression In-Reply-To: References: Message-ID: As far as I know, no. But you could simply truncate the iris dataset for binary classification, e.g., from sklearn import datasets iris = datasets.load_iris() X = iris.data[:100] y = iris.target[:100] Best, Sebastian > On Dec 3, 2017, at 3:54 PM, Peng Yu wrote: > > Hi, iris is a three-class dataset. Is there a dataset in sklearn that > is suitable for binary classification? Thanks. > > -- > Regards, > Peng > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From francois.dion at gmail.com Mon Dec 4 08:02:09 2017 From: francois.dion at gmail.com (Francois Dion) Date: Mon, 4 Dec 2017 08:02:09 -0500 Subject: [scikit-learn] a dataset suitable for logistic regression In-Reply-To: References: Message-ID: There's at least one that is part of base.py in sklearn.datasets. from sklearn.datasets import load_breast_cancer load_breast_cancer? Signature: load_breast_cancer(return_X_y=False) Docstring: Load and return the breast cancer wisconsin dataset (classification). The breast cancer dataset is a classic and very easy binary classification dataset. ================= ============== Classes 2 Samples per class 212(M),357(B) Samples total 569 Dimensionality 30 Features real, positive ================= ============== It is a very small data set. If you need something much larger, you can easily create large (artificial) sets using make_classification. And you can augment that with faker or elizabeth (pypi modules, not part of scikit-learn) to create realistic looking data sets. Francois about.me/francois.dion On Sun, Dec 3, 2017 at 3:54 PM, Peng Yu wrote: > Hi, iris is a three-class dataset. Is there a dataset in sklearn that > is suitable for binary classification? Thanks. > > -- > Regards, > Peng > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From iacopo at lighton.io Mon Dec 4 08:09:09 2017 From: iacopo at lighton.io (Iacopo Poli) Date: Mon, 4 Dec 2017 14:09:09 +0100 Subject: [scikit-learn] Using scikit-learn HTML theme for Sphinx docs in another project Message-ID: Hello everyone, I'm working on a project that is implemented following quite strictly the scikit-learn API and I would like to use the scikit-learn Sphinx theme for the docs. I would do that only if I don't infringe any copyright and whatsoever. What's your policy in this regard? Cheers, Iacopo -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Mon Dec 4 07:37:49 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Mon, 4 Dec 2017 13:37:49 +0100 Subject: [scikit-learn] Error while running 'python setup.py build_ext --inplace' In-Reply-To: References: Message-ID: Maybe update your version of Cython? -- Olivier ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.hausamann at tum.de Mon Dec 4 09:21:12 2017 From: peter.hausamann at tum.de (Peter Hausamann) Date: Mon, 04 Dec 2017 14:21:12 +0000 Subject: [scikit-learn] Announcing sklearn-xarray Message-ID: Hi all, I'd like to announce *sklearn-xarray*, a new package that provides a scikit-learn interface for xarray users. For those not familiar with xarray (http://xarray.pydata.org), it is a "pandas-like and pandas-compatible toolkit for analytics on multi-dimensional arrays". The package makes it possible to apply sklearn estimators to xarray DataArrays and Datasets while keeping the labels (called coordinates in xarray) intact whereever possible. You can install the package via pip: pip install sklearn-xarray To get started, you can: - read the documentation: https://phausamann.github.io/sklearn-xarray and - check out the repository: https://github.com/phausamann/sklearn-xarray Note that the package is still in a very early development stage and there will probably be some major API changes in upcoming releases. Most notably, I'd like to replicate the complete sklearn module structure at some point by decorating all available estimators with the necessary wrappers. Feedback of any kind is appreciated. Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Mon Dec 4 09:47:51 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 4 Dec 2017 15:47:51 +0100 Subject: [scikit-learn] Using scikit-learn HTML theme for Sphinx docs in another project In-Reply-To: References: Message-ID: <20171204144751.GC2024654@phare.normalesup.org> You're not infringing copyright (this is BSD-licensed). The only thing is that we would like you to indicate clearly that the project is not scikit-learn, so that we don't recieve support calls. For this, in addition to text pointing it out, you should use a different logo and a different icon the browser's tab. Cheers, Ga?l On Mon, Dec 04, 2017 at 02:09:09PM +0100, Iacopo Poli wrote: > Hello everyone, > I'm working on a project that is implemented following quite strictly the > scikit-learn API and I would like to use the scikit-learn Sphinx theme for the > docs. > I would do that only if I don't infringe any copyright and whatsoever. What's > your policy in this regard? > Cheers, > Iacopo > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From iacopo at lighton.io Mon Dec 4 10:08:09 2017 From: iacopo at lighton.io (Iacopo Poli) Date: Mon, 4 Dec 2017 16:08:09 +0100 Subject: [scikit-learn] Using scikit-learn HTML theme for Sphinx docs in another project In-Reply-To: <20171204144751.GC2024654@phare.normalesup.org> References: <20171204144751.GC2024654@phare.normalesup.org> Message-ID: Cool! Of course will change logo and icon :-) Thank you very much, Iacopo 2017-12-04 15:47 GMT+01:00 Gael Varoquaux : > You're not infringing copyright (this is BSD-licensed). The only thing is > that we would like you to indicate clearly that the project is not > scikit-learn, so that we don't recieve support calls. For this, in > addition to text pointing it out, you should use a different logo and a > different icon the browser's tab. > > Cheers, > > Ga?l > > On Mon, Dec 04, 2017 at 02:09:09PM +0100, Iacopo Poli wrote: > > Hello everyone, > > > I'm working on a project that is implemented following quite strictly the > > scikit-learn API and I would like to use the scikit-learn Sphinx theme > for the > > docs. > > > I would do that only if I don't infringe any copyright and whatsoever. > What's > > your policy in this regard? > > > Cheers, > > Iacopo > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aniket.g.meshram at gmail.com Mon Dec 4 09:20:20 2017 From: aniket.g.meshram at gmail.com (Aniket Meshram) Date: Mon, 4 Dec 2017 19:50:20 +0530 Subject: [scikit-learn] Error while running 'python setup.py build_ext --inplace' In-Reply-To: References: Message-ID: I updated all the packages before running install. On Mon, Dec 4, 2017 at 6:07 PM, Olivier Grisel wrote: > Maybe update your version of Cython? > > -- > Olivier > ? > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Regards, Aniket G. Meshram -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Mon Dec 4 10:03:12 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Mon, 4 Dec 2017 16:03:12 +0100 Subject: [scikit-learn] Announcing sklearn-xarray In-Reply-To: References: Message-ID: Interesting project! BTW, do you know about dask-ml [1]? It might be interesting to think about generalizing the input validation of fit and predict / transform as a private method of the BaseEstimator class instead of directly calling into sklearn.utils.validation functions so has to make it easier for third party projects such as sklearn-xarray and dask-ml to subclass and override those methods to allow for specific input data-structure without converting everyting to a numpy array. [1] https://github.com/dask/dask-ml 2017-12-04 15:21 GMT+01:00 Peter Hausamann : > Hi all, > > I'd like to announce *sklearn-xarray*, a new package that provides a > scikit-learn interface for xarray users. For those not familiar with xarray > (http://xarray.pydata.org), it is a "pandas-like and pandas-compatible > toolkit for analytics on multi-dimensional arrays". > > The package makes it possible to apply sklearn estimators to xarray > DataArrays and Datasets while keeping the labels (called coordinates in > xarray) intact whereever possible. > > You can install the package via pip: > > pip install sklearn-xarray > > To get started, you can: > > - read the documentation: https://phausamann.github.io/sklearn-xarray > and > - check out the repository: https://github. > com/phausamann/sklearn-xarray > > Note that the package is still in a very early development stage and there > will probably be some major API changes in upcoming releases. Most notably, > I'd like to replicate the complete sklearn module structure at some point > by decorating all available estimators with the necessary wrappers. > > Feedback of any kind is appreciated. > > Peter > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Mon Dec 4 11:00:37 2017 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Mon, 4 Dec 2017 10:00:37 -0600 Subject: [scikit-learn] Announcing sklearn-xarray In-Reply-To: References: Message-ID: I haven't looked at the implementation of `sklearn_xarray.dataarray.wrap` yet, but a simple test on `dask_ml.preprocessing.StandardScaler` failed with the (probably expected) `TypeError: 'int' object is not iterable` when dask-ml attempts an `X.mean(0)`. I'd be interested to hear what changes dask-ml would need to make to get things working on dask-back xarray datasets, without reading everything into memory at once. The code: import sklearn_xarray.dataarray as da from sklearn_xarray.data import load_dummy_dataarray from dask_ml.preprocessing import StandardScaler X = load_dummy_dataarray() Xt = da.wrap(StandardScaler()).fit_transform(X) Tom On Mon, Dec 4, 2017 at 9:03 AM, Olivier Grisel wrote: > Interesting project! > > BTW, do you know about dask-ml [1]? > > It might be interesting to think about generalizing the input validation > of fit and predict / transform as a private method of the BaseEstimator > class instead of directly calling into sklearn.utils.validation functions > so has to make it easier for third party projects such as sklearn-xarray > and dask-ml to subclass and override those methods to allow for specific > input data-structure without converting everyting to a numpy array. > > [1] https://github.com/dask/dask-ml > > > > 2017-12-04 15:21 GMT+01:00 Peter Hausamann : > >> Hi all, >> >> I'd like to announce *sklearn-xarray*, a new package that provides a >> scikit-learn interface for xarray users. For those not familiar with xarray >> (http://xarray.pydata.org), it is a "pandas-like and pandas-compatible >> toolkit for analytics on multi-dimensional arrays". >> >> The package makes it possible to apply sklearn estimators to xarray >> DataArrays and Datasets while keeping the labels (called coordinates in >> xarray) intact whereever possible. >> >> You can install the package via pip: >> >> pip install sklearn-xarray >> >> To get started, you can: >> >> - read the documentation: https://phausamann.github.io/sklearn-xarray >> and >> - check out the repository: https://github.com >> /phausamann/sklearn-xarray >> >> Note that the package is still in a very early development stage and >> there will probably be some major API changes in upcoming releases. Most >> notably, I'd like to replicate the complete sklearn module structure at >> some point by decorating all available estimators with the necessary >> wrappers. >> >> Feedback of any kind is appreciated. >> >> Peter >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Olivier > http://twitter.com/ogrisel - http://github.com/ogrisel > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.hausamann at tum.de Mon Dec 4 11:25:16 2017 From: peter.hausamann at tum.de (Peter Hausamann) Date: Mon, 04 Dec 2017 16:25:16 +0000 Subject: [scikit-learn] Announcing sklearn-xarray In-Reply-To: References: Message-ID: Thanks everyone for your feedback. The reason you're getting the error is because the first argument of DataArray.mean() is the named dimension 'dim' and not 'axis'. So calling X.mean(axis=0) would probably solve the problem... but it might be easier (and more robust) to fix this on my end by always converting the data to a numpy array before passing it to the wrapped estimator. Regarding the question on how to avoid data being loaded into memory: I'm honestly not familiar enough with this subject to give you an answer just yet, but supporting too-big-for-memory datasets is definitely a feature that would be very important to me. Cheers Peter Tom Augspurger schrieb am Mo., 4. Dez. 2017 um 17:00 Uhr: > I haven't looked at the implementation of `sklearn_xarray.dataarray.wrap` > yet, but a simple test > on `dask_ml.preprocessing.StandardScaler` failed with the (probably > expected) `TypeError: 'int' object is not iterable` > when dask-ml attempts an `X.mean(0)`. > > I'd be interested to hear what changes dask-ml would need to make to get > things working on dask-back xarray datasets, > without reading everything into memory at once. > > The code: > > > import sklearn_xarray.dataarray as da > from sklearn_xarray.data import load_dummy_dataarray > from dask_ml.preprocessing import StandardScaler > > X = load_dummy_dataarray() > Xt = da.wrap(StandardScaler()).fit_transform(X) > > > Tom > > On Mon, Dec 4, 2017 at 9:03 AM, Olivier Grisel > wrote: > >> Interesting project! >> >> BTW, do you know about dask-ml [1]? >> >> It might be interesting to think about generalizing the input validation >> of fit and predict / transform as a private method of the BaseEstimator >> class instead of directly calling into sklearn.utils.validation functions >> so has to make it easier for third party projects such as sklearn-xarray >> and dask-ml to subclass and override those methods to allow for specific >> input data-structure without converting everyting to a numpy array. >> >> [1] https://github.com/dask/dask-ml >> >> >> >> 2017-12-04 15:21 GMT+01:00 Peter Hausamann : >> >>> Hi all, >>> >>> I'd like to announce *sklearn-xarray*, a new package that provides a >>> scikit-learn interface for xarray users. For those not familiar with xarray >>> (http://xarray.pydata.org), it is a "pandas-like and pandas-compatible >>> toolkit for analytics on multi-dimensional arrays". >>> >>> The package makes it possible to apply sklearn estimators to xarray >>> DataArrays and Datasets while keeping the labels (called coordinates in >>> xarray) intact whereever possible. >>> >>> You can install the package via pip: >>> >>> pip install sklearn-xarray >>> >>> To get started, you can: >>> >>> - read the documentation: https://phausamann.github.io/sklearn-xarray >>> and >>> - check out the repository: >>> https://github.com/phausamann/sklearn-xarray >>> >>> Note that the package is still in a very early development stage and >>> there will probably be some major API changes in upcoming releases. Most >>> notably, I'd like to replicate the complete sklearn module structure at >>> some point by decorating all available estimators with the necessary >>> wrappers. >>> >>> Feedback of any kind is appreciated. >>> >>> Peter >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Olivier >> http://twitter.com/ogrisel - http://github.com/ogrisel >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.hausamann at tum.de Mon Dec 4 11:36:32 2017 From: peter.hausamann at tum.de (Peter Hausamann) Date: Mon, 04 Dec 2017 16:36:32 +0000 Subject: [scikit-learn] Announcing sklearn-xarray In-Reply-To: References: Message-ID: PS: obviously forcing conversion to numpy is not what we would want, rather passing the underlying array of the DataArray. Peter Hausamann schrieb am Mo., 4. Dez. 2017 um 17:25 Uhr: > Thanks everyone for your feedback. > > The reason you're getting the error is because the first argument of > DataArray.mean() is the named dimension 'dim' and not 'axis'. So calling > X.mean(axis=0) would probably solve the problem... but it might be easier > (and more robust) to fix this on my end by always converting the data to a > numpy array before passing it to the wrapped estimator. > > Regarding the question on how to avoid data being loaded into memory: I'm > honestly not familiar enough with this subject to give you an answer just > yet, but supporting too-big-for-memory datasets is definitely a feature > that would be very important to me. > > Cheers > > Peter > > > Tom Augspurger schrieb am Mo., 4. Dez. 2017 > um 17:00 Uhr: > >> I haven't looked at the implementation of `sklearn_xarray.dataarray.wrap` >> yet, but a simple test >> on `dask_ml.preprocessing.StandardScaler` failed with the (probably >> expected) `TypeError: 'int' object is not iterable` >> when dask-ml attempts an `X.mean(0)`. >> >> I'd be interested to hear what changes dask-ml would need to make to get >> things working on dask-back xarray datasets, >> without reading everything into memory at once. >> >> The code: >> >> >> import sklearn_xarray.dataarray as da >> from sklearn_xarray.data import load_dummy_dataarray >> from dask_ml.preprocessing import StandardScaler >> >> X = load_dummy_dataarray() >> Xt = da.wrap(StandardScaler()).fit_transform(X) >> >> >> Tom >> >> On Mon, Dec 4, 2017 at 9:03 AM, Olivier Grisel >> wrote: >> >>> Interesting project! >>> >>> BTW, do you know about dask-ml [1]? >>> >>> It might be interesting to think about generalizing the input validation >>> of fit and predict / transform as a private method of the BaseEstimator >>> class instead of directly calling into sklearn.utils.validation functions >>> so has to make it easier for third party projects such as sklearn-xarray >>> and dask-ml to subclass and override those methods to allow for specific >>> input data-structure without converting everyting to a numpy array. >>> >>> [1] https://github.com/dask/dask-ml >>> >>> >>> >>> 2017-12-04 15:21 GMT+01:00 Peter Hausamann : >>> >>>> Hi all, >>>> >>>> I'd like to announce *sklearn-xarray*, a new package that provides a >>>> scikit-learn interface for xarray users. For those not familiar with xarray >>>> (http://xarray.pydata.org), it is a "pandas-like and pandas-compatible >>>> toolkit for analytics on multi-dimensional arrays". >>>> >>>> The package makes it possible to apply sklearn estimators to xarray >>>> DataArrays and Datasets while keeping the labels (called coordinates in >>>> xarray) intact whereever possible. >>>> >>>> You can install the package via pip: >>>> >>>> pip install sklearn-xarray >>>> >>>> To get started, you can: >>>> >>>> - read the documentation: >>>> https://phausamann.github.io/sklearn-xarray and >>>> - check out the repository: >>>> https://github.com/phausamann/sklearn-xarray >>>> >>>> Note that the package is still in a very early development stage and >>>> there will probably be some major API changes in upcoming releases. Most >>>> notably, I'd like to replicate the complete sklearn module structure at >>>> some point by decorating all available estimators with the necessary >>>> wrappers. >>>> >>>> Feedback of any kind is appreciated. >>>> >>>> Peter >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> >>> -- >>> Olivier >>> http://twitter.com/ogrisel - http://github.com/ogrisel >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From albertthomas88 at gmail.com Mon Dec 4 12:16:29 2017 From: albertthomas88 at gmail.com (Albert Thomas) Date: Mon, 04 Dec 2017 17:16:29 +0000 Subject: [scikit-learn] Error while running 'python setup.py build_ext --inplace' In-Reply-To: References: Message-ID: Maybe run ?make clean? before running pip install ... Albert On Mon 4 Dec 2017 at 16:11, Aniket Meshram wrote: > I updated all the packages before running install. > > On Mon, Dec 4, 2017 at 6:07 PM, Olivier Grisel > wrote: > >> Maybe update your version of Cython? >> >> -- >> Olivier >> ? >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Regards, > > Aniket G. Meshram > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From l.lomasto at innovationengineering.eu Mon Dec 4 13:30:55 2017 From: l.lomasto at innovationengineering.eu (Luigi Lomasto) Date: Mon, 4 Dec 2017 19:30:55 +0100 Subject: [scikit-learn] Error while running 'python setup.py build_ext --inplace' In-Reply-To: References: Message-ID: You can try to use python 3 with pip3 Inviato da iPhone > Il giorno 04 dic 2017, alle ore 18:16, Albert Thomas ha scritto: > > Maybe run ?make clean? before running pip install ... > > Albert >> On Mon 4 Dec 2017 at 16:11, Aniket Meshram wrote: >> I updated all the packages before running install. >> >>> On Mon, Dec 4, 2017 at 6:07 PM, Olivier Grisel wrote: >>> Maybe update your version of Cython? >>> >>> -- >>> Olivier >>> ? >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> >> -- >> Regards, >> >> Aniket G. Meshram >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Dec 4 14:15:52 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 4 Dec 2017 14:15:52 -0500 Subject: [scikit-learn] Error while running 'python setup.py build_ext --inplace' In-Reply-To: References: Message-ID: Please stay on the mailing list. That's not the current version. Please try updating as Olivier suggested. On 12/04/2017 01:52 PM, Aniket Meshram wrote: > $ cython --version > Cython version 0.23.4 > > On Mon, Dec 4, 2017 at 10:28 PM, Andreas Mueller > wrote: > > What version of Cython are you using? > > > > On 12/04/2017 09:20 AM, Aniket Meshram wrote: >> I updated all the packages before running install. >> >> On Mon, Dec 4, 2017 at 6:07 PM, Olivier Grisel >> > wrote: >> >> Maybe update your version of Cython? >> >> -- >> Olivier >> ? >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> -- >> Regards, >> >> Aniket G. Meshram >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > > -- > Regards, > > Aniket G. Meshram -------------- next part -------------- An HTML attachment was scrubbed... URL: From aniket.g.meshram at gmail.com Tue Dec 5 02:28:05 2017 From: aniket.g.meshram at gmail.com (Aniket Meshram) Date: Tue, 5 Dec 2017 12:58:05 +0530 Subject: [scikit-learn] Error while running 'python setup.py build_ext --inplace' In-Reply-To: References: Message-ID: I'm using Ubuntu 16.04 LTS Xenial, which shows 0.23.4-0ubuntu5 as the latest. https://packages.ubuntu.com/search?keywords=cython But yes, you are right, I checked on official Cython and I'll install the latest using PyPI. Thought Ubuntu gives the latest, but that isn't true anymore. Thanks Andreas. I'll let you guys know, once I update and rerun pip install ... Thanks On Tue, Dec 5, 2017 at 12:45 AM, Andreas Mueller wrote: > Please stay on the mailing list. > That's not the current version. Please try updating as Olivier suggested. > > > On 12/04/2017 01:52 PM, Aniket Meshram wrote: > > $ cython --version > Cython version 0.23.4 > > On Mon, Dec 4, 2017 at 10:28 PM, Andreas Mueller wrote: > >> What version of Cython are you using? >> >> >> >> On 12/04/2017 09:20 AM, Aniket Meshram wrote: >> >> I updated all the packages before running install. >> >> On Mon, Dec 4, 2017 at 6:07 PM, Olivier Grisel >> wrote: >> >>> Maybe update your version of Cython? >>> >>> -- >>> Olivier >>> ? >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Regards, >> >> Aniket G. Meshram >> >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> >> > > > -- > Regards, > > Aniket G. Meshram > > > -- Regards, Aniket G. Meshram -------------- next part -------------- An HTML attachment was scrubbed... URL: From aniket.g.meshram at gmail.com Tue Dec 5 12:01:46 2017 From: aniket.g.meshram at gmail.com (Aniket Meshram) Date: Tue, 5 Dec 2017 22:31:46 +0530 Subject: [scikit-learn] Error while running 'python setup.py build_ext --inplace' In-Reply-To: References: Message-ID: Yeah. That did it. After updating Cython to latest 0.27.3, the issue is resolved now. Thanks all. I guess this should also be updated on the site / github as well. What'd you say? Best, Aniket On Tue, Dec 5, 2017 at 12:58 PM, Aniket Meshram wrote: > I'm using Ubuntu 16.04 LTS Xenial, which shows 0.23.4-0ubuntu5 as the > latest. > > > https://packages.ubuntu.com/search?keywords=cython > > > But yes, you are right, I checked on official Cython and I'll install the > latest using PyPI. Thought Ubuntu gives the latest, but that isn't true > anymore. > Thanks Andreas. > > I'll let you guys know, once I update and rerun pip install ... > Thanks > > On Tue, Dec 5, 2017 at 12:45 AM, Andreas Mueller wrote: > >> Please stay on the mailing list. >> That's not the current version. Please try updating as Olivier suggested. >> >> >> On 12/04/2017 01:52 PM, Aniket Meshram wrote: >> >> $ cython --version >> Cython version 0.23.4 >> >> On Mon, Dec 4, 2017 at 10:28 PM, Andreas Mueller >> wrote: >> >>> What version of Cython are you using? >>> >>> >>> >>> On 12/04/2017 09:20 AM, Aniket Meshram wrote: >>> >>> I updated all the packages before running install. >>> >>> On Mon, Dec 4, 2017 at 6:07 PM, Olivier Grisel >> > wrote: >>> >>>> Maybe update your version of Cython? >>>> >>>> -- >>>> Olivier >>>> ? >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> >>> -- >>> Regards, >>> >>> Aniket G. Meshram >>> >>> >>> _______________________________________________ >>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >> >> >> -- >> Regards, >> >> Aniket G. Meshram >> >> >> > > > -- > Regards, > > Aniket G. Meshram > -- Regards, Aniket G. Meshram -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Dec 5 19:12:49 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 6 Dec 2017 11:12:49 +1100 Subject: [scikit-learn] Error while running 'python setup.py build_ext --inplace' In-Reply-To: References: Message-ID: A PR is welcome if you can improve documentation. Thanks On 6 December 2017 at 04:01, Aniket Meshram wrote: > Yeah. That did it. After updating Cython to latest 0.27.3, the issue is > resolved now. > Thanks all. I guess this should also be updated on the site / github as > well. What'd you say? > > Best, > Aniket > > On Tue, Dec 5, 2017 at 12:58 PM, Aniket Meshram < > aniket.g.meshram at gmail.com> wrote: > >> I'm using Ubuntu 16.04 LTS Xenial, which shows 0.23.4-0ubuntu5 as the >> latest. >> >> >> https://packages.ubuntu.com/search?keywords=cython >> >> >> But yes, you are right, I checked on official Cython and I'll install the >> latest using PyPI. Thought Ubuntu gives the latest, but that isn't true >> anymore. >> Thanks Andreas. >> >> I'll let you guys know, once I update and rerun pip install ... >> Thanks >> >> On Tue, Dec 5, 2017 at 12:45 AM, Andreas Mueller >> wrote: >> >>> Please stay on the mailing list. >>> That's not the current version. Please try updating as Olivier suggested. >>> >>> >>> On 12/04/2017 01:52 PM, Aniket Meshram wrote: >>> >>> $ cython --version >>> Cython version 0.23.4 >>> >>> On Mon, Dec 4, 2017 at 10:28 PM, Andreas Mueller >>> wrote: >>> >>>> What version of Cython are you using? >>>> >>>> >>>> >>>> On 12/04/2017 09:20 AM, Aniket Meshram wrote: >>>> >>>> I updated all the packages before running install. >>>> >>>> On Mon, Dec 4, 2017 at 6:07 PM, Olivier Grisel < >>>> olivier.grisel at ensta.org> wrote: >>>> >>>>> Maybe update your version of Cython? >>>>> >>>>> -- >>>>> Olivier >>>>> ? >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> >>>> Aniket G. Meshram >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>>> >>> >>> >>> -- >>> Regards, >>> >>> Aniket G. Meshram >>> >>> >>> >> >> >> -- >> Regards, >> >> Aniket G. Meshram >> > > > > -- > Regards, > > Aniket G. Meshram > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fabio.sigrist at hslu.ch Wed Dec 6 07:15:28 2017 From: fabio.sigrist at hslu.ch (Fabio Sigrist) Date: Wed, 6 Dec 2017 13:15:28 +0100 Subject: [scikit-learn] Add Grabit model to gradient boosting Message-ID: Dear all, I added the Tobit loss function to gradient boosting, see https://github.com/scikit-learn/scikit-learn/pull/9961. Recently, I also added a reference to a preprint of an article with documentation on the methodology (https://arxiv.org/abs/1711.08695). What are to next steps in order to decide whether this feature will be added to sklearn? Thanks a lot in advance. Best regards, Fabio Sigrist *Lucerne University of Applied Sciences and Arts* Institute of Financial Services Zug IFZ Grafenauweg 10, CH-6300 Zug *Fabio Sigrist, PhD *Lecturer T +41 41 757 67 61 fabio.sigrist at hslu.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From aniket.g.meshram at gmail.com Wed Dec 6 13:30:54 2017 From: aniket.g.meshram at gmail.com (Aniket Meshram) Date: Thu, 7 Dec 2017 00:00:54 +0530 Subject: [scikit-learn] Error while running 'python setup.py build_ext --inplace' In-Reply-To: References: Message-ID: Alright, i'll make a pull request. But let me tell you guys, I'm totally new to github. This is my first contribution. And until few days back, i didn't even knew what a pull request was. Anyways, what i mean is even though i make a request, it'll take time for me to understand this whole changing something and reflecting it to the master branch. Meanwhile, I'm doing my homework on this, any suggestions would really be appreciated. Thanks, Aniket On Wed, Dec 6, 2017 at 5:42 AM, Joel Nothman wrote: > A PR is welcome if you can improve documentation. Thanks > > On 6 December 2017 at 04:01, Aniket Meshram > wrote: > >> Yeah. That did it. After updating Cython to latest 0.27.3, the issue is >> resolved now. >> Thanks all. I guess this should also be updated on the site / github as >> well. What'd you say? >> >> Best, >> Aniket >> >> On Tue, Dec 5, 2017 at 12:58 PM, Aniket Meshram < >> aniket.g.meshram at gmail.com> wrote: >> >>> I'm using Ubuntu 16.04 LTS Xenial, which shows 0.23.4-0ubuntu5 as the >>> latest. >>> >>> >>> https://packages.ubuntu.com/search?keywords=cython >>> >>> >>> But yes, you are right, I checked on official Cython and I'll install >>> the latest using PyPI. Thought Ubuntu gives the latest, but that isn't true >>> anymore. >>> Thanks Andreas. >>> >>> I'll let you guys know, once I update and rerun pip install ... >>> Thanks >>> >>> On Tue, Dec 5, 2017 at 12:45 AM, Andreas Mueller >>> wrote: >>> >>>> Please stay on the mailing list. >>>> That's not the current version. Please try updating as Olivier >>>> suggested. >>>> >>>> >>>> On 12/04/2017 01:52 PM, Aniket Meshram wrote: >>>> >>>> $ cython --version >>>> Cython version 0.23.4 >>>> >>>> On Mon, Dec 4, 2017 at 10:28 PM, Andreas Mueller >>>> wrote: >>>> >>>>> What version of Cython are you using? >>>>> >>>>> >>>>> >>>>> On 12/04/2017 09:20 AM, Aniket Meshram wrote: >>>>> >>>>> I updated all the packages before running install. >>>>> >>>>> On Mon, Dec 4, 2017 at 6:07 PM, Olivier Grisel < >>>>> olivier.grisel at ensta.org> wrote: >>>>> >>>>>> Maybe update your version of Cython? >>>>>> >>>>>> -- >>>>>> Olivier >>>>>> ? >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> >>>>> Aniket G. Meshram >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> >>>> Aniket G. Meshram >>>> >>>> >>>> >>> >>> >>> -- >>> Regards, >>> >>> Aniket G. Meshram >>> >> >> >> >> -- >> Regards, >> >> Aniket G. Meshram >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Regards, Aniket G. Meshram -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Dec 6 15:02:11 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 7 Dec 2017 07:02:11 +1100 Subject: [scikit-learn] Error while running 'python setup.py build_ext --inplace' In-Reply-To: References: Message-ID: We're biased, but we reckon the skills to make a PR are (a) not insurmountable with a bit of homework; and (b) very worthwhile to have. So try pick it up by yourself, but give us a shout if you're struggling. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Wed Dec 6 18:49:42 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 7 Dec 2017 00:49:42 +0100 Subject: [scikit-learn] MLPClassifier as a feature selector Message-ID: Greetings, I want to train a MLPClassifier with one hidden layer and use it as a feature selector for an MLPRegressor. Is it possible to get the values of the neurons from the last hidden layer of the MLPClassifier to pass them as input to the MLPRegressor? If it is not possible with scikit-learn, is anyone aware of any scikit-compatible NN library that offers this functionality? For example this one: http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html I wouldn't like to do this in Tensorflow because the MLP there is much slower than scikit-learn's implementation. Thomas -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Wed Dec 6 19:56:14 2017 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Thu, 7 Dec 2017 09:56:14 +0900 Subject: [scikit-learn] MLPClassifier as a feature selector In-Reply-To: References: Message-ID: I am also very interested in knowing if there is a sklearn cookbook solution for getting the weights of a one-hidde-layer MLPClassifier. J.B. 2017-12-07 8:49 GMT+09:00 Thomas Evangelidis : > Greetings, > > I want to train a MLPClassifier with one hidden layer and use it as a > feature selector for an MLPRegressor. > Is it possible to get the values of the neurons from the last hidden layer > of the MLPClassifier to pass them as input to the MLPRegressor? > > If it is not possible with scikit-learn, is anyone aware of any > scikit-compatible NN library that offers this functionality? For example > this one: > > http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html > > I wouldn't like to do this in Tensorflow because the MLP there is much > slower than scikit-learn's implementation. > > > Thomas > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manojkumarsivaraj334 at gmail.com Wed Dec 6 22:25:41 2017 From: manojkumarsivaraj334 at gmail.com (Manoj Kumar) Date: Wed, 6 Dec 2017 19:25:41 -0800 Subject: [scikit-learn] MLPClassifier as a feature selector In-Reply-To: References: Message-ID: Hi, The weights and intercepts are available in the coefs_ and intercepts_ attribute respectively. See https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/neural_network/multilayer_perceptron.py#L835 On Wed, Dec 6, 2017 at 4:56 PM, Brown J.B. via scikit-learn < scikit-learn at python.org> wrote: > I am also very interested in knowing if there is a sklearn cookbook > solution for getting the weights of a one-hidde-layer MLPClassifier. > J.B. > > 2017-12-07 8:49 GMT+09:00 Thomas Evangelidis : > >> Greetings, >> >> I want to train a MLPClassifier with one hidden layer and use it as a >> feature selector for an MLPRegressor. >> Is it possible to get the values of the neurons from the last hidden >> layer of the MLPClassifier to pass them as input to the MLPRegressor? >> >> If it is not possible with scikit-learn, is anyone aware of any >> scikit-compatible NN library that offers this functionality? For example >> this one: >> >> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html >> >> I wouldn't like to do this in Tensorflow because the MLP there is much >> slower than scikit-learn's implementation. >> >> >> Thomas >> >> >> -- >> >> ====================================================================== >> >> Dr Thomas Evangelidis >> >> Post-doctoral Researcher >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/2S049, >> 62500 Brno, Czech Republic >> >> email: tevang at pharm.uoa.gr >> >> tevang3 at gmail.com >> >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Manoj, http://github.com/MechCoder -------------- next part -------------- An HTML attachment was scrubbed... URL: From aniket.g.meshram at gmail.com Sat Dec 9 04:33:53 2017 From: aniket.g.meshram at gmail.com (Aniket Meshram) Date: Sat, 9 Dec 2017 15:03:53 +0530 Subject: [scikit-learn] Error while running 'python setup.py build_ext --inplace' In-Reply-To: References: Message-ID: hi All, I've created a pull request for updating the README.rst. https://github.com/scikit-learn/scikit-learn/pull/10276 Thanks, Aniket On Thu, Dec 7, 2017 at 1:32 AM, Joel Nothman wrote: > We're biased, but we reckon the skills to make a PR are (a) not > insurmountable with a bit of homework; and (b) very worthwhile to have. So > try pick it up by yourself, but give us a shout if you're struggling. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Regards, Aniket G. Meshram -------------- next part -------------- An HTML attachment was scrubbed... URL: From dmitrii.ignatov at gmail.com Sun Dec 10 06:33:59 2017 From: dmitrii.ignatov at gmail.com (Dmitry Ignatov) Date: Sun, 10 Dec 2017 14:33:59 +0300 Subject: [scikit-learn] Grid search fir multi-label task Message-ID: Hi All, I've tried GridsearchCV with RandomForestClassifier() clf = GridSearchCV(RandomForestClassifier(), tuned_parameters, cv=5, scoring='accuracy') for a multi-label problem where the output is a list of lists of 20 zeros or ones. [[1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0], [1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0], [0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0], [0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1],... Even though it was done correctly in usual way with clf = RandomForestClassifier(max_depth=7, random_state=0) clf.fit(Xtr,y) with GridsearchCV I have the errors below: 83 # We can't have more than one value on y_type => The set is no more needed ValueError: Classification metrics can't handle a mix of multiclass-multioutput and multilabel-indicator targets Is it possible to perform GridsearchCV in scikit for the multilabel setting (with an appropriate metric like averaged zero-one-loss)? Any hints? Thank you and best regards, Dmitry -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Dec 10 15:09:19 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 11 Dec 2017 07:09:19 +1100 Subject: [scikit-learn] Grid search fir multi-label task In-Reply-To: References: Message-ID: for legacy reasons, multilabel targets need to be passed as an array (or a sparse matrix if supported by the classifier). lists of lists are not supported but may be in the near future. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dmitrii.ignatov at gmail.com Sun Dec 10 15:46:38 2017 From: dmitrii.ignatov at gmail.com (Dmitry Ignatov) Date: Sun, 10 Dec 2017 23:46:38 +0300 Subject: [scikit-learn] Grid search fir multi-label task In-Reply-To: References: Message-ID: Joel, thank you. It helps. One step forward. -Dmitry 2017-12-10 23:09 GMT+03:00 Joel Nothman : > for legacy reasons, multilabel targets need to be passed as an array (or a > sparse matrix if supported by the classifier). lists of lists are not > supported but may be in the near future. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Dec 13 11:40:03 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 13 Dec 2017 11:40:03 -0500 Subject: [scikit-learn] SciPy 2018 tutorial Message-ID: <3c3f6d7c-c845-5866-609a-42e21d62fdf5@gmail.com> Hey folks. Who is coming to SciPy 2018? They just send out the CfP. Does anyone want to co-teach a tutorial? (If there's two other people that want to teach it, I'm also happy to step back this year ;) Andy From joel.nothman at gmail.com Wed Dec 13 19:38:42 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 14 Dec 2017 11:38:42 +1100 Subject: [scikit-learn] FYI: StratifiedKFold(..., shuffle=True) differs in 0.19 Message-ID: It has come to our attention in #10274 that we accidentally changed shuffled StratifiedKFold behaviour in the 0.19.0 release from what had come before. That is, for the same random state, you will get a different cross-validation data partition. This change (merged in #7823) was not documented in 0.19 release notes. We will update the online docs to mention it. The change provided negligible benefit for users. The change shouldn't have happened, but we likely won't revert it unless the community has a strongly divergent opinion. Cheers, Joel and Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: From iacopo at lighton.io Fri Dec 15 10:32:58 2017 From: iacopo at lighton.io (Iacopo Poli) Date: Fri, 15 Dec 2017 16:32:58 +0100 Subject: [scikit-learn] License for a package built on top of scikit-learn Message-ID: Hello, we've built a python package for some ML related application. Scikit-learn is a requirement and we have classes inheriting from sklearn objects. We have to decide the license and we are choosing between Apache 2.0 and BSD-3. We would go with Apache 2.0, but we were wondering if we have to release it under the same license of sklearn. It doesn't seem so by reading the text of BSD-3, but asking before never hurts. Thanks in advance, Iacopo Poli -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Dec 15 11:53:07 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 15 Dec 2017 11:53:07 -0500 Subject: [scikit-learn] License for a package built on top of scikit-learn In-Reply-To: References: Message-ID: <6436ea2e-67d9-3971-7b5a-9aaca95ab75c@gmail.com> Hi Iacopo. Yes, you can do either (in my understanding). If you just import sklearn there's really nothing you need to worry about. If you distribute sklearn, you should also distribute the BSD license file with it and make it clear that that's the license that applies to that part of the code. Cheers, Andy On 12/15/2017 10:32 AM, Iacopo Poli wrote: > Hello, > > we've built a python package for some ML related application. > Scikit-learn is a requirement and we have classes inheriting from > sklearn objects. > > We have to decide the license and we are choosing between Apache 2.0 > and BSD-3. > > We would go with Apache 2.0, but we were wondering if we have to > release it under the same license of sklearn. It doesn't seem so by > reading the text of BSD-3, but asking before never hurts. > > Thanks in advance, > Iacopo Poli > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Sat Dec 16 06:48:57 2017 From: g.lemaitre58 at gmail.com (Guillaume Lemaitre) Date: Sat, 16 Dec 2017 12:48:57 +0100 Subject: [scikit-learn] SciPy 2018 tutorial In-Reply-To: <3c3f6d7c-c845-5866-609a-42e21d62fdf5@gmail.com> References: <3c3f6d7c-c845-5866-609a-42e21d62fdf5@gmail.com> Message-ID: <20171216114857.5128271.71775.45130@gmail.com> Hey Andy, I'll be interested to come at SciPy and to co-teach a tutorial. Guillaume?Lemaitre? INRIA?Saclay?Ile-de-France?/?Equipe?PARIETAL guillaume.lemaitre at inria.fr?-?https://glemaitre.github.io/ ? Original Message ? From: Andreas Mueller Sent: Wednesday, 13 December 2017 17:42 To: Scikit-learn user and developer mailing list Reply To: Scikit-learn mailing list Subject: [scikit-learn] SciPy 2018 tutorial Hey folks. Who is coming to SciPy 2018? They just send out the CfP. Does anyone want to co-teach a tutorial? (If there's two other people that want to teach it, I'm also happy to step back this year ;) Andy _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn From tevang3 at gmail.com Mon Dec 18 09:19:13 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Mon, 18 Dec 2017 15:19:13 +0100 Subject: [scikit-learn] data augmentation following the underlying feature values distributions and correlations Message-ID: Greetings, I want to augment my training set but preserve at the same time the correlations between feature values. More specifically my features are NMR resonances of the nuclei of a single amino acid. For example for Glutamic acid I have for each observation the following feature values: [CA, HA, CB, HB, CG, HG] where CA is the resonance of the alpha carbon, HA the resonance of the alpha proton, and so forth. The complication here is that these feature values are not independent. HA is covalently bonded to CA, CB to CA, and so on. Therefore if I sample a random CA value from the distribution of experimental values of CA, I cannot pick ANY HA VALUE from the respective experimental distribution, simply because CA and HA are correlated. The same applies to CA and CB, CB and HB, CB and CG, CG and HG. Is there any algorithm that can generate [CA, HA, CB, HB, CG, HG] feature vectors that comply with the atom distributions and their correlations? I saw that Gaussian Mixture Models have a function to generate random samples from the fitted Gaussian distribution (sklearn.mixture.GaussianMixture.sample) but it is not clear if these samples will retain the correlations between the features (nuclei in this case). If there is not such an algorithm in scikit-learn, could you please point me to any other Python library which does that? Thanks in advance. Thomas -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From l.lomasto at innovationengineering.eu Tue Dec 19 03:36:42 2017 From: l.lomasto at innovationengineering.eu (Luigi Lomasto) Date: Tue, 19 Dec 2017 09:36:42 +0100 Subject: [scikit-learn] Feature selection with words. Message-ID: Hi all. I?m working for text classification to classify Wikipedia documents. I using a word count approach to extract feature from my text so I obtain a big vocabulary that contains all documents word (train dataset) after lemmatization and deleted stop word. Now I have 70000 features. I think that for this problems (word based) is not good to make feature selection (with SVD or PCA). Actual accuracy is 77%. Do you think that I need to do feature selection to grow up the accuracy? Thank you for answer. Regards. Luigi From joel.nothman at gmail.com Tue Dec 19 04:54:10 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 19 Dec 2017 20:54:10 +1100 Subject: [scikit-learn] Feature selection with words. In-Reply-To: References: Message-ID: It depends what the set of classes is. Best way to find out is to try it... On 19 December 2017 at 19:36, Luigi Lomasto < l.lomasto at innovationengineering.eu> wrote: > Hi all. > > I?m working for text classification to classify Wikipedia documents. I > using a word count approach to extract feature from my text so I obtain a > big vocabulary that contains all documents word (train dataset) after > lemmatization and deleted stop word. Now I have 70000 features. I think > that for this problems (word based) is not good to make feature selection > (with SVD or PCA). Actual accuracy is 77%. > > Do you think that I need to do feature selection to grow up the accuracy? > > Thank you for answer. Regards. > > Luigi > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Tue Dec 19 07:44:54 2017 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Tue, 19 Dec 2017 13:44:54 +0100 Subject: [scikit-learn] Any plans on generalizing Pipeline and transformers? Message-ID: Dear all, Kudos to scikit-learn! Having said that, Pipeline is killing me not being able to transform anything other than X. My current study case would need: - Transformers being able to handle both X and y, e.g. clustering X and y concatenated - Pipeline being able to change other params, e.g. sample_weight Currently, I'm augmenting X through every step with the extra information which seems to work ok for my_pipe.fit_transform(X_train,y_train) but breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I can inherit and modify a descendant from Pipeline class to allow the y parameter which is not ideal but I guess it is an option. The gritty part comes when having to adapt every regressor at the end of the ladder in order to split the extra information from the raw data in X and not being able to generate more than one subproduct from each preprocessing step My current research involves clustering the data and using that classification along with X in order to predict outliers which generates sample_weight info and I would love to use that on the final regressor. Currently there seems not to be another option than pasting that info on X. All in all, I'm stuck with this API limitation and I would love to learn some tricks from you if you could enlighten me. Thanks in advance! Manuel Castej?n-Limas -------------- next part -------------- An HTML attachment was scrubbed... URL: From ichkoar at gmail.com Tue Dec 19 08:15:12 2017 From: ichkoar at gmail.com (Christos Aridas) Date: Tue, 19 Dec 2017 15:15:12 +0200 Subject: [scikit-learn] Any plans on generalizing Pipeline and transformers? In-Reply-To: References: Message-ID: Hey Manuel, In imbalanced-learn we have an extra type of estimators, named Samplers, which are able to modify X and y, at the same time, with the use of new API methods, sample and fit_sample. Also, we have adopted a modified version of scikit-learn's Pipeline class where we allow subsequent transformations using samplers and transformers. Despite the fact that the package deals with imbalanced datasets the aforementioned objects may help your pipeline. Cheerz, Chris On Tue, Dec 19, 2017 at 2:44 PM, Manuel Castej?n Limas < manuel.castejon at gmail.com> wrote: > Dear all, > > Kudos to scikit-learn! Having said that, Pipeline is killing me not being > able to transform anything other than X. > > My current study case would need: > - Transformers being able to handle both X and y, e.g. clustering X and y > concatenated > - Pipeline being able to change other params, e.g. sample_weight > > Currently, I'm augmenting X through every step with the extra information > which seems to work ok for my_pipe.fit_transform(X_train,y_train) but > breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I > can inherit and modify a descendant from Pipeline class to allow the y > parameter which is not ideal but I guess it is an option. The gritty part > comes when having to adapt every regressor at the end of the ladder in > order to split the extra information from the raw data in X and not being > able to generate more than one subproduct from each preprocessing step > > My current research involves clustering the data and using that > classification along with X in order to predict outliers which generates > sample_weight info and I would love to use that on the final regressor. > Currently there seems not to be another option than pasting that info on X. > > All in all, I'm stuck with this API limitation and I would love to learn > some tricks from you if you could enlighten me. > > Thanks in advance! > > Manuel Castej?n-Limas > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Tue Dec 19 08:18:32 2017 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Tue, 19 Dec 2017 14:18:32 +0100 Subject: [scikit-learn] Any plans on generalizing Pipeline and transformers? In-Reply-To: References: Message-ID: I think that you could you use imbalanced-learn regarding the issue that you have with the y. You should be able to wrap your clustering inside the FunctionSampler ( https://github.com/scikit-learn-contrib/imbalanced-learn/pull/342 - we are on the way to merge it) On 19 December 2017 at 13:44, Manuel Castej?n Limas < manuel.castejon at gmail.com> wrote: > Dear all, > > Kudos to scikit-learn! Having said that, Pipeline is killing me not being > able to transform anything other than X. > > My current study case would need: > - Transformers being able to handle both X and y, e.g. clustering X and y > concatenated > - Pipeline being able to change other params, e.g. sample_weight > > Currently, I'm augmenting X through every step with the extra information > which seems to work ok for my_pipe.fit_transform(X_train,y_train) but > breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I > can inherit and modify a descendant from Pipeline class to allow the y > parameter which is not ideal but I guess it is an option. The gritty part > comes when having to adapt every regressor at the end of the ladder in > order to split the extra information from the raw data in X and not being > able to generate more than one subproduct from each preprocessing step > > My current research involves clustering the data and using that > classification along with X in order to predict outliers which generates > sample_weight info and I would love to use that on the final regressor. > Currently there seems not to be another option than pasting that info on X. > > All in all, I'm stuck with this API limitation and I would love to learn > some tricks from you if you could enlighten me. > > Thanks in advance! > > Manuel Castej?n-Limas > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Tue Dec 19 08:33:42 2017 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Tue, 19 Dec 2017 14:33:42 +0100 Subject: [scikit-learn] Any plans on generalizing Pipeline and transformers? In-Reply-To: References: Message-ID: Wow, that seems promising. I'll read with interest the imbalance-learn code. Thanks for the info! Manuel 2017-12-19 14:15 GMT+01:00 Christos Aridas : > Hey Manuel, > > In imbalanced-learn we have an extra type of estimators, named Samplers, > which are able to modify X and y, at the same time, with the use of new API > methods, sample and fit_sample. > Also, we have adopted a modified version of scikit-learn's Pipeline class > where we allow subsequent transformations using samplers and transformers. > Despite the fact that the package deals with imbalanced datasets the > aforementioned objects may help your pipeline. > > Cheerz, > Chris > > On Tue, Dec 19, 2017 at 2:44 PM, Manuel Castej?n Limas < > manuel.castejon at gmail.com> wrote: > >> Dear all, >> >> Kudos to scikit-learn! Having said that, Pipeline is killing me not being >> able to transform anything other than X. >> >> My current study case would need: >> - Transformers being able to handle both X and y, e.g. clustering X and y >> concatenated >> - Pipeline being able to change other params, e.g. sample_weight >> >> Currently, I'm augmenting X through every step with the extra information >> which seems to work ok for my_pipe.fit_transform(X_train,y_train) but >> breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I >> can inherit and modify a descendant from Pipeline class to allow the y >> parameter which is not ideal but I guess it is an option. The gritty part >> comes when having to adapt every regressor at the end of the ladder in >> order to split the extra information from the raw data in X and not being >> able to generate more than one subproduct from each preprocessing step >> >> My current research involves clustering the data and using that >> classification along with X in order to predict outliers which generates >> sample_weight info and I would love to use that on the final regressor. >> Currently there seems not to be another option than pasting that info on X. >> >> All in all, I'm stuck with this API limitation and I would love to learn >> some tricks from you if you could enlighten me. >> >> Thanks in advance! >> >> Manuel Castej?n-Limas >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Tue Dec 19 08:34:49 2017 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Tue, 19 Dec 2017 14:34:49 +0100 Subject: [scikit-learn] Any plans on generalizing Pipeline and transformers? In-Reply-To: References: Message-ID: Eager to learn! Diving on the code right now! Thanks for the tip! Manuel 2017-12-19 14:18 GMT+01:00 Guillaume Lema?tre : > I think that you could you use imbalanced-learn regarding the issue that > you have with the y. > You should be able to wrap your clustering inside the FunctionSampler ( > https://github.com/scikit-learn-contrib/imbalanced-learn/pull/342 - we > are on the way to merge it) > > On 19 December 2017 at 13:44, Manuel Castej?n Limas < > manuel.castejon at gmail.com> wrote: > >> Dear all, >> >> Kudos to scikit-learn! Having said that, Pipeline is killing me not being >> able to transform anything other than X. >> >> My current study case would need: >> - Transformers being able to handle both X and y, e.g. clustering X and y >> concatenated >> - Pipeline being able to change other params, e.g. sample_weight >> >> Currently, I'm augmenting X through every step with the extra information >> which seems to work ok for my_pipe.fit_transform(X_train,y_train) but >> breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I >> can inherit and modify a descendant from Pipeline class to allow the y >> parameter which is not ideal but I guess it is an option. The gritty part >> comes when having to adapt every regressor at the end of the ladder in >> order to split the extra information from the raw data in X and not being >> able to generate more than one subproduct from each preprocessing step >> >> My current research involves clustering the data and using that >> classification along with X in order to predict outliers which generates >> sample_weight info and I would love to use that on the final regressor. >> Currently there seems not to be another option than pasting that info on X. >> >> All in all, I'm stuck with this API limitation and I would love to learn >> some tricks from you if you could enlighten me. >> >> Thanks in advance! >> >> Manuel Castej?n-Limas >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ranjanagirish30 at gmail.com Tue Dec 19 09:38:12 2017 From: ranjanagirish30 at gmail.com (Ranjana Girish) Date: Tue, 19 Dec 2017 20:08:12 +0530 Subject: [scikit-learn] Text classification of large dataet Message-ID: Hai all, I am doing text classification. I have around 10 million data to be classified to around 7k category. Below is the code I am using *# Importing the libraries* *import pandas as pd* *import nltk* *from nltk.corpus import stopwords* *from nltk.tokenize import word_tokenize* *from nltk.stem.wordnet import WordNetLemmatizer* *from nltk.stem.porter import PorterStemmer* *import re* *from sklearn.feature_extraction.text import CountVectorizer* *import random* *from sklearn.naive_bayes import MultinomialNB,GaussianNB* *from sklearn.metrics import accuracy_score* *from sklearn.metrics import precision_recall_curve* *from sklearn.metrics import average_precision_score* *from sklearn import feature_selection* *from scipy.sparse import csr_matrix* *from scipy import sparse* *import sys* *from sklearn import preprocessing* *import numpy as np* *import pickle* *sys.setrecursionlimit(200000000)* *random.seed(20000)* *trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding = "ISO-8859-1")* *trainset2=pd.read_csv("trainsetlessequal500.csv",encoding = "ISO-8859-1")* *dataset=pd.concat([trainset1,trainset2])* *dataset=dataset.dropna()* *dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^a-zA-Z]', ' ')* *dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]', ' ')* *dataset['ProductDescription']=dataset['ProductDescription'].str.lower()* *del trainset1* *del trainset2 * *stop = stopwords.words('english')* *lemmatizer = WordNetLemmatizer()* *dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b(' + r'|'.join(stop) + r')\b\s*', ' ')* *dataset['ProductDescription']=dataset['ProductDescription'].str.replace('\s\s+',' ')* *dataset['ProductDescription'] =dataset['ProductDescription'].apply(word_tokenize)* *ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'* *POS_LIST = [NOUN, VERB, ADJ, ADV]* *for tag in POS_LIST:* * dataset['ProductDescription'] = dataset['ProductDescription'].apply(lambda x: list(set([lemmatizer.lemmatize(item,tag) for item in x])))* *dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda x : " ".join(x))* *countvec = CountVectorizer(min_df=0.00008)* *documenttermmatrix=countvec.fit_transform(dataset['ProductDescription'])* *documenttermmatrix.shape* *column=countvec.get_feature_names()* *filename1 = 'columnnamessample10mastermerge.sav'* *pickle.dump(column, open(filename1, 'wb'))* *y_train=dataset['classpath']* *y_train=dataset['classpath'].tolist()* *labels_train= preprocessing.LabelEncoder()* *labels_train.fit(y_train)* *y1_train=labels_train.transform(y_train)* *del dataset* *del countvec* *del column* *clf = MultinomialNB()* *model=clf.fit(documenttermmatrix,y_train)* *filename2 = 'modelnaivebayessample10withfs.sav'* *pickle.dump(model, open(filename2, 'wb'))* I am using system with *128 GB RAM.* As I was unable to train all 10 million data, I did *stratified sampling* and the trainset reduced to 2.3 million Still I was unable to Train 2.3 million data I got* memory error* when i used *random forest (nestimator=30),**Naive Bayes* and *SVM* *I have stucked* *Can Anyone please tell whether any memory leak in my code and how to use system with 128 GB RAM effectively* Thanks Ranjana -------------- next part -------------- An HTML attachment was scrubbed... URL: From johnmarktaylor at g.harvard.edu Tue Dec 19 16:27:53 2017 From: johnmarktaylor at g.harvard.edu (Taylor, Johnmark) Date: Tue, 19 Dec 2017 16:27:53 -0500 Subject: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints? Message-ID: Hello, I am a researcher in fMRI and am using SVMs to analyze brain data. I am doing decoding between two classes, each of which has 24 exemplars per class. I am comparing two different methods of cross-validation for my data: in one, I am training on 23 exemplars from each class, and testing on the remaining example from each class, and in the other, I am training on 22 exemplars from each class, and testing on the remaining two from each class (in case it matters, the data is structured into different neuroimaging "runs", with each "run" containing several "blocks"; the first cross-validation method is leaving out one block at a time, the second is leaving out one run at a time). Now, I would've thought that these two CV methods would be very similar, since the vast majority of the training data is the same; the only difference is in adding two additional points. However, they are yielding very different results: training on 23 per class is yielding 60% decoding accuracy (averaged across several subjects, and statistically significantly greater than chance), training on 22 per class is yielding chance (50%) decoding. Leaving aside the particulars of fMRI in this case: is it unusual for single points (amounting to less than 5% of the data) to have such a big influence on SVM decoding? I am using a cost parameter of C=1. I must say it is counterintuitive to me that just a couple points out of two dozen could make such a big difference. Thank you very much, and cheers, JohnMark -------------- next part -------------- An HTML attachment was scrubbed... URL: From jakevdp at cs.washington.edu Tue Dec 19 16:37:35 2017 From: jakevdp at cs.washington.edu (Jacob Vanderplas) Date: Tue, 19 Dec 2017 13:37:35 -0800 Subject: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints? In-Reply-To: References: Message-ID: Hi JohnMark, SVMs, by design, are quite sensitive to the addition of single data points ? but only if those data points happen to lie near the margin. I wrote about some of those types of details here: https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html Hope that helps, Jake Jake VanderPlas Senior Data Science Fellow Director of Open Software University of Washington eScience Institute On Tue, Dec 19, 2017 at 1:27 PM, Taylor, Johnmark < johnmarktaylor at g.harvard.edu> wrote: > Hello, > > I am a researcher in fMRI and am using SVMs to analyze brain data. I am > doing decoding between two classes, each of which has 24 exemplars per > class. I am comparing two different methods of cross-validation for my > data: in one, I am training on 23 exemplars from each class, and testing on > the remaining example from each class, and in the other, I am training on > 22 exemplars from each class, and testing on the remaining two from each > class (in case it matters, the data is structured into different > neuroimaging "runs", with each "run" containing several "blocks"; the first > cross-validation method is leaving out one block at a time, the second is > leaving out one run at a time). > > Now, I would've thought that these two CV methods would be very similar, > since the vast majority of the training data is the same; the only > difference is in adding two additional points. However, they are yielding > very different results: training on 23 per class is yielding 60% decoding > accuracy (averaged across several subjects, and statistically significantly > greater than chance), training on 22 per class is yielding chance (50%) > decoding. Leaving aside the particulars of fMRI in this case: is it unusual > for single points (amounting to less than 5% of the data) to have such a > big influence on SVM decoding? I am using a cost parameter of C=1. I must > say it is counterintuitive to me that just a couple points out of two dozen > could make such a big difference. > > Thank you very much, and cheers, > > JohnMark > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From l.lomasto at innovationengineering.eu Tue Dec 19 17:07:57 2017 From: l.lomasto at innovationengineering.eu (Luigi Lomasto) Date: Tue, 19 Dec 2017 23:07:57 +0100 Subject: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints? In-Reply-To: References: Message-ID: <7B51C250-F836-4C5A-89EE-61C756093180@innovationengineering.eu> Hi, you can try to use CV with k-fold partition, so you can see with all training/test combination (generally 90%/10% or 80/20). If you have very different results, probably that you obtain overfitting. Inviato da iPhone > Il giorno 19 dic 2017, alle ore 22:37, Jacob Vanderplas ha scritto: > > Hi JohnMark, > SVMs, by design, are quite sensitive to the addition of single data points ? but only if those data points happen to lie near the margin. I wrote about some of those types of details here: https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html > > Hope that helps, > Jake > > Jake VanderPlas > Senior Data Science Fellow > Director of Open Software > University of Washington eScience Institute > >> On Tue, Dec 19, 2017 at 1:27 PM, Taylor, Johnmark wrote: >> Hello, >> >> I am a researcher in fMRI and am using SVMs to analyze brain data. I am doing decoding between two classes, each of which has 24 exemplars per class. I am comparing two different methods of cross-validation for my data: in one, I am training on 23 exemplars from each class, and testing on the remaining example from each class, and in the other, I am training on 22 exemplars from each class, and testing on the remaining two from each class (in case it matters, the data is structured into different neuroimaging "runs", with each "run" containing several "blocks"; the first cross-validation method is leaving out one block at a time, the second is leaving out one run at a time). >> >> Now, I would've thought that these two CV methods would be very similar, since the vast majority of the training data is the same; the only difference is in adding two additional points. However, they are yielding very different results: training on 23 per class is yielding 60% decoding accuracy (averaged across several subjects, and statistically significantly greater than chance), training on 22 per class is yielding chance (50%) decoding. Leaving aside the particulars of fMRI in this case: is it unusual for single points (amounting to less than 5% of the data) to have such a big influence on SVM decoding? I am using a cost parameter of C=1. I must say it is counterintuitive to me that just a couple points out of two dozen could make such a big difference. >> >> Thank you very much, and cheers, >> >> JohnMark >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff1evesque at yahoo.com Tue Dec 19 16:56:40 2017 From: jeff1evesque at yahoo.com (Jeffrey Levesque) Date: Tue, 19 Dec 2017 16:56:40 -0500 Subject: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints? In-Reply-To: References: Message-ID: Hi guys, I'm currently developing a web-interface, and programmatic rest-API for sklearn. I currently have SVM, and SVR available with some parameters like C, and gamma exposed: - https://github.com/jeff1evesque/machine-learning I'm working a bit to improve the web-interface at the moment. Since you're working with SVM's maybe you'd have time, to try my project, and to provide me some feedback? I hope to expand the toolset to things like ensemble learning, and a long shot of neural network. But, this may be some time. Thank you, Jeff Levesque https://github.com/jeff1evesque > On Dec 19, 2017, at 4:37 PM, Jacob Vanderplas wrote: > > Hi JohnMark, > SVMs, by design, are quite sensitive to the addition of single data points ? but only if those data points happen to lie near the margin. I wrote about some of those types of details here: https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html > > Hope that helps, > Jake > > Jake VanderPlas > Senior Data Science Fellow > Director of Open Software > University of Washington eScience Institute > >> On Tue, Dec 19, 2017 at 1:27 PM, Taylor, Johnmark wrote: >> Hello, >> >> I am a researcher in fMRI and am using SVMs to analyze brain data. I am doing decoding between two classes, each of which has 24 exemplars per class. I am comparing two different methods of cross-validation for my data: in one, I am training on 23 exemplars from each class, and testing on the remaining example from each class, and in the other, I am training on 22 exemplars from each class, and testing on the remaining two from each class (in case it matters, the data is structured into different neuroimaging "runs", with each "run" containing several "blocks"; the first cross-validation method is leaving out one block at a time, the second is leaving out one run at a time). >> >> Now, I would've thought that these two CV methods would be very similar, since the vast majority of the training data is the same; the only difference is in adding two additional points. However, they are yielding very different results: training on 23 per class is yielding 60% decoding accuracy (averaged across several subjects, and statistically significantly greater than chance), training on 22 per class is yielding chance (50%) decoding. Leaving aside the particulars of fMRI in this case: is it unusual for single points (amounting to less than 5% of the data) to have such a big influence on SVM decoding? I am using a cost parameter of C=1. I must say it is counterintuitive to me that just a couple points out of two dozen could make such a big difference. >> >> Thank you very much, and cheers, >> >> JohnMark >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Dec 19 16:35:26 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 19 Dec 2017 22:35:26 +0100 Subject: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints? In-Reply-To: References: Message-ID: <20171219213525.GE360768@phare.normalesup.org> With as few data points, there is a huge uncertainty in the estimation of the prediction accuracy with cross-validation. This isn't a problem of the method, is it a basic limitation of the small amount of data. I've written a paper on this problem is the specific context of neuroimaging: https://www.sciencedirect.com/science/article/pii/S1053811917305311 (preprint: https://hal.inria.fr/hal-01545002/). I except that what you are seing in sampling noise: the result has confidence intervals in large than 10%. Ga?l On Tue, Dec 19, 2017 at 04:27:53PM -0500, Taylor, Johnmark wrote: > Hello, > I am a researcher in fMRI and am using SVMs to analyze brain data. I am doing > decoding between two classes, each of which has 24 exemplars per class. I am > comparing two different methods of cross-validation for my data: in one, I am > training on 23 exemplars from each class, and testing on the remaining example > from each class, and in the other, I am training on 22 exemplars from each > class, and testing on the remaining two from each class (in case it matters, > the data is structured into different neuroimaging "runs", with each "run" > containing several "blocks"; the first cross-validation method is leaving out > one block at a time, the second is leaving out one run at a time).? > Now, I would've thought that these two CV methods would be very similar, since > the vast majority of the training data is the same; the only difference is in > adding two additional points. However, they are yielding very different > results: training on 23 per class is yielding 60% decoding accuracy (averaged > across several subjects, and statistically significantly greater than chance), > training on 22 per class is yielding chance (50%) decoding. Leaving aside the > particulars of fMRI in this case: is it unusual for single points (amounting to > less than 5% of the data) to have such a big influence on SVM decoding? I am > using a cost parameter of C=1. I must say it is counterintuitive to me that > just a couple points out of two dozen could make such a big difference. > Thank you very much, and cheers, > JohnMark > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From joel.nothman at gmail.com Tue Dec 19 19:09:37 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 20 Dec 2017 11:09:37 +1100 Subject: [scikit-learn] Any plans on generalizing Pipeline and transformers? In-Reply-To: References: Message-ID: At a glance, and perhaps not knowing imbalanced-learn well enough, I have some doubts that it will provide an immediate solution for all your needs. At the end of the day, the Pipeline keeps its scope relatively tight, but it should not be so hard to implement something for your own needs if your case does not fit what Pipeline supports. On 20 December 2017 at 00:34, Manuel Castej?n Limas < manuel.castejon at gmail.com> wrote: > Eager to learn! Diving on the code right now! > > Thanks for the tip! > Manuel > > 2017-12-19 14:18 GMT+01:00 Guillaume Lema?tre : > >> I think that you could you use imbalanced-learn regarding the issue that >> you have with the y. >> You should be able to wrap your clustering inside the FunctionSampler ( >> https://github.com/scikit-learn-contrib/imbalanced-learn/pull/342 - we >> are on the way to merge it) >> >> On 19 December 2017 at 13:44, Manuel Castej?n Limas < >> manuel.castejon at gmail.com> wrote: >> >>> Dear all, >>> >>> Kudos to scikit-learn! Having said that, Pipeline is killing me not >>> being able to transform anything other than X. >>> >>> My current study case would need: >>> - Transformers being able to handle both X and y, e.g. clustering X and >>> y concatenated >>> - Pipeline being able to change other params, e.g. sample_weight >>> >>> Currently, I'm augmenting X through every step with the extra >>> information which seems to work ok for my_pipe.fit_transform(X_train,y_train) >>> but breaks on my_pipe.transform(X_test) for the lack of the y parameter. >>> Ok, I can inherit and modify a descendant from Pipeline class to allow the >>> y parameter which is not ideal but I guess it is an option. The gritty part >>> comes when having to adapt every regressor at the end of the ladder in >>> order to split the extra information from the raw data in X and not being >>> able to generate more than one subproduct from each preprocessing step >>> >>> My current research involves clustering the data and using that >>> classification along with X in order to predict outliers which generates >>> sample_weight info and I would love to use that on the final regressor. >>> Currently there seems not to be another option than pasting that info on X. >>> >>> All in all, I'm stuck with this API limitation and I would love to learn >>> some tricks from you if you could enlighten me. >>> >>> Thanks in advance! >>> >>> Manuel Castej?n-Limas >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Guillaume Lemaitre >> INRIA Saclay - Parietal team >> Center for Data Science Paris-Saclay >> https://glemaitre.github.io/ >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Wed Dec 20 10:33:19 2017 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Wed, 20 Dec 2017 16:33:19 +0100 Subject: [scikit-learn] Any plans on generalizing Pipeline and transformers? In-Reply-To: References: Message-ID: Thank you all for your interest! In order to clarify the case allow me to try to synthesize the spirit of what I'd like to put into the pipeline using this sequence of steps: #%% import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import DBSCAN from sklearn.mixture import GaussianMixture from sklearn.model_selection import train_test_split np.random.seed(seed=42) """ Data preparation """ URL = "https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sin_60_ percent_noise.csv" data = pd.read_csv(URL, usecols=['V1','V2']) X, y = data[['V1']], data[['V2']] (data_train, data_test, X_train, X_test, y_train, y_test) = train_test_split(data, X, y) """ Parameters setup """ dbscan__eps = 0.06 mclust__n_components = 3 paella__noise_label = -1 paella__max_it = 20, paella__regular_size = 400, paella__minimum_size = 100, paella__width_r = 0.99, paella__n_neighbors = 5, paella__power = 30, paella__random_state = None #%% """ DBSCAN clustering to detect noise suspects (label == -1) """ dbscan_input = data_train dbscan_clustering = DBSCAN(eps = dbscan__eps) dbscan_output = dbscan_clustering.fit_predict(dbscan_input) plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, c=np.int64(dbscan_output == -1)) #%% """ GaussianMixture fitted with filtered data_train in order to help locate the ellipsoids but predict is applied to the whole data_train set. """ mclust_input = data_train[ dbscan_output != 1] mclust_clustering = GaussianMixture(n_components = mclust__n_components) mclust_clustering.fit(mclust_input) mclust_output = mclust_clustering.predict(data_train) plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, c=mclust_output) #%% """ mclust and dbscan results are combined. """ clustering_output = mclust_output.copy() clustering_output[dbscan_output == -1] = -1 plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, c=clustering_output) #%% """ Old-good Paella paper: https://link.springer.com/article/10.1023/B:DAMI. 0000031630.50685.7c The Paella algorithm calculates sample_weight to be used by the final step regressor (Yes, it is an outlier detection algorithm but we are focusing now on this interesting collateral result). I am currently aggressively changing the code in order to make it fit somehow with the pipeline """ from paella import Paella paella_input = pd.concat([data, clustering_output], axis=1, inplace=False) paella_run = Paella(noise_label = paella__noise_label, max_it = paella__max_it, regular_size = paella__regular_size, minimum_size = paella__minimum_size, width_r = paella__width_r, n_neighbors = paella__n_neighbors, power = paella__power, random_state = paella__random_state) paella_output = paella_run.fit_predict(paella_input, y_train) # paella_output is a vector with sample_weight #%% """ Here we fit a regressor using sample_weight=paella_output """ from sklearn.linear_model import LinearRegression regressor_input=X_train lm = LinearRegression() lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output) regressor_output = lm.predict(X_train) #... In this example we can see that: - A particular step might need results produced not necessarily from the immediately previous step. - The X parameter is not sequentially transformed. Sometimes we might need to skip to a previous step - y sometimes is the target, sometimes is not. For the regressor it is indeed, but for the paella algorithm the prediction is expressed as a vector representing sample_weights. All in all the conclusion is that the chain of processes is not as linear as imposed by the current API. I guess that all these difficulties could be solved by: - Passing a dictionary through the different steps containing the partial results that the following steps will need. - As a christmas gift :-) , a reference to the pipeline itself inserted in that dictionary could provide access to the internal status of the previous steps should it be needed. Another interesting study case with similar needs would be a regressor using a previous clustering step in order to fit one model per cluster. In such case, the clustering results would be needed during the fitting. Thanks for your interest! Manolo -------------- next part -------------- An HTML attachment was scrubbed... URL: From l.lomasto at innovationengineering.eu Wed Dec 20 11:42:50 2017 From: l.lomasto at innovationengineering.eu (Luigi Lomasto) Date: Wed, 20 Dec 2017 17:42:50 +0100 Subject: [scikit-learn] Parallel MLP version Message-ID: Hi all, I have a computational problem to training my neural network so, can you say me if exists any parallel version about MLP library? From drraph at gmail.com Wed Dec 20 11:44:21 2017 From: drraph at gmail.com (Raphael C) Date: Wed, 20 Dec 2017 16:44:21 +0000 Subject: [scikit-learn] Parallel MLP version In-Reply-To: References: Message-ID: I believe tensorflow will do what you want. Raphael On 20 Dec 2017 16:43, "Luigi Lomasto" wrote: > Hi all, > > I have a computational problem to training my neural network so, can you > say me if exists any parallel version about MLP library? > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jeremiah.Johnson at unh.edu Wed Dec 20 12:35:14 2017 From: Jeremiah.Johnson at unh.edu (Johnson, Jeremiah) Date: Wed, 20 Dec 2017 17:35:14 +0000 Subject: [scikit-learn] Parallel MLP version In-Reply-To: References: , Message-ID: <1A2511F7-FB15-4E50-9E4B-ADD0E8A3989C@unh.edu> For neural network training, try one of tensorflow, pytorch, chainer, or mxnet. They?ll all parallelize the computations and can run the computations on Nvidia GPUs with CUDA. Best regards, Jeremiah Sent from my iPhone On Dec 20, 2017, at 11:45, Raphael C > wrote: I believe tensorflow will do what you want. Raphael On 20 Dec 2017 16:43, "Luigi Lomasto" > wrote: Hi all, I have a computational problem to training my neural network so, can you say me if exists any parallel version about MLP library? _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.python.org_mailman_listinfo_scikit-2Dlearn&d=DwICAg&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=oUiWf0H-VRF_Tf5m99hfn3BZTJBEqgeSKY-xNdneIxc&s=VgD7nrKGP85Oo6nHglwNvdxtfzW50CR6RYh4OxjFNAg&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Wed Dec 20 13:32:35 2017 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Wed, 20 Dec 2017 19:32:35 +0100 Subject: [scikit-learn] Text classification of large dataet In-Reply-To: References: Message-ID: <7386f24b-61fe-3ea9-00f2-c6e8a8941902@gmail.com> Ranjana, have a look at this example http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html Since you have a lot of RAM, you may not need to make all the classification pipeline out-of-core, a start with your current code could be to write a generator that loads and pre-processes the text in chunks then feed it one document at the time to CountVecotorizer.fit (it accepts an iterable). To reduce the memory usage, filtering too frequent tokens (instead of the infrequent ones) could help too. Make sure you L2 normalize your data before the classifier. You could use SGDClassifier(loss='log') or LogisticRegression with a sag or saga solver. The multiclasss="multinomial" parameter might be also worth trying, particularly since you have so many classes. -- Roman On 19/12/17 15:38, Ranjana Girish wrote: > Hai all, > > I am doing text classification. I have around 10 million data to be > classified to around 7k category. > > Below is the code I am using > > /# Importing the libraries/ > /i*mport pandas as pd*/ > /*import nltk*/ > /*from nltk.corpus import stopwords*/ > /*from nltk.tokenize import word_tokenize*/ > /*from nltk.stem.wordnet import WordNetLemmatizer*/ > /*from nltk.stem.porter import PorterStemmer*/ > /*import re*/ > /*from sklearn.feature_extraction.text import CountVectorizer*/ > /*import random*/ > /*from sklearn.naive_bayes import MultinomialNB,GaussianNB*/ > /*from sklearn.metrics import accuracy_score*/ > /*from sklearn.metrics import precision_recall_curve*/ > /*from sklearn.metrics import average_precision_score*/ > /*from sklearn import feature_selection*/ > /*from scipy.sparse import csr_matrix*/ > /*from scipy import sparse*/ > /*import sys*/ > /*from sklearn import preprocessing*/ > /*import numpy as np*/ > /*import pickle*/ > /* */ > /*sys.setrecursionlimit(200000000)*/ > /* > */ > /*random.seed(20000)*/ > /* > */ > /* > */ > /*trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding = > "ISO-8859-1")*/ > /*trainset2=pd.read_csv("trainsetlessequal500.csv",encoding = > "ISO-8859-1")*/ > /* > */ > /*dataset=pd.concat([trainset1,trainset2])*/ > /* > */ > /*dataset=dataset.dropna()*/ > /* > */ > /*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^a-zA-Z]', > ' ')*/ > /*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]', > ' ')*/ > /*dataset['ProductDescription']=dataset['ProductDescription'].str.lower()*/ > /* > */ > /*del trainset1*/ > /*del trainset2 */ > /* > */ > /*stop = stopwords.words('english')*/ > /*lemmatizer = WordNetLemmatizer()*/ > /* > */ > /*dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b(' > + r'|'.join(stop) + r')\b\s*', ' ')*/ > /*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('\s\s+',' > ')*/ > /*dataset['ProductDescription'] > =dataset['ProductDescription'].apply(word_tokenize)*/ > /*ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'*/ > /*POS_LIST = [NOUN, VERB, ADJ, ADV]*/ > /*for tag in POS_LIST:*/ > /* dataset['ProductDescription'] = > dataset['ProductDescription'].apply(lambda x: > list(set([lemmatizer.lemmatize(item,tag) for item in x])))*/ > /*dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda > x : " ".join(x))*/ > /* > */ > /*countvec = CountVectorizer(min_df=0.00008)*/ > /*documenttermmatrix=countvec.fit_transform(dataset['ProductDescription'])*/ > /*documenttermmatrix.shape*/ > /*column=countvec.get_feature_names()*/ > /*filename1 = 'columnnamessample10mastermerge.sav'*/ > /*pickle.dump(column, open(filename1, 'wb'))*/ > /* > */ > /*y_train=dataset['classpath']*/ > /*y_train=dataset['classpath'].tolist()*/ > /*labels_train= preprocessing.LabelEncoder()*/ > /*labels_train.fit(y_train)*/ > /*y1_train=labels_train.transform(y_train)*/ > /* > */ > /*del dataset*/ > /*del countvec*/ > /*del column*/ > /* > */ > /* > */ > /*clf = MultinomialNB()*/ > /*model=clf.fit(documenttermmatrix,y_train)*/ > /* > */ > /* > */ > /* > */ > * > * > /* > */ > /*filename2 = 'modelnaivebayessample10withfs.sav'*/ > /*pickle.dump(model, open(filename2, 'wb'))*/ > / > / > / > / > I am using system with *128 GB RAM.* > > As I was unable to train all 10 million data, I did *stratified > sampling* and the trainset reduced to 2.3 million > > Still I was unable to Train 2.3 million data > > I got*memory error* when i used *random forest (nestimator=30),**Naive > Bayes* and *SVM* > > > / > / > *I have stucked* > * > * > * > * > > *Can Anyone please tell whether any memory leak in my code and how to > use system with 128 GB RAM effectively* > > > Thanks > Ranjana > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From joel.nothman at gmail.com Wed Dec 20 15:13:06 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 21 Dec 2017 07:13:06 +1100 Subject: [scikit-learn] Text classification of large dataet In-Reply-To: References: <7386f24b-61fe-3ea9-00f2-c6e8a8941902@gmail.com> Message-ID: To clarify: You have 2.3M samples How many features? How many active features on average per sample? In 7k classes: multiclass or multilabel? Have you tried limiting the depth of the forest? Have you tried embedding your feature space into a smaller vector (pre-trained embeddings, hashing, lda, PCA or random projection)? -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Fri Dec 22 06:09:55 2017 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Fri, 22 Dec 2017 12:09:55 +0100 Subject: [scikit-learn] Any plans on generalizing Pipeline and transformers? In-Reply-To: References: Message-ID: I'm currently thinking on a computational graph which can then be wrapped as a pipeline like object ... I'll try yo make a toy example solving my problem. El 20 dic. 2017 16:33, "Manuel Castej?n Limas" escribi?: > Thank you all for your interest! > > In order to clarify the case allow me to try to synthesize the spirit of > what I'd like to put into the pipeline using this sequence of steps: > > #%% > import pandas as pd > import numpy as np > import matplotlib.pyplot as plt > > from sklearn.cluster import DBSCAN > from sklearn.mixture import GaussianMixture > from sklearn.model_selection import train_test_split > > np.random.seed(seed=42) > > """ > Data preparation > """ > > URL = "https://raw.githubusercontent.com/mcasl/PAELLA/master/data/ > sin_60_percent_noise.csv" > data = pd.read_csv(URL, usecols=['V1','V2']) > X, y = data[['V1']], data[['V2']] > > (data_train, data_test, > X_train, X_test, > y_train, y_test) = train_test_split(data, X, y) > > """ > Parameters setup > """ > > dbscan__eps = 0.06 > > mclust__n_components = 3 > > paella__noise_label = -1 > paella__max_it = 20, > paella__regular_size = 400, > paella__minimum_size = 100, > paella__width_r = 0.99, > paella__n_neighbors = 5, > paella__power = 30, > paella__random_state = None > > #%% > """ > DBSCAN clustering to detect noise suspects (label == -1) > """ > > dbscan_input = data_train > > dbscan_clustering = DBSCAN(eps = dbscan__eps) > > dbscan_output = dbscan_clustering.fit_predict(dbscan_input) > > plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, > c=np.int64(dbscan_output == -1)) > > #%% > """ > GaussianMixture fitted with filtered data_train in order to help locate > the ellipsoids > but predict is applied to the whole data_train set. > """ > > mclust_input = data_train[ dbscan_output != 1] > > mclust_clustering = GaussianMixture(n_components = mclust__n_components) > mclust_clustering.fit(mclust_input) > > mclust_output = mclust_clustering.predict(data_train) > > plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, > c=mclust_output) > > #%% > """ > mclust and dbscan results are combined. > """ > > clustering_output = mclust_output.copy() > clustering_output[dbscan_output == -1] = -1 > > plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, > c=clustering_output) > > #%% > """ > Old-good Paella paper: https://link.springer. > com/article/10.1023/B:DAMI.0000031630.50685.7c > > The Paella algorithm calculates sample_weight to be used by the final step > regressor > (Yes, it is an outlier detection algorithm but we are focusing now on this > interesting collateral result). I am currently aggressively changing the > code in order to make it fit somehow with the pipeline > """ > > from paella import Paella > > paella_input = pd.concat([data, clustering_output], axis=1, inplace=False) > > paella_run = Paella(noise_label = paella__noise_label, > max_it = paella__max_it, > regular_size = paella__regular_size, > minimum_size = paella__minimum_size, > width_r = paella__width_r, > n_neighbors = paella__n_neighbors, > power = paella__power, > random_state = paella__random_state) > > paella_output = paella_run.fit_predict(paella_input, y_train) > # paella_output is a vector with sample_weight > > #%% > """ > Here we fit a regressor using sample_weight=paella_output > """ > from sklearn.linear_model import LinearRegression > > regressor_input=X_train > lm = LinearRegression() > lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output) > regressor_output = lm.predict(X_train) > > #... > > In this example we can see that: > - A particular step might need results produced not necessarily from the > immediately previous step. > - The X parameter is not sequentially transformed. Sometimes we might need > to skip to a previous step > - y sometimes is the target, sometimes is not. For the regressor it is > indeed, but for the paella algorithm the prediction is expressed as a > vector representing sample_weights. > > All in all the conclusion is that the chain of processes is not as linear > as imposed by the current API. I guess that all these difficulties could be > solved by: > - Passing a dictionary through the different steps containing the partial > results that the following steps will need. > - As a christmas gift :-) , a reference to the pipeline itself inserted > in that dictionary could provide access to the internal status of the > previous steps should it be needed. > > Another interesting study case with similar needs would be a regressor > using a previous clustering step in order to fit one model per cluster. In > such case, the clustering results would be needed during the fitting. > > > Thanks for your interest! > Manolo > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Sylvain.Takerkart at univ-amu.fr Fri Dec 22 06:20:55 2017 From: Sylvain.Takerkart at univ-amu.fr (Sylvain Takerkart) Date: Fri, 22 Dec 2017 12:20:55 +0100 Subject: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints? In-Reply-To: <20171219213525.GE360768@phare.normalesup.org> References: <20171219213525.GE360768@phare.normalesup.org> Message-ID: Hello, Yes, Gael's paper points out some fundamental issues! In your case, the practical question is to know what kind of cross validation scheme you used... If you originally used StratifiedKFold, try to re-run your experiments with StratifiedShuffleSplit and a large number of splits! Hopefully, increasing the number of splits should reduce the discrepancy you observe between the two mean accuracies... But as Gael says, the small sample size brings fundamental limitations to what you can measure... Sylvain On Tue, Dec 19, 2017 at 10:35 PM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > With as few data points, there is a huge uncertainty in the estimation of > the prediction accuracy with cross-validation. This isn't a problem of > the method, is it a basic limitation of the small amount of data. I've > written a paper on this problem is the specific context of neuroimaging: > https://www.sciencedirect.com/science/article/pii/S1053811917305311 > (preprint: https://hal.inria.fr/hal-01545002/). > > I except that what you are seing in sampling noise: the result has > confidence intervals in large than 10%. > > Ga?l > > > On Tue, Dec 19, 2017 at 04:27:53PM -0500, Taylor, Johnmark wrote: > > Hello, > > > I am a researcher in fMRI and am using SVMs to analyze brain data. I am > doing > > decoding between two classes, each of which has 24 exemplars per class. > I am > > comparing two different methods of cross-validation for my data: in one, > I am > > training on 23 exemplars from each class, and testing on the remaining > example > > from each class, and in the other, I am training on 22 exemplars from > each > > class, and testing on the remaining two from each class (in case it > matters, > > the data is structured into different neuroimaging "runs", with each > "run" > > containing several "blocks"; the first cross-validation method is > leaving out > > one block at a time, the second is leaving out one run at a time). > > > Now, I would've thought that these two CV methods would be very similar, > since > > the vast majority of the training data is the same; the only difference > is in > > adding two additional points. However, they are yielding very different > > results: training on 23 per class is yielding 60% decoding accuracy > (averaged > > across several subjects, and statistically significantly greater than > chance), > > training on 22 per class is yielding chance (50%) decoding. Leaving > aside the > > particulars of fMRI in this case: is it unusual for single points > (amounting to > > less than 5% of the data) to have such a big influence on SVM decoding? > I am > > using a cost parameter of C=1. I must say it is counterintuitive to me > that > > just a couple points out of two dozen could make such a big difference. > > > Thank you very much, and cheers, > > > JohnMark > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Senior Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Sylvain Takerkart Institut des Neurosciences de la Timone (INT) UMR 7289 CNRS-AMU Marseille, France t?l: +33 (0)4 91 324 007 http://www.int.univ-amu.fr/_TAKERKART-Sylvain_?lang=en -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Tue Dec 26 05:47:47 2017 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Tue, 26 Dec 2017 11:47:47 +0100 Subject: [scikit-learn] Any plans on generalizing Pipeline and transformers? In-Reply-To: References: Message-ID: I'm elaborating on the graph idea. A dictionary to describe the graph, the networkx package to support the graph and run it in topological order; and some wrappers for scikit-learn models. I'm currently thinking on putting some more efforts into a contrib project. It could be something inspired by this example. Manolo #------------------------------------------------- graph_description = { 'First': {'operation': First_Step, 'input': {'X':X, 'y':y}}, 'Concatenate_Xy': {'operation': ConcatenateData_Step, 'input': [('First', 'X'), ('First', 'y')]}, 'Gaussian_Mixture': {'operation': Gaussian_Mixture_Step, 'input': [('Concatenate_Xy', 'data')]}, 'Dbscan': {'operation': Dbscan_Step, 'input': [('Concatenate_Xy', 'data')]}, 'CombineClustering': {'operation': CombineClustering_Step, 'input': [('Dbscan', 'classification'), ('Gaussian_Mixture', 'classification')]}, 'Paella': {'operation': Paella_Step, 'input': [('First', 'X'), ('First', 'y'), ('Concatenate_Xy', 'data'), ('CombineClustering', 'classification')]}, 'Regressor': {'operation': Regressor_Step, 'input': [('First', 'X'), ('First', 'y'), ('Paella', 'sample_weight')]}, 'Last': {'operation': Last_Step, 'input': [('Regressor', 'regressor')]}, } #%% def create_graph(description): cg = nx.DiGraph() cg.add_nodes_from(description) for current_name, info in description.items(): current_node = cg.node[current_name] current_node['operation'] = info['operation']( graph = cg, node_name = current_name ) current_node['input'] = info['input'] if current_name != 'First': for ascendant in set( name for name, attribute in info['input'] ): cg.add_edge(ascendant, current_name) return cg #%% cg = create_graph(graph_description) node_pos = {'First' : ( 0, 0), 'Concatenate_Xy' : ( 2, 4), 'Gaussian_Mixture' : ( 6, 8), 'Dbscan' : ( 6, 6), 'CombineClustering': ( 8, 7), 'Paella' : (10, 2), 'Regressor' : (12, 0), 'Last' : (16, 0) } nx.draw(cg, pos=node_pos, with_labels=True) #%% print("=========================") for name in nx.topological_sort(cg): print("Running: ", name) cg.node[name]['operation'].fit() print("=========================") ######################## 2017-12-22 12:09 GMT+01:00 Manuel Castej?n Limas : > I'm currently thinking on a computational graph which can then be wrapped > as a pipeline like object ... I'll try yo make a toy example solving my > problem. > > El 20 dic. 2017 16:33, "Manuel Castej?n Limas" > escribi?: > >> Thank you all for your interest! >> >> In order to clarify the case allow me to try to synthesize the spirit of >> what I'd like to put into the pipeline using this sequence of steps: >> >> #%% >> import pandas as pd >> import numpy as np >> import matplotlib.pyplot as plt >> >> from sklearn.cluster import DBSCAN >> from sklearn.mixture import GaussianMixture >> from sklearn.model_selection import train_test_split >> >> np.random.seed(seed=42) >> >> """ >> Data preparation >> """ >> >> URL = "https://raw.githubusercontent.com/mcasl/PAELLA/master/data/ >> sin_60_percent_noise.csv" >> data = pd.read_csv(URL, usecols=['V1','V2']) >> X, y = data[['V1']], data[['V2']] >> >> (data_train, data_test, >> X_train, X_test, >> y_train, y_test) = train_test_split(data, X, y) >> >> """ >> Parameters setup >> """ >> >> dbscan__eps = 0.06 >> >> mclust__n_components = 3 >> >> paella__noise_label = -1 >> paella__max_it = 20, >> paella__regular_size = 400, >> paella__minimum_size = 100, >> paella__width_r = 0.99, >> paella__n_neighbors = 5, >> paella__power = 30, >> paella__random_state = None >> >> #%% >> """ >> DBSCAN clustering to detect noise suspects (label == -1) >> """ >> >> dbscan_input = data_train >> >> dbscan_clustering = DBSCAN(eps = dbscan__eps) >> >> dbscan_output = dbscan_clustering.fit_predict(dbscan_input) >> >> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, >> c=np.int64(dbscan_output == -1)) >> >> #%% >> """ >> GaussianMixture fitted with filtered data_train in order to help locate >> the ellipsoids >> but predict is applied to the whole data_train set. >> """ >> >> mclust_input = data_train[ dbscan_output != 1] >> >> mclust_clustering = GaussianMixture(n_components = mclust__n_components) >> mclust_clustering.fit(mclust_input) >> >> mclust_output = mclust_clustering.predict(data_train) >> >> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, >> c=mclust_output) >> >> #%% >> """ >> mclust and dbscan results are combined. >> """ >> >> clustering_output = mclust_output.copy() >> clustering_output[dbscan_output == -1] = -1 >> >> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, >> c=clustering_output) >> >> #%% >> """ >> Old-good Paella paper: https://link.springer.c >> om/article/10.1023/B:DAMI.0000031630.50685.7c >> >> The Paella algorithm calculates sample_weight to be used by the final >> step regressor >> (Yes, it is an outlier detection algorithm but we are focusing now on >> this interesting collateral result). I am currently aggressively changing >> the code in order to make it fit somehow with the pipeline >> """ >> >> from paella import Paella >> >> paella_input = pd.concat([data, clustering_output], axis=1, inplace=False) >> >> paella_run = Paella(noise_label = paella__noise_label, >> max_it = paella__max_it, >> regular_size = paella__regular_size, >> minimum_size = paella__minimum_size, >> width_r = paella__width_r, >> n_neighbors = paella__n_neighbors, >> power = paella__power, >> random_state = paella__random_state) >> >> paella_output = paella_run.fit_predict(paella_input, y_train) >> # paella_output is a vector with sample_weight >> >> #%% >> """ >> Here we fit a regressor using sample_weight=paella_output >> """ >> from sklearn.linear_model import LinearRegression >> >> regressor_input=X_train >> lm = LinearRegression() >> lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output) >> regressor_output = lm.predict(X_train) >> >> #... >> >> In this example we can see that: >> - A particular step might need results produced not necessarily from the >> immediately previous step. >> - The X parameter is not sequentially transformed. Sometimes we might >> need to skip to a previous step >> - y sometimes is the target, sometimes is not. For the regressor it is >> indeed, but for the paella algorithm the prediction is expressed as a >> vector representing sample_weights. >> >> All in all the conclusion is that the chain of processes is not as linear >> as imposed by the current API. I guess that all these difficulties could be >> solved by: >> - Passing a dictionary through the different steps containing the partial >> results that the following steps will need. >> - As a christmas gift :-) , a reference to the pipeline itself inserted >> in that dictionary could provide access to the internal status of the >> previous steps should it be needed. >> >> Another interesting study case with similar needs would be a regressor >> using a previous clustering step in order to fit one model per cluster. In >> such case, the clustering results would be needed during the fitting. >> >> >> Thanks for your interest! >> Manolo >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From ranjanagirish30 at gmail.com Wed Dec 27 05:16:34 2017 From: ranjanagirish30 at gmail.com (Ranjana Girish) Date: Wed, 27 Dec 2017 15:46:34 +0530 Subject: [scikit-learn] Text classification of large dataset Message-ID: Hai all, Thank you for your suggestions. But I am still getting *memory error* while doing feature selection *fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=20)* *documenttermmatrix1 = fs.fit_transform(documenttermmatrix,y1)* *documenttermmatrix* will be of shape *(1594516,232832)* type of *documenttermmatrix * is *scipy csr matrix* Am I doing anything wrong? Is there any better way of doing feature selection? -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Fri Dec 29 06:09:00 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Fri, 29 Dec 2017 12:09:00 +0100 Subject: [scikit-learn] MLPClassifier as a feature selector In-Reply-To: References: Message-ID: Alright, with these attributes I can get the weights and biases, but what about the values on the nodes of the last hidden layer? Do I have to work them out myself or there is a straightforward way to get them? On 7 December 2017 at 04:25, Manoj Kumar wrote: > Hi, > > The weights and intercepts are available in the coefs_ and intercepts_ > attribute respectively. > > See https://github.com/scikit-learn/scikit-learn/blob/ > a24c8b46/sklearn/neural_network/multilayer_perceptron.py#L835 > > On Wed, Dec 6, 2017 at 4:56 PM, Brown J.B. via scikit-learn < > scikit-learn at python.org> wrote: > >> I am also very interested in knowing if there is a sklearn cookbook >> solution for getting the weights of a one-hidde-layer MLPClassifier. >> J.B. >> >> 2017-12-07 8:49 GMT+09:00 Thomas Evangelidis : >> >>> Greetings, >>> >>> I want to train a MLPClassifier with one hidden layer and use it as a >>> feature selector for an MLPRegressor. >>> Is it possible to get the values of the neurons from the last hidden >>> layer of the MLPClassifier to pass them as input to the MLPRegressor? >>> >>> If it is not possible with scikit-learn, is anyone aware of any >>> scikit-compatible NN library that offers this functionality? For example >>> this one: >>> >>> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html >>> >>> I wouldn't like to do this in Tensorflow because the MLP there is much >>> slower than scikit-learn's implementation. >>> >>> >>> Thomas >>> >>> >>> -- >>> >>> ====================================================================== >>> >>> Dr Thomas Evangelidis >>> >>> Post-doctoral Researcher >>> CEITEC - Central European Institute of Technology >>> Masaryk University >>> Kamenice 5/A35/2S049, >>> 62500 Brno, Czech Republic >>> >>> email: tevang at pharm.uoa.gr >>> >>> tevang3 at gmail.com >>> >>> >>> website: https://sites.google.com/site/thomasevangelidishomepage/ >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Manoj, > http://github.com/MechCoder > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlopez at ende.cc Fri Dec 29 11:45:49 2017 From: jlopez at ende.cc (=?UTF-8?Q?Javier_L=C3=B3pez?=) Date: Fri, 29 Dec 2017 16:45:49 +0000 Subject: [scikit-learn] MLPClassifier as a feature selector In-Reply-To: References: Message-ID: Hi Thomas, it is possible to obtain the activation values of any hidden layer, but the procedure is not completely straight forward. If you look at the code of the `_predict` method of MLPs you can see the following: ```python def _predict(self, X): """Predict using the trained model Parameters ---------- X : {array-like, sparse matrix}, shape (n_samples, n_features) The input data. Returns ------- y_pred : array-like, shape (n_samples,) or (n_samples, n_outputs) The decision function of the samples for each class in the model. """ X = check_array(X, accept_sparse=['csr', 'csc', 'coo']) # Make sure self.hidden_layer_sizes is a list hidden_layer_sizes = self.hidden_layer_sizes if not hasattr(hidden_layer_sizes, "__iter__"): hidden_layer_sizes = [hidden_layer_sizes] hidden_layer_sizes = list(hidden_layer_sizes) layer_units = [X.shape[1]] + hidden_layer_sizes + \ [self.n_outputs_] # Initialize layers activations = [X] for i in range(self.n_layers_ - 1): activations.append(np.empty((X.shape[0], layer_units[i + 1]))) # forward propagate self._forward_pass(activations) y_pred = activations[-1] return y_pred ``` the line `y_pred = activations[-1]` is responsible for extracting the values for the last layer, but the `activations` variable contains the values for all the neurons. You can make this function into your own external method (changing the `self` attribute by a proper parameter) and add an extra argument which specifies the layer(s) that you want. I have done this myself in order to make an AutoEncoderNetwork out of the MLP implementation. This makes me wonder, would it be worth adding this to sklearn? A very simple way would be to refactor the `_predict` method, with the additional layer argument, to a new method `_predict_layer`, then we can have the `_predict` method simply call `_predict_layer(..., layer=-1)` and add a new method (perhaps a `transform`?) that allows to get (raveled) values for an arbitrary subset of the layers. I'd be happy to submit a PR if you guys think it would be interesting for the project. Javier On Thu, Dec 7, 2017 at 12:51 AM Thomas Evangelidis wrote: > Greetings, > > I want to train a MLPClassifier with one hidden layer and use it as a > feature selector for an MLPRegressor. > Is it possible to get the values of the neurons from the last hidden layer > of the MLPClassifier to pass them as input to the MLPRegressor? > > If it is not possible with scikit-learn, is anyone aware of any > scikit-compatible NN library that offers this functionality? For example > this one: > > http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html > > I wouldn't like to do this in Tensorflow because the MLP there is much > slower than scikit-learn's implementation. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Fri Dec 29 12:14:15 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Fri, 29 Dec 2017 18:14:15 +0100 Subject: [scikit-learn] MLPClassifier as a feature selector In-Reply-To: References: Message-ID: <460c5520-3226-4aaf-bcbd-343d1e4a7e0e@normalesup.org> I think that a transform method would be good. We would have to add a parameter to the constructor to specify which layer is used for the transform. It should default to "-1", in my opinion. Cheers, Ga?l ?Sent from my phone. Please forgive typos and briefness.? On Dec 29, 2017, 17:48, at 17:48, "Javier L?pez" wrote: >Hi Thomas, > >it is possible to obtain the activation values of any hidden layer, but >the >procedure is not completely straight forward. If you look at the code >of >the `_predict` method of MLPs you can see the following: > >```python > def _predict(self, X): > """Predict using the trained model > > Parameters > ---------- > X : {array-like, sparse matrix}, shape (n_samples, n_features) > The input data. > > Returns > ------- > y_pred : array-like, shape (n_samples,) or (n_samples, n_outputs) > The decision function of the samples for each class in the >model. > """ > X = check_array(X, accept_sparse=['csr', 'csc', 'coo']) > > # Make sure self.hidden_layer_sizes is a list > hidden_layer_sizes = self.hidden_layer_sizes > if not hasattr(hidden_layer_sizes, "__iter__"): > hidden_layer_sizes = [hidden_layer_sizes] > hidden_layer_sizes = list(hidden_layer_sizes) > > layer_units = [X.shape[1]] + hidden_layer_sizes + \ > [self.n_outputs_] > > # Initialize layers > activations = [X] > > for i in range(self.n_layers_ - 1): > activations.append(np.empty((X.shape[0], > layer_units[i + 1]))) > # forward propagate > self._forward_pass(activations) > y_pred = activations[-1] > > return y_pred >``` > >the line `y_pred = activations[-1]` is responsible for extracting the >values for the last layer, >but the `activations` variable contains the values for all the neurons. > >You can make this function into your own external method (changing the >`self` attribute by >a proper parameter) and add an extra argument which specifies the >layer(s) >that you want. >I have done this myself in order to make an AutoEncoderNetwork out of >the >MLP >implementation. > >This makes me wonder, would it be worth adding this to sklearn? >A very simple way would be to refactor the `_predict` method, with the >additional layer >argument, to a new method `_predict_layer`, then we can have the >`_predict` >method >simply call `_predict_layer(..., layer=-1)` and add a new method >(perhaps a >`transform`?) >that allows to get (raveled) values for an arbitrary subset of the >layers. > >I'd be happy to submit a PR if you guys think it would be interesting >for >the project. > >Javier > > > >On Thu, Dec 7, 2017 at 12:51 AM Thomas Evangelidis >wrote: > >> Greetings, >> >> I want to train a MLPClassifier with one hidden layer and use it as a >> feature selector for an MLPRegressor. >> Is it possible to get the values of the neurons from the last hidden >layer >> of the MLPClassifier to pass them as input to the MLPRegressor? >> >> If it is not possible with scikit-learn, is anyone aware of any >> scikit-compatible NN library that offers this functionality? For >example >> this one: >> >> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html >> >> I wouldn't like to do this in Tensorflow because the MLP there is >much >> slower than scikit-learn's implementation. >> > > >------------------------------------------------------------------------ > >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at orges-leka.de Fri Dec 29 13:19:23 2017 From: info at orges-leka.de (Orges Leka) Date: Fri, 29 Dec 2017 19:19:23 +0100 Subject: [scikit-learn] scikit-learn Digest, Vol 21, Issue 29 In-Reply-To: References: Message-ID: Hello, You could use the following code: X_weight = [ ] for x in X: for i in range(len(mlp.coefs_)-1): x =np.array([math.tanh(v) for v in (x.dot(mlp.coefs_[i])+mlp.intercepts_[i])]) X_weight.append(x) where it is assumed that mlp is your trained MLP-Classifier, and you have trained with tanh-activation function X is your matrix which you want to compute the features, and x iterates over the vectors of this matrix. X_weight is a list of vectors with the computed weights. Kind regards Orges Leka 2017-12-29 17:46 GMT+01:00 : > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: MLPClassifier as a feature selector (Thomas Evangelidis) > 2. Re: MLPClassifier as a feature selector (Javier L?pez) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 29 Dec 2017 12:09:00 +0100 > From: Thomas Evangelidis > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] MLPClassifier as a feature selector > Message-ID: > GkbDe6dd9w at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Alright, with these attributes I can get the weights and biases, but what > about the values on the nodes of the last hidden layer? Do I have to work > them out myself or there is a straightforward way to get them? > > On 7 December 2017 at 04:25, Manoj Kumar > wrote: > > > Hi, > > > > The weights and intercepts are available in the coefs_ and intercepts_ > > attribute respectively. > > > > See https://github.com/scikit-learn/scikit-learn/blob/ > > a24c8b46/sklearn/neural_network/multilayer_perceptron.py#L835 > > > > On Wed, Dec 6, 2017 at 4:56 PM, Brown J.B. via scikit-learn < > > scikit-learn at python.org> wrote: > > > >> I am also very interested in knowing if there is a sklearn cookbook > >> solution for getting the weights of a one-hidde-layer MLPClassifier. > >> J.B. > >> > >> 2017-12-07 8:49 GMT+09:00 Thomas Evangelidis : > >> > >>> Greetings, > >>> > >>> I want to train a MLPClassifier with one hidden layer and use it as a > >>> feature selector for an MLPRegressor. > >>> Is it possible to get the values of the neurons from the last hidden > >>> layer of the MLPClassifier to pass them as input to the MLPRegressor? > >>> > >>> If it is not possible with scikit-learn, is anyone aware of any > >>> scikit-compatible NN library that offers this functionality? For > example > >>> this one: > >>> > >>> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html > >>> > >>> I wouldn't like to do this in Tensorflow because the MLP there is much > >>> slower than scikit-learn's implementation. > >>> > >>> > >>> Thomas > >>> > >>> > >>> -- > >>> > >>> ====================================================================== > >>> > >>> Dr Thomas Evangelidis > >>> > >>> Post-doctoral Researcher > >>> CEITEC - Central European Institute of Technology > >>> Masaryk University > >>> Kamenice 5/A35/2S049, > >>> 62500 Brno, Czech Republic > >>> > >>> email: tevang at pharm.uoa.gr > >>> > >>> tevang3 at gmail.com > >>> > >>> > >>> website: https://sites.google.com/site/thomasevangelidishomepage/ > >>> > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> > > > > > > -- > > Manoj, > > http://github.com/MechCoder > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: attachments/20171229/40eaa98c/attachment-0001.html> > > ------------------------------ > > Message: 2 > Date: Fri, 29 Dec 2017 16:45:49 +0000 > From: Javier L?pez > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] MLPClassifier as a feature selector > Message-ID: > gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hi Thomas, > > it is possible to obtain the activation values of any hidden layer, but the > procedure is not completely straight forward. If you look at the code of > the `_predict` method of MLPs you can see the following: > > ```python > def _predict(self, X): > """Predict using the trained model > > Parameters > ---------- > X : {array-like, sparse matrix}, shape (n_samples, n_features) > The input data. > > Returns > ------- > y_pred : array-like, shape (n_samples,) or (n_samples, n_outputs) > The decision function of the samples for each class in the > model. > """ > X = check_array(X, accept_sparse=['csr', 'csc', 'coo']) > > # Make sure self.hidden_layer_sizes is a list > hidden_layer_sizes = self.hidden_layer_sizes > if not hasattr(hidden_layer_sizes, "__iter__"): > hidden_layer_sizes = [hidden_layer_sizes] > hidden_layer_sizes = list(hidden_layer_sizes) > > layer_units = [X.shape[1]] + hidden_layer_sizes + \ > [self.n_outputs_] > > # Initialize layers > activations = [X] > > for i in range(self.n_layers_ - 1): > activations.append(np.empty((X.shape[0], > layer_units[i + 1]))) > # forward propagate > self._forward_pass(activations) > y_pred = activations[-1] > > return y_pred > ``` > > the line `y_pred = activations[-1]` is responsible for extracting the > values for the last layer, > but the `activations` variable contains the values for all the neurons. > > You can make this function into your own external method (changing the > `self` attribute by > a proper parameter) and add an extra argument which specifies the layer(s) > that you want. > I have done this myself in order to make an AutoEncoderNetwork out of the > MLP > implementation. > > This makes me wonder, would it be worth adding this to sklearn? > A very simple way would be to refactor the `_predict` method, with the > additional layer > argument, to a new method `_predict_layer`, then we can have the `_predict` > method > simply call `_predict_layer(..., layer=-1)` and add a new method (perhaps a > `transform`?) > that allows to get (raveled) values for an arbitrary subset of the layers. > > I'd be happy to submit a PR if you guys think it would be interesting for > the project. > > Javier > > > > On Thu, Dec 7, 2017 at 12:51 AM Thomas Evangelidis > wrote: > > > Greetings, > > > > I want to train a MLPClassifier with one hidden layer and use it as a > > feature selector for an MLPRegressor. > > Is it possible to get the values of the neurons from the last hidden > layer > > of the MLPClassifier to pass them as input to the MLPRegressor? > > > > If it is not possible with scikit-learn, is anyone aware of any > > scikit-compatible NN library that offers this functionality? For example > > this one: > > > > http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html > > > > I wouldn't like to do this in Tensorflow because the MLP there is much > > slower than scikit-learn's implementation. > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: attachments/20171229/47c835c7/attachment.html> > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 21, Issue 29 > ******************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Sat Dec 30 03:55:03 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Sat, 30 Dec 2017 09:55:03 +0100 Subject: [scikit-learn] MLPClassifier as a feature selector In-Reply-To: References: Message-ID: Javier, thank you for the detailed explanation. Indeed, it would be very useful to add such a function in the official scikit-learn bundle instead of keeping our own modified versions of the MLP. It would be good for transferability of our code. Dne 29. 12. 2017 17:47 napsal u?ivatel "Javier L?pez" : > Hi Thomas, > > it is possible to obtain the activation values of any hidden layer, but the > procedure is not completely straight forward. If you look at the code of > the `_predict` method of MLPs you can see the following: > > ```python > def _predict(self, X): > """Predict using the trained model > > Parameters > ---------- > X : {array-like, sparse matrix}, shape (n_samples, n_features) > The input data. > > Returns > ------- > y_pred : array-like, shape (n_samples,) or (n_samples, n_outputs) > The decision function of the samples for each class in the > model. > """ > X = check_array(X, accept_sparse=['csr', 'csc', 'coo']) > > # Make sure self.hidden_layer_sizes is a list > hidden_layer_sizes = self.hidden_layer_sizes > if not hasattr(hidden_layer_sizes, "__iter__"): > hidden_layer_sizes = [hidden_layer_sizes] > hidden_layer_sizes = list(hidden_layer_sizes) > > layer_units = [X.shape[1]] + hidden_layer_sizes + \ > [self.n_outputs_] > > # Initialize layers > activations = [X] > > for i in range(self.n_layers_ - 1): > activations.append(np.empty((X.shape[0], > layer_units[i + 1]))) > # forward propagate > self._forward_pass(activations) > y_pred = activations[-1] > > return y_pred > ``` > > the line `y_pred = activations[-1]` is responsible for extracting the > values for the last layer, > but the `activations` variable contains the values for all the neurons. > > You can make this function into your own external method (changing the > `self` attribute by > a proper parameter) and add an extra argument which specifies the layer(s) > that you want. > I have done this myself in order to make an AutoEncoderNetwork out of the > MLP > implementation. > > This makes me wonder, would it be worth adding this to sklearn? > A very simple way would be to refactor the `_predict` method, with the > additional layer > argument, to a new method `_predict_layer`, then we can have the > `_predict` method > simply call `_predict_layer(..., layer=-1)` and add a new method (perhaps > a `transform`?) > that allows to get (raveled) values for an arbitrary subset of the layers. > > I'd be happy to submit a PR if you guys think it would be interesting for > the project. > > Javier > > > > On Thu, Dec 7, 2017 at 12:51 AM Thomas Evangelidis > wrote: > >> Greetings, >> >> I want to train a MLPClassifier with one hidden layer and use it as a >> feature selector for an MLPRegressor. >> Is it possible to get the values of the neurons from the last hidden >> layer of the MLPClassifier to pass them as input to the MLPRegressor? >> >> If it is not possible with scikit-learn, is anyone aware of any >> scikit-compatible NN library that offers this functionality? For example >> this one: >> >> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html >> >> I wouldn't like to do this in Tensorflow because the MLP there is much >> slower than scikit-learn's implementation. >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.bastien at gmail.com Sat Dec 30 09:34:54 2017 From: frederic.bastien at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBCYXN0aWVu?=) Date: Sat, 30 Dec 2017 14:34:54 +0000 Subject: [scikit-learn] Any plans on generalizing Pipeline and transformers? In-Reply-To: References: Message-ID: This start to look as the dask project. Do you know it? Le mar. 26 d?c. 2017 05:49, Manuel Castej?n Limas a ?crit : > I'm elaborating on the graph idea. A dictionary to describe the graph, the > networkx package to support the graph and run it in topological order; and > some wrappers for scikit-learn models. > > I'm currently thinking on putting some more efforts into a contrib project. > > It could be something inspired by this example. > > Manolo > > #------------------------------------------------- > > > > graph_description = { > 'First': > {'operation': First_Step, > 'input': {'X':X, 'y':y}}, > > 'Concatenate_Xy': > {'operation': ConcatenateData_Step, > 'input': [('First', 'X'), > ('First', 'y')]}, > > 'Gaussian_Mixture': > {'operation': Gaussian_Mixture_Step, > 'input': [('Concatenate_Xy', 'data')]}, > > 'Dbscan': > {'operation': Dbscan_Step, > 'input': [('Concatenate_Xy', 'data')]}, > > 'CombineClustering': > {'operation': CombineClustering_Step, > 'input': [('Dbscan', 'classification'), > ('Gaussian_Mixture', 'classification')]}, > > 'Paella': > {'operation': Paella_Step, > 'input': [('First', 'X'), > ('First', 'y'), > ('Concatenate_Xy', 'data'), > ('CombineClustering', 'classification')]}, > > 'Regressor': > {'operation': Regressor_Step, > 'input': [('First', 'X'), > ('First', 'y'), > ('Paella', 'sample_weight')]}, > > 'Last': > {'operation': Last_Step, > 'input': [('Regressor', 'regressor')]}, > > } > > #%% > def create_graph(description): > cg = nx.DiGraph() > cg.add_nodes_from(description) > for current_name, info in description.items(): > current_node = cg.node[current_name] > current_node['operation'] = info['operation']( graph = cg, > node_name = current_name ) > current_node['input'] = info['input'] > if current_name != 'First': > for ascendant in set( name for name, attribute in > info['input'] ): > cg.add_edge(ascendant, current_name) > > return cg > #%% > cg = create_graph(graph_description) > > node_pos = {'First' : ( 0, 0), > 'Concatenate_Xy' : ( 2, 4), > 'Gaussian_Mixture' : ( 6, 8), > 'Dbscan' : ( 6, 6), > 'CombineClustering': ( 8, 7), > 'Paella' : (10, 2), > 'Regressor' : (12, 0), > 'Last' : (16, 0) > } > > nx.draw(cg, pos=node_pos, with_labels=True) > > #%% > > print("=========================") > for name in nx.topological_sort(cg): > print("Running: ", name) > cg.node[name]['operation'].fit() > > print("=========================") > > ######################## > > > > > > 2017-12-22 12:09 GMT+01:00 Manuel Castej?n Limas < > manuel.castejon at gmail.com>: > >> I'm currently thinking on a computational graph which can then be wrapped >> as a pipeline like object ... I'll try yo make a toy example solving my >> problem. >> >> El 20 dic. 2017 16:33, "Manuel Castej?n Limas" >> escribi?: >> >>> Thank you all for your interest! >>> >>> In order to clarify the case allow me to try to synthesize the spirit >>> of what I'd like to put into the pipeline using this sequence of steps: >>> >>> #%% >>> import pandas as pd >>> import numpy as np >>> import matplotlib.pyplot as plt >>> >>> from sklearn.cluster import DBSCAN >>> from sklearn.mixture import GaussianMixture >>> from sklearn.model_selection import train_test_split >>> >>> np.random.seed(seed=42) >>> >>> """ >>> Data preparation >>> """ >>> >>> URL = " >>> https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sin_60_percent_noise.csv >>> " >>> data = pd.read_csv(URL, usecols=['V1','V2']) >>> X, y = data[['V1']], data[['V2']] >>> >>> (data_train, data_test, >>> X_train, X_test, >>> y_train, y_test) = train_test_split(data, X, y) >>> >>> """ >>> Parameters setup >>> """ >>> >>> dbscan__eps = 0.06 >>> >>> mclust__n_components = 3 >>> >>> paella__noise_label = -1 >>> paella__max_it = 20, >>> paella__regular_size = 400, >>> paella__minimum_size = 100, >>> paella__width_r = 0.99, >>> paella__n_neighbors = 5, >>> paella__power = 30, >>> paella__random_state = None >>> >>> #%% >>> """ >>> DBSCAN clustering to detect noise suspects (label == -1) >>> """ >>> >>> dbscan_input = data_train >>> >>> dbscan_clustering = DBSCAN(eps = dbscan__eps) >>> >>> dbscan_output = dbscan_clustering.fit_predict(dbscan_input) >>> >>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, >>> c=np.int64(dbscan_output == -1)) >>> >>> #%% >>> """ >>> GaussianMixture fitted with filtered data_train in order to help locate >>> the ellipsoids >>> but predict is applied to the whole data_train set. >>> """ >>> >>> mclust_input = data_train[ dbscan_output != 1] >>> >>> mclust_clustering = GaussianMixture(n_components = mclust__n_components) >>> mclust_clustering.fit(mclust_input) >>> >>> mclust_output = mclust_clustering.predict(data_train) >>> >>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, >>> c=mclust_output) >>> >>> #%% >>> """ >>> mclust and dbscan results are combined. >>> """ >>> >>> clustering_output = mclust_output.copy() >>> clustering_output[dbscan_output == -1] = -1 >>> >>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, >>> c=clustering_output) >>> >>> #%% >>> """ >>> Old-good Paella paper: >>> https://link.springer.com/article/10.1023/B:DAMI.0000031630.50685.7c >>> >>> The Paella algorithm calculates sample_weight to be used by the final >>> step regressor >>> (Yes, it is an outlier detection algorithm but we are focusing now on >>> this interesting collateral result). I am currently aggressively changing >>> the code in order to make it fit somehow with the pipeline >>> """ >>> >>> from paella import Paella >>> >>> paella_input = pd.concat([data, clustering_output], axis=1, >>> inplace=False) >>> >>> paella_run = Paella(noise_label = paella__noise_label, >>> max_it = paella__max_it, >>> regular_size = paella__regular_size, >>> minimum_size = paella__minimum_size, >>> width_r = paella__width_r, >>> n_neighbors = paella__n_neighbors, >>> power = paella__power, >>> random_state = paella__random_state) >>> >>> paella_output = paella_run.fit_predict(paella_input, y_train) >>> # paella_output is a vector with sample_weight >>> >>> #%% >>> """ >>> Here we fit a regressor using sample_weight=paella_output >>> """ >>> from sklearn.linear_model import LinearRegression >>> >>> regressor_input=X_train >>> lm = LinearRegression() >>> lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output) >>> regressor_output = lm.predict(X_train) >>> >>> #... >>> >>> In this example we can see that: >>> - A particular step might need results produced not necessarily from the >>> immediately previous step. >>> - The X parameter is not sequentially transformed. Sometimes we might >>> need to skip to a previous step >>> - y sometimes is the target, sometimes is not. For the regressor it is >>> indeed, but for the paella algorithm the prediction is expressed as a >>> vector representing sample_weights. >>> >>> All in all the conclusion is that the chain of processes is not as >>> linear as imposed by the current API. I guess that all these difficulties >>> could be solved by: >>> - Passing a dictionary through the different steps containing the >>> partial results that the following steps will need. >>> - As a christmas gift :-) , a reference to the pipeline itself inserted >>> in that dictionary could provide access to the internal status of the >>> previous steps should it be needed. >>> >>> Another interesting study case with similar needs would be a regressor >>> using a previous clustering step in order to fit one model per cluster. In >>> such case, the clustering results would be needed during the fitting. >>> >>> >>> Thanks for your interest! >>> Manolo >>> >>> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gauravdhingra.gxyd at gmail.com Sun Dec 31 05:48:31 2017 From: gauravdhingra.gxyd at gmail.com (Gaurav Dhingra) Date: Sun, 31 Dec 2017 16:18:31 +0530 Subject: [scikit-learn] Topic for thesis work on scikit learn In-Reply-To: <9641a578-194f-183c-fa2c-22cb45a7c76d@gmail.com> References: <069bd230-f2e0-e5ef-4fdf-7d0c529c5d5f@gmail.com> <9641a578-194f-183c-fa2c-22cb45a7c76d@gmail.com> Message-ID: <66ab3e6f-4102-3fd0-5828-7af593eab581@gmail.com> Hi Andreas, I think I'll get access to a local mentor from my college, so I think I rule that issue out, though for technicalities still I would /like/ to be more dependent on feedback from the scikit-learn community, since my aim wouldn't be to make something for my own use but rather something that would be more useful for the scikit-learn community, so that it eventually gets merged into master. I'm currently looking for topic that I can take up, I tried looking into scikit-learn wiki but it doesn't mention for what I'm looking for (no topic is mentioned). Do you have some topic in mind that could be useful for addition to scikit-learn? Even if you could direct me to appropriate links I would be happy to look into those. On Wednesday 01 November 2017 01:43 AM, Andreas Mueller wrote: > Hi Gaurav. > > Do you have a local mentor? I think having a mentor that can guide you > during a thesis is very important. > You could get some feedback from the community for a contribution, but > that can be slow, > and is entirely on volunteer basis, so there is no guarantee that > you'll get the necessary feedback in time > to finish your thesis. > > Mentoring a thesis - in particular without knowing you - is a serious > commitment, so I'm not sure someone > from inside the project will want to do this. I saw you already made a > contribution in https://github.com/scikit-learn/scikit-learn/pull/10005 > but that's a very different scope than doing what I expect would be > several month of work. > Though in this regard I've made a few more contributions, here is the link https://github.com/scikit-learn/scikit-learn/pulls/gxyd, though I know none of them is a big contribution. If you think I should work on a big enough PR, can you please suggest me some issue in that regard? Thanks > Best, > Andy > > On 10/31/2017 03:31 PM, Gaurav Dhingra wrote: >> Hi everyone, >> >> I am a final year (5th year) undergraduate Applied Mathematics >> student in India. I am thinking of doing my final year thesis by >> doing some work (coding part) on scikit learn, so I was thinking if >> anyone could tell me if there are available topics (not necessarily >> names of those topics) that I could work on being an undergraduate >> student? I would want to expand upon this in December when my exams >> will be over. But in the mean time would want to take a step in that >> direction by just knowing if there will be available topics that I >> could work on. >> >> It could be the case that available topics are not so easy for an >> undergraduate, still in that case I would like to do some research on >> the topics first. >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gaurav Dhingra (sent from Thunderbird email client) -------------- next part -------------- An HTML attachment was scrubbed... URL: From gauravdhingra.gxyd at gmail.com Sun Dec 31 05:50:31 2017 From: gauravdhingra.gxyd at gmail.com (Gaurav Dhingra) Date: Sun, 31 Dec 2017 16:20:31 +0530 Subject: [scikit-learn] Topic for thesis work on scikit learn In-Reply-To: <66ab3e6f-4102-3fd0-5828-7af593eab581@gmail.com> References: <069bd230-f2e0-e5ef-4fdf-7d0c529c5d5f@gmail.com> <9641a578-194f-183c-fa2c-22cb45a7c76d@gmail.com> <66ab3e6f-4102-3fd0-5828-7af593eab581@gmail.com> Message-ID: Sorry Andreas, I didn't intend to send last mail to you. I've sent a copy of last mail to scikit-learn mailing list. On Sunday 31 December 2017 04:18 PM, Gaurav Dhingra wrote: > > Hi Andreas, > > I think I'll get access to a local mentor from my college, so I think > I rule that issue out, though for technicalities still I would /like/ > to be more dependent on feedback from the scikit-learn community, > since my aim wouldn't be to make something for my own use but rather > something that would be more useful for the scikit-learn community, so > that it eventually gets merged into master. > > I'm currently looking for topic that I can take up, I tried looking > into scikit-learn wiki but it doesn't mention for what I'm looking for > (no topic is mentioned). Do you have some topic in mind that could be > useful for addition to scikit-learn? Even if you could direct me to > appropriate links I would be happy to look into those. > > > On Wednesday 01 November 2017 01:43 AM, Andreas Mueller wrote: >> Hi Gaurav. >> >> Do you have a local mentor? I think having a mentor that can guide >> you during a thesis is very important. >> You could get some feedback from the community for a contribution, >> but that can be slow, >> and is entirely on volunteer basis, so there is no guarantee that >> you'll get the necessary feedback in time >> to finish your thesis. >> >> Mentoring a thesis - in particular without knowing you - is a serious >> commitment, so I'm not sure someone >> from inside the project will want to do this. I saw you already made >> a contribution in >> https://github.com/scikit-learn/scikit-learn/pull/10005 >> but that's a very different scope than doing what I expect would be >> several month of work. >> > > > Though in this regard I've made a few more contributions, here is the > link https://github.com/scikit-learn/scikit-learn/pulls/gxyd, though I > know none of them is a big contribution. If you think I should work on > a big enough PR, can you please suggest me some issue in that regard? > > Thanks > >> Best, >> Andy >> >> On 10/31/2017 03:31 PM, Gaurav Dhingra wrote: >>> Hi everyone, >>> >>> I am a final year (5th year) undergraduate Applied Mathematics >>> student in India. I am thinking of doing my final year thesis by >>> doing some work (coding part) on scikit learn, so I was thinking if >>> anyone could tell me if there are available topics (not necessarily >>> names of those topics) that I could work on being an undergraduate >>> student? I would want to expand upon this in December when my exams >>> will be over. But in the mean time would want to take a step in that >>> direction by just knowing if there will be available topics that I >>> could work on. >>> >>> It could be the case that available topics are not so easy for an >>> undergraduate, still in that case I would like to do some research >>> on the topics first. >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > -- > Gaurav Dhingra > (sent from Thunderbird email client) -- Gaurav Dhingra (sent from Thunderbird email client) -------------- next part -------------- An HTML attachment was scrubbed... URL: