From aniket.g.meshram at gmail.com  Fri Dec  1 16:05:11 2017
From: aniket.g.meshram at gmail.com (Aniket Meshram)
Date: Sat, 2 Dec 2017 02:35:11 +0530
Subject: [scikit-learn] Error while running 'python setup.py build_ext
 --inplace'
Message-ID: <CAHpd3pNDMQ-j=5rbL1fuDdazdcm_P8865+tPf=9V8VU5vrVwAg@mail.gmail.com>

hi,

I'm following the 'ways to contribute page'

After forking and cloning, I ran the command 'python setup.py build_ext
--inplace'
which is giving me the following error:

cc1: some warnings being treated as errors
error: Command "x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall
-Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g
-fstack-protector-strong -Wformat -Werror=format-security -fPIC
-I/usr/lib/python2.7/dist-packages/numpy/core/include
-I/usr/lib/python2.7/dist-packages/numpy/core/include
-I/usr/include/python2.7 -c sklearn/neighbors/quad_tree.c -o
build/temp.linux-x86_64-2.7/sklearn/neighbors/quad_tree.o" failed with exit
status 1

AnnGM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171202/f68c5bee/attachment.html>

From joel.nothman at gmail.com  Sat Dec  2 06:30:02 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Sat, 2 Dec 2017 22:30:02 +1100
Subject: [scikit-learn] Error while running 'python setup.py build_ext
 --inplace'
In-Reply-To: <CAHpd3pNDMQ-j=5rbL1fuDdazdcm_P8865+tPf=9V8VU5vrVwAg@mail.gmail.com>
References: <CAHpd3pNDMQ-j=5rbL1fuDdazdcm_P8865+tPf=9V8VU5vrVwAg@mail.gmail.com>
Message-ID: <CAAkaFLXOH_Z=TGJFw5rPXrFWsVZkEEiwo0PZQKn9Mgy1Hd3tKw@mail.gmail.com>

There's not enough information there for us to help you. Please provide the
full log if possible. Are you sure you want to build from source?

On 2 December 2017 at 08:05, Aniket Meshram <aniket.g.meshram at gmail.com>
wrote:

> hi,
>
> I'm following the 'ways to contribute page'
>
> After forking and cloning, I ran the command 'python setup.py build_ext
> --inplace'
> which is giving me the following error:
>
> cc1: some warnings being treated as errors
> error: Command "x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2
> -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time
> -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat
> -Werror=format-security -fPIC -I/usr/lib/python2.7/dist-packages/numpy/core/include
> -I/usr/lib/python2.7/dist-packages/numpy/core/include
> -I/usr/include/python2.7 -c sklearn/neighbors/quad_tree.c -o
> build/temp.linux-x86_64-2.7/sklearn/neighbors/quad_tree.o" failed with
> exit status 1
>
> AnnGM
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171202/98232ede/attachment.html>

From aniket.g.meshram at gmail.com  Sun Dec  3 10:30:47 2017
From: aniket.g.meshram at gmail.com (Aniket Meshram)
Date: Sun, 3 Dec 2017 21:00:47 +0530
Subject: [scikit-learn] Error while running 'python setup.py build_ext
 --inplace'
In-Reply-To: <CAAkaFLXOH_Z=TGJFw5rPXrFWsVZkEEiwo0PZQKn9Mgy1Hd3tKw@mail.gmail.com>
References: <CAHpd3pNDMQ-j=5rbL1fuDdazdcm_P8865+tPf=9V8VU5vrVwAg@mail.gmail.com>
 <CAAkaFLXOH_Z=TGJFw5rPXrFWsVZkEEiwo0PZQKn9Mgy1Hd3tKw@mail.gmail.com>
Message-ID: <CAHpd3pOikdXDvN2gq9uCb=ffErDpAZ5jj2QUkYpgW3SU3tM_0w@mail.gmail.com>

hi Joel.

Please find attached full log for 'pip install --editable .'

*Are you sure you want to build from source?*
*[Well, I'm trying the development version, since I'd like to help fix
bugs. Wait, so you mean we are experiencing issue with source install. Oh
man!. if Yes, then can you please suggest some other method. **(i mean
apart from stable install) OR is it ok to go for stable install? ]*

Thanks

On Sat, Dec 2, 2017 at 5:00 PM, Joel Nothman <joel.nothman at gmail.com> wrote:

> There's not enough information there for us to help you. Please provide
> the full log if possible. Are you sure you want to build from source?
>
> On 2 December 2017 at 08:05, Aniket Meshram <aniket.g.meshram at gmail.com>
> wrote:
>
>> hi,
>>
>> I'm following the 'ways to contribute page'
>>
>> After forking and cloning, I ran the command 'python setup.py build_ext
>> --inplace'
>> which is giving me the following error:
>>
>> cc1: some warnings being treated as errors
>> error: Command "x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2
>> -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time
>> -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat
>> -Werror=format-security -fPIC -I/usr/lib/python2.7/dist-packages/numpy/core/include
>> -I/usr/lib/python2.7/dist-packages/numpy/core/include
>> -I/usr/include/python2.7 -c sklearn/neighbors/quad_tree.c -o
>> build/temp.linux-x86_64-2.7/sklearn/neighbors/quad_tree.o" failed with
>> exit status 1
>>
>> AnnGM
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Regards,

Aniket G. Meshram
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171203/ea3612e6/attachment.html>
-------------- next part --------------
sumedh at sumedh-Inspiron-N4010:scikit-learn $ sudo -H pip install --editable .
Obtaining file:///home/sumedh/Downloads/Programming/OpenSourceContributions/scikit-learn
Installing collected packages: scikit-learn
  Running setup.py develop for scikit-learn
    Complete output from command /usr/bin/python -c "import setuptools, tokenize;__file__='/home/sumedh/Downloads/Programming/OpenSourceContributions/scikit-learn/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" develop --no-deps:
    Partial import of sklearn during the build process.
    blas_opt_info:
    blas_mkl_info:
      libraries mkl,vml,guide not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    openblas_info:
      libraries openblas not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    atlas_3_10_blas_threads_info:
    Setting PTATLAS=ATLAS
      libraries tatlas not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    atlas_3_10_blas_info:
      libraries satlas not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    atlas_blas_threads_info:
    Setting PTATLAS=ATLAS
      libraries ptf77blas,ptcblas,atlas not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    atlas_blas_info:
      libraries f77blas,cblas,atlas not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    /usr/lib/python2.7/dist-packages/numpy/distutils/system_info.py:1640: UserWarning:
        Atlas (http://math-atlas.sourceforge.net/) libraries not found.
        Directories to search for the libraries can be specified in the
        numpy/distutils/site.cfg file (section [atlas]) or by setting
        the ATLAS environment variable.
      warnings.warn(AtlasNotFoundError.__doc__)
    blas_info:
      libraries blas not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    /usr/lib/python2.7/dist-packages/numpy/distutils/system_info.py:1649: UserWarning:
        Blas (http://www.netlib.org/blas/) libraries not found.
        Directories to search for the libraries can be specified in the
        numpy/distutils/site.cfg file (section [blas]) or by setting
        the BLAS environment variable.
      warnings.warn(BlasNotFoundError.__doc__)
    blas_src_info:
      NOT AVAILABLE
    
    /usr/lib/python2.7/dist-packages/numpy/distutils/system_info.py:1652: UserWarning:
        Blas (http://www.netlib.org/blas/) sources not found.
        Directories to search for the sources can be specified in the
        numpy/distutils/site.cfg file (section [blas_src]) or by setting
        the BLAS_SRC environment variable.
      warnings.warn(BlasSrcNotFoundError.__doc__)
      NOT AVAILABLE
    
    sklearn/setup.py:72: UserWarning:
        Blas (http://www.netlib.org/blas/) libraries not found.
        Directories to search for the libraries can be specified in the
        numpy/distutils/site.cfg file (section [blas]) or by setting
        the BLAS environment variable.
      warnings.warn(BlasNotFoundError.__doc__)
    missing cimport in module 'sklearn.neighbors': sklearn/manifold/_barnes_hut_tsne.pyx
    running develop
    running build_scripts
    running egg_info
    running build_src
    build_src
    building library "libsvm-skl" sources
    building library "cblas" sources
    building extension "sklearn.__check_build._check_build" sources
    building extension "sklearn.cluster._dbscan_inner" sources
    building extension "sklearn.cluster._hierarchical" sources
    building extension "sklearn.cluster._k_means_elkan" sources
    building extension "sklearn.cluster._k_means" sources
    building extension "sklearn.datasets._svmlight_format" sources
    building extension "sklearn.decomposition._online_lda" sources
    building extension "sklearn.decomposition.cdnmf_fast" sources
    building extension "sklearn.ensemble._gradient_boosting" sources
    building extension "sklearn.feature_extraction._hashing" sources
    building extension "sklearn.manifold._utils" sources
    building extension "sklearn.manifold._barnes_hut_tsne" sources
    building extension "sklearn.metrics.pairwise_fast" sources
    building extension "sklearn.metrics/cluster.expected_mutual_info_fast" sources
    building extension "sklearn.neighbors.ball_tree" sources
    building extension "sklearn.neighbors.kd_tree" sources
    building extension "sklearn.neighbors.dist_metrics" sources
    building extension "sklearn.neighbors.typedefs" sources
    building extension "sklearn.neighbors.quad_tree" sources
    building extension "sklearn.tree._tree" sources
    building extension "sklearn.tree._splitter" sources
    building extension "sklearn.tree._criterion" sources
    building extension "sklearn.tree._utils" sources
    building extension "sklearn.svm.libsvm" sources
    building extension "sklearn.svm.liblinear" sources
    building extension "sklearn.svm.libsvm_sparse" sources
    building extension "sklearn._isotonic" sources
    building extension "sklearn.linear_model.cd_fast" sources
    building extension "sklearn.linear_model.sgd_fast" sources
    building extension "sklearn.linear_model.sag_fast" sources
    building extension "sklearn.utils.sparsefuncs_fast" sources
    building extension "sklearn.utils.arrayfuncs" sources
    building extension "sklearn.utils.murmurhash" sources
    building extension "sklearn.utils.lgamma" sources
    building extension "sklearn.utils.graph_shortest_path" sources
    building extension "sklearn.utils.fast_dict" sources
    building extension "sklearn.utils.seq_dataset" sources
    building extension "sklearn.utils.weight_vector" sources
    building extension "sklearn.utils._random" sources
    building extension "sklearn.utils._logistic_sigmoid" sources
    building data_files sources
    build_src: building npy-pkg config files
    writing requirements to scikit_learn.egg-info/requires.txt
    writing scikit_learn.egg-info/PKG-INFO
    writing top-level names to scikit_learn.egg-info/top_level.txt
    writing dependency_links to scikit_learn.egg-info/dependency_links.txt
    warning: manifest_maker: standard file '-c' not found
    
    reading manifest file 'scikit_learn.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    writing manifest file 'scikit_learn.egg-info/SOURCES.txt'
    running build_ext
    customize UnixCCompiler
    customize UnixCCompiler using build_clib
    customize UnixCCompiler
    customize UnixCCompiler using build_ext
    customize UnixCCompiler
    customize UnixCCompiler using build_ext
    building 'sklearn.neighbors.quad_tree' extension
    compiling C sources
    C compiler: x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC
    
    compile options: '-I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -I/usr/include/python2.7 -c'
    x86_64-linux-gnu-gcc: sklearn/neighbors/quad_tree.c
    In file included from /usr/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarraytypes.h:1777:0,
                     from /usr/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h:18,
                     from /usr/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h:4,
                     from sklearn/neighbors/quad_tree.c:259:
    /usr/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
     #warning "Using deprecated NumPy API, disable it by " \
      ^
    sklearn/neighbors/quad_tree.c:2365:1: warning: function declaration isn?t a prototype [-Wstrict-prototypes]
     static PyObject *__pyx_pf_7sklearn_9neighbors_9quad_tree_9_QuadTree_16test_summarize(); /* proto */
     ^
    sklearn/neighbors/quad_tree.c: In function ?__pyx_f_7sklearn_9neighbors_9quad_tree_9_QuadTree_insert_point?:
    sklearn/neighbors/quad_tree.c:3523:14: error: format not a string literal and no format arguments [-Werror=format-security]
           printf(__pyx_k_QuadTree_found_a_duplicate);
                  ^
    sklearn/neighbors/quad_tree.c: In function ?__pyx_f_7sklearn_9neighbors_9quad_tree_9_QuadTree__get_cell_ndarray?:
    sklearn/neighbors/quad_tree.c:6584:36: warning: passing argument 1 of ?(PyObject * (*)(PyTypeObject *, PyArray_Descr *, int,  npy_intp *, npy_intp *, void *, int,  PyObject *))*(PyArray_API + 752u)? from incompatible pointer type [-Wincompatible-pointer-types]
       __pyx_t_2 = PyArray_NewFromDescr(((PyObject *)__pyx_ptype_5numpy_ndarray), ((PyArray_Descr *)__pyx_t_1), 1, __pyx_v_shape, __pyx_v_strides, ((void *)__pyx_v_self->cells), NPY_DEFAULT, Py_None); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 574; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
                                        ^
    sklearn/neighbors/quad_tree.c:6584:36: note: expected ?PyTypeObject * {aka struct _typeobject *}? but argument is of type ?PyObject * {aka struct _object *}?
    sklearn/neighbors/quad_tree.c: At top level:
    sklearn/neighbors/quad_tree.c:7015:18: warning: function declaration isn?t a prototype [-Wstrict-prototypes]
     static PyObject *__pyx_pf_7sklearn_9neighbors_9quad_tree_9_QuadTree_16test_summarize() {
                      ^
    cc1: some warnings being treated as errors
    In file included from /usr/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarraytypes.h:1777:0,
                     from /usr/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h:18,
                     from /usr/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h:4,
                     from sklearn/neighbors/quad_tree.c:259:
    /usr/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
     #warning "Using deprecated NumPy API, disable it by " \
      ^
    sklearn/neighbors/quad_tree.c:2365:1: warning: function declaration isn?t a prototype [-Wstrict-prototypes]
     static PyObject *__pyx_pf_7sklearn_9neighbors_9quad_tree_9_QuadTree_16test_summarize(); /* proto */
     ^
    sklearn/neighbors/quad_tree.c: In function ?__pyx_f_7sklearn_9neighbors_9quad_tree_9_QuadTree_insert_point?:
    sklearn/neighbors/quad_tree.c:3523:14: error: format not a string literal and no format arguments [-Werror=format-security]
           printf(__pyx_k_QuadTree_found_a_duplicate);
                  ^
    sklearn/neighbors/quad_tree.c: In function ?__pyx_f_7sklearn_9neighbors_9quad_tree_9_QuadTree__get_cell_ndarray?:
    sklearn/neighbors/quad_tree.c:6584:36: warning: passing argument 1 of ?(PyObject * (*)(PyTypeObject *, PyArray_Descr *, int,  npy_intp *, npy_intp *, void *, int,  PyObject *))*(PyArray_API + 752u)? from incompatible pointer type [-Wincompatible-pointer-types]
       __pyx_t_2 = PyArray_NewFromDescr(((PyObject *)__pyx_ptype_5numpy_ndarray), ((PyArray_Descr *)__pyx_t_1), 1, __pyx_v_shape, __pyx_v_strides, ((void *)__pyx_v_self->cells), NPY_DEFAULT, Py_None); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 574; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
                                        ^
    sklearn/neighbors/quad_tree.c:6584:36: note: expected ?PyTypeObject * {aka struct _typeobject *}? but argument is of type ?PyObject * {aka struct _object *}?
    sklearn/neighbors/quad_tree.c: At top level:
    sklearn/neighbors/quad_tree.c:7015:18: warning: function declaration isn?t a prototype [-Wstrict-prototypes]
     static PyObject *__pyx_pf_7sklearn_9neighbors_9quad_tree_9_QuadTree_16test_summarize() {
                      ^
    cc1: some warnings being treated as errors
    error: Command "x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -I/usr/include/python2.7 -c sklearn/neighbors/quad_tree.c -o build/temp.linux-x86_64-2.7/sklearn/neighbors/quad_tree.o" failed with exit status 1
    
    ----------------------------------------
Command "/usr/bin/python -c "import setuptools, tokenize;__file__='/home/sumedh/Downloads/Programming/OpenSourceContributions/scikit-learn/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" develop --no-deps" failed with error code 1 in /home/sumedh/Downloads/Programming/OpenSourceContributions/scikit-learn/
sumedh at sumedh-Inspiron-N4010:scikit-learn $ 


From pengyu.ut at gmail.com  Sun Dec  3 15:54:08 2017
From: pengyu.ut at gmail.com (Peng Yu)
Date: Sun, 3 Dec 2017 14:54:08 -0600
Subject: [scikit-learn] a dataset suitable for logistic regression
Message-ID: <CABrM6wksjMvgADPONhrAZMn3dBufQy=9wdeuuJTaDUxnioqT6Q@mail.gmail.com>

Hi, iris is a three-class dataset. Is there a dataset in sklearn that
is suitable for binary classification? Thanks.

-- 
Regards,
Peng

From se.raschka at gmail.com  Sun Dec  3 17:00:36 2017
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Sun, 3 Dec 2017 17:00:36 -0500
Subject: [scikit-learn] a dataset suitable for logistic regression
In-Reply-To: <CABrM6wksjMvgADPONhrAZMn3dBufQy=9wdeuuJTaDUxnioqT6Q@mail.gmail.com>
References: <CABrM6wksjMvgADPONhrAZMn3dBufQy=9wdeuuJTaDUxnioqT6Q@mail.gmail.com>
Message-ID: <B5B68F87-A3BD-4EE8-957B-96CD1013D95A@gmail.com>

As far as I know, no. But you could simply truncate the iris dataset for binary classification, e.g., 

from sklearn import datasets

iris = datasets.load_iris()
X = iris.data[:100]
y = iris.target[:100]

Best,
Sebastian

> On Dec 3, 2017, at 3:54 PM, Peng Yu <pengyu.ut at gmail.com> wrote:
> 
> Hi, iris is a three-class dataset. Is there a dataset in sklearn that
> is suitable for binary classification? Thanks.
> 
> -- 
> Regards,
> Peng
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From francois.dion at gmail.com  Mon Dec  4 08:02:09 2017
From: francois.dion at gmail.com (Francois Dion)
Date: Mon, 4 Dec 2017 08:02:09 -0500
Subject: [scikit-learn] a dataset suitable for logistic regression
In-Reply-To: <CABrM6wksjMvgADPONhrAZMn3dBufQy=9wdeuuJTaDUxnioqT6Q@mail.gmail.com>
References: <CABrM6wksjMvgADPONhrAZMn3dBufQy=9wdeuuJTaDUxnioqT6Q@mail.gmail.com>
Message-ID: <CAOLi1KDkoHH281Oc2=Uhi_8nSUswmJn3ycD_AnS+JWS4gnLRKQ@mail.gmail.com>

There's at least one that is part of base.py in sklearn.datasets.

from sklearn.datasets import load_breast_cancer

load_breast_cancer?

Signature: load_breast_cancer(return_X_y=False)
Docstring:
Load and return the breast cancer wisconsin dataset (classification).

The breast cancer dataset is a classic and very easy binary classification
dataset.

=================   ==============
Classes                          2
Samples per class    212(M),357(B)
Samples total                  569
Dimensionality                  30
Features            real, positive
=================   ==============

It is a very small data set. If you need something much larger, you can
easily create large (artificial) sets using make_classification. And you
can augment that with faker or elizabeth (pypi modules, not part of
scikit-learn) to create realistic looking data sets.

Francois


about.me/francois.dion


On Sun, Dec 3, 2017 at 3:54 PM, Peng Yu <pengyu.ut at gmail.com> wrote:

> Hi, iris is a three-class dataset. Is there a dataset in sklearn that
> is suitable for binary classification? Thanks.
>
> --
> Regards,
> Peng
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171204/878bad54/attachment.html>

From iacopo at lighton.io  Mon Dec  4 08:09:09 2017
From: iacopo at lighton.io (Iacopo Poli)
Date: Mon, 4 Dec 2017 14:09:09 +0100
Subject: [scikit-learn] Using scikit-learn HTML theme for Sphinx docs in
 another project
Message-ID: <CAN=_51jojt-V4x+8R2Bh0xOzoR=nR=dhfbWNFThYRATG7zsZNA@mail.gmail.com>

Hello everyone,

I'm working on a project that is implemented following quite strictly the
scikit-learn API and I would like to use the scikit-learn Sphinx theme for
the docs.

I would do that only if I don't infringe any copyright and whatsoever.
What's your policy in this regard?

Cheers,
Iacopo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171204/71b3fa12/attachment.html>

From olivier.grisel at ensta.org  Mon Dec  4 07:37:49 2017
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Mon, 4 Dec 2017 13:37:49 +0100
Subject: [scikit-learn] Error while running 'python setup.py build_ext
 --inplace'
In-Reply-To: <CAHpd3pOikdXDvN2gq9uCb=ffErDpAZ5jj2QUkYpgW3SU3tM_0w@mail.gmail.com>
References: <CAHpd3pNDMQ-j=5rbL1fuDdazdcm_P8865+tPf=9V8VU5vrVwAg@mail.gmail.com>
 <CAAkaFLXOH_Z=TGJFw5rPXrFWsVZkEEiwo0PZQKn9Mgy1Hd3tKw@mail.gmail.com>
 <CAHpd3pOikdXDvN2gq9uCb=ffErDpAZ5jj2QUkYpgW3SU3tM_0w@mail.gmail.com>
Message-ID: <CAFvE7K5K+XQEWeHb_a2T15e7YKC6hrfc8v0WFbogk8tmVJnVLQ@mail.gmail.com>

Maybe update your version of Cython?

-- 
Olivier
?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171204/a83a13d0/attachment.html>

From peter.hausamann at tum.de  Mon Dec  4 09:21:12 2017
From: peter.hausamann at tum.de (Peter Hausamann)
Date: Mon, 04 Dec 2017 14:21:12 +0000
Subject: [scikit-learn] Announcing sklearn-xarray
Message-ID: <CAPZCsxhBvstp-QR-GQVNha8DFauXRJ5Z-JC_ucEN=9seYzfRwQ@mail.gmail.com>

Hi all,

I'd like to announce *sklearn-xarray*, a new package that provides a
scikit-learn interface for xarray users. For those not familiar with xarray
(http://xarray.pydata.org), it is a "pandas-like and pandas-compatible
toolkit for analytics on multi-dimensional arrays".

The package makes it possible to apply sklearn estimators to xarray
DataArrays and Datasets while keeping the labels (called coordinates in
xarray) intact whereever possible.

You can install the package via pip:

pip install sklearn-xarray

To get started, you can:

   - read the documentation: https://phausamann.github.io/sklearn-xarray
   and
   - check out the repository: https://github.com/phausamann/sklearn-xarray

Note that the package is still in a very early development stage and there
will probably be some major API changes in upcoming releases. Most notably,
I'd like to replicate the complete sklearn module structure at some point
by decorating all available estimators with the necessary wrappers.

Feedback of any kind is appreciated.

Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171204/83cb5cfe/attachment-0001.html>

From gael.varoquaux at normalesup.org  Mon Dec  4 09:47:51 2017
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Mon, 4 Dec 2017 15:47:51 +0100
Subject: [scikit-learn] Using scikit-learn HTML theme for Sphinx docs in
 another project
In-Reply-To: <CAN=_51jojt-V4x+8R2Bh0xOzoR=nR=dhfbWNFThYRATG7zsZNA@mail.gmail.com>
References: <CAN=_51jojt-V4x+8R2Bh0xOzoR=nR=dhfbWNFThYRATG7zsZNA@mail.gmail.com>
Message-ID: <20171204144751.GC2024654@phare.normalesup.org>

You're not infringing copyright (this is BSD-licensed). The only thing is
that we would like you to indicate clearly that the project is not
scikit-learn, so that we don't recieve support calls. For this, in
addition to text pointing it out, you should use a different logo and a
different icon the browser's tab.

Cheers,

Ga?l

On Mon, Dec 04, 2017 at 02:09:09PM +0100, Iacopo Poli wrote:
> Hello everyone,

> I'm working on a project that is implemented following quite strictly the
> scikit-learn API and I would like to use the scikit-learn Sphinx theme for the
> docs.

> I would do that only if I don't infringe any copyright and whatsoever. What's
> your policy in this regard?

> Cheers,
> Iacopo

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
    Gael Varoquaux
    Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From iacopo at lighton.io  Mon Dec  4 10:08:09 2017
From: iacopo at lighton.io (Iacopo Poli)
Date: Mon, 4 Dec 2017 16:08:09 +0100
Subject: [scikit-learn] Using scikit-learn HTML theme for Sphinx docs in
 another project
In-Reply-To: <20171204144751.GC2024654@phare.normalesup.org>
References: <CAN=_51jojt-V4x+8R2Bh0xOzoR=nR=dhfbWNFThYRATG7zsZNA@mail.gmail.com>
 <20171204144751.GC2024654@phare.normalesup.org>
Message-ID: <CAN=_51j+_k06GGPkzsxV__QtEr8MBSWJVexR=8BCH1mZ_eWypQ@mail.gmail.com>

Cool! Of course will change logo and icon :-)

Thank you very much,
Iacopo

2017-12-04 15:47 GMT+01:00 Gael Varoquaux <gael.varoquaux at normalesup.org>:

> You're not infringing copyright (this is BSD-licensed). The only thing is
> that we would like you to indicate clearly that the project is not
> scikit-learn, so that we don't recieve support calls. For this, in
> addition to text pointing it out, you should use a different logo and a
> different icon the browser's tab.
>
> Cheers,
>
> Ga?l
>
> On Mon, Dec 04, 2017 at 02:09:09PM +0100, Iacopo Poli wrote:
> > Hello everyone,
>
> > I'm working on a project that is implemented following quite strictly the
> > scikit-learn API and I would like to use the scikit-learn Sphinx theme
> for the
> > docs.
>
> > I would do that only if I don't infringe any copyright and whatsoever.
> What's
> > your policy in this regard?
>
> > Cheers,
> > Iacopo
>
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> --
>     Gael Varoquaux
>     Researcher, INRIA Parietal
>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>     Phone:  ++ 33-1-69-08-79-68
>     http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171204/5df2f8aa/attachment.html>

From aniket.g.meshram at gmail.com  Mon Dec  4 09:20:20 2017
From: aniket.g.meshram at gmail.com (Aniket Meshram)
Date: Mon, 4 Dec 2017 19:50:20 +0530
Subject: [scikit-learn] Error while running 'python setup.py build_ext
 --inplace'
In-Reply-To: <CAFvE7K5K+XQEWeHb_a2T15e7YKC6hrfc8v0WFbogk8tmVJnVLQ@mail.gmail.com>
References: <CAHpd3pNDMQ-j=5rbL1fuDdazdcm_P8865+tPf=9V8VU5vrVwAg@mail.gmail.com>
 <CAAkaFLXOH_Z=TGJFw5rPXrFWsVZkEEiwo0PZQKn9Mgy1Hd3tKw@mail.gmail.com>
 <CAHpd3pOikdXDvN2gq9uCb=ffErDpAZ5jj2QUkYpgW3SU3tM_0w@mail.gmail.com>
 <CAFvE7K5K+XQEWeHb_a2T15e7YKC6hrfc8v0WFbogk8tmVJnVLQ@mail.gmail.com>
Message-ID: <CAHpd3pMS3xSukNGX8iML+NbPggn1Gr_0sa0aNvbOXPxv4tSRTQ@mail.gmail.com>

I updated all the packages before running install.

On Mon, Dec 4, 2017 at 6:07 PM, Olivier Grisel <olivier.grisel at ensta.org>
wrote:

> Maybe update your version of Cython?
>
> --
> Olivier
> ?
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Regards,

Aniket G. Meshram
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171204/19697576/attachment.html>

From olivier.grisel at ensta.org  Mon Dec  4 10:03:12 2017
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Mon, 4 Dec 2017 16:03:12 +0100
Subject: [scikit-learn] Announcing sklearn-xarray
In-Reply-To: <CAPZCsxhBvstp-QR-GQVNha8DFauXRJ5Z-JC_ucEN=9seYzfRwQ@mail.gmail.com>
References: <CAPZCsxhBvstp-QR-GQVNha8DFauXRJ5Z-JC_ucEN=9seYzfRwQ@mail.gmail.com>
Message-ID: <CAFvE7K7K-E5ZX84iQMuM5z+EbRwyRnt+YJpG8=VHvo_z0pZa8g@mail.gmail.com>

Interesting project!

BTW, do you know about dask-ml [1]?

It might be interesting to think about generalizing the input validation of
fit and predict / transform as a private method of the BaseEstimator class
instead of directly calling into sklearn.utils.validation functions so has
to make it easier for third party projects such as sklearn-xarray and
dask-ml to subclass and override those methods to allow for specific input
data-structure without converting everyting to a numpy array.

[1] https://github.com/dask/dask-ml


2017-12-04 15:21 GMT+01:00 Peter Hausamann <peter.hausamann at tum.de>:

> Hi all,
>
> I'd like to announce *sklearn-xarray*, a new package that provides a
> scikit-learn interface for xarray users. For those not familiar with xarray
> (http://xarray.pydata.org), it is a "pandas-like and pandas-compatible
> toolkit for analytics on multi-dimensional arrays".
>
> The package makes it possible to apply sklearn estimators to xarray
> DataArrays and Datasets while keeping the labels (called coordinates in
> xarray) intact whereever possible.
>
> You can install the package via pip:
>
> pip install sklearn-xarray
>
> To get started, you can:
>
>    - read the documentation: https://phausamann.github.io/sklearn-xarray
>    and
>    - check out the repository: https://github.
>    com/phausamann/sklearn-xarray
>
> Note that the package is still in a very early development stage and there
> will probably be some major API changes in upcoming releases. Most notably,
> I'd like to replicate the complete sklearn module structure at some point
> by decorating all available estimators with the necessary wrappers.
>
> Feedback of any kind is appreciated.
>
> Peter
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171204/e67204d1/attachment-0001.html>

From tom.augspurger88 at gmail.com  Mon Dec  4 11:00:37 2017
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Mon, 4 Dec 2017 10:00:37 -0600
Subject: [scikit-learn] Announcing sklearn-xarray
In-Reply-To: <CAFvE7K7K-E5ZX84iQMuM5z+EbRwyRnt+YJpG8=VHvo_z0pZa8g@mail.gmail.com>
References: <CAPZCsxhBvstp-QR-GQVNha8DFauXRJ5Z-JC_ucEN=9seYzfRwQ@mail.gmail.com>
 <CAFvE7K7K-E5ZX84iQMuM5z+EbRwyRnt+YJpG8=VHvo_z0pZa8g@mail.gmail.com>
Message-ID: <CAE1aY-mk032cCdgV9U0OZ8eKy4Q=RO4AX53d2BwsmNm5Ku-E0A@mail.gmail.com>

I haven't looked at the implementation of `sklearn_xarray.dataarray.wrap`
yet, but a simple test
on `dask_ml.preprocessing.StandardScaler` failed with the (probably
expected) `TypeError: 'int' object is not iterable`
when dask-ml attempts an `X.mean(0)`.

I'd be interested to hear what changes dask-ml would need to make to get
things working on dask-back xarray datasets,
without reading everything into memory at once.

The code:


import sklearn_xarray.dataarray as da
from sklearn_xarray.data import load_dummy_dataarray
from dask_ml.preprocessing import StandardScaler

X = load_dummy_dataarray()
Xt = da.wrap(StandardScaler()).fit_transform(X)


Tom

On Mon, Dec 4, 2017 at 9:03 AM, Olivier Grisel <olivier.grisel at ensta.org>
wrote:

> Interesting project!
>
> BTW, do you know about dask-ml [1]?
>
> It might be interesting to think about generalizing the input validation
> of fit and predict / transform as a private method of the BaseEstimator
> class instead of directly calling into sklearn.utils.validation functions
> so has to make it easier for third party projects such as sklearn-xarray
> and dask-ml to subclass and override those methods to allow for specific
> input data-structure without converting everyting to a numpy array.
>
> [1] https://github.com/dask/dask-ml
>
>
>
> 2017-12-04 15:21 GMT+01:00 Peter Hausamann <peter.hausamann at tum.de>:
>
>> Hi all,
>>
>> I'd like to announce *sklearn-xarray*, a new package that provides a
>> scikit-learn interface for xarray users. For those not familiar with xarray
>> (http://xarray.pydata.org), it is a "pandas-like and pandas-compatible
>> toolkit for analytics on multi-dimensional arrays".
>>
>> The package makes it possible to apply sklearn estimators to xarray
>> DataArrays and Datasets while keeping the labels (called coordinates in
>> xarray) intact whereever possible.
>>
>> You can install the package via pip:
>>
>> pip install sklearn-xarray
>>
>> To get started, you can:
>>
>>    - read the documentation: https://phausamann.github.io/sklearn-xarray
>>    and
>>    - check out the repository: https://github.com
>>    /phausamann/sklearn-xarray
>>
>> Note that the package is still in a very early development stage and
>> there will probably be some major API changes in upcoming releases. Most
>> notably, I'd like to replicate the complete sklearn module structure at
>> some point by decorating all available estimators with the necessary
>> wrappers.
>>
>> Feedback of any kind is appreciated.
>>
>> Peter
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171204/5a028fa1/attachment.html>

From peter.hausamann at tum.de  Mon Dec  4 11:25:16 2017
From: peter.hausamann at tum.de (Peter Hausamann)
Date: Mon, 04 Dec 2017 16:25:16 +0000
Subject: [scikit-learn] Announcing sklearn-xarray
In-Reply-To: <CAE1aY-mk032cCdgV9U0OZ8eKy4Q=RO4AX53d2BwsmNm5Ku-E0A@mail.gmail.com>
References: <CAPZCsxhBvstp-QR-GQVNha8DFauXRJ5Z-JC_ucEN=9seYzfRwQ@mail.gmail.com>
 <CAFvE7K7K-E5ZX84iQMuM5z+EbRwyRnt+YJpG8=VHvo_z0pZa8g@mail.gmail.com>
 <CAE1aY-mk032cCdgV9U0OZ8eKy4Q=RO4AX53d2BwsmNm5Ku-E0A@mail.gmail.com>
Message-ID: <CAPZCsxhqeVPpYdX29FB5wiw29LyaZmYv1ntQD3ueLLT14f5-gg@mail.gmail.com>

Thanks everyone for your feedback.

The reason you're getting the error is because the first argument of
DataArray.mean() is the named dimension 'dim' and not 'axis'. So calling
X.mean(axis=0) would probably solve the problem... but it might be easier
(and more robust) to fix this on my end by always converting the data to a
numpy array before passing it to the wrapped estimator.

Regarding the question on how to avoid data being loaded into memory: I'm
honestly not familiar enough with this subject to give you an answer just
yet, but supporting too-big-for-memory datasets is definitely a feature
that would be very important to me.

Cheers

Peter

Tom Augspurger <tom.augspurger88 at gmail.com> schrieb am Mo., 4. Dez. 2017 um
17:00 Uhr:

> I haven't looked at the implementation of `sklearn_xarray.dataarray.wrap`
> yet, but a simple test
> on `dask_ml.preprocessing.StandardScaler` failed with the (probably
> expected) `TypeError: 'int' object is not iterable`
> when dask-ml attempts an `X.mean(0)`.
>
> I'd be interested to hear what changes dask-ml would need to make to get
> things working on dask-back xarray datasets,
> without reading everything into memory at once.
>
> The code:
>
>
> import sklearn_xarray.dataarray as da
> from sklearn_xarray.data import load_dummy_dataarray
> from dask_ml.preprocessing import StandardScaler
>
> X = load_dummy_dataarray()
> Xt = da.wrap(StandardScaler()).fit_transform(X)
>
>
> Tom
>
> On Mon, Dec 4, 2017 at 9:03 AM, Olivier Grisel <olivier.grisel at ensta.org>
> wrote:
>
>> Interesting project!
>>
>> BTW, do you know about dask-ml [1]?
>>
>> It might be interesting to think about generalizing the input validation
>> of fit and predict / transform as a private method of the BaseEstimator
>> class instead of directly calling into sklearn.utils.validation functions
>> so has to make it easier for third party projects such as sklearn-xarray
>> and dask-ml to subclass and override those methods to allow for specific
>> input data-structure without converting everyting to a numpy array.
>>
>> [1] https://github.com/dask/dask-ml
>>
>>
>>
>> 2017-12-04 15:21 GMT+01:00 Peter Hausamann <peter.hausamann at tum.de>:
>>
>>> Hi all,
>>>
>>> I'd like to announce *sklearn-xarray*, a new package that provides a
>>> scikit-learn interface for xarray users. For those not familiar with xarray
>>> (http://xarray.pydata.org), it is a "pandas-like and pandas-compatible
>>> toolkit for analytics on multi-dimensional arrays".
>>>
>>> The package makes it possible to apply sklearn estimators to xarray
>>> DataArrays and Datasets while keeping the labels (called coordinates in
>>> xarray) intact whereever possible.
>>>
>>> You can install the package via pip:
>>>
>>> pip install sklearn-xarray
>>>
>>> To get started, you can:
>>>
>>>    - read the documentation: https://phausamann.github.io/sklearn-xarray
>>>    and
>>>    - check out the repository:
>>>    https://github.com/phausamann/sklearn-xarray
>>>
>>> Note that the package is still in a very early development stage and
>>> there will probably be some major API changes in upcoming releases. Most
>>> notably, I'd like to replicate the complete sklearn module structure at
>>> some point by decorating all available estimators with the necessary
>>> wrappers.
>>>
>>> Feedback of any kind is appreciated.
>>>
>>> Peter
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> --
>> Olivier
>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171204/4e93448b/attachment-0001.html>

From peter.hausamann at tum.de  Mon Dec  4 11:36:32 2017
From: peter.hausamann at tum.de (Peter Hausamann)
Date: Mon, 04 Dec 2017 16:36:32 +0000
Subject: [scikit-learn] Announcing sklearn-xarray
In-Reply-To: <CAPZCsxhqeVPpYdX29FB5wiw29LyaZmYv1ntQD3ueLLT14f5-gg@mail.gmail.com>
References: <CAPZCsxhBvstp-QR-GQVNha8DFauXRJ5Z-JC_ucEN=9seYzfRwQ@mail.gmail.com>
 <CAFvE7K7K-E5ZX84iQMuM5z+EbRwyRnt+YJpG8=VHvo_z0pZa8g@mail.gmail.com>
 <CAE1aY-mk032cCdgV9U0OZ8eKy4Q=RO4AX53d2BwsmNm5Ku-E0A@mail.gmail.com>
 <CAPZCsxhqeVPpYdX29FB5wiw29LyaZmYv1ntQD3ueLLT14f5-gg@mail.gmail.com>
Message-ID: <CAPZCsxgQ4y_-zmmdx1mZ6smMpQySZgq_p_QMVWVGC-ds_W-kAw@mail.gmail.com>

PS: obviously forcing conversion to numpy is not what we would want, rather
passing the underlying array of the DataArray.

Peter Hausamann <peter.hausamann at tum.de> schrieb am Mo., 4. Dez. 2017 um
17:25 Uhr:

> Thanks everyone for your feedback.
>
> The reason you're getting the error is because the first argument of
> DataArray.mean() is the named dimension 'dim' and not 'axis'. So calling
> X.mean(axis=0) would probably solve the problem... but it might be easier
> (and more robust) to fix this on my end by always converting the data to a
> numpy array before passing it to the wrapped estimator.
>
> Regarding the question on how to avoid data being loaded into memory: I'm
> honestly not familiar enough with this subject to give you an answer just
> yet, but supporting too-big-for-memory datasets is definitely a feature
> that would be very important to me.
>
> Cheers
>
> Peter
>
>
> Tom Augspurger <tom.augspurger88 at gmail.com> schrieb am Mo., 4. Dez. 2017
> um 17:00 Uhr:
>
>> I haven't looked at the implementation of `sklearn_xarray.dataarray.wrap`
>> yet, but a simple test
>> on `dask_ml.preprocessing.StandardScaler` failed with the (probably
>> expected) `TypeError: 'int' object is not iterable`
>> when dask-ml attempts an `X.mean(0)`.
>>
>> I'd be interested to hear what changes dask-ml would need to make to get
>> things working on dask-back xarray datasets,
>> without reading everything into memory at once.
>>
>> The code:
>>
>>
>> import sklearn_xarray.dataarray as da
>> from sklearn_xarray.data import load_dummy_dataarray
>> from dask_ml.preprocessing import StandardScaler
>>
>> X = load_dummy_dataarray()
>> Xt = da.wrap(StandardScaler()).fit_transform(X)
>>
>>
>> Tom
>>
>> On Mon, Dec 4, 2017 at 9:03 AM, Olivier Grisel <olivier.grisel at ensta.org>
>> wrote:
>>
>>> Interesting project!
>>>
>>> BTW, do you know about dask-ml [1]?
>>>
>>> It might be interesting to think about generalizing the input validation
>>> of fit and predict / transform as a private method of the BaseEstimator
>>> class instead of directly calling into sklearn.utils.validation functions
>>> so has to make it easier for third party projects such as sklearn-xarray
>>> and dask-ml to subclass and override those methods to allow for specific
>>> input data-structure without converting everyting to a numpy array.
>>>
>>> [1] https://github.com/dask/dask-ml
>>>
>>>
>>>
>>> 2017-12-04 15:21 GMT+01:00 Peter Hausamann <peter.hausamann at tum.de>:
>>>
>>>> Hi all,
>>>>
>>>> I'd like to announce *sklearn-xarray*, a new package that provides a
>>>> scikit-learn interface for xarray users. For those not familiar with xarray
>>>> (http://xarray.pydata.org), it is a "pandas-like and pandas-compatible
>>>> toolkit for analytics on multi-dimensional arrays".
>>>>
>>>> The package makes it possible to apply sklearn estimators to xarray
>>>> DataArrays and Datasets while keeping the labels (called coordinates in
>>>> xarray) intact whereever possible.
>>>>
>>>> You can install the package via pip:
>>>>
>>>> pip install sklearn-xarray
>>>>
>>>> To get started, you can:
>>>>
>>>>    - read the documentation:
>>>>    https://phausamann.github.io/sklearn-xarray  and
>>>>    - check out the repository:
>>>>    https://github.com/phausamann/sklearn-xarray
>>>>
>>>> Note that the package is still in a very early development stage and
>>>> there will probably be some major API changes in upcoming releases. Most
>>>> notably, I'd like to replicate the complete sklearn module structure at
>>>> some point by decorating all available estimators with the necessary
>>>> wrappers.
>>>>
>>>> Feedback of any kind is appreciated.
>>>>
>>>> Peter
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>>
>>> --
>>> Olivier
>>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171204/84d31e8e/attachment.html>

From albertthomas88 at gmail.com  Mon Dec  4 12:16:29 2017
From: albertthomas88 at gmail.com (Albert Thomas)
Date: Mon, 04 Dec 2017 17:16:29 +0000
Subject: [scikit-learn] Error while running 'python setup.py build_ext
 --inplace'
In-Reply-To: <CAHpd3pMS3xSukNGX8iML+NbPggn1Gr_0sa0aNvbOXPxv4tSRTQ@mail.gmail.com>
References: <CAHpd3pNDMQ-j=5rbL1fuDdazdcm_P8865+tPf=9V8VU5vrVwAg@mail.gmail.com>
 <CAAkaFLXOH_Z=TGJFw5rPXrFWsVZkEEiwo0PZQKn9Mgy1Hd3tKw@mail.gmail.com>
 <CAHpd3pOikdXDvN2gq9uCb=ffErDpAZ5jj2QUkYpgW3SU3tM_0w@mail.gmail.com>
 <CAFvE7K5K+XQEWeHb_a2T15e7YKC6hrfc8v0WFbogk8tmVJnVLQ@mail.gmail.com>
 <CAHpd3pMS3xSukNGX8iML+NbPggn1Gr_0sa0aNvbOXPxv4tSRTQ@mail.gmail.com>
Message-ID: <CAK6amUPy+VXDFd6OYJk4h3A=JRCBnTx5yqOjKkWZRQgbO6ed0Q@mail.gmail.com>

Maybe run ?make clean? before running pip install ...

Albert
On Mon 4 Dec 2017 at 16:11, Aniket Meshram <aniket.g.meshram at gmail.com>
wrote:

> I updated all the packages before running install.
>
> On Mon, Dec 4, 2017 at 6:07 PM, Olivier Grisel <olivier.grisel at ensta.org>
> wrote:
>
>> Maybe update your version of Cython?
>>
>> --
>> Olivier
>> ?
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Regards,
>
> Aniket G. Meshram
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171204/2917a7f2/attachment.html>

From l.lomasto at innovationengineering.eu  Mon Dec  4 13:30:55 2017
From: l.lomasto at innovationengineering.eu (Luigi Lomasto)
Date: Mon, 4 Dec 2017 19:30:55 +0100
Subject: [scikit-learn] Error while running 'python setup.py build_ext
 --inplace'
In-Reply-To: <CAK6amUPy+VXDFd6OYJk4h3A=JRCBnTx5yqOjKkWZRQgbO6ed0Q@mail.gmail.com>
References: <CAHpd3pNDMQ-j=5rbL1fuDdazdcm_P8865+tPf=9V8VU5vrVwAg@mail.gmail.com>
 <CAAkaFLXOH_Z=TGJFw5rPXrFWsVZkEEiwo0PZQKn9Mgy1Hd3tKw@mail.gmail.com>
 <CAHpd3pOikdXDvN2gq9uCb=ffErDpAZ5jj2QUkYpgW3SU3tM_0w@mail.gmail.com>
 <CAFvE7K5K+XQEWeHb_a2T15e7YKC6hrfc8v0WFbogk8tmVJnVLQ@mail.gmail.com>
 <CAHpd3pMS3xSukNGX8iML+NbPggn1Gr_0sa0aNvbOXPxv4tSRTQ@mail.gmail.com>
 <CAK6amUPy+VXDFd6OYJk4h3A=JRCBnTx5yqOjKkWZRQgbO6ed0Q@mail.gmail.com>
Message-ID: <B9AC367E-3EF9-4668-990B-BEC7D5EF5838@innovationengineering.eu>

You can try to use python 3 with pip3 

Inviato da iPhone

> Il giorno 04 dic 2017, alle ore 18:16, Albert Thomas <albertthomas88 at gmail.com> ha scritto:
> 
> Maybe run ?make clean? before running pip install ...
> 
> Albert
>> On Mon 4 Dec 2017 at 16:11, Aniket Meshram <aniket.g.meshram at gmail.com> wrote:
>> I updated all the packages before running install.
>> 
>>> On Mon, Dec 4, 2017 at 6:07 PM, Olivier Grisel <olivier.grisel at ensta.org> wrote:
>>> Maybe update your version of Cython?
>>> 
>>> -- 
>>> Olivier
>>> ?
>>> 
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> 
>> 
>> 
>> 
>> -- 
>> Regards,
>> 
>> Aniket G. Meshram
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171204/b4d48961/attachment.html>

From t3kcit at gmail.com  Mon Dec  4 14:15:52 2017
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 4 Dec 2017 14:15:52 -0500
Subject: [scikit-learn] Error while running 'python setup.py build_ext
 --inplace'
In-Reply-To: <CAHpd3pOWj--EzeLRfy+izKK-bs7-XEdZTZRwOqDrPgO72NH0sw@mail.gmail.com>
References: <CAHpd3pNDMQ-j=5rbL1fuDdazdcm_P8865+tPf=9V8VU5vrVwAg@mail.gmail.com>
 <CAAkaFLXOH_Z=TGJFw5rPXrFWsVZkEEiwo0PZQKn9Mgy1Hd3tKw@mail.gmail.com>
 <CAHpd3pOikdXDvN2gq9uCb=ffErDpAZ5jj2QUkYpgW3SU3tM_0w@mail.gmail.com>
 <CAFvE7K5K+XQEWeHb_a2T15e7YKC6hrfc8v0WFbogk8tmVJnVLQ@mail.gmail.com>
 <CAHpd3pMS3xSukNGX8iML+NbPggn1Gr_0sa0aNvbOXPxv4tSRTQ@mail.gmail.com>
 <cd381206-76c2-cd2e-ad69-b36c68e65166@gmail.com>
 <CAHpd3pOWj--EzeLRfy+izKK-bs7-XEdZTZRwOqDrPgO72NH0sw@mail.gmail.com>
Message-ID: <ea5fa9d7-cac6-4368-d298-82ee6ee2e121@gmail.com>

Please stay on the mailing list.
That's not the current version. Please try updating as Olivier suggested.

On 12/04/2017 01:52 PM, Aniket Meshram wrote:
> $ cython --version
> Cython version 0.23.4
>
> On Mon, Dec 4, 2017 at 10:28 PM, Andreas Mueller <t3kcit at gmail.com 
> <mailto:t3kcit at gmail.com>> wrote:
>
>     What version of Cython are you using?
>
>
>
>     On 12/04/2017 09:20 AM, Aniket Meshram wrote:
>>     I updated all the packages before running install.
>>
>>     On Mon, Dec 4, 2017 at 6:07 PM, Olivier Grisel
>>     <olivier.grisel at ensta.org <mailto:olivier.grisel at ensta.org>> wrote:
>>
>>         Maybe update your version of Cython?
>>
>>         -- 
>>         Olivier
>>         ?
>>
>>         _______________________________________________
>>         scikit-learn mailing list
>>         scikit-learn at python.org <mailto:scikit-learn at python.org>
>>         https://mail.python.org/mailman/listinfo/scikit-learn
>>         <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>>
>>
>>
>>     -- 
>>     Regards,
>>
>>     Aniket G. Meshram
>>
>>
>>     _______________________________________________
>>     scikit-learn mailing list
>>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>>     https://mail.python.org/mailman/listinfo/scikit-learn
>>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>
>
> -- 
> Regards,
>
> Aniket G. Meshram

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171204/00cee7f4/attachment.html>

From aniket.g.meshram at gmail.com  Tue Dec  5 02:28:05 2017
From: aniket.g.meshram at gmail.com (Aniket Meshram)
Date: Tue, 5 Dec 2017 12:58:05 +0530
Subject: [scikit-learn] Error while running 'python setup.py build_ext
 --inplace'
In-Reply-To: <ea5fa9d7-cac6-4368-d298-82ee6ee2e121@gmail.com>
References: <CAHpd3pNDMQ-j=5rbL1fuDdazdcm_P8865+tPf=9V8VU5vrVwAg@mail.gmail.com>
 <CAAkaFLXOH_Z=TGJFw5rPXrFWsVZkEEiwo0PZQKn9Mgy1Hd3tKw@mail.gmail.com>
 <CAHpd3pOikdXDvN2gq9uCb=ffErDpAZ5jj2QUkYpgW3SU3tM_0w@mail.gmail.com>
 <CAFvE7K5K+XQEWeHb_a2T15e7YKC6hrfc8v0WFbogk8tmVJnVLQ@mail.gmail.com>
 <CAHpd3pMS3xSukNGX8iML+NbPggn1Gr_0sa0aNvbOXPxv4tSRTQ@mail.gmail.com>
 <cd381206-76c2-cd2e-ad69-b36c68e65166@gmail.com>
 <CAHpd3pOWj--EzeLRfy+izKK-bs7-XEdZTZRwOqDrPgO72NH0sw@mail.gmail.com>
 <ea5fa9d7-cac6-4368-d298-82ee6ee2e121@gmail.com>
Message-ID: <CAHpd3pPGQR-q-3LuZWPg_1KWH1WXs=ZMQmkaz90Da3eeozdEdQ@mail.gmail.com>

I'm using Ubuntu 16.04 LTS Xenial, which shows 0.23.4-0ubuntu5 as the
latest.

<https://packages.ubuntu.com/search?keywords=cython>
https://packages.ubuntu.com/search?keywords=cython

<https://packages.ubuntu.com/search?keywords=cython>
But yes, you are right, I checked on official Cython and I'll install the
latest using PyPI. Thought Ubuntu gives the latest, but that isn't true
anymore.
Thanks Andreas.

I'll let you guys know, once I update and rerun pip install ...
Thanks

On Tue, Dec 5, 2017 at 12:45 AM, Andreas Mueller <t3kcit at gmail.com> wrote:

> Please stay on the mailing list.
> That's not the current version. Please try updating as Olivier suggested.
>
>
> On 12/04/2017 01:52 PM, Aniket Meshram wrote:
>
> $ cython --version
> Cython version 0.23.4
>
> On Mon, Dec 4, 2017 at 10:28 PM, Andreas Mueller <t3kcit at gmail.com> wrote:
>
>> What version of Cython are you using?
>>
>>
>>
>> On 12/04/2017 09:20 AM, Aniket Meshram wrote:
>>
>> I updated all the packages before running install.
>>
>> On Mon, Dec 4, 2017 at 6:07 PM, Olivier Grisel <olivier.grisel at ensta.org>
>> wrote:
>>
>>> Maybe update your version of Cython?
>>>
>>> --
>>> Olivier
>>> ?
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> --
>> Regards,
>>
>> Aniket G. Meshram
>>
>>
>> _______________________________________________
>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>
>
> --
> Regards,
>
> Aniket G. Meshram
>
>
>


-- 
Regards,

Aniket G. Meshram
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171205/847b6273/attachment-0001.html>

From aniket.g.meshram at gmail.com  Tue Dec  5 12:01:46 2017
From: aniket.g.meshram at gmail.com (Aniket Meshram)
Date: Tue, 5 Dec 2017 22:31:46 +0530
Subject: [scikit-learn] Error while running 'python setup.py build_ext
 --inplace'
In-Reply-To: <CAHpd3pPGQR-q-3LuZWPg_1KWH1WXs=ZMQmkaz90Da3eeozdEdQ@mail.gmail.com>
References: <CAHpd3pNDMQ-j=5rbL1fuDdazdcm_P8865+tPf=9V8VU5vrVwAg@mail.gmail.com>
 <CAAkaFLXOH_Z=TGJFw5rPXrFWsVZkEEiwo0PZQKn9Mgy1Hd3tKw@mail.gmail.com>
 <CAHpd3pOikdXDvN2gq9uCb=ffErDpAZ5jj2QUkYpgW3SU3tM_0w@mail.gmail.com>
 <CAFvE7K5K+XQEWeHb_a2T15e7YKC6hrfc8v0WFbogk8tmVJnVLQ@mail.gmail.com>
 <CAHpd3pMS3xSukNGX8iML+NbPggn1Gr_0sa0aNvbOXPxv4tSRTQ@mail.gmail.com>
 <cd381206-76c2-cd2e-ad69-b36c68e65166@gmail.com>
 <CAHpd3pOWj--EzeLRfy+izKK-bs7-XEdZTZRwOqDrPgO72NH0sw@mail.gmail.com>
 <ea5fa9d7-cac6-4368-d298-82ee6ee2e121@gmail.com>
 <CAHpd3pPGQR-q-3LuZWPg_1KWH1WXs=ZMQmkaz90Da3eeozdEdQ@mail.gmail.com>
Message-ID: <CAHpd3pNTEba6T4dYWQ8KTbu5PsQOOLPLwWgu2LhJiXw3a8CoVw@mail.gmail.com>

Yeah. That did it. After updating Cython to latest 0.27.3, the issue is
resolved now.
Thanks all. I guess this should also be updated on the site / github as
well. What'd you say?

Best,
Aniket

On Tue, Dec 5, 2017 at 12:58 PM, Aniket Meshram <aniket.g.meshram at gmail.com>
wrote:

> I'm using Ubuntu 16.04 LTS Xenial, which shows 0.23.4-0ubuntu5 as the
> latest.
>
> <https://packages.ubuntu.com/search?keywords=cython>
> https://packages.ubuntu.com/search?keywords=cython
>
> <https://packages.ubuntu.com/search?keywords=cython>
> But yes, you are right, I checked on official Cython and I'll install the
> latest using PyPI. Thought Ubuntu gives the latest, but that isn't true
> anymore.
> Thanks Andreas.
>
> I'll let you guys know, once I update and rerun pip install ...
> Thanks
>
> On Tue, Dec 5, 2017 at 12:45 AM, Andreas Mueller <t3kcit at gmail.com> wrote:
>
>> Please stay on the mailing list.
>> That's not the current version. Please try updating as Olivier suggested.
>>
>>
>> On 12/04/2017 01:52 PM, Aniket Meshram wrote:
>>
>> $ cython --version
>> Cython version 0.23.4
>>
>> On Mon, Dec 4, 2017 at 10:28 PM, Andreas Mueller <t3kcit at gmail.com>
>> wrote:
>>
>>> What version of Cython are you using?
>>>
>>>
>>>
>>> On 12/04/2017 09:20 AM, Aniket Meshram wrote:
>>>
>>> I updated all the packages before running install.
>>>
>>> On Mon, Dec 4, 2017 at 6:07 PM, Olivier Grisel <olivier.grisel at ensta.org
>>> > wrote:
>>>
>>>> Maybe update your version of Cython?
>>>>
>>>> --
>>>> Olivier
>>>> ?
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Aniket G. Meshram
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>>
>>
>>
>> --
>> Regards,
>>
>> Aniket G. Meshram
>>
>>
>>
>
>
> --
> Regards,
>
> Aniket G. Meshram
>


-- 
Regards,

Aniket G. Meshram
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171205/7887837d/attachment.html>

From joel.nothman at gmail.com  Tue Dec  5 19:12:49 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 6 Dec 2017 11:12:49 +1100
Subject: [scikit-learn] Error while running 'python setup.py build_ext
 --inplace'
In-Reply-To: <CAHpd3pNTEba6T4dYWQ8KTbu5PsQOOLPLwWgu2LhJiXw3a8CoVw@mail.gmail.com>
References: <CAHpd3pNDMQ-j=5rbL1fuDdazdcm_P8865+tPf=9V8VU5vrVwAg@mail.gmail.com>
 <CAAkaFLXOH_Z=TGJFw5rPXrFWsVZkEEiwo0PZQKn9Mgy1Hd3tKw@mail.gmail.com>
 <CAHpd3pOikdXDvN2gq9uCb=ffErDpAZ5jj2QUkYpgW3SU3tM_0w@mail.gmail.com>
 <CAFvE7K5K+XQEWeHb_a2T15e7YKC6hrfc8v0WFbogk8tmVJnVLQ@mail.gmail.com>
 <CAHpd3pMS3xSukNGX8iML+NbPggn1Gr_0sa0aNvbOXPxv4tSRTQ@mail.gmail.com>
 <cd381206-76c2-cd2e-ad69-b36c68e65166@gmail.com>
 <CAHpd3pOWj--EzeLRfy+izKK-bs7-XEdZTZRwOqDrPgO72NH0sw@mail.gmail.com>
 <ea5fa9d7-cac6-4368-d298-82ee6ee2e121@gmail.com>
 <CAHpd3pPGQR-q-3LuZWPg_1KWH1WXs=ZMQmkaz90Da3eeozdEdQ@mail.gmail.com>
 <CAHpd3pNTEba6T4dYWQ8KTbu5PsQOOLPLwWgu2LhJiXw3a8CoVw@mail.gmail.com>
Message-ID: <CAAkaFLXfc2ttEv7a9aEMoAOyO8Q14-WUM2+FOvN-M17uskrLJA@mail.gmail.com>

A PR is welcome if you can improve documentation. Thanks

On 6 December 2017 at 04:01, Aniket Meshram <aniket.g.meshram at gmail.com>
wrote:

> Yeah. That did it. After updating Cython to latest 0.27.3, the issue is
> resolved now.
> Thanks all. I guess this should also be updated on the site / github as
> well. What'd you say?
>
> Best,
> Aniket
>
> On Tue, Dec 5, 2017 at 12:58 PM, Aniket Meshram <
> aniket.g.meshram at gmail.com> wrote:
>
>> I'm using Ubuntu 16.04 LTS Xenial, which shows 0.23.4-0ubuntu5 as the
>> latest.
>>
>> <https://packages.ubuntu.com/search?keywords=cython>
>> https://packages.ubuntu.com/search?keywords=cython
>>
>> <https://packages.ubuntu.com/search?keywords=cython>
>> But yes, you are right, I checked on official Cython and I'll install the
>> latest using PyPI. Thought Ubuntu gives the latest, but that isn't true
>> anymore.
>> Thanks Andreas.
>>
>> I'll let you guys know, once I update and rerun pip install ...
>> Thanks
>>
>> On Tue, Dec 5, 2017 at 12:45 AM, Andreas Mueller <t3kcit at gmail.com>
>> wrote:
>>
>>> Please stay on the mailing list.
>>> That's not the current version. Please try updating as Olivier suggested.
>>>
>>>
>>> On 12/04/2017 01:52 PM, Aniket Meshram wrote:
>>>
>>> $ cython --version
>>> Cython version 0.23.4
>>>
>>> On Mon, Dec 4, 2017 at 10:28 PM, Andreas Mueller <t3kcit at gmail.com>
>>> wrote:
>>>
>>>> What version of Cython are you using?
>>>>
>>>>
>>>>
>>>> On 12/04/2017 09:20 AM, Aniket Meshram wrote:
>>>>
>>>> I updated all the packages before running install.
>>>>
>>>> On Mon, Dec 4, 2017 at 6:07 PM, Olivier Grisel <
>>>> olivier.grisel at ensta.org> wrote:
>>>>
>>>>> Maybe update your version of Cython?
>>>>>
>>>>> --
>>>>> Olivier
>>>>> ?
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>>
>>>> Aniket G. Meshram
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Aniket G. Meshram
>>>
>>>
>>>
>>
>>
>> --
>> Regards,
>>
>> Aniket G. Meshram
>>
>
>
>
> --
> Regards,
>
> Aniket G. Meshram
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171206/81a0d11b/attachment-0001.html>

From fabio.sigrist at hslu.ch  Wed Dec  6 07:15:28 2017
From: fabio.sigrist at hslu.ch (Fabio Sigrist)
Date: Wed, 6 Dec 2017 13:15:28 +0100
Subject: [scikit-learn] Add Grabit model to gradient boosting
Message-ID: <CAAZJexZYcX=UOE=qMsq_4RAR6EbH=fha-=M4vJjMb9GXMLSYfA@mail.gmail.com>

Dear all,

I added the Tobit loss function to gradient boosting, see
https://github.com/scikit-learn/scikit-learn/pull/9961. Recently, I also
added a reference to a preprint of an article with documentation on the
methodology (https://arxiv.org/abs/1711.08695).

What are to next steps in order to decide whether this feature will be
added to sklearn?

Thanks a lot in advance.

Best regards,
Fabio Sigrist


*Lucerne University of Applied Sciences and Arts*

Institute of Financial Services Zug IFZ
Grafenauweg 10, CH-6300 Zug


*Fabio Sigrist, PhD *Lecturer

T +41 41 757 67 61
fabio.sigrist at hslu.ch
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171206/c6762714/attachment.html>

From aniket.g.meshram at gmail.com  Wed Dec  6 13:30:54 2017
From: aniket.g.meshram at gmail.com (Aniket Meshram)
Date: Thu, 7 Dec 2017 00:00:54 +0530
Subject: [scikit-learn] Error while running 'python setup.py build_ext
 --inplace'
In-Reply-To: <CAAkaFLXfc2ttEv7a9aEMoAOyO8Q14-WUM2+FOvN-M17uskrLJA@mail.gmail.com>
References: <CAHpd3pNDMQ-j=5rbL1fuDdazdcm_P8865+tPf=9V8VU5vrVwAg@mail.gmail.com>
 <CAAkaFLXOH_Z=TGJFw5rPXrFWsVZkEEiwo0PZQKn9Mgy1Hd3tKw@mail.gmail.com>
 <CAHpd3pOikdXDvN2gq9uCb=ffErDpAZ5jj2QUkYpgW3SU3tM_0w@mail.gmail.com>
 <CAFvE7K5K+XQEWeHb_a2T15e7YKC6hrfc8v0WFbogk8tmVJnVLQ@mail.gmail.com>
 <CAHpd3pMS3xSukNGX8iML+NbPggn1Gr_0sa0aNvbOXPxv4tSRTQ@mail.gmail.com>
 <cd381206-76c2-cd2e-ad69-b36c68e65166@gmail.com>
 <CAHpd3pOWj--EzeLRfy+izKK-bs7-XEdZTZRwOqDrPgO72NH0sw@mail.gmail.com>
 <ea5fa9d7-cac6-4368-d298-82ee6ee2e121@gmail.com>
 <CAHpd3pPGQR-q-3LuZWPg_1KWH1WXs=ZMQmkaz90Da3eeozdEdQ@mail.gmail.com>
 <CAHpd3pNTEba6T4dYWQ8KTbu5PsQOOLPLwWgu2LhJiXw3a8CoVw@mail.gmail.com>
 <CAAkaFLXfc2ttEv7a9aEMoAOyO8Q14-WUM2+FOvN-M17uskrLJA@mail.gmail.com>
Message-ID: <CAHpd3pP5aVqM81Brqe69tGy_g4hQUNk-2V_BOcyC4_9zj8GO3A@mail.gmail.com>

Alright, i'll make a pull request. But let me tell you guys, I'm totally
new to github. This is my first contribution. And until few days back, i
didn't even knew what a pull request was.
Anyways, what i mean is even though i make a request, it'll take time for
me to understand this whole changing something and reflecting it to the
master branch.
Meanwhile, I'm doing my homework on this, any suggestions would really be
appreciated.

Thanks,
Aniket

On Wed, Dec 6, 2017 at 5:42 AM, Joel Nothman <joel.nothman at gmail.com> wrote:

> A PR is welcome if you can improve documentation. Thanks
>
> On 6 December 2017 at 04:01, Aniket Meshram <aniket.g.meshram at gmail.com>
> wrote:
>
>> Yeah. That did it. After updating Cython to latest 0.27.3, the issue is
>> resolved now.
>> Thanks all. I guess this should also be updated on the site / github as
>> well. What'd you say?
>>
>> Best,
>> Aniket
>>
>> On Tue, Dec 5, 2017 at 12:58 PM, Aniket Meshram <
>> aniket.g.meshram at gmail.com> wrote:
>>
>>> I'm using Ubuntu 16.04 LTS Xenial, which shows 0.23.4-0ubuntu5 as the
>>> latest.
>>>
>>> <https://packages.ubuntu.com/search?keywords=cython>
>>> https://packages.ubuntu.com/search?keywords=cython
>>>
>>> <https://packages.ubuntu.com/search?keywords=cython>
>>> But yes, you are right, I checked on official Cython and I'll install
>>> the latest using PyPI. Thought Ubuntu gives the latest, but that isn't true
>>> anymore.
>>> Thanks Andreas.
>>>
>>> I'll let you guys know, once I update and rerun pip install ...
>>> Thanks
>>>
>>> On Tue, Dec 5, 2017 at 12:45 AM, Andreas Mueller <t3kcit at gmail.com>
>>> wrote:
>>>
>>>> Please stay on the mailing list.
>>>> That's not the current version. Please try updating as Olivier
>>>> suggested.
>>>>
>>>>
>>>> On 12/04/2017 01:52 PM, Aniket Meshram wrote:
>>>>
>>>> $ cython --version
>>>> Cython version 0.23.4
>>>>
>>>> On Mon, Dec 4, 2017 at 10:28 PM, Andreas Mueller <t3kcit at gmail.com>
>>>> wrote:
>>>>
>>>>> What version of Cython are you using?
>>>>>
>>>>>
>>>>>
>>>>> On 12/04/2017 09:20 AM, Aniket Meshram wrote:
>>>>>
>>>>> I updated all the packages before running install.
>>>>>
>>>>> On Mon, Dec 4, 2017 at 6:07 PM, Olivier Grisel <
>>>>> olivier.grisel at ensta.org> wrote:
>>>>>
>>>>>> Maybe update your version of Cython?
>>>>>>
>>>>>> --
>>>>>> Olivier
>>>>>> ?
>>>>>>
>>>>>> _______________________________________________
>>>>>> scikit-learn mailing list
>>>>>> scikit-learn at python.org
>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>>
>>>>> Aniket G. Meshram
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>>
>>>> Aniket G. Meshram
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Aniket G. Meshram
>>>
>>
>>
>>
>> --
>> Regards,
>>
>> Aniket G. Meshram
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Regards,

Aniket G. Meshram
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171207/17820b20/attachment.html>

From joel.nothman at gmail.com  Wed Dec  6 15:02:11 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 7 Dec 2017 07:02:11 +1100
Subject: [scikit-learn] Error while running 'python setup.py build_ext
 --inplace'
In-Reply-To: <CAHpd3pP5aVqM81Brqe69tGy_g4hQUNk-2V_BOcyC4_9zj8GO3A@mail.gmail.com>
References: <CAHpd3pNDMQ-j=5rbL1fuDdazdcm_P8865+tPf=9V8VU5vrVwAg@mail.gmail.com>
 <CAAkaFLXOH_Z=TGJFw5rPXrFWsVZkEEiwo0PZQKn9Mgy1Hd3tKw@mail.gmail.com>
 <CAHpd3pOikdXDvN2gq9uCb=ffErDpAZ5jj2QUkYpgW3SU3tM_0w@mail.gmail.com>
 <CAFvE7K5K+XQEWeHb_a2T15e7YKC6hrfc8v0WFbogk8tmVJnVLQ@mail.gmail.com>
 <CAHpd3pMS3xSukNGX8iML+NbPggn1Gr_0sa0aNvbOXPxv4tSRTQ@mail.gmail.com>
 <cd381206-76c2-cd2e-ad69-b36c68e65166@gmail.com>
 <CAHpd3pOWj--EzeLRfy+izKK-bs7-XEdZTZRwOqDrPgO72NH0sw@mail.gmail.com>
 <ea5fa9d7-cac6-4368-d298-82ee6ee2e121@gmail.com>
 <CAHpd3pPGQR-q-3LuZWPg_1KWH1WXs=ZMQmkaz90Da3eeozdEdQ@mail.gmail.com>
 <CAHpd3pNTEba6T4dYWQ8KTbu5PsQOOLPLwWgu2LhJiXw3a8CoVw@mail.gmail.com>
 <CAAkaFLXfc2ttEv7a9aEMoAOyO8Q14-WUM2+FOvN-M17uskrLJA@mail.gmail.com>
 <CAHpd3pP5aVqM81Brqe69tGy_g4hQUNk-2V_BOcyC4_9zj8GO3A@mail.gmail.com>
Message-ID: <CAAkaFLVVp0FBJ2B6FSiSBda1phaWKpGEtX32J6Ehmzp=gZbPCg@mail.gmail.com>

We're biased, but we reckon the skills to make a PR are (a) not
insurmountable with a bit of homework; and (b) very worthwhile to have. So
try pick it up by yourself, but give us a shout if you're struggling.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171207/3c0f228b/attachment.html>

From tevang3 at gmail.com  Wed Dec  6 18:49:42 2017
From: tevang3 at gmail.com (Thomas Evangelidis)
Date: Thu, 7 Dec 2017 00:49:42 +0100
Subject: [scikit-learn] MLPClassifier as a feature selector
Message-ID: <CAACvdx3Pxu3Cw3A5j1st36ntcME5ULP75Kf4ZpUgoi_cr+ZJAw@mail.gmail.com>

Greetings,

I want to train a MLPClassifier with one hidden layer and use it as a
feature selector for an MLPRegressor.
Is it possible to get the values of the neurons from the last hidden layer
of the MLPClassifier to pass them as input to the MLPRegressor?

If it is not possible with scikit-learn, is anyone aware of any
scikit-compatible NN library that offers this functionality? For example
this one:

http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html

I wouldn't like to do this in Tensorflow because the MLP there is much
slower than scikit-learn's implementation.


Thomas


-- 

======================================================================

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tevang at pharm.uoa.gr

          tevang3 at gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171207/f9fab8a6/attachment-0001.html>

From jbbrown at kuhp.kyoto-u.ac.jp  Wed Dec  6 19:56:14 2017
From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.)
Date: Thu, 7 Dec 2017 09:56:14 +0900
Subject: [scikit-learn] MLPClassifier as a feature selector
In-Reply-To: <CAACvdx3Pxu3Cw3A5j1st36ntcME5ULP75Kf4ZpUgoi_cr+ZJAw@mail.gmail.com>
References: <CAACvdx3Pxu3Cw3A5j1st36ntcME5ULP75Kf4ZpUgoi_cr+ZJAw@mail.gmail.com>
Message-ID: <CAJe_vxBd38MU41HghnGYYiunSenk1s6WMCzzeqa7Ye7nUq4DVQ@mail.gmail.com>

I am also very interested in knowing if there is a sklearn cookbook
solution for getting the weights of a one-hidde-layer MLPClassifier.
J.B.

2017-12-07 8:49 GMT+09:00 Thomas Evangelidis <tevang3 at gmail.com>:

> Greetings,
>
> I want to train a MLPClassifier with one hidden layer and use it as a
> feature selector for an MLPRegressor.
> Is it possible to get the values of the neurons from the last hidden layer
> of the MLPClassifier to pass them as input to the MLPRegressor?
>
> If it is not possible with scikit-learn, is anyone aware of any
> scikit-compatible NN library that offers this functionality? For example
> this one:
>
> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html
>
> I wouldn't like to do this in Tensorflow because the MLP there is much
> slower than scikit-learn's implementation.
>
>
> Thomas
>
>
> --
>
> ======================================================================
>
> Dr Thomas Evangelidis
>
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049,
> 62500 Brno, Czech Republic
>
> email: tevang at pharm.uoa.gr
>
>           tevang3 at gmail.com
>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171207/aeb72bf1/attachment.html>

From manojkumarsivaraj334 at gmail.com  Wed Dec  6 22:25:41 2017
From: manojkumarsivaraj334 at gmail.com (Manoj Kumar)
Date: Wed, 6 Dec 2017 19:25:41 -0800
Subject: [scikit-learn] MLPClassifier as a feature selector
In-Reply-To: <CAJe_vxBd38MU41HghnGYYiunSenk1s6WMCzzeqa7Ye7nUq4DVQ@mail.gmail.com>
References: <CAACvdx3Pxu3Cw3A5j1st36ntcME5ULP75Kf4ZpUgoi_cr+ZJAw@mail.gmail.com>
 <CAJe_vxBd38MU41HghnGYYiunSenk1s6WMCzzeqa7Ye7nUq4DVQ@mail.gmail.com>
Message-ID: <CAFQAd-kqW9N3H=MQp0qP6P8hit+BdERDfgS3Hvd73ei0O3Zrgg@mail.gmail.com>

Hi,

The weights and intercepts are available in the coefs_ and intercepts_
attribute respectively.

See
https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/neural_network/multilayer_perceptron.py#L835

On Wed, Dec 6, 2017 at 4:56 PM, Brown J.B. via scikit-learn <
scikit-learn at python.org> wrote:

> I am also very interested in knowing if there is a sklearn cookbook
> solution for getting the weights of a one-hidde-layer MLPClassifier.
> J.B.
>
> 2017-12-07 8:49 GMT+09:00 Thomas Evangelidis <tevang3 at gmail.com>:
>
>> Greetings,
>>
>> I want to train a MLPClassifier with one hidden layer and use it as a
>> feature selector for an MLPRegressor.
>> Is it possible to get the values of the neurons from the last hidden
>> layer of the MLPClassifier to pass them as input to the MLPRegressor?
>>
>> If it is not possible with scikit-learn, is anyone aware of any
>> scikit-compatible NN library that offers this functionality? For example
>> this one:
>>
>> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html
>>
>> I wouldn't like to do this in Tensorflow because the MLP there is much
>> slower than scikit-learn's implementation.
>>
>>
>> Thomas
>>
>>
>> --
>>
>> ======================================================================
>>
>> Dr Thomas Evangelidis
>>
>> Post-doctoral Researcher
>> CEITEC - Central European Institute of Technology
>> Masaryk University
>> Kamenice 5/A35/2S049,
>> 62500 Brno, Czech Republic
>>
>> email: tevang at pharm.uoa.gr
>>
>>           tevang3 at gmail.com
>>
>>
>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Manoj,
http://github.com/MechCoder
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171206/17603586/attachment.html>

From aniket.g.meshram at gmail.com  Sat Dec  9 04:33:53 2017
From: aniket.g.meshram at gmail.com (Aniket Meshram)
Date: Sat, 9 Dec 2017 15:03:53 +0530
Subject: [scikit-learn] Error while running 'python setup.py build_ext
 --inplace'
In-Reply-To: <CAAkaFLVVp0FBJ2B6FSiSBda1phaWKpGEtX32J6Ehmzp=gZbPCg@mail.gmail.com>
References: <CAHpd3pNDMQ-j=5rbL1fuDdazdcm_P8865+tPf=9V8VU5vrVwAg@mail.gmail.com>
 <CAAkaFLXOH_Z=TGJFw5rPXrFWsVZkEEiwo0PZQKn9Mgy1Hd3tKw@mail.gmail.com>
 <CAHpd3pOikdXDvN2gq9uCb=ffErDpAZ5jj2QUkYpgW3SU3tM_0w@mail.gmail.com>
 <CAFvE7K5K+XQEWeHb_a2T15e7YKC6hrfc8v0WFbogk8tmVJnVLQ@mail.gmail.com>
 <CAHpd3pMS3xSukNGX8iML+NbPggn1Gr_0sa0aNvbOXPxv4tSRTQ@mail.gmail.com>
 <cd381206-76c2-cd2e-ad69-b36c68e65166@gmail.com>
 <CAHpd3pOWj--EzeLRfy+izKK-bs7-XEdZTZRwOqDrPgO72NH0sw@mail.gmail.com>
 <ea5fa9d7-cac6-4368-d298-82ee6ee2e121@gmail.com>
 <CAHpd3pPGQR-q-3LuZWPg_1KWH1WXs=ZMQmkaz90Da3eeozdEdQ@mail.gmail.com>
 <CAHpd3pNTEba6T4dYWQ8KTbu5PsQOOLPLwWgu2LhJiXw3a8CoVw@mail.gmail.com>
 <CAAkaFLXfc2ttEv7a9aEMoAOyO8Q14-WUM2+FOvN-M17uskrLJA@mail.gmail.com>
 <CAHpd3pP5aVqM81Brqe69tGy_g4hQUNk-2V_BOcyC4_9zj8GO3A@mail.gmail.com>
 <CAAkaFLVVp0FBJ2B6FSiSBda1phaWKpGEtX32J6Ehmzp=gZbPCg@mail.gmail.com>
Message-ID: <CAHpd3pN0w4f4bq8s-1JiOeGgGByHo_GitobrLmyHWSSyVtsFcQ@mail.gmail.com>

hi All,

I've created a pull request for updating the README.rst.

https://github.com/scikit-learn/scikit-learn/pull/10276

Thanks,
Aniket

On Thu, Dec 7, 2017 at 1:32 AM, Joel Nothman <joel.nothman at gmail.com> wrote:

> We're biased, but we reckon the skills to make a PR are (a) not
> insurmountable with a bit of homework; and (b) very worthwhile to have. So
> try pick it up by yourself, but give us a shout if you're struggling.
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Regards,

Aniket G. Meshram
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171209/0ee0223d/attachment.html>

From dmitrii.ignatov at gmail.com  Sun Dec 10 06:33:59 2017
From: dmitrii.ignatov at gmail.com (Dmitry Ignatov)
Date: Sun, 10 Dec 2017 14:33:59 +0300
Subject: [scikit-learn] Grid search fir multi-label task
Message-ID: <CAKnnxJ26xxZSfgc7PNMsOTYDgqknh8bBu2RKSqZbpiZJfF1dZA@mail.gmail.com>

Hi All,


I've tried GridsearchCV with RandomForestClassifier()


clf = GridSearchCV(RandomForestClassifier(), tuned_parameters, cv=5,
scoring='accuracy')


for a multi-label problem where the output is a list of lists of 20
zeros or ones.


[[1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0],
 [1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0],
 [0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0],
 [0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1],...


Even though it was done correctly in usual way with


clf = RandomForestClassifier(max_depth=7, random_state=0)
clf.fit(Xtr,y)


with GridsearchCV I have the errors below:


  83     # We can't have more than one value on y_type => The set is
no more needed

ValueError: Classification metrics can't handle a mix of
multiclass-multioutput and multilabel-indicator targets


Is it possible to perform GridsearchCV in scikit for the multilabel setting
(with an appropriate metric like averaged zero-one-loss)? Any hints?


Thank you and best regards,
Dmitry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171210/d2d74bfb/attachment.html>

From joel.nothman at gmail.com  Sun Dec 10 15:09:19 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Mon, 11 Dec 2017 07:09:19 +1100
Subject: [scikit-learn] Grid search fir multi-label task
In-Reply-To: <CAKnnxJ26xxZSfgc7PNMsOTYDgqknh8bBu2RKSqZbpiZJfF1dZA@mail.gmail.com>
References: <CAKnnxJ26xxZSfgc7PNMsOTYDgqknh8bBu2RKSqZbpiZJfF1dZA@mail.gmail.com>
Message-ID: <CAAkaFLWFGD8Cp-LTgg4L1pLUgesqCZ+AN3VCgHX6pzdfvVYG+g@mail.gmail.com>

for legacy reasons, multilabel targets need to be passed as an array (or a
sparse matrix if supported by the classifier). lists of lists are not
supported but may be in the near future.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171211/f52dc4dc/attachment.html>

From dmitrii.ignatov at gmail.com  Sun Dec 10 15:46:38 2017
From: dmitrii.ignatov at gmail.com (Dmitry Ignatov)
Date: Sun, 10 Dec 2017 23:46:38 +0300
Subject: [scikit-learn] Grid search fir multi-label task
In-Reply-To: <CAAkaFLWFGD8Cp-LTgg4L1pLUgesqCZ+AN3VCgHX6pzdfvVYG+g@mail.gmail.com>
References: <CAKnnxJ26xxZSfgc7PNMsOTYDgqknh8bBu2RKSqZbpiZJfF1dZA@mail.gmail.com>
 <CAAkaFLWFGD8Cp-LTgg4L1pLUgesqCZ+AN3VCgHX6pzdfvVYG+g@mail.gmail.com>
Message-ID: <CAKnnxJ1mtmhZ1QsYqpUmNgwj4yfpXRCE6d07FWNyYq9JJLFG7Q@mail.gmail.com>

Joel, thank you. It helps. One step forward.

-Dmitry

2017-12-10 23:09 GMT+03:00 Joel Nothman <joel.nothman at gmail.com>:

> for legacy reasons, multilabel targets need to be passed as an array (or a
> sparse matrix if supported by the classifier). lists of lists are not
> supported but may be in the near future.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171210/8b746a66/attachment.html>

From t3kcit at gmail.com  Wed Dec 13 11:40:03 2017
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 13 Dec 2017 11:40:03 -0500
Subject: [scikit-learn] SciPy 2018 tutorial
Message-ID: <3c3f6d7c-c845-5866-609a-42e21d62fdf5@gmail.com>

Hey folks.
Who is coming to SciPy 2018? They just send out the CfP.

Does anyone want to co-teach a tutorial?
(If there's two other people that want to teach it, I'm also happy to 
step back this year ;)

Andy

From joel.nothman at gmail.com  Wed Dec 13 19:38:42 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 14 Dec 2017 11:38:42 +1100
Subject: [scikit-learn] FYI: StratifiedKFold(...,
 shuffle=True) differs in 0.19
Message-ID: <CAAkaFLVihtddHxQMP5A2mwnJ98bPCsLKndmj34xNPtS3mKCezQ@mail.gmail.com>

It has come to our attention in #10274
<https://github.com/scikit-learn/scikit-learn/issues/10274> that we
accidentally changed shuffled StratifiedKFold behaviour in the 0.19.0
release from what had come before. That is, for the same random state, you
will get a different cross-validation data partition.

This change (merged in #7823)
<https://github.com/scikit-learn/scikit-learn/pull/7823>was not documented
in 0.19 release notes. We will update the online docs to mention it. The
change provided negligible benefit for users.

The change shouldn't have happened, but we likely won't revert it unless
the community has a strongly divergent opinion.

Cheers,

Joel and Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171214/b626ecf2/attachment.html>

From iacopo at lighton.io  Fri Dec 15 10:32:58 2017
From: iacopo at lighton.io (Iacopo Poli)
Date: Fri, 15 Dec 2017 16:32:58 +0100
Subject: [scikit-learn] License for a package built on top of scikit-learn
Message-ID: <CAN=_51hH93s9NryjaPp+oCE_edY6KYN1GGw8XiSopmctRfbkAQ@mail.gmail.com>

Hello,

we've built a python package for some ML related application. Scikit-learn
is a requirement and we have classes inheriting from sklearn objects.

We have to decide the license and we are choosing between Apache 2.0 and
BSD-3.

We would go with Apache 2.0, but we were wondering if we have to release it
under the same license of sklearn. It doesn't seem so by reading the text
of BSD-3, but asking before never hurts.

Thanks in advance,
Iacopo Poli
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171215/32e8a95a/attachment.html>

From t3kcit at gmail.com  Fri Dec 15 11:53:07 2017
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 15 Dec 2017 11:53:07 -0500
Subject: [scikit-learn] License for a package built on top of
 scikit-learn
In-Reply-To: <CAN=_51hH93s9NryjaPp+oCE_edY6KYN1GGw8XiSopmctRfbkAQ@mail.gmail.com>
References: <CAN=_51hH93s9NryjaPp+oCE_edY6KYN1GGw8XiSopmctRfbkAQ@mail.gmail.com>
Message-ID: <6436ea2e-67d9-3971-7b5a-9aaca95ab75c@gmail.com>

Hi Iacopo.

Yes, you can do either (in my understanding).
If you just import sklearn there's really nothing you need to worry about.

If you distribute sklearn, you should also distribute the BSD license 
file with it and make
it clear that that's the license that applies to that part of the code.

Cheers,

Andy


On 12/15/2017 10:32 AM, Iacopo Poli wrote:
> Hello,
>
> we've built a python package for some ML related application. 
> Scikit-learn is a requirement and we have classes inheriting from 
> sklearn objects.
>
> We have to decide the license and we are choosing between Apache 2.0 
> and BSD-3.
>
> We would go with Apache 2.0, but we were wondering if we have to 
> release it under the same license of sklearn. It doesn't seem so by 
> reading the text of BSD-3, but asking before never hurts.
>
> Thanks in advance,
> Iacopo Poli
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171215/3a83dc68/attachment.html>

From g.lemaitre58 at gmail.com  Sat Dec 16 06:48:57 2017
From: g.lemaitre58 at gmail.com (Guillaume Lemaitre)
Date: Sat, 16 Dec 2017 12:48:57 +0100
Subject: [scikit-learn] SciPy 2018 tutorial
In-Reply-To: <3c3f6d7c-c845-5866-609a-42e21d62fdf5@gmail.com>
References: <3c3f6d7c-c845-5866-609a-42e21d62fdf5@gmail.com>
Message-ID: <20171216114857.5128271.71775.45130@gmail.com>

Hey Andy,

I'll be interested to come at SciPy and to co-teach a tutorial.

Guillaume?Lemaitre?
INRIA?Saclay?Ile-de-France?/?Equipe?PARIETAL
guillaume.lemaitre at inria.fr?-?https://glemaitre.github.io/
? Original Message ?
From: Andreas Mueller
Sent: Wednesday, 13 December 2017 17:42
To: Scikit-learn user and developer mailing list
Reply To: Scikit-learn mailing list
Subject: [scikit-learn] SciPy 2018 tutorial

Hey folks.
Who is coming to SciPy 2018? They just send out the CfP.

Does anyone want to co-teach a tutorial?
(If there's two other people that want to teach it, I'm also happy to 
step back this year ;)

Andy
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn

From tevang3 at gmail.com  Mon Dec 18 09:19:13 2017
From: tevang3 at gmail.com (Thomas Evangelidis)
Date: Mon, 18 Dec 2017 15:19:13 +0100
Subject: [scikit-learn] data augmentation following the underlying feature
 values distributions and correlations
Message-ID: <CAACvdx27GWtc3Fna87mz8Ftm3fFqpSJ1=5hNvP9Wfw8=gwqFmA@mail.gmail.com>

Greetings,

I want to augment my training set but preserve at the same time the
correlations between feature values. More specifically my features are NMR
resonances of the nuclei of a single amino acid. For example for Glutamic
acid I have for each observation the following feature values:

[CA, HA, CB, HB, CG, HG]

where CA is the resonance of the alpha carbon, HA the resonance of the
alpha proton, and so forth. The complication here is that these feature
values are not independent. HA is covalently bonded to CA, CB to CA, and so
on. Therefore if I sample a random CA value from the distribution of
experimental values of CA, I cannot pick ANY HA VALUE from the respective
experimental distribution, simply because CA and HA are correlated. The
same applies to CA and CB, CB and HB, CB and CG, CG and HG. Is there any
algorithm that can generate [CA, HA, CB, HB, CG, HG] feature vectors that
comply with the atom distributions and their correlations? I saw that
Gaussian Mixture Models have a function to generate random samples from the
fitted Gaussian distribution (sklearn.mixture.GaussianMixture.sample) but
it is not clear if these samples will retain the correlations between the
features (nuclei in this case). If there is not such an algorithm in
scikit-learn,
could you please point me to any other Python library which does that?

Thanks in advance.
Thomas


-- 

======================================================================

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tevang at pharm.uoa.gr

          tevang3 at gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171218/f0141d39/attachment.html>

From l.lomasto at innovationengineering.eu  Tue Dec 19 03:36:42 2017
From: l.lomasto at innovationengineering.eu (Luigi Lomasto)
Date: Tue, 19 Dec 2017 09:36:42 +0100
Subject: [scikit-learn] Feature selection with words.
Message-ID: <FE78D49B-0BD5-4A38-A299-1A78F09109D9@innovationengineering.eu>

Hi all. 

I?m working for text classification to classify Wikipedia documents. I using a word count approach to extract feature from my text so I obtain a big vocabulary that contains all documents word (train dataset) after lemmatization and deleted stop word. Now I have 70000 features. I think that for this problems (word based) is not good to make feature selection (with SVD or PCA). Actual accuracy is 77%. 

Do you think that I need to do feature selection to grow up the accuracy? 

Thank you for answer. Regards. 

Luigi 


From joel.nothman at gmail.com  Tue Dec 19 04:54:10 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 19 Dec 2017 20:54:10 +1100
Subject: [scikit-learn] Feature selection with words.
In-Reply-To: <FE78D49B-0BD5-4A38-A299-1A78F09109D9@innovationengineering.eu>
References: <FE78D49B-0BD5-4A38-A299-1A78F09109D9@innovationengineering.eu>
Message-ID: <CAAkaFLUnKQnD=P+1sDoLJhy0ZHyDKHtKi_hV4E=gF0+rPgkrvQ@mail.gmail.com>

It depends what the set of classes is. Best way to find out is to try it...

On 19 December 2017 at 19:36, Luigi Lomasto <
l.lomasto at innovationengineering.eu> wrote:

> Hi all.
>
> I?m working for text classification to classify Wikipedia documents. I
> using a word count approach to extract feature from my text so I obtain a
> big vocabulary that contains all documents word (train dataset) after
> lemmatization and deleted stop word. Now I have 70000 features. I think
> that for this problems (word based) is not good to make feature selection
> (with SVD or PCA). Actual accuracy is 77%.
>
> Do you think that I need to do feature selection to grow up the accuracy?
>
> Thank you for answer. Regards.
>
> Luigi
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171219/2b025fef/attachment.html>

From manuel.castejon at gmail.com  Tue Dec 19 07:44:54 2017
From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=)
Date: Tue, 19 Dec 2017 13:44:54 +0100
Subject: [scikit-learn] Any plans on generalizing Pipeline and transformers?
Message-ID: <CAK+G6sGjcnxiJbMbs0qLsQ3rETL8B5AZeFry27JLPE2rHFaRzw@mail.gmail.com>

Dear all,

Kudos to scikit-learn! Having said that, Pipeline is killing me not being
able to transform anything other than X.

My current study case would need:
- Transformers being able to handle both X and y, e.g. clustering X and y
concatenated
- Pipeline being able to change other params, e.g. sample_weight

Currently, I'm augmenting X through every step with the extra information
which seems to work ok for my_pipe.fit_transform(X_train,y_train) but
breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I
can inherit and modify a descendant from Pipeline class to allow the y
parameter which is not ideal but I guess it is an option. The gritty part
comes when having to adapt every regressor at the end of the ladder in
order to split the extra information from the raw data in X and not being
able to generate more than one subproduct from each preprocessing step

My current research involves clustering the data and using that
classification along with X in order to predict outliers which generates
sample_weight info and I would love to use that on the final regressor.
Currently there seems not to be another option than pasting that info on X.

All in all, I'm stuck with this API limitation and I would love to learn
some tricks from you if you could enlighten me.

Thanks in advance!

Manuel Castej?n-Limas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171219/963d5ce3/attachment.html>

From ichkoar at gmail.com  Tue Dec 19 08:15:12 2017
From: ichkoar at gmail.com (Christos Aridas)
Date: Tue, 19 Dec 2017 15:15:12 +0200
Subject: [scikit-learn] Any plans on generalizing Pipeline and
 transformers?
In-Reply-To: <CAK+G6sGjcnxiJbMbs0qLsQ3rETL8B5AZeFry27JLPE2rHFaRzw@mail.gmail.com>
References: <CAK+G6sGjcnxiJbMbs0qLsQ3rETL8B5AZeFry27JLPE2rHFaRzw@mail.gmail.com>
Message-ID: <CAJBJ-XEMFsUh1xWHj3=H3S0+b93LetQ01JTwSxTnZQ_OoYLr2g@mail.gmail.com>

Hey Manuel,

In imbalanced-learn we have an extra type of estimators, named Samplers,
which are able to modify X and y, at the same time, with the use of new API
methods, sample and fit_sample.
Also, we have adopted a modified version of scikit-learn's Pipeline class
where we allow subsequent transformations using samplers and transformers.
Despite the fact that the package deals with imbalanced datasets the
aforementioned objects may help your pipeline.

Cheerz,
Chris

On Tue, Dec 19, 2017 at 2:44 PM, Manuel Castej?n Limas <
manuel.castejon at gmail.com> wrote:

> Dear all,
>
> Kudos to scikit-learn! Having said that, Pipeline is killing me not being
> able to transform anything other than X.
>
> My current study case would need:
> - Transformers being able to handle both X and y, e.g. clustering X and y
> concatenated
> - Pipeline being able to change other params, e.g. sample_weight
>
> Currently, I'm augmenting X through every step with the extra information
> which seems to work ok for my_pipe.fit_transform(X_train,y_train) but
> breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I
> can inherit and modify a descendant from Pipeline class to allow the y
> parameter which is not ideal but I guess it is an option. The gritty part
> comes when having to adapt every regressor at the end of the ladder in
> order to split the extra information from the raw data in X and not being
> able to generate more than one subproduct from each preprocessing step
>
> My current research involves clustering the data and using that
> classification along with X in order to predict outliers which generates
> sample_weight info and I would love to use that on the final regressor.
> Currently there seems not to be another option than pasting that info on X.
>
> All in all, I'm stuck with this API limitation and I would love to learn
> some tricks from you if you could enlighten me.
>
> Thanks in advance!
>
> Manuel Castej?n-Limas
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171219/2bff1512/attachment.html>

From g.lemaitre58 at gmail.com  Tue Dec 19 08:18:32 2017
From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=)
Date: Tue, 19 Dec 2017 14:18:32 +0100
Subject: [scikit-learn] Any plans on generalizing Pipeline and
 transformers?
In-Reply-To: <CAK+G6sGjcnxiJbMbs0qLsQ3rETL8B5AZeFry27JLPE2rHFaRzw@mail.gmail.com>
References: <CAK+G6sGjcnxiJbMbs0qLsQ3rETL8B5AZeFry27JLPE2rHFaRzw@mail.gmail.com>
Message-ID: <CACDxx9gvH6-xi3_5Z-zcXLP9iBWruy--9M0nHFY===K=0xayAg@mail.gmail.com>

I think that you could you use imbalanced-learn regarding the issue that
you have with the y.
You should be able to wrap your clustering inside the FunctionSampler (
https://github.com/scikit-learn-contrib/imbalanced-learn/pull/342 - we are
on the way to merge it)

On 19 December 2017 at 13:44, Manuel Castej?n Limas <
manuel.castejon at gmail.com> wrote:

> Dear all,
>
> Kudos to scikit-learn! Having said that, Pipeline is killing me not being
> able to transform anything other than X.
>
> My current study case would need:
> - Transformers being able to handle both X and y, e.g. clustering X and y
> concatenated
> - Pipeline being able to change other params, e.g. sample_weight
>
> Currently, I'm augmenting X through every step with the extra information
> which seems to work ok for my_pipe.fit_transform(X_train,y_train) but
> breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I
> can inherit and modify a descendant from Pipeline class to allow the y
> parameter which is not ideal but I guess it is an option. The gritty part
> comes when having to adapt every regressor at the end of the ladder in
> order to split the extra information from the raw data in X and not being
> able to generate more than one subproduct from each preprocessing step
>
> My current research involves clustering the data and using that
> classification along with X in order to predict outliers which generates
> sample_weight info and I would love to use that on the final regressor.
> Currently there seems not to be another option than pasting that info on X.
>
> All in all, I'm stuck with this API limitation and I would love to learn
> some tricks from you if you could enlighten me.
>
> Thanks in advance!
>
> Manuel Castej?n-Limas
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171219/7565f251/attachment-0001.html>

From manuel.castejon at gmail.com  Tue Dec 19 08:33:42 2017
From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=)
Date: Tue, 19 Dec 2017 14:33:42 +0100
Subject: [scikit-learn] Any plans on generalizing Pipeline and
 transformers?
In-Reply-To: <CAJBJ-XEMFsUh1xWHj3=H3S0+b93LetQ01JTwSxTnZQ_OoYLr2g@mail.gmail.com>
References: <CAK+G6sGjcnxiJbMbs0qLsQ3rETL8B5AZeFry27JLPE2rHFaRzw@mail.gmail.com>
 <CAJBJ-XEMFsUh1xWHj3=H3S0+b93LetQ01JTwSxTnZQ_OoYLr2g@mail.gmail.com>
Message-ID: <CAK+G6sG4rcxMjipgGVyAW2SKrumsaEQYsExmwupBmHwNaYa1zw@mail.gmail.com>

Wow, that seems promising. I'll read with interest the imbalance-learn code.
Thanks for the info!
Manuel


2017-12-19 14:15 GMT+01:00 Christos Aridas <ichkoar at gmail.com>:

> Hey Manuel,
>
> In imbalanced-learn we have an extra type of estimators, named Samplers,
> which are able to modify X and y, at the same time, with the use of new API
> methods, sample and fit_sample.
> Also, we have adopted a modified version of scikit-learn's Pipeline class
> where we allow subsequent transformations using samplers and transformers.
> Despite the fact that the package deals with imbalanced datasets the
> aforementioned objects may help your pipeline.
>
> Cheerz,
> Chris
>
> On Tue, Dec 19, 2017 at 2:44 PM, Manuel Castej?n Limas <
> manuel.castejon at gmail.com> wrote:
>
>> Dear all,
>>
>> Kudos to scikit-learn! Having said that, Pipeline is killing me not being
>> able to transform anything other than X.
>>
>> My current study case would need:
>> - Transformers being able to handle both X and y, e.g. clustering X and y
>> concatenated
>> - Pipeline being able to change other params, e.g. sample_weight
>>
>> Currently, I'm augmenting X through every step with the extra information
>> which seems to work ok for my_pipe.fit_transform(X_train,y_train) but
>> breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I
>> can inherit and modify a descendant from Pipeline class to allow the y
>> parameter which is not ideal but I guess it is an option. The gritty part
>> comes when having to adapt every regressor at the end of the ladder in
>> order to split the extra information from the raw data in X and not being
>> able to generate more than one subproduct from each preprocessing step
>>
>> My current research involves clustering the data and using that
>> classification along with X in order to predict outliers which generates
>> sample_weight info and I would love to use that on the final regressor.
>> Currently there seems not to be another option than pasting that info on X.
>>
>> All in all, I'm stuck with this API limitation and I would love to learn
>> some tricks from you if you could enlighten me.
>>
>> Thanks in advance!
>>
>> Manuel Castej?n-Limas
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171219/3f022295/attachment.html>

From manuel.castejon at gmail.com  Tue Dec 19 08:34:49 2017
From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=)
Date: Tue, 19 Dec 2017 14:34:49 +0100
Subject: [scikit-learn] Any plans on generalizing Pipeline and
 transformers?
In-Reply-To: <CACDxx9gvH6-xi3_5Z-zcXLP9iBWruy--9M0nHFY===K=0xayAg@mail.gmail.com>
References: <CAK+G6sGjcnxiJbMbs0qLsQ3rETL8B5AZeFry27JLPE2rHFaRzw@mail.gmail.com>
 <CACDxx9gvH6-xi3_5Z-zcXLP9iBWruy--9M0nHFY===K=0xayAg@mail.gmail.com>
Message-ID: <CAK+G6sGLes=tsDxKrydKUNirEGUZFuoQs0GPLOCQyi5cXzoj6w@mail.gmail.com>

Eager to learn! Diving on the code right now!

Thanks for the tip!
Manuel

2017-12-19 14:18 GMT+01:00 Guillaume Lema?tre <g.lemaitre58 at gmail.com>:

> I think that you could you use imbalanced-learn regarding the issue that
> you have with the y.
> You should be able to wrap your clustering inside the FunctionSampler (
> https://github.com/scikit-learn-contrib/imbalanced-learn/pull/342 - we
> are on the way to merge it)
>
> On 19 December 2017 at 13:44, Manuel Castej?n Limas <
> manuel.castejon at gmail.com> wrote:
>
>> Dear all,
>>
>> Kudos to scikit-learn! Having said that, Pipeline is killing me not being
>> able to transform anything other than X.
>>
>> My current study case would need:
>> - Transformers being able to handle both X and y, e.g. clustering X and y
>> concatenated
>> - Pipeline being able to change other params, e.g. sample_weight
>>
>> Currently, I'm augmenting X through every step with the extra information
>> which seems to work ok for my_pipe.fit_transform(X_train,y_train) but
>> breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I
>> can inherit and modify a descendant from Pipeline class to allow the y
>> parameter which is not ideal but I guess it is an option. The gritty part
>> comes when having to adapt every regressor at the end of the ladder in
>> order to split the extra information from the raw data in X and not being
>> able to generate more than one subproduct from each preprocessing step
>>
>> My current research involves clustering the data and using that
>> classification along with X in order to predict outliers which generates
>> sample_weight info and I would love to use that on the final regressor.
>> Currently there seems not to be another option than pasting that info on X.
>>
>> All in all, I'm stuck with this API limitation and I would love to learn
>> some tricks from you if you could enlighten me.
>>
>> Thanks in advance!
>>
>> Manuel Castej?n-Limas
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171219/b5506083/attachment.html>

From ranjanagirish30 at gmail.com  Tue Dec 19 09:38:12 2017
From: ranjanagirish30 at gmail.com (Ranjana Girish)
Date: Tue, 19 Dec 2017 20:08:12 +0530
Subject: [scikit-learn] Text classification of large dataet
Message-ID: <CAF5P65ngAVSpRsq2U4yaqiK-AhkTUkq1SQjvW8TMEfwydiiLig@mail.gmail.com>

Hai all,

I am doing text classification. I have around 10 million data to be
classified to around 7k category.

Below is the code I am using

*# Importing the libraries*
*import pandas as pd*
*import nltk*
*from nltk.corpus import stopwords*
*from nltk.tokenize import word_tokenize*
*from nltk.stem.wordnet import WordNetLemmatizer*
*from nltk.stem.porter import PorterStemmer*
*import re*
*from sklearn.feature_extraction.text import CountVectorizer*
*import random*
*from sklearn.naive_bayes import MultinomialNB,GaussianNB*
*from sklearn.metrics import accuracy_score*
*from sklearn.metrics import precision_recall_curve*
*from sklearn.metrics import average_precision_score*
*from sklearn import feature_selection*
*from scipy.sparse import csr_matrix*
*from scipy import sparse*
*import sys*
*from sklearn import preprocessing*
*import numpy as np*
*import pickle*

*sys.setrecursionlimit(200000000)*

*random.seed(20000)*


*trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding =
"ISO-8859-1")*
*trainset2=pd.read_csv("trainsetlessequal500.csv",encoding = "ISO-8859-1")*

*dataset=pd.concat([trainset1,trainset2])*

*dataset=dataset.dropna()*

*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^a-zA-Z]',
' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]',
' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.lower()*

*del trainset1*
*del trainset2  *

*stop = stopwords.words('english')*
*lemmatizer = WordNetLemmatizer()*

*dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b('
+ r'|'.join(stop) + r')\b\s*', ' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('\s\s+','
')*
*dataset['ProductDescription']
=dataset['ProductDescription'].apply(word_tokenize)*
*ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'*
*POS_LIST = [NOUN, VERB, ADJ, ADV]*
*for tag in POS_LIST:*
*    dataset['ProductDescription'] =
dataset['ProductDescription'].apply(lambda x:
list(set([lemmatizer.lemmatize(item,tag) for item in x])))*
*dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda x
: " ".join(x))*

*countvec = CountVectorizer(min_df=0.00008)*
*documenttermmatrix=countvec.fit_transform(dataset['ProductDescription'])*
*documenttermmatrix.shape*
*column=countvec.get_feature_names()*
*filename1 = 'columnnamessample10mastermerge.sav'*
*pickle.dump(column, open(filename1, 'wb'))*

*y_train=dataset['classpath']*
*y_train=dataset['classpath'].tolist()*
*labels_train= preprocessing.LabelEncoder()*
*labels_train.fit(y_train)*
*y1_train=labels_train.transform(y_train)*

*del dataset*
*del countvec*
*del column*


*clf = MultinomialNB()*
*model=clf.fit(documenttermmatrix,y_train)*


*filename2 = 'modelnaivebayessample10withfs.sav'*
*pickle.dump(model, open(filename2, 'wb'))*


I am using system with *128 GB RAM.*

As I was unable to train all 10 million data, I did *stratified sampling*
and the trainset reduced to 2.3 million

Still I was unable to Train  2.3 million data

I got* memory error* when i used *random forest (nestimator=30),**Naive
Bayes* and *SVM*


*I have stucked*


*Can Anyone please tell whether any memory leak in my code and  how to use
system with 128 GB RAM effectively*


Thanks
Ranjana
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171219/cbafad9c/attachment-0001.html>

From johnmarktaylor at g.harvard.edu  Tue Dec 19 16:27:53 2017
From: johnmarktaylor at g.harvard.edu (Taylor, Johnmark)
Date: Tue, 19 Dec 2017 16:27:53 -0500
Subject: [scikit-learn] Support Vector Machines: Sensitive to Single
 Datapoints?
Message-ID: <CAOBxrqwHzqNbkLuEoA-F2St2tU5Dz-10ZB=4YE73byc7tOQ9Xg@mail.gmail.com>

Hello,

I am a researcher in fMRI and am using SVMs to analyze brain data. I am
doing decoding between two classes, each of which has 24 exemplars per
class. I am comparing two different methods of cross-validation for my
data: in one, I am training on 23 exemplars from each class, and testing on
the remaining example from each class, and in the other, I am training on
22 exemplars from each class, and testing on the remaining two from each
class (in case it matters, the data is structured into different
neuroimaging "runs", with each "run" containing several "blocks"; the first
cross-validation method is leaving out one block at a time, the second is
leaving out one run at a time).

Now, I would've thought that these two CV methods would be very similar,
since the vast majority of the training data is the same; the only
difference is in adding two additional points. However, they are yielding
very different results: training on 23 per class is yielding 60% decoding
accuracy (averaged across several subjects, and statistically significantly
greater than chance), training on 22 per class is yielding chance (50%)
decoding. Leaving aside the particulars of fMRI in this case: is it unusual
for single points (amounting to less than 5% of the data) to have such a
big influence on SVM decoding? I am using a cost parameter of C=1. I must
say it is counterintuitive to me that just a couple points out of two dozen
could make such a big difference.

Thank you very much, and cheers,

JohnMark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171219/6877cfc5/attachment.html>

From jakevdp at cs.washington.edu  Tue Dec 19 16:37:35 2017
From: jakevdp at cs.washington.edu (Jacob Vanderplas)
Date: Tue, 19 Dec 2017 13:37:35 -0800
Subject: [scikit-learn] Support Vector Machines: Sensitive to Single
 Datapoints?
In-Reply-To: <CAOBxrqwHzqNbkLuEoA-F2St2tU5Dz-10ZB=4YE73byc7tOQ9Xg@mail.gmail.com>
References: <CAOBxrqwHzqNbkLuEoA-F2St2tU5Dz-10ZB=4YE73byc7tOQ9Xg@mail.gmail.com>
Message-ID: <CACpqBg3OeUCqDv8cgbrYq-jHzEUvOGKeUoB7ktA3K9QpThrzuA@mail.gmail.com>

Hi JohnMark,
SVMs, by design, are quite sensitive to the addition of single data points
? but only if those data points happen to lie near the margin. I wrote
about some of those types of details here:
https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html


Hope that helps,
   Jake

 Jake VanderPlas
 Senior Data Science Fellow
 Director of Open Software
 University of Washington eScience Institute

On Tue, Dec 19, 2017 at 1:27 PM, Taylor, Johnmark <
johnmarktaylor at g.harvard.edu> wrote:

> Hello,
>
> I am a researcher in fMRI and am using SVMs to analyze brain data. I am
> doing decoding between two classes, each of which has 24 exemplars per
> class. I am comparing two different methods of cross-validation for my
> data: in one, I am training on 23 exemplars from each class, and testing on
> the remaining example from each class, and in the other, I am training on
> 22 exemplars from each class, and testing on the remaining two from each
> class (in case it matters, the data is structured into different
> neuroimaging "runs", with each "run" containing several "blocks"; the first
> cross-validation method is leaving out one block at a time, the second is
> leaving out one run at a time).
>
> Now, I would've thought that these two CV methods would be very similar,
> since the vast majority of the training data is the same; the only
> difference is in adding two additional points. However, they are yielding
> very different results: training on 23 per class is yielding 60% decoding
> accuracy (averaged across several subjects, and statistically significantly
> greater than chance), training on 22 per class is yielding chance (50%)
> decoding. Leaving aside the particulars of fMRI in this case: is it unusual
> for single points (amounting to less than 5% of the data) to have such a
> big influence on SVM decoding? I am using a cost parameter of C=1. I must
> say it is counterintuitive to me that just a couple points out of two dozen
> could make such a big difference.
>
> Thank you very much, and cheers,
>
> JohnMark
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171219/42c864f1/attachment.html>

From l.lomasto at innovationengineering.eu  Tue Dec 19 17:07:57 2017
From: l.lomasto at innovationengineering.eu (Luigi Lomasto)
Date: Tue, 19 Dec 2017 23:07:57 +0100
Subject: [scikit-learn] Support Vector Machines: Sensitive to Single
 Datapoints?
In-Reply-To: <CACpqBg3OeUCqDv8cgbrYq-jHzEUvOGKeUoB7ktA3K9QpThrzuA@mail.gmail.com>
References: <CAOBxrqwHzqNbkLuEoA-F2St2tU5Dz-10ZB=4YE73byc7tOQ9Xg@mail.gmail.com>
 <CACpqBg3OeUCqDv8cgbrYq-jHzEUvOGKeUoB7ktA3K9QpThrzuA@mail.gmail.com>
Message-ID: <7B51C250-F836-4C5A-89EE-61C756093180@innovationengineering.eu>

Hi, you can try to use CV with k-fold partition, so you can see with all training/test combination (generally 90%/10% or 80/20).  If you have very different results, probably that you obtain overfitting. 

Inviato da iPhone

> Il giorno 19 dic 2017, alle ore 22:37, Jacob Vanderplas <jakevdp at cs.washington.edu> ha scritto:
> 
> Hi JohnMark,
> SVMs, by design, are quite sensitive to the addition of single data points ? but only if those data points happen to lie near the margin. I wrote about some of those types of details here: https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html 
> 
> Hope that helps,
>    Jake
> 
>  Jake VanderPlas
>  Senior Data Science Fellow
>  Director of Open Software
>  University of Washington eScience Institute
> 
>> On Tue, Dec 19, 2017 at 1:27 PM, Taylor, Johnmark <johnmarktaylor at g.harvard.edu> wrote:
>> Hello,
>> 
>> I am a researcher in fMRI and am using SVMs to analyze brain data. I am doing decoding between two classes, each of which has 24 exemplars per class. I am comparing two different methods of cross-validation for my data: in one, I am training on 23 exemplars from each class, and testing on the remaining example from each class, and in the other, I am training on 22 exemplars from each class, and testing on the remaining two from each class (in case it matters, the data is structured into different neuroimaging "runs", with each "run" containing several "blocks"; the first cross-validation method is leaving out one block at a time, the second is leaving out one run at a time). 
>> 
>> Now, I would've thought that these two CV methods would be very similar, since the vast majority of the training data is the same; the only difference is in adding two additional points. However, they are yielding very different results: training on 23 per class is yielding 60% decoding accuracy (averaged across several subjects, and statistically significantly greater than chance), training on 22 per class is yielding chance (50%) decoding. Leaving aside the particulars of fMRI in this case: is it unusual for single points (amounting to less than 5% of the data) to have such a big influence on SVM decoding? I am using a cost parameter of C=1. I must say it is counterintuitive to me that just a couple points out of two dozen could make such a big difference.
>> 
>> Thank you very much, and cheers,
>> 
>> JohnMark
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171219/01321ac7/attachment-0001.html>

From jeff1evesque at yahoo.com  Tue Dec 19 16:56:40 2017
From: jeff1evesque at yahoo.com (Jeffrey Levesque)
Date: Tue, 19 Dec 2017 16:56:40 -0500
Subject: [scikit-learn] Support Vector Machines: Sensitive to Single
 Datapoints?
In-Reply-To: <CACpqBg3OeUCqDv8cgbrYq-jHzEUvOGKeUoB7ktA3K9QpThrzuA@mail.gmail.com>
References: <CAOBxrqwHzqNbkLuEoA-F2St2tU5Dz-10ZB=4YE73byc7tOQ9Xg@mail.gmail.com>
 <CACpqBg3OeUCqDv8cgbrYq-jHzEUvOGKeUoB7ktA3K9QpThrzuA@mail.gmail.com>
Message-ID: <E6B3D760-3420-4FFE-BF94-A021B590233E@yahoo.com>

Hi guys,
I'm currently developing a web-interface, and programmatic rest-API for sklearn. I currently have SVM, and SVR available with some parameters like C, and gamma exposed:

- https://github.com/jeff1evesque/machine-learning

I'm working a bit to improve the web-interface at the moment. Since you're working with SVM's maybe you'd have time, to try my project, and to provide me some feedback? I hope to expand the toolset to things like ensemble learning, and a long shot of neural network. But, this may be some time.

Thank you,

Jeff Levesque
https://github.com/jeff1evesque

> On Dec 19, 2017, at 4:37 PM, Jacob Vanderplas <jakevdp at cs.washington.edu> wrote:
> 
> Hi JohnMark,
> SVMs, by design, are quite sensitive to the addition of single data points ? but only if those data points happen to lie near the margin. I wrote about some of those types of details here: https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html 
> 
> Hope that helps,
>    Jake
> 
>  Jake VanderPlas
>  Senior Data Science Fellow
>  Director of Open Software
>  University of Washington eScience Institute
> 
>> On Tue, Dec 19, 2017 at 1:27 PM, Taylor, Johnmark <johnmarktaylor at g.harvard.edu> wrote:
>> Hello,
>> 
>> I am a researcher in fMRI and am using SVMs to analyze brain data. I am doing decoding between two classes, each of which has 24 exemplars per class. I am comparing two different methods of cross-validation for my data: in one, I am training on 23 exemplars from each class, and testing on the remaining example from each class, and in the other, I am training on 22 exemplars from each class, and testing on the remaining two from each class (in case it matters, the data is structured into different neuroimaging "runs", with each "run" containing several "blocks"; the first cross-validation method is leaving out one block at a time, the second is leaving out one run at a time). 
>> 
>> Now, I would've thought that these two CV methods would be very similar, since the vast majority of the training data is the same; the only difference is in adding two additional points. However, they are yielding very different results: training on 23 per class is yielding 60% decoding accuracy (averaged across several subjects, and statistically significantly greater than chance), training on 22 per class is yielding chance (50%) decoding. Leaving aside the particulars of fMRI in this case: is it unusual for single points (amounting to less than 5% of the data) to have such a big influence on SVM decoding? I am using a cost parameter of C=1. I must say it is counterintuitive to me that just a couple points out of two dozen could make such a big difference.
>> 
>> Thank you very much, and cheers,
>> 
>> JohnMark
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171219/9949e37f/attachment.html>

From gael.varoquaux at normalesup.org  Tue Dec 19 16:35:26 2017
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Tue, 19 Dec 2017 22:35:26 +0100
Subject: [scikit-learn] Support Vector Machines: Sensitive to Single
 Datapoints?
In-Reply-To: <CAOBxrqwHzqNbkLuEoA-F2St2tU5Dz-10ZB=4YE73byc7tOQ9Xg@mail.gmail.com>
References: <CAOBxrqwHzqNbkLuEoA-F2St2tU5Dz-10ZB=4YE73byc7tOQ9Xg@mail.gmail.com>
Message-ID: <20171219213525.GE360768@phare.normalesup.org>

With as few data points, there is a huge uncertainty in the estimation of
the prediction accuracy with cross-validation. This isn't a problem of
the method, is it a basic limitation of the small amount of data. I've
written a paper on this problem is the specific context of neuroimaging:
https://www.sciencedirect.com/science/article/pii/S1053811917305311
(preprint: https://hal.inria.fr/hal-01545002/).

I except that what you are seing in sampling noise: the result has
confidence intervals in large than 10%.

Ga?l


On Tue, Dec 19, 2017 at 04:27:53PM -0500, Taylor, Johnmark wrote:
> Hello,

> I am a researcher in fMRI and am using SVMs to analyze brain data. I am doing
> decoding between two classes, each of which has 24 exemplars per class. I am
> comparing two different methods of cross-validation for my data: in one, I am
> training on 23 exemplars from each class, and testing on the remaining example
> from each class, and in the other, I am training on 22 exemplars from each
> class, and testing on the remaining two from each class (in case it matters,
> the data is structured into different neuroimaging "runs", with each "run"
> containing several "blocks"; the first cross-validation method is leaving out
> one block at a time, the second is leaving out one run at a time).?

> Now, I would've thought that these two CV methods would be very similar, since
> the vast majority of the training data is the same; the only difference is in
> adding two additional points. However, they are yielding very different
> results: training on 23 per class is yielding 60% decoding accuracy (averaged
> across several subjects, and statistically significantly greater than chance),
> training on 22 per class is yielding chance (50%) decoding. Leaving aside the
> particulars of fMRI in this case: is it unusual for single points (amounting to
> less than 5% of the data) to have such a big influence on SVM decoding? I am
> using a cost parameter of C=1. I must say it is counterintuitive to me that
> just a couple points out of two dozen could make such a big difference.

> Thank you very much, and cheers,

> JohnMark

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From joel.nothman at gmail.com  Tue Dec 19 19:09:37 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 20 Dec 2017 11:09:37 +1100
Subject: [scikit-learn] Any plans on generalizing Pipeline and
 transformers?
In-Reply-To: <CAK+G6sGLes=tsDxKrydKUNirEGUZFuoQs0GPLOCQyi5cXzoj6w@mail.gmail.com>
References: <CAK+G6sGjcnxiJbMbs0qLsQ3rETL8B5AZeFry27JLPE2rHFaRzw@mail.gmail.com>
 <CACDxx9gvH6-xi3_5Z-zcXLP9iBWruy--9M0nHFY===K=0xayAg@mail.gmail.com>
 <CAK+G6sGLes=tsDxKrydKUNirEGUZFuoQs0GPLOCQyi5cXzoj6w@mail.gmail.com>
Message-ID: <CAAkaFLUTKK3CWZAtVYLmRgfsZuzoZkz6D8OyBxLNfBxTJ6XYVQ@mail.gmail.com>

At a glance, and perhaps not knowing imbalanced-learn well enough, I have
some doubts that it will provide an immediate solution for all your needs.

At the end of the day, the Pipeline keeps its scope relatively tight, but
it should not be so hard to implement something for your own needs if your
case does not fit what Pipeline supports.

On 20 December 2017 at 00:34, Manuel Castej?n Limas <
manuel.castejon at gmail.com> wrote:

> Eager to learn! Diving on the code right now!
>
> Thanks for the tip!
> Manuel
>
> 2017-12-19 14:18 GMT+01:00 Guillaume Lema?tre <g.lemaitre58 at gmail.com>:
>
>> I think that you could you use imbalanced-learn regarding the issue that
>> you have with the y.
>> You should be able to wrap your clustering inside the FunctionSampler (
>> https://github.com/scikit-learn-contrib/imbalanced-learn/pull/342 - we
>> are on the way to merge it)
>>
>> On 19 December 2017 at 13:44, Manuel Castej?n Limas <
>> manuel.castejon at gmail.com> wrote:
>>
>>> Dear all,
>>>
>>> Kudos to scikit-learn! Having said that, Pipeline is killing me not
>>> being able to transform anything other than X.
>>>
>>> My current study case would need:
>>> - Transformers being able to handle both X and y, e.g. clustering X and
>>> y concatenated
>>> - Pipeline being able to change other params, e.g. sample_weight
>>>
>>> Currently, I'm augmenting X through every step with the extra
>>> information which seems to work ok for my_pipe.fit_transform(X_train,y_train)
>>> but breaks on my_pipe.transform(X_test) for the lack of the y parameter.
>>> Ok, I can inherit and modify a descendant from Pipeline class to allow the
>>> y parameter which is not ideal but I guess it is an option. The gritty part
>>> comes when having to adapt every regressor at the end of the ladder in
>>> order to split the extra information from the raw data in X and not being
>>> able to generate more than one subproduct from each preprocessing step
>>>
>>> My current research involves clustering the data and using that
>>> classification along with X in order to predict outliers which generates
>>> sample_weight info and I would love to use that on the final regressor.
>>> Currently there seems not to be another option than pasting that info on X.
>>>
>>> All in all, I'm stuck with this API limitation and I would love to learn
>>> some tricks from you if you could enlighten me.
>>>
>>> Thanks in advance!
>>>
>>> Manuel Castej?n-Limas
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> --
>> Guillaume Lemaitre
>> INRIA Saclay - Parietal team
>> Center for Data Science Paris-Saclay
>> https://glemaitre.github.io/
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171220/22f450ef/attachment-0001.html>

From manuel.castejon at gmail.com  Wed Dec 20 10:33:19 2017
From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=)
Date: Wed, 20 Dec 2017 16:33:19 +0100
Subject: [scikit-learn] Any plans on generalizing Pipeline and
 transformers?
In-Reply-To: <CAK+G6sEWG8j+K-AWS4BzJeDp_xwR8X6iVqjoteKfkZXdv8U_ug@mail.gmail.com>
References: <CAK+G6sGjcnxiJbMbs0qLsQ3rETL8B5AZeFry27JLPE2rHFaRzw@mail.gmail.com>
 <f19bdb9b-f835-b720-cfc2-93e9dfb492c5@gmail.com>
 <CAK+G6sEWG8j+K-AWS4BzJeDp_xwR8X6iVqjoteKfkZXdv8U_ug@mail.gmail.com>
Message-ID: <CAK+G6sGPgJGpuqwcwAz7XMmQFKjtu=sCo5esNV+L2JCx6Xir5Q@mail.gmail.com>

Thank you all for your interest!

In order to clarify the case allow me to try to synthesize the spirit of
what I'd like to put into the pipeline using this sequence of steps:

#%%
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import train_test_split

np.random.seed(seed=42)

"""
Data preparation
"""

URL = "https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sin_60_
percent_noise.csv"
data = pd.read_csv(URL, usecols=['V1','V2'])
X, y = data[['V1']], data[['V2']]

(data_train, data_test,
 X_train, X_test,
 y_train, y_test) = train_test_split(data, X, y)

"""
Parameters setup
"""

dbscan__eps = 0.06

mclust__n_components = 3

paella__noise_label = -1
paella__max_it = 20,
paella__regular_size = 400,
paella__minimum_size = 100,
paella__width_r = 0.99,
paella__n_neighbors = 5,
paella__power = 30,
paella__random_state = None

#%%
"""
DBSCAN clustering to detect noise suspects (label == -1)
"""

dbscan_input = data_train

dbscan_clustering = DBSCAN(eps = dbscan__eps)

dbscan_output = dbscan_clustering.fit_predict(dbscan_input)

plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
c=np.int64(dbscan_output == -1))

#%%
"""
GaussianMixture fitted with filtered data_train in order to help locate the
ellipsoids
but predict is applied to the whole data_train set.
"""

mclust_input = data_train[ dbscan_output != 1]

mclust_clustering = GaussianMixture(n_components = mclust__n_components)
mclust_clustering.fit(mclust_input)

mclust_output = mclust_clustering.predict(data_train)

plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
c=mclust_output)

#%%
"""
mclust and dbscan results are combined.
"""

clustering_output = mclust_output.copy()
clustering_output[dbscan_output == -1] =  -1

plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
c=clustering_output)

#%%
"""
Old-good Paella paper: https://link.springer.com/article/10.1023/B:DAMI.
0000031630.50685.7c

The Paella algorithm calculates sample_weight to be used by the final step
regressor
(Yes, it is an outlier detection algorithm but we are focusing now on this
interesting collateral result). I am currently aggressively changing the
code in order to make it fit somehow with the pipeline
"""

from paella import Paella

paella_input = pd.concat([data, clustering_output], axis=1, inplace=False)

paella_run = Paella(noise_label = paella__noise_label,
                    max_it = paella__max_it,
                    regular_size = paella__regular_size,
                    minimum_size = paella__minimum_size,
                    width_r = paella__width_r,
                    n_neighbors = paella__n_neighbors,
                    power = paella__power,
                    random_state = paella__random_state)

paella_output = paella_run.fit_predict(paella_input, y_train)
# paella_output is a vector with sample_weight

#%%
"""
Here we fit a regressor using sample_weight=paella_output
"""
from sklearn.linear_model import LinearRegression

regressor_input=X_train
lm = LinearRegression()
lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output)
regressor_output = lm.predict(X_train)

#...

In this example we can see that:
- A particular step might need results produced not necessarily from the
immediately previous step.
- The X parameter is not sequentially transformed. Sometimes we might need
to skip to a previous step
- y sometimes is the target, sometimes is not. For the regressor it is
indeed, but for the paella algorithm the prediction is expressed as a
vector representing sample_weights.

All in all the conclusion is that the chain of processes is not as linear
as imposed by the current API. I guess that all these difficulties could be
solved by:
- Passing a dictionary through the different steps containing the partial
results that the following steps will need.
-  As a christmas gift :-) , a reference to the pipeline itself inserted in
that dictionary could provide access to the internal status of the previous
steps should it be needed.

Another interesting study case with similar needs would be a regressor
using a previous clustering step in order to fit one model per cluster. In
such case, the clustering results would be needed during the fitting.


Thanks for your interest!
Manolo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171220/14069400/attachment.html>

From l.lomasto at innovationengineering.eu  Wed Dec 20 11:42:50 2017
From: l.lomasto at innovationengineering.eu (Luigi Lomasto)
Date: Wed, 20 Dec 2017 17:42:50 +0100
Subject: [scikit-learn] Parallel MLP version
Message-ID: <E21A9981-A70D-4791-86F4-67A83680BABB@innovationengineering.eu>

Hi all,

I have a computational problem to training my neural network so, can you say me if exists any parallel version about MLP library? 


From drraph at gmail.com  Wed Dec 20 11:44:21 2017
From: drraph at gmail.com (Raphael C)
Date: Wed, 20 Dec 2017 16:44:21 +0000
Subject: [scikit-learn] Parallel MLP version
In-Reply-To: <E21A9981-A70D-4791-86F4-67A83680BABB@innovationengineering.eu>
References: <E21A9981-A70D-4791-86F4-67A83680BABB@innovationengineering.eu>
Message-ID: <CAFHc1QYQUtTd-vrU9ThQoPw=TYA0ZpKZ9nCidttdqRsOP1rYoA@mail.gmail.com>

I believe tensorflow will do what you want.

Raphael

On 20 Dec 2017 16:43, "Luigi Lomasto" <l.lomasto at innovationengineering.eu>
wrote:

> Hi all,
>
> I have a computational problem to training my neural network so, can you
> say me if exists any parallel version about MLP library?
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171220/f71a7631/attachment-0001.html>

From Jeremiah.Johnson at unh.edu  Wed Dec 20 12:35:14 2017
From: Jeremiah.Johnson at unh.edu (Johnson, Jeremiah)
Date: Wed, 20 Dec 2017 17:35:14 +0000
Subject: [scikit-learn] Parallel MLP version
In-Reply-To: <CAFHc1QYQUtTd-vrU9ThQoPw=TYA0ZpKZ9nCidttdqRsOP1rYoA@mail.gmail.com>
References: <E21A9981-A70D-4791-86F4-67A83680BABB@innovationengineering.eu>,
 <CAFHc1QYQUtTd-vrU9ThQoPw=TYA0ZpKZ9nCidttdqRsOP1rYoA@mail.gmail.com>
Message-ID: <1A2511F7-FB15-4E50-9E4B-ADD0E8A3989C@unh.edu>

For neural network training, try one of  tensorflow, pytorch, chainer, or mxnet. They?ll all parallelize the computations and can run the computations on Nvidia GPUs with CUDA.

Best regards,
Jeremiah

Sent from my iPhone

On Dec 20, 2017, at 11:45, Raphael C <drraph at gmail.com<mailto:drraph at gmail.com>> wrote:

I believe tensorflow will do what you want.

Raphael

On 20 Dec 2017 16:43, "Luigi Lomasto" <l.lomasto at innovationengineering.eu<mailto:l.lomasto at innovationengineering.eu>> wrote:
Hi all,

I have a computational problem to training my neural network so, can you say me if exists any parallel version about MLP library?


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn<https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.python.org_mailman_listinfo_scikit-2Dlearn&d=DwMFaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=oUiWf0H-VRF_Tf5m99hfn3BZTJBEqgeSKY-xNdneIxc&s=VgD7nrKGP85Oo6nHglwNvdxtfzW50CR6RYh4OxjFNAg&e=>
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.python.org_mailman_listinfo_scikit-2Dlearn&d=DwICAg&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=oUiWf0H-VRF_Tf5m99hfn3BZTJBEqgeSKY-xNdneIxc&s=VgD7nrKGP85Oo6nHglwNvdxtfzW50CR6RYh4OxjFNAg&e=
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171220/341a8af4/attachment.html>

From rth.yurchak at gmail.com  Wed Dec 20 13:32:35 2017
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Wed, 20 Dec 2017 19:32:35 +0100
Subject: [scikit-learn] Text classification of large dataet
In-Reply-To: <CAF5P65ngAVSpRsq2U4yaqiK-AhkTUkq1SQjvW8TMEfwydiiLig@mail.gmail.com>
References: <CAF5P65ngAVSpRsq2U4yaqiK-AhkTUkq1SQjvW8TMEfwydiiLig@mail.gmail.com>
Message-ID: <7386f24b-61fe-3ea9-00f2-c6e8a8941902@gmail.com>

Ranjana,

have a look at this example 
http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html

Since you have a lot of RAM, you may not need to make all the 
classification pipeline out-of-core, a start with your current code 
could be to write a generator that loads and pre-processes the text in 
chunks then feed it one document at the time to CountVecotorizer.fit (it 
accepts an iterable). To reduce the memory usage, filtering too frequent 
tokens (instead of the infrequent ones) could help too. Make sure you L2 
normalize your data before the classifier. You could use 
SGDClassifier(loss='log') or LogisticRegression with a sag or saga 
solver. The multiclasss="multinomial" parameter might be also worth 
trying, particularly since you have so many classes.

-- 
Roman

On 19/12/17 15:38, Ranjana Girish wrote:
> Hai all,
>
> I am doing text classification. I have around 10 million data to be
> classified to around 7k category.
>
> Below is the code I am using
>
> /# Importing the libraries/
> /i*mport pandas as pd*/
> /*import nltk*/
> /*from nltk.corpus import stopwords*/
> /*from nltk.tokenize import word_tokenize*/
> /*from nltk.stem.wordnet import WordNetLemmatizer*/
> /*from nltk.stem.porter import PorterStemmer*/
> /*import re*/
> /*from sklearn.feature_extraction.text import CountVectorizer*/
> /*import random*/
> /*from sklearn.naive_bayes import MultinomialNB,GaussianNB*/
> /*from sklearn.metrics import accuracy_score*/
> /*from sklearn.metrics import precision_recall_curve*/
> /*from sklearn.metrics import average_precision_score*/
> /*from sklearn import feature_selection*/
> /*from scipy.sparse import csr_matrix*/
> /*from scipy import sparse*/
> /*import sys*/
> /*from sklearn import preprocessing*/
> /*import numpy as np*/
> /*import pickle*/
> /* */
> /*sys.setrecursionlimit(200000000)*/
> /*
> */
> /*random.seed(20000)*/
> /*
> */
> /*
> */
> /*trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding =
> "ISO-8859-1")*/
> /*trainset2=pd.read_csv("trainsetlessequal500.csv",encoding =
> "ISO-8859-1")*/
> /*
> */
> /*dataset=pd.concat([trainset1,trainset2])*/
> /*
> */
> /*dataset=dataset.dropna()*/
> /*
> */
> /*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^a-zA-Z]',
> ' ')*/
> /*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]',
> ' ')*/
> /*dataset['ProductDescription']=dataset['ProductDescription'].str.lower()*/
> /*
> */
> /*del trainset1*/
> /*del trainset2  */
> /*
> */
> /*stop = stopwords.words('english')*/
> /*lemmatizer = WordNetLemmatizer()*/
> /*
> */
> /*dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b('
> + r'|'.join(stop) + r')\b\s*', ' ')*/
> /*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('\s\s+','
> ')*/
> /*dataset['ProductDescription']
> =dataset['ProductDescription'].apply(word_tokenize)*/
> /*ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'*/
> /*POS_LIST = [NOUN, VERB, ADJ, ADV]*/
> /*for tag in POS_LIST:*/
> /*    dataset['ProductDescription'] =
> dataset['ProductDescription'].apply(lambda x:
> list(set([lemmatizer.lemmatize(item,tag) for item in x])))*/
> /*dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda
> x : " ".join(x))*/
> /*
> */
> /*countvec = CountVectorizer(min_df=0.00008)*/
> /*documenttermmatrix=countvec.fit_transform(dataset['ProductDescription'])*/
> /*documenttermmatrix.shape*/
> /*column=countvec.get_feature_names()*/
> /*filename1 = 'columnnamessample10mastermerge.sav'*/
> /*pickle.dump(column, open(filename1, 'wb'))*/
> /*
> */
> /*y_train=dataset['classpath']*/
> /*y_train=dataset['classpath'].tolist()*/
> /*labels_train= preprocessing.LabelEncoder()*/
> /*labels_train.fit(y_train)*/
> /*y1_train=labels_train.transform(y_train)*/
> /*
> */
> /*del dataset*/
> /*del countvec*/
> /*del column*/
> /*
> */
> /*
> */
> /*clf = MultinomialNB()*/
> /*model=clf.fit(documenttermmatrix,y_train)*/
> /*
> */
> /*
> */
> /*
> */
> *
> *
> /*
> */
> /*filename2 = 'modelnaivebayessample10withfs.sav'*/
> /*pickle.dump(model, open(filename2, 'wb'))*/
> /
> /
> /
> /
> I am using system with *128 GB RAM.*
>
> As I was unable to train all 10 million data, I did *stratified
> sampling* and the trainset reduced to 2.3 million
>
> Still I was unable to Train  2.3 million data
>
> I got*memory error* when i used *random forest (nestimator=30),**Naive
> Bayes* and *SVM*
>
>
> /
> /
> *I have stucked*
> *
> *
> *
> *
>
> *Can Anyone please tell whether any memory leak in my code and  how to
> use system with 128 GB RAM effectively*
>
>
> Thanks
> Ranjana
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


From joel.nothman at gmail.com  Wed Dec 20 15:13:06 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 21 Dec 2017 07:13:06 +1100
Subject: [scikit-learn] Text classification of large dataet
In-Reply-To: <CAAkaFLXitjd0qP2cmSsWL9vZC5M6ZeLF4v4yarTko-8SVsQMcA@mail.gmail.com>
References: <CAF5P65ngAVSpRsq2U4yaqiK-AhkTUkq1SQjvW8TMEfwydiiLig@mail.gmail.com>
 <7386f24b-61fe-3ea9-00f2-c6e8a8941902@gmail.com>
 <CAAkaFLXitjd0qP2cmSsWL9vZC5M6ZeLF4v4yarTko-8SVsQMcA@mail.gmail.com>
Message-ID: <CAAkaFLVO0kqHtm1sA+AzVJ0HBqrBH9orp4jhJRM7iw-jOf8sFQ@mail.gmail.com>

To clarify:
You have 2.3M samples
How many features?
How many active features on average per sample?
In 7k classes: multiclass or multilabel?

Have you tried limiting the depth of the forest? Have you tried embedding
your feature space into a smaller vector (pre-trained embeddings, hashing,
lda, PCA or random projection)?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171221/5c3287da/attachment.html>

From manuel.castejon at gmail.com  Fri Dec 22 06:09:55 2017
From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=)
Date: Fri, 22 Dec 2017 12:09:55 +0100
Subject: [scikit-learn] Any plans on generalizing Pipeline and
 transformers?
In-Reply-To: <CAK+G6sGPgJGpuqwcwAz7XMmQFKjtu=sCo5esNV+L2JCx6Xir5Q@mail.gmail.com>
References: <CAK+G6sGjcnxiJbMbs0qLsQ3rETL8B5AZeFry27JLPE2rHFaRzw@mail.gmail.com>
 <f19bdb9b-f835-b720-cfc2-93e9dfb492c5@gmail.com>
 <CAK+G6sEWG8j+K-AWS4BzJeDp_xwR8X6iVqjoteKfkZXdv8U_ug@mail.gmail.com>
 <CAK+G6sGPgJGpuqwcwAz7XMmQFKjtu=sCo5esNV+L2JCx6Xir5Q@mail.gmail.com>
Message-ID: <CAK+G6sGU2Q7sTOmtRhr=bFJNiy8bVrQRDNH2DsE9_KWSSLHc7Q@mail.gmail.com>

I'm currently thinking on a computational graph which can then be wrapped
as a pipeline like object ... I'll try yo make a toy example solving my
problem.

El 20 dic. 2017 16:33, "Manuel Castej?n Limas" <manuel.castejon at gmail.com>
escribi?:

> Thank you all for your interest!
>
> In order to clarify the case allow me to try to synthesize the spirit of
> what I'd like to put into the pipeline using this sequence of steps:
>
> #%%
> import pandas as pd
> import numpy as np
> import matplotlib.pyplot as plt
>
> from sklearn.cluster import DBSCAN
> from sklearn.mixture import GaussianMixture
> from sklearn.model_selection import train_test_split
>
> np.random.seed(seed=42)
>
> """
> Data preparation
> """
>
> URL = "https://raw.githubusercontent.com/mcasl/PAELLA/master/data/
> sin_60_percent_noise.csv"
> data = pd.read_csv(URL, usecols=['V1','V2'])
> X, y = data[['V1']], data[['V2']]
>
> (data_train, data_test,
>  X_train, X_test,
>  y_train, y_test) = train_test_split(data, X, y)
>
> """
> Parameters setup
> """
>
> dbscan__eps = 0.06
>
> mclust__n_components = 3
>
> paella__noise_label = -1
> paella__max_it = 20,
> paella__regular_size = 400,
> paella__minimum_size = 100,
> paella__width_r = 0.99,
> paella__n_neighbors = 5,
> paella__power = 30,
> paella__random_state = None
>
> #%%
> """
> DBSCAN clustering to detect noise suspects (label == -1)
> """
>
> dbscan_input = data_train
>
> dbscan_clustering = DBSCAN(eps = dbscan__eps)
>
> dbscan_output = dbscan_clustering.fit_predict(dbscan_input)
>
> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
> c=np.int64(dbscan_output == -1))
>
> #%%
> """
> GaussianMixture fitted with filtered data_train in order to help locate
> the ellipsoids
> but predict is applied to the whole data_train set.
> """
>
> mclust_input = data_train[ dbscan_output != 1]
>
> mclust_clustering = GaussianMixture(n_components = mclust__n_components)
> mclust_clustering.fit(mclust_input)
>
> mclust_output = mclust_clustering.predict(data_train)
>
> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
> c=mclust_output)
>
> #%%
> """
> mclust and dbscan results are combined.
> """
>
> clustering_output = mclust_output.copy()
> clustering_output[dbscan_output == -1] =  -1
>
> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
> c=clustering_output)
>
> #%%
> """
> Old-good Paella paper: https://link.springer.
> com/article/10.1023/B:DAMI.0000031630.50685.7c
>
> The Paella algorithm calculates sample_weight to be used by the final step
> regressor
> (Yes, it is an outlier detection algorithm but we are focusing now on this
> interesting collateral result). I am currently aggressively changing the
> code in order to make it fit somehow with the pipeline
> """
>
> from paella import Paella
>
> paella_input = pd.concat([data, clustering_output], axis=1, inplace=False)
>
> paella_run = Paella(noise_label = paella__noise_label,
>                     max_it = paella__max_it,
>                     regular_size = paella__regular_size,
>                     minimum_size = paella__minimum_size,
>                     width_r = paella__width_r,
>                     n_neighbors = paella__n_neighbors,
>                     power = paella__power,
>                     random_state = paella__random_state)
>
> paella_output = paella_run.fit_predict(paella_input, y_train)
> # paella_output is a vector with sample_weight
>
> #%%
> """
> Here we fit a regressor using sample_weight=paella_output
> """
> from sklearn.linear_model import LinearRegression
>
> regressor_input=X_train
> lm = LinearRegression()
> lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output)
> regressor_output = lm.predict(X_train)
>
> #...
>
> In this example we can see that:
> - A particular step might need results produced not necessarily from the
> immediately previous step.
> - The X parameter is not sequentially transformed. Sometimes we might need
> to skip to a previous step
> - y sometimes is the target, sometimes is not. For the regressor it is
> indeed, but for the paella algorithm the prediction is expressed as a
> vector representing sample_weights.
>
> All in all the conclusion is that the chain of processes is not as linear
> as imposed by the current API. I guess that all these difficulties could be
> solved by:
> - Passing a dictionary through the different steps containing the partial
> results that the following steps will need.
> -  As a christmas gift :-) , a reference to the pipeline itself inserted
> in that dictionary could provide access to the internal status of the
> previous steps should it be needed.
>
> Another interesting study case with similar needs would be a regressor
> using a previous clustering step in order to fit one model per cluster. In
> such case, the clustering results would be needed during the fitting.
>
>
> Thanks for your interest!
> Manolo
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171222/d3a5fb7c/attachment.html>

From Sylvain.Takerkart at univ-amu.fr  Fri Dec 22 06:20:55 2017
From: Sylvain.Takerkart at univ-amu.fr (Sylvain Takerkart)
Date: Fri, 22 Dec 2017 12:20:55 +0100
Subject: [scikit-learn] Support Vector Machines: Sensitive to Single
 Datapoints?
In-Reply-To: <20171219213525.GE360768@phare.normalesup.org>
References: <CAOBxrqwHzqNbkLuEoA-F2St2tU5Dz-10ZB=4YE73byc7tOQ9Xg@mail.gmail.com>
 <20171219213525.GE360768@phare.normalesup.org>
Message-ID: <CAC6XrFyChUM3jh=G1JexQCYzeXX3GMxCKuOV_BGQ0_MQGUpGgQ@mail.gmail.com>

Hello,

Yes, Gael's paper points out some fundamental issues! In your case, the
practical question is to know what kind of cross validation scheme you
used... If you originally used StratifiedKFold, try to re-run your
experiments with StratifiedShuffleSplit and a large number of splits!
Hopefully, increasing the number of splits should reduce the discrepancy
you observe between the two mean accuracies... But as Gael says, the small
sample size brings fundamental limitations to what you can measure...

Sylvain

On Tue, Dec 19, 2017 at 10:35 PM, Gael Varoquaux <
gael.varoquaux at normalesup.org> wrote:

> With as few data points, there is a huge uncertainty in the estimation of
> the prediction accuracy with cross-validation. This isn't a problem of
> the method, is it a basic limitation of the small amount of data. I've
> written a paper on this problem is the specific context of neuroimaging:
> https://www.sciencedirect.com/science/article/pii/S1053811917305311
> (preprint: https://hal.inria.fr/hal-01545002/).
>
> I except that what you are seing in sampling noise: the result has
> confidence intervals in large than 10%.
>
> Ga?l
>
>
> On Tue, Dec 19, 2017 at 04:27:53PM -0500, Taylor, Johnmark wrote:
> > Hello,
>
> > I am a researcher in fMRI and am using SVMs to analyze brain data. I am
> doing
> > decoding between two classes, each of which has 24 exemplars per class.
> I am
> > comparing two different methods of cross-validation for my data: in one,
> I am
> > training on 23 exemplars from each class, and testing on the remaining
> example
> > from each class, and in the other, I am training on 22 exemplars from
> each
> > class, and testing on the remaining two from each class (in case it
> matters,
> > the data is structured into different neuroimaging "runs", with each
> "run"
> > containing several "blocks"; the first cross-validation method is
> leaving out
> > one block at a time, the second is leaving out one run at a time).
>
> > Now, I would've thought that these two CV methods would be very similar,
> since
> > the vast majority of the training data is the same; the only difference
> is in
> > adding two additional points. However, they are yielding very different
> > results: training on 23 per class is yielding 60% decoding accuracy
> (averaged
> > across several subjects, and statistically significantly greater than
> chance),
> > training on 22 per class is yielding chance (50%) decoding. Leaving
> aside the
> > particulars of fMRI in this case: is it unusual for single points
> (amounting to
> > less than 5% of the data) to have such a big influence on SVM decoding?
> I am
> > using a cost parameter of C=1. I must say it is counterintuitive to me
> that
> > just a couple points out of two dozen could make such a big difference.
>
> > Thank you very much, and cheers,
>
> > JohnMark
>
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> --
>     Gael Varoquaux
>     Senior Researcher, INRIA Parietal
>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>     Phone:  ++ 33-1-69-08-79-68
>     http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Sylvain Takerkart

Institut des Neurosciences de la Timone (INT)
UMR 7289 CNRS-AMU
Marseille, France
t?l: +33 (0)4 91 324 007
http://www.int.univ-amu.fr/_TAKERKART-Sylvain_?lang=en
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171222/a24d44ad/attachment-0001.html>

From manuel.castejon at gmail.com  Tue Dec 26 05:47:47 2017
From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=)
Date: Tue, 26 Dec 2017 11:47:47 +0100
Subject: [scikit-learn] Any plans on generalizing Pipeline and
 transformers?
In-Reply-To: <CAK+G6sGU2Q7sTOmtRhr=bFJNiy8bVrQRDNH2DsE9_KWSSLHc7Q@mail.gmail.com>
References: <CAK+G6sGjcnxiJbMbs0qLsQ3rETL8B5AZeFry27JLPE2rHFaRzw@mail.gmail.com>
 <f19bdb9b-f835-b720-cfc2-93e9dfb492c5@gmail.com>
 <CAK+G6sEWG8j+K-AWS4BzJeDp_xwR8X6iVqjoteKfkZXdv8U_ug@mail.gmail.com>
 <CAK+G6sGPgJGpuqwcwAz7XMmQFKjtu=sCo5esNV+L2JCx6Xir5Q@mail.gmail.com>
 <CAK+G6sGU2Q7sTOmtRhr=bFJNiy8bVrQRDNH2DsE9_KWSSLHc7Q@mail.gmail.com>
Message-ID: <CAK+G6sGdAD3tbj-VM-tUzfCNOO84sh_Yug1MFPHR75D2fsRv2Q@mail.gmail.com>

I'm elaborating on the graph idea. A dictionary to describe the graph, the
networkx package to support the graph and run it in topological order; and
some wrappers for scikit-learn models.

I'm currently thinking on putting some more efforts into a contrib project.

It could be something inspired by this example.

Manolo

#-------------------------------------------------


graph_description = {
              'First':
                  {'operation': First_Step,
                   'input': {'X':X, 'y':y}},

              'Concatenate_Xy':
                  {'operation': ConcatenateData_Step,
                   'input': [('First', 'X'),
                             ('First', 'y')]},

              'Gaussian_Mixture':
                  {'operation': Gaussian_Mixture_Step,
                   'input': [('Concatenate_Xy', 'data')]},

              'Dbscan':
                  {'operation': Dbscan_Step,
                   'input': [('Concatenate_Xy', 'data')]},

              'CombineClustering':
                  {'operation': CombineClustering_Step,
                   'input': [('Dbscan', 'classification'),
                             ('Gaussian_Mixture', 'classification')]},

              'Paella':
                  {'operation': Paella_Step,
                   'input': [('First', 'X'),
                             ('First', 'y'),
                             ('Concatenate_Xy', 'data'),
                             ('CombineClustering', 'classification')]},

              'Regressor':
                  {'operation': Regressor_Step,
                   'input': [('First', 'X'),
                             ('First', 'y'),
                             ('Paella', 'sample_weight')]},

              'Last':
                  {'operation': Last_Step,
                   'input': [('Regressor', 'regressor')]},

             }

#%%
def create_graph(description):
    cg = nx.DiGraph()
    cg.add_nodes_from(description)
    for current_name, info in description.items():
        current_node = cg.node[current_name]
        current_node['operation'] = info['operation']( graph = cg,
node_name = current_name )
        current_node['input']     = info['input']
        if current_name != 'First':
            for ascendant in set( name for name, attribute in info['input']
):
                cg.add_edge(ascendant, current_name)

    return cg
#%%
cg = create_graph(graph_description)

node_pos = {'First'            : ( 0, 0),
            'Concatenate_Xy'   : ( 2, 4),
            'Gaussian_Mixture' : ( 6, 8),
            'Dbscan'           : ( 6, 6),
            'CombineClustering': ( 8, 7),
            'Paella'           : (10, 2),
            'Regressor'        : (12, 0),
            'Last'             : (16, 0)
            }

nx.draw(cg, pos=node_pos, with_labels=True)

#%%

print("=========================")
for name in nx.topological_sort(cg):
    print("Running: ", name)
    cg.node[name]['operation'].fit()

print("=========================")

########################


2017-12-22 12:09 GMT+01:00 Manuel Castej?n Limas <manuel.castejon at gmail.com>
:

> I'm currently thinking on a computational graph which can then be wrapped
> as a pipeline like object ... I'll try yo make a toy example solving my
> problem.
>
> El 20 dic. 2017 16:33, "Manuel Castej?n Limas" <manuel.castejon at gmail.com>
> escribi?:
>
>> Thank you all for your interest!
>>
>> In order to clarify the case allow me to try to synthesize the spirit of
>> what I'd like to put into the pipeline using this sequence of steps:
>>
>> #%%
>> import pandas as pd
>> import numpy as np
>> import matplotlib.pyplot as plt
>>
>> from sklearn.cluster import DBSCAN
>> from sklearn.mixture import GaussianMixture
>> from sklearn.model_selection import train_test_split
>>
>> np.random.seed(seed=42)
>>
>> """
>> Data preparation
>> """
>>
>> URL = "https://raw.githubusercontent.com/mcasl/PAELLA/master/data/
>> sin_60_percent_noise.csv"
>> data = pd.read_csv(URL, usecols=['V1','V2'])
>> X, y = data[['V1']], data[['V2']]
>>
>> (data_train, data_test,
>>  X_train, X_test,
>>  y_train, y_test) = train_test_split(data, X, y)
>>
>> """
>> Parameters setup
>> """
>>
>> dbscan__eps = 0.06
>>
>> mclust__n_components = 3
>>
>> paella__noise_label = -1
>> paella__max_it = 20,
>> paella__regular_size = 400,
>> paella__minimum_size = 100,
>> paella__width_r = 0.99,
>> paella__n_neighbors = 5,
>> paella__power = 30,
>> paella__random_state = None
>>
>> #%%
>> """
>> DBSCAN clustering to detect noise suspects (label == -1)
>> """
>>
>> dbscan_input = data_train
>>
>> dbscan_clustering = DBSCAN(eps = dbscan__eps)
>>
>> dbscan_output = dbscan_clustering.fit_predict(dbscan_input)
>>
>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
>> c=np.int64(dbscan_output == -1))
>>
>> #%%
>> """
>> GaussianMixture fitted with filtered data_train in order to help locate
>> the ellipsoids
>> but predict is applied to the whole data_train set.
>> """
>>
>> mclust_input = data_train[ dbscan_output != 1]
>>
>> mclust_clustering = GaussianMixture(n_components = mclust__n_components)
>> mclust_clustering.fit(mclust_input)
>>
>> mclust_output = mclust_clustering.predict(data_train)
>>
>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
>> c=mclust_output)
>>
>> #%%
>> """
>> mclust and dbscan results are combined.
>> """
>>
>> clustering_output = mclust_output.copy()
>> clustering_output[dbscan_output == -1] =  -1
>>
>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
>> c=clustering_output)
>>
>> #%%
>> """
>> Old-good Paella paper: https://link.springer.c
>> om/article/10.1023/B:DAMI.0000031630.50685.7c
>>
>> The Paella algorithm calculates sample_weight to be used by the final
>> step regressor
>> (Yes, it is an outlier detection algorithm but we are focusing now on
>> this interesting collateral result). I am currently aggressively changing
>> the code in order to make it fit somehow with the pipeline
>> """
>>
>> from paella import Paella
>>
>> paella_input = pd.concat([data, clustering_output], axis=1, inplace=False)
>>
>> paella_run = Paella(noise_label = paella__noise_label,
>>                     max_it = paella__max_it,
>>                     regular_size = paella__regular_size,
>>                     minimum_size = paella__minimum_size,
>>                     width_r = paella__width_r,
>>                     n_neighbors = paella__n_neighbors,
>>                     power = paella__power,
>>                     random_state = paella__random_state)
>>
>> paella_output = paella_run.fit_predict(paella_input, y_train)
>> # paella_output is a vector with sample_weight
>>
>> #%%
>> """
>> Here we fit a regressor using sample_weight=paella_output
>> """
>> from sklearn.linear_model import LinearRegression
>>
>> regressor_input=X_train
>> lm = LinearRegression()
>> lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output)
>> regressor_output = lm.predict(X_train)
>>
>> #...
>>
>> In this example we can see that:
>> - A particular step might need results produced not necessarily from the
>> immediately previous step.
>> - The X parameter is not sequentially transformed. Sometimes we might
>> need to skip to a previous step
>> - y sometimes is the target, sometimes is not. For the regressor it is
>> indeed, but for the paella algorithm the prediction is expressed as a
>> vector representing sample_weights.
>>
>> All in all the conclusion is that the chain of processes is not as linear
>> as imposed by the current API. I guess that all these difficulties could be
>> solved by:
>> - Passing a dictionary through the different steps containing the partial
>> results that the following steps will need.
>> -  As a christmas gift :-) , a reference to the pipeline itself inserted
>> in that dictionary could provide access to the internal status of the
>> previous steps should it be needed.
>>
>> Another interesting study case with similar needs would be a regressor
>> using a previous clustering step in order to fit one model per cluster. In
>> such case, the clustering results would be needed during the fitting.
>>
>>
>> Thanks for your interest!
>> Manolo
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171226/91f1c9d7/attachment-0001.html>

From ranjanagirish30 at gmail.com  Wed Dec 27 05:16:34 2017
From: ranjanagirish30 at gmail.com (Ranjana Girish)
Date: Wed, 27 Dec 2017 15:46:34 +0530
Subject: [scikit-learn] Text classification of large dataset
Message-ID: <CAF5P65=bXhSh2-kkWrxiZo9o0hp_Mo4XRcCWWrnCrOk5teBjjw@mail.gmail.com>

Hai all,

Thank you for your suggestions.

But I am still getting *memory error* while doing feature selection

*fs = feature_selection.SelectPercentile(feature_selection.chi2,
percentile=20)*
*documenttermmatrix1 = fs.fit_transform(documenttermmatrix,y1)*


*documenttermmatrix* will be of shape *(1594516,232832)*
type of *documenttermmatrix * is *scipy csr matrix*

Am I doing anything wrong?

Is there any better way of doing feature selection?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171227/76efd6f1/attachment.html>

From tevang3 at gmail.com  Fri Dec 29 06:09:00 2017
From: tevang3 at gmail.com (Thomas Evangelidis)
Date: Fri, 29 Dec 2017 12:09:00 +0100
Subject: [scikit-learn] MLPClassifier as a feature selector
In-Reply-To: <CAFQAd-kqW9N3H=MQp0qP6P8hit+BdERDfgS3Hvd73ei0O3Zrgg@mail.gmail.com>
References: <CAACvdx3Pxu3Cw3A5j1st36ntcME5ULP75Kf4ZpUgoi_cr+ZJAw@mail.gmail.com>
 <CAJe_vxBd38MU41HghnGYYiunSenk1s6WMCzzeqa7Ye7nUq4DVQ@mail.gmail.com>
 <CAFQAd-kqW9N3H=MQp0qP6P8hit+BdERDfgS3Hvd73ei0O3Zrgg@mail.gmail.com>
Message-ID: <CAACvdx0gO+5B7L6EyQbQSTWtoGBZKZKKPc0bGsxMGkbDe6dd9w@mail.gmail.com>

Alright, with these attributes I can get the weights and biases, but what
about the values on the nodes of the last hidden layer? Do I have to work
them out myself or there is a straightforward way to get them?

On 7 December 2017 at 04:25, Manoj Kumar <manojkumarsivaraj334 at gmail.com>
wrote:

> Hi,
>
> The weights and intercepts are available in the coefs_ and intercepts_
> attribute respectively.
>
> See https://github.com/scikit-learn/scikit-learn/blob/
> a24c8b46/sklearn/neural_network/multilayer_perceptron.py#L835
>
> On Wed, Dec 6, 2017 at 4:56 PM, Brown J.B. via scikit-learn <
> scikit-learn at python.org> wrote:
>
>> I am also very interested in knowing if there is a sklearn cookbook
>> solution for getting the weights of a one-hidde-layer MLPClassifier.
>> J.B.
>>
>> 2017-12-07 8:49 GMT+09:00 Thomas Evangelidis <tevang3 at gmail.com>:
>>
>>> Greetings,
>>>
>>> I want to train a MLPClassifier with one hidden layer and use it as a
>>> feature selector for an MLPRegressor.
>>> Is it possible to get the values of the neurons from the last hidden
>>> layer of the MLPClassifier to pass them as input to the MLPRegressor?
>>>
>>> If it is not possible with scikit-learn, is anyone aware of any
>>> scikit-compatible NN library that offers this functionality? For example
>>> this one:
>>>
>>> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html
>>>
>>> I wouldn't like to do this in Tensorflow because the MLP there is much
>>> slower than scikit-learn's implementation.
>>>
>>>
>>> Thomas
>>>
>>>
>>> --
>>>
>>> ======================================================================
>>>
>>> Dr Thomas Evangelidis
>>>
>>> Post-doctoral Researcher
>>> CEITEC - Central European Institute of Technology
>>> Masaryk University
>>> Kamenice 5/A35/2S049,
>>> 62500 Brno, Czech Republic
>>>
>>> email: tevang at pharm.uoa.gr
>>>
>>>           tevang3 at gmail.com
>>>
>>>
>>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Manoj,
> http://github.com/MechCoder
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 

======================================================================

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tevang at pharm.uoa.gr

          tevang3 at gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171229/40eaa98c/attachment.html>

From jlopez at ende.cc  Fri Dec 29 11:45:49 2017
From: jlopez at ende.cc (=?UTF-8?Q?Javier_L=C3=B3pez?=)
Date: Fri, 29 Dec 2017 16:45:49 +0000
Subject: [scikit-learn] MLPClassifier as a feature selector
In-Reply-To: <CAACvdx3Pxu3Cw3A5j1st36ntcME5ULP75Kf4ZpUgoi_cr+ZJAw@mail.gmail.com>
References: <CAACvdx3Pxu3Cw3A5j1st36ntcME5ULP75Kf4ZpUgoi_cr+ZJAw@mail.gmail.com>
Message-ID: <CAJn5T5VTwy5q7VM5Dvg5i1qmT8Y9=mvEzL07GmGBhDJ+Bu2aag@mail.gmail.com>

Hi Thomas,

it is possible to obtain the activation values of any hidden layer, but the
procedure is not completely straight forward. If you look at the code of
the `_predict` method of MLPs you can see the following:

```python
    def _predict(self, X):
        """Predict using the trained model

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape (n_samples, n_features)
            The input data.

        Returns
        -------
        y_pred : array-like, shape (n_samples,) or (n_samples, n_outputs)
            The decision function of the samples for each class in the
model.
        """
        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])

        # Make sure self.hidden_layer_sizes is a list
        hidden_layer_sizes = self.hidden_layer_sizes
        if not hasattr(hidden_layer_sizes, "__iter__"):
            hidden_layer_sizes = [hidden_layer_sizes]
        hidden_layer_sizes = list(hidden_layer_sizes)

        layer_units = [X.shape[1]] + hidden_layer_sizes + \
            [self.n_outputs_]

        # Initialize layers
        activations = [X]

        for i in range(self.n_layers_ - 1):
            activations.append(np.empty((X.shape[0],
                                         layer_units[i + 1])))
        # forward propagate
        self._forward_pass(activations)
        y_pred = activations[-1]

        return y_pred
```

the line `y_pred = activations[-1]` is responsible for extracting the
values for the last layer,
but the `activations` variable contains the values for all the neurons.

You can make this function into your own external method (changing the
`self` attribute by
a proper parameter) and add an extra argument which specifies the layer(s)
that you want.
I have done this myself in order to make an AutoEncoderNetwork out of the
MLP
implementation.

This makes me wonder, would it be worth adding this to sklearn?
A very simple way would be to refactor the `_predict` method, with the
additional layer
argument, to a new method `_predict_layer`, then we can have the `_predict`
method
simply call `_predict_layer(..., layer=-1)` and add a new method (perhaps a
`transform`?)
that allows to get (raveled) values for an arbitrary subset of the layers.

I'd be happy to submit a PR if you guys think it would be interesting for
the project.

Javier


On Thu, Dec 7, 2017 at 12:51 AM Thomas Evangelidis <tevang3 at gmail.com>
wrote:

> Greetings,
>
> I want to train a MLPClassifier with one hidden layer and use it as a
> feature selector for an MLPRegressor.
> Is it possible to get the values of the neurons from the last hidden layer
> of the MLPClassifier to pass them as input to the MLPRegressor?
>
> If it is not possible with scikit-learn, is anyone aware of any
> scikit-compatible NN library that offers this functionality? For example
> this one:
>
> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html
>
> I wouldn't like to do this in Tensorflow because the MLP there is much
> slower than scikit-learn's implementation.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171229/47c835c7/attachment-0001.html>

From gael.varoquaux at normalesup.org  Fri Dec 29 12:14:15 2017
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Fri, 29 Dec 2017 18:14:15 +0100
Subject: [scikit-learn] MLPClassifier as a feature selector
In-Reply-To: <CAJn5T5VTwy5q7VM5Dvg5i1qmT8Y9=mvEzL07GmGBhDJ+Bu2aag@mail.gmail.com>
References: <CAACvdx3Pxu3Cw3A5j1st36ntcME5ULP75Kf4ZpUgoi_cr+ZJAw@mail.gmail.com>
 <CAJn5T5VTwy5q7VM5Dvg5i1qmT8Y9=mvEzL07GmGBhDJ+Bu2aag@mail.gmail.com>
Message-ID: <460c5520-3226-4aaf-bcbd-343d1e4a7e0e@normalesup.org>

I think that a transform method would be good. We would have to add a parameter to the constructor to specify which layer is used for the transform. It should default to "-1", in my opinion. 

Cheers,

Ga?l

?Sent from my phone. Please forgive typos and briefness.?

On Dec 29, 2017, 17:48, at 17:48, "Javier L?pez" <jlopez at ende.cc> wrote:
>Hi Thomas,
>
>it is possible to obtain the activation values of any hidden layer, but
>the
>procedure is not completely straight forward. If you look at the code
>of
>the `_predict` method of MLPs you can see the following:
>
>```python
>    def _predict(self, X):
>        """Predict using the trained model
>
>        Parameters
>        ----------
>        X : {array-like, sparse matrix}, shape (n_samples, n_features)
>            The input data.
>
>        Returns
>        -------
>      y_pred : array-like, shape (n_samples,) or (n_samples, n_outputs)
>            The decision function of the samples for each class in the
>model.
>        """
>        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
>
>        # Make sure self.hidden_layer_sizes is a list
>        hidden_layer_sizes = self.hidden_layer_sizes
>        if not hasattr(hidden_layer_sizes, "__iter__"):
>            hidden_layer_sizes = [hidden_layer_sizes]
>        hidden_layer_sizes = list(hidden_layer_sizes)
>
>        layer_units = [X.shape[1]] + hidden_layer_sizes + \
>            [self.n_outputs_]
>
>        # Initialize layers
>        activations = [X]
>
>        for i in range(self.n_layers_ - 1):
>            activations.append(np.empty((X.shape[0],
>                                         layer_units[i + 1])))
>        # forward propagate
>        self._forward_pass(activations)
>        y_pred = activations[-1]
>
>        return y_pred
>```
>
>the line `y_pred = activations[-1]` is responsible for extracting the
>values for the last layer,
>but the `activations` variable contains the values for all the neurons.
>
>You can make this function into your own external method (changing the
>`self` attribute by
>a proper parameter) and add an extra argument which specifies the
>layer(s)
>that you want.
>I have done this myself in order to make an AutoEncoderNetwork out of
>the
>MLP
>implementation.
>
>This makes me wonder, would it be worth adding this to sklearn?
>A very simple way would be to refactor the `_predict` method, with the
>additional layer
>argument, to a new method `_predict_layer`, then we can have the
>`_predict`
>method
>simply call `_predict_layer(..., layer=-1)` and add a new method
>(perhaps a
>`transform`?)
>that allows to get (raveled) values for an arbitrary subset of the
>layers.
>
>I'd be happy to submit a PR if you guys think it would be interesting
>for
>the project.
>
>Javier
>
>
>
>On Thu, Dec 7, 2017 at 12:51 AM Thomas Evangelidis <tevang3 at gmail.com>
>wrote:
>
>> Greetings,
>>
>> I want to train a MLPClassifier with one hidden layer and use it as a
>> feature selector for an MLPRegressor.
>> Is it possible to get the values of the neurons from the last hidden
>layer
>> of the MLPClassifier to pass them as input to the MLPRegressor?
>>
>> If it is not possible with scikit-learn, is anyone aware of any
>> scikit-compatible NN library that offers this functionality? For
>example
>> this one:
>>
>> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html
>>
>> I wouldn't like to do this in Tensorflow because the MLP there is
>much
>> slower than scikit-learn's implementation.
>>
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>scikit-learn mailing list
>scikit-learn at python.org
>https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171229/eeebf602/attachment.html>

From info at orges-leka.de  Fri Dec 29 13:19:23 2017
From: info at orges-leka.de (Orges Leka)
Date: Fri, 29 Dec 2017 19:19:23 +0100
Subject: [scikit-learn] scikit-learn Digest, Vol 21, Issue 29
In-Reply-To: <mailman.2284.1514565968.2885.scikit-learn@python.org>
References: <mailman.2284.1514565968.2885.scikit-learn@python.org>
Message-ID: <CAFKtZkORTQth-mv5K-e3kvaUZ5T2iPEgnWQOKxF-jr8aj540cg@mail.gmail.com>

Hello,

You could use the following code:

    X_weight = [ ]
    for x in X:
        for i in range(len(mlp.coefs_)-1):
            x =np.array([math.tanh(v) for v in
(x.dot(mlp.coefs_[i])+mlp.intercepts_[i])])
        X_weight.append(x)

    where it is assumed that mlp is your trained MLP-Classifier, and you
have trained with tanh-activation function
    X is your matrix which you want to compute the features, and x iterates
over the vectors of this matrix.
    X_weight is a list of vectors with the computed weights.

Kind regards
Orges Leka

2017-12-29 17:46 GMT+01:00 <scikit-learn-request at python.org>:

> Send scikit-learn mailing list submissions to
>         scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
>         scikit-learn-request at python.org
>
> You can reach the person managing the list at
>         scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
>    1. Re: MLPClassifier as a feature selector (Thomas Evangelidis)
>    2. Re: MLPClassifier as a feature selector (Javier L?pez)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 29 Dec 2017 12:09:00 +0100
> From: Thomas Evangelidis <tevang3 at gmail.com>
> To: Scikit-learn mailing list <scikit-learn at python.org>
> Subject: Re: [scikit-learn] MLPClassifier as a feature selector
> Message-ID:
>         <CAACvdx0gO+5B7L6EyQbQSTWtoGBZKZKKPc0bGsxM
> GkbDe6dd9w at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Alright, with these attributes I can get the weights and biases, but what
> about the values on the nodes of the last hidden layer? Do I have to work
> them out myself or there is a straightforward way to get them?
>
> On 7 December 2017 at 04:25, Manoj Kumar <manojkumarsivaraj334 at gmail.com>
> wrote:
>
> > Hi,
> >
> > The weights and intercepts are available in the coefs_ and intercepts_
> > attribute respectively.
> >
> > See https://github.com/scikit-learn/scikit-learn/blob/
> > a24c8b46/sklearn/neural_network/multilayer_perceptron.py#L835
> >
> > On Wed, Dec 6, 2017 at 4:56 PM, Brown J.B. via scikit-learn <
> > scikit-learn at python.org> wrote:
> >
> >> I am also very interested in knowing if there is a sklearn cookbook
> >> solution for getting the weights of a one-hidde-layer MLPClassifier.
> >> J.B.
> >>
> >> 2017-12-07 8:49 GMT+09:00 Thomas Evangelidis <tevang3 at gmail.com>:
> >>
> >>> Greetings,
> >>>
> >>> I want to train a MLPClassifier with one hidden layer and use it as a
> >>> feature selector for an MLPRegressor.
> >>> Is it possible to get the values of the neurons from the last hidden
> >>> layer of the MLPClassifier to pass them as input to the MLPRegressor?
> >>>
> >>> If it is not possible with scikit-learn, is anyone aware of any
> >>> scikit-compatible NN library that offers this functionality? For
> example
> >>> this one:
> >>>
> >>> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html
> >>>
> >>> I wouldn't like to do this in Tensorflow because the MLP there is much
> >>> slower than scikit-learn's implementation.
> >>>
> >>>
> >>> Thomas
> >>>
> >>>
> >>> --
> >>>
> >>> ======================================================================
> >>>
> >>> Dr Thomas Evangelidis
> >>>
> >>> Post-doctoral Researcher
> >>> CEITEC - Central European Institute of Technology
> >>> Masaryk University
> >>> Kamenice 5/A35/2S049,
> >>> 62500 Brno, Czech Republic
> >>>
> >>> email: tevang at pharm.uoa.gr
> >>>
> >>>           tevang3 at gmail.com
> >>>
> >>>
> >>> website: https://sites.google.com/site/thomasevangelidishomepage/
> >>>
> >>>
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>>
> >>>
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >>
> >
> >
> > --
> > Manoj,
> > http://github.com/MechCoder
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
>
>
> --
>
> ======================================================================
>
> Dr Thomas Evangelidis
>
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049,
> 62500 Brno, Czech Republic
>
> email: tevang at pharm.uoa.gr
>
>           tevang3 at gmail.com
>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/
> attachments/20171229/40eaa98c/attachment-0001.html>
>
> ------------------------------
>
> Message: 2
> Date: Fri, 29 Dec 2017 16:45:49 +0000
> From: Javier L?pez <jlopez at ende.cc>
> To: Scikit-learn mailing list <scikit-learn at python.org>
> Subject: Re: [scikit-learn] MLPClassifier as a feature selector
> Message-ID:
>         <CAJn5T5VTwy5q7VM5Dvg5i1qmT8Y9=mvEzL07GmGBhDJ+Bu2aag at mail.
> gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Thomas,
>
> it is possible to obtain the activation values of any hidden layer, but the
> procedure is not completely straight forward. If you look at the code of
> the `_predict` method of MLPs you can see the following:
>
> ```python
>     def _predict(self, X):
>         """Predict using the trained model
>
>         Parameters
>         ----------
>         X : {array-like, sparse matrix}, shape (n_samples, n_features)
>             The input data.
>
>         Returns
>         -------
>         y_pred : array-like, shape (n_samples,) or (n_samples, n_outputs)
>             The decision function of the samples for each class in the
> model.
>         """
>         X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
>
>         # Make sure self.hidden_layer_sizes is a list
>         hidden_layer_sizes = self.hidden_layer_sizes
>         if not hasattr(hidden_layer_sizes, "__iter__"):
>             hidden_layer_sizes = [hidden_layer_sizes]
>         hidden_layer_sizes = list(hidden_layer_sizes)
>
>         layer_units = [X.shape[1]] + hidden_layer_sizes + \
>             [self.n_outputs_]
>
>         # Initialize layers
>         activations = [X]
>
>         for i in range(self.n_layers_ - 1):
>             activations.append(np.empty((X.shape[0],
>                                          layer_units[i + 1])))
>         # forward propagate
>         self._forward_pass(activations)
>         y_pred = activations[-1]
>
>         return y_pred
> ```
>
> the line `y_pred = activations[-1]` is responsible for extracting the
> values for the last layer,
> but the `activations` variable contains the values for all the neurons.
>
> You can make this function into your own external method (changing the
> `self` attribute by
> a proper parameter) and add an extra argument which specifies the layer(s)
> that you want.
> I have done this myself in order to make an AutoEncoderNetwork out of the
> MLP
> implementation.
>
> This makes me wonder, would it be worth adding this to sklearn?
> A very simple way would be to refactor the `_predict` method, with the
> additional layer
> argument, to a new method `_predict_layer`, then we can have the `_predict`
> method
> simply call `_predict_layer(..., layer=-1)` and add a new method (perhaps a
> `transform`?)
> that allows to get (raveled) values for an arbitrary subset of the layers.
>
> I'd be happy to submit a PR if you guys think it would be interesting for
> the project.
>
> Javier
>
>
>
> On Thu, Dec 7, 2017 at 12:51 AM Thomas Evangelidis <tevang3 at gmail.com>
> wrote:
>
> > Greetings,
> >
> > I want to train a MLPClassifier with one hidden layer and use it as a
> > feature selector for an MLPRegressor.
> > Is it possible to get the values of the neurons from the last hidden
> layer
> > of the MLPClassifier to pass them as input to the MLPRegressor?
> >
> > If it is not possible with scikit-learn, is anyone aware of any
> > scikit-compatible NN library that offers this functionality? For example
> > this one:
> >
> > http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html
> >
> > I wouldn't like to do this in Tensorflow because the MLP there is much
> > slower than scikit-learn's implementation.
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/
> attachments/20171229/47c835c7/attachment.html>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 21, Issue 29
> ********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171229/3ede81c8/attachment-0001.html>

From tevang3 at gmail.com  Sat Dec 30 03:55:03 2017
From: tevang3 at gmail.com (Thomas Evangelidis)
Date: Sat, 30 Dec 2017 09:55:03 +0100
Subject: [scikit-learn] MLPClassifier as a feature selector
In-Reply-To: <CAJn5T5VTwy5q7VM5Dvg5i1qmT8Y9=mvEzL07GmGBhDJ+Bu2aag@mail.gmail.com>
References: <CAACvdx3Pxu3Cw3A5j1st36ntcME5ULP75Kf4ZpUgoi_cr+ZJAw@mail.gmail.com>
 <CAJn5T5VTwy5q7VM5Dvg5i1qmT8Y9=mvEzL07GmGBhDJ+Bu2aag@mail.gmail.com>
Message-ID: <CAACvdx2g6OzU7N2Yfc+psjXSwRLFfEeqVUfqwEE9zsV+nimApw@mail.gmail.com>

Javier, thank you for the detailed explanation. Indeed, it would be very
useful to add such a function in the official scikit-learn bundle instead
of keeping our own modified versions of the MLP. It would be good for
transferability of our code.

Dne 29. 12. 2017 17:47 napsal u?ivatel "Javier L?pez" <jlopez at ende.cc>:

> Hi Thomas,
>
> it is possible to obtain the activation values of any hidden layer, but the
> procedure is not completely straight forward. If you look at the code of
> the `_predict` method of MLPs you can see the following:
>
> ```python
>     def _predict(self, X):
>         """Predict using the trained model
>
>         Parameters
>         ----------
>         X : {array-like, sparse matrix}, shape (n_samples, n_features)
>             The input data.
>
>         Returns
>         -------
>         y_pred : array-like, shape (n_samples,) or (n_samples, n_outputs)
>             The decision function of the samples for each class in the
> model.
>         """
>         X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
>
>         # Make sure self.hidden_layer_sizes is a list
>         hidden_layer_sizes = self.hidden_layer_sizes
>         if not hasattr(hidden_layer_sizes, "__iter__"):
>             hidden_layer_sizes = [hidden_layer_sizes]
>         hidden_layer_sizes = list(hidden_layer_sizes)
>
>         layer_units = [X.shape[1]] + hidden_layer_sizes + \
>             [self.n_outputs_]
>
>         # Initialize layers
>         activations = [X]
>
>         for i in range(self.n_layers_ - 1):
>             activations.append(np.empty((X.shape[0],
>                                          layer_units[i + 1])))
>         # forward propagate
>         self._forward_pass(activations)
>         y_pred = activations[-1]
>
>         return y_pred
> ```
>
> the line `y_pred = activations[-1]` is responsible for extracting the
> values for the last layer,
> but the `activations` variable contains the values for all the neurons.
>
> You can make this function into your own external method (changing the
> `self` attribute by
> a proper parameter) and add an extra argument which specifies the layer(s)
> that you want.
> I have done this myself in order to make an AutoEncoderNetwork out of the
> MLP
> implementation.
>
> This makes me wonder, would it be worth adding this to sklearn?
> A very simple way would be to refactor the `_predict` method, with the
> additional layer
> argument, to a new method `_predict_layer`, then we can have the
> `_predict` method
> simply call `_predict_layer(..., layer=-1)` and add a new method (perhaps
> a `transform`?)
> that allows to get (raveled) values for an arbitrary subset of the layers.
>
> I'd be happy to submit a PR if you guys think it would be interesting for
> the project.
>
> Javier
>
>
>
> On Thu, Dec 7, 2017 at 12:51 AM Thomas Evangelidis <tevang3 at gmail.com>
> wrote:
>
>> Greetings,
>>
>> I want to train a MLPClassifier with one hidden layer and use it as a
>> feature selector for an MLPRegressor.
>> Is it possible to get the values of the neurons from the last hidden
>> layer of the MLPClassifier to pass them as input to the MLPRegressor?
>>
>> If it is not possible with scikit-learn, is anyone aware of any
>> scikit-compatible NN library that offers this functionality? For example
>> this one:
>>
>> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html
>>
>> I wouldn't like to do this in Tensorflow because the MLP there is much
>> slower than scikit-learn's implementation.
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171230/872d7d48/attachment.html>

From frederic.bastien at gmail.com  Sat Dec 30 09:34:54 2017
From: frederic.bastien at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBCYXN0aWVu?=)
Date: Sat, 30 Dec 2017 14:34:54 +0000
Subject: [scikit-learn] Any plans on generalizing Pipeline and
 transformers?
In-Reply-To: <CAK+G6sGdAD3tbj-VM-tUzfCNOO84sh_Yug1MFPHR75D2fsRv2Q@mail.gmail.com>
References: <CAK+G6sGjcnxiJbMbs0qLsQ3rETL8B5AZeFry27JLPE2rHFaRzw@mail.gmail.com>
 <f19bdb9b-f835-b720-cfc2-93e9dfb492c5@gmail.com>
 <CAK+G6sEWG8j+K-AWS4BzJeDp_xwR8X6iVqjoteKfkZXdv8U_ug@mail.gmail.com>
 <CAK+G6sGPgJGpuqwcwAz7XMmQFKjtu=sCo5esNV+L2JCx6Xir5Q@mail.gmail.com>
 <CAK+G6sGU2Q7sTOmtRhr=bFJNiy8bVrQRDNH2DsE9_KWSSLHc7Q@mail.gmail.com>
 <CAK+G6sGdAD3tbj-VM-tUzfCNOO84sh_Yug1MFPHR75D2fsRv2Q@mail.gmail.com>
Message-ID: <CADKKbthLKUYMQgddovnatJ5fC22rkbpMXLCVs-0hhCZxEnPv2g@mail.gmail.com>

This start to look as the dask project. Do you know it?

Le mar. 26 d?c. 2017 05:49, Manuel Castej?n Limas <manuel.castejon at gmail.com>
a ?crit :

> I'm elaborating on the graph idea. A dictionary to describe the graph, the
> networkx package to support the graph and run it in topological order; and
> some wrappers for scikit-learn models.
>
> I'm currently thinking on putting some more efforts into a contrib project.
>
> It could be something inspired by this example.
>
> Manolo
>
> #-------------------------------------------------
>
>
>
> graph_description = {
>               'First':
>                   {'operation': First_Step,
>                    'input': {'X':X, 'y':y}},
>
>               'Concatenate_Xy':
>                   {'operation': ConcatenateData_Step,
>                    'input': [('First', 'X'),
>                              ('First', 'y')]},
>
>               'Gaussian_Mixture':
>                   {'operation': Gaussian_Mixture_Step,
>                    'input': [('Concatenate_Xy', 'data')]},
>
>               'Dbscan':
>                   {'operation': Dbscan_Step,
>                    'input': [('Concatenate_Xy', 'data')]},
>
>               'CombineClustering':
>                   {'operation': CombineClustering_Step,
>                    'input': [('Dbscan', 'classification'),
>                              ('Gaussian_Mixture', 'classification')]},
>
>               'Paella':
>                   {'operation': Paella_Step,
>                    'input': [('First', 'X'),
>                              ('First', 'y'),
>                              ('Concatenate_Xy', 'data'),
>                              ('CombineClustering', 'classification')]},
>
>               'Regressor':
>                   {'operation': Regressor_Step,
>                    'input': [('First', 'X'),
>                              ('First', 'y'),
>                              ('Paella', 'sample_weight')]},
>
>               'Last':
>                   {'operation': Last_Step,
>                    'input': [('Regressor', 'regressor')]},
>
>              }
>
> #%%
> def create_graph(description):
>     cg = nx.DiGraph()
>     cg.add_nodes_from(description)
>     for current_name, info in description.items():
>         current_node = cg.node[current_name]
>         current_node['operation'] = info['operation']( graph = cg,
> node_name = current_name )
>         current_node['input']     = info['input']
>         if current_name != 'First':
>             for ascendant in set( name for name, attribute in
> info['input'] ):
>                 cg.add_edge(ascendant, current_name)
>
>     return cg
> #%%
> cg = create_graph(graph_description)
>
> node_pos = {'First'            : ( 0, 0),
>             'Concatenate_Xy'   : ( 2, 4),
>             'Gaussian_Mixture' : ( 6, 8),
>             'Dbscan'           : ( 6, 6),
>             'CombineClustering': ( 8, 7),
>             'Paella'           : (10, 2),
>             'Regressor'        : (12, 0),
>             'Last'             : (16, 0)
>             }
>
> nx.draw(cg, pos=node_pos, with_labels=True)
>
> #%%
>
> print("=========================")
> for name in nx.topological_sort(cg):
>     print("Running: ", name)
>     cg.node[name]['operation'].fit()
>
> print("=========================")
>
> ########################
>
>
>
>
>
> 2017-12-22 12:09 GMT+01:00 Manuel Castej?n Limas <
> manuel.castejon at gmail.com>:
>
>> I'm currently thinking on a computational graph which can then be wrapped
>> as a pipeline like object ... I'll try yo make a toy example solving my
>> problem.
>>
>> El 20 dic. 2017 16:33, "Manuel Castej?n Limas" <manuel.castejon at gmail.com>
>> escribi?:
>>
>>> Thank you all for your interest!
>>>
>>> In order to clarify the case allow me to try to synthesize the spirit
>>> of what I'd like to put into the pipeline using this sequence of steps:
>>>
>>> #%%
>>> import pandas as pd
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>>
>>> from sklearn.cluster import DBSCAN
>>> from sklearn.mixture import GaussianMixture
>>> from sklearn.model_selection import train_test_split
>>>
>>> np.random.seed(seed=42)
>>>
>>> """
>>> Data preparation
>>> """
>>>
>>> URL = "
>>> https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sin_60_percent_noise.csv
>>> "
>>> data = pd.read_csv(URL, usecols=['V1','V2'])
>>> X, y = data[['V1']], data[['V2']]
>>>
>>> (data_train, data_test,
>>>  X_train, X_test,
>>>  y_train, y_test) = train_test_split(data, X, y)
>>>
>>> """
>>> Parameters setup
>>> """
>>>
>>> dbscan__eps = 0.06
>>>
>>> mclust__n_components = 3
>>>
>>> paella__noise_label = -1
>>> paella__max_it = 20,
>>> paella__regular_size = 400,
>>> paella__minimum_size = 100,
>>> paella__width_r = 0.99,
>>> paella__n_neighbors = 5,
>>> paella__power = 30,
>>> paella__random_state = None
>>>
>>> #%%
>>> """
>>> DBSCAN clustering to detect noise suspects (label == -1)
>>> """
>>>
>>> dbscan_input = data_train
>>>
>>> dbscan_clustering = DBSCAN(eps = dbscan__eps)
>>>
>>> dbscan_output = dbscan_clustering.fit_predict(dbscan_input)
>>>
>>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
>>> c=np.int64(dbscan_output == -1))
>>>
>>> #%%
>>> """
>>> GaussianMixture fitted with filtered data_train in order to help locate
>>> the ellipsoids
>>> but predict is applied to the whole data_train set.
>>> """
>>>
>>> mclust_input = data_train[ dbscan_output != 1]
>>>
>>> mclust_clustering = GaussianMixture(n_components = mclust__n_components)
>>> mclust_clustering.fit(mclust_input)
>>>
>>> mclust_output = mclust_clustering.predict(data_train)
>>>
>>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
>>> c=mclust_output)
>>>
>>> #%%
>>> """
>>> mclust and dbscan results are combined.
>>> """
>>>
>>> clustering_output = mclust_output.copy()
>>> clustering_output[dbscan_output == -1] =  -1
>>>
>>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
>>> c=clustering_output)
>>>
>>> #%%
>>> """
>>> Old-good Paella paper:
>>> https://link.springer.com/article/10.1023/B:DAMI.0000031630.50685.7c
>>>
>>> The Paella algorithm calculates sample_weight to be used by the final
>>> step regressor
>>> (Yes, it is an outlier detection algorithm but we are focusing now on
>>> this interesting collateral result). I am currently aggressively changing
>>> the code in order to make it fit somehow with the pipeline
>>> """
>>>
>>> from paella import Paella
>>>
>>> paella_input = pd.concat([data, clustering_output], axis=1,
>>> inplace=False)
>>>
>>> paella_run = Paella(noise_label = paella__noise_label,
>>>                     max_it = paella__max_it,
>>>                     regular_size = paella__regular_size,
>>>                     minimum_size = paella__minimum_size,
>>>                     width_r = paella__width_r,
>>>                     n_neighbors = paella__n_neighbors,
>>>                     power = paella__power,
>>>                     random_state = paella__random_state)
>>>
>>> paella_output = paella_run.fit_predict(paella_input, y_train)
>>> # paella_output is a vector with sample_weight
>>>
>>> #%%
>>> """
>>> Here we fit a regressor using sample_weight=paella_output
>>> """
>>> from sklearn.linear_model import LinearRegression
>>>
>>> regressor_input=X_train
>>> lm = LinearRegression()
>>> lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output)
>>> regressor_output = lm.predict(X_train)
>>>
>>> #...
>>>
>>> In this example we can see that:
>>> - A particular step might need results produced not necessarily from the
>>> immediately previous step.
>>> - The X parameter is not sequentially transformed. Sometimes we might
>>> need to skip to a previous step
>>> - y sometimes is the target, sometimes is not. For the regressor it is
>>> indeed, but for the paella algorithm the prediction is expressed as a
>>> vector representing sample_weights.
>>>
>>> All in all the conclusion is that the chain of processes is not as
>>> linear as imposed by the current API. I guess that all these difficulties
>>> could be solved by:
>>> - Passing a dictionary through the different steps containing the
>>> partial results that the following steps will need.
>>> -  As a christmas gift :-) , a reference to the pipeline itself inserted
>>> in that dictionary could provide access to the internal status of the
>>> previous steps should it be needed.
>>>
>>> Another interesting study case with similar needs would be a regressor
>>> using a previous clustering step in order to fit one model per cluster. In
>>> such case, the clustering results would be needed during the fitting.
>>>
>>>
>>> Thanks for your interest!
>>> Manolo
>>>
>>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171230/9ae90dc9/attachment-0001.html>

From gauravdhingra.gxyd at gmail.com  Sun Dec 31 05:48:31 2017
From: gauravdhingra.gxyd at gmail.com (Gaurav Dhingra)
Date: Sun, 31 Dec 2017 16:18:31 +0530
Subject: [scikit-learn] Topic for thesis work on scikit learn
In-Reply-To: <9641a578-194f-183c-fa2c-22cb45a7c76d@gmail.com>
References: <069bd230-f2e0-e5ef-4fdf-7d0c529c5d5f@gmail.com>
 <9641a578-194f-183c-fa2c-22cb45a7c76d@gmail.com>
Message-ID: <66ab3e6f-4102-3fd0-5828-7af593eab581@gmail.com>

Hi Andreas,

I think I'll get access to a local mentor from my college, so I think I 
rule that issue out, though for technicalities still I would /like/ to 
be more dependent on feedback from the scikit-learn community, since my 
aim wouldn't be to make something for my own use but rather something 
that would be more useful for the scikit-learn community, so that it 
eventually gets merged into master.

I'm currently looking for topic that I can take up, I tried looking into 
scikit-learn wiki but it doesn't mention for what I'm looking for (no 
topic is mentioned). Do you have some topic in mind that could be useful 
for addition to scikit-learn? Even if you could direct me to appropriate 
links I would be happy to look into those.


On Wednesday 01 November 2017 01:43 AM, Andreas Mueller wrote:
> Hi Gaurav.
>
> Do you have a local mentor? I think having a mentor that can guide you 
> during a thesis is very important.
> You could get some feedback from the community for a contribution, but 
> that can be slow,
> and is entirely on volunteer basis, so there is no guarantee that 
> you'll get the necessary feedback in time
> to finish your thesis.
>
> Mentoring a thesis - in particular without knowing you - is a serious 
> commitment, so I'm not sure someone
> from inside the project will want to do this. I saw you already made a 
> contribution in https://github.com/scikit-learn/scikit-learn/pull/10005
> but that's a very different scope than doing what I expect would be 
> several month of work.
>


Though in this regard I've made a few more contributions, here is the 
link https://github.com/scikit-learn/scikit-learn/pulls/gxyd, though I 
know none of them is a big contribution. If you think I should work on a 
big enough PR, can you please suggest me some issue in that regard?

Thanks

> Best,
> Andy
>
> On 10/31/2017 03:31 PM, Gaurav Dhingra wrote:
>> Hi everyone,
>>
>> I am a final year (5th year) undergraduate Applied Mathematics 
>> student in India. I am thinking of doing my final year thesis by 
>> doing some work (coding part) on scikit learn, so I was thinking if 
>> anyone could tell me if there are available topics (not necessarily 
>> names of those topics) that I could work on being an undergraduate 
>> student? I would want to expand upon this in December when my exams 
>> will be over. But in the mean time would want to take a step in that 
>> direction by just knowing if there will be available topics that I 
>> could work on.
>>
>> It could be the case that available topics are not so easy for an 
>> undergraduate, still in that case I would like to do some research on 
>> the topics first.
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gaurav Dhingra
(sent from Thunderbird email client)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171231/a1eb4d9e/attachment.html>

From gauravdhingra.gxyd at gmail.com  Sun Dec 31 05:50:31 2017
From: gauravdhingra.gxyd at gmail.com (Gaurav Dhingra)
Date: Sun, 31 Dec 2017 16:20:31 +0530
Subject: [scikit-learn] Topic for thesis work on scikit learn
In-Reply-To: <66ab3e6f-4102-3fd0-5828-7af593eab581@gmail.com>
References: <069bd230-f2e0-e5ef-4fdf-7d0c529c5d5f@gmail.com>
 <9641a578-194f-183c-fa2c-22cb45a7c76d@gmail.com>
 <66ab3e6f-4102-3fd0-5828-7af593eab581@gmail.com>
Message-ID: <a3206e85-3e98-f763-a380-71be604a74ed@gmail.com>

Sorry Andreas, I didn't intend to send last mail to you. I've sent a 
copy of last mail to scikit-learn mailing list.


On Sunday 31 December 2017 04:18 PM, Gaurav Dhingra wrote:
>
> Hi Andreas,
>
> I think I'll get access to a local mentor from my college, so I think 
> I rule that issue out, though for technicalities still I would /like/ 
> to be more dependent on feedback from the scikit-learn community, 
> since my aim wouldn't be to make something for my own use but rather 
> something that would be more useful for the scikit-learn community, so 
> that it eventually gets merged into master.
>
> I'm currently looking for topic that I can take up, I tried looking 
> into scikit-learn wiki but it doesn't mention for what I'm looking for 
> (no topic is mentioned). Do you have some topic in mind that could be 
> useful for addition to scikit-learn? Even if you could direct me to 
> appropriate links I would be happy to look into those.
>
>
> On Wednesday 01 November 2017 01:43 AM, Andreas Mueller wrote:
>> Hi Gaurav.
>>
>> Do you have a local mentor? I think having a mentor that can guide 
>> you during a thesis is very important.
>> You could get some feedback from the community for a contribution, 
>> but that can be slow,
>> and is entirely on volunteer basis, so there is no guarantee that 
>> you'll get the necessary feedback in time
>> to finish your thesis.
>>
>> Mentoring a thesis - in particular without knowing you - is a serious 
>> commitment, so I'm not sure someone
>> from inside the project will want to do this. I saw you already made 
>> a contribution in 
>> https://github.com/scikit-learn/scikit-learn/pull/10005
>> but that's a very different scope than doing what I expect would be 
>> several month of work.
>>
>
>
> Though in this regard I've made a few more contributions, here is the 
> link https://github.com/scikit-learn/scikit-learn/pulls/gxyd, though I 
> know none of them is a big contribution. If you think I should work on 
> a big enough PR, can you please suggest me some issue in that regard?
>
> Thanks
>
>> Best,
>> Andy
>>
>> On 10/31/2017 03:31 PM, Gaurav Dhingra wrote:
>>> Hi everyone,
>>>
>>> I am a final year (5th year) undergraduate Applied Mathematics 
>>> student in India. I am thinking of doing my final year thesis by 
>>> doing some work (coding part) on scikit learn, so I was thinking if 
>>> anyone could tell me if there are available topics (not necessarily 
>>> names of those topics) that I could work on being an undergraduate 
>>> student? I would want to expand upon this in December when my exams 
>>> will be over. But in the mean time would want to take a step in that 
>>> direction by just knowing if there will be available topics that I 
>>> could work on.
>>>
>>> It could be the case that available topics are not so easy for an 
>>> undergraduate, still in that case I would like to do some research 
>>> on the topics first.
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
> -- 
> Gaurav Dhingra
> (sent from Thunderbird email client)

-- 
Gaurav Dhingra
(sent from Thunderbird email client)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171231/d09c6420/attachment.html>