From olivier.grisel at ensta.org  Wed May  1 06:58:29 2019
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Wed, 1 May 2019 12:58:29 +0200
Subject: [scikit-learn] Release Candidate for Scikit-learn 0.21
In-Reply-To: <CAAkaFLUVdoRngX6VAsG7w_i-kCY_1cdZpihcrSq=2TE9XuSqAg@mail.gmail.com>
References: <CAAkaFLUVdoRngX6VAsG7w_i-kCY_1cdZpihcrSq=2TE9XuSqAg@mail.gmail.com>
Message-ID: <CAFvE7K7Tcea4FecoBu8BNQzdiymsdLDwCO33OjHUpjcSXwo3ew@mail.gmail.com>

\o/

From t3kcit at gmail.com  Wed May  1 22:13:02 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 1 May 2019 22:13:02 -0400
Subject: [scikit-learn] Release Candidate for Scikit-learn 0.21
In-Reply-To: <CAAkaFLUVdoRngX6VAsG7w_i-kCY_1cdZpihcrSq=2TE9XuSqAg@mail.gmail.com>
References: <CAAkaFLUVdoRngX6VAsG7w_i-kCY_1cdZpihcrSq=2TE9XuSqAg@mail.gmail.com>
Message-ID: <be6da305-a268-e42b-2fd5-537e1ee6f36f@gmail.com>

Thank you for all the amazing work y'all!


On 4/30/19 10:09 PM, Joel Nothman wrote:
> PyPI now has source and binary releases for Scikit-learn 0.21rc2.
>
> * Documentation at https://scikit-learn.org/0.21
> * Release Notes at https://scikit-learn.org/0.21/whats_new
> * Download source or wheels at 
> https://pypi.org/project/scikit-learn/0.21rc2/
>
> Please try out the software and help us edit the release notes before 
> a final release.
>
> Highlights include:
> * neighbors.NeighborhoodComponentsAnalysis for supervised metric 
> learning, which learns a weighted euclidean distance for k-nearest 
> neighbors. https://scikit-learn.org/0.21/modules/neighbors.html#nca
> *?ensemble.HistGradientBoostingClassifier 
> and?ensemble.HistGradientBoostingRegressor: experimental 
> implementations of efficient binned gradient boosting machines. 
> https://scikit-learn.org/0.21/modules/ensemble.html#gradient-tree-boosting
> * impute.IterativeImputer: a non-trivial approach to missing value 
> imputation. 
> https://scikit-learn.org/0.21/modules/impute.html#multivariate-feature-imputation
> * cluster.OPTICS: a new density-based clustering algorithm. 
> https://scikit-learn.org/0.21/modules/clustering.html#optics
> * better printing of estimators as strings, with an option to hide 
> default parameters for compactness: 
> https://scikit-learn.org/0.21/auto_examples/plot_changed_only_pprint_parameter.html
> * for estimator and library developers: a way to tag your estimator so 
> that it can be treated appropriately with check_estimator. 
> https://scikit-learn.org/0.21/developers/contributing.html#estimator-tags
>
> There are many other enhancements and fixes listed in the release 
> notes (https://scikit-learn.org/0.21/whats_new).
>
> Please note that Scikit-learn has new dependencies:
> * joblib >= 0.11, which used to be vendored within Scikit-learn
> * OpenMP, unless the environment variable?SKLEARN_NO_OPENMP=1 when the 
> code is compiled (and cythonized)
>
> Happy Learning!
>
> From the Scikit-learn core dev team.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190501/f06d6775/attachment.html>

From gael.varoquaux at normalesup.org  Thu May  2 03:28:25 2019
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Thu, 2 May 2019 09:28:25 +0200
Subject: [scikit-learn] Release Candidate for Scikit-learn 0.21
In-Reply-To: <be6da305-a268-e42b-2fd5-537e1ee6f36f@gmail.com>
References: <CAAkaFLUVdoRngX6VAsG7w_i-kCY_1cdZpihcrSq=2TE9XuSqAg@mail.gmail.com>
 <be6da305-a268-e42b-2fd5-537e1ee6f36f@gmail.com>
Message-ID: <20190502072825.fzheoqwesuppvs4f@phare.normalesup.org>

Thank you all and congratulations indeed.

Because this release comes soon after the latest one from the 0.20
series, we might have thought that it would be a light one. But no!
Plenty of exciting features!

Ga?l

On Wed, May 01, 2019 at 10:13:02PM -0400, Andreas Mueller wrote:
> Thank you for all the amazing work y'all!


> On 4/30/19 10:09 PM, Joel Nothman wrote:

>     PyPI now has source and binary releases for Scikit-learn 0.21rc2.

>     * Documentation at?https://scikit-learn.org/0.21
>     * Release Notes at?https://scikit-learn.org/0.21/whats_new
>     * Download source or wheels at?https://pypi.org/project/scikit-learn/
>     0.21rc2/

>     Please try out the software and help us edit the release notes before a
>     final release.

>     Highlights include:
>     * neighbors.NeighborhoodComponentsAnalysis for supervised metric learning,
>     which learns a weighted euclidean distance for k-nearest neighbors.?https:/
>     /scikit-learn.org/0.21/modules/neighbors.html#nca
>     *?ensemble.HistGradientBoostingClassifier
>     and?ensemble.HistGradientBoostingRegressor: experimental implementations of
>     efficient binned gradient boosting machines.?https://scikit-learn.org/0.21/
>     modules/ensemble.html#gradient-tree-boosting
>     * impute.IterativeImputer: a non-trivial approach to missing value
>     imputation.?https://scikit-learn.org/0.21/modules/impute.html#
>     multivariate-feature-imputation
>     * cluster.OPTICS: a new density-based clustering algorithm.?https://
>     scikit-learn.org/0.21/modules/clustering.html#optics
>     * better printing of estimators as strings, with an option to hide default
>     parameters for compactness:?https://scikit-learn.org/0.21/auto_examples/
>     plot_changed_only_pprint_parameter.html
>     * for estimator and library developers: a way to tag your estimator so that
>     it can be treated appropriately with check_estimator.?https://
>     scikit-learn.org/0.21/developers/contributing.html#estimator-tags

>     There are many other enhancements and fixes listed in the release notes (
>     https://scikit-learn.org/0.21/whats_new).

>     Please note that Scikit-learn has new dependencies:
>     * joblib >= 0.11, which used to be vendored within Scikit-learn
>     * OpenMP, unless the environment variable?SKLEARN_NO_OPENMP=1 when the code
>     is compiled (and cythonized)

>     Happy Learning!

>     From the Scikit-learn core dev team.


>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org
>     https://mail.python.org/mailman/listinfo/scikit-learn


> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
    Gael Varoquaux
    Senior Researcher, INRIA 
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From krallinger.martin at gmail.com  Thu May  2 13:03:34 2019
From: krallinger.martin at gmail.com (Martin Krallinger)
Date: Thu, 2 May 2019 19:03:34 +0200
Subject: [scikit-learn] MEDDOCAN Shared task for Named Entity Recognition
 and Classification with Scikit-Learn
In-Reply-To: <mailman.77.1556812805.2596.scikit-learn@python.org>
References: <mailman.77.1556812805.2596.scikit-learn@python.org>
Message-ID: <CAMx+MKFx0dLYVb27V6jpbtRttqyo+07UNKX3+eDitBBaZTOnxA@mail.gmail.com>

*IberLEF/SEPLN: CFP MEDDOCAN track & task prize: named entity recognition
and sensitive personal information identification*


***** *CFP MEDDOCAN track ****

*First Medical Document Anonymization *

*http://temu.bsc.es/meddocan <http://temu.bsc.es/meddocan>*


*SEAD ? Plan TL Sponsoring Track Awards*

Sub-tracks: 1,000?, 500? and 200? (first, second, third team)


*Task description*

Scikit-Learn has been successfully used for Named Entity Recognition and
Classification tasks in the past, showing that it is specially competitive
for fining mentions of entities in running text.


Clinical records with protected health information (PHI) cannot be directly
shared as is, due to privacy constraints, making it particularly cumbersome
to carry out NLP research in the medical domain. A necessary precondition
for accessing clinical records outside of hospitals is their
de-identification, i.e., the exhaustive removal (or replacement) of all
mentioned PHI phrases.


The practical relevance of anonymization or de-identification of clinical
texts motivated the proposal of two shared tasks, the 2006 and 2014
de-identification tracks, organized under the umbrella of the i2b2 (*i2b2.org
<http://i2b2.org>*) community evaluation effort. The i2b2 effort has deeply
influenced the clinical NLP community worldwide, but was focused on
documents in English and covering characteristics of US-healthcare data
providers.


As part of the IberLEF 2019 (*https://sites.google.com/view/iberlef-2019
<https://sites.google.com/view/iberlef-2019>*) initiative, we announce  *the
first community challenge task specifically devoted to the anonymization of
medical documents in Spanish*, called the MEDDOCAN (Medical Document
Anonymization) track.


In order to carry out these tasks we have prepared a synthetic corpus of
1000 clinical case studies. This corpus was selected manually by a
practicing physician and augmented with PHI information from discharge
summaries and medical genetics clinical records.


The MEDDOCAN task will be structured into *two sub-tracks*:

   - NER offset and entity type classification
   - Sensitive span detection.


*Publications*

Teams will be invited to send a workshop proceedings systems description
paper, similarly to previous *IberEval* events.

We plan to* invite selected works *for full publication in a *Q1 Journal ?
Special Issue devoted to MEDDOCAN*.  Invitation to the special issue will
consider multiple aspects such as performance, novelty of the system,
availability of the underlying system (software/web-service) as well as the
workshop presentation.


*Important Dates*

   - March 18, 2019: Sample set and Evaluation script released.
   - March 20, 2019: Training set released.
   - April 4, 2019: Development set released.
   - April 29, 2019: Test set released (includes background set).
   - May 17, 2019: End of evaluation period (system submissions).
   - May 20, 2019: Results posted and Test set with GS annotations
   released.
   - May 31, 2019:  Working notes paper submission.
   - June 14, 2019: Notification of acceptance (peer-reviews).
   - June 28, 2019: Camera ready paper submission.
   - September 24, 2019:  IberLEF 2019 Workshop, Bilbao Spain


*Task organizers*

   - Aitor Gonzalez-Agirre, Barcelona Supercomputing Center.
   - Ander Intxaurrondo, Barcelona Supercomputing Center.
   - Jose Antonio Lopez-Martin, Hospital 12 de Octubre.
   - Montserrat Marimon, Barcelona Supercomputing Center.
   - Felipe Soares, Barcelona Supercomputing Center.
   - Marta Villegas, Barcelona Supercomputing Center.
   - Martin Krallinger, Barcelona Supercomputing Center.


*Scientific committee *

? Hercules Dalianis, DSV/Stockholm University, Sweden
? Christoph Dieterich, Klaus-Tschira-Institute for Computational
Cardiology, University Hospital Heidelberg, Germany
? Jelena Jacimovic, University of Belgrade, Serbia
? Bradley Malin, Vanderbilt University Medical Center, USA
? ?ystein Nytr?, Norwegian University of Science and Technology, Norway
? Patrick Ruch, SIB Text Mining, HES-SO & Swiss Institute of
Bioinformatics, Switzerland
? Angus Roberts, King?s College London, UK
? Arturo Romero Guti?rrez, Ministerio de Sanidad, Servicios Sociales e
Igualdad, Spain
? Ozlem Uzuner, George Mason University, USA
? Alfonso Valencia, Barcelona Supercomputing Center, Spain


============================
Martin Krallinger, Dr.
--------------------------------------------------------------------
Head of Biological Text Mining Unit
Structural Biology and BioComputing Programme
Spanish National Cancer Research Centre (CNIO)
--------------------------------------------------------------------
Oficina T?cnica General (OTG) del Plan TL en el
?rea de Biomedicina de la Secretaria de Estado de
Telecomunicaciones y para la Sociedad de la
Informaci?n
============================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190502/b1d7b49b/attachment.html>

From pahome.chen at mirlab.org  Fri May  3 04:03:05 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Fri, 3 May 2019 16:03:05 +0800
Subject: [scikit-learn] Can I evaluate clustering efficiency incrementally?
Message-ID: <CAB3eZfuM3av5T2nmQrQYhkrqv2cm5UAJFmjf5OB07pTCc10F6g@mail.gmail.com>

I see some algo can cluster incrementally if dataset is too huge ex:
minibatchkmeans and Birch.

But is there any way to evaluate incrementally?

I found silhouette-coefficient and Calinski-Harabaz index because I don't
know the ground truth labels.
But they can't evaluate incrementally.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190503/e3286d06/attachment-0001.html>

From g.lemaitre58 at gmail.com  Fri May  3 04:12:09 2019
From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=)
Date: Fri, 3 May 2019 10:12:09 +0200
Subject: [scikit-learn] Can I evaluate clustering efficiency
 incrementally?
In-Reply-To: <CAB3eZfuM3av5T2nmQrQYhkrqv2cm5UAJFmjf5OB07pTCc10F6g@mail.gmail.com>
References: <CAB3eZfuM3av5T2nmQrQYhkrqv2cm5UAJFmjf5OB07pTCc10F6g@mail.gmail.com>
Message-ID: <CACDxx9ikMRFajg3QrxY6jLtZ1XT36+LdOtPMAFXu-0xdZwNXtw@mail.gmail.com>

You can always predict incrementally by predicting on batches of samples.

On Fri, 3 May 2019 at 10:05, lampahome <pahome.chen at mirlab.org> wrote:

> I see some algo can cluster incrementally if dataset is too huge ex:
> minibatchkmeans and Birch.
>
> But is there any way to evaluate incrementally?
>
> I found silhouette-coefficient and Calinski-Harabaz index because I don't
> know the ground truth labels.
> But they can't evaluate incrementally.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190503/92f09ca2/attachment.html>

From g.lemaitre58 at gmail.com  Fri May  3 04:14:28 2019
From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=)
Date: Fri, 3 May 2019 10:14:28 +0200
Subject: [scikit-learn] Can I evaluate clustering efficiency
 incrementally?
In-Reply-To: <CACDxx9ikMRFajg3QrxY6jLtZ1XT36+LdOtPMAFXu-0xdZwNXtw@mail.gmail.com>
References: <CAB3eZfuM3av5T2nmQrQYhkrqv2cm5UAJFmjf5OB07pTCc10F6g@mail.gmail.com>
 <CACDxx9ikMRFajg3QrxY6jLtZ1XT36+LdOtPMAFXu-0xdZwNXtw@mail.gmail.com>
Message-ID: <CACDxx9i=axi2gnD02NftPYsnzH4n-mjg=cmk7a+=71u6rZ0BkA@mail.gmail.com>

oh sorry, I see now that you mention about evaluating.

On Fri, 3 May 2019 at 10:12, Guillaume Lema?tre <g.lemaitre58 at gmail.com>
wrote:

> You can always predict incrementally by predicting on batches of samples.
>
> On Fri, 3 May 2019 at 10:05, lampahome <pahome.chen at mirlab.org> wrote:
>
>> I see some algo can cluster incrementally if dataset is too huge ex:
>> minibatchkmeans and Birch.
>>
>> But is there any way to evaluate incrementally?
>>
>> I found silhouette-coefficient and Calinski-Harabaz index because I don't
>> know the ground truth labels.
>> But they can't evaluate incrementally.
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
>


-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190503/75cf06ac/attachment.html>

From ugoren at gmail.com  Fri May  3 07:27:21 2019
From: ugoren at gmail.com (Uri Goren)
Date: Fri, 3 May 2019 14:27:21 +0300
Subject: [scikit-learn] Can I evaluate clustering efficiency
 incrementally?
In-Reply-To: <CAB3eZfuM3av5T2nmQrQYhkrqv2cm5UAJFmjf5OB07pTCc10F6g@mail.gmail.com>
References: <CAB3eZfuM3av5T2nmQrQYhkrqv2cm5UAJFmjf5OB07pTCc10F6g@mail.gmail.com>
Message-ID: <CANCr86PSh5yiftps60i6v_o60zP7MFkqQfPTPyjSpURk=+tQfg@mail.gmail.com>

I usually use clustering to save costs on labelling.
I like to apply hierarchical clustering, and then label a small sample and
fine-tune the clustering algorithm.

That way, you can evaluate the effectiveness in terms of cluster purity
(how many clusters contain mixed labels)

See example with sklearn here :
https://youtu.be/GM8L324MuHc?list=PLqkckaeDLF4IDdKltyBwx8jLaz5nwDPQU


On Fri, May 3, 2019, 11:03 AM lampahome <pahome.chen at mirlab.org> wrote:

> I see some algo can cluster incrementally if dataset is too huge ex:
> minibatchkmeans and Birch.
>
> But is there any way to evaluate incrementally?
>
> I found silhouette-coefficient and Calinski-Harabaz index because I don't
> know the ground truth labels.
> But they can't evaluate incrementally.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190503/172eaa2b/attachment.html>

From prudhvirajnitjsr at gmail.com  Sat May 11 16:28:36 2019
From: prudhvirajnitjsr at gmail.com (prudhviraj nitjsr)
Date: Sun, 12 May 2019 01:58:36 +0530
Subject: [scikit-learn] Proposing Encoder class to encode Ordinal values of
 an attribute
Message-ID: <CANka6xtYsE+2thAaD+J9=d0cVRtbBxWyh7zJtvMop9LGtn6bxw@mail.gmail.com>

Hi All,

Recently, when i was solving some ML problem, I came accross an
attribute which has Ordinal Values . Eg:

Student ID    |    Subjects
========================================
1            |    ['Math']
2            |    ['Math','Python']
3            |    ['C']
4            |    ['Python','Statistics']
========================================

Here, attribute Subjects is a list which contains list of subjects the
student is interested in. We have sklearn.preprocessing.OneHotEncoder
which encodes a single Categorical variable by creating multiple
columns.
Similarily, I want to propose different encoder that encodes this type
of list and creates new columns , one column for each subject. Allowed
values are 1/0 which specifies whether student is interested in this
subject or not. I'm new to Open Source contribution. Can someone tell
me If there is an existing feature that handles this type of data or
If I can start working on this feature. Any response would be
appreciated.

Thanks
Prudvi RajKumar

From prudhvirajnitjsr at gmail.com  Mon May 13 14:58:34 2019
From: prudhvirajnitjsr at gmail.com (prudhviraj nitjsr)
Date: Tue, 14 May 2019 00:28:34 +0530
Subject: [scikit-learn] Fwd: Proposing Encoder class to encode Ordinal
 attributes
In-Reply-To: <CANka6xtO7O0_yK4L5NuKyAkwaBxFA5sFbq79hmJkaSR8_2mHjw@mail.gmail.com>
References: <CANka6xtO7O0_yK4L5NuKyAkwaBxFA5sFbq79hmJkaSR8_2mHjw@mail.gmail.com>
Message-ID: <CANka6xuncsmnTxMkoFiJAF6QxVt8Bqvu73g0XxWkmXS8w-a=Rw@mail.gmail.com>

Hi,

Can someone please respond. Any response would be appreciated

Thanks

---------- Forwarded message ---------
From: prudhviraj nitjsr <prudhvirajnitjsr at gmail.com>
Date: Sun, May 12, 2019 at 1:38 AM
Subject: Proposing Encoder class to encode Ordinal attributes
To: <scikit-learn at python.org>


Hi All,

Recently, when i was solving some ML problem, I came accross an
attribute which has Ordinal Values . Eg:

Student ID    |    Subjects
========================================
1            |    ['Math']
2            |    ['Math','Python']
3            |    ['C']
4            |    ['Python','Statistics']
========================================

Here, attribute Subjects is a list which contains list of subjects the
student is interested in. We have sklearn.preprocessing.OneHotEncoder
which encodes a single Categorical variable by creating multiple
columns.
Similarily, I want to propose different encoder that encodes this type
of list and creates new columns , one column for each subject. Allowed
values are 1/0 which specifies whether student is interested in this
subject or not. I'm new to Open Source contribution. Can someone tell
me If there is an existing feature that handles this type of data or
If I can start working on this feature. Any response would be
appreciated.

Thanks
Prudvi RajKumar

From joel.nothman at gmail.com  Mon May 13 16:30:28 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 14 May 2019 06:30:28 +1000
Subject: [scikit-learn] Fwd: Proposing Encoder class to encode Ordinal
 attributes
In-Reply-To: <CANka6xuncsmnTxMkoFiJAF6QxVt8Bqvu73g0XxWkmXS8w-a=Rw@mail.gmail.com>
References: <CANka6xtO7O0_yK4L5NuKyAkwaBxFA5sFbq79hmJkaSR8_2mHjw@mail.gmail.com>
 <CANka6xuncsmnTxMkoFiJAF6QxVt8Bqvu73g0XxWkmXS8w-a=Rw@mail.gmail.com>
Message-ID: <CAAkaFLVzwuBVT7qiQxNjQAc7UuhzWgUdG6tVw2aFp6Rka3MB5w@mail.gmail.com>

There has been an issue and a pull request for something similar in
DictVectorizer. https://github.com/scikit-learn/scikit-learn/pull/8750 got
close to merging and I'm not really sure why it was closed rather than
completed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190514/717d4942/attachment.html>

From nicholdav at gmail.com  Mon May 13 21:35:17 2019
From: nicholdav at gmail.com (David Nicholson)
Date: Mon, 13 May 2019 21:35:17 -0400
Subject: [scikit-learn] Fwd: Proposing Encoder class to encode Ordinal
 attributes
In-Reply-To: <CAAkaFLVzwuBVT7qiQxNjQAc7UuhzWgUdG6tVw2aFp6Rka3MB5w@mail.gmail.com>
References: <CANka6xtO7O0_yK4L5NuKyAkwaBxFA5sFbq79hmJkaSR8_2mHjw@mail.gmail.com>
 <CANka6xuncsmnTxMkoFiJAF6QxVt8Bqvu73g0XxWkmXS8w-a=Rw@mail.gmail.com>
 <CAAkaFLVzwuBVT7qiQxNjQAc7UuhzWgUdG6tVw2aFp6Rka3MB5w@mail.gmail.com>
Message-ID: <CAMabFbUpLtDJC2DuRgdsn2FkMEw41BUtP6xqgTMT_aZYQd=X6A@mail.gmail.com>

There is this in scikit-learn-contrib, Categorical Encodering:
https://joss.theoj.org/papers/d57818316816a19a80112892c3d12ed7
https://github.com/scikit-learn-contrib/categorical-encoding

David Nicholson, Ph.D.
https://nicholdav.info/
https://github.com/NickleDave
Prinz lab <http://www.biology.emory.edu/research/Prinz/>, Emory University,
Atlanta, GA, USA


On Mon, May 13, 2019 at 4:32 PM Joel Nothman <joel.nothman at gmail.com> wrote:

> There has been an issue and a pull request for something similar in
> DictVectorizer. https://github.com/scikit-learn/scikit-learn/pull/8750
> got close to merging and I'm not really sure why it was closed rather than
> completed.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190513/d98525d8/attachment.html>

From pahome.chen at mirlab.org  Mon May 13 22:10:22 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Tue, 14 May 2019 10:10:22 +0800
Subject: [scikit-learn] Can I evaluate clustering efficiency
 incrementally?
In-Reply-To: <CANCr86PSh5yiftps60i6v_o60zP7MFkqQfPTPyjSpURk=+tQfg@mail.gmail.com>
References: <CAB3eZfuM3av5T2nmQrQYhkrqv2cm5UAJFmjf5OB07pTCc10F6g@mail.gmail.com>
 <CANCr86PSh5yiftps60i6v_o60zP7MFkqQfPTPyjSpURk=+tQfg@mail.gmail.com>
Message-ID: <CAB3eZfvE9aW9GkMZXj=-djGnOoDp2vy7fG9zqzpYnPWwJrC7GA@mail.gmail.com>

Uri Goren <ugoren at gmail.com> ? 2019?5?3? ?? ??7:29???

> I usually use clustering to save costs on labelling.
> I like to apply hierarchical clustering, and then label a small sample and
> fine-tune the clustering algorithm.
>
> That way, you can evaluate the effectiveness in terms of cluster purity
> (how many clusters contain mixed labels)
>
> See example with sklearn here :
> https://youtu.be/GM8L324MuHc?list=PLqkckaeDLF4IDdKltyBwx8jLaz5nwDPQU
>
>
> But if my dataset is too large to load into memory, will it work?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190514/f8343113/attachment.html>

From ugoren at gmail.com  Tue May 14 03:06:33 2019
From: ugoren at gmail.com (Uri Goren)
Date: Tue, 14 May 2019 10:06:33 +0300
Subject: [scikit-learn] Can I evaluate clustering efficiency
 incrementally?
In-Reply-To: <CAB3eZfvE9aW9GkMZXj=-djGnOoDp2vy7fG9zqzpYnPWwJrC7GA@mail.gmail.com>
References: <CAB3eZfuM3av5T2nmQrQYhkrqv2cm5UAJFmjf5OB07pTCc10F6g@mail.gmail.com>
 <CANCr86PSh5yiftps60i6v_o60zP7MFkqQfPTPyjSpURk=+tQfg@mail.gmail.com>
 <CAB3eZfvE9aW9GkMZXj=-djGnOoDp2vy7fG9zqzpYnPWwJrC7GA@mail.gmail.com>
Message-ID: <CANCr86OKNJBcx3O4DaLzeO5Q-+yoJf79wteYexu_WfpOQ=HFNg@mail.gmail.com>

Sounds like you need to use spark,
this project looks promising:
https://github.com/xiaocai00/SparkPinkMST

On Tue, May 14, 2019 at 5:12 AM lampahome <pahome.chen at mirlab.org> wrote:

>
> Uri Goren <ugoren at gmail.com> ? 2019?5?3? ?? ??7:29???
>
>> I usually use clustering to save costs on labelling.
>> I like to apply hierarchical clustering, and then label a small sample
>> and fine-tune the clustering algorithm.
>>
>> That way, you can evaluate the effectiveness in terms of cluster purity
>> (how many clusters contain mixed labels)
>>
>> See example with sklearn here :
>> https://youtu.be/GM8L324MuHc?list=PLqkckaeDLF4IDdKltyBwx8jLaz5nwDPQU
>>
>>
>> But if my dataset is too large to load into memory, will it work?
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190514/8aecba52/attachment-0001.html>

From tom.augspurger88 at gmail.com  Tue May 14 09:18:24 2019
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Tue, 14 May 2019 08:18:24 -0500
Subject: [scikit-learn] Can I evaluate clustering efficiency
 incrementally?
In-Reply-To: <CANCr86OKNJBcx3O4DaLzeO5Q-+yoJf79wteYexu_WfpOQ=HFNg@mail.gmail.com>
References: <CAB3eZfuM3av5T2nmQrQYhkrqv2cm5UAJFmjf5OB07pTCc10F6g@mail.gmail.com>
 <CANCr86PSh5yiftps60i6v_o60zP7MFkqQfPTPyjSpURk=+tQfg@mail.gmail.com>
 <CAB3eZfvE9aW9GkMZXj=-djGnOoDp2vy7fG9zqzpYnPWwJrC7GA@mail.gmail.com>
 <CANCr86OKNJBcx3O4DaLzeO5Q-+yoJf79wteYexu_WfpOQ=HFNg@mail.gmail.com>
Message-ID: <CAE1aY-nWtJU0msneVO0joyeYKscWt10tbuaPaBFVK1uXNA8yWA@mail.gmail.com>

If anyone is interested in implementing these, dask-ml would welcome
additional
metrics that work well with Dask arrays:
https://github.com/dask/dask-ml/issues/213.

On Tue, May 14, 2019 at 2:09 AM Uri Goren <ugoren at gmail.com> wrote:

> Sounds like you need to use spark,
> this project looks promising:
> https://github.com/xiaocai00/SparkPinkMST
>
> On Tue, May 14, 2019 at 5:12 AM lampahome <pahome.chen at mirlab.org> wrote:
>
>>
>> Uri Goren <ugoren at gmail.com> ? 2019?5?3? ?? ??7:29???
>>
>>> I usually use clustering to save costs on labelling.
>>> I like to apply hierarchical clustering, and then label a small sample
>>> and fine-tune the clustering algorithm.
>>>
>>> That way, you can evaluate the effectiveness in terms of cluster purity
>>> (how many clusters contain mixed labels)
>>>
>>> See example with sklearn here :
>>> https://youtu.be/GM8L324MuHc?list=PLqkckaeDLF4IDdKltyBwx8jLaz5nwDPQU
>>>
>>>
>>> But if my dataset is too large to load into memory, will it work?
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190514/3fb804a5/attachment.html>

From joel.nothman at gmail.com  Wed May 15 00:14:17 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 15 May 2019 14:14:17 +1000
Subject: [scikit-learn] Can I evaluate clustering efficiency
 incrementally?
In-Reply-To: <CAE1aY-nWtJU0msneVO0joyeYKscWt10tbuaPaBFVK1uXNA8yWA@mail.gmail.com>
References: <CAB3eZfuM3av5T2nmQrQYhkrqv2cm5UAJFmjf5OB07pTCc10F6g@mail.gmail.com>
 <CANCr86PSh5yiftps60i6v_o60zP7MFkqQfPTPyjSpURk=+tQfg@mail.gmail.com>
 <CAB3eZfvE9aW9GkMZXj=-djGnOoDp2vy7fG9zqzpYnPWwJrC7GA@mail.gmail.com>
 <CANCr86OKNJBcx3O4DaLzeO5Q-+yoJf79wteYexu_WfpOQ=HFNg@mail.gmail.com>
 <CAE1aY-nWtJU0msneVO0joyeYKscWt10tbuaPaBFVK1uXNA8yWA@mail.gmail.com>
Message-ID: <CAAkaFLVKmjND8GYQbc0MKhh6o6OP=0BLMtJY_nyEwNZXR-x_dg@mail.gmail.com>

Evaluating on large datasets is easy if the sufficient statistics are just
the contingency matrix.

On Tue., 14 May 2019, 11:19 pm Tom Augspurger, <tom.augspurger88 at gmail.com>
wrote:

> If anyone is interested in implementing these, dask-ml would welcome
> additional
> metrics that work well with Dask arrays:
> https://github.com/dask/dask-ml/issues/213.
>
> On Tue, May 14, 2019 at 2:09 AM Uri Goren <ugoren at gmail.com> wrote:
>
>> Sounds like you need to use spark,
>> this project looks promising:
>> https://github.com/xiaocai00/SparkPinkMST
>>
>> On Tue, May 14, 2019 at 5:12 AM lampahome <pahome.chen at mirlab.org> wrote:
>>
>>>
>>> Uri Goren <ugoren at gmail.com> ? 2019?5?3? ?? ??7:29???
>>>
>>>> I usually use clustering to save costs on labelling.
>>>> I like to apply hierarchical clustering, and then label a small sample
>>>> and fine-tune the clustering algorithm.
>>>>
>>>> That way, you can evaluate the effectiveness in terms of cluster purity
>>>> (how many clusters contain mixed labels)
>>>>
>>>> See example with sklearn here :
>>>> https://youtu.be/GM8L324MuHc?list=PLqkckaeDLF4IDdKltyBwx8jLaz5nwDPQU
>>>>
>>>>
>>>> But if my dataset is too large to load into memory, will it work?
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190515/3d6a0948/attachment.html>

From drh at aiwerkstatt.com  Wed May 15 15:18:13 2019
From: drh at aiwerkstatt.com (drh at aiwerkstatt.com)
Date: Wed, 15 May 2019 13:18:13 -0600
Subject: [scikit-learn] Example of a scikit-learn compatible classifier with
 C++ implementation of the algorithms
Message-ID: <20190515131813.Horde.Li6N562F-XfoEURE43WiHLh@just35.justhost.com>

I use a PYTHON BASED ECOSYSTEM (SCIKIT-LEARN, ? ) FOR PROTOTYPING and  
I have a C++ BASED PRODUCTION SYSTEM. A scikit-learn compatible  
interface allows me to take advantage of scikit-learn?s ecosystem.  
Implementing the algorithm in C++ allows me to develop and test my  
algorithms already during prototyping.

I started with scikit-learn?s project template to roll my own decision  
tree and forest classifier and implemented the algorithms in a C++  
library, using Cython to create the Python bindings.

Starting out with a Python implementation, I experimented a little bit  
with implementing the algorithms in Cython. But I found that if you  
are proficient in Python and C++ coding, that implementing the  
algorithm directly in C++ was much faster than writing it in Cython.

I made this project available to everybody, because I think it could  
serve as an example or template for anybody who would like to roll  
their own scikit-learn compatible classifier with a C++ based  
implementation of the algorithms to be re-used in a production system.  
At least version 1.0.0 should be useful, after that it might become  
too complex to be used as an example.

Check it out:

READTHEDOCs: https://koho.readthedocs.io

  GITHUB: https://github.com/AIWerkstatt/koho

I tried to be consistent with scikit-learn?s decision tree and  
ensemble modules, and the basic concepts, including stack, samples LUT  
with in-place partitioning, incremental histogram updates, for the  
implementation of the classifiers are based on: G. Louppe,  
Understanding Random Forests, PhD Thesis, 2014. Thanks a lot Gilles  
for that comprehensive work on random forests!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190515/c9d7cc05/attachment.html>

From pahome.chen at mirlab.org  Wed May 15 21:45:43 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Thu, 16 May 2019 09:45:43 +0800
Subject: [scikit-learn] Can I evaluate clustering efficiency
 incrementally?
In-Reply-To: <CAAkaFLVKmjND8GYQbc0MKhh6o6OP=0BLMtJY_nyEwNZXR-x_dg@mail.gmail.com>
References: <CAB3eZfuM3av5T2nmQrQYhkrqv2cm5UAJFmjf5OB07pTCc10F6g@mail.gmail.com>
 <CANCr86PSh5yiftps60i6v_o60zP7MFkqQfPTPyjSpURk=+tQfg@mail.gmail.com>
 <CAB3eZfvE9aW9GkMZXj=-djGnOoDp2vy7fG9zqzpYnPWwJrC7GA@mail.gmail.com>
 <CANCr86OKNJBcx3O4DaLzeO5Q-+yoJf79wteYexu_WfpOQ=HFNg@mail.gmail.com>
 <CAE1aY-nWtJU0msneVO0joyeYKscWt10tbuaPaBFVK1uXNA8yWA@mail.gmail.com>
 <CAAkaFLVKmjND8GYQbc0MKhh6o6OP=0BLMtJY_nyEwNZXR-x_dg@mail.gmail.com>
Message-ID: <CAB3eZfsEiTNHZ=VTB6Oi+5Nuf2haM+pPE2-jut45WZVneYek-Q@mail.gmail.com>

Joel Nothman <joel.nothman at gmail.com> ? 2019?5?15? ?? ??12:16???

> Evaluating on large datasets is easy if the sufficient statistics are just
> the contingency matrix.
>
>
Sorry, I don't understand it. Can you explain detailly?
You mean we could take  subset   of samples to evaluating if subset is
contingency(normal distribution) matrix?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190516/211d999a/attachment.html>

From joel.nothman at gmail.com  Thu May 16 03:06:37 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 16 May 2019 17:06:37 +1000
Subject: [scikit-learn] Can I evaluate clustering efficiency
 incrementally?
In-Reply-To: <CAB3eZfsEiTNHZ=VTB6Oi+5Nuf2haM+pPE2-jut45WZVneYek-Q@mail.gmail.com>
References: <CAB3eZfuM3av5T2nmQrQYhkrqv2cm5UAJFmjf5OB07pTCc10F6g@mail.gmail.com>
 <CANCr86PSh5yiftps60i6v_o60zP7MFkqQfPTPyjSpURk=+tQfg@mail.gmail.com>
 <CAB3eZfvE9aW9GkMZXj=-djGnOoDp2vy7fG9zqzpYnPWwJrC7GA@mail.gmail.com>
 <CANCr86OKNJBcx3O4DaLzeO5Q-+yoJf79wteYexu_WfpOQ=HFNg@mail.gmail.com>
 <CAE1aY-nWtJU0msneVO0joyeYKscWt10tbuaPaBFVK1uXNA8yWA@mail.gmail.com>
 <CAAkaFLVKmjND8GYQbc0MKhh6o6OP=0BLMtJY_nyEwNZXR-x_dg@mail.gmail.com>
 <CAB3eZfsEiTNHZ=VTB6Oi+5Nuf2haM+pPE2-jut45WZVneYek-Q@mail.gmail.com>
Message-ID: <CAAkaFLVwDMxEBJyAL9ynrufRxOgzW0vNKY9_951zCrB+09jq+w@mail.gmail.com>

The contingency matrix (
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cluster.contingency_matrix.html)
counts how many times each pair of (true cluster, predicted cluster)
occurs. It is sufficient statistics for every "supervised" (i.e. ground
truth-based) clustering evaluation metric in Scikit-learn. In an
incremental setting, you can simply add to the contingency matrix with each
new predicted batch. In
https://github.com/scikit-learn/scikit-learn/issues/8103 I proposed that we
provide an API for calculating clustering metrics from the sufficient
statistics alone, but it's not come to fruition.

On Thu, 16 May 2019 at 11:47, lampahome <pahome.chen at mirlab.org> wrote:

> Joel Nothman <joel.nothman at gmail.com> ? 2019?5?15? ?? ??12:16???
>
>> Evaluating on large datasets is easy if the sufficient statistics are
>> just the contingency matrix.
>>
>>
> Sorry, I don't understand it. Can you explain detailly?
> You mean we could take  subset   of samples to evaluating if subset is
> contingency(normal distribution) matrix?
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190516/219d061e/attachment.html>

From joel.nothman at gmail.com  Thu May 16 04:03:23 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 16 May 2019 18:03:23 +1000
Subject: [scikit-learn] ANN: scikit-learn 0.21 released
Message-ID: <CAAkaFLWkxiaXvDkOhgwWkDGan1-fp_qNrngWiqbc7BuEr8oy+A@mail.gmail.com>

Thanks to the work of many, many contributors, we have released
Scikit-learn 0.21. It is available from GitHub, PyPI and Conda-forge, but
is not yet available on the Anaconda defaults channel.

* Documentation at https://scikit-learn.org/0.21
* Release Notes at https://scikit-learn.org/0.21/whats_new
* Download source or wheels at
https://pypi.org/project/scikit-learn/0.21rc2/
* Install from conda-forge with `conda install -c conda-forge scikit-learn`

Highlights include:
* neighbors.NeighborhoodComponentsAnalysis for supervised metric learning,
which learns a weighted euclidean distance for k-nearest neighbors.
https://scikit-learn.org/0.21/modules/neighbors.html#nca
* ensemble.HistGradientBoostingClassifier
and ensemble.HistGradientBoostingRegressor: experimental implementations of
efficient binned gradient boosting machines. https://scikit-learn.org/0.21
/modules/ensemble.html#gradient-tree-boosting
* impute.IterativeImputer: an experimental API for a non-trivial approach
to missing value imputation. https://scikit-learn.org/0.21
/modules/impute.html#multivariate-feature-imputation
* cluster.OPTICS: a new density-based clustering algorithm.
https://scikit-learn.org/0.21/modules/clustering.html#optics
* better printing of estimators as strings, with an option to hide default
parameters for compactness: https://scikit-learn.org/0.21
/auto_examples/plot_changed_only_pprint_parameter.html
* for estimator and library developers: a way to tag your estimator so that
it can be treated appropriately with check_estimator.
https://scikit-learn.org/0.21/developers/contributing.html#estimator-tags

There are many other enhancements and fixes listed in the release notes (
https://scikit-learn.org/0.21/whats_new).

Please note that Scikit-learn has new dependencies. It requires:
* joblib >= 0.11, which used to be vendored within Scikit-learn
* OpenMP, unless the environment variable SKLEARN_NO_OPENMP=1 when the code
is compiled (and cythonized)
* Python >= 3.5. Installing Scikit-learn from Python 2 will continue to
provide version 0.20.

Thanks again to everyone who contributed and to our sponsors, who helped us
to develop such a great set of features and fixes since version 0.20 in
under 8 months.

Happy Learning!

>From the Scikit-learn ]team.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190516/c15379b5/attachment-0001.html>

From bertrand.thirion at inria.fr  Thu May 16 04:21:09 2019
From: bertrand.thirion at inria.fr (bertrand.thirion)
Date: Thu, 16 May 2019 10:21:09 +0200
Subject: [scikit-learn] ANN: scikit-learn 0.21 released
In-Reply-To: <CAAkaFLWkxiaXvDkOhgwWkDGan1-fp_qNrngWiqbc7BuEr8oy+A@mail.gmail.com>
Message-ID: <9cac0f$bdkt92@mail2-relais-roc.national.inria.fr>

Congratulations !Bertrand?Envoy? depuis mon smartphone Samsung Galaxy.
-------- Message d'origine --------De : Joel Nothman <joel.nothman at gmail.com> Date : 16/05/2019  10:03  (GMT+01:00) ? : Scikit-learn user and developer mailing list <scikit-learn at python.org> Objet : [scikit-learn] ANN: scikit-learn 0.21 released Thanks to the work of many, many contributors, we have released Scikit-learn 0.21. It is available from GitHub, PyPI and Conda-forge, but is not yet available on the Anaconda defaults channel.* Documentation at?https://scikit-learn.org/0.21*?Release?Notes at?https://scikit-learn.org/0.21/whats_new* Download source or wheels at?https://pypi.org/project/scikit-learn/0.21rc2/* Install from conda-forge with `conda install -c conda-forge scikit-learn`Highlights include:* neighbors.NeighborhoodComponentsAnalysis for supervised metric learning, which learns a weighted euclidean distance for k-nearest neighbors.?https://scikit-learn.org/0.21/modules/neighbors.html#nca*?ensemble.HistGradientBoostingClassifier and?ensemble.HistGradientBoostingRegressor: experimental implementations of efficient binned gradient boosting machines.?https://scikit-learn.org/0.21/modules/ensemble.html#gradient-tree-boosting* impute.IterativeImputer: an experimental API for a non-trivial approach to missing value imputation.?https://scikit-learn.org/0.21/modules/impute.html#multivariate-feature-imputation* cluster.OPTICS: a new density-based clustering algorithm.?https://scikit-learn.org/0.21/modules/clustering.html#optics* better printing of estimators as strings, with an option to hide default parameters for compactness:?https://scikit-learn.org/0.21/auto_examples/plot_changed_only_pprint_parameter.html* for estimator and library developers: a way to tag your estimator so that it can be treated appropriately with check_estimator.?https://scikit-learn.org/0.21/developers/contributing.html#estimator-tagsThere are many other enhancements and fixes listed in the?release?notes (https://scikit-learn.org/0.21/whats_new).Please note that Scikit-learn has new dependencies. It requires:* joblib >= 0.11, which used to be vendored within Scikit-learn* OpenMP, unless the environment variable?SKLEARN_NO_OPENMP=1 when the code is compiled (and cythonized)* Python >= 3.5. Installing Scikit-learn from Python 2 will continue to provide version 0.20.Thanks again to everyone who contributed and to our sponsors, who helped us to develop such a great set of features and fixes?since version 0.20 in under 8 months.Happy Learning!From the Scikit-learn ]team.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190516/285bbee1/attachment.html>

From gael.varoquaux at normalesup.org  Thu May 16 04:35:00 2019
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Thu, 16 May 2019 10:35:00 +0200
Subject: [scikit-learn] ANN: scikit-learn 0.21 released
In-Reply-To: <9cac0f$bdkt92@mail2-relais-roc.national.inria.fr>
References: <CAAkaFLWkxiaXvDkOhgwWkDGan1-fp_qNrngWiqbc7BuEr8oy+A@mail.gmail.com>
 <9cac0f$bdkt92@mail2-relais-roc.national.inria.fr>
Message-ID: <20190516083500.t373fnb2vijtgwe2@phare.normalesup.org>

Indeed!

Great improvements. And it's a pleasure to see that the releases are more
frequent: a huge value to the community.

Ga?l

On Thu, May 16, 2019 at 10:21:09AM +0200, bertrand.thirion wrote:
> Congratulations !
> Bertrand 


> Envoy? depuis mon smartphone Samsung Galaxy.

> -------- Message d'origine --------
> De : Joel Nothman <joel.nothman at gmail.com>
> Date : 16/05/2019 10:03 (GMT+01:00)
> ? : Scikit-learn user and developer mailing list <scikit-learn at python.org>
> Objet : [scikit-learn] ANN: scikit-learn 0.21 released

> Thanks to the work of many, many contributors, we have released Scikit-learn
> 0.21. It is available from GitHub, PyPI and Conda-forge, but is not yet
> available on the Anaconda defaults channel.

> * Documentation at https://scikit-learn.org/0.21
> * Release Notes at https://scikit-learn.org/0.21/whats_new
> * Download source or wheels at https://pypi.org/project/scikit-learn/0.21rc2/
> * Install from conda-forge with `conda install -c conda-forge scikit-learn`

> Highlights include:
> * neighbors.NeighborhoodComponentsAnalysis for supervised metric learning,
> which learns a weighted euclidean distance for k-nearest neighbors. https://
> scikit-learn.org/0.21/modules/neighbors.html#nca
> * ensemble.HistGradientBoostingClassifier
> and ensemble.HistGradientBoostingRegressor: experimental implementations of
> efficient binned gradient boosting machines. https://scikit-learn.org/0.21/
> modules/ensemble.html#gradient-tree-boosting
> * impute.IterativeImputer: an experimental API for a non-trivial approach to
> missing value imputation. https://scikit-learn.org/0.21/modules/impute.html#
> multivariate-feature-imputation
> * cluster.OPTICS: a new density-based clustering algorithm. https://
> scikit-learn.org/0.21/modules/clustering.html#optics
> * better printing of estimators as strings, with an option to hide default
> parameters for compactness: https://scikit-learn.org/0.21/auto_examples/
> plot_changed_only_pprint_parameter.html
> * for estimator and library developers: a way to tag your estimator so that it
> can be treated appropriately with check_estimator. https://scikit-learn.org/
> 0.21/developers/contributing.html#estimator-tags

> There are many other enhancements and fixes listed in the release notes (https:
> //scikit-learn.org/0.21/whats_new).

> Please note that Scikit-learn has new dependencies. It requires:
> * joblib >= 0.11, which used to be vendored within Scikit-learn
> * OpenMP, unless the environment variable SKLEARN_NO_OPENMP=1 when the code is
> compiled (and cythonized)
> * Python >= 3.5. Installing Scikit-learn from Python 2 will continue to provide
> version 0.20.

> Thanks again to everyone who contributed and to our sponsors, who helped us to
> develop such a great set of features and fixes since version 0.20 in under 8
> months.

> Happy Learning!

> From the Scikit-learn ]team.

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
    Gael Varoquaux
    Senior Researcher, INRIA 
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From maxhalford25 at gmail.com  Thu May 16 12:22:35 2019
From: maxhalford25 at gmail.com (Max Halford)
Date: Thu, 16 May 2019 18:22:35 +0200
Subject: [scikit-learn] Introducing creme for online learning
Message-ID: <CAPyoE90YJXeEPFvnjACVzLBd6Jo6=2toSr8iWE3NnEoO4GMqMg@mail.gmail.com>

Hello everyone,

I sometimes see emails where people are asking about training models
incrementally. Me and some friends have started a Python library for doing
so-called online learning named creme: https://github.com/creme-ml/creme.
The code is idiomatic and the API resembles that of sklearn. Online
learning is treated as a first class citizen, which makes more practical
and efficient than sklearn if online learning is your goal. Each estimator
has a fit_one(x, y) method which allows it to train with one observation at
a time.

I just presented it at PyData Amsterdam where people seemed enthusiastic
about it. The video is not out yet but here are the slides:
https://maxhalford.github.io/slides/creme-pydata.

Best regards. And congrats on version 0.21!

-- 
Max Halford
+336 28 25 13 38
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190516/b047386f/attachment.html>

From prudhvirajnitjsr at gmail.com  Wed May 22 09:56:28 2019
From: prudhvirajnitjsr at gmail.com (prudhviraj nitjsr)
Date: Wed, 22 May 2019 19:26:28 +0530
Subject: [scikit-learn] Regularization in Tree Models
Message-ID: <CANka6xvdzutBOTfqeoD86kgcWKqKSXa9J3vT5k4wH=r_FZeJ3g@mail.gmail.com>

Hi All,

I've noticed that there is no regularization term for Decision Tree
estimators in scikit learn. Are there any plans to introduce that?

Thanks
Prudvi RajKumar M

From t3kcit at gmail.com  Wed May 22 11:02:31 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 22 May 2019 11:02:31 -0400
Subject: [scikit-learn] Regularization in Tree Models
In-Reply-To: <CANka6xvdzutBOTfqeoD86kgcWKqKSXa9J3vT5k4wH=r_FZeJ3g@mail.gmail.com>
References: <CANka6xvdzutBOTfqeoD86kgcWKqKSXa9J3vT5k4wH=r_FZeJ3g@mail.gmail.com>
Message-ID: <41317a97-e36e-9158-68a1-db19b0dc5747@gmail.com>

Hi Prudvi.
What exactly do you mean by that?
There is regularization in the new HistGradientBoosting, and we're 
working on post-pruning for decision trees.
I'm not sure what l2 regularization for decision tree classifiers or for 
decision tree regressors would mean. Do you have a reference?

Best,
Andy

On 5/22/19 9:56 AM, prudhviraj nitjsr wrote:
> Hi All,
>
> I've noticed that there is no regularization term for Decision Tree
> estimators in scikit learn. Are there any plans to introduce that?
>
> Thanks
> Prudvi RajKumar M
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From ahowe42 at gmail.com  Thu May 23 10:39:06 2019
From: ahowe42 at gmail.com (Andrew Howe)
Date: Thu, 23 May 2019 15:39:06 +0100
Subject: [scikit-learn] Version 0.21! and plot_tree!
Message-ID: <CANnYi3R26eNQ-1fN9WG7-reNXht6Lkxufay+G1_hseD+2hY7GQ@mail.gmail.com>

I want to say thank you to all the sklearn developers. The breadth and
quality of this software is truly breathtaking.

Specifically, I want to say thank you very very much for the plot_tree
function! I have wasted a lot of effort in the past, on multiple OSes,
getting everything to work so I could view the tree.export_graphviz
results. Having this new function to plot the trees natively in matplotlib
is extremely useful.

Thanks again!
Andrew

<~~~~~~~~~~~~~~~~~~~~~~~~~~~>
J. Andrew Howe, PhD
LinkedIn Profile <http://www.linkedin.com/in/ahowe42>
ResearchGate Profile <http://www.researchgate.net/profile/John_Howe12/>
Open Researcher and Contributor ID (ORCID)
<http://orcid.org/0000-0002-3553-1990>
Github Profile <http://github.com/ahowe42>
Personal Website <http://www.andrewhowe.com>
I live to learn, so I can learn to live. - me
<~~~~~~~~~~~~~~~~~~~~~~~~~~~>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190523/de2fd692/attachment.html>

From t3kcit at gmail.com  Thu May 23 11:22:21 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 23 May 2019 11:22:21 -0400
Subject: [scikit-learn] Version 0.21! and plot_tree!
In-Reply-To: <CANnYi3R26eNQ-1fN9WG7-reNXht6Lkxufay+G1_hseD+2hY7GQ@mail.gmail.com>
References: <CANnYi3R26eNQ-1fN9WG7-reNXht6Lkxufay+G1_hseD+2hY7GQ@mail.gmail.com>
Message-ID: <428bcef1-fb27-fd98-7d7a-13d33e7fb8c6@gmail.com>

Hey Andrew.
Thanks for saying thanks!
I share your frustration with export_graphviz, in particular for teaching.
I feel like plot_tree is not ideal yet, though. In particular the layout 
is not as compact as the graphviz one.
If you have any feedback or suggestions, I'd be very happy to hear them!

Cheers,
Andy


On 5/23/19 10:39 AM, Andrew Howe wrote:
> I want to say thank you to all the sklearn developers. The breadth and 
> quality of this software is truly breathtaking.
>
> Specifically, I want to say thank you very very much for the plot_tree 
> function! I have wasted a lot of effort in the past, on multiple OSes, 
> getting everything to work so I could view the tree.export_graphviz 
> results. Having this new function to plot the trees natively in 
> matplotlib is extremely useful.
>
> Thanks again!
> Andrew
>
> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> J. Andrew Howe, PhD
> LinkedIn Profile <http://www.linkedin.com/in/ahowe42>
> ResearchGate Profile <http://www.researchgate.net/profile/John_Howe12/>
> Open Researcher and Contributor ID (ORCID) 
> <http://orcid.org/0000-0002-3553-1990>
> Github Profile <http://github.com/ahowe42>
> Personal Website <http://www.andrewhowe.com>
> I live to learn, so I can learn to live. - me
> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190523/567e12cc/attachment.html>

From anael.beaugnon at ssi.gouv.fr  Thu May 23 11:49:34 2019
From: anael.beaugnon at ssi.gouv.fr (Beaugnon Anael)
Date: Thu, 23 May 2019 17:49:34 +0200
Subject: [scikit-learn] decision_path method for tree-based models
Message-ID: <9e074d7f-3ead-2df8-8b8c-3f2554d95d4c@ssi.gouv.fr>

Hi everyone,

The decision_path method is currently available only for
DecisionTreeClassifier, DecisionTreeRegressor, and RandomForest, but not
for IsolationForest and GradientBoostingClassifier. In these cases, the
implementation is quite easy, it is exactly the same as for
RandomForest, but it I think it would be very handy to have a public method.

What do you think of this proposal ? If you are ok with it, I would be
happy to propose a pull request.

Thanks,

--
Ana?l Beaugnon
ANSSI - Intrusion Detection Research Laboratory

Les donn?es ? caract?re personnel recueillies et trait?es dans le cadre de cet ?change, le sont ? seule fin d?ex?cution d?une relation professionnelle et s?op?rent dans cette seule finalit? et pour la dur?e n?cessaire ? cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos donn?es, veuillez contacter contact.rgpd at sgdsn.gouv.fr. Si vous avez re?u ce message par erreur, nous vous remercions d?en informer l?exp?diteur et de d?truire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd at sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.

From niourf at gmail.com  Thu May 23 12:17:58 2019
From: niourf at gmail.com (Nicolas Hug)
Date: Thu, 23 May 2019 12:17:58 -0400
Subject: [scikit-learn] decision_path method for tree-based models
In-Reply-To: <9e074d7f-3ead-2df8-8b8c-3f2554d95d4c@ssi.gouv.fr>
References: <9e074d7f-3ead-2df8-8b8c-3f2554d95d4c@ssi.gouv.fr>
Message-ID: <1a2dceee-e9c7-81bb-d1a8-4f1a18d759ac@gmail.com>

Hi Ana?l, yes feel free to submit a PR

On 5/23/19 11:49 AM, Beaugnon Anael wrote:
> Hi everyone,
>
> The decision_path method is currently available only for
> DecisionTreeClassifier, DecisionTreeRegressor, and RandomForest, but not
> for IsolationForest and GradientBoostingClassifier. In these cases, the
> implementation is quite easy, it is exactly the same as for
> RandomForest, but it I think it would be very handy to have a public method.
>
> What do you think of this proposal ? If you are ok with it, I would be
> happy to propose a pull request.
>
> Thanks,
>
> --
> Ana?l Beaugnon
> ANSSI - Intrusion Detection Research Laboratory
>
> Les donn?es ? caract?re personnel recueillies et trait?es dans le cadre de cet ?change, le sont ? seule fin d?ex?cution d?une relation professionnelle et s?op?rent dans cette seule finalit? et pour la dur?e n?cessaire ? cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos donn?es, veuillez contacter contact.rgpd at sgdsn.gouv.fr. Si vous avez re?u ce message par erreur, nous vous remercions d?en informer l?exp?diteur et de d?truire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd at sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From anael.beaugnon at gmail.com  Thu May 23 16:53:16 2019
From: anael.beaugnon at gmail.com (=?UTF-8?Q?Ana=C3=ABl_Beaugnon?=)
Date: Thu, 23 May 2019 22:53:16 +0200
Subject: [scikit-learn] decision_path method for tree-based models
In-Reply-To: <1a2dceee-e9c7-81bb-d1a8-4f1a18d759ac@gmail.com>
References: <9e074d7f-3ead-2df8-8b8c-3f2554d95d4c@ssi.gouv.fr>
 <1a2dceee-e9c7-81bb-d1a8-4f1a18d759ac@gmail.com>
Message-ID: <CAHB3a706wrTeH6ka7QJy_w_4xvGvRWSKvcF1ob6Ae25dS1p=Hg@mail.gmail.com>

Hi Nicolas,

Thanks for your quick answer. I have just submitted a PR (
https://github.com/scikit-learn/scikit-learn/pull/13935).

Le jeu. 23 mai 2019 ? 18:21, Nicolas Hug <niourf at gmail.com> a ?crit :

> Hi Ana?l, yes feel free to submit a PR
>
> On 5/23/19 11:49 AM, Beaugnon Anael wrote:
> > Hi everyone,
> >
> > The decision_path method is currently available only for
> > DecisionTreeClassifier, DecisionTreeRegressor, and RandomForest, but not
> > for IsolationForest and GradientBoostingClassifier. In these cases, the
> > implementation is quite easy, it is exactly the same as for
> > RandomForest, but it I think it would be very handy to have a public
> method.
> >
> > What do you think of this proposal ? If you are ok with it, I would be
> > happy to propose a pull request.
> >
> > Thanks,
> >
> > --
> > Ana?l Beaugnon
> > ANSSI - Intrusion Detection Research Laboratory
> >
> > Les donn?es ? caract?re personnel recueillies et trait?es dans le cadre
> de cet ?change, le sont ? seule fin d?ex?cution d?une relation
> professionnelle et s?op?rent dans cette seule finalit? et pour la dur?e
> n?cessaire ? cette relation. Si vous souhaitez faire usage de vos droits de
> consultation, de rectification et de suppression de vos donn?es, veuillez
> contacter contact.rgpd at sgdsn.gouv.fr. Si vous avez re?u ce message par
> erreur, nous vous remercions d?en informer l?exp?diteur et de d?truire le
> message. The personal data collected and processed during this exchange
> aims solely at completing a business relationship and is limited to the
> necessary duration of that relationship. If you wish to use your rights of
> consultation, rectification and deletion of your data, please contact:
> contact.rgpd at sgdsn.gouv.fr. If you have received this message in error,
> we thank you for informing the sender and destroying the message.
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190523/463a5b5d/attachment.html>

From olivier.grisel at ensta.org  Fri May 24 03:38:16 2019
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Fri, 24 May 2019 09:38:16 +0200
Subject: [scikit-learn] ANN: scikit-learn 0.21.2 released
Message-ID: <CAFvE7K7aNUL=uzCqrAGtHgzgcaN_rNxYrDarLoyEK_mtqvr19w@mail.gmail.com>

A quick bugfix release to fix a critical regression in the computation
of the euclidean distances returning incorrect values silently.

This release also includes other bugfixes listed in the changelog:

https://scikit-learn.org/0.21/whats_new.html#version-0-21-2

The PyPI.org wheels and conda-forge packages are online. The packages
for the default Anaconda channel should follow soon.

Thanks to all the contributors!

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

From mraetz at eonerc.rwth-aachen.de  Fri May 24 03:57:27 2019
From: mraetz at eonerc.rwth-aachen.de (=?iso-8859-1?Q?R=E4tz=2C_Martin?=)
Date: Fri, 24 May 2019 07:57:27 +0000
Subject: [scikit-learn] [Copyright] Skicit-learn graphic
Message-ID: <c781503abb6b4bf690b83c528d63ac90@eonerc.rwth-aachen.de>

Dear scikit-learn employees,

on the scikit-learn webpage (link to the graphic<https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html>) you will find a graphic, which I would like to use in a publication in an international journal.
I slightly modified the graphic as you can see in the appendix.
Of course, I refer to Scikit-learn in the caption. The naming in the bibliography is as follows:

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,
J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,
and ?. Duchesnay, "Scikit-learn: Machine learning in python,"
Journal of Machine Learning Research, vol. 12, no. Oct, pp. 2825-
2830, 2011.

I would like to kindly ask for permission to publish the graphic.
Yours sincerely

Martin R?tz

_______________________________________
 Martin R?tz, M.Sc.
Research Associate
T +49 241 80-49794
F +49 241 80-49769
mraetz at eonerc.rwth-aachen.de<mailto:mraetz at eonerc.rwth-aachen.de>
RWTH Aachen University
E.ON Energy Research Center   *   Institute for Energy Efficient Buildings and Indoor Climate
E.ON Energieforschungszentrum   *   Lehrstuhl f?r Geb?ude- und Raumklimatechnik
Mathieustra?e 10
52074 Aachen   *   Germany
www.eonerc.rwth-aachen.de/ebc<http://www.eonerc.rwth-aachen.de/ebc>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190524/9035ffd6/attachment-0001.html>

From olivier.grisel at ensta.org  Fri May 24 06:11:21 2019
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Fri, 24 May 2019 12:11:21 +0200
Subject: [scikit-learn] [Copyright] Skicit-learn graphic
In-Reply-To: <c781503abb6b4bf690b83c528d63ac90@eonerc.rwth-aachen.de>
References: <c781503abb6b4bf690b83c528d63ac90@eonerc.rwth-aachen.de>
Message-ID: <CAFvE7K7wEAuHpWqnpeV-9Tp4LSmN1h8aU1rQbxRV2xg_Yei_kA@mail.gmail.com>

I think it's ok to do as you said.

-- 
Olivier

From t3kcit at gmail.com  Fri May 24 14:16:41 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 24 May 2019 14:16:41 -0400
Subject: [scikit-learn] Google code reviews
Message-ID: <a957f270-3982-6e35-d2d5-4204a1a859b6@gmail.com>

Hi All.
What do you think of https://www.pullrequest.com/googleserve/?
It's sponsored code reviews. Could be interesting, right?

Best,
Andy

From randalljellis at gmail.com  Fri May 24 17:21:50 2019
From: randalljellis at gmail.com (Randy Ellis)
Date: Fri, 24 May 2019 17:21:50 -0400
Subject: [scikit-learn] Highly cited paper - causal random forests
Message-ID: <CAMN2r7LgcibUiM-1LcZJ3uoYFOxz53B7zBad2uRzQcsYX2n5TQ@mail.gmail.com>

Would this be difficult for a moderate user to implement in sklearn by
modifying the existing code base?

Estimation and Inference of Heterogeneous Treatment Effects using Random
Forests

342 citations in less than a year (Google Scholar):
https://amstat.tandfonline.com/doi/full/10.1080/01621459.2017.1319839

"In this article, we develop a nonparametric *causal forest* for estimating
heterogeneous treatment effects that extends Breiman?s widely used random
forest algorithm. In the potential outcomes framework with
unconfoundedness, we show that causal forests are pointwise consistent for
the true treatment effect and have an asymptotically Gaussian and centered
sampling distribution. We also discuss a practical method for constructing
asymptotic confidence intervals for the true treatment effect that are
centered at the causal forest estimates. Our theoretical results rely on a
generic Gaussian theory for a large family of random forest algorithms. To
our knowledge, this is the first set of results that allows any type of
random forest, including classification and regression forests, to be used
for provably valid statistical inference. In experiments, we find causal
forests to be substantially more powerful than classical methods based on
nearest-neighbor matching, especially in the presence of irrelevant
covariates."

-- 
*Randall J. Ellis*
PhD Student, Hurd lab <http://labs.neuroscience.mssm.edu/project/hurd-lab/>,
Mount Sinai School of Medicine
Special Volunteer, Michaelides lab <http://www.michaelideslab.org/>, NIDA
IRP
Phone: +1-954-260-9891
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190524/2a1345aa/attachment.html>

From gael.varoquaux at normalesup.org  Sat May 25 06:21:01 2019
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Sat, 25 May 2019 12:21:01 +0200
Subject: [scikit-learn] Highly cited paper - causal random forests
In-Reply-To: <CAMN2r7LgcibUiM-1LcZJ3uoYFOxz53B7zBad2uRzQcsYX2n5TQ@mail.gmail.com>
References: <CAMN2r7LgcibUiM-1LcZJ3uoYFOxz53B7zBad2uRzQcsYX2n5TQ@mail.gmail.com>
Message-ID: <20190525102101.ttssoo2vt5vr3uxo@phare.normalesup.org>

Causal forest are a very nice work. However, they deal with causal
inference, rather than prediction. Hence, I am not really sure how we
could implement them in the API of scikit-learn. Do you have a
suggestion?

Cheers,

Ga?l


On Fri, May 24, 2019 at 05:21:50PM -0400, Randy Ellis wrote:
> Would this be difficult for a moderate user to implement in sklearn by
> modifying the existing code base?

> Estimation and Inference of Heterogeneous Treatment Effects using Random
> Forests

> 342 citations in less than a year (Google Scholar):?https://
> amstat.tandfonline.com/doi/full/10.1080/01621459.2017.1319839

> "In this article, we develop a nonparametric causal forest for estimating
> heterogeneous treatment effects that extends Breiman?s widely used random
> forest algorithm. In the potential outcomes framework with unconfoundedness, we
> show that causal forests are pointwise consistent for the true treatment effect
> and have an asymptotically Gaussian and centered sampling distribution. We also
> discuss a practical method for constructing asymptotic confidence intervals for
> the true treatment effect that are centered at the causal forest estimates. Our
> theoretical results rely on a generic Gaussian theory for a large family of
> random forest algorithms. To our knowledge, this is the first set of results
> that allows any type of random forest, including classification and regression
> forests, to be used for provably valid statistical inference. In experiments,
> we find causal forests to be substantially more powerful than classical methods
> based on nearest-neighbor matching, especially in the presence of irrelevant
> covariates."
-- 
    Gael Varoquaux
    Senior Researcher, INRIA 
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From joel.nothman at gmail.com  Sat May 25 08:06:50 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Sat, 25 May 2019 22:06:50 +1000
Subject: [scikit-learn] ANN: Scikit-learn 0.21.2 released
Message-ID: <CAAkaFLXSM9WQ32JoZceQQUiE+AjmZw-L1+TPMrhk21VMccxj2g@mail.gmail.com>

We've released 0.21.2 primarily to fix an issue with euclidean_distances
(and pairwise_distances). It should be available on PyPI and Conda-Forge.

Full list of changes at https://scikit-learn.org/0.21/whats_new/v0.21.html

Thanks to all who helped fix these issues so quickly after 0.21.1.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190525/4ee66af7/attachment.html>

From joel.nothman at gmail.com  Sat May 25 08:07:16 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Sat, 25 May 2019 22:07:16 +1000
Subject: [scikit-learn] ANN: scikit-learn 0.21.2 released
In-Reply-To: <CAFvE7K7aNUL=uzCqrAGtHgzgcaN_rNxYrDarLoyEK_mtqvr19w@mail.gmail.com>
References: <CAFvE7K7aNUL=uzCqrAGtHgzgcaN_rNxYrDarLoyEK_mtqvr19w@mail.gmail.com>
Message-ID: <CAAkaFLWPwQxJCcE8G46opnjL+nHh1dLxbhUSS3p5VhU4--ttzQ@mail.gmail.com>

Sorry, didn't see this one already went through! Whoops.

On Fri, 24 May 2019 at 17:41, Olivier Grisel <olivier.grisel at ensta.org>
wrote:

> A quick bugfix release to fix a critical regression in the computation
> of the euclidean distances returning incorrect values silently.
>
> This release also includes other bugfixes listed in the changelog:
>
> https://scikit-learn.org/0.21/whats_new.html#version-0-21-2
>
> The PyPI.org wheels and conda-forge packages are online. The packages
> for the default Anaconda channel should follow soon.
>
> Thanks to all the contributors!
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190525/888fba4a/attachment-0001.html>

From joel.nothman at gmail.com  Sat May 25 08:08:37 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Sat, 25 May 2019 22:08:37 +1000
Subject: [scikit-learn] Google code reviews
In-Reply-To: <a957f270-3982-6e35-d2d5-4204a1a859b6@gmail.com>
References: <a957f270-3982-6e35-d2d5-4204a1a859b6@gmail.com>
Message-ID: <CAAkaFLX38M08sMV+hgMJ6bQfA+i+gKOwEEW6m6+rAHcEsC239A@mail.gmail.com>

For some of the larger PRs, this might be helpful. Not going to help where
the intricacies of Scikit-learn API come in play.

On Sat, 25 May 2019 at 04:17, Andreas Mueller <t3kcit at gmail.com> wrote:

> Hi All.
> What do you think of https://www.pullrequest.com/googleserve/?
> It's sponsored code reviews. Could be interesting, right?
>
> Best,
> Andy
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190525/475a706e/attachment.html>

From regis_cardos at hotmail.com  Wed May 29 08:16:09 2019
From: regis_cardos at hotmail.com (=?iso-8859-1?Q?R=E9gis_Cardoso?=)
Date: Wed, 29 May 2019 12:16:09 +0000
Subject: [scikit-learn] Problems with installation Scikit Learn
Message-ID: <CP2P15201MB1988BAF383841EE146D140EAF21F0@CP2P15201MB1988.LAMP152.PROD.OUTLOOK.COM>

Dear,

I subscribled on scikit-learn subscription results now.

My name is R?gis i am estudying word2vec and artificial neural network using the scikit-learn and i am trying to install the scikit-learn in a Rasp Berry Pi 3, but i don't achieving. I am using the all comands below and i don't have sucess with the installation.

1? Try - pip install -U scikit-learn

2? Try - sudo install scikit-learn

3?  Try - sudo apt-get install gfortran libopenblas-dev liblapack-dev
              sudo pip install scikit-learn

4?  Try - sudo pip3 install scikit-learn

5? Try - pip install scikit-learn


I would like to know if there is other form to install the scikit-learn in a Rasp Berry Pi 3? I have the Python 3.6 intalled in my board on this case and also there are installed the Numpy, Scipy and Joblib.


Regards,

Cardoso, Regis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190529/aeb34eb0/attachment.html>

From g.lemaitre58 at gmail.com  Wed May 29 08:58:49 2019
From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=)
Date: Wed, 29 May 2019 14:58:49 +0200
Subject: [scikit-learn] Problems with installation Scikit Learn
In-Reply-To: <CP2P15201MB1988BAF383841EE146D140EAF21F0@CP2P15201MB1988.LAMP152.PROD.OUTLOOK.COM>
References: <CP2P15201MB1988BAF383841EE146D140EAF21F0@CP2P15201MB1988.LAMP152.PROD.OUTLOOK.COM>
Message-ID: <CACDxx9gPpe86ZG-7MTDYG9i=0P1mbZrM9FejayDZftGySHHXeg@mail.gmail.com>

Could you install all package from the system? If you have a Debian
distribution these packages should be available.
Somehow, I would expect apt-get install python-sklearn to work (it should
install the dependencies).

On Wed, 29 May 2019 at 14:34, R?gis Cardoso <regis_cardos at hotmail.com>
wrote:

> Dear,
>
> I subscribled on scikit-learn subscription results now.
>
> My name is R?gis i am estudying word2vec and artificial neural network
> using the scikit-learn and i am trying to install the scikit-learn in a
> Rasp Berry Pi 3, but i don't achieving. I am using the all comands below
> and i don't have sucess with the installation.
>
> 1? Try - pip install -U scikit-learn
>
> 2? Try - sudo install scikit-learn
>
> 3?  Try - sudo apt-get install gfortran libopenblas-dev liblapack-dev
>               sudo pip install scikit-learn
>
> 4?  Try - sudo pip3 install scikit-learn
>
> 5? Try - pip install scikit-learn
>
>
> I would like to know if there is other form to install the scikit-learn in
> a Rasp Berry Pi 3? I have the Python 3.6 intalled in my board on this case
> and also there are installed the Numpy, Scipy and Joblib.
>
>
>
> Regards,
>
> Cardoso, Regis
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190529/bc7b1a9e/attachment-0001.html>

From krallinger.martin at gmail.com  Wed May 29 09:42:44 2019
From: krallinger.martin at gmail.com (Martin Krallinger)
Date: Wed, 29 May 2019 15:42:44 +0200
Subject: [scikit-learn] scikit-learn for automatic text
 indexing/classification shared task : MESINESP /BioASQ
In-Reply-To: <mailman.804.1559134745.19359.scikit-learn@python.org>
References: <mailman.804.1559134745.19359.scikit-learn@python.org>
Message-ID: <CAMx+MKH=MN=V8+o8UaEVD48Bz64S4G2Nz_htrRL+69GgUFALPA@mail.gmail.com>

*** Call for Participation Medical Semantic indexing in Spanish ***

Medical Semantic indexing in Spanish

BioASQ MESINESP Task

http://temu.bsc.es/mesinesp/

Task description

Scikit-learn has been successfully used for a variety of text
classification tasks on documents in a range of different languages.

As part of the BioASQ challenges on biomedical semantic indexing and
question answering (http://bioasq.org/), we organize the first task on
semantic indexing of Spanish medical texts. The task will address the
automatic classification/indexing with structured medical vocabularies
(DeCS terms) of abstracts from the IBECS and LILACS databases written in
Spanish.

The main aim is to promote the development of semantic indexing tools of
practical relevance of non-English content, determining the
current-state-of-the art, identifying challenges and comparing the
strategies and results to those published for English data.

In order to measure classification performance, an on-line evaluation
system will be maintained. As the true annotations of the articles are not
available beforehand, the evaluation procedure will run continuously by
providing online results. The participating systems will be assessed for
their performance based on two measures, one hierarchical and one flat: the
Lowest Common Ancestor F-measure (LCA-F) and the label-based micro
F-measure, respectively.

Deadlines for submission: The task will run in Autumn 2019 (detailed
schedule TBA). Participants, after downloading the released test sets, will
have to submit results within a limited time window. The task will run for
two consecutive periods (batches) of 5 weeks each. The first batch will
start on October, 2019. For further details, please refer to:

Additional information is available at http://temu.bsc.es/mesinesp/ and
http://participants-area.bioasq.org/general_information/Taskaspanish/

Best regards,

  Martin Krallinger
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190529/a0b19d04/attachment.html>

From jesse.livezey at gmail.com  Wed May 29 13:34:49 2019
From: jesse.livezey at gmail.com (Jesse Livezey)
Date: Wed, 29 May 2019 10:34:49 -0700
Subject: [scikit-learn] Difference in normalization between Lasso and
 LogisticRegression + L1
Message-ID: <CAH1y-JWDPjdX8ehvxvF6jDzJOCGJGcSqwM29stAjUgt16cB+8w@mail.gmail.com>

Hi everyone,

I noticed recently that in the Lasso implementation (and docs), the MSE
term is normalized by the number of samples
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

but for LogisticRegression + L1, the logloss does not seem to be normalized
by the number of samples. One consequence is that the strength of the
regularization depends on the number of samples explicitly. For instance,
in Lasso, if you tile a dataset N times, you will learn the same coef, but
in LogisticRegression, you will learn a different coef.

Is this the intended behavior of LogisticRegression? I was surprised by
this. Either way, it would be helpful to document this more clearly in the
Logistic Regression docs (I can make a PR.)
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Jesse
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190529/5ea1b9c5/attachment.html>

From michael.eickenberg at gmail.com  Wed May 29 13:42:04 2019
From: michael.eickenberg at gmail.com (Michael Eickenberg)
Date: Wed, 29 May 2019 10:42:04 -0700
Subject: [scikit-learn] Difference in normalization between Lasso and
 LogisticRegression + L1
In-Reply-To: <CAH1y-JWDPjdX8ehvxvF6jDzJOCGJGcSqwM29stAjUgt16cB+8w@mail.gmail.com>
References: <CAH1y-JWDPjdX8ehvxvF6jDzJOCGJGcSqwM29stAjUgt16cB+8w@mail.gmail.com>
Message-ID: <CADxJN66dS21sZT6365iRz8wSv4FQTVt5E4u1vf7JwO3_+1t+vQ@mail.gmail.com>

Hi Jesse,

I think there was an effort to compare normalization methods on the data
attachment term between Lasso and Ridge regression back in 2012/13, but
this might have not been finished or extended to Logistic Regression.

If it is not documented well, it could definitely benefit from a
documentation update.

As for changing it to a more consistent state, that would require adding a
keyword argument pertaining to this functionality and, after discussion,
possibly changing the default value after some deprecation cycles (though
this seems like a dangerous one to change at all imho).

Michael


On Wed, May 29, 2019 at 10:38 AM Jesse Livezey <jesse.livezey at gmail.com>
wrote:

> Hi everyone,
>
> I noticed recently that in the Lasso implementation (and docs), the MSE
> term is normalized by the number of samples
>
> https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
>
> but for LogisticRegression + L1, the logloss does not seem to be
> normalized by the number of samples. One consequence is that the strength
> of the regularization depends on the number of samples explicitly. For
> instance, in Lasso, if you tile a dataset N times, you will learn the same
> coef, but in LogisticRegression, you will learn a different coef.
>
> Is this the intended behavior of LogisticRegression? I was surprised by
> this. Either way, it would be helpful to document this more clearly in the
> Logistic Regression docs (I can make a PR.)
>
> https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
>
> Jesse
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190529/782df37f/attachment.html>

From t3kcit at gmail.com  Wed May 29 13:48:42 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 29 May 2019 13:48:42 -0400
Subject: [scikit-learn] Difference in normalization between Lasso and
 LogisticRegression + L1
In-Reply-To: <CADxJN66dS21sZT6365iRz8wSv4FQTVt5E4u1vf7JwO3_+1t+vQ@mail.gmail.com>
References: <CAH1y-JWDPjdX8ehvxvF6jDzJOCGJGcSqwM29stAjUgt16cB+8w@mail.gmail.com>
 <CADxJN66dS21sZT6365iRz8wSv4FQTVt5E4u1vf7JwO3_+1t+vQ@mail.gmail.com>
Message-ID: <ac5abae9-b6a0-b0ec-e320-78d77f03d759@gmail.com>

That is not very ideal indeed.
I think we just went with what liblinear did, and when saga was 
introduced kept that behavior.
It should probably be scaled as in Lasso, I would imagine?


On 5/29/19 1:42 PM, Michael Eickenberg wrote:
> Hi Jesse,
>
> I think there was an effort to compare normalization methods on the 
> data attachment term between Lasso and Ridge regression back in 
> 2012/13, but this might have not been finished or extended to Logistic 
> Regression.
>
> If it is not documented well, it could definitely benefit from a 
> documentation update.
>
> As for changing it to a more consistent state, that would require 
> adding a keyword argument pertaining to this functionality and, after 
> discussion, possibly changing the default value after some deprecation 
> cycles (though this seems like a dangerous one to change at all imho).
>
> Michael
>
>
> On Wed, May 29, 2019 at 10:38 AM Jesse Livezey 
> <jesse.livezey at gmail.com <mailto:jesse.livezey at gmail.com>> wrote:
>
>     Hi everyone,
>
>     I noticed recently that in the Lasso implementation (and docs),
>     the MSE term is normalized by the number of samples
>     https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
>
>     but for LogisticRegression + L1, the logloss does not seem to be
>     normalized by the number of samples. One consequence is that the
>     strength of the regularization depends on the number of samples
>     explicitly. For instance, in Lasso, if you tile a dataset N times,
>     you will learn the same coef, but in LogisticRegression, you will
>     learn a different coef.
>
>     Is this the intended behavior of LogisticRegression? I was
>     surprised by this. Either way, it would be helpful to document
>     this more clearly in the Logistic Regression docs (I can make a PR.)
>     https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
>
>     Jesse
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190529/506e2d52/attachment.html>

From stuart at stuartreynolds.net  Wed May 29 17:29:39 2019
From: stuart at stuartreynolds.net (Stuart Reynolds)
Date: Wed, 29 May 2019 14:29:39 -0700
Subject: [scikit-learn] Difference in normalization between Lasso and
 LogisticRegression + L1
In-Reply-To: <ac5abae9-b6a0-b0ec-e320-78d77f03d759@gmail.com>
References: <CAH1y-JWDPjdX8ehvxvF6jDzJOCGJGcSqwM29stAjUgt16cB+8w@mail.gmail.com>
 <CADxJN66dS21sZT6365iRz8wSv4FQTVt5E4u1vf7JwO3_+1t+vQ@mail.gmail.com>
 <ac5abae9-b6a0-b0ec-e320-78d77f03d759@gmail.com>
Message-ID: <CAAy-kdnYnVLGfH1BdpWxDyo0EuqcN2q5=EZpbhTBHMY1t60e7Q@mail.gmail.com>

I looked into like a while ago. There were differences in which algorithms
regularized the intercept, and which ones do not. (I believe liblinear
does, lbgfs does not).
All of the algorithms disagreed with logistic regression in scipy.

- Stuart

On Wed, May 29, 2019 at 10:50 AM Andreas Mueller <t3kcit at gmail.com> wrote:

> That is not very ideal indeed.
> I think we just went with what liblinear did, and when saga was introduced
> kept that behavior.
> It should probably be scaled as in Lasso, I would imagine?
>
>
> On 5/29/19 1:42 PM, Michael Eickenberg wrote:
>
> Hi Jesse,
>
> I think there was an effort to compare normalization methods on the data
> attachment term between Lasso and Ridge regression back in 2012/13, but
> this might have not been finished or extended to Logistic Regression.
>
> If it is not documented well, it could definitely benefit from a
> documentation update.
>
> As for changing it to a more consistent state, that would require adding a
> keyword argument pertaining to this functionality and, after discussion,
> possibly changing the default value after some deprecation cycles (though
> this seems like a dangerous one to change at all imho).
>
> Michael
>
>
> On Wed, May 29, 2019 at 10:38 AM Jesse Livezey <jesse.livezey at gmail.com>
> wrote:
>
>> Hi everyone,
>>
>> I noticed recently that in the Lasso implementation (and docs), the MSE
>> term is normalized by the number of samples
>>
>> https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
>>
>> but for LogisticRegression + L1, the logloss does not seem to be
>> normalized by the number of samples. One consequence is that the strength
>> of the regularization depends on the number of samples explicitly. For
>> instance, in Lasso, if you tile a dataset N times, you will learn the same
>> coef, but in LogisticRegression, you will learn a different coef.
>>
>> Is this the intended behavior of LogisticRegression? I was surprised by
>> this. Either way, it would be helpful to document this more clearly in the
>> Logistic Regression docs (I can make a PR.)
>>
>> https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
>>
>> Jesse
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190529/0f1ac1a6/attachment-0001.html>

From pahome.chen at mirlab.org  Thu May 30 04:42:20 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Thu, 30 May 2019 16:42:20 +0800
Subject: [scikit-learn] MemoryError when evaluate clustering with
 gridsearchcv
Message-ID: <CAB3eZfsouit5SL3uR10qfK=TYnMVgdZQuLCf8i2Ljy3GgsBxxQ@mail.gmail.com>

I read a large data into memory and it cost about 2GB ram(I have 4GB ram)

Size get from sys.getsizeof(train_X)
*63963248*

And I evalute clustering with gridsearchcv below:
 def grid_search_clu(X):
def cv_scorer(estimator, X):
estimator.fit(X)
cluster_labels = estimator.labels_ if hasattr(estimator, 'labels_') else
estimator.predict(X)#estimator.predict(X)#.labels_
num_labels = len(set(cluster_labels))
num_samples = len(X)
if num_labels == 1 or num_labels == num_samples:
return -1
else:
return -metrics.davies_bouldin_score(X, cluster_labels)

m = cluster.Birch(n_clusters=None, compute_labels=True)
m_param = {'branching_factor' : range(10,60,10), 'threshold' :
np.arange(0.1, 0.6, 0.1).round(decimals=3) }

clf = GridSearchCV(m, m_param, cv=[(slice(None), slice(None))],
scoring=cv_scorer, verbose=1, n_jobs=1, return_train_score=False).fit(X)

And I got memoryerror, how should I do to solve this?
Adjust the parameters' range?

thx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190530/a0940887/attachment.html>

From regis_cardos at hotmail.com  Thu May 30 07:08:25 2019
From: regis_cardos at hotmail.com (=?Windows-1252?Q?R=E9gis_Cardoso?=)
Date: Thu, 30 May 2019 11:08:25 +0000
Subject: [scikit-learn] scikit-learn Digest, Vol 38, Issue 18
In-Reply-To: <mailman.804.1559134745.19359.scikit-learn@python.org>
References: <mailman.804.1559134745.19359.scikit-learn@python.org>
Message-ID: <CP2P15201MB1988C7CC63D6A87511496664F2180@CP2P15201MB1988.LAMP152.PROD.OUTLOOK.COM>

Dear,

What the dependencies are you talking? Because I have install the Numpy, Scipy and Joblib, this is the necessary programs, ok?

I used also this paper below, it is a very nice paper about the programs to data science from Raspberry Pi, but isn't working, when I try $ pytest sklearn, the Raspberry don't working, because don't find the sklearn. I don't know that to do now, I need the sklearn to development my work and don't have more ideia to resolve this problem.

https://geoffboeing.com/2016/03/scientific-python-raspberry-pi/
[https://geoffboeing.com/wp-content/uploads/2016/03/raspberry-pi-3-300x225.jpg]<https://geoffboeing.com/2016/03/scientific-python-raspberry-pi/>
Scientific Python for Raspberry Pi - Geoff Boeing<https://geoffboeing.com/2016/03/scientific-python-raspberry-pi/>
A guide to setting up the Python scientific stack, well-suited for geospatial analysis, on a Raspberry Pi 3. The whole process takes just a few minutes. The Raspberry Pi 3 was announced two weeks ago and presents a substantial step up in computational power over its predecessors. It can serve as a functional Wi-Fi connected ? Continue reading Scientific Python for Raspberry Pi
geoffboeing.com


Cardoso, Regis

________________________________
De: scikit-learn <scikit-learn-bounces+regis_cardos=hotmail.com at python.org> em nome de scikit-learn-request at python.org <scikit-learn-request at python.org>
Enviado: quarta-feira, 29 de maio de 2019 09:59
Para: scikit-learn at python.org
Assunto: scikit-learn Digest, Vol 38, Issue 18

Send scikit-learn mailing list submissions to
        scikit-learn at python.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://mail.python.org/mailman/listinfo/scikit-learn
or, via email, send a message with subject or body 'help' to
        scikit-learn-request at python.org

You can reach the person managing the list at
        scikit-learn-owner at python.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of scikit-learn digest..."


Today's Topics:

   1. Problems with installation Scikit Learn (R?gis Cardoso)
   2. Re: Problems with installation Scikit Learn (Guillaume Lema?tre)


----------------------------------------------------------------------

Message: 1
Date: Wed, 29 May 2019 12:16:09 +0000
From: R?gis Cardoso <regis_cardos at hotmail.com>
To: "scikit-learn at python.org" <scikit-learn at python.org>
Subject: [scikit-learn] Problems with installation Scikit Learn
Message-ID:
        <CP2P15201MB1988BAF383841EE146D140EAF21F0 at CP2P15201MB1988.LAMP152.PROD.OUTLOOK.COM>

Content-Type: text/plain; charset="iso-8859-1"

Dear,

I subscribled on scikit-learn subscription results now.

My name is R?gis i am estudying word2vec and artificial neural network using the scikit-learn and i am trying to install the scikit-learn in a Rasp Berry Pi 3, but i don't achieving. I am using the all comands below and i don't have sucess with the installation.

1? Try - pip install -U scikit-learn

2? Try - sudo install scikit-learn

3?  Try - sudo apt-get install gfortran libopenblas-dev liblapack-dev
              sudo pip install scikit-learn

4?  Try - sudo pip3 install scikit-learn

5? Try - pip install scikit-learn


I would like to know if there is other form to install the scikit-learn in a Rasp Berry Pi 3? I have the Python 3.6 intalled in my board on this case and also there are installed the Numpy, Scipy and Joblib.


Regards,

Cardoso, Regis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190529/aeb34eb0/attachment-0001.html>

------------------------------

Message: 2
Date: Wed, 29 May 2019 14:58:49 +0200
From: Guillaume Lema?tre <g.lemaitre58 at gmail.com>
To: Scikit-learn mailing list <scikit-learn at python.org>
Subject: Re: [scikit-learn] Problems with installation Scikit Learn
Message-ID:
        <CACDxx9gPpe86ZG-7MTDYG9i=0P1mbZrM9FejayDZftGySHHXeg at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Could you install all package from the system? If you have a Debian
distribution these packages should be available.
Somehow, I would expect apt-get install python-sklearn to work (it should
install the dependencies).

On Wed, 29 May 2019 at 14:34, R?gis Cardoso <regis_cardos at hotmail.com>
wrote:

> Dear,
>
> I subscribled on scikit-learn subscription results now.
>
> My name is R?gis i am estudying word2vec and artificial neural network
> using the scikit-learn and i am trying to install the scikit-learn in a
> Rasp Berry Pi 3, but i don't achieving. I am using the all comands below
> and i don't have sucess with the installation.
>
> 1? Try - pip install -U scikit-learn
>
> 2? Try - sudo install scikit-learn
>
> 3?  Try - sudo apt-get install gfortran libopenblas-dev liblapack-dev
>               sudo pip install scikit-learn
>
> 4?  Try - sudo pip3 install scikit-learn
>
> 5? Try - pip install scikit-learn
>
>
> I would like to know if there is other form to install the scikit-learn in
> a Rasp Berry Pi 3? I have the Python 3.6 intalled in my board on this case
> and also there are installed the Numpy, Scipy and Joblib.
>
>
>
> Regards,
>
> Cardoso, Regis
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


--
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190529/bc7b1a9e/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn


------------------------------

End of scikit-learn Digest, Vol 38, Issue 18
********************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190530/a64860de/attachment-0001.html>

From sepand.haghighi at yahoo.com  Thu May 30 11:15:57 2019
From: sepand.haghighi at yahoo.com (Sepand Haghighi)
Date: Thu, 30 May 2019 15:15:57 +0000 (UTC)
Subject: [scikit-learn] PyCM 2.2 released : A general benchmark based
 comparison of classification models
References: <1134551450.7155028.1559229357976.ref@mail.yahoo.com>
Message-ID: <1134551450.7155028.1559229357976@mail.yahoo.com>

Hi folks

Recently we have released new version of PyCM,?library for confusion matrix statistical analysis. I thought you might find it interesting.

http://www.pycm.ir

https://github.com/sepandhaghighi/pycm

Changelog :

   
   -    
Negative likelihood ratio interpretation (NLRI) added

   -    
Cramer's benchmark (SOA5) added

   -    
Matthews correlation coefficient interpretation (MCCI) added?#204

   -    
Matthews's benchmark (SOA6) added?#204

   -    
F1 macro added

   -    
F1 micro added

   -    
Accuracy macro added?#205

   -    
Compare class score calculation modified

   -    
Parameters recommendation for multi-class dataset modified

   -    
Parameters recommendation for imbalance dataset modified

   -    
README.md?modified

   -    
Document modified

   - Logo updated


Best RegardsSepand Haghighi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190530/88e8bfd9/attachment.html>

From omkar.kumbhar at innoplexus.com  Fri May 31 06:14:57 2019
From: omkar.kumbhar at innoplexus.com (Omkar Kumbhar)
Date: Fri, 31 May 2019 15:44:57 +0530
Subject: [scikit-learn] Mahalanobis distance metric in OPTICS
Message-ID: <CAPH36nDRvhhoXrr8+U6FWgQZcY9vXade-aOG+pgK0W5OfyF6-g@mail.gmail.com>

Hello,

I was having issues while fitting OPTICS using Mahalanobis metric. I tried
many things and had a hard time fitting it to my data distribution.
I have replicated the issue in the ipython notebook below. You could also
take a look at the html version of the notebook to look at the issues. Can
you guide me on how to resolve this bug?

PFA,
ipython notebook to replicate the issue
html of ipynb to look at stack traces.
Thanks & Regards,

Omkar Kumbhar
Associate Data Scientist
Innoplexus Consulting Services Pvt. Ltd.
www.innoplexus.com
Mob : +91- 9579464473
Landline: +91-20-66527300

The Intelligence MachineTM that is disrupting the traditional data and
analytics services model

? 2011-19 Innoplexus Consulting Services Pvt. Ltd.

Unless otherwise explicitly stated, all rights including those in copyright
in the content of this e-mail are owned by Innoplexus Consulting Services
Pvt Ltd. and all related legal entities. The contents of this e-mail shall
not be copied, reproduced, or transmitted in any form without the written
permission of Innoplexus Consulting Services Pvt Ltd or that of the
copyright owner. The receipt of this mail is the acknowledgement of the
receipt of contents; if the recipient is not the intended addressee then
the recipient shall notify the sender immediately.

The contents are provided for information only and no opinions expressed
should be relied on without further consultation with Innoplexus Consulting
Services Pvt Ltd. and all related legal entities. While all endeavors have
been made to ensure accuracy, Innoplexus Consulting Services Pvt. Ltd.
makes no warranty or representation to its accuracy, completeness or
fairness and persons who rely on it do so entirely at their own risk. The
information herein may be changed or withdrawn at any time without notice.
Innoplexus Consulting Services Pvt. Ltd. will not be liable to any client
or third party for the accuracy of the information supplied through this
service.

Innoplexus Consulting Services Pvt. Ltd. accepts no responsibility or
liability for the contents of any other site, whether linked to this site
or not, or any consequences from your acting upon the contents of another
site.

-- 
 <https://www.innoplexus.com/news/bio-international-convention>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190531/d0eebdf3/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 1790 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190531/d0eebdf3/attachment-0001.jpg>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190531/d0eebdf3/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OPTICS Mahalanobis distance issue.ipynb
Type: application/x-ipynb+json
Size: 22219 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190531/d0eebdf3/attachment-0001.bin>

From adrin.jalali at gmail.com  Fri May 31 12:54:05 2019
From: adrin.jalali at gmail.com (Adrin)
Date: Fri, 31 May 2019 18:54:05 +0200
Subject: [scikit-learn] Mahalanobis distance metric in OPTICS
In-Reply-To: <CAPH36nDRvhhoXrr8+U6FWgQZcY9vXade-aOG+pgK0W5OfyF6-g@mail.gmail.com>
References: <CAPH36nDRvhhoXrr8+U6FWgQZcY9vXade-aOG+pgK0W5OfyF6-g@mail.gmail.com>
Message-ID: <CAEOrW49BR9rapF-2FkTghtvFj90ygVnkjr9uZTaY3kZ52nWySQ@mail.gmail.com>

Mahalanobis is always tricky, the covariance is between the features, not
the samples. This works:

OPTICS(metric='mahalanobis',metric_params={'VI':
np.linalg.inv(np.cov(test_array.T))}).fit(test_array)

Not sure why it wouldn't work when you pass V, as it suggests as an
alternative.

On Fri, May 31, 2019 at 12:16 PM Omkar Kumbhar <omkar.kumbhar at innoplexus.com>
wrote:

> Hello,
>
> I was having issues while fitting OPTICS using Mahalanobis metric. I tried
> many things and had a hard time fitting it to my data distribution.
> I have replicated the issue in the ipython notebook below. You could also
> take a look at the html version of the notebook to look at the issues. Can
> you guide me on how to resolve this bug?
>
> PFA,
> ipython notebook to replicate the issue
> html of ipynb to look at stack traces.
> Thanks & Regards,
>
> Omkar Kumbhar
> Associate Data Scientist
> Innoplexus Consulting Services Pvt. Ltd.
> www.innoplexus.com
> Mob : +91- 9579464473
> Landline: +91-20-66527300
>
> The Intelligence MachineTM that is disrupting the traditional data and
> analytics services model
>
> ? 2011-19 Innoplexus Consulting Services Pvt. Ltd.
>
> Unless otherwise explicitly stated, all rights including those in
> copyright in the content of this e-mail are owned by Innoplexus Consulting
> Services Pvt Ltd. and all related legal entities. The contents of this
> e-mail shall not be copied, reproduced, or transmitted in any form without
> the written permission of Innoplexus Consulting Services Pvt Ltd or that of
> the copyright owner. The receipt of this mail is the acknowledgement of the
> receipt of contents; if the recipient is not the intended addressee then
> the recipient shall notify the sender immediately.
>
> The contents are provided for information only and no opinions expressed
> should be relied on without further consultation with Innoplexus Consulting
> Services Pvt Ltd. and all related legal entities. While all endeavors have
> been made to ensure accuracy, Innoplexus Consulting Services Pvt. Ltd.
> makes no warranty or representation to its accuracy, completeness or
> fairness and persons who rely on it do so entirely at their own risk. The
> information herein may be changed or withdrawn at any time without notice.
> Innoplexus Consulting Services Pvt. Ltd. will not be liable to any client
> or third party for the accuracy of the information supplied through this
> service.
>
> Innoplexus Consulting Services Pvt. Ltd. accepts no responsibility or
> liability for the contents of any other site, whether linked to this site
> or not, or any consequences from your acting upon the contents of another
> site.
>
>
>
>
> <https://www.innoplexus.com/news/bio-international-convention>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190531/38cf0887/attachment.html>

From tmrsg11 at gmail.com  Fri May 31 20:54:32 2019
From: tmrsg11 at gmail.com (C W)
Date: Fri, 31 May 2019 20:54:32 -0400
Subject: [scikit-learn] How is linear regression in scikit-learn done? Do
 you need train and test split?
Message-ID: <CAE2FW2=OT8jsXRE598WH9rGNhBmXYtgb0ZdH3LgcM+wihhd-+A@mail.gmail.com>

Hello everyone,

I'm new to scikit learn. I see that many tutorial in scikit-learn follows
the work-flow along the lines of
1) tranform the data
2) split the data: train, test
3) instantiate the sklearn object and fit
4) predict and tune parameter

But, linear regression is done in least squares, so I don't think train
test split is necessary. So, I guess I can just use the entire dataset?

Thanks in advance!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190531/00896505/attachment.html>