From qinhanmin2005 at sina.com  Tue Apr  2 10:36:03 2019
From: qinhanmin2005 at sina.com (Hanmin Qin)
Date: Tue, 02 Apr 2019 22:36:03 +0800
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
 functions?
Message-ID: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>

See https://github.com/scikit-learn/scikit-learn/issues/13448
We've introduced several plotting functions (e.g., plot_tree and plot_partial_dependence) and will introduce more (e.g., plot_decision_boundary) in the future. Consequently, we need to decide where to put these functions. Currently, there're 3 proposals:
(1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
(2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
(3) sklearn.XXX.plot.plot_YYY (e.g., sklearn.tree.plot.plot_tree, note that we won't support from sklearn.XXX import plot_YYY)
Joel Nothman, Gael Varoquaux and I decided to post it on the mailing list to invite opinions.
Thanks
Hanmin Qin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190402/914ffcde/attachment.html>

From martin.watzenboeck at gmail.com  Tue Apr  2 14:57:51 2019
From: martin.watzenboeck at gmail.com (Martin Watzenboeck)
Date: Tue, 2 Apr 2019 20:57:51 +0200
Subject: [scikit-learn] LASSO: Predicted values show negative correlation
 with observed values on random data
Message-ID: <CABvijHC-ZAOdWLf5zH7F=jRdB4v3K7BqyfdSRZCW5gTkbdFd2Q@mail.gmail.com>

Hello,

I tried to apply LASSO regression in combination with LeaveOneOut CV on my
data, and observed a significant negative correlation between predicted and
observed response values. I tried to replicate the problem using random
data (please see code below).

Anyone have an idea what I am doing wrong? I would very much like to use
LASSO regression on my data. Thanks a lot!

Cheers,
Martin

#Lasso example
from sklearn.linear_model import Lasso
from sklearn.model_selection import LeaveOneOut
from scipy.stats import pearsonr
import numpy as np

n_samples = 500
n_features = 30

#create random features
rng = np.random.RandomState(seed=42)
X = rng.randn(n_samples * n_features).reshape(n_samples, n_features)

#Create Ys
Y = rng.randn(n_samples)

#instantiate regressor and cv object
cv = LeaveOneOut()
reg = Lasso(random_state = 42)


#create arrays to save predicted (and observed) Y values
pred = np.array([])
obs = np.array([])


#run cross validation
for train, test in cv.split(X, Y):

    #fit regressor
    reg.fit(X[train], Y[train])

    #append predicted and observed values to the arrays
    pred = np.r_[pred, reg.predict(X[test])]
    obs = np.r_[obs, Y[test]]

#test correlation
pearsonr(pred, obs)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190402/b27cbb9e/attachment.html>

From alexandre.gramfort at inria.fr  Tue Apr  2 15:33:02 2019
From: alexandre.gramfort at inria.fr (Alexandre Gramfort)
Date: Tue, 2 Apr 2019 21:33:02 +0200
Subject: [scikit-learn] LASSO: Predicted values show negative
 correlation with observed values on random data
In-Reply-To: <CABvijHC-ZAOdWLf5zH7F=jRdB4v3K7BqyfdSRZCW5gTkbdFd2Q@mail.gmail.com>
References: <CABvijHC-ZAOdWLf5zH7F=jRdB4v3K7BqyfdSRZCW5gTkbdFd2Q@mail.gmail.com>
Message-ID: <CADeotZquX+iKCbJyLeD9KTTWeYAcCUNxrpKyH62viauL_btahg@mail.gmail.com>

in your example with random data Lasso leads to coef_ of zeros so you get
as prediction : np.mean(Y[train])

you'll see the same phenomenon if you do:

pred = np.r_[pred, np.mean(Y[train])]

Alex
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190402/a9d246ca/attachment.html>

From jbbrown at kuhp.kyoto-u.ac.jp  Tue Apr  2 22:44:22 2019
From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.)
Date: Wed, 3 Apr 2019 11:44:22 +0900
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
 functions?
In-Reply-To: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
Message-ID: <CAJe_vxAzdcDqbdUCoWoC1k2xPKWdvRixA3gfpt3PkixxH5tOuQ@mail.gmail.com>

As a user, I feel that (2) "sklearn.plot.XXX.plot_YYY" best allows for
future expansion of sub-namespaces in a tractable way that is also easy to
understand during code review.
For example, sklearn.plot.tree.plot_forest() or sklearn.plot.lasso.plot_* .

Just my opinion.
J.B.


2019?4?2?(?) 23:40 Hanmin Qin <qinhanmin2005 at sina.com>:

> See https://github.com/scikit-learn/scikit-learn/issues/13448
>
> We've introduced several plotting functions (e.g., plot_tree and
> plot_partial_dependence) and will introduce more (e.g.,
> plot_decision_boundary) in the future. Consequently, we need to decide
> where to put these functions. Currently, there're 3 proposals:
>
> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>
> (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
>
> (3) sklearn.XXX.plot.plot_YYY (e.g., sklearn.tree.plot.plot_tree, note
> that we won't support from sklearn.XXX import plot_YYY)
>
> Joel Nothman, Gael Varoquaux and I decided to post it on the mailing list
> to invite opinions.
>
> Thanks
>
> Hanmin Qin
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190403/bf3255a2/attachment.html>

From pahome.chen at mirlab.org  Wed Apr  3 05:07:08 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Wed, 3 Apr 2019 17:07:08 +0800
Subject: [scikit-learn] Can cluster help me to cluster data with length of
 continuous series?
Message-ID: <CAB3eZfs83vELFjDSwuRO5N0xDZnKb=6CMq3G-bk038st-fx1yQ@mail.gmail.com>

I have data which contain access duration of each items.

EX: t0~t4 is the access time duration. 1 means the item was accessed in the
time duration, 0 means not.
ID,t0,t1,t2,t3,t4
0,1,0,0,1
1,1,0,0,1
2,0,0,1,1
3,0,1,1,1

What I want to cluster is the length of continuous duration
Ex:
ID=3 > 2 > 1 = 0

Can any distance metric to help clustering based on the length of
continuous duration?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190403/eb629272/attachment.html>

From ahowe42 at gmail.com  Wed Apr  3 05:52:18 2019
From: ahowe42 at gmail.com (Andrew Howe)
Date: Wed, 3 Apr 2019 10:52:18 +0100
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
 functions?
In-Reply-To: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
Message-ID: <CANnYi3SO-6j8GpkO7cUZabB1CfzegwUJTF0c90B6UVBvgCCLkQ@mail.gmail.com>

My preference would be for (1). I don't think the sub-namespace in (2) is
necessary, and don't like (3), as I would prefer the plotting functions to
be all in the same namespace sklearn.plot.

Andrew

<~~~~~~~~~~~~~~~~~~~~~~~~~~~>
J. Andrew Howe, PhD
LinkedIn Profile <http://www.linkedin.com/in/ahowe42>
ResearchGate Profile <http://www.researchgate.net/profile/John_Howe12/>
Open Researcher and Contributor ID (ORCID)
<http://orcid.org/0000-0002-3553-1990>
Github Profile <http://github.com/ahowe42>
Personal Website <http://www.andrewhowe.com>
I live to learn, so I can learn to live. - me
<~~~~~~~~~~~~~~~~~~~~~~~~~~~>


On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin <qinhanmin2005 at sina.com> wrote:

> See https://github.com/scikit-learn/scikit-learn/issues/13448
>
> We've introduced several plotting functions (e.g., plot_tree and
> plot_partial_dependence) and will introduce more (e.g.,
> plot_decision_boundary) in the future. Consequently, we need to decide
> where to put these functions. Currently, there're 3 proposals:
>
> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>
> (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
>
> (3) sklearn.XXX.plot.plot_YYY (e.g., sklearn.tree.plot.plot_tree, note
> that we won't support from sklearn.XXX import plot_YYY)
>
> Joel Nothman, Gael Varoquaux and I decided to post it on the mailing list
> to invite opinions.
>
> Thanks
>
> Hanmin Qin
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190403/1ab7a024/attachment-0001.html>

From trev.stephens at gmail.com  Wed Apr  3 06:06:07 2019
From: trev.stephens at gmail.com (Trevor Stephens)
Date: Wed, 3 Apr 2019 21:06:07 +1100
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
 functions?
In-Reply-To: <CANnYi3SO-6j8GpkO7cUZabB1CfzegwUJTF0c90B6UVBvgCCLkQ@mail.gmail.com>
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
 <CANnYi3SO-6j8GpkO7cUZabB1CfzegwUJTF0c90B6UVBvgCCLkQ@mail.gmail.com>
Message-ID: <CAE0Q3zWkV3quVioOMQ06kQceGvgmEvw-zJ2EH=VC7MEzPyuENA@mail.gmail.com>

I think #1 if any of these... Plotting functions should hopefully be as
general as possible, so tagging with a specific type of estimator will, in
some scikit-learn utopia, be unnecessary.

If a general plotter is built, where does it live in other
estimator-specific namespace options? Feels awkward to put it under every
estimator's namespace.

Then again, there might be a #4 where there is no plot module and plotting
classes live under groups of utilities like introspection, cross-validation
or something?...

On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe <ahowe42 at gmail.com> wrote:

> My preference would be for (1). I don't think the sub-namespace in (2) is
> necessary, and don't like (3), as I would prefer the plotting functions to
> be all in the same namespace sklearn.plot.
>
> Andrew
>
> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> J. Andrew Howe, PhD
> LinkedIn Profile <http://www.linkedin.com/in/ahowe42>
> ResearchGate Profile <http://www.researchgate.net/profile/John_Howe12/>
> Open Researcher and Contributor ID (ORCID)
> <http://orcid.org/0000-0002-3553-1990>
> Github Profile <http://github.com/ahowe42>
> Personal Website <http://www.andrewhowe.com>
> I live to learn, so I can learn to live. - me
> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>
>
> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin <qinhanmin2005 at sina.com> wrote:
>
>> See https://github.com/scikit-learn/scikit-learn/issues/13448
>>
>> We've introduced several plotting functions (e.g., plot_tree and
>> plot_partial_dependence) and will introduce more (e.g.,
>> plot_decision_boundary) in the future. Consequently, we need to decide
>> where to put these functions. Currently, there're 3 proposals:
>>
>> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>>
>> (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
>>
>> (3) sklearn.XXX.plot.plot_YYY (e.g., sklearn.tree.plot.plot_tree, note
>> that we won't support from sklearn.XXX import plot_YYY)
>>
>> Joel Nothman, Gael Varoquaux and I decided to post it on the mailing list
>> to invite opinions.
>>
>> Thanks
>>
>> Hanmin Qin
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190403/e1b0cdb6/attachment.html>

From christian.braune79 at gmail.com  Wed Apr  3 06:18:13 2019
From: christian.braune79 at gmail.com (Christian Braune)
Date: Wed, 3 Apr 2019 12:18:13 +0200
Subject: [scikit-learn] Can cluster help me to cluster data with length
 of continuous series?
In-Reply-To: <CAB3eZfs83vELFjDSwuRO5N0xDZnKb=6CMq3G-bk038st-fx1yQ@mail.gmail.com>
References: <CAB3eZfs83vELFjDSwuRO5N0xDZnKb=6CMq3G-bk038st-fx1yQ@mail.gmail.com>
Message-ID: <CABfx9=d_bXsLFHxREvEv68SPxY2=Nt1AoTN+f9MM8iFZCxxfug@mail.gmail.com>

Hi,

that does not really sound like a clustering but more like a preprocessing
problem to me. For each item you want to calculate the length of the
longest subsequence of "1"s. That could be done by a simple function and
would create a new (one-dimensional) property for each of your items.
You could then apply any clustering algorithm to this feature (i.e. you'd
be clustering a one-dimensional dataset)...

Regards,
  Christian

lampahome <pahome.chen at mirlab.org> schrieb am Mi., 3. Apr. 2019 um
11:08 Uhr:

> I have data which contain access duration of each items.
>
> EX: t0~t4 is the access time duration. 1 means the item was accessed in
> the time duration, 0 means not.
> ID,t0,t1,t2,t3,t4
> 0,1,0,0,1
> 1,1,0,0,1
> 2,0,0,1,1
> 3,0,1,1,1
>
> What I want to cluster is the length of continuous duration
> Ex:
> ID=3 > 2 > 1 = 0
>
> Can any distance metric to help clustering based on the length of
> continuous duration?
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190403/baac3589/attachment.html>

From hollas at informatik.htw-dresden.de  Wed Apr  3 06:28:22 2019
From: hollas at informatik.htw-dresden.de (Boris Hollas)
Date: Wed, 3 Apr 2019 12:28:22 +0200
Subject: [scikit-learn] Why is cross_val_predict discouraged?
Message-ID: <a456bc01-5204-504e-9810-2c228d21ec6a@informatik.htw-dresden.de>

I use

sum((cross_val_predict(model, X, y) - y)**2) / len(y)??? ??? (*)

to evaluate the performance of a model. This conforms with Murphy: 
Machine Learning, section 6.5.3, and Hastie et al: The Elements of 
Statistical Learning,? eq. 7.48. However, according to the documentation 
of cross_val_predict, "it is not appropriate to pass these predictions 
into an evaluation metric". While it is obvious that cross_val_predict 
is different from cross_val_score, I don't see what should be wrong with 
(*).

Also, the explanation that "|cross_val_predict| 
<https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html#sklearn.model_selection.cross_val_predict>simply 
returns the labels (or probabilities)" is unclear, if not wrong. As I 
understand it, this function returns estimates and no labels or 
probabilities.

Regards, Boris

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190403/777aae79/attachment.html>

From martin.watzenboeck at gmail.com  Wed Apr  3 07:17:13 2019
From: martin.watzenboeck at gmail.com (Martin Watzenboeck)
Date: Wed, 3 Apr 2019 13:17:13 +0200
Subject: [scikit-learn] LASSO: Predicted values show negative
 correlation with observed values on random data
In-Reply-To: <CADeotZquX+iKCbJyLeD9KTTWeYAcCUNxrpKyH62viauL_btahg@mail.gmail.com>
References: <CABvijHC-ZAOdWLf5zH7F=jRdB4v3K7BqyfdSRZCW5gTkbdFd2Q@mail.gmail.com>
 <CADeotZquX+iKCbJyLeD9KTTWeYAcCUNxrpKyH62viauL_btahg@mail.gmail.com>
Message-ID: <CABvijHDxbd-URaPPAxSk2SyqF1eZm2+ez-t1_93TPu=ztRAxNQ@mail.gmail.com>

Hi Alex,

Thanks a lot for the answer! That does indeed explain this phenomenon.
Also, I know see that with my data I can get meaningful LASSO predictions
by tuning the alpha parameter.

Cheers,
Martin

Am Di., 2. Apr. 2019 um 21:33 Uhr schrieb Alexandre Gramfort <
alexandre.gramfort at inria.fr>:

> in your example with random data Lasso leads to coef_ of zeros so you get
> as prediction : np.mean(Y[train])
>
> you'll see the same phenomenon if you do:
>
> pred = np.r_[pred, np.mean(Y[train])]
>
> Alex
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190403/7e1a25c3/attachment-0001.html>

From rth.yurchak at pm.me  Wed Apr  3 07:35:23 2019
From: rth.yurchak at pm.me (Roman Yurchak)
Date: Wed, 03 Apr 2019 11:35:23 +0000
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
 functions?
In-Reply-To: <CAE0Q3zWkV3quVioOMQ06kQceGvgmEvw-zJ2EH=VC7MEzPyuENA@mail.gmail.com>
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
 <CANnYi3SO-6j8GpkO7cUZabB1CfzegwUJTF0c90B6UVBvgCCLkQ@mail.gmail.com>
 <CAE0Q3zWkV3quVioOMQ06kQceGvgmEvw-zJ2EH=VC7MEzPyuENA@mail.gmail.com>
Message-ID: <UV3VC8B5V87wU1W2eaCcKKLQbR6L5hvH9mneGPeV3Nn_noj_GCLlioQ1Mb7ODYfu_vsr9-BBDwNFhl8yLE-1xrQLePrpThu_Sc-lv7Fadts=@pm.me>

+1 for options 1 and +0.5 for 3. Do we anticipate that many plotting 
functions will be added? If it's just a dozen or less, putting them all 
into a single namespace sklearn.plot might be easier.

This also would avoid discussion about where to put some generic 
plotting functions (e.g. 
https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479).

Roman

On 03/04/2019 12:06, Trevor Stephens wrote:
> I think #1 if any of these... Plotting functions should hopefully be as 
> general as possible, so tagging with a specific type of estimator will, 
> in some scikit-learn utopia, be unnecessary.
> 
> If a general plotter is built, where does it live in other 
> estimator-specific namespace options? Feels awkward to put it under 
> every estimator's namespace.
> 
> Then again, there might be a #4 where there is no plot module and 
> plotting classes live under groups of utilities like introspection, 
> cross-validation or something?...
> 
> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe <ahowe42 at gmail.com 
> <mailto:ahowe42 at gmail.com>> wrote:
> 
>     My preference would be for (1). I don't think the sub-namespace in
>     (2) is necessary, and don't like (3), as I would prefer the plotting
>     functions to be all in the same namespace sklearn.plot.
> 
>     Andrew
> 
>     <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>     J. Andrew Howe, PhD
>     LinkedIn Profile <http://www.linkedin.com/in/ahowe42>
>     ResearchGate Profile <http://www.researchgate.net/profile/John_Howe12/>
>     Open Researcher and Contributor ID (ORCID)
>     <http://orcid.org/0000-0002-3553-1990>
>     Github Profile <http://github.com/ahowe42>
>     Personal Website <http://www.andrewhowe.com>
>     I live to learn, so I can learn to live. - me
>     <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> 
> 
>     On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin <qinhanmin2005 at sina.com
>     <mailto:qinhanmin2005 at sina.com>> wrote:
> 
>         See https://github.com/scikit-learn/scikit-learn/issues/13448
> 
>         We've introduced several plotting functions (e.g., plot_tree and
>         plot_partial_dependence) and will introduce more (e.g.,
>         plot_decision_boundary) in the future. Consequently, we need to
>         decide where to put these functions. Currently, there're 3
>         proposals:
> 
>         (1)?sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
> 
>         (2)?sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
> 
>         (3)?sklearn.XXX.plot.plot_YYY (e.g.,
>         sklearn.tree.plot.plot_tree, note that we won't support from
>         sklearn.XXX import plot_YYY)
> 
>         Joel Nothman,?Gael Varoquaux and I decided to post it on the
>         mailing list to invite opinions.
> 
>         Thanks
> 
>         Hanmin Qin
>         _______________________________________________
>         scikit-learn mailing list
>         scikit-learn at python.org <mailto:scikit-learn at python.org>
>         https://mail.python.org/mailman/listinfo/scikit-learn
> 
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
> 


From joel.nothman at gmail.com  Wed Apr  3 07:59:18 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 3 Apr 2019 22:59:18 +1100
Subject: [scikit-learn] Why is cross_val_predict discouraged?
In-Reply-To: <a456bc01-5204-504e-9810-2c228d21ec6a@informatik.htw-dresden.de>
References: <a456bc01-5204-504e-9810-2c228d21ec6a@informatik.htw-dresden.de>
Message-ID: <CAAkaFLVhNHTPk9zsXbmJmyghA0gSuWLDEuZyaRy2e=Ub1zC-qw@mail.gmail.com>

The equations in Murphy and Hastie very clearly assume a metric
decomposable over samples (a loss function). Several popular metrics
are not.

For a metric like MSE it will be almost identical assuming the test
sets have almost the same size. For something like Recall
(sensitivity) it will be almost identical assuming similar test set
sizes *and* stratification. For something like precision whose
denominator is determined by the biases of the learnt classifier on
the test dataset, you can't say the same. For something like ROC AUC
score, relying on some decision function that may not be equivalently
calibrated across splits, evaluating in this way is almost
meaningless.

On Wed, 3 Apr 2019 at 22:01, Boris Hollas
<hollas at informatik.htw-dresden.de> wrote:
>
> I use
>
> sum((cross_val_predict(model, X, y) - y)**2) / len(y)        (*)
>
> to evaluate the performance of a model. This conforms with Murphy: Machine Learning, section 6.5.3, and Hastie et al: The Elements of Statistical Learning,  eq. 7.48. However, according to the documentation of cross_val_predict, "it is not appropriate to pass these predictions into an evaluation metric". While it is obvious that cross_val_predict is different from cross_val_score, I don't see what should be wrong with (*).
>
> Also, the explanation that "cross_val_predict simply returns the labels (or probabilities)" is unclear, if not wrong. As I understand it, this function returns estimates and no labels or probabilities.
>
> Regards, Boris
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From t3kcit at gmail.com  Wed Apr  3 08:54:51 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 3 Apr 2019 08:54:51 -0400
Subject: [scikit-learn] Why is cross_val_predict discouraged?
In-Reply-To: <CAAkaFLVhNHTPk9zsXbmJmyghA0gSuWLDEuZyaRy2e=Ub1zC-qw@mail.gmail.com>
References: <a456bc01-5204-504e-9810-2c228d21ec6a@informatik.htw-dresden.de>
 <CAAkaFLVhNHTPk9zsXbmJmyghA0gSuWLDEuZyaRy2e=Ub1zC-qw@mail.gmail.com>
Message-ID: <dbf08008-2ef7-c7ff-7768-ac1c61a5c027@gmail.com>


On 4/3/19 7:59 AM, Joel Nothman wrote:
> The equations in Murphy and Hastie very clearly assume a metric
> decomposable over samples (a loss function). Several popular metrics
> are not.
>
> For a metric like MSE it will be almost identical assuming the test
> sets have almost the same size. For something like Recall
> (sensitivity) it will be almost identical assuming similar test set
> sizes *and* stratification. For something like precision whose
> denominator is determined by the biases of the learnt classifier on
> the test dataset, you can't say the same. For something like ROC AUC
> score, relying on some decision function that may not be equivalently
> calibrated across splits, evaluating in this way is almost
> meaningless.

In theory. Not sure how it holds up in practice.

I didn't get the point about precision.

But yes, we should add to the docs that in particular for losses that 
don't decompose this is a weird thing to do.

If the loss decomposes, the result might be different b/c of different 
test set sizes, but I'm not sure if they are "worse" in some way?


From t3kcit at gmail.com  Wed Apr  3 09:09:19 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 3 Apr 2019 09:09:19 -0400
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
 functions?
In-Reply-To: <UV3VC8B5V87wU1W2eaCcKKLQbR6L5hvH9mneGPeV3Nn_noj_GCLlioQ1Mb7ODYfu_vsr9-BBDwNFhl8yLE-1xrQLePrpThu_Sc-lv7Fadts=@pm.me>
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
 <CANnYi3SO-6j8GpkO7cUZabB1CfzegwUJTF0c90B6UVBvgCCLkQ@mail.gmail.com>
 <CAE0Q3zWkV3quVioOMQ06kQceGvgmEvw-zJ2EH=VC7MEzPyuENA@mail.gmail.com>
 <UV3VC8B5V87wU1W2eaCcKKLQbR6L5hvH9mneGPeV3Nn_noj_GCLlioQ1Mb7ODYfu_vsr9-BBDwNFhl8yLE-1xrQLePrpThu_Sc-lv7Fadts=@pm.me>
Message-ID: <e5f0d69f-7ff4-5f78-b12a-dfbd9288b502@gmail.com>

I think what was not clear from the question is that there is actually 
quite different kinds of plotting functions, and many of these are tied 
to existing code.

Right now we have some that are specific to trees (plot_tree) and to 
gradient boosting (plot_partial_dependence).

I think we want more general functions, and plot_partial_dependence has 
been extended to general estimators.

However, the plotting functions might be generic wrt the estimator, but 
relate to a specific function, say plotting results of GridSearchCV.
Then one might argue that having the plotting function close to 
GridSearchCV might make sense.
Similarly for plotting partial dependence plots and feature importances, 
it might be a bit strange to have the plotting functions not next to the 
functions that compute these.
Another question would be is whether the plotting functions also "do the 
work" in some cases:
Do we want plot_partial_dependence also to compute the partial 
dependence? (I would argue yes but either way the result is a bit strange).
In that case you have somewhat of the same functionality in two 
different modules, unless you also put the "compute partial dependence" 
function in the plotting module as well,
which is a bit strange.

Maybe we could inform this discussion by listing candidate plotting 
functions, and also considering whether they "do the work" and where the 
"work" function is.

Other examples are plotting the confusion matrix, which probably should 
also compute the confusion matrix (it's fast and so that would be 
convenient), and so it would "duplicate" functionality from the metrics 
module.

Plotting learning curves and validation curves should probably not do 
the work as it's pretty involved, and so someone would need to import 
the learning and validation curves from model selection, and then the 
plotting functions from a plotting module.

Calibrations curves and P/R curves and roc curves are also pretty fast 
to compute (and passing around the arguments is somewhat error prone) so 
I would say the plotting functions for these should do the work as well.

Anyway, you can see that many plotting functions are actually associated 
with functions in existing modules and the interactions are a bit unclear.

The only plotting functions I haven't mentioned so far that I thought 
about in the past are "2d scatter" and "plot decision function". These 
would be kind of generic, but mostly used in the examples.
Though having a discrete 2d scatter function would be pretty nice 
(plt.scatter doesn't allow legends and makes it hard to use qualitative 
color maps).


I think I would vote for option (1), "sklearn.plot.plot_zzz" but the 
case is not really that clear.

Cheers,

Andy

On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote:
> +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting
> functions will be added? If it's just a dozen or less, putting them all
> into a single namespace sklearn.plot might be easier.
>
> This also would avoid discussion about where to put some generic
> plotting functions (e.g.
> https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479).
>
> Roman
>
> On 03/04/2019 12:06, Trevor Stephens wrote:
>> I think #1 if any of these... Plotting functions should hopefully be as
>> general as possible, so tagging with a specific type of estimator will,
>> in some scikit-learn utopia, be unnecessary.
>>
>> If a general plotter is built, where does it live in other
>> estimator-specific namespace options? Feels awkward to put it under
>> every estimator's namespace.
>>
>> Then again, there might be a #4 where there is no plot module and
>> plotting classes live under groups of utilities like introspection,
>> cross-validation or something?...
>>
>> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe <ahowe42 at gmail.com
>> <mailto:ahowe42 at gmail.com>> wrote:
>>
>>      My preference would be for (1). I don't think the sub-namespace in
>>      (2) is necessary, and don't like (3), as I would prefer the plotting
>>      functions to be all in the same namespace sklearn.plot.
>>
>>      Andrew
>>
>>      <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>>      J. Andrew Howe, PhD
>>      LinkedIn Profile <http://www.linkedin.com/in/ahowe42>
>>      ResearchGate Profile <http://www.researchgate.net/profile/John_Howe12/>
>>      Open Researcher and Contributor ID (ORCID)
>>      <http://orcid.org/0000-0002-3553-1990>
>>      Github Profile <http://github.com/ahowe42>
>>      Personal Website <http://www.andrewhowe.com>
>>      I live to learn, so I can learn to live. - me
>>      <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>>
>>
>>      On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin <qinhanmin2005 at sina.com
>>      <mailto:qinhanmin2005 at sina.com>> wrote:
>>
>>          See https://github.com/scikit-learn/scikit-learn/issues/13448
>>
>>          We've introduced several plotting functions (e.g., plot_tree and
>>          plot_partial_dependence) and will introduce more (e.g.,
>>          plot_decision_boundary) in the future. Consequently, we need to
>>          decide where to put these functions. Currently, there're 3
>>          proposals:
>>
>>          (1)?sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>>
>>          (2)?sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
>>
>>          (3)?sklearn.XXX.plot.plot_YYY (e.g.,
>>          sklearn.tree.plot.plot_tree, note that we won't support from
>>          sklearn.XXX import plot_YYY)
>>
>>          Joel Nothman,?Gael Varoquaux and I decided to post it on the
>>          mailing list to invite opinions.
>>
>>          Thanks
>>
>>          Hanmin Qin
>>          _______________________________________________
>>          scikit-learn mailing list
>>          scikit-learn at python.org <mailto:scikit-learn at python.org>
>>          https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>      _______________________________________________
>>      scikit-learn mailing list
>>      scikit-learn at python.org <mailto:scikit-learn at python.org>
>>      https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From gael.varoquaux at normalesup.org  Wed Apr  3 09:28:52 2019
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Wed, 3 Apr 2019 15:28:52 +0200
Subject: [scikit-learn] Why is cross_val_predict discouraged?
In-Reply-To: <dbf08008-2ef7-c7ff-7768-ac1c61a5c027@gmail.com>
References: <a456bc01-5204-504e-9810-2c228d21ec6a@informatik.htw-dresden.de>
 <CAAkaFLVhNHTPk9zsXbmJmyghA0gSuWLDEuZyaRy2e=Ub1zC-qw@mail.gmail.com>
 <dbf08008-2ef7-c7ff-7768-ac1c61a5c027@gmail.com>
Message-ID: <20190403132852.2jdszy2rfp3kivk4@phare.normalesup.org>

On Wed, Apr 03, 2019 at 08:54:51AM -0400, Andreas Mueller wrote:
> If the loss decomposes, the result might be different b/c of different test
> set sizes, but I'm not sure if they are "worse" in some way?

Mathematically, a cross-validation estimates a double expectation: one
expectation on the model (ie the train data), and another on the test
data (see for instance eq 3 in
https://europepmc.org/articles/pmc5441396, sorry for the self citation,
this is seldom discussed in the literature).

The correct way to compute this double expectation is by averaging first
inside the fold and second across the folds. Other ways of computing
errors estimate other quantities, that are harder to study mathematically
and not comparable to objects studied in the literature.

Another problem with cross_val_predict is that some people use metrics
like correlation (which is a terrible metric and does not decompose
across folds). It will then pick up things like correlations across
folds.

All these problems are made worse when data are not iid, and hence folds
risk not being iid.

G

From joel.nothman at gmail.com  Wed Apr  3 10:06:13 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 4 Apr 2019 01:06:13 +1100
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
 functions?
In-Reply-To: <e5f0d69f-7ff4-5f78-b12a-dfbd9288b502@gmail.com>
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
 <CANnYi3SO-6j8GpkO7cUZabB1CfzegwUJTF0c90B6UVBvgCCLkQ@mail.gmail.com>
 <CAE0Q3zWkV3quVioOMQ06kQceGvgmEvw-zJ2EH=VC7MEzPyuENA@mail.gmail.com>
 <UV3VC8B5V87wU1W2eaCcKKLQbR6L5hvH9mneGPeV3Nn_noj_GCLlioQ1Mb7ODYfu_vsr9-BBDwNFhl8yLE-1xrQLePrpThu_Sc-lv7Fadts=@pm.me>
 <e5f0d69f-7ff4-5f78-b12a-dfbd9288b502@gmail.com>
Message-ID: <CAAkaFLXMnz4-UiYO1JwhgJePLuqeM1W0aM1GUkWDdOVLBPdgUg@mail.gmail.com>

With option 1, sklearn.plot is likely to import large chunks of the
library (particularly, but not exclusively, if the plotting function
"does the work" as Andy suggests). This is under the assumption that
one plot function will want to import trees, another GPs, etc. Unless
we move to lazy imports, that would be against the current convention
that importing sklearn is fairly minimal.

I do like Andy's idea of framing this discussion more clearly around
likely candidates.

On Thu, 4 Apr 2019 at 00:10, Andreas Mueller <t3kcit at gmail.com> wrote:
>
> I think what was not clear from the question is that there is actually
> quite different kinds of plotting functions, and many of these are tied
> to existing code.
>
> Right now we have some that are specific to trees (plot_tree) and to
> gradient boosting (plot_partial_dependence).
>
> I think we want more general functions, and plot_partial_dependence has
> been extended to general estimators.
>
> However, the plotting functions might be generic wrt the estimator, but
> relate to a specific function, say plotting results of GridSearchCV.
> Then one might argue that having the plotting function close to
> GridSearchCV might make sense.
> Similarly for plotting partial dependence plots and feature importances,
> it might be a bit strange to have the plotting functions not next to the
> functions that compute these.
> Another question would be is whether the plotting functions also "do the
> work" in some cases:
> Do we want plot_partial_dependence also to compute the partial
> dependence? (I would argue yes but either way the result is a bit strange).
> In that case you have somewhat of the same functionality in two
> different modules, unless you also put the "compute partial dependence"
> function in the plotting module as well,
> which is a bit strange.
>
> Maybe we could inform this discussion by listing candidate plotting
> functions, and also considering whether they "do the work" and where the
> "work" function is.
>
> Other examples are plotting the confusion matrix, which probably should
> also compute the confusion matrix (it's fast and so that would be
> convenient), and so it would "duplicate" functionality from the metrics
> module.
>
> Plotting learning curves and validation curves should probably not do
> the work as it's pretty involved, and so someone would need to import
> the learning and validation curves from model selection, and then the
> plotting functions from a plotting module.
>
> Calibrations curves and P/R curves and roc curves are also pretty fast
> to compute (and passing around the arguments is somewhat error prone) so
> I would say the plotting functions for these should do the work as well.
>
> Anyway, you can see that many plotting functions are actually associated
> with functions in existing modules and the interactions are a bit unclear.
>
> The only plotting functions I haven't mentioned so far that I thought
> about in the past are "2d scatter" and "plot decision function". These
> would be kind of generic, but mostly used in the examples.
> Though having a discrete 2d scatter function would be pretty nice
> (plt.scatter doesn't allow legends and makes it hard to use qualitative
> color maps).
>
>
> I think I would vote for option (1), "sklearn.plot.plot_zzz" but the
> case is not really that clear.
>
> Cheers,
>
> Andy
>
> On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote:
> > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting
> > functions will be added? If it's just a dozen or less, putting them all
> > into a single namespace sklearn.plot might be easier.
> >
> > This also would avoid discussion about where to put some generic
> > plotting functions (e.g.
> > https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479).
> >
> > Roman
> >
> > On 03/04/2019 12:06, Trevor Stephens wrote:
> >> I think #1 if any of these... Plotting functions should hopefully be as
> >> general as possible, so tagging with a specific type of estimator will,
> >> in some scikit-learn utopia, be unnecessary.
> >>
> >> If a general plotter is built, where does it live in other
> >> estimator-specific namespace options? Feels awkward to put it under
> >> every estimator's namespace.
> >>
> >> Then again, there might be a #4 where there is no plot module and
> >> plotting classes live under groups of utilities like introspection,
> >> cross-validation or something?...
> >>
> >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe <ahowe42 at gmail.com
> >> <mailto:ahowe42 at gmail.com>> wrote:
> >>
> >>      My preference would be for (1). I don't think the sub-namespace in
> >>      (2) is necessary, and don't like (3), as I would prefer the plotting
> >>      functions to be all in the same namespace sklearn.plot.
> >>
> >>      Andrew
> >>
> >>      <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> >>      J. Andrew Howe, PhD
> >>      LinkedIn Profile <http://www.linkedin.com/in/ahowe42>
> >>      ResearchGate Profile <http://www.researchgate.net/profile/John_Howe12/>
> >>      Open Researcher and Contributor ID (ORCID)
> >>      <http://orcid.org/0000-0002-3553-1990>
> >>      Github Profile <http://github.com/ahowe42>
> >>      Personal Website <http://www.andrewhowe.com>
> >>      I live to learn, so I can learn to live. - me
> >>      <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> >>
> >>
> >>      On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin <qinhanmin2005 at sina.com
> >>      <mailto:qinhanmin2005 at sina.com>> wrote:
> >>
> >>          See https://github.com/scikit-learn/scikit-learn/issues/13448
> >>
> >>          We've introduced several plotting functions (e.g., plot_tree and
> >>          plot_partial_dependence) and will introduce more (e.g.,
> >>          plot_decision_boundary) in the future. Consequently, we need to
> >>          decide where to put these functions. Currently, there're 3
> >>          proposals:
> >>
> >>          (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
> >>
> >>          (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
> >>
> >>          (3) sklearn.XXX.plot.plot_YYY (e.g.,
> >>          sklearn.tree.plot.plot_tree, note that we won't support from
> >>          sklearn.XXX import plot_YYY)
> >>
> >>          Joel Nothman, Gael Varoquaux and I decided to post it on the
> >>          mailing list to invite opinions.
> >>
> >>          Thanks
> >>
> >>          Hanmin Qin
> >>          _______________________________________________
> >>          scikit-learn mailing list
> >>          scikit-learn at python.org <mailto:scikit-learn at python.org>
> >>          https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >>      _______________________________________________
> >>      scikit-learn mailing list
> >>      scikit-learn at python.org <mailto:scikit-learn at python.org>
> >>      https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From hollas at informatik.htw-dresden.de  Wed Apr  3 12:50:24 2019
From: hollas at informatik.htw-dresden.de (Boris Hollas)
Date: Wed, 3 Apr 2019 18:50:24 +0200
Subject: [scikit-learn] Why is cross_val_predict discouraged?
In-Reply-To: <CAAkaFLVhNHTPk9zsXbmJmyghA0gSuWLDEuZyaRy2e=Ub1zC-qw@mail.gmail.com>
References: <a456bc01-5204-504e-9810-2c228d21ec6a@informatik.htw-dresden.de>
 <CAAkaFLVhNHTPk9zsXbmJmyghA0gSuWLDEuZyaRy2e=Ub1zC-qw@mail.gmail.com>
Message-ID: <1d887c05-bfdd-2559-c7a7-6e63a156eacc@informatik.htw-dresden.de>

Am 03.04.19 um 13:59 schrieb Joel Nothman:
> The equations in Murphy and Hastie very clearly assume a metric
> decomposable over samples (a loss function). Several popular metrics
> are not.
>
> For a metric like MSE it will be almost identical assuming the test
> sets have almost the same size.
What will be almost identical to what? I suppose you mean that (*) is 
consistent with the scores of the models in the fold (ie, the result of 
cross_val_score) if the loss function is (x-y)?.
> For something like Recall
> (sensitivity) it will be almost identical assuming similar test set
> sizes*and*  stratification. For something like precision whose
> denominator is determined by the biases of the learnt classifier on
> the test dataset, you can't say the same.
I can't follow here. If the loss function is L(x,y) = 1_{x = y}, then 
(*) gives the accuracy.
>   For something like ROC AUC
> score, relying on some decision function that may not be equivalently
> calibrated across splits, evaluating in this way is almost
> meaningless.

In any case, I still don't see what may be wrong with (*). Otherwise, 
the warning in the documentation about the use of cross_val_predict 
should be removed or revised.

On the other hand, an example in the documentation uses 
cross_val_scores.mean(). This is debatable since this computes a mean of 
means.

>
> On Wed, 3 Apr 2019 at 22:01, Boris Hollas
> <hollas at informatik.htw-dresden.de>  wrote:
>> I use
>>
>> sum((cross_val_predict(model, X, y) - y)**2) / len(y)        (*)
>>
>> to evaluate the performance of a model. This conforms with Murphy: Machine Learning, section 6.5.3, and Hastie et al: The Elements of Statistical Learning,  eq. 7.48. However, according to the documentation of cross_val_predict, "it is not appropriate to pass these predictions into an evaluation metric". While it is obvious that cross_val_predict is different from cross_val_score, I don't see what should be wrong with (*).
>>
>> Also, the explanation that "cross_val_predict simply returns the labels (or probabilities)" is unclear, if not wrong. As I understand it, this function returns estimates and no labels or probabilities.
>>
>> Regards, Boris

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190403/b72385f5/attachment.html>

From rr.rosas at gmail.com  Wed Apr  3 14:38:37 2019
From: rr.rosas at gmail.com (Rodrigo Rosenfeld Rosas)
Date: Wed, 3 Apr 2019 15:38:37 -0300
Subject: [scikit-learn] How to answer questions from big documents?
Message-ID: <CAGmv+wLJt=GdUG=fR6cZA-KXeBG0wMaKDZrr3p_N=7i-dsT_7g@mail.gmail.com>

Hi everyone, this is my first post here :)

About two weeks ago, due to the low demand in my project, I have been
assigned a completely unusual request: to automatically extract answers
from documents based on machine learning. I've never read anything about
ML, AI or NLP before, so I've been basically doing just that for the past
two weeks.

When it comes to ML, most book recommendations and tutorials I've found so
far use the Python language and tools, so I took the first week to learn
about Python, NumPy, Scikit, Panda, Matplotlib and so on. Then, this week I
started reading about NLP itself, after spending a few days reading about
generic ML algorithms.

So far, I've basically read about Bag of Words, using TF-IDF (or simply
terms count) to convert the words to numeric representations and a few
methods such as the gaussian and multinomial naive bayes methods to train
and predict values. The methods also mention the importance of using the
usual pre-processing methods such as lemmatization and alikes. However,
basically all examples assume that a given text can be classified in one of
the categorized topics, like the sentiment analysis use case. I'm afraid
this doesn't represent my use case, so I'd like to describe it here so that
you could help me identifying which methods I should be looking for.

We have a system with thousands of transactions/deals inputted manually by
an specialized team. Each deal has a set of documents (a dozen per deal
typically) and some documents could have hundreds of pages. The inputing
team has to extract about a thousand fields from those documents for any
particular deal. So, in our database we have all their data and we
typically also know the document specific snippets associated to each field
value.

So, my task is to, given a new document and deal, and based on the previous
answers, fill in as many fields as I could by automatically finding the
corresponding snippets in the new documents. I'm not sure how I should
approach this problem.

For example, I could consider each sentence of the document as a separate
document to be analyzed and compared to the snippets I already have for the
matching data. However, I can't be sure whether some of those sentences
would actually answer the question. For example, maybe there are 6
occurrences in the documents that would answer a particular question/field,
but maybe the inputters only identified 2 or 3 of them.

Also, for any given sentence, it could tell that the answer for a given
field is A or B, or it could be that there's absolutely no association
between the sentence and the field/question, as it would be the case for
most sentences. I know that Scikit provides the predict_proba method, so
that I could try to only consider the sentence as relevant if the
probabilities of answering the question would be above 80%, for example,
but based on a few quick tests I've made with a few sentences and words, I
suspect this won't work very well. Also, it could be quite slow to treat
each sentence of a 500-hundreds of pages documents as a separate document
to be analyzed, so I'm not sure if there are better methods to handle this
use case.

Some of the fields are free-text ones, like company and firm names, for
example, and I suspect those would be the hardest to answer, so I'm trying
to start with the multiple-choice ones, with a finite set of classification.

How would you advise me to look at this problem? Are there any algorithms
you'd recommend me to study for solving this particular problem?

Here are some sample data so that you could get a better understanding of
the problem:

One of the fields is called "Deal Structure" and it could have the
following values: "Asset Purchase", "Stock or Equity Purchase" or "Public
Target Merger" (there are a few others, but this gives you an idea).

So, here are some sentences highlighted for Public Target Merger deals
(those documents come from Edgar Filings public database which are freely
available for US deals):

deal 1 / doc 1: "AGREEMENT AND PLAN OF MERGER, dated as of March 14, 2018
(this ?Agreement?), by and among HarborOne Bancorp, Inc., a Massachusetts
corporation (?Buyer?), Massachusetts Acquisitions, LLC, a Maryland limited
liability company of which Buyer is the sole member (?Merger LLC?), and
Coastway Bancorp, Inc., a Maryland corporation (the ?Company?)."

"WHEREAS, Buyer, Merger LLC, and the Company intend to effect a merger (the
?Merger?) of Merger LLC with and into the Company in accordance with this
Agreement and the Maryland General Corporation Law (the ?MGCL?) and the
Maryland Limited Liability Company Act, as amended (the ?MLLCA?), with the
Company to be the surviving entity in the Merger. The Merger will be
followed immediately by a merger of the Company with and into Buyer (the
?Upstream Merger?), with the Buyer to be the surviving entity in the
Upstream Merger. It is intended that the Merger be mutually interdependent
with and a condition precedent to the Upstream Merger and that the Upstream
Merger shall, through the binding commitment evidenced by this Agreement,
be effected immediately following the Effective Time (as defined below)
without further approval, authorization or direction from or by any of the
parties hereto; and"

deal 2 / doc 1:

"WHEREAS, it is also proposed that, as soon as practicable following the
consummation of the Offer, the Parties wish to effect the acquisition of
the Company by Parent through the merger of Purchaser with and into the
Company, with the Company being the surviving entity (the ?Merger?);"

Now, for Asset Purchase deals:

deal 3 / doc 1:

"Subject to the terms and conditions of this Agreement, Sellers are willing
to sell to Buyer, and Buyer is willing to purchase from Sellers, all of
their assets relating to the Businesses as set forth herein."

deal 4 / doc 1:

"WHEREAS, Seller wishes to sell and assign to Buyer, and Buyer wishes to
purchase and assume from Seller, the rights and obligations of Seller to
the Purchased Assets (as defined herein), subject to the terms and
conditions set forth herein."

Please forgive me for any imprecise/incorrect terms or understanding on
this topic as this is all very new to me. Any help is very appreciated.

I've also asked this question in StackOverflow, so if you'd prefer to
answer there instead, here is the link:

https://stackoverflow.com/questions/55499866/how-to-answer-questions-from-big-documents

Would this field be called data mining? Feature extraction? Question
answering? I'm not sure how to properly search about this subject so any
hints are very welcome :)

Thanks in advance,
Rodrigo.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190403/6643500f/attachment-0001.html>

From joel.nothman at gmail.com  Wed Apr  3 17:46:57 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 4 Apr 2019 08:46:57 +1100
Subject: [scikit-learn] Why is cross_val_predict discouraged?
In-Reply-To: <1d887c05-bfdd-2559-c7a7-6e63a156eacc@informatik.htw-dresden.de>
References: <a456bc01-5204-504e-9810-2c228d21ec6a@informatik.htw-dresden.de>
 <CAAkaFLVhNHTPk9zsXbmJmyghA0gSuWLDEuZyaRy2e=Ub1zC-qw@mail.gmail.com>
 <1d887c05-bfdd-2559-c7a7-6e63a156eacc@informatik.htw-dresden.de>
Message-ID: <CAAkaFLUYWHrpkg5OYr-ZejnG93Wzo=rasy3-RDxVGFdFkZ4S8g@mail.gmail.com>

Pull requests improving the documentation are always welcome. At a minimum,
users need to know that these compute different things.

Accuracy is not precision. Precision is the number of true positives
divided by the number of true positives plus false positives. It therefore
cannot be decomposed as a sample-wise measure without knowing the rate of
positive predictions. This rate is dependent on the training data and
algorithm.

I'm not a statistician and cannot speak to issues of computing a mean of
means, but if what we are trying to estimate is the performance on a sample
of size approximately n_t of a model trained on a sample of size
approximately N - n_t, then I wouldn't have thought taking a mean over such
measures (with whatever score function) to be unreasonable.

On Thu., 4 Apr. 2019, 3:51 am Boris Hollas, <
hollas at informatik.htw-dresden.de> wrote:

> Am 03.04.19 um 13:59 schrieb Joel Nothman:
>
> The equations in Murphy and Hastie very clearly assume a metric
> decomposable over samples (a loss function). Several popular metrics
> are not.
>
> For a metric like MSE it will be almost identical assuming the test
> sets have almost the same size.
>
> What will be almost identical to what? I suppose you mean that (*) is
> consistent with the scores of the models in the fold (ie, the result of
> cross_val_score) if the loss function is (x-y)?.
>
> For something like Recall
> (sensitivity) it will be almost identical assuming similar test set
> sizes **and** stratification. For something like precision whose
> denominator is determined by the biases of the learnt classifier on
> the test dataset, you can't say the same.
>
> I can't follow here. If the loss function is L(x,y) = 1_{x = y}, then (*)
> gives the accuracy.
>
>  For something like ROC AUC
> score, relying on some decision function that may not be equivalently
> calibrated across splits, evaluating in this way is almost
> meaningless.
>
> In any case, I still don't see what may be wrong with (*). Otherwise, the
> warning in the documentation about the use of cross_val_predict should be
> removed or revised.
>
> On the other hand, an example in the documentation uses
> cross_val_scores.mean(). This is debatable since this computes a mean of
> means.
>
>
>
> On Wed, 3 Apr 2019 at 22:01, Boris Hollas<hollas at informatik.htw-dresden.de> <hollas at informatik.htw-dresden.de> wrote:
>
> I use
>
> sum((cross_val_predict(model, X, y) - y)**2) / len(y)        (*)
>
> to evaluate the performance of a model. This conforms with Murphy: Machine Learning, section 6.5.3, and Hastie et al: The Elements of Statistical Learning,  eq. 7.48. However, according to the documentation of cross_val_predict, "it is not appropriate to pass these predictions into an evaluation metric". While it is obvious that cross_val_predict is different from cross_val_score, I don't see what should be wrong with (*).
>
> Also, the explanation that "cross_val_predict simply returns the labels (or probabilities)" is unclear, if not wrong. As I understand it, this function returns estimates and no labels or probabilities.
>
> Regards, Boris
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190404/18943e3d/attachment.html>

From ericmajinglong at gmail.com  Wed Apr  3 18:59:02 2019
From: ericmajinglong at gmail.com (Eric Ma)
Date: Thu, 4 Apr 2019 00:59:02 +0200
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
 functions?
In-Reply-To: <CAAkaFLXMnz4-UiYO1JwhgJePLuqeM1W0aM1GUkWDdOVLBPdgUg@mail.gmail.com>
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
 <CANnYi3SO-6j8GpkO7cUZabB1CfzegwUJTF0c90B6UVBvgCCLkQ@mail.gmail.com>
 <CAE0Q3zWkV3quVioOMQ06kQceGvgmEvw-zJ2EH=VC7MEzPyuENA@mail.gmail.com>
 <UV3VC8B5V87wU1W2eaCcKKLQbR6L5hvH9mneGPeV3Nn_noj_GCLlioQ1Mb7ODYfu_vsr9-BBDwNFhl8yLE-1xrQLePrpThu_Sc-lv7Fadts=@pm.me>
 <e5f0d69f-7ff4-5f78-b12a-dfbd9288b502@gmail.com>
 <CAAkaFLXMnz4-UiYO1JwhgJePLuqeM1W0aM1GUkWDdOVLBPdgUg@mail.gmail.com>
Message-ID: <CAK-i=xhJAV-VkEYrb_bCP68RmvnCNNboNqEmPkF3+bhigb+=Uw@mail.gmail.com>

This is not a strongly-held suggestion - but what about adopting
YellowBrick as the plotting API for sklearn? Not sure how exactly the
interaction would work - could be PRs to their library, or ask them to
integrate into sklearn, or do a lock-step dance with versions but maintain
separate teams? (I know it raises more questions than answers, but wanted
to put it out there.)

On Wed, Apr 3, 2019 at 4:07 PM Joel Nothman <joel.nothman at gmail.com> wrote:

> With option 1, sklearn.plot is likely to import large chunks of the
> library (particularly, but not exclusively, if the plotting function
> "does the work" as Andy suggests). This is under the assumption that
> one plot function will want to import trees, another GPs, etc. Unless
> we move to lazy imports, that would be against the current convention
> that importing sklearn is fairly minimal.
>
> I do like Andy's idea of framing this discussion more clearly around
> likely candidates.
>
> On Thu, 4 Apr 2019 at 00:10, Andreas Mueller <t3kcit at gmail.com> wrote:
> >
> > I think what was not clear from the question is that there is actually
> > quite different kinds of plotting functions, and many of these are tied
> > to existing code.
> >
> > Right now we have some that are specific to trees (plot_tree) and to
> > gradient boosting (plot_partial_dependence).
> >
> > I think we want more general functions, and plot_partial_dependence has
> > been extended to general estimators.
> >
> > However, the plotting functions might be generic wrt the estimator, but
> > relate to a specific function, say plotting results of GridSearchCV.
> > Then one might argue that having the plotting function close to
> > GridSearchCV might make sense.
> > Similarly for plotting partial dependence plots and feature importances,
> > it might be a bit strange to have the plotting functions not next to the
> > functions that compute these.
> > Another question would be is whether the plotting functions also "do the
> > work" in some cases:
> > Do we want plot_partial_dependence also to compute the partial
> > dependence? (I would argue yes but either way the result is a bit
> strange).
> > In that case you have somewhat of the same functionality in two
> > different modules, unless you also put the "compute partial dependence"
> > function in the plotting module as well,
> > which is a bit strange.
> >
> > Maybe we could inform this discussion by listing candidate plotting
> > functions, and also considering whether they "do the work" and where the
> > "work" function is.
> >
> > Other examples are plotting the confusion matrix, which probably should
> > also compute the confusion matrix (it's fast and so that would be
> > convenient), and so it would "duplicate" functionality from the metrics
> > module.
> >
> > Plotting learning curves and validation curves should probably not do
> > the work as it's pretty involved, and so someone would need to import
> > the learning and validation curves from model selection, and then the
> > plotting functions from a plotting module.
> >
> > Calibrations curves and P/R curves and roc curves are also pretty fast
> > to compute (and passing around the arguments is somewhat error prone) so
> > I would say the plotting functions for these should do the work as well.
> >
> > Anyway, you can see that many plotting functions are actually associated
> > with functions in existing modules and the interactions are a bit
> unclear.
> >
> > The only plotting functions I haven't mentioned so far that I thought
> > about in the past are "2d scatter" and "plot decision function". These
> > would be kind of generic, but mostly used in the examples.
> > Though having a discrete 2d scatter function would be pretty nice
> > (plt.scatter doesn't allow legends and makes it hard to use qualitative
> > color maps).
> >
> >
> > I think I would vote for option (1), "sklearn.plot.plot_zzz" but the
> > case is not really that clear.
> >
> > Cheers,
> >
> > Andy
> >
> > On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote:
> > > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting
> > > functions will be added? If it's just a dozen or less, putting them all
> > > into a single namespace sklearn.plot might be easier.
> > >
> > > This also would avoid discussion about where to put some generic
> > > plotting functions (e.g.
> > >
> https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479
> ).
> > >
> > > Roman
> > >
> > > On 03/04/2019 12:06, Trevor Stephens wrote:
> > >> I think #1 if any of these... Plotting functions should hopefully be
> as
> > >> general as possible, so tagging with a specific type of estimator
> will,
> > >> in some scikit-learn utopia, be unnecessary.
> > >>
> > >> If a general plotter is built, where does it live in other
> > >> estimator-specific namespace options? Feels awkward to put it under
> > >> every estimator's namespace.
> > >>
> > >> Then again, there might be a #4 where there is no plot module and
> > >> plotting classes live under groups of utilities like introspection,
> > >> cross-validation or something?...
> > >>
> > >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe <ahowe42 at gmail.com
> > >> <mailto:ahowe42 at gmail.com>> wrote:
> > >>
> > >>      My preference would be for (1). I don't think the sub-namespace
> in
> > >>      (2) is necessary, and don't like (3), as I would prefer the
> plotting
> > >>      functions to be all in the same namespace sklearn.plot.
> > >>
> > >>      Andrew
> > >>
> > >>      <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> > >>      J. Andrew Howe, PhD
> > >>      LinkedIn Profile <http://www.linkedin.com/in/ahowe42>
> > >>      ResearchGate Profile <
> http://www.researchgate.net/profile/John_Howe12/>
> > >>      Open Researcher and Contributor ID (ORCID)
> > >>      <http://orcid.org/0000-0002-3553-1990>
> > >>      Github Profile <http://github.com/ahowe42>
> > >>      Personal Website <http://www.andrewhowe.com>
> > >>      I live to learn, so I can learn to live. - me
> > >>      <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> > >>
> > >>
> > >>      On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin <
> qinhanmin2005 at sina.com
> > >>      <mailto:qinhanmin2005 at sina.com>> wrote:
> > >>
> > >>          See
> https://github.com/scikit-learn/scikit-learn/issues/13448
> > >>
> > >>          We've introduced several plotting functions (e.g., plot_tree
> and
> > >>          plot_partial_dependence) and will introduce more (e.g.,
> > >>          plot_decision_boundary) in the future. Consequently, we need
> to
> > >>          decide where to put these functions. Currently, there're 3
> > >>          proposals:
> > >>
> > >>          (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
> > >>
> > >>          (2) sklearn.plot.XXX.plot_YYY (e.g.,
> sklearn.plot.tree.plot_tree)
> > >>
> > >>          (3) sklearn.XXX.plot.plot_YYY (e.g.,
> > >>          sklearn.tree.plot.plot_tree, note that we won't support from
> > >>          sklearn.XXX import plot_YYY)
> > >>
> > >>          Joel Nothman, Gael Varoquaux and I decided to post it on the
> > >>          mailing list to invite opinions.
> > >>
> > >>          Thanks
> > >>
> > >>          Hanmin Qin
> > >>          _______________________________________________
> > >>          scikit-learn mailing list
> > >>          scikit-learn at python.org <mailto:scikit-learn at python.org>
> > >>          https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >>      _______________________________________________
> > >>      scikit-learn mailing list
> > >>      scikit-learn at python.org <mailto:scikit-learn at python.org>
> > >>      https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190404/635cc0a6/attachment-0001.html>

From joel.nothman at gmail.com  Wed Apr  3 19:50:51 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 4 Apr 2019 10:50:51 +1100
Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug
Message-ID: <CAAkaFLVZ2idFj=jKyrngqqyjFDu5xhuBTNb8i0T_zTw-f9GVcQ@mail.gmail.com>

The core developers of Scikit-learn have recently voted to welcome
Thomas Fan and Nicolas Hug to the team, in recognition of their
efforts and trustworthiness as contributors. Both happen to be working
with Andy Mueller at Columbia University at the moment.
Congratulations and thanks to them both!

From qinhanmin2005 at sina.com  Wed Apr  3 21:05:55 2019
From: qinhanmin2005 at sina.com (Hanmin Qin)
Date: Thu, 04 Apr 2019 09:05:55 +0800
Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug
Message-ID: <20190404010555.772254140094@webmail.sinamail.sina.com.cn>

Congratulations and welcome to the team!
Hanmin Qin
----- Original Message -----
From: Joel Nothman <joel.nothman at gmail.com>
To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug
Date: 2019-04-04 07:52


The core developers of Scikit-learn have recently voted to welcome
Thomas Fan and Nicolas Hug to the team, in recognition of their
efforts and trustworthiness as contributors. Both happen to be working
with Andy Mueller at Columbia University at the moment.
Congratulations and thanks to them both!
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190404/87c61cee/attachment.html>

From t3kcit at gmail.com  Wed Apr  3 23:11:36 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 3 Apr 2019 23:11:36 -0400
Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug
In-Reply-To: <20190404010555.772254140094@webmail.sinamail.sina.com.cn>
References: <20190404010555.772254140094@webmail.sinamail.sina.com.cn>
Message-ID: <CADxzQoq4uzXqsP0bOurwA2=n6-qD45SDr+Yy5EqHb8k+Nx0uPA@mail.gmail.com>

Congratulations guys! Great work! Looking forward to much more! Proud to
have you on the team!

Now we in NYC can approve our own pull requests ;)


Sent from phone. Please excuse spelling and brevity.

On Wed, Apr 3, 2019, 21:08 Hanmin Qin <qinhanmin2005 at sina.com> wrote:

> Congratulations and welcome to the team!
>
> Hanmin Qin
>
> ----- Original Message -----
> From: Joel Nothman <joel.nothman at gmail.com>
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
> Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug
> Date: 2019-04-04 07:52
>
>
> The core developers of Scikit-learn have recently voted to welcome
> Thomas Fan and Nicolas Hug to the team, in recognition of their
> efforts and trustworthiness as contributors. Both happen to be working
> with Andy Mueller at Columbia University at the moment.
> Congratulations and thanks to them both!
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190403/5995f7be/attachment.html>

From hollas at informatik.htw-dresden.de  Thu Apr  4 03:39:14 2019
From: hollas at informatik.htw-dresden.de (Boris Hollas)
Date: Thu, 4 Apr 2019 09:39:14 +0200
Subject: [scikit-learn] Why is cross_val_predict discouraged?
In-Reply-To: <CAAkaFLUYWHrpkg5OYr-ZejnG93Wzo=rasy3-RDxVGFdFkZ4S8g@mail.gmail.com>
References: <a456bc01-5204-504e-9810-2c228d21ec6a@informatik.htw-dresden.de>
 <CAAkaFLVhNHTPk9zsXbmJmyghA0gSuWLDEuZyaRy2e=Ub1zC-qw@mail.gmail.com>
 <1d887c05-bfdd-2559-c7a7-6e63a156eacc@informatik.htw-dresden.de>
 <CAAkaFLUYWHrpkg5OYr-ZejnG93Wzo=rasy3-RDxVGFdFkZ4S8g@mail.gmail.com>
Message-ID: <ff1f1351-c332-6a01-e893-2b54ebabb52a@informatik.htw-dresden.de>

Am 03.04.19 um 23:46 schrieb Joel Nothman:
> Pull requests improving the documentation are always welcome. At a 
> minimum, users need to know that these compute different things.
>
> Accuracy is not precision. Precision is the number of true positives 
> divided by the number of true positives plus false positives. It 
> therefore cannot be decomposed as a sample-wise measure without 
> knowing the rate of positive predictions. This rate is dependent on 
> the training data and algorithm.

In my last post, I referred to your remark that "for precision ... you 
can't say the same". Since precision can't be computed with formula (*), 
even with a different loss function, I pointed out that (*) can be used 
to compute the accuracy if the loss function is an indicator function.

It is still not clear to me what your point is with your remark that 
"for precision ... you can't say the same". I assume that you want to 
tell that it is not wise to compute TP, FP, FN and then precision and 
recall using cross_val_predict. If this is what you mean, I'd like you 
to explain why.

> I'm not a statistician and cannot speak to issues of computing a mean 
> of means, but if what we are trying to estimate is the performance on 
> a sample of size approximately n_t of a model trained on a sample of 
> size approximately N - n_t, then I wouldn't have thought taking a mean 
> over such measures (with whatever score function) to be unreasonable.
>
In general, a mean of means is not the mean of the original data. The 
pooled mean is the correct metric in this case. However, the pooled mean 
equals the mean of means if all folds are exactly the same size.

> On Thu., 4 Apr. 2019, 3:51 am Boris Hollas, 
> <hollas at informatik.htw-dresden.de 
> <mailto:hollas at informatik.htw-dresden.de>> wrote:
>
>     Am 03.04.19 um 13:59 schrieb Joel Nothman:
>>     The equations in Murphy and Hastie very clearly assume a metric
>>     decomposable over samples (a loss function). Several popular metrics
>>     are not.
>>
>>     For a metric like MSE it will be almost identical assuming the test
>>     sets have almost the same size.
>     What will be almost identical to what? I suppose you mean that (*)
>     is consistent with the scores of the models in the fold (ie, the
>     result of cross_val_score) if the loss function is (x-y)?.
>>     For something like Recall
>>     (sensitivity) it will be almost identical assuming similar test set
>>     sizes**and**  stratification. For something like precision whose
>>     denominator is determined by the biases of the learnt classifier on
>>     the test dataset, you can't say the same.
>     I can't follow here. If the loss function is L(x,y) = 1_{x = y},
>     then (*) gives the accuracy.
>>       For something like ROC AUC
>>     score, relying on some decision function that may not be equivalently
>>     calibrated across splits, evaluating in this way is almost
>>     meaningless.
>
>     In any case, I still don't see what may be wrong with (*).
>     Otherwise, the warning in the documentation about the use of
>     cross_val_predict should be removed or revised.
>
>     On the other hand, an example in the documentation uses
>     cross_val_scores.mean(). This is debatable since this computes a
>     mean of means.
>
>>     On Wed, 3 Apr 2019 at 22:01, Boris Hollas
>>     <hollas at informatik.htw-dresden.de>  <mailto:hollas at informatik.htw-dresden.de>  wrote:
>>>     I use
>>>
>>>     sum((cross_val_predict(model, X, y) - y)**2) / len(y)        (*)
>>>
>>>     to evaluate the performance of a model. This conforms with Murphy: Machine Learning, section 6.5.3, and Hastie et al: The Elements of Statistical Learning,  eq. 7.48. However, according to the documentation of cross_val_predict, "it is not appropriate to pass these predictions into an evaluation metric". While it is obvious that cross_val_predict is different from cross_val_score, I don't see what should be wrong with (*).
>>>
>>>     Also, the explanation that "cross_val_predict simply returns the labels (or probabilities)" is unclear, if not wrong. As I understand it, this function returns estimates and no labels or probabilities.
>>>
>>>     Regards, Boris
>
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190404/9b092b55/attachment-0001.html>

From joel.nothman at gmail.com  Thu Apr  4 04:03:16 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 4 Apr 2019 19:03:16 +1100
Subject: [scikit-learn] Why is cross_val_predict discouraged?
In-Reply-To: <ff1f1351-c332-6a01-e893-2b54ebabb52a@informatik.htw-dresden.de>
References: <a456bc01-5204-504e-9810-2c228d21ec6a@informatik.htw-dresden.de>
 <CAAkaFLVhNHTPk9zsXbmJmyghA0gSuWLDEuZyaRy2e=Ub1zC-qw@mail.gmail.com>
 <1d887c05-bfdd-2559-c7a7-6e63a156eacc@informatik.htw-dresden.de>
 <CAAkaFLUYWHrpkg5OYr-ZejnG93Wzo=rasy3-RDxVGFdFkZ4S8g@mail.gmail.com>
 <ff1f1351-c332-6a01-e893-2b54ebabb52a@informatik.htw-dresden.de>
Message-ID: <CAAkaFLU9XqWoHYkgAuit6wyFADRPARBCxJHLDvYg4wigdPUXyw@mail.gmail.com>

> I assume that you want to tell that it is not wise to compute TP, FP, FN
and then precision and recall using cross_val_predict. If this is what you
mean, I'd like you to explain why.

Because if there is high variance as a function of training set rather than
test sample I'd like to know.

> The pooled mean is the correct metric in this case.

I don't think we are in agreement on that.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190404/79a8def4/attachment.html>

From alexandre.gramfort at inria.fr  Thu Apr  4 05:40:48 2019
From: alexandre.gramfort at inria.fr (Alexandre Gramfort)
Date: Thu, 4 Apr 2019 11:40:48 +0200
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
 functions?
In-Reply-To: <CAK-i=xhJAV-VkEYrb_bCP68RmvnCNNboNqEmPkF3+bhigb+=Uw@mail.gmail.com>
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
 <CANnYi3SO-6j8GpkO7cUZabB1CfzegwUJTF0c90B6UVBvgCCLkQ@mail.gmail.com>
 <CAE0Q3zWkV3quVioOMQ06kQceGvgmEvw-zJ2EH=VC7MEzPyuENA@mail.gmail.com>
 <UV3VC8B5V87wU1W2eaCcKKLQbR6L5hvH9mneGPeV3Nn_noj_GCLlioQ1Mb7ODYfu_vsr9-BBDwNFhl8yLE-1xrQLePrpThu_Sc-lv7Fadts=@pm.me>
 <e5f0d69f-7ff4-5f78-b12a-dfbd9288b502@gmail.com>
 <CAAkaFLXMnz4-UiYO1JwhgJePLuqeM1W0aM1GUkWDdOVLBPdgUg@mail.gmail.com>
 <CAK-i=xhJAV-VkEYrb_bCP68RmvnCNNboNqEmPkF3+bhigb+=Uw@mail.gmail.com>
Message-ID: <CADeotZr1x6fE9Q7FziKEiv7ujgvyDKtSgqHD9ij_NE1CVW=2Vw@mail.gmail.com>

I also think that YellowBrick folks did a great job and that we should not
reinvent the wheel or at least have clear idea of how we differ in scope
with respect to YellowBrick

my 2c

Alex


On Thu, Apr 4, 2019 at 1:02 AM Eric Ma <ericmajinglong at gmail.com> wrote:

> This is not a strongly-held suggestion - but what about adopting
> YellowBrick as the plotting API for sklearn? Not sure how exactly the
> interaction would work - could be PRs to their library, or ask them to
> integrate into sklearn, or do a lock-step dance with versions but maintain
> separate teams? (I know it raises more questions than answers, but wanted
> to put it out there.)
>
> On Wed, Apr 3, 2019 at 4:07 PM Joel Nothman <joel.nothman at gmail.com>
> wrote:
>
>> With option 1, sklearn.plot is likely to import large chunks of the
>> library (particularly, but not exclusively, if the plotting function
>> "does the work" as Andy suggests). This is under the assumption that
>> one plot function will want to import trees, another GPs, etc. Unless
>> we move to lazy imports, that would be against the current convention
>> that importing sklearn is fairly minimal.
>>
>> I do like Andy's idea of framing this discussion more clearly around
>> likely candidates.
>>
>> On Thu, 4 Apr 2019 at 00:10, Andreas Mueller <t3kcit at gmail.com> wrote:
>> >
>> > I think what was not clear from the question is that there is actually
>> > quite different kinds of plotting functions, and many of these are tied
>> > to existing code.
>> >
>> > Right now we have some that are specific to trees (plot_tree) and to
>> > gradient boosting (plot_partial_dependence).
>> >
>> > I think we want more general functions, and plot_partial_dependence has
>> > been extended to general estimators.
>> >
>> > However, the plotting functions might be generic wrt the estimator, but
>> > relate to a specific function, say plotting results of GridSearchCV.
>> > Then one might argue that having the plotting function close to
>> > GridSearchCV might make sense.
>> > Similarly for plotting partial dependence plots and feature importances,
>> > it might be a bit strange to have the plotting functions not next to the
>> > functions that compute these.
>> > Another question would be is whether the plotting functions also "do the
>> > work" in some cases:
>> > Do we want plot_partial_dependence also to compute the partial
>> > dependence? (I would argue yes but either way the result is a bit
>> strange).
>> > In that case you have somewhat of the same functionality in two
>> > different modules, unless you also put the "compute partial dependence"
>> > function in the plotting module as well,
>> > which is a bit strange.
>> >
>> > Maybe we could inform this discussion by listing candidate plotting
>> > functions, and also considering whether they "do the work" and where the
>> > "work" function is.
>> >
>> > Other examples are plotting the confusion matrix, which probably should
>> > also compute the confusion matrix (it's fast and so that would be
>> > convenient), and so it would "duplicate" functionality from the metrics
>> > module.
>> >
>> > Plotting learning curves and validation curves should probably not do
>> > the work as it's pretty involved, and so someone would need to import
>> > the learning and validation curves from model selection, and then the
>> > plotting functions from a plotting module.
>> >
>> > Calibrations curves and P/R curves and roc curves are also pretty fast
>> > to compute (and passing around the arguments is somewhat error prone) so
>> > I would say the plotting functions for these should do the work as well.
>> >
>> > Anyway, you can see that many plotting functions are actually associated
>> > with functions in existing modules and the interactions are a bit
>> unclear.
>> >
>> > The only plotting functions I haven't mentioned so far that I thought
>> > about in the past are "2d scatter" and "plot decision function". These
>> > would be kind of generic, but mostly used in the examples.
>> > Though having a discrete 2d scatter function would be pretty nice
>> > (plt.scatter doesn't allow legends and makes it hard to use qualitative
>> > color maps).
>> >
>> >
>> > I think I would vote for option (1), "sklearn.plot.plot_zzz" but the
>> > case is not really that clear.
>> >
>> > Cheers,
>> >
>> > Andy
>> >
>> > On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote:
>> > > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting
>> > > functions will be added? If it's just a dozen or less, putting them
>> all
>> > > into a single namespace sklearn.plot might be easier.
>> > >
>> > > This also would avoid discussion about where to put some generic
>> > > plotting functions (e.g.
>> > >
>> https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479
>> ).
>> > >
>> > > Roman
>> > >
>> > > On 03/04/2019 12:06, Trevor Stephens wrote:
>> > >> I think #1 if any of these... Plotting functions should hopefully be
>> as
>> > >> general as possible, so tagging with a specific type of estimator
>> will,
>> > >> in some scikit-learn utopia, be unnecessary.
>> > >>
>> > >> If a general plotter is built, where does it live in other
>> > >> estimator-specific namespace options? Feels awkward to put it under
>> > >> every estimator's namespace.
>> > >>
>> > >> Then again, there might be a #4 where there is no plot module and
>> > >> plotting classes live under groups of utilities like introspection,
>> > >> cross-validation or something?...
>> > >>
>> > >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe <ahowe42 at gmail.com
>> > >> <mailto:ahowe42 at gmail.com>> wrote:
>> > >>
>> > >>      My preference would be for (1). I don't think the sub-namespace
>> in
>> > >>      (2) is necessary, and don't like (3), as I would prefer the
>> plotting
>> > >>      functions to be all in the same namespace sklearn.plot.
>> > >>
>> > >>      Andrew
>> > >>
>> > >>      <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>> > >>      J. Andrew Howe, PhD
>> > >>      LinkedIn Profile <http://www.linkedin.com/in/ahowe42>
>> > >>      ResearchGate Profile <
>> http://www.researchgate.net/profile/John_Howe12/>
>> > >>      Open Researcher and Contributor ID (ORCID)
>> > >>      <http://orcid.org/0000-0002-3553-1990>
>> > >>      Github Profile <http://github.com/ahowe42>
>> > >>      Personal Website <http://www.andrewhowe.com>
>> > >>      I live to learn, so I can learn to live. - me
>> > >>      <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>> > >>
>> > >>
>> > >>      On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin <
>> qinhanmin2005 at sina.com
>> > >>      <mailto:qinhanmin2005 at sina.com>> wrote:
>> > >>
>> > >>          See
>> https://github.com/scikit-learn/scikit-learn/issues/13448
>> > >>
>> > >>          We've introduced several plotting functions (e.g.,
>> plot_tree and
>> > >>          plot_partial_dependence) and will introduce more (e.g.,
>> > >>          plot_decision_boundary) in the future. Consequently, we
>> need to
>> > >>          decide where to put these functions. Currently, there're 3
>> > >>          proposals:
>> > >>
>> > >>          (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>> > >>
>> > >>          (2) sklearn.plot.XXX.plot_YYY (e.g.,
>> sklearn.plot.tree.plot_tree)
>> > >>
>> > >>          (3) sklearn.XXX.plot.plot_YYY (e.g.,
>> > >>          sklearn.tree.plot.plot_tree, note that we won't support from
>> > >>          sklearn.XXX import plot_YYY)
>> > >>
>> > >>          Joel Nothman, Gael Varoquaux and I decided to post it on the
>> > >>          mailing list to invite opinions.
>> > >>
>> > >>          Thanks
>> > >>
>> > >>          Hanmin Qin
>> > >>          _______________________________________________
>> > >>          scikit-learn mailing list
>> > >>          scikit-learn at python.org <mailto:scikit-learn at python.org>
>> > >>          https://mail.python.org/mailman/listinfo/scikit-learn
>> > >>
>> > >>      _______________________________________________
>> > >>      scikit-learn mailing list
>> > >>      scikit-learn at python.org <mailto:scikit-learn at python.org>
>> > >>      https://mail.python.org/mailman/listinfo/scikit-learn
>> > >>
>> > >
>> > > _______________________________________________
>> > > scikit-learn mailing list
>> > > scikit-learn at python.org
>> > > https://mail.python.org/mailman/listinfo/scikit-learn
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190404/1a5536ed/attachment-0001.html>

From t3kcit at gmail.com  Thu Apr  4 10:24:40 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 4 Apr 2019 10:24:40 -0400
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
 functions?
In-Reply-To: <CADeotZr1x6fE9Q7FziKEiv7ujgvyDKtSgqHD9ij_NE1CVW=2Vw@mail.gmail.com>
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
 <CANnYi3SO-6j8GpkO7cUZabB1CfzegwUJTF0c90B6UVBvgCCLkQ@mail.gmail.com>
 <CAE0Q3zWkV3quVioOMQ06kQceGvgmEvw-zJ2EH=VC7MEzPyuENA@mail.gmail.com>
 <UV3VC8B5V87wU1W2eaCcKKLQbR6L5hvH9mneGPeV3Nn_noj_GCLlioQ1Mb7ODYfu_vsr9-BBDwNFhl8yLE-1xrQLePrpThu_Sc-lv7Fadts=@pm.me>
 <e5f0d69f-7ff4-5f78-b12a-dfbd9288b502@gmail.com>
 <CAAkaFLXMnz4-UiYO1JwhgJePLuqeM1W0aM1GUkWDdOVLBPdgUg@mail.gmail.com>
 <CAK-i=xhJAV-VkEYrb_bCP68RmvnCNNboNqEmPkF3+bhigb+=Uw@mail.gmail.com>
 <CADeotZr1x6fE9Q7FziKEiv7ujgvyDKtSgqHD9ij_NE1CVW=2Vw@mail.gmail.com>
Message-ID: <CADxzQooNw--VM9NdLDTrS3CvQsiv=f-2y6_EMA6TUHSKPryZ1w@mail.gmail.com>

I would argue that sklearn users would benefit in having solutions in
scikit-learn. The yellowbrick api is quite different from the approaches we
discussed. If we can reuse their implementations I think we should do so
and credit where we can.
Having plotting in sklearn is also likely to attract more contributors and
we have more eyes for doing reviews.

Sent from phone. Please excuse spelling and brevity.

On Thu, Apr 4, 2019, 05:43 Alexandre Gramfort <alexandre.gramfort at inria.fr>
wrote:

> I also think that YellowBrick folks did a great job and that we should not
> reinvent the wheel or at least have clear idea of how we differ in scope
> with respect to YellowBrick
>
> my 2c
>
> Alex
>
>
> On Thu, Apr 4, 2019 at 1:02 AM Eric Ma <ericmajinglong at gmail.com> wrote:
>
>> This is not a strongly-held suggestion - but what about adopting
>> YellowBrick as the plotting API for sklearn? Not sure how exactly the
>> interaction would work - could be PRs to their library, or ask them to
>> integrate into sklearn, or do a lock-step dance with versions but maintain
>> separate teams? (I know it raises more questions than answers, but wanted
>> to put it out there.)
>>
>> On Wed, Apr 3, 2019 at 4:07 PM Joel Nothman <joel.nothman at gmail.com>
>> wrote:
>>
>>> With option 1, sklearn.plot is likely to import large chunks of the
>>> library (particularly, but not exclusively, if the plotting function
>>> "does the work" as Andy suggests). This is under the assumption that
>>> one plot function will want to import trees, another GPs, etc. Unless
>>> we move to lazy imports, that would be against the current convention
>>> that importing sklearn is fairly minimal.
>>>
>>> I do like Andy's idea of framing this discussion more clearly around
>>> likely candidates.
>>>
>>> On Thu, 4 Apr 2019 at 00:10, Andreas Mueller <t3kcit at gmail.com> wrote:
>>> >
>>> > I think what was not clear from the question is that there is actually
>>> > quite different kinds of plotting functions, and many of these are tied
>>> > to existing code.
>>> >
>>> > Right now we have some that are specific to trees (plot_tree) and to
>>> > gradient boosting (plot_partial_dependence).
>>> >
>>> > I think we want more general functions, and plot_partial_dependence has
>>> > been extended to general estimators.
>>> >
>>> > However, the plotting functions might be generic wrt the estimator, but
>>> > relate to a specific function, say plotting results of GridSearchCV.
>>> > Then one might argue that having the plotting function close to
>>> > GridSearchCV might make sense.
>>> > Similarly for plotting partial dependence plots and feature
>>> importances,
>>> > it might be a bit strange to have the plotting functions not next to
>>> the
>>> > functions that compute these.
>>> > Another question would be is whether the plotting functions also "do
>>> the
>>> > work" in some cases:
>>> > Do we want plot_partial_dependence also to compute the partial
>>> > dependence? (I would argue yes but either way the result is a bit
>>> strange).
>>> > In that case you have somewhat of the same functionality in two
>>> > different modules, unless you also put the "compute partial dependence"
>>> > function in the plotting module as well,
>>> > which is a bit strange.
>>> >
>>> > Maybe we could inform this discussion by listing candidate plotting
>>> > functions, and also considering whether they "do the work" and where
>>> the
>>> > "work" function is.
>>> >
>>> > Other examples are plotting the confusion matrix, which probably should
>>> > also compute the confusion matrix (it's fast and so that would be
>>> > convenient), and so it would "duplicate" functionality from the metrics
>>> > module.
>>> >
>>> > Plotting learning curves and validation curves should probably not do
>>> > the work as it's pretty involved, and so someone would need to import
>>> > the learning and validation curves from model selection, and then the
>>> > plotting functions from a plotting module.
>>> >
>>> > Calibrations curves and P/R curves and roc curves are also pretty fast
>>> > to compute (and passing around the arguments is somewhat error prone)
>>> so
>>> > I would say the plotting functions for these should do the work as
>>> well.
>>> >
>>> > Anyway, you can see that many plotting functions are actually
>>> associated
>>> > with functions in existing modules and the interactions are a bit
>>> unclear.
>>> >
>>> > The only plotting functions I haven't mentioned so far that I thought
>>> > about in the past are "2d scatter" and "plot decision function". These
>>> > would be kind of generic, but mostly used in the examples.
>>> > Though having a discrete 2d scatter function would be pretty nice
>>> > (plt.scatter doesn't allow legends and makes it hard to use qualitative
>>> > color maps).
>>> >
>>> >
>>> > I think I would vote for option (1), "sklearn.plot.plot_zzz" but the
>>> > case is not really that clear.
>>> >
>>> > Cheers,
>>> >
>>> > Andy
>>> >
>>> > On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote:
>>> > > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting
>>> > > functions will be added? If it's just a dozen or less, putting them
>>> all
>>> > > into a single namespace sklearn.plot might be easier.
>>> > >
>>> > > This also would avoid discussion about where to put some generic
>>> > > plotting functions (e.g.
>>> > >
>>> https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479
>>> ).
>>> > >
>>> > > Roman
>>> > >
>>> > > On 03/04/2019 12:06, Trevor Stephens wrote:
>>> > >> I think #1 if any of these... Plotting functions should hopefully
>>> be as
>>> > >> general as possible, so tagging with a specific type of estimator
>>> will,
>>> > >> in some scikit-learn utopia, be unnecessary.
>>> > >>
>>> > >> If a general plotter is built, where does it live in other
>>> > >> estimator-specific namespace options? Feels awkward to put it under
>>> > >> every estimator's namespace.
>>> > >>
>>> > >> Then again, there might be a #4 where there is no plot module and
>>> > >> plotting classes live under groups of utilities like introspection,
>>> > >> cross-validation or something?...
>>> > >>
>>> > >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe <ahowe42 at gmail.com
>>> > >> <mailto:ahowe42 at gmail.com>> wrote:
>>> > >>
>>> > >>      My preference would be for (1). I don't think the
>>> sub-namespace in
>>> > >>      (2) is necessary, and don't like (3), as I would prefer the
>>> plotting
>>> > >>      functions to be all in the same namespace sklearn.plot.
>>> > >>
>>> > >>      Andrew
>>> > >>
>>> > >>      <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>>> > >>      J. Andrew Howe, PhD
>>> > >>      LinkedIn Profile <http://www.linkedin.com/in/ahowe42>
>>> > >>      ResearchGate Profile <
>>> http://www.researchgate.net/profile/John_Howe12/>
>>> > >>      Open Researcher and Contributor ID (ORCID)
>>> > >>      <http://orcid.org/0000-0002-3553-1990>
>>> > >>      Github Profile <http://github.com/ahowe42>
>>> > >>      Personal Website <http://www.andrewhowe.com>
>>> > >>      I live to learn, so I can learn to live. - me
>>> > >>      <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>>> > >>
>>> > >>
>>> > >>      On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin <
>>> qinhanmin2005 at sina.com
>>> > >>      <mailto:qinhanmin2005 at sina.com>> wrote:
>>> > >>
>>> > >>          See
>>> https://github.com/scikit-learn/scikit-learn/issues/13448
>>> > >>
>>> > >>          We've introduced several plotting functions (e.g.,
>>> plot_tree and
>>> > >>          plot_partial_dependence) and will introduce more (e.g.,
>>> > >>          plot_decision_boundary) in the future. Consequently, we
>>> need to
>>> > >>          decide where to put these functions. Currently, there're 3
>>> > >>          proposals:
>>> > >>
>>> > >>          (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>>> > >>
>>> > >>          (2) sklearn.plot.XXX.plot_YYY (e.g.,
>>> sklearn.plot.tree.plot_tree)
>>> > >>
>>> > >>          (3) sklearn.XXX.plot.plot_YYY (e.g.,
>>> > >>          sklearn.tree.plot.plot_tree, note that we won't support
>>> from
>>> > >>          sklearn.XXX import plot_YYY)
>>> > >>
>>> > >>          Joel Nothman, Gael Varoquaux and I decided to post it on
>>> the
>>> > >>          mailing list to invite opinions.
>>> > >>
>>> > >>          Thanks
>>> > >>
>>> > >>          Hanmin Qin
>>> > >>          _______________________________________________
>>> > >>          scikit-learn mailing list
>>> > >>          scikit-learn at python.org <mailto:scikit-learn at python.org>
>>> > >>          https://mail.python.org/mailman/listinfo/scikit-learn
>>> > >>
>>> > >>      _______________________________________________
>>> > >>      scikit-learn mailing list
>>> > >>      scikit-learn at python.org <mailto:scikit-learn at python.org>
>>> > >>      https://mail.python.org/mailman/listinfo/scikit-learn
>>> > >>
>>> > >
>>> > > _______________________________________________
>>> > > scikit-learn mailing list
>>> > > scikit-learn at python.org
>>> > > https://mail.python.org/mailman/listinfo/scikit-learn
>>> > _______________________________________________
>>> > scikit-learn mailing list
>>> > scikit-learn at python.org
>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190404/ab848022/attachment-0001.html>

From joel.nothman at gmail.com  Thu Apr  4 17:12:09 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Fri, 5 Apr 2019 08:12:09 +1100
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
 functions?
In-Reply-To: <CADxzQooNw--VM9NdLDTrS3CvQsiv=f-2y6_EMA6TUHSKPryZ1w@mail.gmail.com>
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
 <CANnYi3SO-6j8GpkO7cUZabB1CfzegwUJTF0c90B6UVBvgCCLkQ@mail.gmail.com>
 <CAE0Q3zWkV3quVioOMQ06kQceGvgmEvw-zJ2EH=VC7MEzPyuENA@mail.gmail.com>
 <UV3VC8B5V87wU1W2eaCcKKLQbR6L5hvH9mneGPeV3Nn_noj_GCLlioQ1Mb7ODYfu_vsr9-BBDwNFhl8yLE-1xrQLePrpThu_Sc-lv7Fadts=@pm.me>
 <e5f0d69f-7ff4-5f78-b12a-dfbd9288b502@gmail.com>
 <CAAkaFLXMnz4-UiYO1JwhgJePLuqeM1W0aM1GUkWDdOVLBPdgUg@mail.gmail.com>
 <CAK-i=xhJAV-VkEYrb_bCP68RmvnCNNboNqEmPkF3+bhigb+=Uw@mail.gmail.com>
 <CADeotZr1x6fE9Q7FziKEiv7ujgvyDKtSgqHD9ij_NE1CVW=2Vw@mail.gmail.com>
 <CADxzQooNw--VM9NdLDTrS3CvQsiv=f-2y6_EMA6TUHSKPryZ1w@mail.gmail.com>
Message-ID: <CAAkaFLXhERonKd5jQP9OozYQLb6rDWhdsU-8PVkMQz4jXzCQLg@mail.gmail.com>

Well it would certainly be a low-cost effort improvement if we demonstrated
yellowbrick in our examples.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190405/89676020/attachment.html>

From heitor.boschirolli at gmail.com  Sat Apr  6 13:07:38 2019
From: heitor.boschirolli at gmail.com (Heitor Boschirolli)
Date: Sat, 6 Apr 2019 14:07:38 -0300
Subject: [scikit-learn] Starting to contribute
Message-ID: <CABpcPbxPV+d7dJwudnUm7xrDB=JZm8ZF3K0eVTmDgjP3RPCN8A@mail.gmail.com>

Hello!

First of all, I'm apologize if this email is not for such questions, but I
never contributed to open source code before and I'm not sure how to
proceed, could someone help me with that?

Should I just pick an issue, solve it following the guidelines described in
the website and open a PR?
If I have any trouble, can I post it on the mailing list?

Att, Heitor Boschirolli
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190406/590d682c/attachment.html>

From ahowe42 at gmail.com  Sun Apr  7 05:08:24 2019
From: ahowe42 at gmail.com (Andrew Howe)
Date: Sun, 7 Apr 2019 10:08:24 +0100
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
 functions?
In-Reply-To: <CADxzQooNw--VM9NdLDTrS3CvQsiv=f-2y6_EMA6TUHSKPryZ1w@mail.gmail.com>
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
 <CANnYi3SO-6j8GpkO7cUZabB1CfzegwUJTF0c90B6UVBvgCCLkQ@mail.gmail.com>
 <CAE0Q3zWkV3quVioOMQ06kQceGvgmEvw-zJ2EH=VC7MEzPyuENA@mail.gmail.com>
 <UV3VC8B5V87wU1W2eaCcKKLQbR6L5hvH9mneGPeV3Nn_noj_GCLlioQ1Mb7ODYfu_vsr9-BBDwNFhl8yLE-1xrQLePrpThu_Sc-lv7Fadts=@pm.me>
 <e5f0d69f-7ff4-5f78-b12a-dfbd9288b502@gmail.com>
 <CAAkaFLXMnz4-UiYO1JwhgJePLuqeM1W0aM1GUkWDdOVLBPdgUg@mail.gmail.com>
 <CAK-i=xhJAV-VkEYrb_bCP68RmvnCNNboNqEmPkF3+bhigb+=Uw@mail.gmail.com>
 <CADeotZr1x6fE9Q7FziKEiv7ujgvyDKtSgqHD9ij_NE1CVW=2Vw@mail.gmail.com>
 <CADxzQooNw--VM9NdLDTrS3CvQsiv=f-2y6_EMA6TUHSKPryZ1w@mail.gmail.com>
Message-ID: <CANnYi3RAfCdWrRoMpVbuG2HKrjo_s=QFScakqKsW1gaBZCo5iA@mail.gmail.com>

I'm with Andreas on this. As a user, I would prefer to see this as part of
sklearn with the usual sklearn api. If we want static matplotlib-style
images, reusing (with credit) some of the yellowbrick implementations is a
good idea.

Would we consider plotly-based visualizations? I've been doing my own
plotting in plotly for the last month, and can't imagine going back to
static matplotlib plots...

Andrew

<~~~~~~~~~~~~~~~~~~~~~~~~~~~>
J. Andrew Howe, PhD
LinkedIn Profile <http://www.linkedin.com/in/ahowe42>
ResearchGate Profile <http://www.researchgate.net/profile/John_Howe12/>
Open Researcher and Contributor ID (ORCID)
<http://orcid.org/0000-0002-3553-1990>
Github Profile <http://github.com/ahowe42>
Personal Website <http://www.andrewhowe.com>
I live to learn, so I can learn to live. - me
<~~~~~~~~~~~~~~~~~~~~~~~~~~~>


On Thu, Apr 4, 2019 at 3:26 PM Andreas Mueller <t3kcit at gmail.com> wrote:

> I would argue that sklearn users would benefit in having solutions in
> scikit-learn. The yellowbrick api is quite different from the approaches we
> discussed. If we can reuse their implementations I think we should do so
> and credit where we can.
> Having plotting in sklearn is also likely to attract more contributors and
> we have more eyes for doing reviews.
>
> Sent from phone. Please excuse spelling and brevity.
>
> On Thu, Apr 4, 2019, 05:43 Alexandre Gramfort <alexandre.gramfort at inria.fr>
> wrote:
>
>> I also think that YellowBrick folks did a great job and that we should
>> not reinvent the wheel or at least have clear idea of how we differ in
>> scope with respect to YellowBrick
>>
>> my 2c
>>
>> Alex
>>
>>
>> On Thu, Apr 4, 2019 at 1:02 AM Eric Ma <ericmajinglong at gmail.com> wrote:
>>
>>> This is not a strongly-held suggestion - but what about adopting
>>> YellowBrick as the plotting API for sklearn? Not sure how exactly the
>>> interaction would work - could be PRs to their library, or ask them to
>>> integrate into sklearn, or do a lock-step dance with versions but maintain
>>> separate teams? (I know it raises more questions than answers, but wanted
>>> to put it out there.)
>>>
>>> On Wed, Apr 3, 2019 at 4:07 PM Joel Nothman <joel.nothman at gmail.com>
>>> wrote:
>>>
>>>> With option 1, sklearn.plot is likely to import large chunks of the
>>>> library (particularly, but not exclusively, if the plotting function
>>>> "does the work" as Andy suggests). This is under the assumption that
>>>> one plot function will want to import trees, another GPs, etc. Unless
>>>> we move to lazy imports, that would be against the current convention
>>>> that importing sklearn is fairly minimal.
>>>>
>>>> I do like Andy's idea of framing this discussion more clearly around
>>>> likely candidates.
>>>>
>>>> On Thu, 4 Apr 2019 at 00:10, Andreas Mueller <t3kcit at gmail.com> wrote:
>>>> >
>>>> > I think what was not clear from the question is that there is actually
>>>> > quite different kinds of plotting functions, and many of these are
>>>> tied
>>>> > to existing code.
>>>> >
>>>> > Right now we have some that are specific to trees (plot_tree) and to
>>>> > gradient boosting (plot_partial_dependence).
>>>> >
>>>> > I think we want more general functions, and plot_partial_dependence
>>>> has
>>>> > been extended to general estimators.
>>>> >
>>>> > However, the plotting functions might be generic wrt the estimator,
>>>> but
>>>> > relate to a specific function, say plotting results of GridSearchCV.
>>>> > Then one might argue that having the plotting function close to
>>>> > GridSearchCV might make sense.
>>>> > Similarly for plotting partial dependence plots and feature
>>>> importances,
>>>> > it might be a bit strange to have the plotting functions not next to
>>>> the
>>>> > functions that compute these.
>>>> > Another question would be is whether the plotting functions also "do
>>>> the
>>>> > work" in some cases:
>>>> > Do we want plot_partial_dependence also to compute the partial
>>>> > dependence? (I would argue yes but either way the result is a bit
>>>> strange).
>>>> > In that case you have somewhat of the same functionality in two
>>>> > different modules, unless you also put the "compute partial
>>>> dependence"
>>>> > function in the plotting module as well,
>>>> > which is a bit strange.
>>>> >
>>>> > Maybe we could inform this discussion by listing candidate plotting
>>>> > functions, and also considering whether they "do the work" and where
>>>> the
>>>> > "work" function is.
>>>> >
>>>> > Other examples are plotting the confusion matrix, which probably
>>>> should
>>>> > also compute the confusion matrix (it's fast and so that would be
>>>> > convenient), and so it would "duplicate" functionality from the
>>>> metrics
>>>> > module.
>>>> >
>>>> > Plotting learning curves and validation curves should probably not do
>>>> > the work as it's pretty involved, and so someone would need to import
>>>> > the learning and validation curves from model selection, and then the
>>>> > plotting functions from a plotting module.
>>>> >
>>>> > Calibrations curves and P/R curves and roc curves are also pretty fast
>>>> > to compute (and passing around the arguments is somewhat error prone)
>>>> so
>>>> > I would say the plotting functions for these should do the work as
>>>> well.
>>>> >
>>>> > Anyway, you can see that many plotting functions are actually
>>>> associated
>>>> > with functions in existing modules and the interactions are a bit
>>>> unclear.
>>>> >
>>>> > The only plotting functions I haven't mentioned so far that I thought
>>>> > about in the past are "2d scatter" and "plot decision function". These
>>>> > would be kind of generic, but mostly used in the examples.
>>>> > Though having a discrete 2d scatter function would be pretty nice
>>>> > (plt.scatter doesn't allow legends and makes it hard to use
>>>> qualitative
>>>> > color maps).
>>>> >
>>>> >
>>>> > I think I would vote for option (1), "sklearn.plot.plot_zzz" but the
>>>> > case is not really that clear.
>>>> >
>>>> > Cheers,
>>>> >
>>>> > Andy
>>>> >
>>>> > On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote:
>>>> > > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting
>>>> > > functions will be added? If it's just a dozen or less, putting them
>>>> all
>>>> > > into a single namespace sklearn.plot might be easier.
>>>> > >
>>>> > > This also would avoid discussion about where to put some generic
>>>> > > plotting functions (e.g.
>>>> > >
>>>> https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479
>>>> ).
>>>> > >
>>>> > > Roman
>>>> > >
>>>> > > On 03/04/2019 12:06, Trevor Stephens wrote:
>>>> > >> I think #1 if any of these... Plotting functions should hopefully
>>>> be as
>>>> > >> general as possible, so tagging with a specific type of estimator
>>>> will,
>>>> > >> in some scikit-learn utopia, be unnecessary.
>>>> > >>
>>>> > >> If a general plotter is built, where does it live in other
>>>> > >> estimator-specific namespace options? Feels awkward to put it under
>>>> > >> every estimator's namespace.
>>>> > >>
>>>> > >> Then again, there might be a #4 where there is no plot module and
>>>> > >> plotting classes live under groups of utilities like introspection,
>>>> > >> cross-validation or something?...
>>>> > >>
>>>> > >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe <ahowe42 at gmail.com
>>>> > >> <mailto:ahowe42 at gmail.com>> wrote:
>>>> > >>
>>>> > >>      My preference would be for (1). I don't think the
>>>> sub-namespace in
>>>> > >>      (2) is necessary, and don't like (3), as I would prefer the
>>>> plotting
>>>> > >>      functions to be all in the same namespace sklearn.plot.
>>>> > >>
>>>> > >>      Andrew
>>>> > >>
>>>> > >>      <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>>>> > >>      J. Andrew Howe, PhD
>>>> > >>      LinkedIn Profile <http://www.linkedin.com/in/ahowe42>
>>>> > >>      ResearchGate Profile <
>>>> http://www.researchgate.net/profile/John_Howe12/>
>>>> > >>      Open Researcher and Contributor ID (ORCID)
>>>> > >>      <http://orcid.org/0000-0002-3553-1990>
>>>> > >>      Github Profile <http://github.com/ahowe42>
>>>> > >>      Personal Website <http://www.andrewhowe.com>
>>>> > >>      I live to learn, so I can learn to live. - me
>>>> > >>      <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>>>> > >>
>>>> > >>
>>>> > >>      On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin <
>>>> qinhanmin2005 at sina.com
>>>> > >>      <mailto:qinhanmin2005 at sina.com>> wrote:
>>>> > >>
>>>> > >>          See
>>>> https://github.com/scikit-learn/scikit-learn/issues/13448
>>>> > >>
>>>> > >>          We've introduced several plotting functions (e.g.,
>>>> plot_tree and
>>>> > >>          plot_partial_dependence) and will introduce more (e.g.,
>>>> > >>          plot_decision_boundary) in the future. Consequently, we
>>>> need to
>>>> > >>          decide where to put these functions. Currently, there're 3
>>>> > >>          proposals:
>>>> > >>
>>>> > >>          (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>>>> > >>
>>>> > >>          (2) sklearn.plot.XXX.plot_YYY (e.g.,
>>>> sklearn.plot.tree.plot_tree)
>>>> > >>
>>>> > >>          (3) sklearn.XXX.plot.plot_YYY (e.g.,
>>>> > >>          sklearn.tree.plot.plot_tree, note that we won't support
>>>> from
>>>> > >>          sklearn.XXX import plot_YYY)
>>>> > >>
>>>> > >>          Joel Nothman, Gael Varoquaux and I decided to post it on
>>>> the
>>>> > >>          mailing list to invite opinions.
>>>> > >>
>>>> > >>          Thanks
>>>> > >>
>>>> > >>          Hanmin Qin
>>>> > >>          _______________________________________________
>>>> > >>          scikit-learn mailing list
>>>> > >>          scikit-learn at python.org <mailto:scikit-learn at python.org>
>>>> > >>          https://mail.python.org/mailman/listinfo/scikit-learn
>>>> > >>
>>>> > >>      _______________________________________________
>>>> > >>      scikit-learn mailing list
>>>> > >>      scikit-learn at python.org <mailto:scikit-learn at python.org>
>>>> > >>      https://mail.python.org/mailman/listinfo/scikit-learn
>>>> > >>
>>>> > >
>>>> > > _______________________________________________
>>>> > > scikit-learn mailing list
>>>> > > scikit-learn at python.org
>>>> > > https://mail.python.org/mailman/listinfo/scikit-learn
>>>> > _______________________________________________
>>>> > scikit-learn mailing list
>>>> > scikit-learn at python.org
>>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190407/5a9ee1a9/attachment-0001.html>

From rth.yurchak at pm.me  Sun Apr  7 05:23:56 2019
From: rth.yurchak at pm.me (Roman Yurchak)
Date: Sun, 07 Apr 2019 09:23:56 +0000
Subject: [scikit-learn] Starting to contribute
In-Reply-To: <CABpcPbxPV+d7dJwudnUm7xrDB=JZm8ZF3K0eVTmDgjP3RPCN8A@mail.gmail.com>
References: <CABpcPbxPV+d7dJwudnUm7xrDB=JZm8ZF3K0eVTmDgjP3RPCN8A@mail.gmail.com>
Message-ID: <6cT00pTFFXyDphB5zuehmAGa1J-m9yvO8UcgcbA9wFY8KzGojBgoA5pfPQZakaVTF88utkZlF9v-qCyfIHledAKfzFtXXvqvTBNkT975it8=@pm.me>

Hello Heitor,

yes, you can chose an issue, comment there that you plan to work on it 
(to avoid redundant work by other contributors) and if no one objects 
make a PR. If you have any questions you can ask them by commenting on 
that issue (for specific questions) or on the scikit-learn Gitter 
https://gitter.im/scikit-learn/scikit-learn (for general questions about 
how to contribute).

See https://scikit-learn.org/stable/developers/contributing.html for 
more information.

Roman

On 06/04/2019 19:07, Heitor Boschirolli wrote:
> Hello!
> 
> First of all, I'm apologize if this email is not for such questions, but 
> I never contributed to open source code before and I'm not sure how to 
> proceed, could someone help me with that?
> 
> Should I just pick an issue, solve it following the guidelines described 
> in the website and open a PR?
> If I have any trouble, can I post it on the mailing list?
> 
> Att, Heitor Boschirolli


From emmanuelle.gouillart at nsup.org  Sun Apr  7 11:41:48 2019
From: emmanuelle.gouillart at nsup.org (Emmanuelle Gouillart)
Date: Sun, 7 Apr 2019 17:41:48 +0200
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
 functions?
In-Reply-To: <CANnYi3RAfCdWrRoMpVbuG2HKrjo_s=QFScakqKsW1gaBZCo5iA@mail.gmail.com>
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
 <CANnYi3SO-6j8GpkO7cUZabB1CfzegwUJTF0c90B6UVBvgCCLkQ@mail.gmail.com>
 <CAE0Q3zWkV3quVioOMQ06kQceGvgmEvw-zJ2EH=VC7MEzPyuENA@mail.gmail.com>
 <UV3VC8B5V87wU1W2eaCcKKLQbR6L5hvH9mneGPeV3Nn_noj_GCLlioQ1Mb7ODYfu_vsr9-BBDwNFhl8yLE-1xrQLePrpThu_Sc-lv7Fadts=@pm.me>
 <e5f0d69f-7ff4-5f78-b12a-dfbd9288b502@gmail.com>
 <CAAkaFLXMnz4-UiYO1JwhgJePLuqeM1W0aM1GUkWDdOVLBPdgUg@mail.gmail.com>
 <CAK-i=xhJAV-VkEYrb_bCP68RmvnCNNboNqEmPkF3+bhigb+=Uw@mail.gmail.com>
 <CADeotZr1x6fE9Q7FziKEiv7ujgvyDKtSgqHD9ij_NE1CVW=2Vw@mail.gmail.com>
 <CADxzQooNw--VM9NdLDTrS3CvQsiv=f-2y6_EMA6TUHSKPryZ1w@mail.gmail.com>
 <CANnYi3RAfCdWrRoMpVbuG2HKrjo_s=QFScakqKsW1gaBZCo5iA@mail.gmail.com>
Message-ID: <20190407154148.utfbtkrakftz3rbr@phare.normalesup.org>

Hi,
I suppose you won't want to rewrite all examples if you choose
plotly-based viz, so this help page about converting matplotlib figures
or code to plotly might help https://plot.ly/matplotlib/getting-started/
I hope it works, the doc page looks a bit old.

Cheers
Emma


On Sun, Apr 07, 2019 at 10:08:24AM +0100, Andrew Howe wrote:
> I'm with Andreas on this. As a user, I would prefer to see this as part of
> sklearn with the usual sklearn api. If we want static matplotlib-style images,
> reusing (with credit) some of the yellowbrick implementations is a good idea.

> Would we consider plotly-based visualizations? I've been doing my own plotting
> in plotly for the last month, and can't imagine going back to static matplotlib
> plots...

> Andrew

> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> J. Andrew Howe, PhD
> LinkedIn Profile
> ResearchGate Profile
> Open Researcher and Contributor ID (ORCID)
> Github Profile
> Personal Website
> I live to learn, so I can learn to live. - me
> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>


> On Thu, Apr 4, 2019 at 3:26 PM Andreas Mueller <t3kcit at gmail.com> wrote:

>     I would argue that sklearn users would benefit in having solutions in
>     scikit-learn. The yellowbrick api is quite different from the approaches we
>     discussed. If we can reuse their implementations I think we should do so
>     and credit where we can.?
>     Having plotting in sklearn is also likely to attract more contributors and
>     we have more eyes for doing reviews.

>     Sent from phone. Please excuse spelling and brevity.

>     On Thu, Apr 4, 2019, 05:43 Alexandre Gramfort <alexandre.gramfort at inria.fr>
>     wrote:

>         I also think that YellowBrick folks did a great job and that we should
>         not reinvent the wheel or at least have clear idea of how we differ in
>         scope with respect to YellowBrick

>         my 2c

>         Alex


>         On Thu, Apr 4, 2019 at 1:02 AM Eric Ma <ericmajinglong at gmail.com>
>         wrote:

>             This is not a strongly-held suggestion - but what about adopting
>             YellowBrick as the plotting API for sklearn? Not sure how exactly
>             the interaction would work - could be PRs to their library, or ask
>             them to integrate into sklearn, or do a lock-step dance with
>             versions but maintain separate teams? (I know it raises more
>             questions than answers, but wanted to put it out?there.)

>             On Wed, Apr 3, 2019 at 4:07 PM Joel Nothman <joel.nothman at gmail.com
>             > wrote:

>                 With option 1, sklearn.plot is likely to import large chunks of
>                 the
>                 library (particularly, but not exclusively, if the plotting
>                 function
>                 "does the work" as Andy suggests). This is under the assumption
>                 that
>                 one plot function will want to import trees, another GPs, etc.
>                 Unless
>                 we move to lazy imports, that would be against the current
>                 convention
>                 that importing sklearn is fairly minimal.

>                 I do like Andy's idea of framing this discussion more clearly
>                 around
>                 likely candidates.

>                 On Thu, 4 Apr 2019 at 00:10, Andreas Mueller <t3kcit at gmail.com>
>                 wrote:

>                 > I think what was not clear from the question is that there is
>                 actually
>                 > quite different kinds of plotting functions, and many of
>                 these are tied
>                 > to existing code.

>                 > Right now we have some that are specific to trees (plot_tree)
>                 and to
>                 > gradient boosting (plot_partial_dependence).

>                 > I think we want more general functions, and
>                 plot_partial_dependence has
>                 > been extended to general estimators.

>                 > However, the plotting functions might be generic wrt the
>                 estimator, but
>                 > relate to a specific function, say plotting results of
>                 GridSearchCV.
>                 > Then one might argue that having the plotting function close
>                 to
>                 > GridSearchCV might make sense.
>                 > Similarly for plotting partial dependence plots and feature
>                 importances,
>                 > it might be a bit strange to have the plotting functions not
>                 next to the
>                 > functions that compute these.
>                 > Another question would be is whether the plotting functions
>                 also "do the
>                 > work" in some cases:
>                 > Do we want plot_partial_dependence also to compute the
>                 partial
>                 > dependence? (I would argue yes but either way the result is a
>                 bit strange).
>                 > In that case you have somewhat of the same functionality in
>                 two
>                 > different modules, unless you also put the "compute partial
>                 dependence"
>                 > function in the plotting module as well,
>                 > which is a bit strange.

>                 > Maybe we could inform this discussion by listing candidate
>                 plotting
>                 > functions, and also considering whether they "do the work"
>                 and where the
>                 > "work" function is.

>                 > Other examples are plotting the confusion matrix, which
>                 probably should
>                 > also compute the confusion matrix (it's fast and so that
>                 would be
>                 > convenient), and so it would "duplicate" functionality from
>                 the metrics
>                 > module.

>                 > Plotting learning curves and validation curves should
>                 probably not do
>                 > the work as it's pretty involved, and so someone would need
>                 to import
>                 > the learning and validation curves from model selection, and
>                 then the
>                 > plotting functions from a plotting module.

>                 > Calibrations curves and P/R curves and roc curves are also
>                 pretty fast
>                 > to compute (and passing around the arguments is somewhat
>                 error prone) so
>                 > I would say the plotting functions for these should do the
>                 work as well.

>                 > Anyway, you can see that many plotting functions are actually
>                 associated
>                 > with functions in existing modules and the interactions are a
>                 bit unclear.

>                 > The only plotting functions I haven't mentioned so far that I
>                 thought
>                 > about in the past are "2d scatter" and "plot decision
>                 function". These
>                 > would be kind of generic, but mostly used in the examples.
>                 > Though having a discrete 2d scatter function would be pretty
>                 nice
>                 > (plt.scatter doesn't allow legends and makes it hard to use
>                 qualitative
>                 > color maps).


>                 > I think I would vote for option (1), "sklearn.plot.plot_zzz"
>                 but the
>                 > case is not really that clear.

>                 > Cheers,

>                 > Andy

>                 > On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote:
>                 > > +1 for options 1 and +0.5 for 3. Do we anticipate that many
>                 plotting
>                 > > functions will be added? If it's just a dozen or less,
>                 putting them all
>                 > > into a single namespace sklearn.plot might be easier.

>                 > > This also would avoid discussion about where to put some
>                 generic
>                 > > plotting functions (e.g.
>                 > > https://github.com/scikit-learn/scikit-learn/issues/13448#
>                 issuecomment-478341479).

>                 > > Roman

>                 > > On 03/04/2019 12:06, Trevor Stephens wrote:
>                 > >> I think #1 if any of these... Plotting functions should
>                 hopefully be as
>                 > >> general as possible, so tagging with a specific type of
>                 estimator will,
>                 > >> in some scikit-learn utopia, be unnecessary.

>                 > >> If a general plotter is built, where does it live in other
>                 > >> estimator-specific namespace options? Feels awkward to put
>                 it under
>                 > >> every estimator's namespace.

>                 > >> Then again, there might be a #4 where there is no plot
>                 module and
>                 > >> plotting classes live under groups of utilities like
>                 introspection,
>                 > >> cross-validation or something?...

>                 > >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe <
>                 ahowe42 at gmail.com
>                 > >> <mailto:ahowe42 at gmail.com>> wrote:

>                 > >>? ? ? My preference would be for (1). I don't think the
>                 sub-namespace in
>                 > >>? ? ? (2) is necessary, and don't like (3), as I would
>                 prefer the plotting
>                 > >>? ? ? functions to be all in the same namespace
>                 sklearn.plot.

>                 > >>? ? ? Andrew

>                 > >>? ? ? <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>                 > >>? ? ? J. Andrew Howe, PhD
>                 > >>? ? ? LinkedIn Profile <http://www.linkedin.com/in/ahowe42>
>                 > >>? ? ? ResearchGate Profile <http://www.researchgate.net/
>                 profile/John_Howe12/>
>                 > >>? ? ? Open Researcher and Contributor ID (ORCID)
>                 > >>? ? ? <http://orcid.org/0000-0002-3553-1990>
>                 > >>? ? ? Github Profile <http://github.com/ahowe42>
>                 > >>? ? ? Personal Website <http://www.andrewhowe.com>
>                 > >>? ? ? I live to learn, so I can learn to live. - me
>                 > >>? ? ? <~~~~~~~~~~~~~~~~~~~~~~~~~~~>


>                 > >>? ? ? On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin <
>                 qinhanmin2005 at sina.com
>                 > >>? ? ? <mailto:qinhanmin2005 at sina.com>> wrote:

>                 > >>? ? ? ? ? See https://github.com/scikit-learn/scikit-learn/
>                 issues/13448

>                 > >>? ? ? ? ? We've introduced several plotting functions
>                 (e.g., plot_tree and
>                 > >>? ? ? ? ? plot_partial_dependence) and will introduce more
>                 (e.g.,
>                 > >>? ? ? ? ? plot_decision_boundary) in the future.
>                 Consequently, we need to
>                 > >>? ? ? ? ? decide where to put these functions. Currently,
>                 there're 3
>                 > >>? ? ? ? ? proposals:

>                 > >>? ? ? ? ? (1) sklearn.plot.plot_YYY (e.g.,
>                 sklearn.plot.plot_tree)

>                 > >>? ? ? ? ? (2) sklearn.plot.XXX.plot_YYY (e.g.,
>                 sklearn.plot.tree.plot_tree)

>                 > >>? ? ? ? ? (3) sklearn.XXX.plot.plot_YYY (e.g.,
>                 > >>? ? ? ? ? sklearn.tree.plot.plot_tree, note that we won't
>                 support from
>                 > >>? ? ? ? ? sklearn.XXX import plot_YYY)

>                 > >>? ? ? ? ? Joel Nothman, Gael Varoquaux and I decided to
>                 post it on the
>                 > >>? ? ? ? ? mailing list to invite opinions.

>                 > >>? ? ? ? ? Thanks

>                 > >>? ? ? ? ? Hanmin Qin
>                 > >>? ? ? ? ? _______________________________________________
>                 > >>? ? ? ? ? scikit-learn mailing list
>                 > >>? ? ? ? ? scikit-learn at python.org <mailto:
>                 scikit-learn at python.org>
>                 > >>? ? ? ? ? https://mail.python.org/mailman/listinfo/
>                 scikit-learn

>                 > >>? ? ? _______________________________________________
>                 > >>? ? ? scikit-learn mailing list
>                 > >>? ? ? scikit-learn at python.org <mailto:
>                 scikit-learn at python.org>
>                 > >>? ? ? https://mail.python.org/mailman/listinfo/scikit-learn


>                 > > _______________________________________________
>                 > > scikit-learn mailing list
>                 > > scikit-learn at python.org
>                 > > https://mail.python.org/mailman/listinfo/scikit-learn
>                 > _______________________________________________
>                 > scikit-learn mailing list
>                 > scikit-learn at python.org
>                 > https://mail.python.org/mailman/listinfo/scikit-learn
>                 _______________________________________________
>                 scikit-learn mailing list
>                 scikit-learn at python.org
>                 https://mail.python.org/mailman/listinfo/scikit-learn

>             _______________________________________________
>             scikit-learn mailing list
>             scikit-learn at python.org
>             https://mail.python.org/mailman/listinfo/scikit-learn

>         _______________________________________________
>         scikit-learn mailing list
>         scikit-learn at python.org
>         https://mail.python.org/mailman/listinfo/scikit-learn

>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org
>     https://mail.python.org/mailman/listinfo/scikit-learn


> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From solegalli1 at gmail.com  Wed Apr 10 13:23:03 2019
From: solegalli1 at gmail.com (Sole Galli)
Date: Wed, 10 Apr 2019 18:23:03 +0100
Subject: [scikit-learn] Feature engineering functionality - new package
In-Reply-To: <CANDT+DFqs8LApvBCW5TRcsXC32jBAWMmoGf6fG2-yPF7WF-pkA@mail.gmail.com>
References: <CANDT+DFqs8LApvBCW5TRcsXC32jBAWMmoGf6fG2-yPF7WF-pkA@mail.gmail.com>
Message-ID: <CANDT+DFM6arejRVu5HF8J0e4AtWaJNE7iE-nrTDkQei3BTsy0w@mail.gmail.com>

>
> Dear Scikit-Learn team,
>
> Feature engineering is a big task ahead of building machine learning
> models. It involves imputation of missing values, encoding of categorical
> variables, discretisation, variable transformation etc.
>
> Sklearn includes some functionality for feature engineering, which is
> useful, but it has a few limitations:
>
> 1) it does not allow for feature specification - it will do the same
> process on all variables, for example SimpleImputer. Typically, we want
> to impute different columns with different values.
> 2) It does not capture information from the training set, this is it does
> not learn, therefore, it is not able to perpetuate the values learnt from
> the train set, to unseen data.
>
> The 2 limitations above apply to all the feature transformers in sklearn,
> I believe.
>
> Therefore, if these transformers are used as part of a pipeline, we could
> end up doing different transformations to train and test, depending on the
> characteristics of the datasets. For business purposes, this is not a
> desired option.
>
> I think that building transformers that learn from the train set would be
> of much use for the community.
>
> To this end, I built a python package called feature engine
> <https://pypi.org/project/feature-engine/> which expands the sklearn-api
> with additional feature engineering techniques, and the functionality that
> allows the transformer to learn from data and store the parameters learnt.
>
> The techniques included have been used worldwide, both in business and in
> data competitions, and reported in kdd reports and other articles. I also
> cover them in an udemy course
> <https://www.udemy.com/feature-engineering-for-machine-learning> which
> has enrolled several thousand students.
>
> The package capitalises on the use of pandas to capture the features, but
> I am confident that the columns names could be captured and the df
> transformed to a numpy array to comply with sklearn requirements.
>
> I wondered whether it would be of interest to include the functionality of
> this package within sklearn?
> If you would consider extending the sklearn api to include these
> transformers, I would be happy to help.
>
> Alternatively, would you consider to add the package to your website?
> where you mention the libaries that extend sklearn functionality?
>
> All feedback is welcome.
>
> Many thanks and I look forward to hearing from you
>
> Thank you so much fur such an awesome contribution through the sklearn api
>
> Kind regards
>
> Sole
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190410/e8209d78/attachment.html>

From liam at chatdesk.com  Wed Apr 10 13:25:56 2019
From: liam at chatdesk.com (Liam Geron)
Date: Wed, 10 Apr 2019 13:25:56 -0400
Subject: [scikit-learn] Predict Method of OneVsRestClassifier Integration
 with Google Cloud ML
Message-ID: <CAJn_aE576XK2MmTXAcgvpCxwpdc5gVVc0SbYSfFLhu+hQNKv8A@mail.gmail.com>

Hi all,

I was hoping to get some guidance re: changing the result of the predict
method of the OneVsRestClassifier to return a dense array rather than a
sparse array, given that Google Cloud ML only accepts dense numpy arrays as
a result of a given models predict method. Right now my model architecture
looks like:

model = Pipeline([('tfidf', TfidfVectorizer()), ('clf',
OneVsRestClassifier(XGBClassifier()))])

Which returns a sparse array with the predict method. I saw the Stack
Overflow post here:
https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-learn-prediction-probability-predict-proba

which recommends overwriting the predict method with the predict_proba
method, however I found that I can't serialize the model after doing so. I
also have a stack overflow post here:
https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-onevsrestclassifier-predict-method-output-to-dense-a
which
details the specific pickling error.

Is this a known issue? Is there an accepted way to convert this into a
dense array?

Thanks,
Liam Geron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190410/3456adf4/attachment.html>

From goix.nicolas at gmail.com  Wed Apr 10 13:42:09 2019
From: goix.nicolas at gmail.com (Nicolas Goix)
Date: Wed, 10 Apr 2019 18:42:09 +0100
Subject: [scikit-learn] Feature engineering functionality - new package
In-Reply-To: <CANDT+DFM6arejRVu5HF8J0e4AtWaJNE7iE-nrTDkQei3BTsy0w@mail.gmail.com>
References: <CANDT+DFqs8LApvBCW5TRcsXC32jBAWMmoGf6fG2-yPF7WF-pkA@mail.gmail.com>
 <CANDT+DFM6arejRVu5HF8J0e4AtWaJNE7iE-nrTDkQei3BTsy0w@mail.gmail.com>
Message-ID: <CAPV6P2wKna8PPRsu35D1uGVrwanBOhJUQOE_u1fYPpEPFZrhEA@mail.gmail.com>

Hi Sole,

I'm not sure the 2 limitations you mentioned are correct.
1) in your example, using the ColumnTransformer you can impute different
values for different columns.
2) the sklearn transformers do learn on the training set and are able to
perpetuate the values learnt from the train set to unseen data.

Nicolas

On Wed, Apr 10, 2019, 18:25 Sole Galli <solegalli1 at gmail.com> wrote:

> Dear Scikit-Learn team,
>>
>> Feature engineering is a big task ahead of building machine learning
>> models. It involves imputation of missing values, encoding of categorical
>> variables, discretisation, variable transformation etc.
>>
>> Sklearn includes some functionality for feature engineering, which is
>> useful, but it has a few limitations:
>>
>> 1) it does not allow for feature specification - it will do the same
>> process on all variables, for example SimpleImputer. Typically, we want
>> to impute different columns with different values.
>> 2) It does not capture information from the training set, this is it does
>> not learn, therefore, it is not able to perpetuate the values learnt from
>> the train set, to unseen data.
>>
>> The 2 limitations above apply to all the feature transformers in sklearn,
>> I believe.
>>
>> Therefore, if these transformers are used as part of a pipeline, we could
>> end up doing different transformations to train and test, depending on the
>> characteristics of the datasets. For business purposes, this is not a
>> desired option.
>>
>> I think that building transformers that learn from the train set would be
>> of much use for the community.
>>
>> To this end, I built a python package called feature engine
>> <https://pypi.org/project/feature-engine/> which expands the sklearn-api
>> with additional feature engineering techniques, and the functionality that
>> allows the transformer to learn from data and store the parameters learnt.
>>
>> The techniques included have been used worldwide, both in business and in
>> data competitions, and reported in kdd reports and other articles. I also
>> cover them in an udemy course
>> <https://www.udemy.com/feature-engineering-for-machine-learning> which
>> has enrolled several thousand students.
>>
>> The package capitalises on the use of pandas to capture the features, but
>> I am confident that the columns names could be captured and the df
>> transformed to a numpy array to comply with sklearn requirements.
>>
>> I wondered whether it would be of interest to include the functionality
>> of this package within sklearn?
>> If you would consider extending the sklearn api to include these
>> transformers, I would be happy to help.
>>
>> Alternatively, would you consider to add the package to your website?
>> where you mention the libaries that extend sklearn functionality?
>>
>> All feedback is welcome.
>>
>> Many thanks and I look forward to hearing from you
>>
>> Thank you so much fur such an awesome contribution through the sklearn api
>>
>> Kind regards
>>
>> Sole
>>
>> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190410/f58a699f/attachment-0001.html>

From mail at sebastianraschka.com  Wed Apr 10 13:35:07 2019
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Wed, 10 Apr 2019 12:35:07 -0500
Subject: [scikit-learn] Predict Method of OneVsRestClassifier
 Integration with Google Cloud ML
In-Reply-To: <CAJn_aE576XK2MmTXAcgvpCxwpdc5gVVc0SbYSfFLhu+hQNKv8A@mail.gmail.com>
References: <CAJn_aE576XK2MmTXAcgvpCxwpdc5gVVc0SbYSfFLhu+hQNKv8A@mail.gmail.com>
Message-ID: <FDC8C2EC-5E6E-471B-8771-BA82201BF782@sebastianraschka.com>

Hi Liam,

not sure what your exact error message is, but it may also be that the XGBClassifier only accepts dense arrays? I think the TfidfVectorizer returns sparse arrays. You could probably fix your issues by inserting a "DenseTransformer" into your pipelone (a simple class that just transforms an array from a sparse to a dense format). I've implemented sth like that that you can import or copy&paste it from here:

https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_transformer.py

The usage would then basically be

model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense', DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))])

Best,
Sebastian


> On Apr 10, 2019, at 12:25 PM, Liam Geron <liam at chatdesk.com> wrote:
> 
> Hi all,
> 
> I was hoping to get some guidance re: changing the result of the predict method of the OneVsRestClassifier to return a dense array rather than a sparse array, given that Google Cloud ML only accepts dense numpy arrays as a result of a given models predict method. Right now my model architecture looks like:
> 
> model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', OneVsRestClassifier(XGBClassifier()))])
> 
> Which returns a sparse array with the predict method. I saw the Stack Overflow post here: https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-learn-prediction-probability-predict-proba
> 
> which recommends overwriting the predict method with the predict_proba method, however I found that I can't serialize the model after doing so. I also have a stack overflow post here: https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-onevsrestclassifier-predict-method-output-to-dense-a which details the specific pickling error.
> 
> Is this a known issue? Is there an accepted way to convert this into a dense array?
> 
> Thanks,
> Liam Geron
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From liam at chatdesk.com  Wed Apr 10 14:10:35 2019
From: liam at chatdesk.com (Liam Geron)
Date: Wed, 10 Apr 2019 14:10:35 -0400
Subject: [scikit-learn] Predict Method of OneVsRestClassifier
 Integration with Google Cloud ML
In-Reply-To: <FDC8C2EC-5E6E-471B-8771-BA82201BF782@sebastianraschka.com>
References: <CAJn_aE576XK2MmTXAcgvpCxwpdc5gVVc0SbYSfFLhu+hQNKv8A@mail.gmail.com>
 <FDC8C2EC-5E6E-471B-8771-BA82201BF782@sebastianraschka.com>
Message-ID: <CAJn_aE6_jr3vohAT2gYfSP94xRZDu=-oeagsaDWxQEyt3Ya1zQ@mail.gmail.com>

Hi Sebastian,

Thanks for the advice! The model actually works on it's own in python fine
luckily, so I don't think that that is the issue exactly. I have tried
rolling my own estimator to wrap the pipeline to have it call the
predict_proba method to return a dense array, however I then came across
the problem that I would have to have that custom estimator defined on the
Cloud ML end, which I'm unsure how to do.

Thanks,
Liam

On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka <mail at sebastianraschka.com>
wrote:

> Hi Liam,
>
> not sure what your exact error message is, but it may also be that the
> XGBClassifier only accepts dense arrays? I think the TfidfVectorizer
> returns sparse arrays. You could probably fix your issues by inserting a
> "DenseTransformer" into your pipelone (a simple class that just transforms
> an array from a sparse to a dense format). I've implemented sth like that
> that you can import or copy&paste it from here:
>
>
> https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_transformer.py
>
> The usage would then basically be
>
> model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense',
> DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))])
>
> Best,
> Sebastian
>
>
>
>
> > On Apr 10, 2019, at 12:25 PM, Liam Geron <liam at chatdesk.com> wrote:
> >
> > Hi all,
> >
> > I was hoping to get some guidance re: changing the result of the predict
> method of the OneVsRestClassifier to return a dense array rather than a
> sparse array, given that Google Cloud ML only accepts dense numpy arrays as
> a result of a given models predict method. Right now my model architecture
> looks like:
> >
> > model = Pipeline([('tfidf', TfidfVectorizer()), ('clf',
> OneVsRestClassifier(XGBClassifier()))])
> >
> > Which returns a sparse array with the predict method. I saw the Stack
> Overflow post here:
> https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-learn-prediction-probability-predict-proba
> >
> > which recommends overwriting the predict method with the predict_proba
> method, however I found that I can't serialize the model after doing so. I
> also have a stack overflow post here:
> https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-onevsrestclassifier-predict-method-output-to-dense-a
> which details the specific pickling error.
> >
> > Is this a known issue? Is there an accepted way to convert this into a
> dense array?
> >
> > Thanks,
> > Liam Geron
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190410/841a4c11/attachment.html>

From solegalli1 at gmail.com  Wed Apr 10 14:13:46 2019
From: solegalli1 at gmail.com (Sole Galli)
Date: Wed, 10 Apr 2019 19:13:46 +0100
Subject: [scikit-learn] Feature engineering functionality - new package
In-Reply-To: <CAPV6P2wKna8PPRsu35D1uGVrwanBOhJUQOE_u1fYPpEPFZrhEA@mail.gmail.com>
References: <CANDT+DFqs8LApvBCW5TRcsXC32jBAWMmoGf6fG2-yPF7WF-pkA@mail.gmail.com>
 <CANDT+DFM6arejRVu5HF8J0e4AtWaJNE7iE-nrTDkQei3BTsy0w@mail.gmail.com>
 <CAPV6P2wKna8PPRsu35D1uGVrwanBOhJUQOE_u1fYPpEPFZrhEA@mail.gmail.com>
Message-ID: <CANDT+DGRP366WUXVUBTvkP74q=234KceAYstyPOXjOxZBVc++g@mail.gmail.com>

Hi Nicolas,

You are right, I am just checking this in the source code.

Sorry for the confusion and thanks for the quick response

Cheers

Sole

On Wed, 10 Apr 2019 at 18:43, Nicolas Goix <goix.nicolas at gmail.com> wrote:

> Hi Sole,
>
> I'm not sure the 2 limitations you mentioned are correct.
> 1) in your example, using the ColumnTransformer you can impute different
> values for different columns.
> 2) the sklearn transformers do learn on the training set and are able to
> perpetuate the values learnt from the train set to unseen data.
>
> Nicolas
>
> On Wed, Apr 10, 2019, 18:25 Sole Galli <solegalli1 at gmail.com> wrote:
>
>> Dear Scikit-Learn team,
>>>
>>> Feature engineering is a big task ahead of building machine learning
>>> models. It involves imputation of missing values, encoding of categorical
>>> variables, discretisation, variable transformation etc.
>>>
>>> Sklearn includes some functionality for feature engineering, which is
>>> useful, but it has a few limitations:
>>>
>>> 1) it does not allow for feature specification - it will do the same
>>> process on all variables, for example SimpleImputer. Typically, we want
>>> to impute different columns with different values.
>>> 2) It does not capture information from the training set, this is it
>>> does not learn, therefore, it is not able to perpetuate the values learnt
>>> from the train set, to unseen data.
>>>
>>> The 2 limitations above apply to all the feature transformers in
>>> sklearn, I believe.
>>>
>>> Therefore, if these transformers are used as part of a pipeline, we
>>> could end up doing different transformations to train and test, depending
>>> on the characteristics of the datasets. For business purposes, this is not
>>> a desired option.
>>>
>>> I think that building transformers that learn from the train set would
>>> be of much use for the community.
>>>
>>> To this end, I built a python package called feature engine
>>> <https://pypi.org/project/feature-engine/> which expands the
>>> sklearn-api with additional feature engineering techniques, and the
>>> functionality that allows the transformer to learn from data and store the
>>> parameters learnt.
>>>
>>> The techniques included have been used worldwide, both in business and
>>> in data competitions, and reported in kdd reports and other articles. I
>>> also cover them in an udemy course
>>> <https://www.udemy.com/feature-engineering-for-machine-learning> which
>>> has enrolled several thousand students.
>>>
>>> The package capitalises on the use of pandas to capture the features,
>>> but I am confident that the columns names could be captured and the df
>>> transformed to a numpy array to comply with sklearn requirements.
>>>
>>> I wondered whether it would be of interest to include the functionality
>>> of this package within sklearn?
>>> If you would consider extending the sklearn api to include these
>>> transformers, I would be happy to help.
>>>
>>> Alternatively, would you consider to add the package to your website?
>>> where you mention the libaries that extend sklearn functionality?
>>>
>>> All feedback is welcome.
>>>
>>> Many thanks and I look forward to hearing from you
>>>
>>> Thank you so much fur such an awesome contribution through the sklearn
>>> api
>>>
>>> Kind regards
>>>
>>> Sole
>>>
>>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190410/2e5c23e5/attachment-0001.html>

From mail at sebastianraschka.com  Wed Apr 10 14:34:16 2019
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Wed, 10 Apr 2019 13:34:16 -0500
Subject: [scikit-learn] Predict Method of OneVsRestClassifier
 Integration with Google Cloud ML
In-Reply-To: <CAJn_aE6_jr3vohAT2gYfSP94xRZDu=-oeagsaDWxQEyt3Ya1zQ@mail.gmail.com>
References: <CAJn_aE576XK2MmTXAcgvpCxwpdc5gVVc0SbYSfFLhu+hQNKv8A@mail.gmail.com>
 <FDC8C2EC-5E6E-471B-8771-BA82201BF782@sebastianraschka.com>
 <CAJn_aE6_jr3vohAT2gYfSP94xRZDu=-oeagsaDWxQEyt3Ya1zQ@mail.gmail.com>
Message-ID: <9B6ADC52-08EB-40A7-BC4D-346F978A43FE@sebastianraschka.com>

Hm, weird that their platform seems to be so picky about it. Have you tried to just make the output of the pipeline dense? I.e., 

(model.predict(X)).toarray()

Best,
Sebastian

> On Apr 10, 2019, at 1:10 PM, Liam Geron <liam at chatdesk.com> wrote:
> 
> Hi Sebastian,
> 
> Thanks for the advice! The model actually works on it's own in python fine luckily, so I don't think that that is the issue exactly. I have tried rolling my own estimator to wrap the pipeline to have it call the predict_proba method to return a dense array, however I then came across the problem that I would have to have that custom estimator defined on the Cloud ML end, which I'm unsure how to do.
> 
> Thanks,
> Liam
> 
> On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka <mail at sebastianraschka.com> wrote:
> Hi Liam,
> 
> not sure what your exact error message is, but it may also be that the XGBClassifier only accepts dense arrays? I think the TfidfVectorizer returns sparse arrays. You could probably fix your issues by inserting a "DenseTransformer" into your pipelone (a simple class that just transforms an array from a sparse to a dense format). I've implemented sth like that that you can import or copy&paste it from here:
> 
> https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_transformer.py
> 
> The usage would then basically be
> 
> model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense', DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))])
> 
> Best,
> Sebastian
> 
> 
> 
> 
> > On Apr 10, 2019, at 12:25 PM, Liam Geron <liam at chatdesk.com> wrote:
> > 
> > Hi all,
> > 
> > I was hoping to get some guidance re: changing the result of the predict method of the OneVsRestClassifier to return a dense array rather than a sparse array, given that Google Cloud ML only accepts dense numpy arrays as a result of a given models predict method. Right now my model architecture looks like:
> > 
> > model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', OneVsRestClassifier(XGBClassifier()))])
> > 
> > Which returns a sparse array with the predict method. I saw the Stack Overflow post here: https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-learn-prediction-probability-predict-proba
> > 
> > which recommends overwriting the predict method with the predict_proba method, however I found that I can't serialize the model after doing so. I also have a stack overflow post here: https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-onevsrestclassifier-predict-method-output-to-dense-a which details the specific pickling error.
> > 
> > Is this a known issue? Is there an accepted way to convert this into a dense array?
> > 
> > Thanks,
> > Liam Geron
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From liam at chatdesk.com  Wed Apr 10 15:26:55 2019
From: liam at chatdesk.com (Liam Geron)
Date: Wed, 10 Apr 2019 15:26:55 -0400
Subject: [scikit-learn] Predict Method of OneVsRestClassifier
 Integration with Google Cloud ML
In-Reply-To: <9B6ADC52-08EB-40A7-BC4D-346F978A43FE@sebastianraschka.com>
References: <CAJn_aE576XK2MmTXAcgvpCxwpdc5gVVc0SbYSfFLhu+hQNKv8A@mail.gmail.com>
 <FDC8C2EC-5E6E-471B-8771-BA82201BF782@sebastianraschka.com>
 <CAJn_aE6_jr3vohAT2gYfSP94xRZDu=-oeagsaDWxQEyt3Ya1zQ@mail.gmail.com>
 <9B6ADC52-08EB-40A7-BC4D-346F978A43FE@sebastianraschka.com>
Message-ID: <CAJn_aE5k4EFujs4sb_0NRkOE-vQp+WEAsG7AKKz91jjA69XVWg@mail.gmail.com>

Unfortunately I don't believe that you get that level of freedom, it's an
API call that automatically calls the model's predict method so I don't
think that I get to specify something like model.predict(X).toarray(). I
could be wrong however, I don't pretend to be an expert on Cloud ML by any
stretch.

Thanks,
Liam

On Wed, Apr 10, 2019 at 3:23 PM Sebastian Raschka <mail at sebastianraschka.com>
wrote:

> Hm, weird that their platform seems to be so picky about it. Have you
> tried to just make the output of the pipeline dense? I.e.,
>
> (model.predict(X)).toarray()
>
> Best,
> Sebastian
>
> > On Apr 10, 2019, at 1:10 PM, Liam Geron <liam at chatdesk.com> wrote:
> >
> > Hi Sebastian,
> >
> > Thanks for the advice! The model actually works on it's own in python
> fine luckily, so I don't think that that is the issue exactly. I have tried
> rolling my own estimator to wrap the pipeline to have it call the
> predict_proba method to return a dense array, however I then came across
> the problem that I would have to have that custom estimator defined on the
> Cloud ML end, which I'm unsure how to do.
> >
> > Thanks,
> > Liam
> >
> > On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka <
> mail at sebastianraschka.com> wrote:
> > Hi Liam,
> >
> > not sure what your exact error message is, but it may also be that the
> XGBClassifier only accepts dense arrays? I think the TfidfVectorizer
> returns sparse arrays. You could probably fix your issues by inserting a
> "DenseTransformer" into your pipelone (a simple class that just transforms
> an array from a sparse to a dense format). I've implemented sth like that
> that you can import or copy&paste it from here:
> >
> >
> https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_transformer.py
> >
> > The usage would then basically be
> >
> > model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense',
> DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))])
> >
> > Best,
> > Sebastian
> >
> >
> >
> >
> > > On Apr 10, 2019, at 12:25 PM, Liam Geron <liam at chatdesk.com> wrote:
> > >
> > > Hi all,
> > >
> > > I was hoping to get some guidance re: changing the result of the
> predict method of the OneVsRestClassifier to return a dense array rather
> than a sparse array, given that Google Cloud ML only accepts dense numpy
> arrays as a result of a given models predict method. Right now my model
> architecture looks like:
> > >
> > > model = Pipeline([('tfidf', TfidfVectorizer()), ('clf',
> OneVsRestClassifier(XGBClassifier()))])
> > >
> > > Which returns a sparse array with the predict method. I saw the Stack
> Overflow post here:
> https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-learn-prediction-probability-predict-proba
> > >
> > > which recommends overwriting the predict method with the predict_proba
> method, however I found that I can't serialize the model after doing so. I
> also have a stack overflow post here:
> https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-onevsrestclassifier-predict-method-output-to-dense-a
> which details the specific pickling error.
> > >
> > > Is this a known issue? Is there an accepted way to convert this into a
> dense array?
> > >
> > > Thanks,
> > > Liam Geron
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190410/3680e0ab/attachment.html>

From joel.nothman at gmail.com  Wed Apr 10 23:01:28 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 11 Apr 2019 13:01:28 +1000
Subject: [scikit-learn] Predict Method of OneVsRestClassifier
 Integration with Google Cloud ML
In-Reply-To: <CAJn_aE5k4EFujs4sb_0NRkOE-vQp+WEAsG7AKKz91jjA69XVWg@mail.gmail.com>
References: <CAJn_aE576XK2MmTXAcgvpCxwpdc5gVVc0SbYSfFLhu+hQNKv8A@mail.gmail.com>
 <FDC8C2EC-5E6E-471B-8771-BA82201BF782@sebastianraschka.com>
 <CAJn_aE6_jr3vohAT2gYfSP94xRZDu=-oeagsaDWxQEyt3Ya1zQ@mail.gmail.com>
 <9B6ADC52-08EB-40A7-BC4D-346F978A43FE@sebastianraschka.com>
 <CAJn_aE5k4EFujs4sb_0NRkOE-vQp+WEAsG7AKKz91jjA69XVWg@mail.gmail.com>
Message-ID: <CAAkaFLUO=RVLVaVEkR0koCfci_-97=BNAwjmd-Rp65xL_Oy57Q@mail.gmail.com>

I think it's a bit weird if we're returning sparse output from
OneVsRestClassifier.predict if it wasn't fit on sparse Y.

Actually, I would be in favour of deprecating multilabel support in
OneVsRestClassifier, since it is performing "binary relevance method" for
multilabel, not actually OvR. MultiOutputClassifier duplicates this
functionality (more or less), outputs a dense array (indeed it doesn't
support sparse Y and perhaps it should) and lives closer to functional
alternatives to binary relevance, such as ClassifierChain.

On Thu, 11 Apr 2019 at 05:32, Liam Geron <liam at chatdesk.com> wrote:

> Unfortunately I don't believe that you get that level of freedom, it's an
> API call that automatically calls the model's predict method so I don't
> think that I get to specify something like model.predict(X).toarray(). I
> could be wrong however, I don't pretend to be an expert on Cloud ML by any
> stretch.
>
> Thanks,
> Liam
>
> On Wed, Apr 10, 2019 at 3:23 PM Sebastian Raschka <
> mail at sebastianraschka.com> wrote:
>
>> Hm, weird that their platform seems to be so picky about it. Have you
>> tried to just make the output of the pipeline dense? I.e.,
>>
>> (model.predict(X)).toarray()
>>
>> Best,
>> Sebastian
>>
>> > On Apr 10, 2019, at 1:10 PM, Liam Geron <liam at chatdesk.com> wrote:
>> >
>> > Hi Sebastian,
>> >
>> > Thanks for the advice! The model actually works on it's own in python
>> fine luckily, so I don't think that that is the issue exactly. I have tried
>> rolling my own estimator to wrap the pipeline to have it call the
>> predict_proba method to return a dense array, however I then came across
>> the problem that I would have to have that custom estimator defined on the
>> Cloud ML end, which I'm unsure how to do.
>> >
>> > Thanks,
>> > Liam
>> >
>> > On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka <
>> mail at sebastianraschka.com> wrote:
>> > Hi Liam,
>> >
>> > not sure what your exact error message is, but it may also be that the
>> XGBClassifier only accepts dense arrays? I think the TfidfVectorizer
>> returns sparse arrays. You could probably fix your issues by inserting a
>> "DenseTransformer" into your pipelone (a simple class that just transforms
>> an array from a sparse to a dense format). I've implemented sth like that
>> that you can import or copy&paste it from here:
>> >
>> >
>> https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_transformer.py
>> >
>> > The usage would then basically be
>> >
>> > model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense',
>> DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))])
>> >
>> > Best,
>> > Sebastian
>> >
>> >
>> >
>> >
>> > > On Apr 10, 2019, at 12:25 PM, Liam Geron <liam at chatdesk.com> wrote:
>> > >
>> > > Hi all,
>> > >
>> > > I was hoping to get some guidance re: changing the result of the
>> predict method of the OneVsRestClassifier to return a dense array rather
>> than a sparse array, given that Google Cloud ML only accepts dense numpy
>> arrays as a result of a given models predict method. Right now my model
>> architecture looks like:
>> > >
>> > > model = Pipeline([('tfidf', TfidfVectorizer()), ('clf',
>> OneVsRestClassifier(XGBClassifier()))])
>> > >
>> > > Which returns a sparse array with the predict method. I saw the Stack
>> Overflow post here:
>> https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-learn-prediction-probability-predict-proba
>> > >
>> > > which recommends overwriting the predict method with the
>> predict_proba method, however I found that I can't serialize the model
>> after doing so. I also have a stack overflow post here:
>> https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-onevsrestclassifier-predict-method-output-to-dense-a
>> which details the specific pickling error.
>> > >
>> > > Is this a known issue? Is there an accepted way to convert this into
>> a dense array?
>> > >
>> > > Thanks,
>> > > Liam Geron
>> > > _______________________________________________
>> > > scikit-learn mailing list
>> > > scikit-learn at python.org
>> > > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190411/31904711/attachment-0001.html>

From liam at chatdesk.com  Thu Apr 11 13:30:56 2019
From: liam at chatdesk.com (Liam Geron)
Date: Thu, 11 Apr 2019 13:30:56 -0400
Subject: [scikit-learn] Predict Method of OneVsRestClassifier
 Integration with Google Cloud ML
In-Reply-To: <CAAkaFLUO=RVLVaVEkR0koCfci_-97=BNAwjmd-Rp65xL_Oy57Q@mail.gmail.com>
References: <CAJn_aE576XK2MmTXAcgvpCxwpdc5gVVc0SbYSfFLhu+hQNKv8A@mail.gmail.com>
 <FDC8C2EC-5E6E-471B-8771-BA82201BF782@sebastianraschka.com>
 <CAJn_aE6_jr3vohAT2gYfSP94xRZDu=-oeagsaDWxQEyt3Ya1zQ@mail.gmail.com>
 <9B6ADC52-08EB-40A7-BC4D-346F978A43FE@sebastianraschka.com>
 <CAJn_aE5k4EFujs4sb_0NRkOE-vQp+WEAsG7AKKz91jjA69XVWg@mail.gmail.com>
 <CAAkaFLUO=RVLVaVEkR0koCfci_-97=BNAwjmd-Rp65xL_Oy57Q@mail.gmail.com>
Message-ID: <CAJn_aE5y1x-5EQHyObCeQ4sZj=q=FezHPSYBrWXYanBkZkBNGQ@mail.gmail.com>

That's a great tip actually, I was unaware about the MultiOutputClassifier
option. I'll give it a try!

Thanks,
Liam


On Wed, Apr 10, 2019 at 11:03 PM Joel Nothman <joel.nothman at gmail.com>
wrote:

> I think it's a bit weird if we're returning sparse output from
> OneVsRestClassifier.predict if it wasn't fit on sparse Y.
>
> Actually, I would be in favour of deprecating multilabel support in
> OneVsRestClassifier, since it is performing "binary relevance method" for
> multilabel, not actually OvR. MultiOutputClassifier duplicates this
> functionality (more or less), outputs a dense array (indeed it doesn't
> support sparse Y and perhaps it should) and lives closer to functional
> alternatives to binary relevance, such as ClassifierChain.
>
> On Thu, 11 Apr 2019 at 05:32, Liam Geron <liam at chatdesk.com> wrote:
>
>> Unfortunately I don't believe that you get that level of freedom, it's an
>> API call that automatically calls the model's predict method so I don't
>> think that I get to specify something like model.predict(X).toarray(). I
>> could be wrong however, I don't pretend to be an expert on Cloud ML by any
>> stretch.
>>
>> Thanks,
>> Liam
>>
>> On Wed, Apr 10, 2019 at 3:23 PM Sebastian Raschka <
>> mail at sebastianraschka.com> wrote:
>>
>>> Hm, weird that their platform seems to be so picky about it. Have you
>>> tried to just make the output of the pipeline dense? I.e.,
>>>
>>> (model.predict(X)).toarray()
>>>
>>> Best,
>>> Sebastian
>>>
>>> > On Apr 10, 2019, at 1:10 PM, Liam Geron <liam at chatdesk.com> wrote:
>>> >
>>> > Hi Sebastian,
>>> >
>>> > Thanks for the advice! The model actually works on it's own in python
>>> fine luckily, so I don't think that that is the issue exactly. I have tried
>>> rolling my own estimator to wrap the pipeline to have it call the
>>> predict_proba method to return a dense array, however I then came across
>>> the problem that I would have to have that custom estimator defined on the
>>> Cloud ML end, which I'm unsure how to do.
>>> >
>>> > Thanks,
>>> > Liam
>>> >
>>> > On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka <
>>> mail at sebastianraschka.com> wrote:
>>> > Hi Liam,
>>> >
>>> > not sure what your exact error message is, but it may also be that the
>>> XGBClassifier only accepts dense arrays? I think the TfidfVectorizer
>>> returns sparse arrays. You could probably fix your issues by inserting a
>>> "DenseTransformer" into your pipelone (a simple class that just transforms
>>> an array from a sparse to a dense format). I've implemented sth like that
>>> that you can import or copy&paste it from here:
>>> >
>>> >
>>> https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_transformer.py
>>> >
>>> > The usage would then basically be
>>> >
>>> > model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense',
>>> DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))])
>>> >
>>> > Best,
>>> > Sebastian
>>> >
>>> >
>>> >
>>> >
>>> > > On Apr 10, 2019, at 12:25 PM, Liam Geron <liam at chatdesk.com> wrote:
>>> > >
>>> > > Hi all,
>>> > >
>>> > > I was hoping to get some guidance re: changing the result of the
>>> predict method of the OneVsRestClassifier to return a dense array rather
>>> than a sparse array, given that Google Cloud ML only accepts dense numpy
>>> arrays as a result of a given models predict method. Right now my model
>>> architecture looks like:
>>> > >
>>> > > model = Pipeline([('tfidf', TfidfVectorizer()), ('clf',
>>> OneVsRestClassifier(XGBClassifier()))])
>>> > >
>>> > > Which returns a sparse array with the predict method. I saw the
>>> Stack Overflow post here:
>>> https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-learn-prediction-probability-predict-proba
>>> > >
>>> > > which recommends overwriting the predict method with the
>>> predict_proba method, however I found that I can't serialize the model
>>> after doing so. I also have a stack overflow post here:
>>> https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-onevsrestclassifier-predict-method-output-to-dense-a
>>> which details the specific pickling error.
>>> > >
>>> > > Is this a known issue? Is there an accepted way to convert this into
>>> a dense array?
>>> > >
>>> > > Thanks,
>>> > > Liam Geron
>>> > > _______________________________________________
>>> > > scikit-learn mailing list
>>> > > scikit-learn at python.org
>>> > > https://mail.python.org/mailman/listinfo/scikit-learn
>>> >
>>> > _______________________________________________
>>> > scikit-learn mailing list
>>> > scikit-learn at python.org
>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>> > _______________________________________________
>>> > scikit-learn mailing list
>>> > scikit-learn at python.org
>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190411/8a48e1a3/attachment.html>

From t3kcit at gmail.com  Mon Apr 15 10:55:11 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 15 Apr 2019 10:55:11 -0400
Subject: [scikit-learn] Feature engineering functionality - new package
In-Reply-To: <CANDT+DGRP366WUXVUBTvkP74q=234KceAYstyPOXjOxZBVc++g@mail.gmail.com>
References: <CANDT+DFqs8LApvBCW5TRcsXC32jBAWMmoGf6fG2-yPF7WF-pkA@mail.gmail.com>
 <CANDT+DFM6arejRVu5HF8J0e4AtWaJNE7iE-nrTDkQei3BTsy0w@mail.gmail.com>
 <CAPV6P2wKna8PPRsu35D1uGVrwanBOhJUQOE_u1fYPpEPFZrhEA@mail.gmail.com>
 <CANDT+DGRP366WUXVUBTvkP74q=234KceAYstyPOXjOxZBVc++g@mail.gmail.com>
Message-ID: <6d06420d-a0f7-a374-ee90-c73af5219e35@gmail.com>

1) was indeed a design decision. Your design is certainly an alternative 
design, that might be more convenient in some situations,
but requires adding this feature to all transformers, which basically 
just adds a bunch of boilerplate code everywhere.
So you could argue our design decision was more driven by ease of 
maintenance than ease of use.

There might be some transformers in your package that we could add to 
scikit-learn in some form, but several are already available,
SimpleImputer implements MedianMeanImputer, CategoricalVariableImputer 
and FrequentCategoryImputer
We don't currently have RandomSampleImputer and EndTailImputer, I think. 
AddNaNBinaryImputer is "MissingIndicator" in sklearn.

OneHotCategoricalEncoder and OrdinalEncoder exist, 
CountFrequencyCategoricalEncoder and MeanCategoriclaEncoder are in the 
works,
though there are some arguments about the details. These are also in the 
categorical-encoding package:
http://contrib.scikit-learn.org/categorical-encoding/

RareLabelCategoricalEncoder is something I definitely want in 
OneHotEncoder, not sure if there's a PR yet.

Do you have examples of WoERatioCategoricalEncoder or Windsorizer or any 
of the discretizers actually working well in practice?
I have not seen them used much, they seemed to be popular in Weka, though.

BoxCoxTransformer is implemented in PowerTransformer, and 
LogTransformer, ReciprocalTransformer and ExponentialTransformer can be
implemented as FunctionTransformer(np.log), FunctionTransformer(lambda 
x: 1/x) and FunctionTransformer(lambda x: x ** exp) I believe.

It might be interesting to add your package to scikit-learn-contrib:
https://github.com/scikit-learn-contrib

We are struggling a bit with how to best organize that, though.

Cheers,
Andy


On 4/10/19 2:13 PM, Sole Galli wrote:
> Hi Nicolas,
>
> You are right, I am just checking this in the source code.
>
> Sorry?for the confusion and thanks for the quick?response
>
> Cheers
>
> Sole
>
> On Wed, 10 Apr 2019 at 18:43, Nicolas Goix <goix.nicolas at gmail.com 
> <mailto:goix.nicolas at gmail.com>> wrote:
>
>     Hi Sole,
>
>     I'm not sure the 2 limitations you mentioned are correct.
>     1) in your example, using the ColumnTransformer you can impute
>     different values for different columns.
>     2) the sklearn transformers do learn on the training set and are
>     able to perpetuate the values learnt from the train set to unseen
>     data.
>
>     Nicolas
>
>     On Wed, Apr 10, 2019, 18:25 Sole Galli <solegalli1 at gmail.com
>     <mailto:solegalli1 at gmail.com>> wrote:
>
>             Dear Scikit-Learn team,
>
>             Feature engineering is a big task ahead of building
>             machine learning models. It involves imputation of missing
>             values, encoding of categorical variables, discretisation,
>             variable transformation etc.
>
>             Sklearn includes some functionality for feature
>             engineering, which is useful, but it has a few limitations:
>
>             1) it does not allow for feature specification - it will
>             do the same process on all variables, for example
>             SimpleImputer. Typically, we want to impute different
>             columns with different values.
>             2) It does not capture information from the training set,
>             this is it does not learn, therefore, it is not able to
>             perpetuate the values learnt from the train set, to unseen
>             data.
>
>             The 2 limitations above apply to all the feature
>             transformers in sklearn, I believe.
>
>             Therefore, if these transformers are used as part of a
>             pipeline, we could end up doing different transformations
>             to train and test, depending on the characteristics of the
>             datasets. For business purposes, this is not a desired option.
>
>             I think that building transformers that learn from the
>             train set would be of much use for the community.
>
>             To this end, I built a python package called feature
>             engine <https://pypi.org/project/feature-engine/>?which
>             expands the sklearn-api with additional feature
>             engineering techniques, and the functionality that allows
>             the transformer to learn from data and store the
>             parameters learnt.
>
>             The techniques included have been used worldwide, both in
>             business and in data competitions, and reported in kdd
>             reports and other articles. I also cover them in an udemy
>             course
>             <https://www.udemy.com/feature-engineering-for-machine-learning>
>             which has enrolled several thousand students.
>
>             The package capitalises on the use of pandas to capture
>             the features, but I am confident that the columns names
>             could be captured and the df transformed to a numpy array
>             to comply with sklearn requirements.
>
>             I wondered whether it would be of interest to include the
>             functionality of this package within sklearn?
>             If you would consider extending the sklearn api to include
>             these transformers, I would be happy to help.
>
>             Alternatively, would you consider to add the package to
>             your website? where you mention the libaries that extend
>             sklearn functionality?
>
>             All feedback is welcome.
>
>             Many thanks and I look forward to hearing from you
>
>             Thank you so much fur such an awesome contribution through
>             the sklearn api
>
>             Kind regards
>
>             Sole
>
>         _______________________________________________
>         scikit-learn mailing list
>         scikit-learn at python.org <mailto:scikit-learn at python.org>
>         https://mail.python.org/mailman/listinfo/scikit-learn
>
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190415/460659db/attachment.html>

From ian at ianozsvald.com  Wed Apr 17 10:59:42 2019
From: ian at ianozsvald.com (Ian Ozsvald)
Date: Wed, 17 Apr 2019 15:59:42 +0100
Subject: [scikit-learn] PyDataLondon 2019 (July 12-14) Call for Proposals
 closing this Friday
Message-ID: <CAPvwANAObVdEinto7Bg7aU+uHLVnz-jsOmE=488Ew-z_JLiq-A@mail.gmail.com>

On July 12-14 we host the sixth PyDataLondon conference in central
London. As last year we'll be hosted close to Tower Bridge at the
Tower Hotel with 700 attendees over 3 days:
https://pydata.org/london2019/

Our Call for Proposals has been open for several weeks, it closes this
Friday. If anyone here would like to spread the good word about
scikit-learn (and any scikit/scipy/Python data science related topics)
we'd love to see a proposal. We also offer first-time speaker
mentoring, it is a bit late for this now so I'll offer to answer any
questions anyone has personally - just email me directly.

The Call for Proposals closes this Friday, please submit your talk
here: https://pydata.org/london2019/cfp

If you've not been to PyDataLondon before - here's last year's
schedule and my write-up of all of the events that we covered. Gael
Varoquaux and others spoke for us, we'd love to see scikit-learn well
represented again:
https://pydata.org/london2018/schedule/
https://ianozsvald.com/2018/04/30/pydatalondon-2018-and-creating-correct-and-capable-classifiers/

Regards, Ian (PyDataLondon co-founder)

-- 
Ian Ozsvald (Data Scientist, PyDataLondon co-chair)
ian at IanOzsvald.com

https://IanOzsvald.com
https://MorConsulting.com
https://twitter.com/IanOzsvald

From vaggi.federico at gmail.com  Fri Apr 19 12:52:51 2019
From: vaggi.federico at gmail.com (federico vaggi)
Date: Fri, 19 Apr 2019 09:52:51 -0700
Subject: [scikit-learn] Categorical Encoding of high cardinality variables
Message-ID: <CAGvd0=hWVvnfKNewTQLC1bzLr6QHXp0xuz173Vc=MutQcdsSmQ@mail.gmail.com>

Hi everyone,

I wanted to use the scikit-learn transformer API to clean up some messy
data as input to a neural network.  One of the steps involves converting
categorical variables (of very high cardinality) into integers for use in
an embedding layer.

Unfortunately, I cannot quite use LabelEncoder to do solve this.  When
dealing with categorical variables with very high cardinality, I found it
useful in practice to have a threshold value for the frequency under which
a variable ends up with the 'unk' or 'rare' label.  This same label would
also end up applied at test time to entries that were not observed in the
train set.

This is relatively straightforward to add to the existing label encoder
code, but it breaks the contract slightly: if we encode some variables with
a 'rare' label, then the transform operation is no longer a bijection.

Is this feature too niche for the main sklearn?  I saw there was a package (
https://feature-engine.readthedocs.io/en/latest/RareLabelCategoricalEncoder.html)
that implemented a similar feature discussed in the mailing list.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190419/864826f0/attachment.html>

From mlcnworkshop at gmail.com  Tue Apr 23 04:05:43 2019
From: mlcnworkshop at gmail.com (MLCN Workshop)
Date: Tue, 23 Apr 2019 10:05:43 +0200
Subject: [scikit-learn] The 2nd International Workshop on Machine Learning
 in Clinical Neuroimaging (MLCN 2019): ENTERING THE ERA OF BIG DATA VIA
 TRANSFER LEARNING AND DATA HARMONIZATION
Message-ID: <CAN9Uh4diOEu4GDr2dyw8Q1gAcvRJkgMDHAFjRuX2_w2r-v_LSg@mail.gmail.com>

Dear Colleagues,We are delighted to invite you to join us for the MLCN 2019
<https://mlcnws.com/> workshop as a satellite event of the MICCAI 2019
conference, Shenzhen, China. Call for Papers

Recent advances in neuroimaging and machine learning provide an exceptional
opportunity for investigators and physicians to discover complex
relationships between brain, behaviors, and mental and neurological
disorders. The MLCN 2019 workshop (https://mlcnws.com), as a satellite
event of MICCAI 2019 (https://www.miccai2019.org), aims to bring together
researchers in both theory and application from various fields in domains
such as *e.g.* machine learning, neuroimaging, predictive clinical
neuroscience, *etc.* Topics of interests include, but are not limited to:

   - Transfer learning in clinical neuroimaging
   - Model stability in transfer learning
   - Data prerequisites for successful transfer learning
   - Domain adaptation in neuroimaging
   - Data harmonization across sites
   - Data pooling ? practical issues
   - Cross-domain learning in neuroimaging
   - Interpretability for transfer learning
   - Unsupervised methods for domain adaptation
   - Multi-site data analysis, from preprocessing to modeling
   - Big data in clinical neuroimaging
   - Scalable machine learning methods
   - Benefits, problems, and solutions of working with very large datasets

SUBMISSION PROCESS:

The workshop seeks high quality, original, and unpublished work on
algorithms, theory, and applications of machine learning in clinical
neuroimaging related to big data, transfer learning, and data
harmonization. Papers should be submitted electronically in Springer
Lecture Notes in Computer Science (LCNS) style (
https://www.springer.com/gp/computer-science/lncs/conference-proceedings-guidelines)
with up to 8-pages and using the CMT system at
https://cmt3.research.microsoft.com/MLCN2019. The MLCN workshop uses a
double-blind review process in the evaluation phase, thus authors must
ensure anonymous submissions. Accepted papers will be published in a joint
proceeding with the MICCAI 2019 conference.
IMPORTANT DATES:

   -

   Paper submission deadline: July 1, 2019 (23:59 PST)
   -

   Notification of Acceptance: August 5, 2019
   -

   Camera-ready Submission: August 12, 2019
   -

   Workshop Date: October 13, 2019


Best regards,

MLCN 2019 Organizing Committee <https://mlcnws.com/organization/>,

Email: mlcnworkshop at gmail.com

Website: https://mlcnws.com/

twitter: @MLCNworkshop <https://twitter.com/MLCNworkshop>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190423/07843c29/attachment.html>

From solegalli1 at gmail.com  Tue Apr 23 20:00:15 2019
From: solegalli1 at gmail.com (Sole Galli)
Date: Wed, 24 Apr 2019 01:00:15 +0100
Subject: [scikit-learn] Categorical Encoding of high cardinality
 variables
In-Reply-To: <CAGvd0=hWVvnfKNewTQLC1bzLr6QHXp0xuz173Vc=MutQcdsSmQ@mail.gmail.com>
References: <CAGvd0=hWVvnfKNewTQLC1bzLr6QHXp0xuz173Vc=MutQcdsSmQ@mail.gmail.com>
Message-ID: <CANDT+DEwBNa+pkrPerxrQ35r5CDPTC+Kev_Y5G3dA2v9zzPppg@mail.gmail.com>

Hello everyone,

I am Sole, I started the conversation on feature engine
<https://feature-engine.readthedocs.io>, a package I created for feature
engineering.

Regarding the grouping of *rare /  infrequent* categories into an umbrella
term like "Rare", "Other", etc, which Federico raised recently, I would
like to provide some literature at the end of this email, that quotes the
use of this procedure. These are a series of articles by the best solutions
to the 2009 KDD annual competition, which were compiled into one "book
<http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf>", and I am sure
you are aware of it already. I would also like to highlight that this is
extremely common practice in the industry, not only to avoid overfitting,
but also to handle unseen categories when models are deployed. It would be
great to see this functionality added to both the OrdinalEncoder and the
OneHotEncoder, with triggers on the representation of the label in the
dataset (eg. percentage)

Pointing to the main quotes from these articles
<http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf>:

Page 4 of the summary and introductory article:
"For categorical variables, grouping of under-represented categories proved
to be useful  to avoid overfitting. The winners of the fast and the slow
track used similar strategies consisting in retaining the most populated
categories and coarsely grouping the others in  an unsupervised way"

Page 23:
"Most of the learning algorithms we were planning to use do not handle
categorical variables,  so we needed to recode them. This was done in a
standard way, by generating indicator vari-  ables for the different values
a categorical attribute could take. The only slightly non-standard
decision was to limit ourselves to encoding only the 10 most common values
of each categorical  attribute, rather than all the values, in order to
avoid an explosion in the number of features from  variables with a huge
vocabulary"

Page 36:
"We consolidate the extremely low populated entries  (having fewer than 200
examples) with their neighbors to smooth out the outliers. Similarly, we
group some categorical variables which have a large number of entries (  >
1000 distinct values)  into 100 categories."

See bulletpoints in Page 47

I hope you find these useful.

Let me know if / how I can help.

Regards

Sole


On Fri, 19 Apr 2019 at 17:54, federico vaggi <vaggi.federico at gmail.com>
wrote:

> Hi everyone,
>
> I wanted to use the scikit-learn transformer API to clean up some messy
> data as input to a neural network.  One of the steps involves converting
> categorical variables (of very high cardinality) into integers for use in
> an embedding layer.
>
> Unfortunately, I cannot quite use LabelEncoder to do solve this.  When
> dealing with categorical variables with very high cardinality, I found it
> useful in practice to have a threshold value for the frequency under which
> a variable ends up with the 'unk' or 'rare' label.  This same label would
> also end up applied at test time to entries that were not observed in the
> train set.
>
> This is relatively straightforward to add to the existing label encoder
> code, but it breaks the contract slightly: if we encode some variables with
> a 'rare' label, then the transform operation is no longer a bijection.
>
> Is this feature too niche for the main sklearn?  I saw there was a package
> (
> https://feature-engine.readthedocs.io/en/latest/RareLabelCategoricalEncoder.html)
> that implemented a similar feature discussed in the mailing list.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190424/7234fdd4/attachment.html>

From solegalli1 at gmail.com  Tue Apr 23 21:36:16 2019
From: solegalli1 at gmail.com (Sole Galli)
Date: Wed, 24 Apr 2019 02:36:16 +0100
Subject: [scikit-learn] Feature engineering functionality - new package
In-Reply-To: <6d06420d-a0f7-a374-ee90-c73af5219e35@gmail.com>
References: <CANDT+DFqs8LApvBCW5TRcsXC32jBAWMmoGf6fG2-yPF7WF-pkA@mail.gmail.com>
 <CANDT+DFM6arejRVu5HF8J0e4AtWaJNE7iE-nrTDkQei3BTsy0w@mail.gmail.com>
 <CAPV6P2wKna8PPRsu35D1uGVrwanBOhJUQOE_u1fYPpEPFZrhEA@mail.gmail.com>
 <CANDT+DGRP366WUXVUBTvkP74q=234KceAYstyPOXjOxZBVc++g@mail.gmail.com>
 <6d06420d-a0f7-a374-ee90-c73af5219e35@gmail.com>
Message-ID: <CANDT+DE1MbomQguQ4ABrQ-JtmoFi_tBvtVmJo68sEw6ADGkYKw@mail.gmail.com>

Hi Andreas and team,

Thank you very much for your reply. This was very helpful. Happy to hear
that functionality similar to CountFrequencyCategoricalEncoder,
MeanCategoriclaEncoder and RareLabelCategoricalEncoder are in the agenda.
The last functionality, grouping of rare labels, would be useful for both
the OneHotEncoder and OrdinalEncoder, as per a previous thread.

-------------------------
Re: your questions:

Examples of various discretisers can be found in the winner solutions of
the KDD 2009 annual competition article
<http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf>s: See for example:

   - Bulletpoints in page 26, which include use of decision trees to create
   bins.
   - Summary of employed methods in page 14: "Discretization  was the
   second most used preprocessing. Its usefulness for this particular dataset
   is justified by the non-normality of the distribution of the variables and
   the existence of extreme  values. The simple bining used by the winners of
   the slow track proved to be efficient. "
   - A peculiar binning described in 2.2 in page 36
   - I also use discretisers at work, inspired on the KDD articles, see for
   example my blog at the peer-to-peer company
   <https://blog.zopa.com/2017/07/20/tips-honing-logistic-regression-models/>,
   which I would argue attest to successful implementation:p
   - Equal width and equal frequency discretisers are discussed in this
   master thesis
   <https://www.ke.tu-darmstadt.de/lehre/arbeiten/studien/2015/Dong_Ying.pdf>
   .

Windsorisation, or top coding: we these use all the time in the industry,
usually capping at arbitrary values. Windsorisation using mean and std or
quantiles is a way of automating the capping. In theory it would boost
performance of linear models. Have tried that myself in a couple of toy
datasets from Kaggle. I don't have a good article to point you to at the
moment. There are a few that discuss topcoding, and also the effect of
outiers on NN, but not too sure how widely accepted they are.

On WoE, I understand is common practice in finance. Haven't used it at
work. Have used it in toy datasets, behaves more or less the same than
target mean encoding. Although the purpose of WoE goes beyond than
improving performance, it is also a way of "standarising" the variables and
making them understandable. See for example this summary.
<http://documentation.statsoft.com/StatisticaHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEIntroductoryOverview>

I know that sklearn likes to include algorithms widely accepted, ideally
from multi-quoted articles. So for winsorisation and WoE I am not quite
answering your questions I guess. I will keep an eye in case something new
comes up.

------------------
Re: sharing feature-engine in sklearn contrib.

I would really appreciate if you could do that. I am planning to expand the
package with other feature engineering techniques, which I think will be
useful for the community. In particular, until ColumnTransformer becomes
widely adopted and the other transformers developed. Would be great if it
could be shared in the contrib page
<https://github.com/scikit-learn-contrib> and also int the related projects
<https://scikit-learn.org/stable/related_projects.html> page.

----------------
Re: the categorical encoding package

I am aware that it exists. Haven't tried it myself. When we presented it to
the company, the main criticism was that most of the encoders distort the
variables so much that they lose all possible human interpretation of them.
So, the business prefers not to use these types of encoding. Which, I think
I kind of agree.

Thanks again for your time. Let me know if / how I can help and if you
would be happy to include feature engine in the contrib page.

Have a good rest of week

Sole


On Mon, 15 Apr 2019 at 15:56, Andreas Mueller <t3kcit at gmail.com> wrote:

> 1) was indeed a design decision. Your design is certainly an alternative
> design, that might be more convenient in some situations,
> but requires adding this feature to all transformers, which basically just
> adds a bunch of boilerplate code everywhere.
> So you could argue our design decision was more driven by ease of
> maintenance than ease of use.
>
> There might be some transformers in your package that we could add to
> scikit-learn in some form, but several are already available,
> SimpleImputer implements MedianMeanImputer, CategoricalVariableImputer and
> FrequentCategoryImputer
> We don't currently have RandomSampleImputer and EndTailImputer, I think.
> AddNaNBinaryImputer is "MissingIndicator" in sklearn.
>
> OneHotCategoricalEncoder and OrdinalEncoder exist,
> CountFrequencyCategoricalEncoder and MeanCategoriclaEncoder are in the
> works,
> though there are some arguments about the details. These are also in the
> categorical-encoding package:
> http://contrib.scikit-learn.org/categorical-encoding/
>
> RareLabelCategoricalEncoder is something I definitely want in
> OneHotEncoder, not sure if there's a PR yet.
>
> Do you have examples of WoERatioCategoricalEncoder or Windsorizer or any
> of the discretizers actually working well in practice?
> I have not seen them used much, they seemed to be popular in Weka, though.
>
> BoxCoxTransformer is implemented in PowerTransformer, and LogTransformer,
> ReciprocalTransformer and ExponentialTransformer can be
> implemented as FunctionTransformer(np.log), FunctionTransformer(lambda x:
> 1/x) and FunctionTransformer(lambda x: x ** exp) I believe.
>
> It might be interesting to add your package to scikit-learn-contrib:
> https://github.com/scikit-learn-contrib
>
> We are struggling a bit with how to best organize that, though.
>
> Cheers,
> Andy
>
>
> On 4/10/19 2:13 PM, Sole Galli wrote:
>
> Hi Nicolas,
>
> You are right, I am just checking this in the source code.
>
> Sorry for the confusion and thanks for the quick response
>
> Cheers
>
> Sole
>
> On Wed, 10 Apr 2019 at 18:43, Nicolas Goix <goix.nicolas at gmail.com> wrote:
>
>> Hi Sole,
>>
>> I'm not sure the 2 limitations you mentioned are correct.
>> 1) in your example, using the ColumnTransformer you can impute different
>> values for different columns.
>> 2) the sklearn transformers do learn on the training set and are able to
>> perpetuate the values learnt from the train set to unseen data.
>>
>> Nicolas
>>
>> On Wed, Apr 10, 2019, 18:25 Sole Galli <solegalli1 at gmail.com> wrote:
>>
>>> Dear Scikit-Learn team,
>>>>
>>>> Feature engineering is a big task ahead of building machine learning
>>>> models. It involves imputation of missing values, encoding of categorical
>>>> variables, discretisation, variable transformation etc.
>>>>
>>>> Sklearn includes some functionality for feature engineering, which is
>>>> useful, but it has a few limitations:
>>>>
>>>> 1) it does not allow for feature specification - it will do the same
>>>> process on all variables, for example SimpleImputer. Typically, we
>>>> want to impute different columns with different values.
>>>> 2) It does not capture information from the training set, this is it
>>>> does not learn, therefore, it is not able to perpetuate the values learnt
>>>> from the train set, to unseen data.
>>>>
>>>> The 2 limitations above apply to all the feature transformers in
>>>> sklearn, I believe.
>>>>
>>>> Therefore, if these transformers are used as part of a pipeline, we
>>>> could end up doing different transformations to train and test, depending
>>>> on the characteristics of the datasets. For business purposes, this is not
>>>> a desired option.
>>>>
>>>> I think that building transformers that learn from the train set would
>>>> be of much use for the community.
>>>>
>>>> To this end, I built a python package called feature engine
>>>> <https://pypi.org/project/feature-engine/> which expands the
>>>> sklearn-api with additional feature engineering techniques, and the
>>>> functionality that allows the transformer to learn from data and store the
>>>> parameters learnt.
>>>>
>>>> The techniques included have been used worldwide, both in business and
>>>> in data competitions, and reported in kdd reports and other articles. I
>>>> also cover them in an udemy course
>>>> <https://www.udemy.com/feature-engineering-for-machine-learning> which
>>>> has enrolled several thousand students.
>>>>
>>>> The package capitalises on the use of pandas to capture the features,
>>>> but I am confident that the columns names could be captured and the df
>>>> transformed to a numpy array to comply with sklearn requirements.
>>>>
>>>> I wondered whether it would be of interest to include the functionality
>>>> of this package within sklearn?
>>>> If you would consider extending the sklearn api to include these
>>>> transformers, I would be happy to help.
>>>>
>>>> Alternatively, would you consider to add the package to your website?
>>>> where you mention the libaries that extend sklearn functionality?
>>>>
>>>> All feedback is welcome.
>>>>
>>>> Many thanks and I look forward to hearing from you
>>>>
>>>> Thank you so much fur such an awesome contribution through the sklearn
>>>> api
>>>>
>>>> Kind regards
>>>>
>>>> Sole
>>>>
>>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190424/56ea6ab6/attachment-0001.html>

From nelle.varoquaux at gmail.com  Fri Apr 26 17:39:06 2019
From: nelle.varoquaux at gmail.com (Nelle Varoquaux)
Date: Fri, 26 Apr 2019 14:39:06 -0700
Subject: [scikit-learn] 2019 John Hunter Excellence in Plotting Contest
 Reminder
Message-ID: <CAE-UAvRUnQbkqOw_soKzn6_adiLjFnOOFgqtBEDrT1pS+XKA7w@mail.gmail.com>

Hi everybody,


My apologies to those of you getting this on multiple lists.


In memory of John Hunter, we are pleased to be announce the SciPy John
Hunter Excellence in Plotting Competition for 2019. This open competition
aims to highlight the importance of data visualization to scientific
progress and showcase the capabilities of open source software.

Participants are invited to submit scientific plots to be judged by a
panel. The winning entries will be announced and displayed at the
conference.

John Hunter?s family and NumFocus are graciously sponsoring cash prizes for
the winners in the following amounts:


   -

   1st prize: $1000
   -

   2nd prize: $750
   -

   3rd prize: $500


   -

   Entries must be submitted by June, 8th to the form at
   https://goo.gl/forms/cFTB3FUBrMPfQ7Vz1
   -

   Winners will be announced at Scipy 2019 in Austin, TX.
   -

   Participants do not need to attend the Scipy conference.
   -

   Entries may take the definition of ?visualization? rather broadly.
   Entries may be, for example, a traditional printed plot, an interactive
   visualization for the web, or an animation.
   -

   Source code for the plot must be provided, in the form of Python code
   and/or a Jupyter notebook, along with a rendering of the plot in a widely
   used format.  This may be, for example, PDF for print, standalone HTML and
   Javascript for an interactive plot, or MPEG-4 for a video. If the original
   data can not be shared for reasons of size or licensing, "fake" data may be
   substituted, along with an image of the plot using real data.
   -

   Each entry must include a 300-500 word abstract describing the plot and
   its importance for a general scientific audience.
   -

   Entries will be judged on their clarity, innovation and aesthetics, but
   most importantly for their effectiveness in communicating a real-world
   problem. Entrants are encouraged to submit plots that were used during the
   course of research or work, rather than merely being hypothetical.
   -

   SciPy reserves the right to display any and all entries, whether
   prize-winning or not, at the conference, use in any materials or on its
   website, with attribution to the original author(s).


SciPy John Hunter Excellence in Plotting Competition Co-Chairs

Hannah Aizenman

Thomas Caswell

Madicken Munk

Nelle Varoquaux
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190426/969f4a41/attachment.html>

From pahome.chen at mirlab.org  Tue Apr 30 04:48:09 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Tue, 30 Apr 2019 16:48:09 +0800
Subject: [scikit-learn] Any other clustering algo cluster incrementally?
Message-ID: <CAB3eZft-_hQYmXTtn457PpzRVWYX6=U2HzkvVyA=DKMd_RTSEw@mail.gmail.com>

I read this :  https://scikit-learn.org/0.15/modules/scaling_strategies.html

There's only one clustering algo cluster incrementally, that is minibatch
kmeans.

Is there any clustering algo can reach this? On github is okay. thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190430/a93ec84a/attachment.html>

From gael.varoquaux at normalesup.org  Tue Apr 30 12:38:24 2019
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Tue, 30 Apr 2019 18:38:24 +0200
Subject: [scikit-learn] Any other clustering algo cluster incrementally?
In-Reply-To: <CAB3eZft-_hQYmXTtn457PpzRVWYX6=U2HzkvVyA=DKMd_RTSEw@mail.gmail.com>
References: <CAB3eZft-_hQYmXTtn457PpzRVWYX6=U2HzkvVyA=DKMd_RTSEw@mail.gmail.com>
Message-ID: <20190430163824.nkn6adhv6gz5ahqa@phare.normalesup.org>

On Tue, Apr 30, 2019 at 04:48:09PM +0800, lampahome wrote:
> I read this :? https://scikit-learn.org/0.15/modules/scaling_strategies.html

> There's only one clustering algo cluster incrementally, that is minibatch
> kmeans.

The documentation that you are pointing to refers to version 0.15. If you
look at the current page on scaling, you will see that there is another
clustering algorithm that works incrementally:
https://scikit-learn.org/stable/modules/computing.html#strategies-to-scale-computationally-bigger-data

Best,

Ga?l

From joel.nothman at gmail.com  Tue Apr 30 17:23:06 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 1 May 2019 07:23:06 +1000
Subject: [scikit-learn] Any other clustering algo cluster incrementally?
In-Reply-To: <20190430163824.nkn6adhv6gz5ahqa@phare.normalesup.org>
References: <CAB3eZft-_hQYmXTtn457PpzRVWYX6=U2HzkvVyA=DKMd_RTSEw@mail.gmail.com>
 <20190430163824.nkn6adhv6gz5ahqa@phare.normalesup.org>
Message-ID: <CAAkaFLWWf-Ar6vXba4azNPM6Nx8M1S9XG8sCZGqe5K6YbmHc6Q@mail.gmail.com>

I think it would be possible to implement an incremental extension to
dbscan. But it's been years since I looked at what is involved and it might
require storing the training data, unlike those out of core methods.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190501/84a7326c/attachment.html>

From joel.nothman at gmail.com  Tue Apr 30 22:09:55 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 1 May 2019 12:09:55 +1000
Subject: [scikit-learn] Release Candidate for Scikit-learn 0.21
Message-ID: <CAAkaFLUVdoRngX6VAsG7w_i-kCY_1cdZpihcrSq=2TE9XuSqAg@mail.gmail.com>

PyPI now has source and binary releases for Scikit-learn 0.21rc2.

* Documentation at https://scikit-learn.org/0.21
* Release Notes at https://scikit-learn.org/0.21/whats_new
* Download source or wheels at
https://pypi.org/project/scikit-learn/0.21rc2/

Please try out the software and help us edit the release notes before a
final release.

Highlights include:
* neighbors.NeighborhoodComponentsAnalysis for supervised metric learning,
which learns a weighted euclidean distance for k-nearest neighbors.
https://scikit-learn.org/0.21/modules/neighbors.html#nca
* ensemble.HistGradientBoostingClassifier
and ensemble.HistGradientBoostingRegressor: experimental implementations of
efficient binned gradient boosting machines.
https://scikit-learn.org/0.21/modules/ensemble.html#gradient-tree-boosting
* impute.IterativeImputer: a non-trivial approach to missing value
imputation.
https://scikit-learn.org/0.21/modules/impute.html#multivariate-feature-imputation
* cluster.OPTICS: a new density-based clustering algorithm.
https://scikit-learn.org/0.21/modules/clustering.html#optics
* better printing of estimators as strings, with an option to hide default
parameters for compactness:
https://scikit-learn.org/0.21/auto_examples/plot_changed_only_pprint_parameter.html
* for estimator and library developers: a way to tag your estimator so that
it can be treated appropriately with check_estimator.
https://scikit-learn.org/0.21/developers/contributing.html#estimator-tags

There are many other enhancements and fixes listed in the release notes (
https://scikit-learn.org/0.21/whats_new).

Please note that Scikit-learn has new dependencies:
* joblib >= 0.11, which used to be vendored within Scikit-learn
* OpenMP, unless the environment variable SKLEARN_NO_OPENMP=1 when the code
is compiled (and cythonized)

Happy Learning!

>From the Scikit-learn core dev team.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190501/0ba98f94/attachment.html>