From qinhanmin2005 at sina.com Tue Apr 2 10:36:03 2019
From: qinhanmin2005 at sina.com (Hanmin Qin)
Date: Tue, 02 Apr 2019 22:36:03 +0800
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
functions?
Message-ID: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
See https://github.com/scikit-learn/scikit-learn/issues/13448
We've introduced several plotting functions (e.g., plot_tree and plot_partial_dependence) and will introduce more (e.g., plot_decision_boundary) in the future. Consequently, we need to decide where to put these functions. Currently, there're 3 proposals:
(1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
(2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
(3) sklearn.XXX.plot.plot_YYY (e.g., sklearn.tree.plot.plot_tree, note that we won't support from sklearn.XXX import plot_YYY)
Joel Nothman, Gael Varoquaux and I decided to post it on the mailing list to invite opinions.
Thanks
Hanmin Qin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From martin.watzenboeck at gmail.com Tue Apr 2 14:57:51 2019
From: martin.watzenboeck at gmail.com (Martin Watzenboeck)
Date: Tue, 2 Apr 2019 20:57:51 +0200
Subject: [scikit-learn] LASSO: Predicted values show negative correlation
with observed values on random data
Message-ID:
Hello,
I tried to apply LASSO regression in combination with LeaveOneOut CV on my
data, and observed a significant negative correlation between predicted and
observed response values. I tried to replicate the problem using random
data (please see code below).
Anyone have an idea what I am doing wrong? I would very much like to use
LASSO regression on my data. Thanks a lot!
Cheers,
Martin
#Lasso example
from sklearn.linear_model import Lasso
from sklearn.model_selection import LeaveOneOut
from scipy.stats import pearsonr
import numpy as np
n_samples = 500
n_features = 30
#create random features
rng = np.random.RandomState(seed=42)
X = rng.randn(n_samples * n_features).reshape(n_samples, n_features)
#Create Ys
Y = rng.randn(n_samples)
#instantiate regressor and cv object
cv = LeaveOneOut()
reg = Lasso(random_state = 42)
#create arrays to save predicted (and observed) Y values
pred = np.array([])
obs = np.array([])
#run cross validation
for train, test in cv.split(X, Y):
#fit regressor
reg.fit(X[train], Y[train])
#append predicted and observed values to the arrays
pred = np.r_[pred, reg.predict(X[test])]
obs = np.r_[obs, Y[test]]
#test correlation
pearsonr(pred, obs)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From alexandre.gramfort at inria.fr Tue Apr 2 15:33:02 2019
From: alexandre.gramfort at inria.fr (Alexandre Gramfort)
Date: Tue, 2 Apr 2019 21:33:02 +0200
Subject: [scikit-learn] LASSO: Predicted values show negative
correlation with observed values on random data
In-Reply-To:
References:
Message-ID:
in your example with random data Lasso leads to coef_ of zeros so you get
as prediction : np.mean(Y[train])
you'll see the same phenomenon if you do:
pred = np.r_[pred, np.mean(Y[train])]
Alex
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From jbbrown at kuhp.kyoto-u.ac.jp Tue Apr 2 22:44:22 2019
From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.)
Date: Wed, 3 Apr 2019 11:44:22 +0900
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
functions?
In-Reply-To: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
Message-ID:
As a user, I feel that (2) "sklearn.plot.XXX.plot_YYY" best allows for
future expansion of sub-namespaces in a tractable way that is also easy to
understand during code review.
For example, sklearn.plot.tree.plot_forest() or sklearn.plot.lasso.plot_* .
Just my opinion.
J.B.
2019?4?2?(?) 23:40 Hanmin Qin :
> See https://github.com/scikit-learn/scikit-learn/issues/13448
>
> We've introduced several plotting functions (e.g., plot_tree and
> plot_partial_dependence) and will introduce more (e.g.,
> plot_decision_boundary) in the future. Consequently, we need to decide
> where to put these functions. Currently, there're 3 proposals:
>
> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>
> (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
>
> (3) sklearn.XXX.plot.plot_YYY (e.g., sklearn.tree.plot.plot_tree, note
> that we won't support from sklearn.XXX import plot_YYY)
>
> Joel Nothman, Gael Varoquaux and I decided to post it on the mailing list
> to invite opinions.
>
> Thanks
>
> Hanmin Qin
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From pahome.chen at mirlab.org Wed Apr 3 05:07:08 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Wed, 3 Apr 2019 17:07:08 +0800
Subject: [scikit-learn] Can cluster help me to cluster data with length of
continuous series?
Message-ID:
I have data which contain access duration of each items.
EX: t0~t4 is the access time duration. 1 means the item was accessed in the
time duration, 0 means not.
ID,t0,t1,t2,t3,t4
0,1,0,0,1
1,1,0,0,1
2,0,0,1,1
3,0,1,1,1
What I want to cluster is the length of continuous duration
Ex:
ID=3 > 2 > 1 = 0
Can any distance metric to help clustering based on the length of
continuous duration?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From ahowe42 at gmail.com Wed Apr 3 05:52:18 2019
From: ahowe42 at gmail.com (Andrew Howe)
Date: Wed, 3 Apr 2019 10:52:18 +0100
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
functions?
In-Reply-To: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
Message-ID:
My preference would be for (1). I don't think the sub-namespace in (2) is
necessary, and don't like (3), as I would prefer the plotting functions to
be all in the same namespace sklearn.plot.
Andrew
<~~~~~~~~~~~~~~~~~~~~~~~~~~~>
J. Andrew Howe, PhD
LinkedIn Profile
ResearchGate Profile
Open Researcher and Contributor ID (ORCID)
Github Profile
Personal Website
I live to learn, so I can learn to live. - me
<~~~~~~~~~~~~~~~~~~~~~~~~~~~>
On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin wrote:
> See https://github.com/scikit-learn/scikit-learn/issues/13448
>
> We've introduced several plotting functions (e.g., plot_tree and
> plot_partial_dependence) and will introduce more (e.g.,
> plot_decision_boundary) in the future. Consequently, we need to decide
> where to put these functions. Currently, there're 3 proposals:
>
> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>
> (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
>
> (3) sklearn.XXX.plot.plot_YYY (e.g., sklearn.tree.plot.plot_tree, note
> that we won't support from sklearn.XXX import plot_YYY)
>
> Joel Nothman, Gael Varoquaux and I decided to post it on the mailing list
> to invite opinions.
>
> Thanks
>
> Hanmin Qin
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From trev.stephens at gmail.com Wed Apr 3 06:06:07 2019
From: trev.stephens at gmail.com (Trevor Stephens)
Date: Wed, 3 Apr 2019 21:06:07 +1100
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
functions?
In-Reply-To:
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
Message-ID:
I think #1 if any of these... Plotting functions should hopefully be as
general as possible, so tagging with a specific type of estimator will, in
some scikit-learn utopia, be unnecessary.
If a general plotter is built, where does it live in other
estimator-specific namespace options? Feels awkward to put it under every
estimator's namespace.
Then again, there might be a #4 where there is no plot module and plotting
classes live under groups of utilities like introspection, cross-validation
or something?...
On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe wrote:
> My preference would be for (1). I don't think the sub-namespace in (2) is
> necessary, and don't like (3), as I would prefer the plotting functions to
> be all in the same namespace sklearn.plot.
>
> Andrew
>
> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> J. Andrew Howe, PhD
> LinkedIn Profile
> ResearchGate Profile
> Open Researcher and Contributor ID (ORCID)
>
> Github Profile
> Personal Website
> I live to learn, so I can learn to live. - me
> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>
>
> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin wrote:
>
>> See https://github.com/scikit-learn/scikit-learn/issues/13448
>>
>> We've introduced several plotting functions (e.g., plot_tree and
>> plot_partial_dependence) and will introduce more (e.g.,
>> plot_decision_boundary) in the future. Consequently, we need to decide
>> where to put these functions. Currently, there're 3 proposals:
>>
>> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>>
>> (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
>>
>> (3) sklearn.XXX.plot.plot_YYY (e.g., sklearn.tree.plot.plot_tree, note
>> that we won't support from sklearn.XXX import plot_YYY)
>>
>> Joel Nothman, Gael Varoquaux and I decided to post it on the mailing list
>> to invite opinions.
>>
>> Thanks
>>
>> Hanmin Qin
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From christian.braune79 at gmail.com Wed Apr 3 06:18:13 2019
From: christian.braune79 at gmail.com (Christian Braune)
Date: Wed, 3 Apr 2019 12:18:13 +0200
Subject: [scikit-learn] Can cluster help me to cluster data with length
of continuous series?
In-Reply-To:
References:
Message-ID:
Hi,
that does not really sound like a clustering but more like a preprocessing
problem to me. For each item you want to calculate the length of the
longest subsequence of "1"s. That could be done by a simple function and
would create a new (one-dimensional) property for each of your items.
You could then apply any clustering algorithm to this feature (i.e. you'd
be clustering a one-dimensional dataset)...
Regards,
Christian
lampahome schrieb am Mi., 3. Apr. 2019 um
11:08 Uhr:
> I have data which contain access duration of each items.
>
> EX: t0~t4 is the access time duration. 1 means the item was accessed in
> the time duration, 0 means not.
> ID,t0,t1,t2,t3,t4
> 0,1,0,0,1
> 1,1,0,0,1
> 2,0,0,1,1
> 3,0,1,1,1
>
> What I want to cluster is the length of continuous duration
> Ex:
> ID=3 > 2 > 1 = 0
>
> Can any distance metric to help clustering based on the length of
> continuous duration?
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From hollas at informatik.htw-dresden.de Wed Apr 3 06:28:22 2019
From: hollas at informatik.htw-dresden.de (Boris Hollas)
Date: Wed, 3 Apr 2019 12:28:22 +0200
Subject: [scikit-learn] Why is cross_val_predict discouraged?
Message-ID:
I use
sum((cross_val_predict(model, X, y) - y)**2) / len(y)??? ??? (*)
to evaluate the performance of a model. This conforms with Murphy:
Machine Learning, section 6.5.3, and Hastie et al: The Elements of
Statistical Learning,? eq. 7.48. However, according to the documentation
of cross_val_predict, "it is not appropriate to pass these predictions
into an evaluation metric". While it is obvious that cross_val_predict
is different from cross_val_score, I don't see what should be wrong with
(*).
Also, the explanation that "|cross_val_predict|
simply
returns the labels (or probabilities)" is unclear, if not wrong. As I
understand it, this function returns estimates and no labels or
probabilities.
Regards, Boris
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From martin.watzenboeck at gmail.com Wed Apr 3 07:17:13 2019
From: martin.watzenboeck at gmail.com (Martin Watzenboeck)
Date: Wed, 3 Apr 2019 13:17:13 +0200
Subject: [scikit-learn] LASSO: Predicted values show negative
correlation with observed values on random data
In-Reply-To:
References:
Message-ID:
Hi Alex,
Thanks a lot for the answer! That does indeed explain this phenomenon.
Also, I know see that with my data I can get meaningful LASSO predictions
by tuning the alpha parameter.
Cheers,
Martin
Am Di., 2. Apr. 2019 um 21:33 Uhr schrieb Alexandre Gramfort <
alexandre.gramfort at inria.fr>:
> in your example with random data Lasso leads to coef_ of zeros so you get
> as prediction : np.mean(Y[train])
>
> you'll see the same phenomenon if you do:
>
> pred = np.r_[pred, np.mean(Y[train])]
>
> Alex
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From rth.yurchak at pm.me Wed Apr 3 07:35:23 2019
From: rth.yurchak at pm.me (Roman Yurchak)
Date: Wed, 03 Apr 2019 11:35:23 +0000
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
functions?
In-Reply-To:
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
Message-ID:
+1 for options 1 and +0.5 for 3. Do we anticipate that many plotting
functions will be added? If it's just a dozen or less, putting them all
into a single namespace sklearn.plot might be easier.
This also would avoid discussion about where to put some generic
plotting functions (e.g.
https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479).
Roman
On 03/04/2019 12:06, Trevor Stephens wrote:
> I think #1 if any of these... Plotting functions should hopefully be as
> general as possible, so tagging with a specific type of estimator will,
> in some scikit-learn utopia, be unnecessary.
>
> If a general plotter is built, where does it live in other
> estimator-specific namespace options? Feels awkward to put it under
> every estimator's namespace.
>
> Then again, there might be a #4 where there is no plot module and
> plotting classes live under groups of utilities like introspection,
> cross-validation or something?...
>
> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe > wrote:
>
> My preference would be for (1). I don't think the sub-namespace in
> (2) is necessary, and don't like (3), as I would prefer the plotting
> functions to be all in the same namespace sklearn.plot.
>
> Andrew
>
> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> J. Andrew Howe, PhD
> LinkedIn Profile
> ResearchGate Profile
> Open Researcher and Contributor ID (ORCID)
>
> Github Profile
> Personal Website
> I live to learn, so I can learn to live. - me
> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>
>
> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin > wrote:
>
> See https://github.com/scikit-learn/scikit-learn/issues/13448
>
> We've introduced several plotting functions (e.g., plot_tree and
> plot_partial_dependence) and will introduce more (e.g.,
> plot_decision_boundary) in the future. Consequently, we need to
> decide where to put these functions. Currently, there're 3
> proposals:
>
> (1)?sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>
> (2)?sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
>
> (3)?sklearn.XXX.plot.plot_YYY (e.g.,
> sklearn.tree.plot.plot_tree, note that we won't support from
> sklearn.XXX import plot_YYY)
>
> Joel Nothman,?Gael Varoquaux and I decided to post it on the
> mailing list to invite opinions.
>
> Thanks
>
> Hanmin Qin
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
From joel.nothman at gmail.com Wed Apr 3 07:59:18 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 3 Apr 2019 22:59:18 +1100
Subject: [scikit-learn] Why is cross_val_predict discouraged?
In-Reply-To:
References:
Message-ID:
The equations in Murphy and Hastie very clearly assume a metric
decomposable over samples (a loss function). Several popular metrics
are not.
For a metric like MSE it will be almost identical assuming the test
sets have almost the same size. For something like Recall
(sensitivity) it will be almost identical assuming similar test set
sizes *and* stratification. For something like precision whose
denominator is determined by the biases of the learnt classifier on
the test dataset, you can't say the same. For something like ROC AUC
score, relying on some decision function that may not be equivalently
calibrated across splits, evaluating in this way is almost
meaningless.
On Wed, 3 Apr 2019 at 22:01, Boris Hollas
wrote:
>
> I use
>
> sum((cross_val_predict(model, X, y) - y)**2) / len(y) (*)
>
> to evaluate the performance of a model. This conforms with Murphy: Machine Learning, section 6.5.3, and Hastie et al: The Elements of Statistical Learning, eq. 7.48. However, according to the documentation of cross_val_predict, "it is not appropriate to pass these predictions into an evaluation metric". While it is obvious that cross_val_predict is different from cross_val_score, I don't see what should be wrong with (*).
>
> Also, the explanation that "cross_val_predict simply returns the labels (or probabilities)" is unclear, if not wrong. As I understand it, this function returns estimates and no labels or probabilities.
>
> Regards, Boris
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
From t3kcit at gmail.com Wed Apr 3 08:54:51 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 3 Apr 2019 08:54:51 -0400
Subject: [scikit-learn] Why is cross_val_predict discouraged?
In-Reply-To:
References:
Message-ID:
On 4/3/19 7:59 AM, Joel Nothman wrote:
> The equations in Murphy and Hastie very clearly assume a metric
> decomposable over samples (a loss function). Several popular metrics
> are not.
>
> For a metric like MSE it will be almost identical assuming the test
> sets have almost the same size. For something like Recall
> (sensitivity) it will be almost identical assuming similar test set
> sizes *and* stratification. For something like precision whose
> denominator is determined by the biases of the learnt classifier on
> the test dataset, you can't say the same. For something like ROC AUC
> score, relying on some decision function that may not be equivalently
> calibrated across splits, evaluating in this way is almost
> meaningless.
In theory. Not sure how it holds up in practice.
I didn't get the point about precision.
But yes, we should add to the docs that in particular for losses that
don't decompose this is a weird thing to do.
If the loss decomposes, the result might be different b/c of different
test set sizes, but I'm not sure if they are "worse" in some way?
From t3kcit at gmail.com Wed Apr 3 09:09:19 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 3 Apr 2019 09:09:19 -0400
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
functions?
In-Reply-To:
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
Message-ID:
I think what was not clear from the question is that there is actually
quite different kinds of plotting functions, and many of these are tied
to existing code.
Right now we have some that are specific to trees (plot_tree) and to
gradient boosting (plot_partial_dependence).
I think we want more general functions, and plot_partial_dependence has
been extended to general estimators.
However, the plotting functions might be generic wrt the estimator, but
relate to a specific function, say plotting results of GridSearchCV.
Then one might argue that having the plotting function close to
GridSearchCV might make sense.
Similarly for plotting partial dependence plots and feature importances,
it might be a bit strange to have the plotting functions not next to the
functions that compute these.
Another question would be is whether the plotting functions also "do the
work" in some cases:
Do we want plot_partial_dependence also to compute the partial
dependence? (I would argue yes but either way the result is a bit strange).
In that case you have somewhat of the same functionality in two
different modules, unless you also put the "compute partial dependence"
function in the plotting module as well,
which is a bit strange.
Maybe we could inform this discussion by listing candidate plotting
functions, and also considering whether they "do the work" and where the
"work" function is.
Other examples are plotting the confusion matrix, which probably should
also compute the confusion matrix (it's fast and so that would be
convenient), and so it would "duplicate" functionality from the metrics
module.
Plotting learning curves and validation curves should probably not do
the work as it's pretty involved, and so someone would need to import
the learning and validation curves from model selection, and then the
plotting functions from a plotting module.
Calibrations curves and P/R curves and roc curves are also pretty fast
to compute (and passing around the arguments is somewhat error prone) so
I would say the plotting functions for these should do the work as well.
Anyway, you can see that many plotting functions are actually associated
with functions in existing modules and the interactions are a bit unclear.
The only plotting functions I haven't mentioned so far that I thought
about in the past are "2d scatter" and "plot decision function". These
would be kind of generic, but mostly used in the examples.
Though having a discrete 2d scatter function would be pretty nice
(plt.scatter doesn't allow legends and makes it hard to use qualitative
color maps).
I think I would vote for option (1), "sklearn.plot.plot_zzz" but the
case is not really that clear.
Cheers,
Andy
On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote:
> +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting
> functions will be added? If it's just a dozen or less, putting them all
> into a single namespace sklearn.plot might be easier.
>
> This also would avoid discussion about where to put some generic
> plotting functions (e.g.
> https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479).
>
> Roman
>
> On 03/04/2019 12:06, Trevor Stephens wrote:
>> I think #1 if any of these... Plotting functions should hopefully be as
>> general as possible, so tagging with a specific type of estimator will,
>> in some scikit-learn utopia, be unnecessary.
>>
>> If a general plotter is built, where does it live in other
>> estimator-specific namespace options? Feels awkward to put it under
>> every estimator's namespace.
>>
>> Then again, there might be a #4 where there is no plot module and
>> plotting classes live under groups of utilities like introspection,
>> cross-validation or something?...
>>
>> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe > > wrote:
>>
>> My preference would be for (1). I don't think the sub-namespace in
>> (2) is necessary, and don't like (3), as I would prefer the plotting
>> functions to be all in the same namespace sklearn.plot.
>>
>> Andrew
>>
>> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>> J. Andrew Howe, PhD
>> LinkedIn Profile
>> ResearchGate Profile
>> Open Researcher and Contributor ID (ORCID)
>>
>> Github Profile
>> Personal Website
>> I live to learn, so I can learn to live. - me
>> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>>
>>
>> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin > > wrote:
>>
>> See https://github.com/scikit-learn/scikit-learn/issues/13448
>>
>> We've introduced several plotting functions (e.g., plot_tree and
>> plot_partial_dependence) and will introduce more (e.g.,
>> plot_decision_boundary) in the future. Consequently, we need to
>> decide where to put these functions. Currently, there're 3
>> proposals:
>>
>> (1)?sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>>
>> (2)?sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
>>
>> (3)?sklearn.XXX.plot.plot_YYY (e.g.,
>> sklearn.tree.plot.plot_tree, note that we won't support from
>> sklearn.XXX import plot_YYY)
>>
>> Joel Nothman,?Gael Varoquaux and I decided to post it on the
>> mailing list to invite opinions.
>>
>> Thanks
>>
>> Hanmin Qin
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
From gael.varoquaux at normalesup.org Wed Apr 3 09:28:52 2019
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Wed, 3 Apr 2019 15:28:52 +0200
Subject: [scikit-learn] Why is cross_val_predict discouraged?
In-Reply-To:
References:
Message-ID: <20190403132852.2jdszy2rfp3kivk4@phare.normalesup.org>
On Wed, Apr 03, 2019 at 08:54:51AM -0400, Andreas Mueller wrote:
> If the loss decomposes, the result might be different b/c of different test
> set sizes, but I'm not sure if they are "worse" in some way?
Mathematically, a cross-validation estimates a double expectation: one
expectation on the model (ie the train data), and another on the test
data (see for instance eq 3 in
https://europepmc.org/articles/pmc5441396, sorry for the self citation,
this is seldom discussed in the literature).
The correct way to compute this double expectation is by averaging first
inside the fold and second across the folds. Other ways of computing
errors estimate other quantities, that are harder to study mathematically
and not comparable to objects studied in the literature.
Another problem with cross_val_predict is that some people use metrics
like correlation (which is a terrible metric and does not decompose
across folds). It will then pick up things like correlations across
folds.
All these problems are made worse when data are not iid, and hence folds
risk not being iid.
G
From joel.nothman at gmail.com Wed Apr 3 10:06:13 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 4 Apr 2019 01:06:13 +1100
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
functions?
In-Reply-To:
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
Message-ID:
With option 1, sklearn.plot is likely to import large chunks of the
library (particularly, but not exclusively, if the plotting function
"does the work" as Andy suggests). This is under the assumption that
one plot function will want to import trees, another GPs, etc. Unless
we move to lazy imports, that would be against the current convention
that importing sklearn is fairly minimal.
I do like Andy's idea of framing this discussion more clearly around
likely candidates.
On Thu, 4 Apr 2019 at 00:10, Andreas Mueller wrote:
>
> I think what was not clear from the question is that there is actually
> quite different kinds of plotting functions, and many of these are tied
> to existing code.
>
> Right now we have some that are specific to trees (plot_tree) and to
> gradient boosting (plot_partial_dependence).
>
> I think we want more general functions, and plot_partial_dependence has
> been extended to general estimators.
>
> However, the plotting functions might be generic wrt the estimator, but
> relate to a specific function, say plotting results of GridSearchCV.
> Then one might argue that having the plotting function close to
> GridSearchCV might make sense.
> Similarly for plotting partial dependence plots and feature importances,
> it might be a bit strange to have the plotting functions not next to the
> functions that compute these.
> Another question would be is whether the plotting functions also "do the
> work" in some cases:
> Do we want plot_partial_dependence also to compute the partial
> dependence? (I would argue yes but either way the result is a bit strange).
> In that case you have somewhat of the same functionality in two
> different modules, unless you also put the "compute partial dependence"
> function in the plotting module as well,
> which is a bit strange.
>
> Maybe we could inform this discussion by listing candidate plotting
> functions, and also considering whether they "do the work" and where the
> "work" function is.
>
> Other examples are plotting the confusion matrix, which probably should
> also compute the confusion matrix (it's fast and so that would be
> convenient), and so it would "duplicate" functionality from the metrics
> module.
>
> Plotting learning curves and validation curves should probably not do
> the work as it's pretty involved, and so someone would need to import
> the learning and validation curves from model selection, and then the
> plotting functions from a plotting module.
>
> Calibrations curves and P/R curves and roc curves are also pretty fast
> to compute (and passing around the arguments is somewhat error prone) so
> I would say the plotting functions for these should do the work as well.
>
> Anyway, you can see that many plotting functions are actually associated
> with functions in existing modules and the interactions are a bit unclear.
>
> The only plotting functions I haven't mentioned so far that I thought
> about in the past are "2d scatter" and "plot decision function". These
> would be kind of generic, but mostly used in the examples.
> Though having a discrete 2d scatter function would be pretty nice
> (plt.scatter doesn't allow legends and makes it hard to use qualitative
> color maps).
>
>
> I think I would vote for option (1), "sklearn.plot.plot_zzz" but the
> case is not really that clear.
>
> Cheers,
>
> Andy
>
> On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote:
> > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting
> > functions will be added? If it's just a dozen or less, putting them all
> > into a single namespace sklearn.plot might be easier.
> >
> > This also would avoid discussion about where to put some generic
> > plotting functions (e.g.
> > https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479).
> >
> > Roman
> >
> > On 03/04/2019 12:06, Trevor Stephens wrote:
> >> I think #1 if any of these... Plotting functions should hopefully be as
> >> general as possible, so tagging with a specific type of estimator will,
> >> in some scikit-learn utopia, be unnecessary.
> >>
> >> If a general plotter is built, where does it live in other
> >> estimator-specific namespace options? Feels awkward to put it under
> >> every estimator's namespace.
> >>
> >> Then again, there might be a #4 where there is no plot module and
> >> plotting classes live under groups of utilities like introspection,
> >> cross-validation or something?...
> >>
> >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe >> > wrote:
> >>
> >> My preference would be for (1). I don't think the sub-namespace in
> >> (2) is necessary, and don't like (3), as I would prefer the plotting
> >> functions to be all in the same namespace sklearn.plot.
> >>
> >> Andrew
> >>
> >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> >> J. Andrew Howe, PhD
> >> LinkedIn Profile
> >> ResearchGate Profile
> >> Open Researcher and Contributor ID (ORCID)
> >>
> >> Github Profile
> >> Personal Website
> >> I live to learn, so I can learn to live. - me
> >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> >>
> >>
> >> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin >> > wrote:
> >>
> >> See https://github.com/scikit-learn/scikit-learn/issues/13448
> >>
> >> We've introduced several plotting functions (e.g., plot_tree and
> >> plot_partial_dependence) and will introduce more (e.g.,
> >> plot_decision_boundary) in the future. Consequently, we need to
> >> decide where to put these functions. Currently, there're 3
> >> proposals:
> >>
> >> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
> >>
> >> (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
> >>
> >> (3) sklearn.XXX.plot.plot_YYY (e.g.,
> >> sklearn.tree.plot.plot_tree, note that we won't support from
> >> sklearn.XXX import plot_YYY)
> >>
> >> Joel Nothman, Gael Varoquaux and I decided to post it on the
> >> mailing list to invite opinions.
> >>
> >> Thanks
> >>
> >> Hanmin Qin
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
From hollas at informatik.htw-dresden.de Wed Apr 3 12:50:24 2019
From: hollas at informatik.htw-dresden.de (Boris Hollas)
Date: Wed, 3 Apr 2019 18:50:24 +0200
Subject: [scikit-learn] Why is cross_val_predict discouraged?
In-Reply-To:
References:
Message-ID: <1d887c05-bfdd-2559-c7a7-6e63a156eacc@informatik.htw-dresden.de>
Am 03.04.19 um 13:59 schrieb Joel Nothman:
> The equations in Murphy and Hastie very clearly assume a metric
> decomposable over samples (a loss function). Several popular metrics
> are not.
>
> For a metric like MSE it will be almost identical assuming the test
> sets have almost the same size.
What will be almost identical to what? I suppose you mean that (*) is
consistent with the scores of the models in the fold (ie, the result of
cross_val_score) if the loss function is (x-y)?.
> For something like Recall
> (sensitivity) it will be almost identical assuming similar test set
> sizes*and* stratification. For something like precision whose
> denominator is determined by the biases of the learnt classifier on
> the test dataset, you can't say the same.
I can't follow here. If the loss function is L(x,y) = 1_{x = y}, then
(*) gives the accuracy.
> For something like ROC AUC
> score, relying on some decision function that may not be equivalently
> calibrated across splits, evaluating in this way is almost
> meaningless.
In any case, I still don't see what may be wrong with (*). Otherwise,
the warning in the documentation about the use of cross_val_predict
should be removed or revised.
On the other hand, an example in the documentation uses
cross_val_scores.mean(). This is debatable since this computes a mean of
means.
>
> On Wed, 3 Apr 2019 at 22:01, Boris Hollas
> wrote:
>> I use
>>
>> sum((cross_val_predict(model, X, y) - y)**2) / len(y) (*)
>>
>> to evaluate the performance of a model. This conforms with Murphy: Machine Learning, section 6.5.3, and Hastie et al: The Elements of Statistical Learning, eq. 7.48. However, according to the documentation of cross_val_predict, "it is not appropriate to pass these predictions into an evaluation metric". While it is obvious that cross_val_predict is different from cross_val_score, I don't see what should be wrong with (*).
>>
>> Also, the explanation that "cross_val_predict simply returns the labels (or probabilities)" is unclear, if not wrong. As I understand it, this function returns estimates and no labels or probabilities.
>>
>> Regards, Boris
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From rr.rosas at gmail.com Wed Apr 3 14:38:37 2019
From: rr.rosas at gmail.com (Rodrigo Rosenfeld Rosas)
Date: Wed, 3 Apr 2019 15:38:37 -0300
Subject: [scikit-learn] How to answer questions from big documents?
Message-ID:
Hi everyone, this is my first post here :)
About two weeks ago, due to the low demand in my project, I have been
assigned a completely unusual request: to automatically extract answers
from documents based on machine learning. I've never read anything about
ML, AI or NLP before, so I've been basically doing just that for the past
two weeks.
When it comes to ML, most book recommendations and tutorials I've found so
far use the Python language and tools, so I took the first week to learn
about Python, NumPy, Scikit, Panda, Matplotlib and so on. Then, this week I
started reading about NLP itself, after spending a few days reading about
generic ML algorithms.
So far, I've basically read about Bag of Words, using TF-IDF (or simply
terms count) to convert the words to numeric representations and a few
methods such as the gaussian and multinomial naive bayes methods to train
and predict values. The methods also mention the importance of using the
usual pre-processing methods such as lemmatization and alikes. However,
basically all examples assume that a given text can be classified in one of
the categorized topics, like the sentiment analysis use case. I'm afraid
this doesn't represent my use case, so I'd like to describe it here so that
you could help me identifying which methods I should be looking for.
We have a system with thousands of transactions/deals inputted manually by
an specialized team. Each deal has a set of documents (a dozen per deal
typically) and some documents could have hundreds of pages. The inputing
team has to extract about a thousand fields from those documents for any
particular deal. So, in our database we have all their data and we
typically also know the document specific snippets associated to each field
value.
So, my task is to, given a new document and deal, and based on the previous
answers, fill in as many fields as I could by automatically finding the
corresponding snippets in the new documents. I'm not sure how I should
approach this problem.
For example, I could consider each sentence of the document as a separate
document to be analyzed and compared to the snippets I already have for the
matching data. However, I can't be sure whether some of those sentences
would actually answer the question. For example, maybe there are 6
occurrences in the documents that would answer a particular question/field,
but maybe the inputters only identified 2 or 3 of them.
Also, for any given sentence, it could tell that the answer for a given
field is A or B, or it could be that there's absolutely no association
between the sentence and the field/question, as it would be the case for
most sentences. I know that Scikit provides the predict_proba method, so
that I could try to only consider the sentence as relevant if the
probabilities of answering the question would be above 80%, for example,
but based on a few quick tests I've made with a few sentences and words, I
suspect this won't work very well. Also, it could be quite slow to treat
each sentence of a 500-hundreds of pages documents as a separate document
to be analyzed, so I'm not sure if there are better methods to handle this
use case.
Some of the fields are free-text ones, like company and firm names, for
example, and I suspect those would be the hardest to answer, so I'm trying
to start with the multiple-choice ones, with a finite set of classification.
How would you advise me to look at this problem? Are there any algorithms
you'd recommend me to study for solving this particular problem?
Here are some sample data so that you could get a better understanding of
the problem:
One of the fields is called "Deal Structure" and it could have the
following values: "Asset Purchase", "Stock or Equity Purchase" or "Public
Target Merger" (there are a few others, but this gives you an idea).
So, here are some sentences highlighted for Public Target Merger deals
(those documents come from Edgar Filings public database which are freely
available for US deals):
deal 1 / doc 1: "AGREEMENT AND PLAN OF MERGER, dated as of March 14, 2018
(this ?Agreement?), by and among HarborOne Bancorp, Inc., a Massachusetts
corporation (?Buyer?), Massachusetts Acquisitions, LLC, a Maryland limited
liability company of which Buyer is the sole member (?Merger LLC?), and
Coastway Bancorp, Inc., a Maryland corporation (the ?Company?)."
"WHEREAS, Buyer, Merger LLC, and the Company intend to effect a merger (the
?Merger?) of Merger LLC with and into the Company in accordance with this
Agreement and the Maryland General Corporation Law (the ?MGCL?) and the
Maryland Limited Liability Company Act, as amended (the ?MLLCA?), with the
Company to be the surviving entity in the Merger. The Merger will be
followed immediately by a merger of the Company with and into Buyer (the
?Upstream Merger?), with the Buyer to be the surviving entity in the
Upstream Merger. It is intended that the Merger be mutually interdependent
with and a condition precedent to the Upstream Merger and that the Upstream
Merger shall, through the binding commitment evidenced by this Agreement,
be effected immediately following the Effective Time (as defined below)
without further approval, authorization or direction from or by any of the
parties hereto; and"
deal 2 / doc 1:
"WHEREAS, it is also proposed that, as soon as practicable following the
consummation of the Offer, the Parties wish to effect the acquisition of
the Company by Parent through the merger of Purchaser with and into the
Company, with the Company being the surviving entity (the ?Merger?);"
Now, for Asset Purchase deals:
deal 3 / doc 1:
"Subject to the terms and conditions of this Agreement, Sellers are willing
to sell to Buyer, and Buyer is willing to purchase from Sellers, all of
their assets relating to the Businesses as set forth herein."
deal 4 / doc 1:
"WHEREAS, Seller wishes to sell and assign to Buyer, and Buyer wishes to
purchase and assume from Seller, the rights and obligations of Seller to
the Purchased Assets (as defined herein), subject to the terms and
conditions set forth herein."
Please forgive me for any imprecise/incorrect terms or understanding on
this topic as this is all very new to me. Any help is very appreciated.
I've also asked this question in StackOverflow, so if you'd prefer to
answer there instead, here is the link:
https://stackoverflow.com/questions/55499866/how-to-answer-questions-from-big-documents
Would this field be called data mining? Feature extraction? Question
answering? I'm not sure how to properly search about this subject so any
hints are very welcome :)
Thanks in advance,
Rodrigo.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From joel.nothman at gmail.com Wed Apr 3 17:46:57 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 4 Apr 2019 08:46:57 +1100
Subject: [scikit-learn] Why is cross_val_predict discouraged?
In-Reply-To: <1d887c05-bfdd-2559-c7a7-6e63a156eacc@informatik.htw-dresden.de>
References:
<1d887c05-bfdd-2559-c7a7-6e63a156eacc@informatik.htw-dresden.de>
Message-ID:
Pull requests improving the documentation are always welcome. At a minimum,
users need to know that these compute different things.
Accuracy is not precision. Precision is the number of true positives
divided by the number of true positives plus false positives. It therefore
cannot be decomposed as a sample-wise measure without knowing the rate of
positive predictions. This rate is dependent on the training data and
algorithm.
I'm not a statistician and cannot speak to issues of computing a mean of
means, but if what we are trying to estimate is the performance on a sample
of size approximately n_t of a model trained on a sample of size
approximately N - n_t, then I wouldn't have thought taking a mean over such
measures (with whatever score function) to be unreasonable.
On Thu., 4 Apr. 2019, 3:51 am Boris Hollas, <
hollas at informatik.htw-dresden.de> wrote:
> Am 03.04.19 um 13:59 schrieb Joel Nothman:
>
> The equations in Murphy and Hastie very clearly assume a metric
> decomposable over samples (a loss function). Several popular metrics
> are not.
>
> For a metric like MSE it will be almost identical assuming the test
> sets have almost the same size.
>
> What will be almost identical to what? I suppose you mean that (*) is
> consistent with the scores of the models in the fold (ie, the result of
> cross_val_score) if the loss function is (x-y)?.
>
> For something like Recall
> (sensitivity) it will be almost identical assuming similar test set
> sizes **and** stratification. For something like precision whose
> denominator is determined by the biases of the learnt classifier on
> the test dataset, you can't say the same.
>
> I can't follow here. If the loss function is L(x,y) = 1_{x = y}, then (*)
> gives the accuracy.
>
> For something like ROC AUC
> score, relying on some decision function that may not be equivalently
> calibrated across splits, evaluating in this way is almost
> meaningless.
>
> In any case, I still don't see what may be wrong with (*). Otherwise, the
> warning in the documentation about the use of cross_val_predict should be
> removed or revised.
>
> On the other hand, an example in the documentation uses
> cross_val_scores.mean(). This is debatable since this computes a mean of
> means.
>
>
>
> On Wed, 3 Apr 2019 at 22:01, Boris Hollas wrote:
>
> I use
>
> sum((cross_val_predict(model, X, y) - y)**2) / len(y) (*)
>
> to evaluate the performance of a model. This conforms with Murphy: Machine Learning, section 6.5.3, and Hastie et al: The Elements of Statistical Learning, eq. 7.48. However, according to the documentation of cross_val_predict, "it is not appropriate to pass these predictions into an evaluation metric". While it is obvious that cross_val_predict is different from cross_val_score, I don't see what should be wrong with (*).
>
> Also, the explanation that "cross_val_predict simply returns the labels (or probabilities)" is unclear, if not wrong. As I understand it, this function returns estimates and no labels or probabilities.
>
> Regards, Boris
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From ericmajinglong at gmail.com Wed Apr 3 18:59:02 2019
From: ericmajinglong at gmail.com (Eric Ma)
Date: Thu, 4 Apr 2019 00:59:02 +0200
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
functions?
In-Reply-To:
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
Message-ID:
This is not a strongly-held suggestion - but what about adopting
YellowBrick as the plotting API for sklearn? Not sure how exactly the
interaction would work - could be PRs to their library, or ask them to
integrate into sklearn, or do a lock-step dance with versions but maintain
separate teams? (I know it raises more questions than answers, but wanted
to put it out there.)
On Wed, Apr 3, 2019 at 4:07 PM Joel Nothman wrote:
> With option 1, sklearn.plot is likely to import large chunks of the
> library (particularly, but not exclusively, if the plotting function
> "does the work" as Andy suggests). This is under the assumption that
> one plot function will want to import trees, another GPs, etc. Unless
> we move to lazy imports, that would be against the current convention
> that importing sklearn is fairly minimal.
>
> I do like Andy's idea of framing this discussion more clearly around
> likely candidates.
>
> On Thu, 4 Apr 2019 at 00:10, Andreas Mueller wrote:
> >
> > I think what was not clear from the question is that there is actually
> > quite different kinds of plotting functions, and many of these are tied
> > to existing code.
> >
> > Right now we have some that are specific to trees (plot_tree) and to
> > gradient boosting (plot_partial_dependence).
> >
> > I think we want more general functions, and plot_partial_dependence has
> > been extended to general estimators.
> >
> > However, the plotting functions might be generic wrt the estimator, but
> > relate to a specific function, say plotting results of GridSearchCV.
> > Then one might argue that having the plotting function close to
> > GridSearchCV might make sense.
> > Similarly for plotting partial dependence plots and feature importances,
> > it might be a bit strange to have the plotting functions not next to the
> > functions that compute these.
> > Another question would be is whether the plotting functions also "do the
> > work" in some cases:
> > Do we want plot_partial_dependence also to compute the partial
> > dependence? (I would argue yes but either way the result is a bit
> strange).
> > In that case you have somewhat of the same functionality in two
> > different modules, unless you also put the "compute partial dependence"
> > function in the plotting module as well,
> > which is a bit strange.
> >
> > Maybe we could inform this discussion by listing candidate plotting
> > functions, and also considering whether they "do the work" and where the
> > "work" function is.
> >
> > Other examples are plotting the confusion matrix, which probably should
> > also compute the confusion matrix (it's fast and so that would be
> > convenient), and so it would "duplicate" functionality from the metrics
> > module.
> >
> > Plotting learning curves and validation curves should probably not do
> > the work as it's pretty involved, and so someone would need to import
> > the learning and validation curves from model selection, and then the
> > plotting functions from a plotting module.
> >
> > Calibrations curves and P/R curves and roc curves are also pretty fast
> > to compute (and passing around the arguments is somewhat error prone) so
> > I would say the plotting functions for these should do the work as well.
> >
> > Anyway, you can see that many plotting functions are actually associated
> > with functions in existing modules and the interactions are a bit
> unclear.
> >
> > The only plotting functions I haven't mentioned so far that I thought
> > about in the past are "2d scatter" and "plot decision function". These
> > would be kind of generic, but mostly used in the examples.
> > Though having a discrete 2d scatter function would be pretty nice
> > (plt.scatter doesn't allow legends and makes it hard to use qualitative
> > color maps).
> >
> >
> > I think I would vote for option (1), "sklearn.plot.plot_zzz" but the
> > case is not really that clear.
> >
> > Cheers,
> >
> > Andy
> >
> > On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote:
> > > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting
> > > functions will be added? If it's just a dozen or less, putting them all
> > > into a single namespace sklearn.plot might be easier.
> > >
> > > This also would avoid discussion about where to put some generic
> > > plotting functions (e.g.
> > >
> https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479
> ).
> > >
> > > Roman
> > >
> > > On 03/04/2019 12:06, Trevor Stephens wrote:
> > >> I think #1 if any of these... Plotting functions should hopefully be
> as
> > >> general as possible, so tagging with a specific type of estimator
> will,
> > >> in some scikit-learn utopia, be unnecessary.
> > >>
> > >> If a general plotter is built, where does it live in other
> > >> estimator-specific namespace options? Feels awkward to put it under
> > >> every estimator's namespace.
> > >>
> > >> Then again, there might be a #4 where there is no plot module and
> > >> plotting classes live under groups of utilities like introspection,
> > >> cross-validation or something?...
> > >>
> > >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe > >> > wrote:
> > >>
> > >> My preference would be for (1). I don't think the sub-namespace
> in
> > >> (2) is necessary, and don't like (3), as I would prefer the
> plotting
> > >> functions to be all in the same namespace sklearn.plot.
> > >>
> > >> Andrew
> > >>
> > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> > >> J. Andrew Howe, PhD
> > >> LinkedIn Profile
> > >> ResearchGate Profile <
> http://www.researchgate.net/profile/John_Howe12/>
> > >> Open Researcher and Contributor ID (ORCID)
> > >>
> > >> Github Profile
> > >> Personal Website
> > >> I live to learn, so I can learn to live. - me
> > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> > >>
> > >>
> > >> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin <
> qinhanmin2005 at sina.com
> > >> > wrote:
> > >>
> > >> See
> https://github.com/scikit-learn/scikit-learn/issues/13448
> > >>
> > >> We've introduced several plotting functions (e.g., plot_tree
> and
> > >> plot_partial_dependence) and will introduce more (e.g.,
> > >> plot_decision_boundary) in the future. Consequently, we need
> to
> > >> decide where to put these functions. Currently, there're 3
> > >> proposals:
> > >>
> > >> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
> > >>
> > >> (2) sklearn.plot.XXX.plot_YYY (e.g.,
> sklearn.plot.tree.plot_tree)
> > >>
> > >> (3) sklearn.XXX.plot.plot_YYY (e.g.,
> > >> sklearn.tree.plot.plot_tree, note that we won't support from
> > >> sklearn.XXX import plot_YYY)
> > >>
> > >> Joel Nothman, Gael Varoquaux and I decided to post it on the
> > >> mailing list to invite opinions.
> > >>
> > >> Thanks
> > >>
> > >> Hanmin Qin
> > >> _______________________________________________
> > >> scikit-learn mailing list
> > >> scikit-learn at python.org
> > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >> _______________________________________________
> > >> scikit-learn mailing list
> > >> scikit-learn at python.org
> > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From joel.nothman at gmail.com Wed Apr 3 19:50:51 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 4 Apr 2019 10:50:51 +1100
Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug
Message-ID:
The core developers of Scikit-learn have recently voted to welcome
Thomas Fan and Nicolas Hug to the team, in recognition of their
efforts and trustworthiness as contributors. Both happen to be working
with Andy Mueller at Columbia University at the moment.
Congratulations and thanks to them both!
From qinhanmin2005 at sina.com Wed Apr 3 21:05:55 2019
From: qinhanmin2005 at sina.com (Hanmin Qin)
Date: Thu, 04 Apr 2019 09:05:55 +0800
Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug
Message-ID: <20190404010555.772254140094@webmail.sinamail.sina.com.cn>
Congratulations and welcome to the team!
Hanmin Qin
----- Original Message -----
From: Joel Nothman
To: Scikit-learn user and developer mailing list
Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug
Date: 2019-04-04 07:52
The core developers of Scikit-learn have recently voted to welcome
Thomas Fan and Nicolas Hug to the team, in recognition of their
efforts and trustworthiness as contributors. Both happen to be working
with Andy Mueller at Columbia University at the moment.
Congratulations and thanks to them both!
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Wed Apr 3 23:11:36 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 3 Apr 2019 23:11:36 -0400
Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug
In-Reply-To: <20190404010555.772254140094@webmail.sinamail.sina.com.cn>
References: <20190404010555.772254140094@webmail.sinamail.sina.com.cn>
Message-ID:
Congratulations guys! Great work! Looking forward to much more! Proud to
have you on the team!
Now we in NYC can approve our own pull requests ;)
Sent from phone. Please excuse spelling and brevity.
On Wed, Apr 3, 2019, 21:08 Hanmin Qin wrote:
> Congratulations and welcome to the team!
>
> Hanmin Qin
>
> ----- Original Message -----
> From: Joel Nothman
> To: Scikit-learn user and developer mailing list
> Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug
> Date: 2019-04-04 07:52
>
>
> The core developers of Scikit-learn have recently voted to welcome
> Thomas Fan and Nicolas Hug to the team, in recognition of their
> efforts and trustworthiness as contributors. Both happen to be working
> with Andy Mueller at Columbia University at the moment.
> Congratulations and thanks to them both!
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From hollas at informatik.htw-dresden.de Thu Apr 4 03:39:14 2019
From: hollas at informatik.htw-dresden.de (Boris Hollas)
Date: Thu, 4 Apr 2019 09:39:14 +0200
Subject: [scikit-learn] Why is cross_val_predict discouraged?
In-Reply-To:
References:
<1d887c05-bfdd-2559-c7a7-6e63a156eacc@informatik.htw-dresden.de>
Message-ID:
Am 03.04.19 um 23:46 schrieb Joel Nothman:
> Pull requests improving the documentation are always welcome. At a
> minimum, users need to know that these compute different things.
>
> Accuracy is not precision. Precision is the number of true positives
> divided by the number of true positives plus false positives. It
> therefore cannot be decomposed as a sample-wise measure without
> knowing the rate of positive predictions. This rate is dependent on
> the training data and algorithm.
In my last post, I referred to your remark that "for precision ... you
can't say the same". Since precision can't be computed with formula (*),
even with a different loss function, I pointed out that (*) can be used
to compute the accuracy if the loss function is an indicator function.
It is still not clear to me what your point is with your remark that
"for precision ... you can't say the same". I assume that you want to
tell that it is not wise to compute TP, FP, FN and then precision and
recall using cross_val_predict. If this is what you mean, I'd like you
to explain why.
> I'm not a statistician and cannot speak to issues of computing a mean
> of means, but if what we are trying to estimate is the performance on
> a sample of size approximately n_t of a model trained on a sample of
> size approximately N - n_t, then I wouldn't have thought taking a mean
> over such measures (with whatever score function) to be unreasonable.
>
In general, a mean of means is not the mean of the original data. The
pooled mean is the correct metric in this case. However, the pooled mean
equals the mean of means if all folds are exactly the same size.
> On Thu., 4 Apr. 2019, 3:51 am Boris Hollas,
> > wrote:
>
> Am 03.04.19 um 13:59 schrieb Joel Nothman:
>> The equations in Murphy and Hastie very clearly assume a metric
>> decomposable over samples (a loss function). Several popular metrics
>> are not.
>>
>> For a metric like MSE it will be almost identical assuming the test
>> sets have almost the same size.
> What will be almost identical to what? I suppose you mean that (*)
> is consistent with the scores of the models in the fold (ie, the
> result of cross_val_score) if the loss function is (x-y)?.
>> For something like Recall
>> (sensitivity) it will be almost identical assuming similar test set
>> sizes**and** stratification. For something like precision whose
>> denominator is determined by the biases of the learnt classifier on
>> the test dataset, you can't say the same.
> I can't follow here. If the loss function is L(x,y) = 1_{x = y},
> then (*) gives the accuracy.
>> For something like ROC AUC
>> score, relying on some decision function that may not be equivalently
>> calibrated across splits, evaluating in this way is almost
>> meaningless.
>
> In any case, I still don't see what may be wrong with (*).
> Otherwise, the warning in the documentation about the use of
> cross_val_predict should be removed or revised.
>
> On the other hand, an example in the documentation uses
> cross_val_scores.mean(). This is debatable since this computes a
> mean of means.
>
>> On Wed, 3 Apr 2019 at 22:01, Boris Hollas
>> wrote:
>>> I use
>>>
>>> sum((cross_val_predict(model, X, y) - y)**2) / len(y) (*)
>>>
>>> to evaluate the performance of a model. This conforms with Murphy: Machine Learning, section 6.5.3, and Hastie et al: The Elements of Statistical Learning, eq. 7.48. However, according to the documentation of cross_val_predict, "it is not appropriate to pass these predictions into an evaluation metric". While it is obvious that cross_val_predict is different from cross_val_score, I don't see what should be wrong with (*).
>>>
>>> Also, the explanation that "cross_val_predict simply returns the labels (or probabilities)" is unclear, if not wrong. As I understand it, this function returns estimates and no labels or probabilities.
>>>
>>> Regards, Boris
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From joel.nothman at gmail.com Thu Apr 4 04:03:16 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 4 Apr 2019 19:03:16 +1100
Subject: [scikit-learn] Why is cross_val_predict discouraged?
In-Reply-To:
References:
<1d887c05-bfdd-2559-c7a7-6e63a156eacc@informatik.htw-dresden.de>
Message-ID:
> I assume that you want to tell that it is not wise to compute TP, FP, FN
and then precision and recall using cross_val_predict. If this is what you
mean, I'd like you to explain why.
Because if there is high variance as a function of training set rather than
test sample I'd like to know.
> The pooled mean is the correct metric in this case.
I don't think we are in agreement on that.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From alexandre.gramfort at inria.fr Thu Apr 4 05:40:48 2019
From: alexandre.gramfort at inria.fr (Alexandre Gramfort)
Date: Thu, 4 Apr 2019 11:40:48 +0200
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
functions?
In-Reply-To:
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
Message-ID:
I also think that YellowBrick folks did a great job and that we should not
reinvent the wheel or at least have clear idea of how we differ in scope
with respect to YellowBrick
my 2c
Alex
On Thu, Apr 4, 2019 at 1:02 AM Eric Ma wrote:
> This is not a strongly-held suggestion - but what about adopting
> YellowBrick as the plotting API for sklearn? Not sure how exactly the
> interaction would work - could be PRs to their library, or ask them to
> integrate into sklearn, or do a lock-step dance with versions but maintain
> separate teams? (I know it raises more questions than answers, but wanted
> to put it out there.)
>
> On Wed, Apr 3, 2019 at 4:07 PM Joel Nothman
> wrote:
>
>> With option 1, sklearn.plot is likely to import large chunks of the
>> library (particularly, but not exclusively, if the plotting function
>> "does the work" as Andy suggests). This is under the assumption that
>> one plot function will want to import trees, another GPs, etc. Unless
>> we move to lazy imports, that would be against the current convention
>> that importing sklearn is fairly minimal.
>>
>> I do like Andy's idea of framing this discussion more clearly around
>> likely candidates.
>>
>> On Thu, 4 Apr 2019 at 00:10, Andreas Mueller wrote:
>> >
>> > I think what was not clear from the question is that there is actually
>> > quite different kinds of plotting functions, and many of these are tied
>> > to existing code.
>> >
>> > Right now we have some that are specific to trees (plot_tree) and to
>> > gradient boosting (plot_partial_dependence).
>> >
>> > I think we want more general functions, and plot_partial_dependence has
>> > been extended to general estimators.
>> >
>> > However, the plotting functions might be generic wrt the estimator, but
>> > relate to a specific function, say plotting results of GridSearchCV.
>> > Then one might argue that having the plotting function close to
>> > GridSearchCV might make sense.
>> > Similarly for plotting partial dependence plots and feature importances,
>> > it might be a bit strange to have the plotting functions not next to the
>> > functions that compute these.
>> > Another question would be is whether the plotting functions also "do the
>> > work" in some cases:
>> > Do we want plot_partial_dependence also to compute the partial
>> > dependence? (I would argue yes but either way the result is a bit
>> strange).
>> > In that case you have somewhat of the same functionality in two
>> > different modules, unless you also put the "compute partial dependence"
>> > function in the plotting module as well,
>> > which is a bit strange.
>> >
>> > Maybe we could inform this discussion by listing candidate plotting
>> > functions, and also considering whether they "do the work" and where the
>> > "work" function is.
>> >
>> > Other examples are plotting the confusion matrix, which probably should
>> > also compute the confusion matrix (it's fast and so that would be
>> > convenient), and so it would "duplicate" functionality from the metrics
>> > module.
>> >
>> > Plotting learning curves and validation curves should probably not do
>> > the work as it's pretty involved, and so someone would need to import
>> > the learning and validation curves from model selection, and then the
>> > plotting functions from a plotting module.
>> >
>> > Calibrations curves and P/R curves and roc curves are also pretty fast
>> > to compute (and passing around the arguments is somewhat error prone) so
>> > I would say the plotting functions for these should do the work as well.
>> >
>> > Anyway, you can see that many plotting functions are actually associated
>> > with functions in existing modules and the interactions are a bit
>> unclear.
>> >
>> > The only plotting functions I haven't mentioned so far that I thought
>> > about in the past are "2d scatter" and "plot decision function". These
>> > would be kind of generic, but mostly used in the examples.
>> > Though having a discrete 2d scatter function would be pretty nice
>> > (plt.scatter doesn't allow legends and makes it hard to use qualitative
>> > color maps).
>> >
>> >
>> > I think I would vote for option (1), "sklearn.plot.plot_zzz" but the
>> > case is not really that clear.
>> >
>> > Cheers,
>> >
>> > Andy
>> >
>> > On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote:
>> > > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting
>> > > functions will be added? If it's just a dozen or less, putting them
>> all
>> > > into a single namespace sklearn.plot might be easier.
>> > >
>> > > This also would avoid discussion about where to put some generic
>> > > plotting functions (e.g.
>> > >
>> https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479
>> ).
>> > >
>> > > Roman
>> > >
>> > > On 03/04/2019 12:06, Trevor Stephens wrote:
>> > >> I think #1 if any of these... Plotting functions should hopefully be
>> as
>> > >> general as possible, so tagging with a specific type of estimator
>> will,
>> > >> in some scikit-learn utopia, be unnecessary.
>> > >>
>> > >> If a general plotter is built, where does it live in other
>> > >> estimator-specific namespace options? Feels awkward to put it under
>> > >> every estimator's namespace.
>> > >>
>> > >> Then again, there might be a #4 where there is no plot module and
>> > >> plotting classes live under groups of utilities like introspection,
>> > >> cross-validation or something?...
>> > >>
>> > >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe > > >> > wrote:
>> > >>
>> > >> My preference would be for (1). I don't think the sub-namespace
>> in
>> > >> (2) is necessary, and don't like (3), as I would prefer the
>> plotting
>> > >> functions to be all in the same namespace sklearn.plot.
>> > >>
>> > >> Andrew
>> > >>
>> > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>> > >> J. Andrew Howe, PhD
>> > >> LinkedIn Profile
>> > >> ResearchGate Profile <
>> http://www.researchgate.net/profile/John_Howe12/>
>> > >> Open Researcher and Contributor ID (ORCID)
>> > >>
>> > >> Github Profile
>> > >> Personal Website
>> > >> I live to learn, so I can learn to live. - me
>> > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>> > >>
>> > >>
>> > >> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin <
>> qinhanmin2005 at sina.com
>> > >> > wrote:
>> > >>
>> > >> See
>> https://github.com/scikit-learn/scikit-learn/issues/13448
>> > >>
>> > >> We've introduced several plotting functions (e.g.,
>> plot_tree and
>> > >> plot_partial_dependence) and will introduce more (e.g.,
>> > >> plot_decision_boundary) in the future. Consequently, we
>> need to
>> > >> decide where to put these functions. Currently, there're 3
>> > >> proposals:
>> > >>
>> > >> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>> > >>
>> > >> (2) sklearn.plot.XXX.plot_YYY (e.g.,
>> sklearn.plot.tree.plot_tree)
>> > >>
>> > >> (3) sklearn.XXX.plot.plot_YYY (e.g.,
>> > >> sklearn.tree.plot.plot_tree, note that we won't support from
>> > >> sklearn.XXX import plot_YYY)
>> > >>
>> > >> Joel Nothman, Gael Varoquaux and I decided to post it on the
>> > >> mailing list to invite opinions.
>> > >>
>> > >> Thanks
>> > >>
>> > >> Hanmin Qin
>> > >> _______________________________________________
>> > >> scikit-learn mailing list
>> > >> scikit-learn at python.org
>> > >> https://mail.python.org/mailman/listinfo/scikit-learn
>> > >>
>> > >> _______________________________________________
>> > >> scikit-learn mailing list
>> > >> scikit-learn at python.org
>> > >> https://mail.python.org/mailman/listinfo/scikit-learn
>> > >>
>> > >
>> > > _______________________________________________
>> > > scikit-learn mailing list
>> > > scikit-learn at python.org
>> > > https://mail.python.org/mailman/listinfo/scikit-learn
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Thu Apr 4 10:24:40 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 4 Apr 2019 10:24:40 -0400
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
functions?
In-Reply-To:
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
Message-ID:
I would argue that sklearn users would benefit in having solutions in
scikit-learn. The yellowbrick api is quite different from the approaches we
discussed. If we can reuse their implementations I think we should do so
and credit where we can.
Having plotting in sklearn is also likely to attract more contributors and
we have more eyes for doing reviews.
Sent from phone. Please excuse spelling and brevity.
On Thu, Apr 4, 2019, 05:43 Alexandre Gramfort
wrote:
> I also think that YellowBrick folks did a great job and that we should not
> reinvent the wheel or at least have clear idea of how we differ in scope
> with respect to YellowBrick
>
> my 2c
>
> Alex
>
>
> On Thu, Apr 4, 2019 at 1:02 AM Eric Ma wrote:
>
>> This is not a strongly-held suggestion - but what about adopting
>> YellowBrick as the plotting API for sklearn? Not sure how exactly the
>> interaction would work - could be PRs to their library, or ask them to
>> integrate into sklearn, or do a lock-step dance with versions but maintain
>> separate teams? (I know it raises more questions than answers, but wanted
>> to put it out there.)
>>
>> On Wed, Apr 3, 2019 at 4:07 PM Joel Nothman
>> wrote:
>>
>>> With option 1, sklearn.plot is likely to import large chunks of the
>>> library (particularly, but not exclusively, if the plotting function
>>> "does the work" as Andy suggests). This is under the assumption that
>>> one plot function will want to import trees, another GPs, etc. Unless
>>> we move to lazy imports, that would be against the current convention
>>> that importing sklearn is fairly minimal.
>>>
>>> I do like Andy's idea of framing this discussion more clearly around
>>> likely candidates.
>>>
>>> On Thu, 4 Apr 2019 at 00:10, Andreas Mueller wrote:
>>> >
>>> > I think what was not clear from the question is that there is actually
>>> > quite different kinds of plotting functions, and many of these are tied
>>> > to existing code.
>>> >
>>> > Right now we have some that are specific to trees (plot_tree) and to
>>> > gradient boosting (plot_partial_dependence).
>>> >
>>> > I think we want more general functions, and plot_partial_dependence has
>>> > been extended to general estimators.
>>> >
>>> > However, the plotting functions might be generic wrt the estimator, but
>>> > relate to a specific function, say plotting results of GridSearchCV.
>>> > Then one might argue that having the plotting function close to
>>> > GridSearchCV might make sense.
>>> > Similarly for plotting partial dependence plots and feature
>>> importances,
>>> > it might be a bit strange to have the plotting functions not next to
>>> the
>>> > functions that compute these.
>>> > Another question would be is whether the plotting functions also "do
>>> the
>>> > work" in some cases:
>>> > Do we want plot_partial_dependence also to compute the partial
>>> > dependence? (I would argue yes but either way the result is a bit
>>> strange).
>>> > In that case you have somewhat of the same functionality in two
>>> > different modules, unless you also put the "compute partial dependence"
>>> > function in the plotting module as well,
>>> > which is a bit strange.
>>> >
>>> > Maybe we could inform this discussion by listing candidate plotting
>>> > functions, and also considering whether they "do the work" and where
>>> the
>>> > "work" function is.
>>> >
>>> > Other examples are plotting the confusion matrix, which probably should
>>> > also compute the confusion matrix (it's fast and so that would be
>>> > convenient), and so it would "duplicate" functionality from the metrics
>>> > module.
>>> >
>>> > Plotting learning curves and validation curves should probably not do
>>> > the work as it's pretty involved, and so someone would need to import
>>> > the learning and validation curves from model selection, and then the
>>> > plotting functions from a plotting module.
>>> >
>>> > Calibrations curves and P/R curves and roc curves are also pretty fast
>>> > to compute (and passing around the arguments is somewhat error prone)
>>> so
>>> > I would say the plotting functions for these should do the work as
>>> well.
>>> >
>>> > Anyway, you can see that many plotting functions are actually
>>> associated
>>> > with functions in existing modules and the interactions are a bit
>>> unclear.
>>> >
>>> > The only plotting functions I haven't mentioned so far that I thought
>>> > about in the past are "2d scatter" and "plot decision function". These
>>> > would be kind of generic, but mostly used in the examples.
>>> > Though having a discrete 2d scatter function would be pretty nice
>>> > (plt.scatter doesn't allow legends and makes it hard to use qualitative
>>> > color maps).
>>> >
>>> >
>>> > I think I would vote for option (1), "sklearn.plot.plot_zzz" but the
>>> > case is not really that clear.
>>> >
>>> > Cheers,
>>> >
>>> > Andy
>>> >
>>> > On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote:
>>> > > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting
>>> > > functions will be added? If it's just a dozen or less, putting them
>>> all
>>> > > into a single namespace sklearn.plot might be easier.
>>> > >
>>> > > This also would avoid discussion about where to put some generic
>>> > > plotting functions (e.g.
>>> > >
>>> https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479
>>> ).
>>> > >
>>> > > Roman
>>> > >
>>> > > On 03/04/2019 12:06, Trevor Stephens wrote:
>>> > >> I think #1 if any of these... Plotting functions should hopefully
>>> be as
>>> > >> general as possible, so tagging with a specific type of estimator
>>> will,
>>> > >> in some scikit-learn utopia, be unnecessary.
>>> > >>
>>> > >> If a general plotter is built, where does it live in other
>>> > >> estimator-specific namespace options? Feels awkward to put it under
>>> > >> every estimator's namespace.
>>> > >>
>>> > >> Then again, there might be a #4 where there is no plot module and
>>> > >> plotting classes live under groups of utilities like introspection,
>>> > >> cross-validation or something?...
>>> > >>
>>> > >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe >> > >> > wrote:
>>> > >>
>>> > >> My preference would be for (1). I don't think the
>>> sub-namespace in
>>> > >> (2) is necessary, and don't like (3), as I would prefer the
>>> plotting
>>> > >> functions to be all in the same namespace sklearn.plot.
>>> > >>
>>> > >> Andrew
>>> > >>
>>> > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>>> > >> J. Andrew Howe, PhD
>>> > >> LinkedIn Profile
>>> > >> ResearchGate Profile <
>>> http://www.researchgate.net/profile/John_Howe12/>
>>> > >> Open Researcher and Contributor ID (ORCID)
>>> > >>
>>> > >> Github Profile
>>> > >> Personal Website
>>> > >> I live to learn, so I can learn to live. - me
>>> > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>>> > >>
>>> > >>
>>> > >> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin <
>>> qinhanmin2005 at sina.com
>>> > >> > wrote:
>>> > >>
>>> > >> See
>>> https://github.com/scikit-learn/scikit-learn/issues/13448
>>> > >>
>>> > >> We've introduced several plotting functions (e.g.,
>>> plot_tree and
>>> > >> plot_partial_dependence) and will introduce more (e.g.,
>>> > >> plot_decision_boundary) in the future. Consequently, we
>>> need to
>>> > >> decide where to put these functions. Currently, there're 3
>>> > >> proposals:
>>> > >>
>>> > >> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>>> > >>
>>> > >> (2) sklearn.plot.XXX.plot_YYY (e.g.,
>>> sklearn.plot.tree.plot_tree)
>>> > >>
>>> > >> (3) sklearn.XXX.plot.plot_YYY (e.g.,
>>> > >> sklearn.tree.plot.plot_tree, note that we won't support
>>> from
>>> > >> sklearn.XXX import plot_YYY)
>>> > >>
>>> > >> Joel Nothman, Gael Varoquaux and I decided to post it on
>>> the
>>> > >> mailing list to invite opinions.
>>> > >>
>>> > >> Thanks
>>> > >>
>>> > >> Hanmin Qin
>>> > >> _______________________________________________
>>> > >> scikit-learn mailing list
>>> > >> scikit-learn at python.org
>>> > >> https://mail.python.org/mailman/listinfo/scikit-learn
>>> > >>
>>> > >> _______________________________________________
>>> > >> scikit-learn mailing list
>>> > >> scikit-learn at python.org
>>> > >> https://mail.python.org/mailman/listinfo/scikit-learn
>>> > >>
>>> > >
>>> > > _______________________________________________
>>> > > scikit-learn mailing list
>>> > > scikit-learn at python.org
>>> > > https://mail.python.org/mailman/listinfo/scikit-learn
>>> > _______________________________________________
>>> > scikit-learn mailing list
>>> > scikit-learn at python.org
>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From joel.nothman at gmail.com Thu Apr 4 17:12:09 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Fri, 5 Apr 2019 08:12:09 +1100
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
functions?
In-Reply-To:
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
Message-ID:
Well it would certainly be a low-cost effort improvement if we demonstrated
yellowbrick in our examples.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From heitor.boschirolli at gmail.com Sat Apr 6 13:07:38 2019
From: heitor.boschirolli at gmail.com (Heitor Boschirolli)
Date: Sat, 6 Apr 2019 14:07:38 -0300
Subject: [scikit-learn] Starting to contribute
Message-ID:
Hello!
First of all, I'm apologize if this email is not for such questions, but I
never contributed to open source code before and I'm not sure how to
proceed, could someone help me with that?
Should I just pick an issue, solve it following the guidelines described in
the website and open a PR?
If I have any trouble, can I post it on the mailing list?
Att, Heitor Boschirolli
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From ahowe42 at gmail.com Sun Apr 7 05:08:24 2019
From: ahowe42 at gmail.com (Andrew Howe)
Date: Sun, 7 Apr 2019 10:08:24 +0100
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
functions?
In-Reply-To:
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>
Message-ID:
I'm with Andreas on this. As a user, I would prefer to see this as part of
sklearn with the usual sklearn api. If we want static matplotlib-style
images, reusing (with credit) some of the yellowbrick implementations is a
good idea.
Would we consider plotly-based visualizations? I've been doing my own
plotting in plotly for the last month, and can't imagine going back to
static matplotlib plots...
Andrew
<~~~~~~~~~~~~~~~~~~~~~~~~~~~>
J. Andrew Howe, PhD
LinkedIn Profile
ResearchGate Profile
Open Researcher and Contributor ID (ORCID)
Github Profile
Personal Website
I live to learn, so I can learn to live. - me
<~~~~~~~~~~~~~~~~~~~~~~~~~~~>
On Thu, Apr 4, 2019 at 3:26 PM Andreas Mueller wrote:
> I would argue that sklearn users would benefit in having solutions in
> scikit-learn. The yellowbrick api is quite different from the approaches we
> discussed. If we can reuse their implementations I think we should do so
> and credit where we can.
> Having plotting in sklearn is also likely to attract more contributors and
> we have more eyes for doing reviews.
>
> Sent from phone. Please excuse spelling and brevity.
>
> On Thu, Apr 4, 2019, 05:43 Alexandre Gramfort
> wrote:
>
>> I also think that YellowBrick folks did a great job and that we should
>> not reinvent the wheel or at least have clear idea of how we differ in
>> scope with respect to YellowBrick
>>
>> my 2c
>>
>> Alex
>>
>>
>> On Thu, Apr 4, 2019 at 1:02 AM Eric Ma wrote:
>>
>>> This is not a strongly-held suggestion - but what about adopting
>>> YellowBrick as the plotting API for sklearn? Not sure how exactly the
>>> interaction would work - could be PRs to their library, or ask them to
>>> integrate into sklearn, or do a lock-step dance with versions but maintain
>>> separate teams? (I know it raises more questions than answers, but wanted
>>> to put it out there.)
>>>
>>> On Wed, Apr 3, 2019 at 4:07 PM Joel Nothman
>>> wrote:
>>>
>>>> With option 1, sklearn.plot is likely to import large chunks of the
>>>> library (particularly, but not exclusively, if the plotting function
>>>> "does the work" as Andy suggests). This is under the assumption that
>>>> one plot function will want to import trees, another GPs, etc. Unless
>>>> we move to lazy imports, that would be against the current convention
>>>> that importing sklearn is fairly minimal.
>>>>
>>>> I do like Andy's idea of framing this discussion more clearly around
>>>> likely candidates.
>>>>
>>>> On Thu, 4 Apr 2019 at 00:10, Andreas Mueller wrote:
>>>> >
>>>> > I think what was not clear from the question is that there is actually
>>>> > quite different kinds of plotting functions, and many of these are
>>>> tied
>>>> > to existing code.
>>>> >
>>>> > Right now we have some that are specific to trees (plot_tree) and to
>>>> > gradient boosting (plot_partial_dependence).
>>>> >
>>>> > I think we want more general functions, and plot_partial_dependence
>>>> has
>>>> > been extended to general estimators.
>>>> >
>>>> > However, the plotting functions might be generic wrt the estimator,
>>>> but
>>>> > relate to a specific function, say plotting results of GridSearchCV.
>>>> > Then one might argue that having the plotting function close to
>>>> > GridSearchCV might make sense.
>>>> > Similarly for plotting partial dependence plots and feature
>>>> importances,
>>>> > it might be a bit strange to have the plotting functions not next to
>>>> the
>>>> > functions that compute these.
>>>> > Another question would be is whether the plotting functions also "do
>>>> the
>>>> > work" in some cases:
>>>> > Do we want plot_partial_dependence also to compute the partial
>>>> > dependence? (I would argue yes but either way the result is a bit
>>>> strange).
>>>> > In that case you have somewhat of the same functionality in two
>>>> > different modules, unless you also put the "compute partial
>>>> dependence"
>>>> > function in the plotting module as well,
>>>> > which is a bit strange.
>>>> >
>>>> > Maybe we could inform this discussion by listing candidate plotting
>>>> > functions, and also considering whether they "do the work" and where
>>>> the
>>>> > "work" function is.
>>>> >
>>>> > Other examples are plotting the confusion matrix, which probably
>>>> should
>>>> > also compute the confusion matrix (it's fast and so that would be
>>>> > convenient), and so it would "duplicate" functionality from the
>>>> metrics
>>>> > module.
>>>> >
>>>> > Plotting learning curves and validation curves should probably not do
>>>> > the work as it's pretty involved, and so someone would need to import
>>>> > the learning and validation curves from model selection, and then the
>>>> > plotting functions from a plotting module.
>>>> >
>>>> > Calibrations curves and P/R curves and roc curves are also pretty fast
>>>> > to compute (and passing around the arguments is somewhat error prone)
>>>> so
>>>> > I would say the plotting functions for these should do the work as
>>>> well.
>>>> >
>>>> > Anyway, you can see that many plotting functions are actually
>>>> associated
>>>> > with functions in existing modules and the interactions are a bit
>>>> unclear.
>>>> >
>>>> > The only plotting functions I haven't mentioned so far that I thought
>>>> > about in the past are "2d scatter" and "plot decision function". These
>>>> > would be kind of generic, but mostly used in the examples.
>>>> > Though having a discrete 2d scatter function would be pretty nice
>>>> > (plt.scatter doesn't allow legends and makes it hard to use
>>>> qualitative
>>>> > color maps).
>>>> >
>>>> >
>>>> > I think I would vote for option (1), "sklearn.plot.plot_zzz" but the
>>>> > case is not really that clear.
>>>> >
>>>> > Cheers,
>>>> >
>>>> > Andy
>>>> >
>>>> > On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote:
>>>> > > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting
>>>> > > functions will be added? If it's just a dozen or less, putting them
>>>> all
>>>> > > into a single namespace sklearn.plot might be easier.
>>>> > >
>>>> > > This also would avoid discussion about where to put some generic
>>>> > > plotting functions (e.g.
>>>> > >
>>>> https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479
>>>> ).
>>>> > >
>>>> > > Roman
>>>> > >
>>>> > > On 03/04/2019 12:06, Trevor Stephens wrote:
>>>> > >> I think #1 if any of these... Plotting functions should hopefully
>>>> be as
>>>> > >> general as possible, so tagging with a specific type of estimator
>>>> will,
>>>> > >> in some scikit-learn utopia, be unnecessary.
>>>> > >>
>>>> > >> If a general plotter is built, where does it live in other
>>>> > >> estimator-specific namespace options? Feels awkward to put it under
>>>> > >> every estimator's namespace.
>>>> > >>
>>>> > >> Then again, there might be a #4 where there is no plot module and
>>>> > >> plotting classes live under groups of utilities like introspection,
>>>> > >> cross-validation or something?...
>>>> > >>
>>>> > >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe >>> > >> > wrote:
>>>> > >>
>>>> > >> My preference would be for (1). I don't think the
>>>> sub-namespace in
>>>> > >> (2) is necessary, and don't like (3), as I would prefer the
>>>> plotting
>>>> > >> functions to be all in the same namespace sklearn.plot.
>>>> > >>
>>>> > >> Andrew
>>>> > >>
>>>> > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>>>> > >> J. Andrew Howe, PhD
>>>> > >> LinkedIn Profile
>>>> > >> ResearchGate Profile <
>>>> http://www.researchgate.net/profile/John_Howe12/>
>>>> > >> Open Researcher and Contributor ID (ORCID)
>>>> > >>
>>>> > >> Github Profile
>>>> > >> Personal Website
>>>> > >> I live to learn, so I can learn to live. - me
>>>> > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
>>>> > >>
>>>> > >>
>>>> > >> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin <
>>>> qinhanmin2005 at sina.com
>>>> > >> > wrote:
>>>> > >>
>>>> > >> See
>>>> https://github.com/scikit-learn/scikit-learn/issues/13448
>>>> > >>
>>>> > >> We've introduced several plotting functions (e.g.,
>>>> plot_tree and
>>>> > >> plot_partial_dependence) and will introduce more (e.g.,
>>>> > >> plot_decision_boundary) in the future. Consequently, we
>>>> need to
>>>> > >> decide where to put these functions. Currently, there're 3
>>>> > >> proposals:
>>>> > >>
>>>> > >> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>>>> > >>
>>>> > >> (2) sklearn.plot.XXX.plot_YYY (e.g.,
>>>> sklearn.plot.tree.plot_tree)
>>>> > >>
>>>> > >> (3) sklearn.XXX.plot.plot_YYY (e.g.,
>>>> > >> sklearn.tree.plot.plot_tree, note that we won't support
>>>> from
>>>> > >> sklearn.XXX import plot_YYY)
>>>> > >>
>>>> > >> Joel Nothman, Gael Varoquaux and I decided to post it on
>>>> the
>>>> > >> mailing list to invite opinions.
>>>> > >>
>>>> > >> Thanks
>>>> > >>
>>>> > >> Hanmin Qin
>>>> > >> _______________________________________________
>>>> > >> scikit-learn mailing list
>>>> > >> scikit-learn at python.org
>>>> > >> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> > >>
>>>> > >> _______________________________________________
>>>> > >> scikit-learn mailing list
>>>> > >> scikit-learn at python.org
>>>> > >> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> > >>
>>>> > >
>>>> > > _______________________________________________
>>>> > > scikit-learn mailing list
>>>> > > scikit-learn at python.org
>>>> > > https://mail.python.org/mailman/listinfo/scikit-learn
>>>> > _______________________________________________
>>>> > scikit-learn mailing list
>>>> > scikit-learn at python.org
>>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From rth.yurchak at pm.me Sun Apr 7 05:23:56 2019
From: rth.yurchak at pm.me (Roman Yurchak)
Date: Sun, 07 Apr 2019 09:23:56 +0000
Subject: [scikit-learn] Starting to contribute
In-Reply-To:
References:
Message-ID: <6cT00pTFFXyDphB5zuehmAGa1J-m9yvO8UcgcbA9wFY8KzGojBgoA5pfPQZakaVTF88utkZlF9v-qCyfIHledAKfzFtXXvqvTBNkT975it8=@pm.me>
Hello Heitor,
yes, you can chose an issue, comment there that you plan to work on it
(to avoid redundant work by other contributors) and if no one objects
make a PR. If you have any questions you can ask them by commenting on
that issue (for specific questions) or on the scikit-learn Gitter
https://gitter.im/scikit-learn/scikit-learn (for general questions about
how to contribute).
See https://scikit-learn.org/stable/developers/contributing.html for
more information.
Roman
On 06/04/2019 19:07, Heitor Boschirolli wrote:
> Hello!
>
> First of all, I'm apologize if this email is not for such questions, but
> I never contributed to open source code before and I'm not sure how to
> proceed, could someone help me with that?
>
> Should I just pick an issue, solve it following the guidelines described
> in the website and open a PR?
> If I have any trouble, can I post it on the mailing list?
>
> Att, Heitor Boschirolli
From emmanuelle.gouillart at nsup.org Sun Apr 7 11:41:48 2019
From: emmanuelle.gouillart at nsup.org (Emmanuelle Gouillart)
Date: Sun, 7 Apr 2019 17:41:48 +0200
Subject: [scikit-learn] API Discussion: Where shall we put the plotting
functions?
In-Reply-To:
References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn>