From qinhanmin2005 at sina.com Tue Apr 2 10:36:03 2019 From: qinhanmin2005 at sina.com (Hanmin Qin) Date: Tue, 02 Apr 2019 22:36:03 +0800 Subject: [scikit-learn] API Discussion: Where shall we put the plotting functions? Message-ID: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn> See https://github.com/scikit-learn/scikit-learn/issues/13448 We've introduced several plotting functions (e.g., plot_tree and plot_partial_dependence) and will introduce more (e.g., plot_decision_boundary) in the future. Consequently, we need to decide where to put these functions. Currently, there're 3 proposals: (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree) (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree) (3) sklearn.XXX.plot.plot_YYY (e.g., sklearn.tree.plot.plot_tree, note that we won't support from sklearn.XXX import plot_YYY) Joel Nothman, Gael Varoquaux and I decided to post it on the mailing list to invite opinions. Thanks Hanmin Qin -------------- next part -------------- An HTML attachment was scrubbed... URL: From martin.watzenboeck at gmail.com Tue Apr 2 14:57:51 2019 From: martin.watzenboeck at gmail.com (Martin Watzenboeck) Date: Tue, 2 Apr 2019 20:57:51 +0200 Subject: [scikit-learn] LASSO: Predicted values show negative correlation with observed values on random data Message-ID: Hello, I tried to apply LASSO regression in combination with LeaveOneOut CV on my data, and observed a significant negative correlation between predicted and observed response values. I tried to replicate the problem using random data (please see code below). Anyone have an idea what I am doing wrong? I would very much like to use LASSO regression on my data. Thanks a lot! Cheers, Martin #Lasso example from sklearn.linear_model import Lasso from sklearn.model_selection import LeaveOneOut from scipy.stats import pearsonr import numpy as np n_samples = 500 n_features = 30 #create random features rng = np.random.RandomState(seed=42) X = rng.randn(n_samples * n_features).reshape(n_samples, n_features) #Create Ys Y = rng.randn(n_samples) #instantiate regressor and cv object cv = LeaveOneOut() reg = Lasso(random_state = 42) #create arrays to save predicted (and observed) Y values pred = np.array([]) obs = np.array([]) #run cross validation for train, test in cv.split(X, Y): #fit regressor reg.fit(X[train], Y[train]) #append predicted and observed values to the arrays pred = np.r_[pred, reg.predict(X[test])] obs = np.r_[obs, Y[test]] #test correlation pearsonr(pred, obs) -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.gramfort at inria.fr Tue Apr 2 15:33:02 2019 From: alexandre.gramfort at inria.fr (Alexandre Gramfort) Date: Tue, 2 Apr 2019 21:33:02 +0200 Subject: [scikit-learn] LASSO: Predicted values show negative correlation with observed values on random data In-Reply-To: References: Message-ID: in your example with random data Lasso leads to coef_ of zeros so you get as prediction : np.mean(Y[train]) you'll see the same phenomenon if you do: pred = np.r_[pred, np.mean(Y[train])] Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Tue Apr 2 22:44:22 2019 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Wed, 3 Apr 2019 11:44:22 +0900 Subject: [scikit-learn] API Discussion: Where shall we put the plotting functions? In-Reply-To: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn> References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn> Message-ID: As a user, I feel that (2) "sklearn.plot.XXX.plot_YYY" best allows for future expansion of sub-namespaces in a tractable way that is also easy to understand during code review. For example, sklearn.plot.tree.plot_forest() or sklearn.plot.lasso.plot_* . Just my opinion. J.B. 2019?4?2?(?) 23:40 Hanmin Qin : > See https://github.com/scikit-learn/scikit-learn/issues/13448 > > We've introduced several plotting functions (e.g., plot_tree and > plot_partial_dependence) and will introduce more (e.g., > plot_decision_boundary) in the future. Consequently, we need to decide > where to put these functions. Currently, there're 3 proposals: > > (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree) > > (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree) > > (3) sklearn.XXX.plot.plot_YYY (e.g., sklearn.tree.plot.plot_tree, note > that we won't support from sklearn.XXX import plot_YYY) > > Joel Nothman, Gael Varoquaux and I decided to post it on the mailing list > to invite opinions. > > Thanks > > Hanmin Qin > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Wed Apr 3 05:07:08 2019 From: pahome.chen at mirlab.org (lampahome) Date: Wed, 3 Apr 2019 17:07:08 +0800 Subject: [scikit-learn] Can cluster help me to cluster data with length of continuous series? Message-ID: I have data which contain access duration of each items. EX: t0~t4 is the access time duration. 1 means the item was accessed in the time duration, 0 means not. ID,t0,t1,t2,t3,t4 0,1,0,0,1 1,1,0,0,1 2,0,0,1,1 3,0,1,1,1 What I want to cluster is the length of continuous duration Ex: ID=3 > 2 > 1 = 0 Can any distance metric to help clustering based on the length of continuous duration? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahowe42 at gmail.com Wed Apr 3 05:52:18 2019 From: ahowe42 at gmail.com (Andrew Howe) Date: Wed, 3 Apr 2019 10:52:18 +0100 Subject: [scikit-learn] API Discussion: Where shall we put the plotting functions? In-Reply-To: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn> References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn> Message-ID: My preference would be for (1). I don't think the sub-namespace in (2) is necessary, and don't like (3), as I would prefer the plotting functions to be all in the same namespace sklearn.plot. Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD LinkedIn Profile ResearchGate Profile Open Researcher and Contributor ID (ORCID) Github Profile Personal Website I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin wrote: > See https://github.com/scikit-learn/scikit-learn/issues/13448 > > We've introduced several plotting functions (e.g., plot_tree and > plot_partial_dependence) and will introduce more (e.g., > plot_decision_boundary) in the future. Consequently, we need to decide > where to put these functions. Currently, there're 3 proposals: > > (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree) > > (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree) > > (3) sklearn.XXX.plot.plot_YYY (e.g., sklearn.tree.plot.plot_tree, note > that we won't support from sklearn.XXX import plot_YYY) > > Joel Nothman, Gael Varoquaux and I decided to post it on the mailing list > to invite opinions. > > Thanks > > Hanmin Qin > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From trev.stephens at gmail.com Wed Apr 3 06:06:07 2019 From: trev.stephens at gmail.com (Trevor Stephens) Date: Wed, 3 Apr 2019 21:06:07 +1100 Subject: [scikit-learn] API Discussion: Where shall we put the plotting functions? In-Reply-To: References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn> Message-ID: I think #1 if any of these... Plotting functions should hopefully be as general as possible, so tagging with a specific type of estimator will, in some scikit-learn utopia, be unnecessary. If a general plotter is built, where does it live in other estimator-specific namespace options? Feels awkward to put it under every estimator's namespace. Then again, there might be a #4 where there is no plot module and plotting classes live under groups of utilities like introspection, cross-validation or something?... On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe wrote: > My preference would be for (1). I don't think the sub-namespace in (2) is > necessary, and don't like (3), as I would prefer the plotting functions to > be all in the same namespace sklearn.plot. > > Andrew > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > J. Andrew Howe, PhD > LinkedIn Profile > ResearchGate Profile > Open Researcher and Contributor ID (ORCID) > > Github Profile > Personal Website > I live to learn, so I can learn to live. - me > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > > On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin wrote: > >> See https://github.com/scikit-learn/scikit-learn/issues/13448 >> >> We've introduced several plotting functions (e.g., plot_tree and >> plot_partial_dependence) and will introduce more (e.g., >> plot_decision_boundary) in the future. Consequently, we need to decide >> where to put these functions. Currently, there're 3 proposals: >> >> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree) >> >> (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree) >> >> (3) sklearn.XXX.plot.plot_YYY (e.g., sklearn.tree.plot.plot_tree, note >> that we won't support from sklearn.XXX import plot_YYY) >> >> Joel Nothman, Gael Varoquaux and I decided to post it on the mailing list >> to invite opinions. >> >> Thanks >> >> Hanmin Qin >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From christian.braune79 at gmail.com Wed Apr 3 06:18:13 2019 From: christian.braune79 at gmail.com (Christian Braune) Date: Wed, 3 Apr 2019 12:18:13 +0200 Subject: [scikit-learn] Can cluster help me to cluster data with length of continuous series? In-Reply-To: References: Message-ID: Hi, that does not really sound like a clustering but more like a preprocessing problem to me. For each item you want to calculate the length of the longest subsequence of "1"s. That could be done by a simple function and would create a new (one-dimensional) property for each of your items. You could then apply any clustering algorithm to this feature (i.e. you'd be clustering a one-dimensional dataset)... Regards, Christian lampahome schrieb am Mi., 3. Apr. 2019 um 11:08 Uhr: > I have data which contain access duration of each items. > > EX: t0~t4 is the access time duration. 1 means the item was accessed in > the time duration, 0 means not. > ID,t0,t1,t2,t3,t4 > 0,1,0,0,1 > 1,1,0,0,1 > 2,0,0,1,1 > 3,0,1,1,1 > > What I want to cluster is the length of continuous duration > Ex: > ID=3 > 2 > 1 = 0 > > Can any distance metric to help clustering based on the length of > continuous duration? > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hollas at informatik.htw-dresden.de Wed Apr 3 06:28:22 2019 From: hollas at informatik.htw-dresden.de (Boris Hollas) Date: Wed, 3 Apr 2019 12:28:22 +0200 Subject: [scikit-learn] Why is cross_val_predict discouraged? Message-ID: I use sum((cross_val_predict(model, X, y) - y)**2) / len(y)??? ??? (*) to evaluate the performance of a model. This conforms with Murphy: Machine Learning, section 6.5.3, and Hastie et al: The Elements of Statistical Learning,? eq. 7.48. However, according to the documentation of cross_val_predict, "it is not appropriate to pass these predictions into an evaluation metric". While it is obvious that cross_val_predict is different from cross_val_score, I don't see what should be wrong with (*). Also, the explanation that "|cross_val_predict| simply returns the labels (or probabilities)" is unclear, if not wrong. As I understand it, this function returns estimates and no labels or probabilities. Regards, Boris -------------- next part -------------- An HTML attachment was scrubbed... URL: From martin.watzenboeck at gmail.com Wed Apr 3 07:17:13 2019 From: martin.watzenboeck at gmail.com (Martin Watzenboeck) Date: Wed, 3 Apr 2019 13:17:13 +0200 Subject: [scikit-learn] LASSO: Predicted values show negative correlation with observed values on random data In-Reply-To: References: Message-ID: Hi Alex, Thanks a lot for the answer! That does indeed explain this phenomenon. Also, I know see that with my data I can get meaningful LASSO predictions by tuning the alpha parameter. Cheers, Martin Am Di., 2. Apr. 2019 um 21:33 Uhr schrieb Alexandre Gramfort < alexandre.gramfort at inria.fr>: > in your example with random data Lasso leads to coef_ of zeros so you get > as prediction : np.mean(Y[train]) > > you'll see the same phenomenon if you do: > > pred = np.r_[pred, np.mean(Y[train])] > > Alex > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at pm.me Wed Apr 3 07:35:23 2019 From: rth.yurchak at pm.me (Roman Yurchak) Date: Wed, 03 Apr 2019 11:35:23 +0000 Subject: [scikit-learn] API Discussion: Where shall we put the plotting functions? In-Reply-To: References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn> Message-ID: +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting functions will be added? If it's just a dozen or less, putting them all into a single namespace sklearn.plot might be easier. This also would avoid discussion about where to put some generic plotting functions (e.g. https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479). Roman On 03/04/2019 12:06, Trevor Stephens wrote: > I think #1 if any of these... Plotting functions should hopefully be as > general as possible, so tagging with a specific type of estimator will, > in some scikit-learn utopia, be unnecessary. > > If a general plotter is built, where does it live in other > estimator-specific namespace options? Feels awkward to put it under > every estimator's namespace. > > Then again, there might be a #4 where there is no plot module and > plotting classes live under groups of utilities like introspection, > cross-validation or something?... > > On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe > wrote: > > My preference would be for (1). I don't think the sub-namespace in > (2) is necessary, and don't like (3), as I would prefer the plotting > functions to be all in the same namespace sklearn.plot. > > Andrew > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > J. Andrew Howe, PhD > LinkedIn Profile > ResearchGate Profile > Open Researcher and Contributor ID (ORCID) > > Github Profile > Personal Website > I live to learn, so I can learn to live. - me > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > > On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin > wrote: > > See https://github.com/scikit-learn/scikit-learn/issues/13448 > > We've introduced several plotting functions (e.g., plot_tree and > plot_partial_dependence) and will introduce more (e.g., > plot_decision_boundary) in the future. Consequently, we need to > decide where to put these functions. Currently, there're 3 > proposals: > > (1)?sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree) > > (2)?sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree) > > (3)?sklearn.XXX.plot.plot_YYY (e.g., > sklearn.tree.plot.plot_tree, note that we won't support from > sklearn.XXX import plot_YYY) > > Joel Nothman,?Gael Varoquaux and I decided to post it on the > mailing list to invite opinions. > > Thanks > > Hanmin Qin > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From joel.nothman at gmail.com Wed Apr 3 07:59:18 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 3 Apr 2019 22:59:18 +1100 Subject: [scikit-learn] Why is cross_val_predict discouraged? In-Reply-To: References: Message-ID: The equations in Murphy and Hastie very clearly assume a metric decomposable over samples (a loss function). Several popular metrics are not. For a metric like MSE it will be almost identical assuming the test sets have almost the same size. For something like Recall (sensitivity) it will be almost identical assuming similar test set sizes *and* stratification. For something like precision whose denominator is determined by the biases of the learnt classifier on the test dataset, you can't say the same. For something like ROC AUC score, relying on some decision function that may not be equivalently calibrated across splits, evaluating in this way is almost meaningless. On Wed, 3 Apr 2019 at 22:01, Boris Hollas wrote: > > I use > > sum((cross_val_predict(model, X, y) - y)**2) / len(y) (*) > > to evaluate the performance of a model. This conforms with Murphy: Machine Learning, section 6.5.3, and Hastie et al: The Elements of Statistical Learning, eq. 7.48. However, according to the documentation of cross_val_predict, "it is not appropriate to pass these predictions into an evaluation metric". While it is obvious that cross_val_predict is different from cross_val_score, I don't see what should be wrong with (*). > > Also, the explanation that "cross_val_predict simply returns the labels (or probabilities)" is unclear, if not wrong. As I understand it, this function returns estimates and no labels or probabilities. > > Regards, Boris > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Wed Apr 3 08:54:51 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 3 Apr 2019 08:54:51 -0400 Subject: [scikit-learn] Why is cross_val_predict discouraged? In-Reply-To: References: Message-ID: On 4/3/19 7:59 AM, Joel Nothman wrote: > The equations in Murphy and Hastie very clearly assume a metric > decomposable over samples (a loss function). Several popular metrics > are not. > > For a metric like MSE it will be almost identical assuming the test > sets have almost the same size. For something like Recall > (sensitivity) it will be almost identical assuming similar test set > sizes *and* stratification. For something like precision whose > denominator is determined by the biases of the learnt classifier on > the test dataset, you can't say the same. For something like ROC AUC > score, relying on some decision function that may not be equivalently > calibrated across splits, evaluating in this way is almost > meaningless. In theory. Not sure how it holds up in practice. I didn't get the point about precision. But yes, we should add to the docs that in particular for losses that don't decompose this is a weird thing to do. If the loss decomposes, the result might be different b/c of different test set sizes, but I'm not sure if they are "worse" in some way? From t3kcit at gmail.com Wed Apr 3 09:09:19 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 3 Apr 2019 09:09:19 -0400 Subject: [scikit-learn] API Discussion: Where shall we put the plotting functions? In-Reply-To: References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn> Message-ID: I think what was not clear from the question is that there is actually quite different kinds of plotting functions, and many of these are tied to existing code. Right now we have some that are specific to trees (plot_tree) and to gradient boosting (plot_partial_dependence). I think we want more general functions, and plot_partial_dependence has been extended to general estimators. However, the plotting functions might be generic wrt the estimator, but relate to a specific function, say plotting results of GridSearchCV. Then one might argue that having the plotting function close to GridSearchCV might make sense. Similarly for plotting partial dependence plots and feature importances, it might be a bit strange to have the plotting functions not next to the functions that compute these. Another question would be is whether the plotting functions also "do the work" in some cases: Do we want plot_partial_dependence also to compute the partial dependence? (I would argue yes but either way the result is a bit strange). In that case you have somewhat of the same functionality in two different modules, unless you also put the "compute partial dependence" function in the plotting module as well, which is a bit strange. Maybe we could inform this discussion by listing candidate plotting functions, and also considering whether they "do the work" and where the "work" function is. Other examples are plotting the confusion matrix, which probably should also compute the confusion matrix (it's fast and so that would be convenient), and so it would "duplicate" functionality from the metrics module. Plotting learning curves and validation curves should probably not do the work as it's pretty involved, and so someone would need to import the learning and validation curves from model selection, and then the plotting functions from a plotting module. Calibrations curves and P/R curves and roc curves are also pretty fast to compute (and passing around the arguments is somewhat error prone) so I would say the plotting functions for these should do the work as well. Anyway, you can see that many plotting functions are actually associated with functions in existing modules and the interactions are a bit unclear. The only plotting functions I haven't mentioned so far that I thought about in the past are "2d scatter" and "plot decision function". These would be kind of generic, but mostly used in the examples. Though having a discrete 2d scatter function would be pretty nice (plt.scatter doesn't allow legends and makes it hard to use qualitative color maps). I think I would vote for option (1), "sklearn.plot.plot_zzz" but the case is not really that clear. Cheers, Andy On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote: > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting > functions will be added? If it's just a dozen or less, putting them all > into a single namespace sklearn.plot might be easier. > > This also would avoid discussion about where to put some generic > plotting functions (e.g. > https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479). > > Roman > > On 03/04/2019 12:06, Trevor Stephens wrote: >> I think #1 if any of these... Plotting functions should hopefully be as >> general as possible, so tagging with a specific type of estimator will, >> in some scikit-learn utopia, be unnecessary. >> >> If a general plotter is built, where does it live in other >> estimator-specific namespace options? Feels awkward to put it under >> every estimator's namespace. >> >> Then again, there might be a #4 where there is no plot module and >> plotting classes live under groups of utilities like introspection, >> cross-validation or something?... >> >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe > > wrote: >> >> My preference would be for (1). I don't think the sub-namespace in >> (2) is necessary, and don't like (3), as I would prefer the plotting >> functions to be all in the same namespace sklearn.plot. >> >> Andrew >> >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >> J. Andrew Howe, PhD >> LinkedIn Profile >> ResearchGate Profile >> Open Researcher and Contributor ID (ORCID) >> >> Github Profile >> Personal Website >> I live to learn, so I can learn to live. - me >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >> >> >> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin > > wrote: >> >> See https://github.com/scikit-learn/scikit-learn/issues/13448 >> >> We've introduced several plotting functions (e.g., plot_tree and >> plot_partial_dependence) and will introduce more (e.g., >> plot_decision_boundary) in the future. Consequently, we need to >> decide where to put these functions. Currently, there're 3 >> proposals: >> >> (1)?sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree) >> >> (2)?sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree) >> >> (3)?sklearn.XXX.plot.plot_YYY (e.g., >> sklearn.tree.plot.plot_tree, note that we won't support from >> sklearn.XXX import plot_YYY) >> >> Joel Nothman,?Gael Varoquaux and I decided to post it on the >> mailing list to invite opinions. >> >> Thanks >> >> Hanmin Qin >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Wed Apr 3 09:28:52 2019 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 3 Apr 2019 15:28:52 +0200 Subject: [scikit-learn] Why is cross_val_predict discouraged? In-Reply-To: References: Message-ID: <20190403132852.2jdszy2rfp3kivk4@phare.normalesup.org> On Wed, Apr 03, 2019 at 08:54:51AM -0400, Andreas Mueller wrote: > If the loss decomposes, the result might be different b/c of different test > set sizes, but I'm not sure if they are "worse" in some way? Mathematically, a cross-validation estimates a double expectation: one expectation on the model (ie the train data), and another on the test data (see for instance eq 3 in https://europepmc.org/articles/pmc5441396, sorry for the self citation, this is seldom discussed in the literature). The correct way to compute this double expectation is by averaging first inside the fold and second across the folds. Other ways of computing errors estimate other quantities, that are harder to study mathematically and not comparable to objects studied in the literature. Another problem with cross_val_predict is that some people use metrics like correlation (which is a terrible metric and does not decompose across folds). It will then pick up things like correlations across folds. All these problems are made worse when data are not iid, and hence folds risk not being iid. G From joel.nothman at gmail.com Wed Apr 3 10:06:13 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 4 Apr 2019 01:06:13 +1100 Subject: [scikit-learn] API Discussion: Where shall we put the plotting functions? In-Reply-To: References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn> Message-ID: With option 1, sklearn.plot is likely to import large chunks of the library (particularly, but not exclusively, if the plotting function "does the work" as Andy suggests). This is under the assumption that one plot function will want to import trees, another GPs, etc. Unless we move to lazy imports, that would be against the current convention that importing sklearn is fairly minimal. I do like Andy's idea of framing this discussion more clearly around likely candidates. On Thu, 4 Apr 2019 at 00:10, Andreas Mueller wrote: > > I think what was not clear from the question is that there is actually > quite different kinds of plotting functions, and many of these are tied > to existing code. > > Right now we have some that are specific to trees (plot_tree) and to > gradient boosting (plot_partial_dependence). > > I think we want more general functions, and plot_partial_dependence has > been extended to general estimators. > > However, the plotting functions might be generic wrt the estimator, but > relate to a specific function, say plotting results of GridSearchCV. > Then one might argue that having the plotting function close to > GridSearchCV might make sense. > Similarly for plotting partial dependence plots and feature importances, > it might be a bit strange to have the plotting functions not next to the > functions that compute these. > Another question would be is whether the plotting functions also "do the > work" in some cases: > Do we want plot_partial_dependence also to compute the partial > dependence? (I would argue yes but either way the result is a bit strange). > In that case you have somewhat of the same functionality in two > different modules, unless you also put the "compute partial dependence" > function in the plotting module as well, > which is a bit strange. > > Maybe we could inform this discussion by listing candidate plotting > functions, and also considering whether they "do the work" and where the > "work" function is. > > Other examples are plotting the confusion matrix, which probably should > also compute the confusion matrix (it's fast and so that would be > convenient), and so it would "duplicate" functionality from the metrics > module. > > Plotting learning curves and validation curves should probably not do > the work as it's pretty involved, and so someone would need to import > the learning and validation curves from model selection, and then the > plotting functions from a plotting module. > > Calibrations curves and P/R curves and roc curves are also pretty fast > to compute (and passing around the arguments is somewhat error prone) so > I would say the plotting functions for these should do the work as well. > > Anyway, you can see that many plotting functions are actually associated > with functions in existing modules and the interactions are a bit unclear. > > The only plotting functions I haven't mentioned so far that I thought > about in the past are "2d scatter" and "plot decision function". These > would be kind of generic, but mostly used in the examples. > Though having a discrete 2d scatter function would be pretty nice > (plt.scatter doesn't allow legends and makes it hard to use qualitative > color maps). > > > I think I would vote for option (1), "sklearn.plot.plot_zzz" but the > case is not really that clear. > > Cheers, > > Andy > > On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote: > > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting > > functions will be added? If it's just a dozen or less, putting them all > > into a single namespace sklearn.plot might be easier. > > > > This also would avoid discussion about where to put some generic > > plotting functions (e.g. > > https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479). > > > > Roman > > > > On 03/04/2019 12:06, Trevor Stephens wrote: > >> I think #1 if any of these... Plotting functions should hopefully be as > >> general as possible, so tagging with a specific type of estimator will, > >> in some scikit-learn utopia, be unnecessary. > >> > >> If a general plotter is built, where does it live in other > >> estimator-specific namespace options? Feels awkward to put it under > >> every estimator's namespace. > >> > >> Then again, there might be a #4 where there is no plot module and > >> plotting classes live under groups of utilities like introspection, > >> cross-validation or something?... > >> > >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe >> > wrote: > >> > >> My preference would be for (1). I don't think the sub-namespace in > >> (2) is necessary, and don't like (3), as I would prefer the plotting > >> functions to be all in the same namespace sklearn.plot. > >> > >> Andrew > >> > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > >> J. Andrew Howe, PhD > >> LinkedIn Profile > >> ResearchGate Profile > >> Open Researcher and Contributor ID (ORCID) > >> > >> Github Profile > >> Personal Website > >> I live to learn, so I can learn to live. - me > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > >> > >> > >> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin >> > wrote: > >> > >> See https://github.com/scikit-learn/scikit-learn/issues/13448 > >> > >> We've introduced several plotting functions (e.g., plot_tree and > >> plot_partial_dependence) and will introduce more (e.g., > >> plot_decision_boundary) in the future. Consequently, we need to > >> decide where to put these functions. Currently, there're 3 > >> proposals: > >> > >> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree) > >> > >> (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree) > >> > >> (3) sklearn.XXX.plot.plot_YYY (e.g., > >> sklearn.tree.plot.plot_tree, note that we won't support from > >> sklearn.XXX import plot_YYY) > >> > >> Joel Nothman, Gael Varoquaux and I decided to post it on the > >> mailing list to invite opinions. > >> > >> Thanks > >> > >> Hanmin Qin > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From hollas at informatik.htw-dresden.de Wed Apr 3 12:50:24 2019 From: hollas at informatik.htw-dresden.de (Boris Hollas) Date: Wed, 3 Apr 2019 18:50:24 +0200 Subject: [scikit-learn] Why is cross_val_predict discouraged? In-Reply-To: References: Message-ID: <1d887c05-bfdd-2559-c7a7-6e63a156eacc@informatik.htw-dresden.de> Am 03.04.19 um 13:59 schrieb Joel Nothman: > The equations in Murphy and Hastie very clearly assume a metric > decomposable over samples (a loss function). Several popular metrics > are not. > > For a metric like MSE it will be almost identical assuming the test > sets have almost the same size. What will be almost identical to what? I suppose you mean that (*) is consistent with the scores of the models in the fold (ie, the result of cross_val_score) if the loss function is (x-y)?. > For something like Recall > (sensitivity) it will be almost identical assuming similar test set > sizes*and* stratification. For something like precision whose > denominator is determined by the biases of the learnt classifier on > the test dataset, you can't say the same. I can't follow here. If the loss function is L(x,y) = 1_{x = y}, then (*) gives the accuracy. > For something like ROC AUC > score, relying on some decision function that may not be equivalently > calibrated across splits, evaluating in this way is almost > meaningless. In any case, I still don't see what may be wrong with (*). Otherwise, the warning in the documentation about the use of cross_val_predict should be removed or revised. On the other hand, an example in the documentation uses cross_val_scores.mean(). This is debatable since this computes a mean of means. > > On Wed, 3 Apr 2019 at 22:01, Boris Hollas > wrote: >> I use >> >> sum((cross_val_predict(model, X, y) - y)**2) / len(y) (*) >> >> to evaluate the performance of a model. This conforms with Murphy: Machine Learning, section 6.5.3, and Hastie et al: The Elements of Statistical Learning, eq. 7.48. However, according to the documentation of cross_val_predict, "it is not appropriate to pass these predictions into an evaluation metric". While it is obvious that cross_val_predict is different from cross_val_score, I don't see what should be wrong with (*). >> >> Also, the explanation that "cross_val_predict simply returns the labels (or probabilities)" is unclear, if not wrong. As I understand it, this function returns estimates and no labels or probabilities. >> >> Regards, Boris -------------- next part -------------- An HTML attachment was scrubbed... URL: From rr.rosas at gmail.com Wed Apr 3 14:38:37 2019 From: rr.rosas at gmail.com (Rodrigo Rosenfeld Rosas) Date: Wed, 3 Apr 2019 15:38:37 -0300 Subject: [scikit-learn] How to answer questions from big documents? Message-ID: Hi everyone, this is my first post here :) About two weeks ago, due to the low demand in my project, I have been assigned a completely unusual request: to automatically extract answers from documents based on machine learning. I've never read anything about ML, AI or NLP before, so I've been basically doing just that for the past two weeks. When it comes to ML, most book recommendations and tutorials I've found so far use the Python language and tools, so I took the first week to learn about Python, NumPy, Scikit, Panda, Matplotlib and so on. Then, this week I started reading about NLP itself, after spending a few days reading about generic ML algorithms. So far, I've basically read about Bag of Words, using TF-IDF (or simply terms count) to convert the words to numeric representations and a few methods such as the gaussian and multinomial naive bayes methods to train and predict values. The methods also mention the importance of using the usual pre-processing methods such as lemmatization and alikes. However, basically all examples assume that a given text can be classified in one of the categorized topics, like the sentiment analysis use case. I'm afraid this doesn't represent my use case, so I'd like to describe it here so that you could help me identifying which methods I should be looking for. We have a system with thousands of transactions/deals inputted manually by an specialized team. Each deal has a set of documents (a dozen per deal typically) and some documents could have hundreds of pages. The inputing team has to extract about a thousand fields from those documents for any particular deal. So, in our database we have all their data and we typically also know the document specific snippets associated to each field value. So, my task is to, given a new document and deal, and based on the previous answers, fill in as many fields as I could by automatically finding the corresponding snippets in the new documents. I'm not sure how I should approach this problem. For example, I could consider each sentence of the document as a separate document to be analyzed and compared to the snippets I already have for the matching data. However, I can't be sure whether some of those sentences would actually answer the question. For example, maybe there are 6 occurrences in the documents that would answer a particular question/field, but maybe the inputters only identified 2 or 3 of them. Also, for any given sentence, it could tell that the answer for a given field is A or B, or it could be that there's absolutely no association between the sentence and the field/question, as it would be the case for most sentences. I know that Scikit provides the predict_proba method, so that I could try to only consider the sentence as relevant if the probabilities of answering the question would be above 80%, for example, but based on a few quick tests I've made with a few sentences and words, I suspect this won't work very well. Also, it could be quite slow to treat each sentence of a 500-hundreds of pages documents as a separate document to be analyzed, so I'm not sure if there are better methods to handle this use case. Some of the fields are free-text ones, like company and firm names, for example, and I suspect those would be the hardest to answer, so I'm trying to start with the multiple-choice ones, with a finite set of classification. How would you advise me to look at this problem? Are there any algorithms you'd recommend me to study for solving this particular problem? Here are some sample data so that you could get a better understanding of the problem: One of the fields is called "Deal Structure" and it could have the following values: "Asset Purchase", "Stock or Equity Purchase" or "Public Target Merger" (there are a few others, but this gives you an idea). So, here are some sentences highlighted for Public Target Merger deals (those documents come from Edgar Filings public database which are freely available for US deals): deal 1 / doc 1: "AGREEMENT AND PLAN OF MERGER, dated as of March 14, 2018 (this ?Agreement?), by and among HarborOne Bancorp, Inc., a Massachusetts corporation (?Buyer?), Massachusetts Acquisitions, LLC, a Maryland limited liability company of which Buyer is the sole member (?Merger LLC?), and Coastway Bancorp, Inc., a Maryland corporation (the ?Company?)." "WHEREAS, Buyer, Merger LLC, and the Company intend to effect a merger (the ?Merger?) of Merger LLC with and into the Company in accordance with this Agreement and the Maryland General Corporation Law (the ?MGCL?) and the Maryland Limited Liability Company Act, as amended (the ?MLLCA?), with the Company to be the surviving entity in the Merger. The Merger will be followed immediately by a merger of the Company with and into Buyer (the ?Upstream Merger?), with the Buyer to be the surviving entity in the Upstream Merger. It is intended that the Merger be mutually interdependent with and a condition precedent to the Upstream Merger and that the Upstream Merger shall, through the binding commitment evidenced by this Agreement, be effected immediately following the Effective Time (as defined below) without further approval, authorization or direction from or by any of the parties hereto; and" deal 2 / doc 1: "WHEREAS, it is also proposed that, as soon as practicable following the consummation of the Offer, the Parties wish to effect the acquisition of the Company by Parent through the merger of Purchaser with and into the Company, with the Company being the surviving entity (the ?Merger?);" Now, for Asset Purchase deals: deal 3 / doc 1: "Subject to the terms and conditions of this Agreement, Sellers are willing to sell to Buyer, and Buyer is willing to purchase from Sellers, all of their assets relating to the Businesses as set forth herein." deal 4 / doc 1: "WHEREAS, Seller wishes to sell and assign to Buyer, and Buyer wishes to purchase and assume from Seller, the rights and obligations of Seller to the Purchased Assets (as defined herein), subject to the terms and conditions set forth herein." Please forgive me for any imprecise/incorrect terms or understanding on this topic as this is all very new to me. Any help is very appreciated. I've also asked this question in StackOverflow, so if you'd prefer to answer there instead, here is the link: https://stackoverflow.com/questions/55499866/how-to-answer-questions-from-big-documents Would this field be called data mining? Feature extraction? Question answering? I'm not sure how to properly search about this subject so any hints are very welcome :) Thanks in advance, Rodrigo. -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Apr 3 17:46:57 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 4 Apr 2019 08:46:57 +1100 Subject: [scikit-learn] Why is cross_val_predict discouraged? In-Reply-To: <1d887c05-bfdd-2559-c7a7-6e63a156eacc@informatik.htw-dresden.de> References: <1d887c05-bfdd-2559-c7a7-6e63a156eacc@informatik.htw-dresden.de> Message-ID: Pull requests improving the documentation are always welcome. At a minimum, users need to know that these compute different things. Accuracy is not precision. Precision is the number of true positives divided by the number of true positives plus false positives. It therefore cannot be decomposed as a sample-wise measure without knowing the rate of positive predictions. This rate is dependent on the training data and algorithm. I'm not a statistician and cannot speak to issues of computing a mean of means, but if what we are trying to estimate is the performance on a sample of size approximately n_t of a model trained on a sample of size approximately N - n_t, then I wouldn't have thought taking a mean over such measures (with whatever score function) to be unreasonable. On Thu., 4 Apr. 2019, 3:51 am Boris Hollas, < hollas at informatik.htw-dresden.de> wrote: > Am 03.04.19 um 13:59 schrieb Joel Nothman: > > The equations in Murphy and Hastie very clearly assume a metric > decomposable over samples (a loss function). Several popular metrics > are not. > > For a metric like MSE it will be almost identical assuming the test > sets have almost the same size. > > What will be almost identical to what? I suppose you mean that (*) is > consistent with the scores of the models in the fold (ie, the result of > cross_val_score) if the loss function is (x-y)?. > > For something like Recall > (sensitivity) it will be almost identical assuming similar test set > sizes **and** stratification. For something like precision whose > denominator is determined by the biases of the learnt classifier on > the test dataset, you can't say the same. > > I can't follow here. If the loss function is L(x,y) = 1_{x = y}, then (*) > gives the accuracy. > > For something like ROC AUC > score, relying on some decision function that may not be equivalently > calibrated across splits, evaluating in this way is almost > meaningless. > > In any case, I still don't see what may be wrong with (*). Otherwise, the > warning in the documentation about the use of cross_val_predict should be > removed or revised. > > On the other hand, an example in the documentation uses > cross_val_scores.mean(). This is debatable since this computes a mean of > means. > > > > On Wed, 3 Apr 2019 at 22:01, Boris Hollas wrote: > > I use > > sum((cross_val_predict(model, X, y) - y)**2) / len(y) (*) > > to evaluate the performance of a model. This conforms with Murphy: Machine Learning, section 6.5.3, and Hastie et al: The Elements of Statistical Learning, eq. 7.48. However, according to the documentation of cross_val_predict, "it is not appropriate to pass these predictions into an evaluation metric". While it is obvious that cross_val_predict is different from cross_val_score, I don't see what should be wrong with (*). > > Also, the explanation that "cross_val_predict simply returns the labels (or probabilities)" is unclear, if not wrong. As I understand it, this function returns estimates and no labels or probabilities. > > Regards, Boris > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ericmajinglong at gmail.com Wed Apr 3 18:59:02 2019 From: ericmajinglong at gmail.com (Eric Ma) Date: Thu, 4 Apr 2019 00:59:02 +0200 Subject: [scikit-learn] API Discussion: Where shall we put the plotting functions? In-Reply-To: References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn> Message-ID: This is not a strongly-held suggestion - but what about adopting YellowBrick as the plotting API for sklearn? Not sure how exactly the interaction would work - could be PRs to their library, or ask them to integrate into sklearn, or do a lock-step dance with versions but maintain separate teams? (I know it raises more questions than answers, but wanted to put it out there.) On Wed, Apr 3, 2019 at 4:07 PM Joel Nothman wrote: > With option 1, sklearn.plot is likely to import large chunks of the > library (particularly, but not exclusively, if the plotting function > "does the work" as Andy suggests). This is under the assumption that > one plot function will want to import trees, another GPs, etc. Unless > we move to lazy imports, that would be against the current convention > that importing sklearn is fairly minimal. > > I do like Andy's idea of framing this discussion more clearly around > likely candidates. > > On Thu, 4 Apr 2019 at 00:10, Andreas Mueller wrote: > > > > I think what was not clear from the question is that there is actually > > quite different kinds of plotting functions, and many of these are tied > > to existing code. > > > > Right now we have some that are specific to trees (plot_tree) and to > > gradient boosting (plot_partial_dependence). > > > > I think we want more general functions, and plot_partial_dependence has > > been extended to general estimators. > > > > However, the plotting functions might be generic wrt the estimator, but > > relate to a specific function, say plotting results of GridSearchCV. > > Then one might argue that having the plotting function close to > > GridSearchCV might make sense. > > Similarly for plotting partial dependence plots and feature importances, > > it might be a bit strange to have the plotting functions not next to the > > functions that compute these. > > Another question would be is whether the plotting functions also "do the > > work" in some cases: > > Do we want plot_partial_dependence also to compute the partial > > dependence? (I would argue yes but either way the result is a bit > strange). > > In that case you have somewhat of the same functionality in two > > different modules, unless you also put the "compute partial dependence" > > function in the plotting module as well, > > which is a bit strange. > > > > Maybe we could inform this discussion by listing candidate plotting > > functions, and also considering whether they "do the work" and where the > > "work" function is. > > > > Other examples are plotting the confusion matrix, which probably should > > also compute the confusion matrix (it's fast and so that would be > > convenient), and so it would "duplicate" functionality from the metrics > > module. > > > > Plotting learning curves and validation curves should probably not do > > the work as it's pretty involved, and so someone would need to import > > the learning and validation curves from model selection, and then the > > plotting functions from a plotting module. > > > > Calibrations curves and P/R curves and roc curves are also pretty fast > > to compute (and passing around the arguments is somewhat error prone) so > > I would say the plotting functions for these should do the work as well. > > > > Anyway, you can see that many plotting functions are actually associated > > with functions in existing modules and the interactions are a bit > unclear. > > > > The only plotting functions I haven't mentioned so far that I thought > > about in the past are "2d scatter" and "plot decision function". These > > would be kind of generic, but mostly used in the examples. > > Though having a discrete 2d scatter function would be pretty nice > > (plt.scatter doesn't allow legends and makes it hard to use qualitative > > color maps). > > > > > > I think I would vote for option (1), "sklearn.plot.plot_zzz" but the > > case is not really that clear. > > > > Cheers, > > > > Andy > > > > On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote: > > > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting > > > functions will be added? If it's just a dozen or less, putting them all > > > into a single namespace sklearn.plot might be easier. > > > > > > This also would avoid discussion about where to put some generic > > > plotting functions (e.g. > > > > https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479 > ). > > > > > > Roman > > > > > > On 03/04/2019 12:06, Trevor Stephens wrote: > > >> I think #1 if any of these... Plotting functions should hopefully be > as > > >> general as possible, so tagging with a specific type of estimator > will, > > >> in some scikit-learn utopia, be unnecessary. > > >> > > >> If a general plotter is built, where does it live in other > > >> estimator-specific namespace options? Feels awkward to put it under > > >> every estimator's namespace. > > >> > > >> Then again, there might be a #4 where there is no plot module and > > >> plotting classes live under groups of utilities like introspection, > > >> cross-validation or something?... > > >> > > >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe > >> > wrote: > > >> > > >> My preference would be for (1). I don't think the sub-namespace > in > > >> (2) is necessary, and don't like (3), as I would prefer the > plotting > > >> functions to be all in the same namespace sklearn.plot. > > >> > > >> Andrew > > >> > > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > >> J. Andrew Howe, PhD > > >> LinkedIn Profile > > >> ResearchGate Profile < > http://www.researchgate.net/profile/John_Howe12/> > > >> Open Researcher and Contributor ID (ORCID) > > >> > > >> Github Profile > > >> Personal Website > > >> I live to learn, so I can learn to live. - me > > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > >> > > >> > > >> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin < > qinhanmin2005 at sina.com > > >> > wrote: > > >> > > >> See > https://github.com/scikit-learn/scikit-learn/issues/13448 > > >> > > >> We've introduced several plotting functions (e.g., plot_tree > and > > >> plot_partial_dependence) and will introduce more (e.g., > > >> plot_decision_boundary) in the future. Consequently, we need > to > > >> decide where to put these functions. Currently, there're 3 > > >> proposals: > > >> > > >> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree) > > >> > > >> (2) sklearn.plot.XXX.plot_YYY (e.g., > sklearn.plot.tree.plot_tree) > > >> > > >> (3) sklearn.XXX.plot.plot_YYY (e.g., > > >> sklearn.tree.plot.plot_tree, note that we won't support from > > >> sklearn.XXX import plot_YYY) > > >> > > >> Joel Nothman, Gael Varoquaux and I decided to post it on the > > >> mailing list to invite opinions. > > >> > > >> Thanks > > >> > > >> Hanmin Qin > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Apr 3 19:50:51 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 4 Apr 2019 10:50:51 +1100 Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug Message-ID: The core developers of Scikit-learn have recently voted to welcome Thomas Fan and Nicolas Hug to the team, in recognition of their efforts and trustworthiness as contributors. Both happen to be working with Andy Mueller at Columbia University at the moment. Congratulations and thanks to them both! From qinhanmin2005 at sina.com Wed Apr 3 21:05:55 2019 From: qinhanmin2005 at sina.com (Hanmin Qin) Date: Thu, 04 Apr 2019 09:05:55 +0800 Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug Message-ID: <20190404010555.772254140094@webmail.sinamail.sina.com.cn> Congratulations and welcome to the team! Hanmin Qin ----- Original Message ----- From: Joel Nothman To: Scikit-learn user and developer mailing list Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug Date: 2019-04-04 07:52 The core developers of Scikit-learn have recently voted to welcome Thomas Fan and Nicolas Hug to the team, in recognition of their efforts and trustworthiness as contributors. Both happen to be working with Andy Mueller at Columbia University at the moment. Congratulations and thanks to them both! _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Apr 3 23:11:36 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 3 Apr 2019 23:11:36 -0400 Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug In-Reply-To: <20190404010555.772254140094@webmail.sinamail.sina.com.cn> References: <20190404010555.772254140094@webmail.sinamail.sina.com.cn> Message-ID: Congratulations guys! Great work! Looking forward to much more! Proud to have you on the team! Now we in NYC can approve our own pull requests ;) Sent from phone. Please excuse spelling and brevity. On Wed, Apr 3, 2019, 21:08 Hanmin Qin wrote: > Congratulations and welcome to the team! > > Hanmin Qin > > ----- Original Message ----- > From: Joel Nothman > To: Scikit-learn user and developer mailing list > Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug > Date: 2019-04-04 07:52 > > > The core developers of Scikit-learn have recently voted to welcome > Thomas Fan and Nicolas Hug to the team, in recognition of their > efforts and trustworthiness as contributors. Both happen to be working > with Andy Mueller at Columbia University at the moment. > Congratulations and thanks to them both! > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hollas at informatik.htw-dresden.de Thu Apr 4 03:39:14 2019 From: hollas at informatik.htw-dresden.de (Boris Hollas) Date: Thu, 4 Apr 2019 09:39:14 +0200 Subject: [scikit-learn] Why is cross_val_predict discouraged? In-Reply-To: References: <1d887c05-bfdd-2559-c7a7-6e63a156eacc@informatik.htw-dresden.de> Message-ID: Am 03.04.19 um 23:46 schrieb Joel Nothman: > Pull requests improving the documentation are always welcome. At a > minimum, users need to know that these compute different things. > > Accuracy is not precision. Precision is the number of true positives > divided by the number of true positives plus false positives. It > therefore cannot be decomposed as a sample-wise measure without > knowing the rate of positive predictions. This rate is dependent on > the training data and algorithm. In my last post, I referred to your remark that "for precision ... you can't say the same". Since precision can't be computed with formula (*), even with a different loss function, I pointed out that (*) can be used to compute the accuracy if the loss function is an indicator function. It is still not clear to me what your point is with your remark that "for precision ... you can't say the same". I assume that you want to tell that it is not wise to compute TP, FP, FN and then precision and recall using cross_val_predict. If this is what you mean, I'd like you to explain why. > I'm not a statistician and cannot speak to issues of computing a mean > of means, but if what we are trying to estimate is the performance on > a sample of size approximately n_t of a model trained on a sample of > size approximately N - n_t, then I wouldn't have thought taking a mean > over such measures (with whatever score function) to be unreasonable. > In general, a mean of means is not the mean of the original data. The pooled mean is the correct metric in this case. However, the pooled mean equals the mean of means if all folds are exactly the same size. > On Thu., 4 Apr. 2019, 3:51 am Boris Hollas, > > wrote: > > Am 03.04.19 um 13:59 schrieb Joel Nothman: >> The equations in Murphy and Hastie very clearly assume a metric >> decomposable over samples (a loss function). Several popular metrics >> are not. >> >> For a metric like MSE it will be almost identical assuming the test >> sets have almost the same size. > What will be almost identical to what? I suppose you mean that (*) > is consistent with the scores of the models in the fold (ie, the > result of cross_val_score) if the loss function is (x-y)?. >> For something like Recall >> (sensitivity) it will be almost identical assuming similar test set >> sizes**and** stratification. For something like precision whose >> denominator is determined by the biases of the learnt classifier on >> the test dataset, you can't say the same. > I can't follow here. If the loss function is L(x,y) = 1_{x = y}, > then (*) gives the accuracy. >> For something like ROC AUC >> score, relying on some decision function that may not be equivalently >> calibrated across splits, evaluating in this way is almost >> meaningless. > > In any case, I still don't see what may be wrong with (*). > Otherwise, the warning in the documentation about the use of > cross_val_predict should be removed or revised. > > On the other hand, an example in the documentation uses > cross_val_scores.mean(). This is debatable since this computes a > mean of means. > >> On Wed, 3 Apr 2019 at 22:01, Boris Hollas >> wrote: >>> I use >>> >>> sum((cross_val_predict(model, X, y) - y)**2) / len(y) (*) >>> >>> to evaluate the performance of a model. This conforms with Murphy: Machine Learning, section 6.5.3, and Hastie et al: The Elements of Statistical Learning, eq. 7.48. However, according to the documentation of cross_val_predict, "it is not appropriate to pass these predictions into an evaluation metric". While it is obvious that cross_val_predict is different from cross_val_score, I don't see what should be wrong with (*). >>> >>> Also, the explanation that "cross_val_predict simply returns the labels (or probabilities)" is unclear, if not wrong. As I understand it, this function returns estimates and no labels or probabilities. >>> >>> Regards, Boris > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Apr 4 04:03:16 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 4 Apr 2019 19:03:16 +1100 Subject: [scikit-learn] Why is cross_val_predict discouraged? In-Reply-To: References: <1d887c05-bfdd-2559-c7a7-6e63a156eacc@informatik.htw-dresden.de> Message-ID: > I assume that you want to tell that it is not wise to compute TP, FP, FN and then precision and recall using cross_val_predict. If this is what you mean, I'd like you to explain why. Because if there is high variance as a function of training set rather than test sample I'd like to know. > The pooled mean is the correct metric in this case. I don't think we are in agreement on that. -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.gramfort at inria.fr Thu Apr 4 05:40:48 2019 From: alexandre.gramfort at inria.fr (Alexandre Gramfort) Date: Thu, 4 Apr 2019 11:40:48 +0200 Subject: [scikit-learn] API Discussion: Where shall we put the plotting functions? In-Reply-To: References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn> Message-ID: I also think that YellowBrick folks did a great job and that we should not reinvent the wheel or at least have clear idea of how we differ in scope with respect to YellowBrick my 2c Alex On Thu, Apr 4, 2019 at 1:02 AM Eric Ma wrote: > This is not a strongly-held suggestion - but what about adopting > YellowBrick as the plotting API for sklearn? Not sure how exactly the > interaction would work - could be PRs to their library, or ask them to > integrate into sklearn, or do a lock-step dance with versions but maintain > separate teams? (I know it raises more questions than answers, but wanted > to put it out there.) > > On Wed, Apr 3, 2019 at 4:07 PM Joel Nothman > wrote: > >> With option 1, sklearn.plot is likely to import large chunks of the >> library (particularly, but not exclusively, if the plotting function >> "does the work" as Andy suggests). This is under the assumption that >> one plot function will want to import trees, another GPs, etc. Unless >> we move to lazy imports, that would be against the current convention >> that importing sklearn is fairly minimal. >> >> I do like Andy's idea of framing this discussion more clearly around >> likely candidates. >> >> On Thu, 4 Apr 2019 at 00:10, Andreas Mueller wrote: >> > >> > I think what was not clear from the question is that there is actually >> > quite different kinds of plotting functions, and many of these are tied >> > to existing code. >> > >> > Right now we have some that are specific to trees (plot_tree) and to >> > gradient boosting (plot_partial_dependence). >> > >> > I think we want more general functions, and plot_partial_dependence has >> > been extended to general estimators. >> > >> > However, the plotting functions might be generic wrt the estimator, but >> > relate to a specific function, say plotting results of GridSearchCV. >> > Then one might argue that having the plotting function close to >> > GridSearchCV might make sense. >> > Similarly for plotting partial dependence plots and feature importances, >> > it might be a bit strange to have the plotting functions not next to the >> > functions that compute these. >> > Another question would be is whether the plotting functions also "do the >> > work" in some cases: >> > Do we want plot_partial_dependence also to compute the partial >> > dependence? (I would argue yes but either way the result is a bit >> strange). >> > In that case you have somewhat of the same functionality in two >> > different modules, unless you also put the "compute partial dependence" >> > function in the plotting module as well, >> > which is a bit strange. >> > >> > Maybe we could inform this discussion by listing candidate plotting >> > functions, and also considering whether they "do the work" and where the >> > "work" function is. >> > >> > Other examples are plotting the confusion matrix, which probably should >> > also compute the confusion matrix (it's fast and so that would be >> > convenient), and so it would "duplicate" functionality from the metrics >> > module. >> > >> > Plotting learning curves and validation curves should probably not do >> > the work as it's pretty involved, and so someone would need to import >> > the learning and validation curves from model selection, and then the >> > plotting functions from a plotting module. >> > >> > Calibrations curves and P/R curves and roc curves are also pretty fast >> > to compute (and passing around the arguments is somewhat error prone) so >> > I would say the plotting functions for these should do the work as well. >> > >> > Anyway, you can see that many plotting functions are actually associated >> > with functions in existing modules and the interactions are a bit >> unclear. >> > >> > The only plotting functions I haven't mentioned so far that I thought >> > about in the past are "2d scatter" and "plot decision function". These >> > would be kind of generic, but mostly used in the examples. >> > Though having a discrete 2d scatter function would be pretty nice >> > (plt.scatter doesn't allow legends and makes it hard to use qualitative >> > color maps). >> > >> > >> > I think I would vote for option (1), "sklearn.plot.plot_zzz" but the >> > case is not really that clear. >> > >> > Cheers, >> > >> > Andy >> > >> > On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote: >> > > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting >> > > functions will be added? If it's just a dozen or less, putting them >> all >> > > into a single namespace sklearn.plot might be easier. >> > > >> > > This also would avoid discussion about where to put some generic >> > > plotting functions (e.g. >> > > >> https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479 >> ). >> > > >> > > Roman >> > > >> > > On 03/04/2019 12:06, Trevor Stephens wrote: >> > >> I think #1 if any of these... Plotting functions should hopefully be >> as >> > >> general as possible, so tagging with a specific type of estimator >> will, >> > >> in some scikit-learn utopia, be unnecessary. >> > >> >> > >> If a general plotter is built, where does it live in other >> > >> estimator-specific namespace options? Feels awkward to put it under >> > >> every estimator's namespace. >> > >> >> > >> Then again, there might be a #4 where there is no plot module and >> > >> plotting classes live under groups of utilities like introspection, >> > >> cross-validation or something?... >> > >> >> > >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe > > >> > wrote: >> > >> >> > >> My preference would be for (1). I don't think the sub-namespace >> in >> > >> (2) is necessary, and don't like (3), as I would prefer the >> plotting >> > >> functions to be all in the same namespace sklearn.plot. >> > >> >> > >> Andrew >> > >> >> > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >> > >> J. Andrew Howe, PhD >> > >> LinkedIn Profile >> > >> ResearchGate Profile < >> http://www.researchgate.net/profile/John_Howe12/> >> > >> Open Researcher and Contributor ID (ORCID) >> > >> >> > >> Github Profile >> > >> Personal Website >> > >> I live to learn, so I can learn to live. - me >> > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >> > >> >> > >> >> > >> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin < >> qinhanmin2005 at sina.com >> > >> > wrote: >> > >> >> > >> See >> https://github.com/scikit-learn/scikit-learn/issues/13448 >> > >> >> > >> We've introduced several plotting functions (e.g., >> plot_tree and >> > >> plot_partial_dependence) and will introduce more (e.g., >> > >> plot_decision_boundary) in the future. Consequently, we >> need to >> > >> decide where to put these functions. Currently, there're 3 >> > >> proposals: >> > >> >> > >> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree) >> > >> >> > >> (2) sklearn.plot.XXX.plot_YYY (e.g., >> sklearn.plot.tree.plot_tree) >> > >> >> > >> (3) sklearn.XXX.plot.plot_YYY (e.g., >> > >> sklearn.tree.plot.plot_tree, note that we won't support from >> > >> sklearn.XXX import plot_YYY) >> > >> >> > >> Joel Nothman, Gael Varoquaux and I decided to post it on the >> > >> mailing list to invite opinions. >> > >> >> > >> Thanks >> > >> >> > >> Hanmin Qin >> > >> _______________________________________________ >> > >> scikit-learn mailing list >> > >> scikit-learn at python.org >> > >> https://mail.python.org/mailman/listinfo/scikit-learn >> > >> >> > >> _______________________________________________ >> > >> scikit-learn mailing list >> > >> scikit-learn at python.org >> > >> https://mail.python.org/mailman/listinfo/scikit-learn >> > >> >> > > >> > > _______________________________________________ >> > > scikit-learn mailing list >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Apr 4 10:24:40 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 4 Apr 2019 10:24:40 -0400 Subject: [scikit-learn] API Discussion: Where shall we put the plotting functions? In-Reply-To: References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn> Message-ID: I would argue that sklearn users would benefit in having solutions in scikit-learn. The yellowbrick api is quite different from the approaches we discussed. If we can reuse their implementations I think we should do so and credit where we can. Having plotting in sklearn is also likely to attract more contributors and we have more eyes for doing reviews. Sent from phone. Please excuse spelling and brevity. On Thu, Apr 4, 2019, 05:43 Alexandre Gramfort wrote: > I also think that YellowBrick folks did a great job and that we should not > reinvent the wheel or at least have clear idea of how we differ in scope > with respect to YellowBrick > > my 2c > > Alex > > > On Thu, Apr 4, 2019 at 1:02 AM Eric Ma wrote: > >> This is not a strongly-held suggestion - but what about adopting >> YellowBrick as the plotting API for sklearn? Not sure how exactly the >> interaction would work - could be PRs to their library, or ask them to >> integrate into sklearn, or do a lock-step dance with versions but maintain >> separate teams? (I know it raises more questions than answers, but wanted >> to put it out there.) >> >> On Wed, Apr 3, 2019 at 4:07 PM Joel Nothman >> wrote: >> >>> With option 1, sklearn.plot is likely to import large chunks of the >>> library (particularly, but not exclusively, if the plotting function >>> "does the work" as Andy suggests). This is under the assumption that >>> one plot function will want to import trees, another GPs, etc. Unless >>> we move to lazy imports, that would be against the current convention >>> that importing sklearn is fairly minimal. >>> >>> I do like Andy's idea of framing this discussion more clearly around >>> likely candidates. >>> >>> On Thu, 4 Apr 2019 at 00:10, Andreas Mueller wrote: >>> > >>> > I think what was not clear from the question is that there is actually >>> > quite different kinds of plotting functions, and many of these are tied >>> > to existing code. >>> > >>> > Right now we have some that are specific to trees (plot_tree) and to >>> > gradient boosting (plot_partial_dependence). >>> > >>> > I think we want more general functions, and plot_partial_dependence has >>> > been extended to general estimators. >>> > >>> > However, the plotting functions might be generic wrt the estimator, but >>> > relate to a specific function, say plotting results of GridSearchCV. >>> > Then one might argue that having the plotting function close to >>> > GridSearchCV might make sense. >>> > Similarly for plotting partial dependence plots and feature >>> importances, >>> > it might be a bit strange to have the plotting functions not next to >>> the >>> > functions that compute these. >>> > Another question would be is whether the plotting functions also "do >>> the >>> > work" in some cases: >>> > Do we want plot_partial_dependence also to compute the partial >>> > dependence? (I would argue yes but either way the result is a bit >>> strange). >>> > In that case you have somewhat of the same functionality in two >>> > different modules, unless you also put the "compute partial dependence" >>> > function in the plotting module as well, >>> > which is a bit strange. >>> > >>> > Maybe we could inform this discussion by listing candidate plotting >>> > functions, and also considering whether they "do the work" and where >>> the >>> > "work" function is. >>> > >>> > Other examples are plotting the confusion matrix, which probably should >>> > also compute the confusion matrix (it's fast and so that would be >>> > convenient), and so it would "duplicate" functionality from the metrics >>> > module. >>> > >>> > Plotting learning curves and validation curves should probably not do >>> > the work as it's pretty involved, and so someone would need to import >>> > the learning and validation curves from model selection, and then the >>> > plotting functions from a plotting module. >>> > >>> > Calibrations curves and P/R curves and roc curves are also pretty fast >>> > to compute (and passing around the arguments is somewhat error prone) >>> so >>> > I would say the plotting functions for these should do the work as >>> well. >>> > >>> > Anyway, you can see that many plotting functions are actually >>> associated >>> > with functions in existing modules and the interactions are a bit >>> unclear. >>> > >>> > The only plotting functions I haven't mentioned so far that I thought >>> > about in the past are "2d scatter" and "plot decision function". These >>> > would be kind of generic, but mostly used in the examples. >>> > Though having a discrete 2d scatter function would be pretty nice >>> > (plt.scatter doesn't allow legends and makes it hard to use qualitative >>> > color maps). >>> > >>> > >>> > I think I would vote for option (1), "sklearn.plot.plot_zzz" but the >>> > case is not really that clear. >>> > >>> > Cheers, >>> > >>> > Andy >>> > >>> > On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote: >>> > > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting >>> > > functions will be added? If it's just a dozen or less, putting them >>> all >>> > > into a single namespace sklearn.plot might be easier. >>> > > >>> > > This also would avoid discussion about where to put some generic >>> > > plotting functions (e.g. >>> > > >>> https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479 >>> ). >>> > > >>> > > Roman >>> > > >>> > > On 03/04/2019 12:06, Trevor Stephens wrote: >>> > >> I think #1 if any of these... Plotting functions should hopefully >>> be as >>> > >> general as possible, so tagging with a specific type of estimator >>> will, >>> > >> in some scikit-learn utopia, be unnecessary. >>> > >> >>> > >> If a general plotter is built, where does it live in other >>> > >> estimator-specific namespace options? Feels awkward to put it under >>> > >> every estimator's namespace. >>> > >> >>> > >> Then again, there might be a #4 where there is no plot module and >>> > >> plotting classes live under groups of utilities like introspection, >>> > >> cross-validation or something?... >>> > >> >>> > >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe >> > >> > wrote: >>> > >> >>> > >> My preference would be for (1). I don't think the >>> sub-namespace in >>> > >> (2) is necessary, and don't like (3), as I would prefer the >>> plotting >>> > >> functions to be all in the same namespace sklearn.plot. >>> > >> >>> > >> Andrew >>> > >> >>> > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >>> > >> J. Andrew Howe, PhD >>> > >> LinkedIn Profile >>> > >> ResearchGate Profile < >>> http://www.researchgate.net/profile/John_Howe12/> >>> > >> Open Researcher and Contributor ID (ORCID) >>> > >> >>> > >> Github Profile >>> > >> Personal Website >>> > >> I live to learn, so I can learn to live. - me >>> > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >>> > >> >>> > >> >>> > >> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin < >>> qinhanmin2005 at sina.com >>> > >> > wrote: >>> > >> >>> > >> See >>> https://github.com/scikit-learn/scikit-learn/issues/13448 >>> > >> >>> > >> We've introduced several plotting functions (e.g., >>> plot_tree and >>> > >> plot_partial_dependence) and will introduce more (e.g., >>> > >> plot_decision_boundary) in the future. Consequently, we >>> need to >>> > >> decide where to put these functions. Currently, there're 3 >>> > >> proposals: >>> > >> >>> > >> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree) >>> > >> >>> > >> (2) sklearn.plot.XXX.plot_YYY (e.g., >>> sklearn.plot.tree.plot_tree) >>> > >> >>> > >> (3) sklearn.XXX.plot.plot_YYY (e.g., >>> > >> sklearn.tree.plot.plot_tree, note that we won't support >>> from >>> > >> sklearn.XXX import plot_YYY) >>> > >> >>> > >> Joel Nothman, Gael Varoquaux and I decided to post it on >>> the >>> > >> mailing list to invite opinions. >>> > >> >>> > >> Thanks >>> > >> >>> > >> Hanmin Qin >>> > >> _______________________________________________ >>> > >> scikit-learn mailing list >>> > >> scikit-learn at python.org >>> > >> https://mail.python.org/mailman/listinfo/scikit-learn >>> > >> >>> > >> _______________________________________________ >>> > >> scikit-learn mailing list >>> > >> scikit-learn at python.org >>> > >> https://mail.python.org/mailman/listinfo/scikit-learn >>> > >> >>> > > >>> > > _______________________________________________ >>> > > scikit-learn mailing list >>> > > scikit-learn at python.org >>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Apr 4 17:12:09 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 5 Apr 2019 08:12:09 +1100 Subject: [scikit-learn] API Discussion: Where shall we put the plotting functions? In-Reply-To: References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn> Message-ID: Well it would certainly be a low-cost effort improvement if we demonstrated yellowbrick in our examples. -------------- next part -------------- An HTML attachment was scrubbed... URL: From heitor.boschirolli at gmail.com Sat Apr 6 13:07:38 2019 From: heitor.boschirolli at gmail.com (Heitor Boschirolli) Date: Sat, 6 Apr 2019 14:07:38 -0300 Subject: [scikit-learn] Starting to contribute Message-ID: Hello! First of all, I'm apologize if this email is not for such questions, but I never contributed to open source code before and I'm not sure how to proceed, could someone help me with that? Should I just pick an issue, solve it following the guidelines described in the website and open a PR? If I have any trouble, can I post it on the mailing list? Att, Heitor Boschirolli -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahowe42 at gmail.com Sun Apr 7 05:08:24 2019 From: ahowe42 at gmail.com (Andrew Howe) Date: Sun, 7 Apr 2019 10:08:24 +0100 Subject: [scikit-learn] API Discussion: Where shall we put the plotting functions? In-Reply-To: References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn> Message-ID: I'm with Andreas on this. As a user, I would prefer to see this as part of sklearn with the usual sklearn api. If we want static matplotlib-style images, reusing (with credit) some of the yellowbrick implementations is a good idea. Would we consider plotly-based visualizations? I've been doing my own plotting in plotly for the last month, and can't imagine going back to static matplotlib plots... Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD LinkedIn Profile ResearchGate Profile Open Researcher and Contributor ID (ORCID) Github Profile Personal Website I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Thu, Apr 4, 2019 at 3:26 PM Andreas Mueller wrote: > I would argue that sklearn users would benefit in having solutions in > scikit-learn. The yellowbrick api is quite different from the approaches we > discussed. If we can reuse their implementations I think we should do so > and credit where we can. > Having plotting in sklearn is also likely to attract more contributors and > we have more eyes for doing reviews. > > Sent from phone. Please excuse spelling and brevity. > > On Thu, Apr 4, 2019, 05:43 Alexandre Gramfort > wrote: > >> I also think that YellowBrick folks did a great job and that we should >> not reinvent the wheel or at least have clear idea of how we differ in >> scope with respect to YellowBrick >> >> my 2c >> >> Alex >> >> >> On Thu, Apr 4, 2019 at 1:02 AM Eric Ma wrote: >> >>> This is not a strongly-held suggestion - but what about adopting >>> YellowBrick as the plotting API for sklearn? Not sure how exactly the >>> interaction would work - could be PRs to their library, or ask them to >>> integrate into sklearn, or do a lock-step dance with versions but maintain >>> separate teams? (I know it raises more questions than answers, but wanted >>> to put it out there.) >>> >>> On Wed, Apr 3, 2019 at 4:07 PM Joel Nothman >>> wrote: >>> >>>> With option 1, sklearn.plot is likely to import large chunks of the >>>> library (particularly, but not exclusively, if the plotting function >>>> "does the work" as Andy suggests). This is under the assumption that >>>> one plot function will want to import trees, another GPs, etc. Unless >>>> we move to lazy imports, that would be against the current convention >>>> that importing sklearn is fairly minimal. >>>> >>>> I do like Andy's idea of framing this discussion more clearly around >>>> likely candidates. >>>> >>>> On Thu, 4 Apr 2019 at 00:10, Andreas Mueller wrote: >>>> > >>>> > I think what was not clear from the question is that there is actually >>>> > quite different kinds of plotting functions, and many of these are >>>> tied >>>> > to existing code. >>>> > >>>> > Right now we have some that are specific to trees (plot_tree) and to >>>> > gradient boosting (plot_partial_dependence). >>>> > >>>> > I think we want more general functions, and plot_partial_dependence >>>> has >>>> > been extended to general estimators. >>>> > >>>> > However, the plotting functions might be generic wrt the estimator, >>>> but >>>> > relate to a specific function, say plotting results of GridSearchCV. >>>> > Then one might argue that having the plotting function close to >>>> > GridSearchCV might make sense. >>>> > Similarly for plotting partial dependence plots and feature >>>> importances, >>>> > it might be a bit strange to have the plotting functions not next to >>>> the >>>> > functions that compute these. >>>> > Another question would be is whether the plotting functions also "do >>>> the >>>> > work" in some cases: >>>> > Do we want plot_partial_dependence also to compute the partial >>>> > dependence? (I would argue yes but either way the result is a bit >>>> strange). >>>> > In that case you have somewhat of the same functionality in two >>>> > different modules, unless you also put the "compute partial >>>> dependence" >>>> > function in the plotting module as well, >>>> > which is a bit strange. >>>> > >>>> > Maybe we could inform this discussion by listing candidate plotting >>>> > functions, and also considering whether they "do the work" and where >>>> the >>>> > "work" function is. >>>> > >>>> > Other examples are plotting the confusion matrix, which probably >>>> should >>>> > also compute the confusion matrix (it's fast and so that would be >>>> > convenient), and so it would "duplicate" functionality from the >>>> metrics >>>> > module. >>>> > >>>> > Plotting learning curves and validation curves should probably not do >>>> > the work as it's pretty involved, and so someone would need to import >>>> > the learning and validation curves from model selection, and then the >>>> > plotting functions from a plotting module. >>>> > >>>> > Calibrations curves and P/R curves and roc curves are also pretty fast >>>> > to compute (and passing around the arguments is somewhat error prone) >>>> so >>>> > I would say the plotting functions for these should do the work as >>>> well. >>>> > >>>> > Anyway, you can see that many plotting functions are actually >>>> associated >>>> > with functions in existing modules and the interactions are a bit >>>> unclear. >>>> > >>>> > The only plotting functions I haven't mentioned so far that I thought >>>> > about in the past are "2d scatter" and "plot decision function". These >>>> > would be kind of generic, but mostly used in the examples. >>>> > Though having a discrete 2d scatter function would be pretty nice >>>> > (plt.scatter doesn't allow legends and makes it hard to use >>>> qualitative >>>> > color maps). >>>> > >>>> > >>>> > I think I would vote for option (1), "sklearn.plot.plot_zzz" but the >>>> > case is not really that clear. >>>> > >>>> > Cheers, >>>> > >>>> > Andy >>>> > >>>> > On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote: >>>> > > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting >>>> > > functions will be added? If it's just a dozen or less, putting them >>>> all >>>> > > into a single namespace sklearn.plot might be easier. >>>> > > >>>> > > This also would avoid discussion about where to put some generic >>>> > > plotting functions (e.g. >>>> > > >>>> https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479 >>>> ). >>>> > > >>>> > > Roman >>>> > > >>>> > > On 03/04/2019 12:06, Trevor Stephens wrote: >>>> > >> I think #1 if any of these... Plotting functions should hopefully >>>> be as >>>> > >> general as possible, so tagging with a specific type of estimator >>>> will, >>>> > >> in some scikit-learn utopia, be unnecessary. >>>> > >> >>>> > >> If a general plotter is built, where does it live in other >>>> > >> estimator-specific namespace options? Feels awkward to put it under >>>> > >> every estimator's namespace. >>>> > >> >>>> > >> Then again, there might be a #4 where there is no plot module and >>>> > >> plotting classes live under groups of utilities like introspection, >>>> > >> cross-validation or something?... >>>> > >> >>>> > >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe >>> > >> > wrote: >>>> > >> >>>> > >> My preference would be for (1). I don't think the >>>> sub-namespace in >>>> > >> (2) is necessary, and don't like (3), as I would prefer the >>>> plotting >>>> > >> functions to be all in the same namespace sklearn.plot. >>>> > >> >>>> > >> Andrew >>>> > >> >>>> > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >>>> > >> J. Andrew Howe, PhD >>>> > >> LinkedIn Profile >>>> > >> ResearchGate Profile < >>>> http://www.researchgate.net/profile/John_Howe12/> >>>> > >> Open Researcher and Contributor ID (ORCID) >>>> > >> >>>> > >> Github Profile >>>> > >> Personal Website >>>> > >> I live to learn, so I can learn to live. - me >>>> > >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >>>> > >> >>>> > >> >>>> > >> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin < >>>> qinhanmin2005 at sina.com >>>> > >> > wrote: >>>> > >> >>>> > >> See >>>> https://github.com/scikit-learn/scikit-learn/issues/13448 >>>> > >> >>>> > >> We've introduced several plotting functions (e.g., >>>> plot_tree and >>>> > >> plot_partial_dependence) and will introduce more (e.g., >>>> > >> plot_decision_boundary) in the future. Consequently, we >>>> need to >>>> > >> decide where to put these functions. Currently, there're 3 >>>> > >> proposals: >>>> > >> >>>> > >> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree) >>>> > >> >>>> > >> (2) sklearn.plot.XXX.plot_YYY (e.g., >>>> sklearn.plot.tree.plot_tree) >>>> > >> >>>> > >> (3) sklearn.XXX.plot.plot_YYY (e.g., >>>> > >> sklearn.tree.plot.plot_tree, note that we won't support >>>> from >>>> > >> sklearn.XXX import plot_YYY) >>>> > >> >>>> > >> Joel Nothman, Gael Varoquaux and I decided to post it on >>>> the >>>> > >> mailing list to invite opinions. >>>> > >> >>>> > >> Thanks >>>> > >> >>>> > >> Hanmin Qin >>>> > >> _______________________________________________ >>>> > >> scikit-learn mailing list >>>> > >> scikit-learn at python.org >>>> > >> https://mail.python.org/mailman/listinfo/scikit-learn >>>> > >> >>>> > >> _______________________________________________ >>>> > >> scikit-learn mailing list >>>> > >> scikit-learn at python.org >>>> > >> https://mail.python.org/mailman/listinfo/scikit-learn >>>> > >> >>>> > > >>>> > > _______________________________________________ >>>> > > scikit-learn mailing list >>>> > > scikit-learn at python.org >>>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>>> > _______________________________________________ >>>> > scikit-learn mailing list >>>> > scikit-learn at python.org >>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at pm.me Sun Apr 7 05:23:56 2019 From: rth.yurchak at pm.me (Roman Yurchak) Date: Sun, 07 Apr 2019 09:23:56 +0000 Subject: [scikit-learn] Starting to contribute In-Reply-To: References: Message-ID: <6cT00pTFFXyDphB5zuehmAGa1J-m9yvO8UcgcbA9wFY8KzGojBgoA5pfPQZakaVTF88utkZlF9v-qCyfIHledAKfzFtXXvqvTBNkT975it8=@pm.me> Hello Heitor, yes, you can chose an issue, comment there that you plan to work on it (to avoid redundant work by other contributors) and if no one objects make a PR. If you have any questions you can ask them by commenting on that issue (for specific questions) or on the scikit-learn Gitter https://gitter.im/scikit-learn/scikit-learn (for general questions about how to contribute). See https://scikit-learn.org/stable/developers/contributing.html for more information. Roman On 06/04/2019 19:07, Heitor Boschirolli wrote: > Hello! > > First of all, I'm apologize if this email is not for such questions, but > I never contributed to open source code before and I'm not sure how to > proceed, could someone help me with that? > > Should I just pick an issue, solve it following the guidelines described > in the website and open a PR? > If I have any trouble, can I post it on the mailing list? > > Att, Heitor Boschirolli From emmanuelle.gouillart at nsup.org Sun Apr 7 11:41:48 2019 From: emmanuelle.gouillart at nsup.org (Emmanuelle Gouillart) Date: Sun, 7 Apr 2019 17:41:48 +0200 Subject: [scikit-learn] API Discussion: Where shall we put the plotting functions? In-Reply-To: References: <20190402143603.E0B2A5D000A0@webmail.sinamail.sina.com.cn> Message-ID: <20190407154148.utfbtkrakftz3rbr@phare.normalesup.org> Hi, I suppose you won't want to rewrite all examples if you choose plotly-based viz, so this help page about converting matplotlib figures or code to plotly might help https://plot.ly/matplotlib/getting-started/ I hope it works, the doc page looks a bit old. Cheers Emma On Sun, Apr 07, 2019 at 10:08:24AM +0100, Andrew Howe wrote: > I'm with Andreas on this. As a user, I would prefer to see this as part of > sklearn with the usual sklearn api. If we want static matplotlib-style images, > reusing (with credit) some of the yellowbrick implementations is a good idea. > Would we consider plotly-based visualizations? I've been doing my own plotting > in plotly for the last month, and can't imagine going back to static matplotlib > plots... > Andrew > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > J. Andrew Howe, PhD > LinkedIn Profile > ResearchGate Profile > Open Researcher and Contributor ID (ORCID) > Github Profile > Personal Website > I live to learn, so I can learn to live. - me > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > On Thu, Apr 4, 2019 at 3:26 PM Andreas Mueller wrote: > I would argue that sklearn users would benefit in having solutions in > scikit-learn. The yellowbrick api is quite different from the approaches we > discussed. If we can reuse their implementations I think we should do so > and credit where we can.? > Having plotting in sklearn is also likely to attract more contributors and > we have more eyes for doing reviews. > Sent from phone. Please excuse spelling and brevity. > On Thu, Apr 4, 2019, 05:43 Alexandre Gramfort > wrote: > I also think that YellowBrick folks did a great job and that we should > not reinvent the wheel or at least have clear idea of how we differ in > scope with respect to YellowBrick > my 2c > Alex > On Thu, Apr 4, 2019 at 1:02 AM Eric Ma > wrote: > This is not a strongly-held suggestion - but what about adopting > YellowBrick as the plotting API for sklearn? Not sure how exactly > the interaction would work - could be PRs to their library, or ask > them to integrate into sklearn, or do a lock-step dance with > versions but maintain separate teams? (I know it raises more > questions than answers, but wanted to put it out?there.) > On Wed, Apr 3, 2019 at 4:07 PM Joel Nothman > wrote: > With option 1, sklearn.plot is likely to import large chunks of > the > library (particularly, but not exclusively, if the plotting > function > "does the work" as Andy suggests). This is under the assumption > that > one plot function will want to import trees, another GPs, etc. > Unless > we move to lazy imports, that would be against the current > convention > that importing sklearn is fairly minimal. > I do like Andy's idea of framing this discussion more clearly > around > likely candidates. > On Thu, 4 Apr 2019 at 00:10, Andreas Mueller > wrote: > > I think what was not clear from the question is that there is > actually > > quite different kinds of plotting functions, and many of > these are tied > > to existing code. > > Right now we have some that are specific to trees (plot_tree) > and to > > gradient boosting (plot_partial_dependence). > > I think we want more general functions, and > plot_partial_dependence has > > been extended to general estimators. > > However, the plotting functions might be generic wrt the > estimator, but > > relate to a specific function, say plotting results of > GridSearchCV. > > Then one might argue that having the plotting function close > to > > GridSearchCV might make sense. > > Similarly for plotting partial dependence plots and feature > importances, > > it might be a bit strange to have the plotting functions not > next to the > > functions that compute these. > > Another question would be is whether the plotting functions > also "do the > > work" in some cases: > > Do we want plot_partial_dependence also to compute the > partial > > dependence? (I would argue yes but either way the result is a > bit strange). > > In that case you have somewhat of the same functionality in > two > > different modules, unless you also put the "compute partial > dependence" > > function in the plotting module as well, > > which is a bit strange. > > Maybe we could inform this discussion by listing candidate > plotting > > functions, and also considering whether they "do the work" > and where the > > "work" function is. > > Other examples are plotting the confusion matrix, which > probably should > > also compute the confusion matrix (it's fast and so that > would be > > convenient), and so it would "duplicate" functionality from > the metrics > > module. > > Plotting learning curves and validation curves should > probably not do > > the work as it's pretty involved, and so someone would need > to import > > the learning and validation curves from model selection, and > then the > > plotting functions from a plotting module. > > Calibrations curves and P/R curves and roc curves are also > pretty fast > > to compute (and passing around the arguments is somewhat > error prone) so > > I would say the plotting functions for these should do the > work as well. > > Anyway, you can see that many plotting functions are actually > associated > > with functions in existing modules and the interactions are a > bit unclear. > > The only plotting functions I haven't mentioned so far that I > thought > > about in the past are "2d scatter" and "plot decision > function". These > > would be kind of generic, but mostly used in the examples. > > Though having a discrete 2d scatter function would be pretty > nice > > (plt.scatter doesn't allow legends and makes it hard to use > qualitative > > color maps). > > I think I would vote for option (1), "sklearn.plot.plot_zzz" > but the > > case is not really that clear. > > Cheers, > > Andy > > On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote: > > > +1 for options 1 and +0.5 for 3. Do we anticipate that many > plotting > > > functions will be added? If it's just a dozen or less, > putting them all > > > into a single namespace sklearn.plot might be easier. > > > This also would avoid discussion about where to put some > generic > > > plotting functions (e.g. > > > https://github.com/scikit-learn/scikit-learn/issues/13448# > issuecomment-478341479). > > > Roman > > > On 03/04/2019 12:06, Trevor Stephens wrote: > > >> I think #1 if any of these... Plotting functions should > hopefully be as > > >> general as possible, so tagging with a specific type of > estimator will, > > >> in some scikit-learn utopia, be unnecessary. > > >> If a general plotter is built, where does it live in other > > >> estimator-specific namespace options? Feels awkward to put > it under > > >> every estimator's namespace. > > >> Then again, there might be a #4 where there is no plot > module and > > >> plotting classes live under groups of utilities like > introspection, > > >> cross-validation or something?... > > >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe < > ahowe42 at gmail.com > > >> > wrote: > > >>? ? ? My preference would be for (1). I don't think the > sub-namespace in > > >>? ? ? (2) is necessary, and don't like (3), as I would > prefer the plotting > > >>? ? ? functions to be all in the same namespace > sklearn.plot. > > >>? ? ? Andrew > > >>? ? ? <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > >>? ? ? J. Andrew Howe, PhD > > >>? ? ? LinkedIn Profile > > >>? ? ? ResearchGate Profile profile/John_Howe12/> > > >>? ? ? Open Researcher and Contributor ID (ORCID) > > >>? ? ? > > >>? ? ? Github Profile > > >>? ? ? Personal Website > > >>? ? ? I live to learn, so I can learn to live. - me > > >>? ? ? <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > >>? ? ? On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin < > qinhanmin2005 at sina.com > > >>? ? ? > wrote: > > >>? ? ? ? ? See https://github.com/scikit-learn/scikit-learn/ > issues/13448 > > >>? ? ? ? ? We've introduced several plotting functions > (e.g., plot_tree and > > >>? ? ? ? ? plot_partial_dependence) and will introduce more > (e.g., > > >>? ? ? ? ? plot_decision_boundary) in the future. > Consequently, we need to > > >>? ? ? ? ? decide where to put these functions. Currently, > there're 3 > > >>? ? ? ? ? proposals: > > >>? ? ? ? ? (1) sklearn.plot.plot_YYY (e.g., > sklearn.plot.plot_tree) > > >>? ? ? ? ? (2) sklearn.plot.XXX.plot_YYY (e.g., > sklearn.plot.tree.plot_tree) > > >>? ? ? ? ? (3) sklearn.XXX.plot.plot_YYY (e.g., > > >>? ? ? ? ? sklearn.tree.plot.plot_tree, note that we won't > support from > > >>? ? ? ? ? sklearn.XXX import plot_YYY) > > >>? ? ? ? ? Joel Nothman, Gael Varoquaux and I decided to > post it on the > > >>? ? ? ? ? mailing list to invite opinions. > > >>? ? ? ? ? Thanks > > >>? ? ? ? ? Hanmin Qin > > >>? ? ? ? ? _______________________________________________ > > >>? ? ? ? ? scikit-learn mailing list > > >>? ? ? ? ? scikit-learn at python.org scikit-learn at python.org> > > >>? ? ? ? ? https://mail.python.org/mailman/listinfo/ > scikit-learn > > >>? ? ? _______________________________________________ > > >>? ? ? scikit-learn mailing list > > >>? ? ? scikit-learn at python.org scikit-learn at python.org> > > >>? ? ? https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From solegalli1 at gmail.com Wed Apr 10 13:23:03 2019 From: solegalli1 at gmail.com (Sole Galli) Date: Wed, 10 Apr 2019 18:23:03 +0100 Subject: [scikit-learn] Feature engineering functionality - new package In-Reply-To: References: Message-ID: > > Dear Scikit-Learn team, > > Feature engineering is a big task ahead of building machine learning > models. It involves imputation of missing values, encoding of categorical > variables, discretisation, variable transformation etc. > > Sklearn includes some functionality for feature engineering, which is > useful, but it has a few limitations: > > 1) it does not allow for feature specification - it will do the same > process on all variables, for example SimpleImputer. Typically, we want > to impute different columns with different values. > 2) It does not capture information from the training set, this is it does > not learn, therefore, it is not able to perpetuate the values learnt from > the train set, to unseen data. > > The 2 limitations above apply to all the feature transformers in sklearn, > I believe. > > Therefore, if these transformers are used as part of a pipeline, we could > end up doing different transformations to train and test, depending on the > characteristics of the datasets. For business purposes, this is not a > desired option. > > I think that building transformers that learn from the train set would be > of much use for the community. > > To this end, I built a python package called feature engine > which expands the sklearn-api > with additional feature engineering techniques, and the functionality that > allows the transformer to learn from data and store the parameters learnt. > > The techniques included have been used worldwide, both in business and in > data competitions, and reported in kdd reports and other articles. I also > cover them in an udemy course > which > has enrolled several thousand students. > > The package capitalises on the use of pandas to capture the features, but > I am confident that the columns names could be captured and the df > transformed to a numpy array to comply with sklearn requirements. > > I wondered whether it would be of interest to include the functionality of > this package within sklearn? > If you would consider extending the sklearn api to include these > transformers, I would be happy to help. > > Alternatively, would you consider to add the package to your website? > where you mention the libaries that extend sklearn functionality? > > All feedback is welcome. > > Many thanks and I look forward to hearing from you > > Thank you so much fur such an awesome contribution through the sklearn api > > Kind regards > > Sole > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From liam at chatdesk.com Wed Apr 10 13:25:56 2019 From: liam at chatdesk.com (Liam Geron) Date: Wed, 10 Apr 2019 13:25:56 -0400 Subject: [scikit-learn] Predict Method of OneVsRestClassifier Integration with Google Cloud ML Message-ID: Hi all, I was hoping to get some guidance re: changing the result of the predict method of the OneVsRestClassifier to return a dense array rather than a sparse array, given that Google Cloud ML only accepts dense numpy arrays as a result of a given models predict method. Right now my model architecture looks like: model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', OneVsRestClassifier(XGBClassifier()))]) Which returns a sparse array with the predict method. I saw the Stack Overflow post here: https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-learn-prediction-probability-predict-proba which recommends overwriting the predict method with the predict_proba method, however I found that I can't serialize the model after doing so. I also have a stack overflow post here: https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-onevsrestclassifier-predict-method-output-to-dense-a which details the specific pickling error. Is this a known issue? Is there an accepted way to convert this into a dense array? Thanks, Liam Geron -------------- next part -------------- An HTML attachment was scrubbed... URL: From goix.nicolas at gmail.com Wed Apr 10 13:42:09 2019 From: goix.nicolas at gmail.com (Nicolas Goix) Date: Wed, 10 Apr 2019 18:42:09 +0100 Subject: [scikit-learn] Feature engineering functionality - new package In-Reply-To: References: Message-ID: Hi Sole, I'm not sure the 2 limitations you mentioned are correct. 1) in your example, using the ColumnTransformer you can impute different values for different columns. 2) the sklearn transformers do learn on the training set and are able to perpetuate the values learnt from the train set to unseen data. Nicolas On Wed, Apr 10, 2019, 18:25 Sole Galli wrote: > Dear Scikit-Learn team, >> >> Feature engineering is a big task ahead of building machine learning >> models. It involves imputation of missing values, encoding of categorical >> variables, discretisation, variable transformation etc. >> >> Sklearn includes some functionality for feature engineering, which is >> useful, but it has a few limitations: >> >> 1) it does not allow for feature specification - it will do the same >> process on all variables, for example SimpleImputer. Typically, we want >> to impute different columns with different values. >> 2) It does not capture information from the training set, this is it does >> not learn, therefore, it is not able to perpetuate the values learnt from >> the train set, to unseen data. >> >> The 2 limitations above apply to all the feature transformers in sklearn, >> I believe. >> >> Therefore, if these transformers are used as part of a pipeline, we could >> end up doing different transformations to train and test, depending on the >> characteristics of the datasets. For business purposes, this is not a >> desired option. >> >> I think that building transformers that learn from the train set would be >> of much use for the community. >> >> To this end, I built a python package called feature engine >> which expands the sklearn-api >> with additional feature engineering techniques, and the functionality that >> allows the transformer to learn from data and store the parameters learnt. >> >> The techniques included have been used worldwide, both in business and in >> data competitions, and reported in kdd reports and other articles. I also >> cover them in an udemy course >> which >> has enrolled several thousand students. >> >> The package capitalises on the use of pandas to capture the features, but >> I am confident that the columns names could be captured and the df >> transformed to a numpy array to comply with sklearn requirements. >> >> I wondered whether it would be of interest to include the functionality >> of this package within sklearn? >> If you would consider extending the sklearn api to include these >> transformers, I would be happy to help. >> >> Alternatively, would you consider to add the package to your website? >> where you mention the libaries that extend sklearn functionality? >> >> All feedback is welcome. >> >> Many thanks and I look forward to hearing from you >> >> Thank you so much fur such an awesome contribution through the sklearn api >> >> Kind regards >> >> Sole >> >> _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Wed Apr 10 13:35:07 2019 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Wed, 10 Apr 2019 12:35:07 -0500 Subject: [scikit-learn] Predict Method of OneVsRestClassifier Integration with Google Cloud ML In-Reply-To: References: Message-ID: Hi Liam, not sure what your exact error message is, but it may also be that the XGBClassifier only accepts dense arrays? I think the TfidfVectorizer returns sparse arrays. You could probably fix your issues by inserting a "DenseTransformer" into your pipelone (a simple class that just transforms an array from a sparse to a dense format). I've implemented sth like that that you can import or copy&paste it from here: https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_transformer.py The usage would then basically be model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense', DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))]) Best, Sebastian > On Apr 10, 2019, at 12:25 PM, Liam Geron wrote: > > Hi all, > > I was hoping to get some guidance re: changing the result of the predict method of the OneVsRestClassifier to return a dense array rather than a sparse array, given that Google Cloud ML only accepts dense numpy arrays as a result of a given models predict method. Right now my model architecture looks like: > > model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', OneVsRestClassifier(XGBClassifier()))]) > > Which returns a sparse array with the predict method. I saw the Stack Overflow post here: https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-learn-prediction-probability-predict-proba > > which recommends overwriting the predict method with the predict_proba method, however I found that I can't serialize the model after doing so. I also have a stack overflow post here: https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-onevsrestclassifier-predict-method-output-to-dense-a which details the specific pickling error. > > Is this a known issue? Is there an accepted way to convert this into a dense array? > > Thanks, > Liam Geron > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From liam at chatdesk.com Wed Apr 10 14:10:35 2019 From: liam at chatdesk.com (Liam Geron) Date: Wed, 10 Apr 2019 14:10:35 -0400 Subject: [scikit-learn] Predict Method of OneVsRestClassifier Integration with Google Cloud ML In-Reply-To: References: Message-ID: Hi Sebastian, Thanks for the advice! The model actually works on it's own in python fine luckily, so I don't think that that is the issue exactly. I have tried rolling my own estimator to wrap the pipeline to have it call the predict_proba method to return a dense array, however I then came across the problem that I would have to have that custom estimator defined on the Cloud ML end, which I'm unsure how to do. Thanks, Liam On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka wrote: > Hi Liam, > > not sure what your exact error message is, but it may also be that the > XGBClassifier only accepts dense arrays? I think the TfidfVectorizer > returns sparse arrays. You could probably fix your issues by inserting a > "DenseTransformer" into your pipelone (a simple class that just transforms > an array from a sparse to a dense format). I've implemented sth like that > that you can import or copy&paste it from here: > > > https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_transformer.py > > The usage would then basically be > > model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense', > DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))]) > > Best, > Sebastian > > > > > > On Apr 10, 2019, at 12:25 PM, Liam Geron wrote: > > > > Hi all, > > > > I was hoping to get some guidance re: changing the result of the predict > method of the OneVsRestClassifier to return a dense array rather than a > sparse array, given that Google Cloud ML only accepts dense numpy arrays as > a result of a given models predict method. Right now my model architecture > looks like: > > > > model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', > OneVsRestClassifier(XGBClassifier()))]) > > > > Which returns a sparse array with the predict method. I saw the Stack > Overflow post here: > https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-learn-prediction-probability-predict-proba > > > > which recommends overwriting the predict method with the predict_proba > method, however I found that I can't serialize the model after doing so. I > also have a stack overflow post here: > https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-onevsrestclassifier-predict-method-output-to-dense-a > which details the specific pickling error. > > > > Is this a known issue? Is there an accepted way to convert this into a > dense array? > > > > Thanks, > > Liam Geron > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From solegalli1 at gmail.com Wed Apr 10 14:13:46 2019 From: solegalli1 at gmail.com (Sole Galli) Date: Wed, 10 Apr 2019 19:13:46 +0100 Subject: [scikit-learn] Feature engineering functionality - new package In-Reply-To: References: Message-ID: Hi Nicolas, You are right, I am just checking this in the source code. Sorry for the confusion and thanks for the quick response Cheers Sole On Wed, 10 Apr 2019 at 18:43, Nicolas Goix wrote: > Hi Sole, > > I'm not sure the 2 limitations you mentioned are correct. > 1) in your example, using the ColumnTransformer you can impute different > values for different columns. > 2) the sklearn transformers do learn on the training set and are able to > perpetuate the values learnt from the train set to unseen data. > > Nicolas > > On Wed, Apr 10, 2019, 18:25 Sole Galli wrote: > >> Dear Scikit-Learn team, >>> >>> Feature engineering is a big task ahead of building machine learning >>> models. It involves imputation of missing values, encoding of categorical >>> variables, discretisation, variable transformation etc. >>> >>> Sklearn includes some functionality for feature engineering, which is >>> useful, but it has a few limitations: >>> >>> 1) it does not allow for feature specification - it will do the same >>> process on all variables, for example SimpleImputer. Typically, we want >>> to impute different columns with different values. >>> 2) It does not capture information from the training set, this is it >>> does not learn, therefore, it is not able to perpetuate the values learnt >>> from the train set, to unseen data. >>> >>> The 2 limitations above apply to all the feature transformers in >>> sklearn, I believe. >>> >>> Therefore, if these transformers are used as part of a pipeline, we >>> could end up doing different transformations to train and test, depending >>> on the characteristics of the datasets. For business purposes, this is not >>> a desired option. >>> >>> I think that building transformers that learn from the train set would >>> be of much use for the community. >>> >>> To this end, I built a python package called feature engine >>> which expands the >>> sklearn-api with additional feature engineering techniques, and the >>> functionality that allows the transformer to learn from data and store the >>> parameters learnt. >>> >>> The techniques included have been used worldwide, both in business and >>> in data competitions, and reported in kdd reports and other articles. I >>> also cover them in an udemy course >>> which >>> has enrolled several thousand students. >>> >>> The package capitalises on the use of pandas to capture the features, >>> but I am confident that the columns names could be captured and the df >>> transformed to a numpy array to comply with sklearn requirements. >>> >>> I wondered whether it would be of interest to include the functionality >>> of this package within sklearn? >>> If you would consider extending the sklearn api to include these >>> transformers, I would be happy to help. >>> >>> Alternatively, would you consider to add the package to your website? >>> where you mention the libaries that extend sklearn functionality? >>> >>> All feedback is welcome. >>> >>> Many thanks and I look forward to hearing from you >>> >>> Thank you so much fur such an awesome contribution through the sklearn >>> api >>> >>> Kind regards >>> >>> Sole >>> >>> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Wed Apr 10 14:34:16 2019 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Wed, 10 Apr 2019 13:34:16 -0500 Subject: [scikit-learn] Predict Method of OneVsRestClassifier Integration with Google Cloud ML In-Reply-To: References: Message-ID: <9B6ADC52-08EB-40A7-BC4D-346F978A43FE@sebastianraschka.com> Hm, weird that their platform seems to be so picky about it. Have you tried to just make the output of the pipeline dense? I.e., (model.predict(X)).toarray() Best, Sebastian > On Apr 10, 2019, at 1:10 PM, Liam Geron wrote: > > Hi Sebastian, > > Thanks for the advice! The model actually works on it's own in python fine luckily, so I don't think that that is the issue exactly. I have tried rolling my own estimator to wrap the pipeline to have it call the predict_proba method to return a dense array, however I then came across the problem that I would have to have that custom estimator defined on the Cloud ML end, which I'm unsure how to do. > > Thanks, > Liam > > On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka wrote: > Hi Liam, > > not sure what your exact error message is, but it may also be that the XGBClassifier only accepts dense arrays? I think the TfidfVectorizer returns sparse arrays. You could probably fix your issues by inserting a "DenseTransformer" into your pipelone (a simple class that just transforms an array from a sparse to a dense format). I've implemented sth like that that you can import or copy&paste it from here: > > https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_transformer.py > > The usage would then basically be > > model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense', DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))]) > > Best, > Sebastian > > > > > > On Apr 10, 2019, at 12:25 PM, Liam Geron wrote: > > > > Hi all, > > > > I was hoping to get some guidance re: changing the result of the predict method of the OneVsRestClassifier to return a dense array rather than a sparse array, given that Google Cloud ML only accepts dense numpy arrays as a result of a given models predict method. Right now my model architecture looks like: > > > > model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', OneVsRestClassifier(XGBClassifier()))]) > > > > Which returns a sparse array with the predict method. I saw the Stack Overflow post here: https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-learn-prediction-probability-predict-proba > > > > which recommends overwriting the predict method with the predict_proba method, however I found that I can't serialize the model after doing so. I also have a stack overflow post here: https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-onevsrestclassifier-predict-method-output-to-dense-a which details the specific pickling error. > > > > Is this a known issue? Is there an accepted way to convert this into a dense array? > > > > Thanks, > > Liam Geron > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From liam at chatdesk.com Wed Apr 10 15:26:55 2019 From: liam at chatdesk.com (Liam Geron) Date: Wed, 10 Apr 2019 15:26:55 -0400 Subject: [scikit-learn] Predict Method of OneVsRestClassifier Integration with Google Cloud ML In-Reply-To: <9B6ADC52-08EB-40A7-BC4D-346F978A43FE@sebastianraschka.com> References: <9B6ADC52-08EB-40A7-BC4D-346F978A43FE@sebastianraschka.com> Message-ID: Unfortunately I don't believe that you get that level of freedom, it's an API call that automatically calls the model's predict method so I don't think that I get to specify something like model.predict(X).toarray(). I could be wrong however, I don't pretend to be an expert on Cloud ML by any stretch. Thanks, Liam On Wed, Apr 10, 2019 at 3:23 PM Sebastian Raschka wrote: > Hm, weird that their platform seems to be so picky about it. Have you > tried to just make the output of the pipeline dense? I.e., > > (model.predict(X)).toarray() > > Best, > Sebastian > > > On Apr 10, 2019, at 1:10 PM, Liam Geron wrote: > > > > Hi Sebastian, > > > > Thanks for the advice! The model actually works on it's own in python > fine luckily, so I don't think that that is the issue exactly. I have tried > rolling my own estimator to wrap the pipeline to have it call the > predict_proba method to return a dense array, however I then came across > the problem that I would have to have that custom estimator defined on the > Cloud ML end, which I'm unsure how to do. > > > > Thanks, > > Liam > > > > On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > Hi Liam, > > > > not sure what your exact error message is, but it may also be that the > XGBClassifier only accepts dense arrays? I think the TfidfVectorizer > returns sparse arrays. You could probably fix your issues by inserting a > "DenseTransformer" into your pipelone (a simple class that just transforms > an array from a sparse to a dense format). I've implemented sth like that > that you can import or copy&paste it from here: > > > > > https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_transformer.py > > > > The usage would then basically be > > > > model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense', > DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))]) > > > > Best, > > Sebastian > > > > > > > > > > > On Apr 10, 2019, at 12:25 PM, Liam Geron wrote: > > > > > > Hi all, > > > > > > I was hoping to get some guidance re: changing the result of the > predict method of the OneVsRestClassifier to return a dense array rather > than a sparse array, given that Google Cloud ML only accepts dense numpy > arrays as a result of a given models predict method. Right now my model > architecture looks like: > > > > > > model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', > OneVsRestClassifier(XGBClassifier()))]) > > > > > > Which returns a sparse array with the predict method. I saw the Stack > Overflow post here: > https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-learn-prediction-probability-predict-proba > > > > > > which recommends overwriting the predict method with the predict_proba > method, however I found that I can't serialize the model after doing so. I > also have a stack overflow post here: > https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-onevsrestclassifier-predict-method-output-to-dense-a > which details the specific pickling error. > > > > > > Is this a known issue? Is there an accepted way to convert this into a > dense array? > > > > > > Thanks, > > > Liam Geron > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Apr 10 23:01:28 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 11 Apr 2019 13:01:28 +1000 Subject: [scikit-learn] Predict Method of OneVsRestClassifier Integration with Google Cloud ML In-Reply-To: References: <9B6ADC52-08EB-40A7-BC4D-346F978A43FE@sebastianraschka.com> Message-ID: I think it's a bit weird if we're returning sparse output from OneVsRestClassifier.predict if it wasn't fit on sparse Y. Actually, I would be in favour of deprecating multilabel support in OneVsRestClassifier, since it is performing "binary relevance method" for multilabel, not actually OvR. MultiOutputClassifier duplicates this functionality (more or less), outputs a dense array (indeed it doesn't support sparse Y and perhaps it should) and lives closer to functional alternatives to binary relevance, such as ClassifierChain. On Thu, 11 Apr 2019 at 05:32, Liam Geron wrote: > Unfortunately I don't believe that you get that level of freedom, it's an > API call that automatically calls the model's predict method so I don't > think that I get to specify something like model.predict(X).toarray(). I > could be wrong however, I don't pretend to be an expert on Cloud ML by any > stretch. > > Thanks, > Liam > > On Wed, Apr 10, 2019 at 3:23 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > >> Hm, weird that their platform seems to be so picky about it. Have you >> tried to just make the output of the pipeline dense? I.e., >> >> (model.predict(X)).toarray() >> >> Best, >> Sebastian >> >> > On Apr 10, 2019, at 1:10 PM, Liam Geron wrote: >> > >> > Hi Sebastian, >> > >> > Thanks for the advice! The model actually works on it's own in python >> fine luckily, so I don't think that that is the issue exactly. I have tried >> rolling my own estimator to wrap the pipeline to have it call the >> predict_proba method to return a dense array, however I then came across >> the problem that I would have to have that custom estimator defined on the >> Cloud ML end, which I'm unsure how to do. >> > >> > Thanks, >> > Liam >> > >> > On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka < >> mail at sebastianraschka.com> wrote: >> > Hi Liam, >> > >> > not sure what your exact error message is, but it may also be that the >> XGBClassifier only accepts dense arrays? I think the TfidfVectorizer >> returns sparse arrays. You could probably fix your issues by inserting a >> "DenseTransformer" into your pipelone (a simple class that just transforms >> an array from a sparse to a dense format). I've implemented sth like that >> that you can import or copy&paste it from here: >> > >> > >> https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_transformer.py >> > >> > The usage would then basically be >> > >> > model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense', >> DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))]) >> > >> > Best, >> > Sebastian >> > >> > >> > >> > >> > > On Apr 10, 2019, at 12:25 PM, Liam Geron wrote: >> > > >> > > Hi all, >> > > >> > > I was hoping to get some guidance re: changing the result of the >> predict method of the OneVsRestClassifier to return a dense array rather >> than a sparse array, given that Google Cloud ML only accepts dense numpy >> arrays as a result of a given models predict method. Right now my model >> architecture looks like: >> > > >> > > model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', >> OneVsRestClassifier(XGBClassifier()))]) >> > > >> > > Which returns a sparse array with the predict method. I saw the Stack >> Overflow post here: >> https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-learn-prediction-probability-predict-proba >> > > >> > > which recommends overwriting the predict method with the >> predict_proba method, however I found that I can't serialize the model >> after doing so. I also have a stack overflow post here: >> https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-onevsrestclassifier-predict-method-output-to-dense-a >> which details the specific pickling error. >> > > >> > > Is this a known issue? Is there an accepted way to convert this into >> a dense array? >> > > >> > > Thanks, >> > > Liam Geron >> > > _______________________________________________ >> > > scikit-learn mailing list >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From liam at chatdesk.com Thu Apr 11 13:30:56 2019 From: liam at chatdesk.com (Liam Geron) Date: Thu, 11 Apr 2019 13:30:56 -0400 Subject: [scikit-learn] Predict Method of OneVsRestClassifier Integration with Google Cloud ML In-Reply-To: References: <9B6ADC52-08EB-40A7-BC4D-346F978A43FE@sebastianraschka.com> Message-ID: That's a great tip actually, I was unaware about the MultiOutputClassifier option. I'll give it a try! Thanks, Liam On Wed, Apr 10, 2019 at 11:03 PM Joel Nothman wrote: > I think it's a bit weird if we're returning sparse output from > OneVsRestClassifier.predict if it wasn't fit on sparse Y. > > Actually, I would be in favour of deprecating multilabel support in > OneVsRestClassifier, since it is performing "binary relevance method" for > multilabel, not actually OvR. MultiOutputClassifier duplicates this > functionality (more or less), outputs a dense array (indeed it doesn't > support sparse Y and perhaps it should) and lives closer to functional > alternatives to binary relevance, such as ClassifierChain. > > On Thu, 11 Apr 2019 at 05:32, Liam Geron wrote: > >> Unfortunately I don't believe that you get that level of freedom, it's an >> API call that automatically calls the model's predict method so I don't >> think that I get to specify something like model.predict(X).toarray(). I >> could be wrong however, I don't pretend to be an expert on Cloud ML by any >> stretch. >> >> Thanks, >> Liam >> >> On Wed, Apr 10, 2019 at 3:23 PM Sebastian Raschka < >> mail at sebastianraschka.com> wrote: >> >>> Hm, weird that their platform seems to be so picky about it. Have you >>> tried to just make the output of the pipeline dense? I.e., >>> >>> (model.predict(X)).toarray() >>> >>> Best, >>> Sebastian >>> >>> > On Apr 10, 2019, at 1:10 PM, Liam Geron wrote: >>> > >>> > Hi Sebastian, >>> > >>> > Thanks for the advice! The model actually works on it's own in python >>> fine luckily, so I don't think that that is the issue exactly. I have tried >>> rolling my own estimator to wrap the pipeline to have it call the >>> predict_proba method to return a dense array, however I then came across >>> the problem that I would have to have that custom estimator defined on the >>> Cloud ML end, which I'm unsure how to do. >>> > >>> > Thanks, >>> > Liam >>> > >>> > On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka < >>> mail at sebastianraschka.com> wrote: >>> > Hi Liam, >>> > >>> > not sure what your exact error message is, but it may also be that the >>> XGBClassifier only accepts dense arrays? I think the TfidfVectorizer >>> returns sparse arrays. You could probably fix your issues by inserting a >>> "DenseTransformer" into your pipelone (a simple class that just transforms >>> an array from a sparse to a dense format). I've implemented sth like that >>> that you can import or copy&paste it from here: >>> > >>> > >>> https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_transformer.py >>> > >>> > The usage would then basically be >>> > >>> > model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense', >>> DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))]) >>> > >>> > Best, >>> > Sebastian >>> > >>> > >>> > >>> > >>> > > On Apr 10, 2019, at 12:25 PM, Liam Geron wrote: >>> > > >>> > > Hi all, >>> > > >>> > > I was hoping to get some guidance re: changing the result of the >>> predict method of the OneVsRestClassifier to return a dense array rather >>> than a sparse array, given that Google Cloud ML only accepts dense numpy >>> arrays as a result of a given models predict method. Right now my model >>> architecture looks like: >>> > > >>> > > model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', >>> OneVsRestClassifier(XGBClassifier()))]) >>> > > >>> > > Which returns a sparse array with the predict method. I saw the >>> Stack Overflow post here: >>> https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-learn-prediction-probability-predict-proba >>> > > >>> > > which recommends overwriting the predict method with the >>> predict_proba method, however I found that I can't serialize the model >>> after doing so. I also have a stack overflow post here: >>> https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-onevsrestclassifier-predict-method-output-to-dense-a >>> which details the specific pickling error. >>> > > >>> > > Is this a known issue? Is there an accepted way to convert this into >>> a dense array? >>> > > >>> > > Thanks, >>> > > Liam Geron >>> > > _______________________________________________ >>> > > scikit-learn mailing list >>> > > scikit-learn at python.org >>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>> > >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Apr 15 10:55:11 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 15 Apr 2019 10:55:11 -0400 Subject: [scikit-learn] Feature engineering functionality - new package In-Reply-To: References: Message-ID: <6d06420d-a0f7-a374-ee90-c73af5219e35@gmail.com> 1) was indeed a design decision. Your design is certainly an alternative design, that might be more convenient in some situations, but requires adding this feature to all transformers, which basically just adds a bunch of boilerplate code everywhere. So you could argue our design decision was more driven by ease of maintenance than ease of use. There might be some transformers in your package that we could add to scikit-learn in some form, but several are already available, SimpleImputer implements MedianMeanImputer, CategoricalVariableImputer and FrequentCategoryImputer We don't currently have RandomSampleImputer and EndTailImputer, I think. AddNaNBinaryImputer is "MissingIndicator" in sklearn. OneHotCategoricalEncoder and OrdinalEncoder exist, CountFrequencyCategoricalEncoder and MeanCategoriclaEncoder are in the works, though there are some arguments about the details. These are also in the categorical-encoding package: http://contrib.scikit-learn.org/categorical-encoding/ RareLabelCategoricalEncoder is something I definitely want in OneHotEncoder, not sure if there's a PR yet. Do you have examples of WoERatioCategoricalEncoder or Windsorizer or any of the discretizers actually working well in practice? I have not seen them used much, they seemed to be popular in Weka, though. BoxCoxTransformer is implemented in PowerTransformer, and LogTransformer, ReciprocalTransformer and ExponentialTransformer can be implemented as FunctionTransformer(np.log), FunctionTransformer(lambda x: 1/x) and FunctionTransformer(lambda x: x ** exp) I believe. It might be interesting to add your package to scikit-learn-contrib: https://github.com/scikit-learn-contrib We are struggling a bit with how to best organize that, though. Cheers, Andy On 4/10/19 2:13 PM, Sole Galli wrote: > Hi Nicolas, > > You are right, I am just checking this in the source code. > > Sorry?for the confusion and thanks for the quick?response > > Cheers > > Sole > > On Wed, 10 Apr 2019 at 18:43, Nicolas Goix > wrote: > > Hi Sole, > > I'm not sure the 2 limitations you mentioned are correct. > 1) in your example, using the ColumnTransformer you can impute > different values for different columns. > 2) the sklearn transformers do learn on the training set and are > able to perpetuate the values learnt from the train set to unseen > data. > > Nicolas > > On Wed, Apr 10, 2019, 18:25 Sole Galli > wrote: > > Dear Scikit-Learn team, > > Feature engineering is a big task ahead of building > machine learning models. It involves imputation of missing > values, encoding of categorical variables, discretisation, > variable transformation etc. > > Sklearn includes some functionality for feature > engineering, which is useful, but it has a few limitations: > > 1) it does not allow for feature specification - it will > do the same process on all variables, for example > SimpleImputer. Typically, we want to impute different > columns with different values. > 2) It does not capture information from the training set, > this is it does not learn, therefore, it is not able to > perpetuate the values learnt from the train set, to unseen > data. > > The 2 limitations above apply to all the feature > transformers in sklearn, I believe. > > Therefore, if these transformers are used as part of a > pipeline, we could end up doing different transformations > to train and test, depending on the characteristics of the > datasets. For business purposes, this is not a desired option. > > I think that building transformers that learn from the > train set would be of much use for the community. > > To this end, I built a python package called feature > engine ?which > expands the sklearn-api with additional feature > engineering techniques, and the functionality that allows > the transformer to learn from data and store the > parameters learnt. > > The techniques included have been used worldwide, both in > business and in data competitions, and reported in kdd > reports and other articles. I also cover them in an udemy > course > > which has enrolled several thousand students. > > The package capitalises on the use of pandas to capture > the features, but I am confident that the columns names > could be captured and the df transformed to a numpy array > to comply with sklearn requirements. > > I wondered whether it would be of interest to include the > functionality of this package within sklearn? > If you would consider extending the sklearn api to include > these transformers, I would be happy to help. > > Alternatively, would you consider to add the package to > your website? where you mention the libaries that extend > sklearn functionality? > > All feedback is welcome. > > Many thanks and I look forward to hearing from you > > Thank you so much fur such an awesome contribution through > the sklearn api > > Kind regards > > Sole > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian at ianozsvald.com Wed Apr 17 10:59:42 2019 From: ian at ianozsvald.com (Ian Ozsvald) Date: Wed, 17 Apr 2019 15:59:42 +0100 Subject: [scikit-learn] PyDataLondon 2019 (July 12-14) Call for Proposals closing this Friday Message-ID: On July 12-14 we host the sixth PyDataLondon conference in central London. As last year we'll be hosted close to Tower Bridge at the Tower Hotel with 700 attendees over 3 days: https://pydata.org/london2019/ Our Call for Proposals has been open for several weeks, it closes this Friday. If anyone here would like to spread the good word about scikit-learn (and any scikit/scipy/Python data science related topics) we'd love to see a proposal. We also offer first-time speaker mentoring, it is a bit late for this now so I'll offer to answer any questions anyone has personally - just email me directly. The Call for Proposals closes this Friday, please submit your talk here: https://pydata.org/london2019/cfp If you've not been to PyDataLondon before - here's last year's schedule and my write-up of all of the events that we covered. Gael Varoquaux and others spoke for us, we'd love to see scikit-learn well represented again: https://pydata.org/london2018/schedule/ https://ianozsvald.com/2018/04/30/pydatalondon-2018-and-creating-correct-and-capable-classifiers/ Regards, Ian (PyDataLondon co-founder) -- Ian Ozsvald (Data Scientist, PyDataLondon co-chair) ian at IanOzsvald.com https://IanOzsvald.com https://MorConsulting.com https://twitter.com/IanOzsvald From vaggi.federico at gmail.com Fri Apr 19 12:52:51 2019 From: vaggi.federico at gmail.com (federico vaggi) Date: Fri, 19 Apr 2019 09:52:51 -0700 Subject: [scikit-learn] Categorical Encoding of high cardinality variables Message-ID: Hi everyone, I wanted to use the scikit-learn transformer API to clean up some messy data as input to a neural network. One of the steps involves converting categorical variables (of very high cardinality) into integers for use in an embedding layer. Unfortunately, I cannot quite use LabelEncoder to do solve this. When dealing with categorical variables with very high cardinality, I found it useful in practice to have a threshold value for the frequency under which a variable ends up with the 'unk' or 'rare' label. This same label would also end up applied at test time to entries that were not observed in the train set. This is relatively straightforward to add to the existing label encoder code, but it breaks the contract slightly: if we encode some variables with a 'rare' label, then the transform operation is no longer a bijection. Is this feature too niche for the main sklearn? I saw there was a package ( https://feature-engine.readthedocs.io/en/latest/RareLabelCategoricalEncoder.html) that implemented a similar feature discussed in the mailing list. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mlcnworkshop at gmail.com Tue Apr 23 04:05:43 2019 From: mlcnworkshop at gmail.com (MLCN Workshop) Date: Tue, 23 Apr 2019 10:05:43 +0200 Subject: [scikit-learn] The 2nd International Workshop on Machine Learning in Clinical Neuroimaging (MLCN 2019): ENTERING THE ERA OF BIG DATA VIA TRANSFER LEARNING AND DATA HARMONIZATION Message-ID: Dear Colleagues,We are delighted to invite you to join us for the MLCN 2019 workshop as a satellite event of the MICCAI 2019 conference, Shenzhen, China. Call for Papers Recent advances in neuroimaging and machine learning provide an exceptional opportunity for investigators and physicians to discover complex relationships between brain, behaviors, and mental and neurological disorders. The MLCN 2019 workshop (https://mlcnws.com), as a satellite event of MICCAI 2019 (https://www.miccai2019.org), aims to bring together researchers in both theory and application from various fields in domains such as *e.g.* machine learning, neuroimaging, predictive clinical neuroscience, *etc.* Topics of interests include, but are not limited to: - Transfer learning in clinical neuroimaging - Model stability in transfer learning - Data prerequisites for successful transfer learning - Domain adaptation in neuroimaging - Data harmonization across sites - Data pooling ? practical issues - Cross-domain learning in neuroimaging - Interpretability for transfer learning - Unsupervised methods for domain adaptation - Multi-site data analysis, from preprocessing to modeling - Big data in clinical neuroimaging - Scalable machine learning methods - Benefits, problems, and solutions of working with very large datasets SUBMISSION PROCESS: The workshop seeks high quality, original, and unpublished work on algorithms, theory, and applications of machine learning in clinical neuroimaging related to big data, transfer learning, and data harmonization. Papers should be submitted electronically in Springer Lecture Notes in Computer Science (LCNS) style ( https://www.springer.com/gp/computer-science/lncs/conference-proceedings-guidelines) with up to 8-pages and using the CMT system at https://cmt3.research.microsoft.com/MLCN2019. The MLCN workshop uses a double-blind review process in the evaluation phase, thus authors must ensure anonymous submissions. Accepted papers will be published in a joint proceeding with the MICCAI 2019 conference. IMPORTANT DATES: - Paper submission deadline: July 1, 2019 (23:59 PST) - Notification of Acceptance: August 5, 2019 - Camera-ready Submission: August 12, 2019 - Workshop Date: October 13, 2019 Best regards, MLCN 2019 Organizing Committee , Email: mlcnworkshop at gmail.com Website: https://mlcnws.com/ twitter: @MLCNworkshop -------------- next part -------------- An HTML attachment was scrubbed... URL: From solegalli1 at gmail.com Tue Apr 23 20:00:15 2019 From: solegalli1 at gmail.com (Sole Galli) Date: Wed, 24 Apr 2019 01:00:15 +0100 Subject: [scikit-learn] Categorical Encoding of high cardinality variables In-Reply-To: References: Message-ID: Hello everyone, I am Sole, I started the conversation on feature engine , a package I created for feature engineering. Regarding the grouping of *rare / infrequent* categories into an umbrella term like "Rare", "Other", etc, which Federico raised recently, I would like to provide some literature at the end of this email, that quotes the use of this procedure. These are a series of articles by the best solutions to the 2009 KDD annual competition, which were compiled into one "book ", and I am sure you are aware of it already. I would also like to highlight that this is extremely common practice in the industry, not only to avoid overfitting, but also to handle unseen categories when models are deployed. It would be great to see this functionality added to both the OrdinalEncoder and the OneHotEncoder, with triggers on the representation of the label in the dataset (eg. percentage) Pointing to the main quotes from these articles : Page 4 of the summary and introductory article: "For categorical variables, grouping of under-represented categories proved to be useful to avoid overfitting. The winners of the fast and the slow track used similar strategies consisting in retaining the most populated categories and coarsely grouping the others in an unsupervised way" Page 23: "Most of the learning algorithms we were planning to use do not handle categorical variables, so we needed to recode them. This was done in a standard way, by generating indicator vari- ables for the different values a categorical attribute could take. The only slightly non-standard decision was to limit ourselves to encoding only the 10 most common values of each categorical attribute, rather than all the values, in order to avoid an explosion in the number of features from variables with a huge vocabulary" Page 36: "We consolidate the extremely low populated entries (having fewer than 200 examples) with their neighbors to smooth out the outliers. Similarly, we group some categorical variables which have a large number of entries ( > 1000 distinct values) into 100 categories." See bulletpoints in Page 47 I hope you find these useful. Let me know if / how I can help. Regards Sole On Fri, 19 Apr 2019 at 17:54, federico vaggi wrote: > Hi everyone, > > I wanted to use the scikit-learn transformer API to clean up some messy > data as input to a neural network. One of the steps involves converting > categorical variables (of very high cardinality) into integers for use in > an embedding layer. > > Unfortunately, I cannot quite use LabelEncoder to do solve this. When > dealing with categorical variables with very high cardinality, I found it > useful in practice to have a threshold value for the frequency under which > a variable ends up with the 'unk' or 'rare' label. This same label would > also end up applied at test time to entries that were not observed in the > train set. > > This is relatively straightforward to add to the existing label encoder > code, but it breaks the contract slightly: if we encode some variables with > a 'rare' label, then the transform operation is no longer a bijection. > > Is this feature too niche for the main sklearn? I saw there was a package > ( > https://feature-engine.readthedocs.io/en/latest/RareLabelCategoricalEncoder.html) > that implemented a similar feature discussed in the mailing list. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From solegalli1 at gmail.com Tue Apr 23 21:36:16 2019 From: solegalli1 at gmail.com (Sole Galli) Date: Wed, 24 Apr 2019 02:36:16 +0100 Subject: [scikit-learn] Feature engineering functionality - new package In-Reply-To: <6d06420d-a0f7-a374-ee90-c73af5219e35@gmail.com> References: <6d06420d-a0f7-a374-ee90-c73af5219e35@gmail.com> Message-ID: Hi Andreas and team, Thank you very much for your reply. This was very helpful. Happy to hear that functionality similar to CountFrequencyCategoricalEncoder, MeanCategoriclaEncoder and RareLabelCategoricalEncoder are in the agenda. The last functionality, grouping of rare labels, would be useful for both the OneHotEncoder and OrdinalEncoder, as per a previous thread. ------------------------- Re: your questions: Examples of various discretisers can be found in the winner solutions of the KDD 2009 annual competition article s: See for example: - Bulletpoints in page 26, which include use of decision trees to create bins. - Summary of employed methods in page 14: "Discretization was the second most used preprocessing. Its usefulness for this particular dataset is justified by the non-normality of the distribution of the variables and the existence of extreme values. The simple bining used by the winners of the slow track proved to be efficient. " - A peculiar binning described in 2.2 in page 36 - I also use discretisers at work, inspired on the KDD articles, see for example my blog at the peer-to-peer company , which I would argue attest to successful implementation:p - Equal width and equal frequency discretisers are discussed in this master thesis . Windsorisation, or top coding: we these use all the time in the industry, usually capping at arbitrary values. Windsorisation using mean and std or quantiles is a way of automating the capping. In theory it would boost performance of linear models. Have tried that myself in a couple of toy datasets from Kaggle. I don't have a good article to point you to at the moment. There are a few that discuss topcoding, and also the effect of outiers on NN, but not too sure how widely accepted they are. On WoE, I understand is common practice in finance. Haven't used it at work. Have used it in toy datasets, behaves more or less the same than target mean encoding. Although the purpose of WoE goes beyond than improving performance, it is also a way of "standarising" the variables and making them understandable. See for example this summary. I know that sklearn likes to include algorithms widely accepted, ideally from multi-quoted articles. So for winsorisation and WoE I am not quite answering your questions I guess. I will keep an eye in case something new comes up. ------------------ Re: sharing feature-engine in sklearn contrib. I would really appreciate if you could do that. I am planning to expand the package with other feature engineering techniques, which I think will be useful for the community. In particular, until ColumnTransformer becomes widely adopted and the other transformers developed. Would be great if it could be shared in the contrib page and also int the related projects page. ---------------- Re: the categorical encoding package I am aware that it exists. Haven't tried it myself. When we presented it to the company, the main criticism was that most of the encoders distort the variables so much that they lose all possible human interpretation of them. So, the business prefers not to use these types of encoding. Which, I think I kind of agree. Thanks again for your time. Let me know if / how I can help and if you would be happy to include feature engine in the contrib page. Have a good rest of week Sole On Mon, 15 Apr 2019 at 15:56, Andreas Mueller wrote: > 1) was indeed a design decision. Your design is certainly an alternative > design, that might be more convenient in some situations, > but requires adding this feature to all transformers, which basically just > adds a bunch of boilerplate code everywhere. > So you could argue our design decision was more driven by ease of > maintenance than ease of use. > > There might be some transformers in your package that we could add to > scikit-learn in some form, but several are already available, > SimpleImputer implements MedianMeanImputer, CategoricalVariableImputer and > FrequentCategoryImputer > We don't currently have RandomSampleImputer and EndTailImputer, I think. > AddNaNBinaryImputer is "MissingIndicator" in sklearn. > > OneHotCategoricalEncoder and OrdinalEncoder exist, > CountFrequencyCategoricalEncoder and MeanCategoriclaEncoder are in the > works, > though there are some arguments about the details. These are also in the > categorical-encoding package: > http://contrib.scikit-learn.org/categorical-encoding/ > > RareLabelCategoricalEncoder is something I definitely want in > OneHotEncoder, not sure if there's a PR yet. > > Do you have examples of WoERatioCategoricalEncoder or Windsorizer or any > of the discretizers actually working well in practice? > I have not seen them used much, they seemed to be popular in Weka, though. > > BoxCoxTransformer is implemented in PowerTransformer, and LogTransformer, > ReciprocalTransformer and ExponentialTransformer can be > implemented as FunctionTransformer(np.log), FunctionTransformer(lambda x: > 1/x) and FunctionTransformer(lambda x: x ** exp) I believe. > > It might be interesting to add your package to scikit-learn-contrib: > https://github.com/scikit-learn-contrib > > We are struggling a bit with how to best organize that, though. > > Cheers, > Andy > > > On 4/10/19 2:13 PM, Sole Galli wrote: > > Hi Nicolas, > > You are right, I am just checking this in the source code. > > Sorry for the confusion and thanks for the quick response > > Cheers > > Sole > > On Wed, 10 Apr 2019 at 18:43, Nicolas Goix wrote: > >> Hi Sole, >> >> I'm not sure the 2 limitations you mentioned are correct. >> 1) in your example, using the ColumnTransformer you can impute different >> values for different columns. >> 2) the sklearn transformers do learn on the training set and are able to >> perpetuate the values learnt from the train set to unseen data. >> >> Nicolas >> >> On Wed, Apr 10, 2019, 18:25 Sole Galli wrote: >> >>> Dear Scikit-Learn team, >>>> >>>> Feature engineering is a big task ahead of building machine learning >>>> models. It involves imputation of missing values, encoding of categorical >>>> variables, discretisation, variable transformation etc. >>>> >>>> Sklearn includes some functionality for feature engineering, which is >>>> useful, but it has a few limitations: >>>> >>>> 1) it does not allow for feature specification - it will do the same >>>> process on all variables, for example SimpleImputer. Typically, we >>>> want to impute different columns with different values. >>>> 2) It does not capture information from the training set, this is it >>>> does not learn, therefore, it is not able to perpetuate the values learnt >>>> from the train set, to unseen data. >>>> >>>> The 2 limitations above apply to all the feature transformers in >>>> sklearn, I believe. >>>> >>>> Therefore, if these transformers are used as part of a pipeline, we >>>> could end up doing different transformations to train and test, depending >>>> on the characteristics of the datasets. For business purposes, this is not >>>> a desired option. >>>> >>>> I think that building transformers that learn from the train set would >>>> be of much use for the community. >>>> >>>> To this end, I built a python package called feature engine >>>> which expands the >>>> sklearn-api with additional feature engineering techniques, and the >>>> functionality that allows the transformer to learn from data and store the >>>> parameters learnt. >>>> >>>> The techniques included have been used worldwide, both in business and >>>> in data competitions, and reported in kdd reports and other articles. I >>>> also cover them in an udemy course >>>> which >>>> has enrolled several thousand students. >>>> >>>> The package capitalises on the use of pandas to capture the features, >>>> but I am confident that the columns names could be captured and the df >>>> transformed to a numpy array to comply with sklearn requirements. >>>> >>>> I wondered whether it would be of interest to include the functionality >>>> of this package within sklearn? >>>> If you would consider extending the sklearn api to include these >>>> transformers, I would be happy to help. >>>> >>>> Alternatively, would you consider to add the package to your website? >>>> where you mention the libaries that extend sklearn functionality? >>>> >>>> All feedback is welcome. >>>> >>>> Many thanks and I look forward to hearing from you >>>> >>>> Thank you so much fur such an awesome contribution through the sklearn >>>> api >>>> >>>> Kind regards >>>> >>>> Sole >>>> >>>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nelle.varoquaux at gmail.com Fri Apr 26 17:39:06 2019 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Fri, 26 Apr 2019 14:39:06 -0700 Subject: [scikit-learn] 2019 John Hunter Excellence in Plotting Contest Reminder Message-ID: Hi everybody, My apologies to those of you getting this on multiple lists. In memory of John Hunter, we are pleased to be announce the SciPy John Hunter Excellence in Plotting Competition for 2019. This open competition aims to highlight the importance of data visualization to scientific progress and showcase the capabilities of open source software. Participants are invited to submit scientific plots to be judged by a panel. The winning entries will be announced and displayed at the conference. John Hunter?s family and NumFocus are graciously sponsoring cash prizes for the winners in the following amounts: - 1st prize: $1000 - 2nd prize: $750 - 3rd prize: $500 - Entries must be submitted by June, 8th to the form at https://goo.gl/forms/cFTB3FUBrMPfQ7Vz1 - Winners will be announced at Scipy 2019 in Austin, TX. - Participants do not need to attend the Scipy conference. - Entries may take the definition of ?visualization? rather broadly. Entries may be, for example, a traditional printed plot, an interactive visualization for the web, or an animation. - Source code for the plot must be provided, in the form of Python code and/or a Jupyter notebook, along with a rendering of the plot in a widely used format. This may be, for example, PDF for print, standalone HTML and Javascript for an interactive plot, or MPEG-4 for a video. If the original data can not be shared for reasons of size or licensing, "fake" data may be substituted, along with an image of the plot using real data. - Each entry must include a 300-500 word abstract describing the plot and its importance for a general scientific audience. - Entries will be judged on their clarity, innovation and aesthetics, but most importantly for their effectiveness in communicating a real-world problem. Entrants are encouraged to submit plots that were used during the course of research or work, rather than merely being hypothetical. - SciPy reserves the right to display any and all entries, whether prize-winning or not, at the conference, use in any materials or on its website, with attribution to the original author(s). SciPy John Hunter Excellence in Plotting Competition Co-Chairs Hannah Aizenman Thomas Caswell Madicken Munk Nelle Varoquaux -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Tue Apr 30 04:48:09 2019 From: pahome.chen at mirlab.org (lampahome) Date: Tue, 30 Apr 2019 16:48:09 +0800 Subject: [scikit-learn] Any other clustering algo cluster incrementally? Message-ID: I read this : https://scikit-learn.org/0.15/modules/scaling_strategies.html There's only one clustering algo cluster incrementally, that is minibatch kmeans. Is there any clustering algo can reach this? On github is okay. thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Apr 30 12:38:24 2019 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 30 Apr 2019 18:38:24 +0200 Subject: [scikit-learn] Any other clustering algo cluster incrementally? In-Reply-To: References: Message-ID: <20190430163824.nkn6adhv6gz5ahqa@phare.normalesup.org> On Tue, Apr 30, 2019 at 04:48:09PM +0800, lampahome wrote: > I read this :? https://scikit-learn.org/0.15/modules/scaling_strategies.html > There's only one clustering algo cluster incrementally, that is minibatch > kmeans. The documentation that you are pointing to refers to version 0.15. If you look at the current page on scaling, you will see that there is another clustering algorithm that works incrementally: https://scikit-learn.org/stable/modules/computing.html#strategies-to-scale-computationally-bigger-data Best, Ga?l From joel.nothman at gmail.com Tue Apr 30 17:23:06 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 1 May 2019 07:23:06 +1000 Subject: [scikit-learn] Any other clustering algo cluster incrementally? In-Reply-To: <20190430163824.nkn6adhv6gz5ahqa@phare.normalesup.org> References: <20190430163824.nkn6adhv6gz5ahqa@phare.normalesup.org> Message-ID: I think it would be possible to implement an incremental extension to dbscan. But it's been years since I looked at what is involved and it might require storing the training data, unlike those out of core methods. -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Apr 30 22:09:55 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 1 May 2019 12:09:55 +1000 Subject: [scikit-learn] Release Candidate for Scikit-learn 0.21 Message-ID: PyPI now has source and binary releases for Scikit-learn 0.21rc2. * Documentation at https://scikit-learn.org/0.21 * Release Notes at https://scikit-learn.org/0.21/whats_new * Download source or wheels at https://pypi.org/project/scikit-learn/0.21rc2/ Please try out the software and help us edit the release notes before a final release. Highlights include: * neighbors.NeighborhoodComponentsAnalysis for supervised metric learning, which learns a weighted euclidean distance for k-nearest neighbors. https://scikit-learn.org/0.21/modules/neighbors.html#nca * ensemble.HistGradientBoostingClassifier and ensemble.HistGradientBoostingRegressor: experimental implementations of efficient binned gradient boosting machines. https://scikit-learn.org/0.21/modules/ensemble.html#gradient-tree-boosting * impute.IterativeImputer: a non-trivial approach to missing value imputation. https://scikit-learn.org/0.21/modules/impute.html#multivariate-feature-imputation * cluster.OPTICS: a new density-based clustering algorithm. https://scikit-learn.org/0.21/modules/clustering.html#optics * better printing of estimators as strings, with an option to hide default parameters for compactness: https://scikit-learn.org/0.21/auto_examples/plot_changed_only_pprint_parameter.html * for estimator and library developers: a way to tag your estimator so that it can be treated appropriately with check_estimator. https://scikit-learn.org/0.21/developers/contributing.html#estimator-tags There are many other enhancements and fixes listed in the release notes ( https://scikit-learn.org/0.21/whats_new). Please note that Scikit-learn has new dependencies: * joblib >= 0.11, which used to be vendored within Scikit-learn * OpenMP, unless the environment variable SKLEARN_NO_OPENMP=1 when the code is compiled (and cythonized) Happy Learning! >From the Scikit-learn core dev team. -------------- next part -------------- An HTML attachment was scrubbed... URL: