From peer.j.nowack at gmail.com Wed May 2 07:08:28 2018 From: peer.j.nowack at gmail.com (Peer Nowack) Date: Wed, 2 May 2018 12:08:28 +0100 Subject: [scikit-learn] How does multiple target Ridge Regression work in scikit learn? Message-ID: Hi all, I am struggling to understand the following: Scikit-learn offers a multiple output version for Ridge Regression, simply by handing over a 2D array [n_samples, n_targets], but how is it implemented? http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html Is it correct to assume that each regression for each target is independent? Under these circumstances, how can I adapt this to use individual alpha regularization parameters for each regression? If I use GridSeachCV, I would have to hand over a matrix of possible regularization parameters, or how would that work? Thanks in advance - I have been searching for hours but could not find anything on this topic. Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From bertrand.thirion at inria.fr Wed May 2 08:07:12 2018 From: bertrand.thirion at inria.fr (bthirion) Date: Wed, 2 May 2018 14:07:12 +0200 Subject: [scikit-learn] How does multiple target Ridge Regression work in scikit learn? In-Reply-To: References: Message-ID: <81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr> The alpha parameter is shared for all problems; If you wnat to use differnt parameters, you probably want to perform seprate fits. Best, Bertrand On 02/05/2018 13:08, Peer Nowack wrote: > > Hi all, > > I am struggling to understand the following: > > Scikit-learn offers a multiple output version for Ridge Regression, > simply by handing over a 2D array [n_samples, n_targets], but how is > it implemented? > > http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html > > Is it correct to assume that each regression for each target is > independent? Under these circumstances, how can I adapt this to use > individual alpha regularization parameters for each regression? If I > use GridSeachCV, I would have to hand over a matrix of possible > regularization parameters, or how would that work? > > Thanks in advance - I have been searching for hours but could not find > anything on this topic. > > Peter > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From peer.j.nowack at gmail.com Wed May 2 09:02:33 2018 From: peer.j.nowack at gmail.com (Peer Nowack) Date: Wed, 2 May 2018 14:02:33 +0100 Subject: [scikit-learn] How does multiple target Ridge Regression work in scikit learn? In-Reply-To: <81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr> References: <81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr> Message-ID: Thanks, Bertrand - very helpful. Needed to consolidate this. Peter On 2 May 2018 at 13:07, bthirion wrote: > The alpha parameter is shared for all problems; If you wnat to use > differnt parameters, you probably want to perform seprate fits. > Best, > > Bertrand > > On 02/05/2018 13:08, Peer Nowack wrote: > > Hi all, > > I am struggling to understand the following: > > Scikit-learn offers a multiple output version for Ridge Regression, simply > by handing over a 2D array [n_samples, n_targets], but how is it > implemented? > > http://scikit-learn.org/stable/modules/generated/ > sklearn.linear_model.Ridge.html > > Is it correct to assume that each regression for each target is > independent? Under these circumstances, how can I adapt this to use > individual alpha regularization parameters for each regression? If I use > GridSeachCV, I would have to hand over a matrix of possible regularization > parameters, or how would that work? > > Thanks in advance - I have been searching for hours but could not find > anything on this topic. > Peter > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Wed May 2 14:32:31 2018 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Wed, 2 May 2018 11:32:31 -0700 Subject: [scikit-learn] How does multiple target Ridge Regression work in scikit learn? In-Reply-To: References: <81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr> Message-ID: By the linear nature of the problem the targets are always separately treated (even if there was a matrix-variate normal prior indicating covariance between target columns, you could do that adjustment before or after fitting). As for different alpha parameters, I think you can specify a different alpha per target if you pass in an array of shape (n_targets,). Maybe this is not implemented for all solvers, but it should be at least for some. If you grid search, then the scikit-learn API requires the score to be one number, so it's non-trivial to optimize different alphas for different voxels easily (even though selecting the best alpha for each voxel will of course make the sum of errors go down, too). Depending on what your use case is, it may be easier to just write your own: If X = U S VT (svd), then weights = VT.T.dot((1 / (S ** 2 + alpha) * U).T.dot(Y)) For more than one alpha: alphas.shape == (n_alphas, n_targets) Y.shape == (n_samples, n_targets) X.shape == (n_samples, n_features) U, S, VT = np.linalg.svd(X) diags = 1 / (S[np.newaxis, :, np.newaxis] ** 2 + alphas[:, np.newaxis, :]) UTY = U.T.dot(Y) weights = np.zeros([n_alphas, n_features, n_targets]) for i in range(alphas.shape[0]): weights[i] = VT.T.dot(diags[i] * UTY) Then use those weights to predict. Michael On Wed, May 2, 2018 at 6:02 AM, Peer Nowack wrote: > Thanks, Bertrand - very helpful. Needed to consolidate this. > > Peter > > On 2 May 2018 at 13:07, bthirion wrote: > >> The alpha parameter is shared for all problems; If you wnat to use >> differnt parameters, you probably want to perform seprate fits. >> Best, >> >> Bertrand >> >> On 02/05/2018 13:08, Peer Nowack wrote: >> >> Hi all, >> >> I am struggling to understand the following: >> >> Scikit-learn offers a multiple output version for Ridge Regression, >> simply by handing over a 2D array [n_samples, n_targets], but how is it >> implemented? >> >> http://scikit-learn.org/stable/modules/generated/sklearn. >> linear_model.Ridge.html >> >> Is it correct to assume that each regression for each target is >> independent? Under these circumstances, how can I adapt this to use >> individual alpha regularization parameters for each regression? If I use >> GridSeachCV, I would have to hand over a matrix of possible regularization >> parameters, or how would that work? >> >> Thanks in advance - I have been searching for hours but could not find >> anything on this topic. >> Peter >> >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From princejha616 at gmail.com Thu May 3 02:53:20 2018 From: princejha616 at gmail.com (prince jha) Date: Thu, 3 May 2018 12:23:20 +0530 Subject: [scikit-learn] Project Contribution Message-ID: Hello everyone, I am also willing to contribute to scikit-learn open source project but since I have never contributed to any open-source projects earlier, so I don't have any idea regarding where to start from. So I will be thankful if any of you could help me in this so that I could also start contributing in this great project. Thanks, Prince -------------- next part -------------- An HTML attachment was scrubbed... URL: From ross at cgl.ucsf.edu Thu May 3 03:02:54 2018 From: ross at cgl.ucsf.edu (Bill Ross) Date: Thu, 3 May 2018 00:02:54 -0700 Subject: [scikit-learn] Project Contribution In-Reply-To: References: Message-ID: Quick followup from a bystander: have you used scikit-learn for anything? How much of the code have you read? (me: no, 0) Bill On 5/2/18 11:53 PM, prince jha wrote: > Hello everyone, I am also willing to contribute to scikit-learn open > source project but since I have never contributed to any open-source > projects earlier, so I don't have any idea regarding where to start > from. So I will be thankful if any of you could help me in this so > that I could also start contributing in this great project. > > Thanks, > Prince > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.ali.jamaoui at gmail.com Thu May 3 03:25:33 2018 From: m.ali.jamaoui at gmail.com (Mohamed Ali Jamaoui) Date: Thu, 3 May 2018 09:25:33 +0200 Subject: [scikit-learn] Project Contribution In-Reply-To: References: Message-ID: Hi, There are many ways to contribute, not only code. You can get started by reading the "Contributing" section of the "Developer's guide" : http://scikit-learn.org/dev/developers/contributing.html For code contributions, you don't need to read all the codebase to be able to contribute, try to pave your way into it gradually. A good first step would be to start with issues labeled good first issue . Welcome onboard :) Regards, Mohamed Ali JAMAOUI On 3 May 2018 at 09:02, Bill Ross wrote: > Quick followup from a bystander: have you used scikit-learn for anything? > How much of the code have you read? (me: no, 0) > > Bill > > On 5/2/18 11:53 PM, prince jha wrote: > > Hello everyone, I am also willing to contribute to scikit-learn open > source project but since I have never contributed to any open-source > projects earlier, so I don't have any idea regarding where to start from. > So I will be thankful if any of you could help me in this so that I could > also start contributing in this great project. > > Thanks, > Prince > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From princejha616 at gmail.com Thu May 3 03:48:14 2018 From: princejha616 at gmail.com (prince jha) Date: Thu, 3 May 2018 13:18:14 +0530 Subject: [scikit-learn] Project Contribution Message-ID: Hi Bill, actually i have used scikit learn for solving problems in which are available in kaggle but i am not so profiecient since i have not used it much. Thanks Prince -------------- next part -------------- An HTML attachment was scrubbed... URL: From wouterverduin at gmail.com Fri May 4 05:12:40 2018 From: wouterverduin at gmail.com (Wouter Verduin) Date: Fri, 4 May 2018 11:12:40 +0200 Subject: [scikit-learn] Retracting model from the 'blackbox' SVM Message-ID: Dear developers of Scikit, I am working on a scientific paper on a predictionmodel predicting complications in major abdominal resections. I have been using scikit to create that model and got good results (score of 0.94). This makes us want to see what the model is like that is made by scikit. As for now we got 100 input variables but logically these arent all as usefull as the others and we want to reduce this number to about 20 and see what the effects on the score are. *My question*: Is there a way to get the underlying formula for the model out of scikit instead of having it as a 'blackbox' in my svm function. At this moment i am predicting a dichtomous variable with 100 variables, (continuous, ordinal and binair). My code: import numpy as npfrom numpy import *import pandas as pdfrom sklearn import tree, svm, linear_model, metrics, preprocessingimport datetimefrom sklearn.model_selection import KFold, cross_val_score, ShuffleSplit, GridSearchCVfrom time import gmtime, strftime #database openen en voorbereiden file = "/home/wouter/scikit/DB_SCIKIT.csv" DB = pd.read_csv(file, sep=";", header=0, decimal= ',').as_matrix() DBT = DBprint "Vorm van de DB: ", DB.shape target = []for i in range(len(DB[:,-1])): target.append(DB[i,-1]) DB = delete(DB,s_[-1],1) #Laatste kolom verwijderenAantalOutcome = target.count(1)print "Aantal outcome:", AantalOutcomeprint "Aantal patienten:", len(target) A = DB b = target print len(DBT) svc=svm.SVC(kernel='linear', cache_size=500, probability=True) indices = np.random.permutation(len(DBT)) rs = ShuffleSplit(n_splits=5, test_size=.15, random_state=None) scores = cross_val_score(svc, A, b, cv=rs) A = ("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))print A X_train = DBT[indices[:-302]] y_train = []for i in range(len(X_train[:,-1])): y_train.append(X_train[i,-1]) X_train = delete(X_train,s_[-1],1) #Laatste kolom verwijderen X_test = DBT[indices[-302:]] y_test = []for i in range(len(X_test[:,-1])): y_test.append(X_test[i,-1]) X_test = delete(X_test,s_[-1],1) #Laatste kolom verwijderen model = svc.fit(X_train,y_train)print model uitkomst = model.score(X_test, y_test)print uitkomst voorspel = model.predict(X_test)print voorspel And output: Vorm van de DB: (2011, 101)Aantal outcome: 128Aantal patienten: 20112011Accuracy: 0.94 (+/- 0.01) SVC(C=1.0, cache_size=500, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False)0.927152317881[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] Thanks in advance! with kind regards, Wouter Verduin -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Fri May 4 05:51:26 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Fri, 4 May 2018 05:51:26 -0400 Subject: [scikit-learn] Retracting model from the 'blackbox' SVM In-Reply-To: References: Message-ID: <5331A676-D6C6-4F01-8A4D-EDDE9318E08F@sebastianraschka.com> Dear Wouter, for the SVM, scikit-learn wraps the LIBSVM and LIBLINEAR. I think the scikit-learn class SVC uses LIBSVM for every kernel. Since you are using the linear kernel, you could use the more efficient LinearSVC scikit-learn class to get similar results. I guess this in turn is easier to handle in terms of > Is there a way to get the underlying formula for the model out of scikit instead of having it as a 'blackbox' in my svm function. More specifically, LinearSVC uses the _fit_liblinear code available here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/base.py And more info on the LIBLINEAR library it is using can be found here: https://www.csie.ntu.edu.tw/~cjlin/liblinear/ (they have links to technical reports and implementation details there) Best, Sebastian > On May 4, 2018, at 5:12 AM, Wouter Verduin wrote: > > Dear developers of Scikit, > > I am working on a scientific paper on a predictionmodel predicting complications in major abdominal resections. I have been using scikit to create that model and got good results (score of 0.94). This makes us want to see what the model is like that is made by scikit. > > As for now we got 100 input variables but logically these arent all as usefull as the others and we want to reduce this number to about 20 and see what the effects on the score are. > > My question: Is there a way to get the underlying formula for the model out of scikit instead of having it as a 'blackbox' in my svm function. > > At this moment i am predicting a dichtomous variable with 100 variables, (continuous, ordinal and binair). > > My code: > > import numpy as > np > > from numpy import * > import pandas as > pd > > from sklearn import tree, svm, linear_model, metrics, > preprocessing > > import > datetime > > from sklearn.model_selection import KFold, cross_val_score, ShuffleSplit, GridSearchCV > from time import gmtime, > strftime > > > #database openen en voorbereiden > > file > = "/home/wouter/scikit/DB_SCIKIT.csv" > > DB > = pd.read_csv(file, sep=";", header=0, decimal= ',').as_matrix() > > DBT > = > DB > > print "Vorm van de DB: ", DB. > shape > target > = [] > for i in range(len(DB[:,-1])): > > target > .append(DB[i,-1]) > > DB > = delete(DB,s_[-1],1) #Laatste kolom verwijderen > AantalOutcome = target.count(1) > print "Aantal outcome:", AantalOutcome > print "Aantal patienten:", len(target) > > > A > = > DB > b > = > target > > > print len(DBT) > > > svc > =svm.SVC(kernel='linear', cache_size=500, probability=True) > > indices > = np.random.permutation(len(DBT)) > > > rs > = ShuffleSplit(n_splits=5, test_size=.15, random_state=None) > > scores > = cross_val_score(svc, A, b, cv=rs) > > A > = ("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) > print > A > > X_train > = DBT[indices[:-302]] > > y_train > = [] > for i in range(len(X_train[:,-1])): > > y_train > .append(X_train[i,-1]) > > X_train > = delete(X_train,s_[-1],1) #Laatste kolom verwijderen > > > X_test > = DBT[indices[-302:]] > > y_test > = [] > for i in range(len(X_test[:,-1])): > > y_test > .append(X_test[i,-1]) > > X_test > = delete(X_test,s_[-1],1) #Laatste kolom verwijderen > > > model > = svc.fit(X_train,y_train) > print > model > > uitkomst > = model.score(X_test, y_test) > print > uitkomst > > voorspel > = model.predict(X_test) > print voorspel > And output: > > Vorm van de DB: (2011, 101) > Aantal outcome: 128 > Aantal patienten: 2011 > 2011 > Accuracy: 0.94 (+/- 0.01) > > SVC > (C=1.0, cache_size=500, class_weight=None, coef0=0.0, > > decision_function_shape > ='ovr', degree=3, gamma='auto', kernel='linear', > > max_iter > =-1, probability=True, random_state=None, shrinking=True, > > tol > =0.001, verbose=False) > 0.927152317881 > [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. > > > 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. > > > 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. > > > 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. > > > 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. > > > 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. > > > 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. > > > 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. > > > 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. > > > 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. > > > 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. > > > 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. > > > 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] > Thanks in advance! > > with kind regards, > > Wouter Verduin > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From david.mo.burns at gmail.com Fri May 4 12:47:20 2018 From: david.mo.burns at gmail.com (David Burns) Date: Fri, 4 May 2018 12:47:20 -0400 Subject: [scikit-learn] Retracting model from the 'blackbox' SVM (Sebastian Raschka) In-Reply-To: References: Message-ID: Hi Sebastian, If you are looking to reduce the feature space for your model, I suggest you look at the scikit-learn page on doing just that http://scikit-learn.org/stable/modules/feature_selection.html David On 2018-05-04 12:00 PM, scikit-learn-request at python.org wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: Retracting model from the 'blackbox' SVM (Sebastian Raschka) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 4 May 2018 05:51:26 -0400 > From: Sebastian Raschka > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] Retracting model from the 'blackbox' SVM > Message-ID: > <5331A676-D6C6-4F01-8A4D-EDDE9318E08F at sebastianraschka.com> > Content-Type: text/plain; charset=us-ascii > > Dear Wouter, > > for the SVM, scikit-learn wraps the LIBSVM and LIBLINEAR. I think the scikit-learn class SVC uses LIBSVM for every kernel. Since you are using the linear kernel, you could use the more efficient LinearSVC scikit-learn class to get similar results. I guess this in turn is easier to handle in terms of > >> Is there a way to get the underlying formula for the model out of scikit instead of having it as a 'blackbox' in my svm function. > More specifically, LinearSVC uses the _fit_liblinear code available here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/base.py > > And more info on the LIBLINEAR library it is using can be found here: https://www.csie.ntu.edu.tw/~cjlin/liblinear/ (they have links to technical reports and implementation details there) > > Best, > Sebastian > >> On May 4, 2018, at 5:12 AM, Wouter Verduin wrote: >> >> Dear developers of Scikit, >> >> I am working on a scientific paper on a predictionmodel predicting complications in major abdominal resections. I have been using scikit to create that model and got good results (score of 0.94). This makes us want to see what the model is like that is made by scikit. >> >> As for now we got 100 input variables but logically these arent all as usefull as the others and we want to reduce this number to about 20 and see what the effects on the score are. >> >> My question: Is there a way to get the underlying formula for the model out of scikit instead of having it as a 'blackbox' in my svm function. >> >> At this moment i am predicting a dichtomous variable with 100 variables, (continuous, ordinal and binair). >> >> My code: >> >> import numpy as >> np >> >> from numpy import * >> import pandas as >> pd >> >> from sklearn import tree, svm, linear_model, metrics, >> preprocessing >> >> import >> datetime >> >> from sklearn.model_selection import KFold, cross_val_score, ShuffleSplit, GridSearchCV >> from time import gmtime, >> strftime >> >> >> #database openen en voorbereiden >> >> file >> = "/home/wouter/scikit/DB_SCIKIT.csv" >> >> DB >> = pd.read_csv(file, sep=";", header=0, decimal= ',').as_matrix() >> >> DBT >> = >> DB >> >> print "Vorm van de DB: ", DB. >> shape >> target >> = [] >> for i in range(len(DB[:,-1])): >> >> target >> .append(DB[i,-1]) >> >> DB >> = delete(DB,s_[-1],1) #Laatste kolom verwijderen >> AantalOutcome = target.count(1) >> print "Aantal outcome:", AantalOutcome >> print "Aantal patienten:", len(target) >> >> >> A >> = >> DB >> b >> = >> target >> >> >> print len(DBT) >> >> >> svc >> =svm.SVC(kernel='linear', cache_size=500, probability=True) >> >> indices >> = np.random.permutation(len(DBT)) >> >> >> rs >> = ShuffleSplit(n_splits=5, test_size=.15, random_state=None) >> >> scores >> = cross_val_score(svc, A, b, cv=rs) >> >> A >> = ("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) >> print >> A >> >> X_train >> = DBT[indices[:-302]] >> >> y_train >> = [] >> for i in range(len(X_train[:,-1])): >> >> y_train >> .append(X_train[i,-1]) >> >> X_train >> = delete(X_train,s_[-1],1) #Laatste kolom verwijderen >> >> >> X_test >> = DBT[indices[-302:]] >> >> y_test >> = [] >> for i in range(len(X_test[:,-1])): >> >> y_test >> .append(X_test[i,-1]) >> >> X_test >> = delete(X_test,s_[-1],1) #Laatste kolom verwijderen >> >> >> model >> = svc.fit(X_train,y_train) >> print >> model >> >> uitkomst >> = model.score(X_test, y_test) >> print >> uitkomst >> >> voorspel >> = model.predict(X_test) >> print voorspel >> And output: >> >> Vorm van de DB: (2011, 101) >> Aantal outcome: 128 >> Aantal patienten: 2011 >> 2011 >> Accuracy: 0.94 (+/- 0.01) >> >> SVC >> (C=1.0, cache_size=500, class_weight=None, coef0=0.0, >> >> decision_function_shape >> ='ovr', degree=3, gamma='auto', kernel='linear', >> >> max_iter >> =-1, probability=True, random_state=None, shrinking=True, >> >> tol >> =0.001, verbose=False) >> 0.927152317881 >> [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. >> >> >> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. >> >> >> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. >> >> >> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. >> >> >> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. >> >> >> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. >> >> >> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. >> >> >> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. >> >> >> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. >> >> >> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. >> >> >> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. >> >> >> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. >> >> >> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] >> Thanks in advance! >> >> with kind regards, >> >> Wouter Verduin >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 26, Issue 5 > ******************************************* From niyaghif at oregonstate.edu Fri May 4 19:10:44 2018 From: niyaghif at oregonstate.edu (Niyaghi, Faraz) Date: Fri, 4 May 2018 16:10:44 -0700 Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature Importance Message-ID: Greetings, This is Faraz Niyaghi from Oregon State University. I research on variable selection using random forest. To the best of my knowledge, there is a difference between scikit-learn's and Breiman's definition of feature importance. Breiman uses out of bag (oob) cases to calculate feature importance but scikit-learn doesn't. I was wondering: 1) why are they different? 2) can they result in very different rankings of features? Here are the definitions I found on the web: *Breiman:* "In every tree grown in the forest, put down the oob cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the oob cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-m-permuted oob data from the number of votes for the correct class in the untouched oob data. The average of this number over all trees in the forest is the raw importance score for variable m." Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm *scikit-learn: *" The relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features." Link: http://scikit-learn.org/stable/modules/ensemble.html Thank you for reading this email. Please let me know your thoughts. Cheers, Faraz. Faraz Niyaghi Ph.D. Candidate, Department of Statistics Oregon State University Corvallis, OR -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Fri May 4 19:58:03 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Fri, 4 May 2018 19:58:03 -0400 Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature Importance In-Reply-To: References: Message-ID: <4B01B139-0D45-4F85-A287-E5B36BC3FE03@sebastianraschka.com> Not sure how it compares in practice, but it's certainly more efficient to rank the features by impurity decrease rather than by OOB permutation performance you wouldn't need to a) compute the OOB performance (an extra pass inference step) b) permute a feature column and do another inference pass and compare it to a) c) repeat step b) for each feature column Another reason would be that Breiman's suggestion wouldn't work that well for certain RandomForestClassifier settings in scikit-learn, e.g., setting bootstrap=False etc. If you like to compute the feature importance after Breiman's suggestion, I have implemented a simple wrapper function for scikit-learn estimators here: http://rasbt.github.io/mlxtend/user_guide/evaluate/feature_importance_permutation/#example-1-feature-importance-for-classifiers Note that it's not using OOB samples but an independent validation set though, because it's a general function that should not be restricted to random forests. If you have such an independent dataset, it should give more accurate results than using OOB samples. Best, Sebastian > On May 4, 2018, at 7:10 PM, Niyaghi, Faraz wrote: > > Greetings, > > This is Faraz Niyaghi from Oregon State University. I research on variable selection using random forest. To the best of my knowledge, there is a difference between scikit-learn's and Breiman's definition of feature importance. Breiman uses out of bag (oob) cases to calculate feature importance but scikit-learn doesn't. I was wondering: 1) why are they different? 2) can they result in very different rankings of features? > > Here are the definitions I found on the web: > > Breiman: "In every tree grown in the forest, put down the oob cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the oob cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-m-permuted oob data from the number of votes for the correct class in the untouched oob data. The average of this number over all trees in the forest is the raw importance score for variable m." > Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm > > scikit-learn: " The relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features." > Link: http://scikit-learn.org/stable/modules/ensemble.html > > Thank you for reading this email. Please let me know your thoughts. > > Cheers, > Faraz. > > Faraz Niyaghi > > Ph.D. Candidate, Department of Statistics > Oregon State University > Corvallis, OR > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From Jeremiah.Johnson at unh.edu Fri May 4 20:08:45 2018 From: Jeremiah.Johnson at unh.edu (Johnson, Jeremiah) Date: Sat, 5 May 2018 00:08:45 +0000 Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature Importance In-Reply-To: References: Message-ID: Faraz, take a look at the discussion of this issue here: http://parrt.cs.usfca.edu/doc/rf-importance/index.html Best, Jeremiah ========================================= Jeremiah W. Johnson, Ph.D Asst. Professor of Data Science Program Coordinator, B.S. in Analytics & Data Science University of New Hampshire Manchester, NH 03101 https://www.linkedin.com/in/jwjohnson314 From: scikit-learn > on behalf of "Niyaghi, Faraz" > Reply-To: Scikit-learn mailing list > Date: Friday, May 4, 2018 at 7:10 PM To: "scikit-learn at python.org" > Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature Importance Caution - External Email ________________________________ Greetings, This is Faraz Niyaghi from Oregon State University. I research on variable selection using random forest. To the best of my knowledge, there is a difference between scikit-learn's and Breiman's definition of feature importance. Breiman uses out of bag (oob) cases to calculate feature importance but scikit-learn doesn't. I was wondering: 1) why are they different? 2) can they result in very different rankings of features? Here are the definitions I found on the web: Breiman: "In every tree grown in the forest, put down the oob cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the oob cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-m-permuted oob data from the number of votes for the correct class in the untouched oob data. The average of this number over all trees in the forest is the raw importance score for variable m." Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm scikit-learn: " The relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features." Link: http://scikit-learn.org/stable/modules/ensemble.html Thank you for reading this email. Please let me know your thoughts. Cheers, Faraz. Faraz Niyaghi Ph.D. Candidate, Department of Statistics Oregon State University Corvallis, OR -------------- next part -------------- An HTML attachment was scrubbed... URL: From aqsdmcet at gmail.com Sat May 5 00:31:14 2018 From: aqsdmcet at gmail.com (aijaz qazi) Date: Sat, 5 May 2018 10:01:14 +0530 Subject: [scikit-learn] Multi learn error. Message-ID: Dear developers of Scikit , I am working on web page categorization with http://scikit.ml/ . *Question*: I am not able to execute MLkNN code on the link http://scikit.ml/api/classify.html. I have installed py 3.6. I found scipy versions not compatible with scikit.ml 0.0.5. Which version of scipy would work with scikit.ml 0.0.5. Kindly let me know. *Regards,* *Aijaz A.Qazi * -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Sat May 5 02:28:22 2018 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Sat, 5 May 2018 09:28:22 +0300 Subject: [scikit-learn] Multi learn error. In-Reply-To: References: Message-ID: <49def996-56c7-ec5e-dc37-bf93968cfa2a@gmail.com> Hi Aijaz, On 05/05/18 07:31, aijaz qazi wrote: > Dear developers of Scikit , Scikit is short for SciPy Toolkits (https://www.scipy.org/scikits.html); there is a number of those. Scikit-learn started as one (and this is the scikit-learn mailing list). The package you are refering is based on scikit-learn but is a separate project (with a somewhat confusing home page URL). The right place to ask for support would be its Github issue tracker or other specific communcations channels if it has any. -- Roman From g.lemaitre58 at gmail.com Sat May 5 04:34:36 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Sat, 5 May 2018 10:34:36 +0200 Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature Importance In-Reply-To: References: Message-ID: +1 on the post pointed out by Jeremiah. On 5 May 2018 at 02:08, Johnson, Jeremiah wrote: > Faraz, take a look at the discussion of this issue here: > http://parrt.cs.usfca.edu/doc/rf-importance/index.html > > Best, > Jeremiah > ========================================= > Jeremiah W. Johnson, Ph.D > Asst. Professor of Data Science > Program Coordinator, B.S. in Analytics & Data Science > University of New Hampshire > Manchester, NH 03101 > https://www.linkedin.com/in/jwjohnson314 > > > From: scikit-learn python.org> on behalf of "Niyaghi, Faraz" > Reply-To: Scikit-learn mailing list > Date: Friday, May 4, 2018 at 7:10 PM > To: "scikit-learn at python.org" > Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature > Importance > > *Caution - External Email* > ------------------------------ > Greetings, > > This is Faraz Niyaghi from Oregon State University. I research on variable > selection using random forest. To the best of my knowledge, there is a > difference between scikit-learn's and Breiman's definition of feature > importance. Breiman uses out of bag (oob) cases to calculate feature > importance but scikit-learn doesn't. I was wondering: 1) why are they > different? 2) can they result in very different rankings of features? > > Here are the definitions I found on the web: > > *Breiman:* "In every tree grown in the forest, put down the oob cases and > count the number of votes cast for the correct class. Now randomly permute > the values of variable m in the oob cases and put these cases down the > tree. Subtract the number of votes for the correct class in the > variable-m-permuted oob data from the number of votes for the correct class > in the untouched oob data. The average of this number over all trees in the > forest is the raw importance score for variable m." > Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm > > > *scikit-learn: *" The relative rank (i.e. depth) of a feature used as a > decision node in a tree can be used to assess the relative importance of > that feature with respect to the predictability of the target variable. > Features used at the top of the tree contribute to the final prediction > decision of a larger fraction of the input samples. The expected fraction > of the samples they contribute to can thus be used as an estimate of the > relative importance of the features." > Link: http://scikit-learn.org/stable/modules/ensemble.html > > > Thank you for reading this email. Please let me know your thoughts. > > Cheers, > Faraz. > > Faraz Niyaghi > > Ph.D. Candidate, Department of Statistics > Oregon State University > Corvallis, OR > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.louppe at gmail.com Sat May 5 05:21:17 2018 From: g.louppe at gmail.com (Gilles Louppe) Date: Sat, 05 May 2018 09:21:17 +0000 Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature Importance In-Reply-To: References: Message-ID: Hi, See also chapters 6 and 7 of http://arxiv.org/abs/1407.7502 for another point of view regarding the "issue" with feature importances. TLDR: Feature importances as we have them in scikit-learn (i.e. MDI) are provably **not** biased, provided trees are built totally at random (as in ExtraTrees with max_feature=1) and the depth is controlled min_samples_split (to avoid splitting on noise). On the other hand, it is not always clear what you actually compute with MDA (permutation based importances), since it is conditioned on the model you use. Gilles On Sat, 5 May 2018 at 10:36, Guillaume Lema?tre wrote: > +1 on the post pointed out by Jeremiah. > On 5 May 2018 at 02:08, Johnson, Jeremiah wrote: >> Faraz, take a look at the discussion of this issue here: http://parrt.cs.usfca.edu/doc/rf-importance/index.html >> Best, >> Jeremiah >> ========================================= >> Jeremiah W. Johnson, Ph.D >> Asst. Professor of Data Science >> Program Coordinator, B.S. in Analytics & Data Science >> University of New Hampshire >> Manchester, NH 03101 >> https://www.linkedin.com/in/jwjohnson314 >> From: scikit-learn on behalf of "Niyaghi, Faraz" >> Reply-To: Scikit-learn mailing list >> Date: Friday, May 4, 2018 at 7:10 PM >> To: "scikit-learn at python.org" >> Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature Importance >> Caution - External Email >> ________________________________ >> Greetings, >> This is Faraz Niyaghi from Oregon State University. I research on variable selection using random forest. To the best of my knowledge, there is a difference between scikit-learn's and Breiman's definition of feature importance. Breiman uses out of bag (oob) cases to calculate feature importance but scikit-learn doesn't. I was wondering: 1) why are they different? 2) can they result in very different rankings of features? >> Here are the definitions I found on the web: >> Breiman: "In every tree grown in the forest, put down the oob cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the oob cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-m-permuted oob data from the number of votes for the correct class in the untouched oob data. The average of this number over all trees in the forest is the raw importance score for variable m." >> Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm >> scikit-learn: " The relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features." >> Link: http://scikit-learn.org/stable/modules/ensemble.html >> Thank you for reading this email. Please let me know your thoughts. >> Cheers, >> Faraz. >> Faraz Niyaghi >> Ph.D. Candidate, Department of Statistics >> Oregon State University >> Corvallis, OR >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Sat May 5 09:16:50 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 5 May 2018 15:16:50 +0200 Subject: [scikit-learn] Announcing IMPAC: an IMaging-PsychiAtry Challenge, using data-science to predict autism from brain imaging Message-ID: <20180505131650.ke323loujdoa2mxr@phare.normalesup.org> Dear colleagues, It is my pleasure to announce IMPAC: an IMaging-PsychiAtry Challenge, using data-science to predict autism from brain imaging. https://paris-saclay-cds.github.io/autism_challenge/ This is a machine-learning challenge on brain-imaging data to achieve the best prediction of autism spectrum disorder diagnostic status. We are providing the largest cohort so far to learn such predictive biomarkers, with more than 2000 individuals. There is a total of 9000 euros of prices to win for the best prediction. The prediction quality will be measured on a large hidden test set, to ensure fairness. We provide a simple starting kit to serve as a proof of feasibility. We are excited to see what the community will come up with in terms of predictive models and of score. Best, Ga?l -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From jeff1evesque at yahoo.com Sat May 5 21:40:34 2018 From: jeff1evesque at yahoo.com (Jeffrey Levesque) Date: Sat, 5 May 2018 21:40:34 -0400 Subject: [scikit-learn] Jeff Levesque: neuroscience related datasets Message-ID: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com> Hi guys, I want to perform some basic data analysis. Anyone have good recommendations where I can obtain free datasets. I was thinking of trying to do something related to neuroscience. But, kaggle doesn't have many datasets for this focus. Thank you, Jeff Levesque https://github.com/jeff1evesque From nicholdav at gmail.com Sat May 5 21:58:54 2018 From: nicholdav at gmail.com (David Nicholson) Date: Sat, 5 May 2018 21:58:54 -0400 Subject: [scikit-learn] Jeff Levesque: neuroscience related datasets In-Reply-To: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com> References: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com> Message-ID: Hi Jeff, here's a couple of places to start, I'm sure other people can recommend more: https://crcns.org/ https://www.nature.com/sdata/policies/repositories (see under Neuroscience) There's also the challenge that Gael just announced, predicting autism from brain imaging data: https://paris-saclay-cds.github.io/autism_challenge/ https://twittr.com/GaelVaroquaux/status/992752034242879488 https://twitter.com/GaelVaroquaux/status/992752034242879488 --David David Nicholson, Ph.D. nickledave.github.io https://github.com/NickleDave Prinz lab , Emory University, Atlanta, GA, USA On Sat, May 5, 2018 at 9:40 PM, Jeffrey Levesque via scikit-learn < scikit-learn at python.org> wrote: > Hi guys, > I want to perform some basic data analysis. Anyone have good > recommendations where I can obtain free datasets. I was thinking of trying > to do something related to neuroscience. But, kaggle doesn't have many > datasets for this focus. > > Thank you, > > Jeff Levesque > https://github.com/jeff1evesque > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Sat May 5 21:59:28 2018 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Sat, 5 May 2018 18:59:28 -0700 Subject: [scikit-learn] Jeff Levesque: neuroscience related datasets In-Reply-To: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com> References: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com> Message-ID: Hi Jeffrey, check out these here for neuron data and fmri: http://crcns.org/ And the ones here for fmri: https://openfmri.org/ You can get started by installing one of the following packages and using their dataset downloaders http://nilearn.github.io/modules/reference.html#module-nilearn.datasets https://martinos.org/mne/stable/manual/datasets_index.html Also, there was this kaggle https://www.kaggle.com/c/decoding-the-human-brain And probably a bunch of others Hope that helps! Michael On Sat, May 5, 2018 at 6:40 PM, Jeffrey Levesque via scikit-learn < scikit-learn at python.org> wrote: > Hi guys, > I want to perform some basic data analysis. Anyone have good > recommendations where I can obtain free datasets. I was thinking of trying > to do something related to neuroscience. But, kaggle doesn't have many > datasets for this focus. > > Thank you, > > Jeff Levesque > https://github.com/jeff1evesque > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nicholdav at gmail.com Sat May 5 22:04:56 2018 From: nicholdav at gmail.com (David Nicholson) Date: Sat, 5 May 2018 22:04:56 -0400 Subject: [scikit-learn] Jeff Levesque: neuroscience related datasets In-Reply-To: References: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com> Message-ID: also (sorry for spamming the list!) should have said the Allen Institute has a ton of data: https://www.nwb.org/allen-cell-types-database/ and check out the cool dataset with this paper: https://figshare.com/articles/Recordings_of_ten_thousand_neurons_in_visual_cortex_during_spontaneous_behaviors/6163622 https://github.com/MouseLand/stringer-pachitariu-et-al-2018a explainer twitter thread: https://twitter.com/marius10p/status/988069221941874688 David Nicholson, Ph.D. nickledave.github.io https://github.com/NickleDave Prinz lab , Emory University, Atlanta, GA, USA On Sat, May 5, 2018 at 9:58 PM, David Nicholson wrote: > Hi Jeff, > > here's a couple of places to start, I'm sure other people can recommend > more: > https://crcns.org/ > https://www.nature.com/sdata/policies/repositories (see under > Neuroscience) > > There's also the challenge that Gael just announced, predicting autism > from brain imaging data: > https://paris-saclay-cds.github.io/autism_challenge/ > https://twittr.com/GaelVaroquaux/status/992752034242879488https:// > twitter.com/GaelVaroquaux/status/992752034242879488 > --David > > David Nicholson, Ph.D. > nickledave.github.io > https://github.com/NickleDave > Prinz lab , Emory > University, Atlanta, GA, USA > > On Sat, May 5, 2018 at 9:40 PM, Jeffrey Levesque via scikit-learn < > scikit-learn at python.org> wrote: > >> Hi guys, >> I want to perform some basic data analysis. Anyone have good >> recommendations where I can obtain free datasets. I was thinking of trying >> to do something related to neuroscience. But, kaggle doesn't have many >> datasets for this focus. >> >> Thank you, >> >> Jeff Levesque >> https://github.com/jeff1evesque >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat May 5 22:17:36 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 6 May 2018 12:17:36 +1000 Subject: [scikit-learn] Retracting model from the 'blackbox' SVM In-Reply-To: References: Message-ID: The coef_ available from LinearSVC will be somewhat indicative of the relative importance of each feature. But you might want to look into our feature selection documentation: http://scikit-learn.org/stable/modules/feature_selection.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.v.viljamaa at gmail.com Sun May 6 14:01:12 2018 From: matti.v.viljamaa at gmail.com (Matti Viljamaa) Date: Sun, 6 May 2018 21:01:12 +0300 Subject: [scikit-learn] Does sklearn.decomposition.TruncatedSVD take n_components in order? Or can I select which features I want? Message-ID: <5aef42ea.1c69fb81.779bc.933b@mx.google.com> Does sklearn.decomposition.TruncatedSVD take n_components in order? Or can I select which features I want? Reason being that if one uses the ?pick features with eigenvalues > 1? principle, then I?d need to tell the SVD algo somehow, which components it should use. BR, Matti L?hetetty Windows 10:n S?hk?postista --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus -------------- next part -------------- An HTML attachment was scrubbed... URL: From santoshmsubedi at gmail.com Tue May 8 03:26:06 2018 From: santoshmsubedi at gmail.com (Santosh Subedi) Date: Tue, 8 May 2018 16:26:06 +0900 Subject: [scikit-learn] Help me Please! Message-ID: Hello, I'm using Scikit-learn for Gaussian Process Regression (GPR). I'm facing a problem/confusion regarding GaussianProcessRegressor class. If gp is a GaussianProcessRegressor, the prediction is given as: y_pred_test, sigma = gp.predict(x_test, return_std =True) After printing the y_pred_test and sigma, the y_pred_test predicted for all the data source (3 data source per each test point) at every test point. However, the Standard deviation (sigma) is predicted just a single value at each test point. I want the sigma to be predicted as y_pred_test for every data source. I've asked my query at StackOverflow at the following link: https://stackoverflow.com/questions/50185399/insufficient- output-with-predictx-test-return-std-true-in-gaussianprocessre Could you reply with an appropriate answer to this email or at the StackOverflow, please? Thank you for your time and consideration. Kindly Regards, santobedi -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.v.viljamaa at gmail.com Wed May 9 10:08:40 2018 From: matti.v.viljamaa at gmail.com (Matti Viljamaa) Date: Wed, 9 May 2018 17:08:40 +0300 Subject: [scikit-learn] How to pick the maximum possible parameters for algos such as sklearn.decomposition.TruncatedSVD? Message-ID: <5af300ea.1c69fb81.cc315.65e7@mx.google.com> How to pick the maximum possible parameters for algos such as sklearn.decomposition.TruncatedSVD? Since this algo can cause a memory error, if memory runs out. But of course one would like to select the maximum possible n_components, given the system memory available. So how to do it? L?hetetty Windows 10:n S?hk?postista --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus -------------- next part -------------- An HTML attachment was scrubbed... URL: From carolduncanpc833 at yahoo.com Wed May 9 11:40:52 2018 From: carolduncanpc833 at yahoo.com (Carol Duncan) Date: Wed, 9 May 2018 15:40:52 +0000 (UTC) Subject: [scikit-learn] How does multiple target Ridge Regression work in scikit learn? In-Reply-To: <81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr> References: <81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr> Message-ID: <1570331285.1609254.1525880452333@mail.yahoo.com> From: bthirion To: scikit-learn at python.org Sent: Wednesday, May 2, 2018 12:07 PM Subject: Re: [scikit-learn] How does multiple target Ridge Regression work in scikit learn? The alpha parameter is shared for all problems; If you wnat to use differnt parameters, you probably want to perform seprate fits. Best, Bertrand On 02/05/2018 13:08, Peer Nowack wrote: Hi all, I am struggling to understand the following: Scikit-learn offers a multiple output version for Ridge Regression, simply by handing over a 2D array [n_samples, n_targets], but how is it implemented? http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html Is it correct to assume that each regression for each target is independent? Under these circumstances, how can I adapt this to use individual alpha regularization parameters for each regression? If I use GridSeachCV, I would have to hand over a matrix of possible regularization parameters, or how would that work? Thanks in advance - I have been searching for hours but could not find anything on this topic. Peter _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From dylanf123 at gmail.com Thu May 10 03:08:07 2018 From: dylanf123 at gmail.com (Dylan Fernando) Date: Thu, 10 May 2018 17:08:07 +1000 Subject: [scikit-learn] Unable to run make test-coverage Message-ID: Hi, I am unable to run make test-coverage. I get the error: rm -rf coverage .coverage pytest sklearn --showlocals -v --cov=sklearn --cov-report=html:coverage usage: pytest [options] [file_or_dir] [file_or_dir] [...] pytest: error: unrecognized arguments: --cov=sklearn --cov-report=html:coverage inifile: /Users/dylan/scikit-learn/setup.cfg rootdir: /Users/dylan/scikit-learn make: *** [test-coverage] Error 2 Regards, Dylan -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu May 10 03:22:12 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 10 May 2018 17:22:12 +1000 Subject: [scikit-learn] Unable to run make test-coverage In-Reply-To: References: Message-ID: Do you have pytest-cov installed?? -------------- next part -------------- An HTML attachment was scrubbed... URL: From dylanf123 at gmail.com Thu May 10 05:29:34 2018 From: dylanf123 at gmail.com (Dylan Fernando) Date: Thu, 10 May 2018 19:29:34 +1000 Subject: [scikit-learn] Unable to run make test-coverage In-Reply-To: References: Message-ID: On Thu, May 10, 2018 at 5:22 PM, Joel Nothman wrote: > Do you have pytest-cov installed?? > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > Thanks, I installed it and it works now -------------- next part -------------- An HTML attachment was scrubbed... URL: From reismc at gmail.com Sat May 12 10:26:05 2018 From: reismc at gmail.com (Mauricio Reis) Date: Sat, 12 May 2018 11:26:05 -0300 Subject: [scikit-learn] DBScan freezes my computer !!! Message-ID: The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my computer without any warning message! I am using WinPython 3.6.5 64 bit. The method works normally with the original data, but freezes when I use the normalized data (between 0 and 1). What should I do? Att., Mauricio Reis -------------- next part -------------- An HTML attachment was scrubbed... URL: From awnystrom at gmail.com Sat May 12 18:20:32 2018 From: awnystrom at gmail.com (Andrew Nystrom) Date: Sat, 12 May 2018 15:20:32 -0700 Subject: [scikit-learn] DBScan freezes my computer !!! In-Reply-To: References: Message-ID: If you?re l2 norming your data, you?re making it live on the surface of a hypershere. That surface will have a high density of points and may not have areas of low density, in which case the entire surface could be recognized as a single cluster if epsilon is high enough and min neighbors is low enough. I?d suggest not using l2 norm with DBSCAN. On Sat, May 12, 2018 at 7:27 AM Mauricio Reis wrote: > The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my computer > without any warning message! > > I am using WinPython 3.6.5 64 bit. > > The method works normally with the original data, but freezes when I use > the normalized data (between 0 and 1). > > What should I do? > > Att., > Mauricio Reis > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Sun May 13 04:34:42 2018 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Sun, 13 May 2018 10:34:42 +0200 Subject: [scikit-learn] DBScan freezes my computer !!! In-Reply-To: References: Message-ID: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com> Could you please check memory usage while running DBSCAN to make sure freezing is due to running out of memory and not to something else? Which parameters do you run DBSCAN with? Changing algorithm, leaf_size parameters and ensuring n_jobs=1 could help. Assuming eps is reasonable, I think it shouldn't be an issue to run DBSCAN on L2 normalized data: using the default euclidean metric, this should produce somewhat similar results to clustering not normalized data with metric='cosine'. On 13/05/18 00:20, Andrew Nystrom wrote: > If you?re l2 norming your data, you?re making it live on the surface of > a hypershere. That surface will have a high density of points and may > not have areas of low density, in which case the entire surface could be > recognized as a single cluster if epsilon is high enough and min > neighbors is low enough. I?d suggest not using l2 norm with DBSCAN. > On Sat, May 12, 2018 at 7:27 AM Mauricio Reis > wrote: > > The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my > computer without any warning message! > > I am using WinPython 3.6.5 64 bit. > > The method works normally with the original data, but freezes when I > use the normalized data (between 0 and 1). > > What should I do? > > Att., > Mauricio Reis > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From reismc at gmail.com Sun May 13 19:23:15 2018 From: reismc at gmail.com (Mauricio Reis) Date: Sun, 13 May 2018 20:23:15 -0300 Subject: [scikit-learn] DBScan freezes my computer !!! In-Reply-To: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com> References: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com> Message-ID: I think the problem is due to the size of my database, which has 44,000 records. When I ran a database test with reduced sizes (10,000 and 20,000 first records), the routine ran normally. You ask me to check the memory while running the DBScan routine, but I do not know how to do that (if I did, I would have done that already). I think the routine is not ready to work with too much data. The problem is that my computer freezes and I can not analyze the case. I've tried to figure out if any changes work (like changing routine parameters), but all alternatives with lots of data (about 40,000 records) generate error. I believe that package routines have no exception handling to improve performance. So I suggest that there is a test version that shows a proper message when an error occurs. To summarize: 1) How to check the memory of the computer during the execution of the routine? 2) I suggest developing test versions of routines that may have a memory error. Att., Mauricio Reis 2018-05-13 5:34 GMT-03:00 Roman Yurchak : > Could you please check memory usage while running DBSCAN to make sure > freezing is due to running out of memory and not to something else? > Which parameters do you run DBSCAN with? Changing algorithm, leaf_size > parameters and ensuring n_jobs=1 could help. > > Assuming eps is reasonable, I think it shouldn't be an issue to run DBSCAN > on L2 normalized data: using the default euclidean metric, this should > produce somewhat similar results to clustering not normalized data with > metric='cosine'. > > On 13/05/18 00:20, Andrew Nystrom wrote: > >> If you?re l2 norming your data, you?re making it live on the surface of a >> hypershere. That surface will have a high density of points and may not >> have areas of low density, in which case the entire surface could be >> recognized as a single cluster if epsilon is high enough and min neighbors >> is low enough. I?d suggest not using l2 norm with DBSCAN. >> On Sat, May 12, 2018 at 7:27 AM Mauricio Reis > reismc at gmail.com>> wrote: >> >> The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my >> computer without any warning message! >> >> I am using WinPython 3.6.5 64 bit. >> >> The method works normally with the original data, but freezes when I >> use the normalized data (between 0 and 1). >> >> What should I do? >> >> Att., >> Mauricio Reis >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chema at rinzewind.org Sun May 13 19:44:34 2018 From: chema at rinzewind.org (=?iso-8859-1?Q?Jos=E9_Mar=EDa?= Mateos) Date: Sun, 13 May 2018 19:44:34 -0400 Subject: [scikit-learn] DBScan freezes my computer !!! In-Reply-To: References: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com> Message-ID: <20180513234434.GA3210@equipaje> On Sun, May 13, 2018 at 08:23:15PM -0300, Mauricio Reis wrote: > To summarize: 1) How to check the memory of the computer during the > execution of the routine? 2) I suggest developing test versions of routines > that may have a memory error. If you are on Linux, can you just run "top" while your script runs? That will tell you how much memory is being used by each process. On Windows, you can use the task scheduler to obtain similar results. Cheers, -- Jos? Mar?a (Chema) Mateos https://rinzewind.org/blog-es || https://rinzewind.org/blog-en From mail at sebastianraschka.com Sun May 13 20:16:16 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Sun, 13 May 2018 20:16:16 -0400 Subject: [scikit-learn] DBScan freezes my computer !!! In-Reply-To: References: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com> Message-ID: <1EA93B26-5892-4D85-9FE7-51F32B06C8DF@sebastianraschka.com> > So I suggest that there is a test version that shows a proper message when an error occurs. I think the freezing that happens in your case is operating system specific and it would require some weird workarounds to detect at which RAM usage the combination of machine and operating system might freeze (i.e., I never observed my system freezing when I run out of RAM, since it has a pretty swift SSD, but the sklearn process may take a very long time to finish). Plus, scikit-learn would require to know and constantly check how much memory would be used and currently available (due to the use of other apps and the OS kernel), which wouldn't be feasible. I am not sure if this helps (depending where the memory-usage bottleneck is), but it could maybe help providing a sparse (CSR) array instead of a dense one to the .fit() method. Another thing to try would be to pre-compute the distances and give that to the .fit() method after initializing the DBSCAN object with metric='precomputed') Best, Sebastian > On May 13, 2018, at 7:23 PM, Mauricio Reis wrote: > > I think the problem is due to the size of my database, which has 44,000 records. When I ran a database test with reduced sizes (10,000 and 20,000 first records), the routine ran normally. > > You ask me to check the memory while running the DBScan routine, but I do not know how to do that (if I did, I would have done that already). > > I think the routine is not ready to work with too much data. The problem is that my computer freezes and I can not analyze the case. I've tried to figure out if any changes work (like changing routine parameters), but all alternatives with lots of data (about 40,000 records) generate error. > > I believe that package routines have no exception handling to improve performance. So I suggest that there is a test version that shows a proper message when an error occurs. > > To summarize: 1) How to check the memory of the computer during the execution of the routine? 2) I suggest developing test versions of routines that may have a memory error. > > Att., > Mauricio Reis > > 2018-05-13 5:34 GMT-03:00 Roman Yurchak : > Could you please check memory usage while running DBSCAN to make sure freezing is due to running out of memory and not to something else? > Which parameters do you run DBSCAN with? Changing algorithm, leaf_size parameters and ensuring n_jobs=1 could help. > > Assuming eps is reasonable, I think it shouldn't be an issue to run DBSCAN on L2 normalized data: using the default euclidean metric, this should produce somewhat similar results to clustering not normalized data with metric='cosine'. > > On 13/05/18 00:20, Andrew Nystrom wrote: > If you?re l2 norming your data, you?re making it live on the surface of a hypershere. That surface will have a high density of points and may not have areas of low density, in which case the entire surface could be recognized as a single cluster if epsilon is high enough and min neighbors is low enough. I?d suggest not using l2 norm with DBSCAN. > On Sat, May 12, 2018 at 7:27 AM Mauricio Reis > wrote: > > The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my > computer without any warning message! > > I am using WinPython 3.6.5 64 bit. > > The method works normally with the original data, but freezes when I > use the normalized data (between 0 and 1). > > What should I do? > > Att., > Mauricio Reis > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Sun May 13 22:59:15 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 14 May 2018 12:59:15 +1000 Subject: [scikit-learn] DBScan freezes my computer !!! In-Reply-To: References: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com> Message-ID: This is quite a common issue with our implementation of DBSCAN, and improvements to documentation would be very, very welcome. The high memory cost comes from constructing the pairwise radius neighbors for all points. If using a distance metric that cannot be indexed with a KD-tree or Ball Tree, this results in n^2 floats being stored in memory even before the radius neighbors are computed. You have the following strategies available to you currently: 1. Calculate the radius neighborhoods using radius_neighbors_graph in chunks, so as to avoid all pairs being calculated and stored at once. This produces a sparse graph representation, which can be passed into dbscan with metric='precomputed'. (I've just seen Sebastian suggested the same.) 2. Reduce the number of samples in your dataset and represent (near-)duplicate points with sample_weight (i.e. two identical points would be merged but would have a sample_weight of 2). There is also a proposal to offer an alternative memory-efficient mode at https://github.com/scikit-learn/scikit-learn/pull/6813. Feedback is welcome. Cheers, Joel -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun May 13 23:07:21 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 14 May 2018 13:07:21 +1000 Subject: [scikit-learn] DBScan freezes my computer !!! In-Reply-To: References: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com> Message-ID: Note that this has long been documented under "Memory consumption for large sample sizes" at http://scikit-learn.org/stable/modules/clustering.html#dbscan On 14 May 2018 at 12:59, Joel Nothman wrote: > This is quite a common issue with our implementation of DBSCAN, and > improvements to documentation would be very, very welcome. > > The high memory cost comes from constructing the pairwise radius neighbors > for all points. If using a distance metric that cannot be indexed with a > KD-tree or Ball Tree, this results in n^2 floats being stored in memory > even before the radius neighbors are computed. > > You have the following strategies available to you currently: > > 1. Calculate the radius neighborhoods using radius_neighbors_graph in > chunks, so as to avoid all pairs being calculated and stored at once. This > produces a sparse graph representation, which can be passed into dbscan > with metric='precomputed'. (I've just seen Sebastian suggested the same.) > 2. Reduce the number of samples in your dataset and represent > (near-)duplicate points with sample_weight (i.e. two identical points would > be merged but would have a sample_weight of 2). > > There is also a proposal to offer an alternative memory-efficient mode at > https://github.com/scikit-learn/scikit-learn/pull/6813. Feedback is > welcome. > > Cheers, > > Joel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dylanf123 at gmail.com Mon May 14 09:39:29 2018 From: dylanf123 at gmail.com (Dylan Fernando) Date: Mon, 14 May 2018 23:39:29 +1000 Subject: [scikit-learn] New algorithm suggestion - AODE Message-ID: Hello, I would like to suggest a new classification algorithm for scikit-learn, Averaged one-dependence estimators (AODE). AODE achieves highly accurate classification by averaging over all of a small space of alternative naive-Bayes-like models that have weaker (and hence less detrimental) independence assumptions than naive Bayes. The resulting algorithm is computationally efficient while delivering highly accurate classification on many learning tasks. For more information, see paper (https://link.springer.com/article/10.1007/s10994-005-4258-6). The paper has over 200 citations. There is an existing implementation in the WEKA machine learning suite ( http://weka.sourceforge.net/doc.stable/weka/classifiers/bayes/AODE.html). I?ve made a pull request and I would like some feedback ( https://github.com/scikit-learn/scikit-learn/pull/11093). Thank You, Dylan -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed May 16 13:27:40 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 16 May 2018 13:27:40 -0400 Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature Importance In-Reply-To: References: Message-ID: <3aeca8e9-2a59-61e4-a88b-4b9e4c26cdae@gmail.com> I don't think that's how most people use the trees, though. Probably not even the ExtraTrees. I really need to get around to reading your thesis :-/ Do you recommend using max_features=1 with ExtraTrees? On 05/05/2018 05:21 AM, Gilles Louppe wrote: > Hi, > > See also chapters 6 and 7 of http://arxiv.org/abs/1407.7502 for another > point of view regarding the "issue" with feature importances. TLDR: Feature > importances as we have them in scikit-learn (i.e. MDI) are provably **not** > biased, provided trees are built totally at random (as in ExtraTrees with > max_feature=1) and the depth is controlled min_samples_split (to avoid > splitting on noise). On the other hand, it is not always clear what you > actually compute with MDA (permutation based importances), since it is > conditioned on the model you use. > > Gilles > On Sat, 5 May 2018 at 10:36, Guillaume Lema?tre > wrote: > >> +1 on the post pointed out by Jeremiah. >> On 5 May 2018 at 02:08, Johnson, Jeremiah > wrote: > >>> Faraz, take a look at the discussion of this issue here: > http://parrt.cs.usfca.edu/doc/rf-importance/index.html > >>> Best, >>> Jeremiah >>> ========================================= >>> Jeremiah W. Johnson, Ph.D >>> Asst. Professor of Data Science >>> Program Coordinator, B.S. in Analytics & Data Science >>> University of New Hampshire >>> Manchester, NH 03101 >>> https://www.linkedin.com/in/jwjohnson314 >>> From: scikit-learn unh.edu at python.org> on behalf of "Niyaghi, Faraz" >>> Reply-To: Scikit-learn mailing list >>> Date: Friday, May 4, 2018 at 7:10 PM >>> To: "scikit-learn at python.org" >>> Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature > Importance > >>> Caution - External Email >>> ________________________________ >>> Greetings, >>> This is Faraz Niyaghi from Oregon State University. I research on > variable selection using random forest. To the best of my knowledge, there > is a difference between scikit-learn's and Breiman's definition of feature > importance. Breiman uses out of bag (oob) cases to calculate feature > importance but scikit-learn doesn't. I was wondering: 1) why are they > different? 2) can they result in very different rankings of features? > >>> Here are the definitions I found on the web: >>> Breiman: "In every tree grown in the forest, put down the oob cases and > count the number of votes cast for the correct class. Now randomly permute > the values of variable m in the oob cases and put these cases down the > tree. Subtract the number of votes for the correct class in the > variable-m-permuted oob data from the number of votes for the correct class > in the untouched oob data. The average of this number over all trees in the > forest is the raw importance score for variable m." >>> Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm >>> scikit-learn: " The relative rank (i.e. depth) of a feature used as a > decision node in a tree can be used to assess the relative importance of > that feature with respect to the predictability of the target variable. > Features used at the top of the tree contribute to the final prediction > decision of a larger fraction of the input samples. The expected fraction > of the samples they contribute to can thus be used as an estimate of the > relative importance of the features." >>> Link: http://scikit-learn.org/stable/modules/ensemble.html >>> Thank you for reading this email. Please let me know your thoughts. >>> Cheers, >>> Faraz. >>> Faraz Niyaghi >>> Ph.D. Candidate, Department of Statistics >>> Oregon State University >>> Corvallis, OR >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >> -- >> Guillaume Lemaitre >> INRIA Saclay - Parietal team >> Center for Data Science Paris-Saclay >> https://glemaitre.github.io/ >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Wed May 16 13:37:36 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 16 May 2018 13:37:36 -0400 Subject: [scikit-learn] DBScan freezes my computer !!! In-Reply-To: References: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com> Message-ID: You might also consider looking at hdbscan: https://github.com/scikit-learn-contrib/hdbscan On 05/13/2018 11:07 PM, Joel Nothman wrote: > Note that this has long been documented under "Memory consumption for > large sample sizes" at > http://scikit-learn.org/stable/modules/clustering.html#dbscan > > On 14 May 2018 at 12:59, Joel Nothman > wrote: > > This is quite a common issue with our implementation of DBSCAN, > and improvements to documentation would be very, very welcome. > > The high memory cost comes from constructing the pairwise radius > neighbors for all points. If using a distance metric that cannot > be indexed with a KD-tree or Ball Tree, this results in n^2 floats > being stored in memory even before the radius neighbors are computed. > > You have the following strategies available to you currently: > > 1. Calculate the radius neighborhoods using radius_neighbors_graph > in chunks, so as to avoid all pairs being calculated and stored at > once. This produces a sparse graph representation, which can be > passed into dbscan with metric='precomputed'. (I've just seen > Sebastian suggested the same.) > 2. Reduce the number of samples in your dataset and represent > (near-)duplicate points with sample_weight (i.e. two identical > points would be merged but would have a sample_weight of 2). > > There is also?a proposal to offer an alternative memory-efficient > mode at https://github.com/scikit-learn/scikit-learn/pull/6813 > . Feedback > is welcome. > > Cheers, > > Joel > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed May 16 13:44:17 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 16 May 2018 13:44:17 -0400 Subject: [scikit-learn] DBScan freezes my computer !!! In-Reply-To: References: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com> Message-ID: <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com> Should we have "low memory"/batched version of k_neighbors_graph and epsilon_neighbors_graph functions? I assume those instantiate the dense matrix right now. On 05/13/2018 10:59 PM, Joel Nothman wrote: > This is quite a common issue with our implementation of DBSCAN, and > improvements to documentation would be very, very welcome. > > The high memory cost comes from constructing the pairwise radius > neighbors for all points. If using a distance metric that cannot be > indexed with a KD-tree or Ball Tree, this results in n^2 floats being > stored in memory even before the radius neighbors are computed. > > You have the following strategies available to you currently: > > 1. Calculate the radius neighborhoods using radius_neighbors_graph in > chunks, so as to avoid all pairs being calculated and stored at once. > This produces a sparse graph representation, which can be passed into > dbscan with metric='precomputed'. (I've just seen Sebastian suggested > the same.) > 2. Reduce the number of samples in your dataset and represent > (near-)duplicate points with sample_weight (i.e. two identical points > would be merged but would have a sample_weight of 2). > > There is also?a proposal to offer an alternative memory-efficient mode > at https://github.com/scikit-learn/scikit-learn/pull/6813. Feedback is > welcome. > > Cheers, > > Joel > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Wed May 16 13:50:07 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 16 May 2018 19:50:07 +0200 Subject: [scikit-learn] DBScan freezes my computer !!! In-Reply-To: <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com> References: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com> <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com> Message-ID: <20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org> On Wed, May 16, 2018 at 01:44:17PM -0400, Andreas Mueller wrote: > Should we have "low memory"/batched version of k_neighbors_graph and > epsilon_neighbors_graph functions? I assume > those instantiate the dense matrix right now. +1! It shouldn't be too hard to do. G From g.louppe at gmail.com Wed May 16 14:08:59 2018 From: g.louppe at gmail.com (Gilles Louppe) Date: Wed, 16 May 2018 20:08:59 +0200 Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature Importance In-Reply-To: <3aeca8e9-2a59-61e4-a88b-4b9e4c26cdae@gmail.com> References: <3aeca8e9-2a59-61e4-a88b-4b9e4c26cdae@gmail.com> Message-ID: > Do you recommend using max_features=1 with ExtraTrees? If what you want are feature importances that reflect, without 'bias', the mutual information of each variable (alone or in combination with others) with Y, then yes. Bonus points if you set min_impurity_decrease > 0, to avoid splitting on noise and collecting that as part of the importance scores. The resulting forest will not be optimal with respect to classification/regression performance though. On Wed, 16 May 2018 at 19:29, Andreas Mueller wrote: > I don't think that's how most people use the trees, though. > Probably not even the ExtraTrees. > I really need to get around to reading your thesis :-/ > Do you recommend using max_features=1 with ExtraTrees? > On 05/05/2018 05:21 AM, Gilles Louppe wrote: > > Hi, > > > > See also chapters 6 and 7 of http://arxiv.org/abs/1407.7502 for another > > point of view regarding the "issue" with feature importances. TLDR: Feature > > importances as we have them in scikit-learn (i.e. MDI) are provably **not** > > biased, provided trees are built totally at random (as in ExtraTrees with > > max_feature=1) and the depth is controlled min_samples_split (to avoid > > splitting on noise). On the other hand, it is not always clear what you > > actually compute with MDA (permutation based importances), since it is > > conditioned on the model you use. > > > > Gilles > > On Sat, 5 May 2018 at 10:36, Guillaume Lema?tre > > wrote: > > > >> +1 on the post pointed out by Jeremiah. > >> On 5 May 2018 at 02:08, Johnson, Jeremiah > > wrote: > > > >>> Faraz, take a look at the discussion of this issue here: > > http://parrt.cs.usfca.edu/doc/rf-importance/index.html > > > >>> Best, > >>> Jeremiah > >>> ========================================= > >>> Jeremiah W. Johnson, Ph.D > >>> Asst. Professor of Data Science > >>> Program Coordinator, B.S. in Analytics & Data Science > >>> University of New Hampshire > >>> Manchester, NH 03101 > >>> https://www.linkedin.com/in/jwjohnson314 > >>> From: scikit-learn > unh.edu at python.org> on behalf of "Niyaghi, Faraz" < niyaghif at oregonstate.edu> > >>> Reply-To: Scikit-learn mailing list > >>> Date: Friday, May 4, 2018 at 7:10 PM > >>> To: "scikit-learn at python.org" > >>> Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature > > Importance > > > >>> Caution - External Email > >>> ________________________________ > >>> Greetings, > >>> This is Faraz Niyaghi from Oregon State University. I research on > > variable selection using random forest. To the best of my knowledge, there > > is a difference between scikit-learn's and Breiman's definition of feature > > importance. Breiman uses out of bag (oob) cases to calculate feature > > importance but scikit-learn doesn't. I was wondering: 1) why are they > > different? 2) can they result in very different rankings of features? > > > >>> Here are the definitions I found on the web: > >>> Breiman: "In every tree grown in the forest, put down the oob cases and > > count the number of votes cast for the correct class. Now randomly permute > > the values of variable m in the oob cases and put these cases down the > > tree. Subtract the number of votes for the correct class in the > > variable-m-permuted oob data from the number of votes for the correct class > > in the untouched oob data. The average of this number over all trees in the > > forest is the raw importance score for variable m." > >>> Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm > >>> scikit-learn: " The relative rank (i.e. depth) of a feature used as a > > decision node in a tree can be used to assess the relative importance of > > that feature with respect to the predictability of the target variable. > > Features used at the top of the tree contribute to the final prediction > > decision of a larger fraction of the input samples. The expected fraction > > of the samples they contribute to can thus be used as an estimate of the > > relative importance of the features." > >>> Link: http://scikit-learn.org/stable/modules/ensemble.html > >>> Thank you for reading this email. Please let me know your thoughts. > >>> Cheers, > >>> Faraz. > >>> Faraz Niyaghi > >>> Ph.D. Candidate, Department of Statistics > >>> Oregon State University > >>> Corvallis, OR > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > >> -- > >> Guillaume Lemaitre > >> INRIA Saclay - Parietal team > >> Center for Data Science Paris-Saclay > >> https://glemaitre.github.io/ > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Wed May 16 19:33:01 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 17 May 2018 09:33:01 +1000 Subject: [scikit-learn] DBScan freezes my computer !!! In-Reply-To: <20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org> References: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com> <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com> <20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org> Message-ID: Implemented in a previous version of #10280 , but removed for now to simplify reviews . If others would like to review #10280, I'm happy to follow up with the changes requested here, which have already been implemented by Aman Dalmia and myself.? -------------- next part -------------- An HTML attachment was scrubbed... URL: From reismc at gmail.com Thu May 17 10:37:14 2018 From: reismc at gmail.com (Mauricio Reis) Date: Thu, 17 May 2018 11:37:14 -0300 Subject: [scikit-learn] DBScan freezes my computer !!! In-Reply-To: References: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com> <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com> <20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org> Message-ID: I'm not used to the terms used here. So I understood that the package had memory management, which was removed. But you could make the code available with memory management implementations. Is it?! :-) The problem is that I do not know what I would do with the code, because I only know how to work with the SciKitLearn package ready. :-( Att., Mauricio Reis 2018-05-16 20:33 GMT-03:00 Joel Nothman : > Implemented in a previous version of #10280 > , but removed > for now to simplify reviews > . > If others would like to review #10280, I'm happy to follow up with the > changes requested here, which have already been implemented by Aman Dalmia > and myself.? > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu May 17 18:02:56 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 18 May 2018 08:02:56 +1000 Subject: [scikit-learn] DBScan freezes my computer !!! In-Reply-To: References: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com> <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com> <20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org> Message-ID: There are two issues here: 1. We store all radius neighborhoods of all points in memory at once. This is a problem if each point has a large radius neighborhood. DBSCAN only requires that you store the radius neighbors of the point you are currently examining. We could provide a memory-efficient mode that would do so. 2. Given that we store all neighborhoods at once, a brute force nearest neighbors search will take O(n^2) which can be reduced by chunking the operation. Both solutions have patches available already, but not reviewed. On 18 May 2018 at 00:37, Mauricio Reis wrote: > I'm not used to the terms used here. So I understood that the package had > memory management, which was removed. But you could make the code available > with memory management implementations. Is it?! :-) > The problem is that I do not know what I would do with the code, because I > only know how to work with the SciKitLearn package ready. :-( > > Att., > Mauricio Reis > > 2018-05-16 20:33 GMT-03:00 Joel Nothman : > >> Implemented in a previous version of #10280 >> , but removed >> for now to simplify reviews >> . >> If others would like to review #10280, I'm happy to follow up with the >> changes requested here, which have already been implemented by Aman Dalmia >> and myself.? >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From valerio.maggio at gmail.com Fri May 18 07:10:59 2018 From: valerio.maggio at gmail.com (Valerio Maggio) Date: Fri, 18 May 2018 13:10:59 +0200 Subject: [scikit-learn] CFP: EuroSciPy 2018 - 11th European Conference on Python in Science Message-ID: *** Apologies if you receive multiple copies *** Dear Colleagues, We are delighted to invite you to join us for the *11th European Conference on Python in Science*. The EuroSciPy 2018 Conference will be organised by Fondazione Bruno Kessler (FBK) and will take place from August, 28th to September, 1st in *Trento, Italy*. The EuroSciPy meeting is a cross-disciplinary gathering focused on the use and development of the Python language in scientific research. This event strives to bring together both users and developers of scientific tools, as well as academic research and state of the art industry. The conference will be structured as it follows: - *Aug, 28-29 *: Tutorials and Hands-on - *Aug, 30-31 *: Main Conference - *Sep, 1 *: Sprint ---------------------------------------------------------------------------------------------------------------- TOPICS OF INTEREST: Presentations of scientific tools and libraries using the Python language, including but not limited to: - Algorithms implemented or exposed in Python - Astronomy - Data Visualisation - Deep Learning & AI - Earth, Ocean and Geo Science - General-purpose Python tools that can be of special interest to the scientific community. - Image Processing - Materials Science - Parallel computing - Political and Social Sciences - Project Jupyter - Reports on the use of Python in scientific achievements or ongoing projects. - Robotics & IoT - Scientific data flow and persistence - Scientific visualization - Simulation - Statistics - Vector and array manipulation - Web applications and portals for science and engineering - 3D Printing ----------------------------------------------------------------------------------------------------------------- CALL FOR PROPOSALS: EuroScipy will accept three different kinds of contributions: - *Regular Talks*: standard talks for oral presentations, allocated in time slots of `15`, or `30` minutes, depending on your preference and scheduling constraints. Each time slot considers a Q&A session at the end of the talk (at least, 5 mins). - *Hands-on Tutorials*: These are *beginner* or *advanced* training sessions to dive into the subject with all details. These sessions are 90 minutes long, and the audience will be strongly encouraged to bring a laptop to experiment. For a sneak peak of last years tutorials, here are the - *Poster: *EuroScipy will host two poster sessions during the two days of Main Conference. So attendees and students are highly encourage to present their work and/or preliminary results as posters. Proposals should be submitted using the EuroScipy submission system at https://pretalx.com/euroscipy18. Submission deadline is *May, 31st 2018.* ---------------------------------------------------------------------------------------------------------------- REGISTRATION & FEES: To register to EuroScipy 2018, please go to euroscipy2018.eventbrite.co.uk or to http://www.euroscipy.org/2018 *Registration fees:* *Tutorials Aug, 28th-29th 2018* *Student** *Academic/Individual* *Industry* Early Bird (till July, 1st) ?50 ?70 ?125 Regular (till Aug, 5th ?100 ?110 ?250 Late (till Aug, 22nd) ?135 ?135 ?300 You register for one of the two tutorial tracks (introductory or advanced) but you can switch between both tracks whenever you want as long as there is enough space in the lecture rooms. *Main Conference Aug, 30th- 31st 2018* *Student** *Academic/Individual* *Industry* Early Bird (till July, 1st) ?50 ?70 ?125 Regular (till Aug, 5th ?100 ?110 ?250 Late (till Aug, 22nd) ?135 ?135 ?300 * A proof of student status will be required at time of the registration. Best regards, EuroScipy 2018 Organising Committee, Email: info at euroscipy.org | euroscipy at fbk.eu Website: http://www.euroscipy.org/2018 twitter: @euroscipy -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcasl at unileon.es Fri May 18 08:32:21 2018 From: mcasl at unileon.es (=?UTF-8?Q?Manuel_CASTEJ=C3=93N_LIMAS?=) Date: Fri, 18 May 2018 14:32:21 +0200 Subject: [scikit-learn] Delegating "get_params" and "set_params" to a wrapped estimator when parameter is not defined. In-Reply-To: References: Message-ID: Dear Joel, I've changed the code of PipeGraph in order to change the old wrappers to new Mixin Classes. The changes are reflected in this MixinClasses branch: https://github.com/mcasl/PipeGraph/blob/feature/MixinClasses/pipegraph/adapters.py My conclusions are that although both approaches are feasible and provide similar functionality, Mixin Classes provide a simpler solution. Following the 'flat is better than nested' principle, the mixin classes should be favoured. This approach seems as well to be more in line with general sklearn development practice, so I'll make the necessary changes to the docs and then the master branch will be replaced with this new Mixin classes version. Thanks for pointing out this issue! Best Manuel 2018-04-16 14:21 GMT+02:00 Manuel CASTEJ?N LIMAS : > Nope! Mostly because of lack of experience with mixins. > I've done some reading and I think I can come up with a few mixins doing > the trick by dynamically adding their methods to an already instantiated > object. I'll play with that and I hope to show you something soon! Or at > least I will have better grounds to make an educated decision. > Best > Manuel > > > > > Manuel Castej?n Limas > *Escuela de Ingenier?a Industrial e Inform?tica* > Universidad de Le?n > Campus de Vegazana sn. > 24071. Le?n. Spain. > *e-mail: *manuel.castejon at unileon.es > *Tel.*: +34 987 291 946 > > Digital Business Card: Click Here > > > > 2018-04-15 15:18 GMT+02:00 Joel Nothman : > >> Have you considered whether a mixin is a better model than a wrapper?? >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shane.grigsby at colorado.edu Fri May 18 18:29:19 2018 From: shane.grigsby at colorado.edu (Shane Grigsby) Date: Fri, 18 May 2018 16:29:19 -0600 Subject: [scikit-learn] DBScan freezes my computer !!! In-Reply-To: References: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com> <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com> <20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org> Message-ID: <20180518222919.aimm2qgywnjw5peo@espgs-MacBook-Pro.local> Hi Mauricio, You can also use OPTICS in DBSCAN mode. The pull request is here if you'd like to clone it: https://github.com/scikit-learn/scikit-learn/pull/1984 Running ~40,000 points in three dimensions takes about a minute. See the example page here for how to do the DBSCAN extraction: https://github.com/espg/scikit-learn/blob/2eac9fbf67b2715e11fdedfbb63bcdb56a80e216/examples/cluster/plot_optics.py Cheers, Shane On 05/17, Mauricio Reis wrote: >I'm not used to the terms used here. So I understood that the package had >memory management, which was removed. But you could make the code available >with memory management implementations. Is it?! :-) >The problem is that I do not know what I would do with the code, because I >only know how to work with the SciKitLearn package ready. :-( > >Att., >Mauricio Reis > >2018-05-16 20:33 GMT-03:00 Joel Nothman : > >> Implemented in a previous version of #10280 >> , but removed >> for now to simplify reviews >> . >> If others would like to review #10280, I'm happy to follow up with the >> changes requested here, which have already been implemented by Aman Dalmia >> and myself.? >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://mail.python.org/mailman/listinfo/scikit-learn -- *PhD candidate & Research Assistant* *Cooperative Institute for Research in Environmental Sciences (CIRES)* *University of Colorado at Boulder* From sdsr.sdsr at gmail.com Sun May 20 03:54:23 2018 From: sdsr.sdsr at gmail.com (=?UTF-8?Q?Sergio_Sol=C3=B3rzano?=) Date: Sun, 20 May 2018 09:54:23 +0200 Subject: [scikit-learn] Isolation forests Message-ID: currently I am studying the "Isolation forest" algorithm proposed by Liu, Ting and Zhou. I started reading the scikit-learn implementation but could not find where exactly is the algorithm 2 of the original paper implemented. So far this is what I managed to understand: In the iforest.py file there is the ?fit? method which, if my understanding is correct, essentially makes a call to the ?_fit? method of the BaseBagging class but there I can?t see how the algorithm 2 of the original reference is implemented. Any help on the details of how Itrees and Iforests are implemented is appreciated. If this is not the right place to ask, please let me know where is it. Thanks for the time and help Sergio S From aqsdmcet at gmail.com Mon May 21 05:10:36 2018 From: aqsdmcet at gmail.com (aijaz qazi) Date: Mon, 21 May 2018 14:40:36 +0530 Subject: [scikit-learn] Error Message-ID: Scikit Multilearn does not work. *Regards,* *Aijaz A.Qazi * -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Mon May 21 05:17:41 2018 From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=) Date: Mon, 21 May 2018 11:17:41 +0200 Subject: [scikit-learn] Error In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: From aqsdmcet at gmail.com Mon May 21 05:33:06 2018 From: aqsdmcet at gmail.com (aijaz qazi) Date: Mon, 21 May 2018 15:03:06 +0530 Subject: [scikit-learn] Error In-Reply-To: References: Message-ID: Dev of scikit multilearn is not responding at all. *Regards,* *Aijaz A.Qazi * On Mon, May 21, 2018 at 2:47 PM, Guillaume Lema?tre wrote: > check with the dev of scikit multilearn directly. > > Sent from my phone - sorry to be brief and potential misspell. > *From:* aqsdmcet at gmail.com > *Sent:* 21 May 2018 11:12 am > *To:* scikit-learn at python.org > *Reply to:* scikit-learn at python.org > *Subject:* [scikit-learn] Error > > Scikit Multilearn does not work. > > > > > *Regards,* > *Aijaz A.Qazi * > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Mon May 21 05:41:14 2018 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Mon, 21 May 2018 11:41:14 +0200 Subject: [scikit-learn] Error In-Reply-To: References: Message-ID: <77b846ba-f4ea-fa1a-d32c-c098e7960508@gmail.com> Try opening an issue at their Github issue tracker https://github.com/scikit-multilearn/scikit-multilearn/issues ; providing a detailed description of the issue takes some time but would also make it more likely to get an answer there (see https://stackoverflow.com/help/mcve). -- Roman On 21/05/18 11:33, aijaz qazi wrote: > Dev of scikit multilearn is not responding at all. > > > > /*Regards,*/ > /*Aijaz A.Qazi */ > > On Mon, May 21, 2018 at 2:47 PM, Guillaume Lema?tre > > wrote: > > check with the dev of scikit multilearn directly. > > Sent from my phone - sorry to be brief and potential misspell. > > *From:* aqsdmcet at gmail.com > *Sent:* 21 May 2018 11:12 am > *To:* scikit-learn at python.org > *Reply to:* scikit-learn at python.org > *Subject:* [scikit-learn] Error > > > Scikit Multilearn does not work. > > > > > /*Regards,*/ > /*Aijaz A.Qazi */ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From albertthomas88 at gmail.com Tue May 22 05:18:32 2018 From: albertthomas88 at gmail.com (Albert Thomas) Date: Tue, 22 May 2018 11:18:32 +0200 Subject: [scikit-learn] Isolation forests In-Reply-To: References: Message-ID: Hi Sergio, In IsolationForest, BaseBagging is applied with ExtraTreeRegressor as base_estimator. Algorithm 2 (iTree) of the original paper is thus implemented in ExtaTreeRegressor. The forest is implemented thanks to the bagging procedure. HTH, Albert On Sun 20 May 2018 at 09:56, Sergio Sol?rzano wrote: > currently I am studying the "Isolation forest" algorithm proposed by > Liu, Ting and Zhou. I started reading the scikit-learn implementation > but could not find where exactly is the algorithm 2 of the original > paper implemented. > > So far this is what I managed to understand: In the iforest.py file > there is the ?fit? method which, if my understanding is correct, > essentially makes a call to the ?_fit? method of the BaseBagging class > but there I can?t see how the algorithm 2 of the original reference is > implemented. > > Any help on the details of how Itrees and Iforests are implemented is > appreciated. > > If this is not the right place to ask, please let me know where is it. > > Thanks for the time and help > Sergio S > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nelle.varoquaux at gmail.com Tue May 22 19:00:47 2018 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Tue, 22 May 2018 16:00:47 -0700 Subject: [scikit-learn] Submit a BoF at SciPy 2018, before June 27! Message-ID: Dear all, (apologies for the cross-posting) The SciPy conference would like to invite you to submit proposals for Birds of a Feather (BOF) sessions at this year's SciPy! BOFs usually include short presentations by a panel and a moderator with the bulk of the time spent opening up the discussion to everyone in attendance. BoF topics can be of general interest, such as state-of-the-project BoFs, or based on the themes of the conference and the mini-symposia topics. Please submit your proposals by June 27 here: https://scipy2018.scipy. org/ehome/299527/648142/ Past SciPy conferences have had a large variety of BOF sessions, including topics on Reproducibility, Jupyter Notebooks, Distributed Computing, Geospatial Packages in Python, Teaching Scientific Computing with Python, Python and Finance, NumFOCUS, Python in Astronomy, Collaborating and Contributing in Open Science, Education, and a Matplotlib Enhancement Proposal Discussion. Generally, if there is a topic where you think a number of people at SciPy will be interested, you should propose it! Thanks, Jess & Nelle -------------- next part -------------- An HTML attachment was scrubbed... URL: From anael.beaugnon at ssi.gouv.fr Wed May 23 05:50:24 2018 From: anael.beaugnon at ssi.gouv.fr (Beaugnon Anael) Date: Wed, 23 May 2018 11:50:24 +0200 Subject: [scikit-learn] Inconsistencies in clustering documentations Message-ID: <6014cc14-1bc3-d175-3492-838a7ca09611@ssi.gouv.fr> Dear all, Three clustering algorithms can take as input distance or similarity matrices instead of the observations (AgglomerativeClustering , AffinityPropagation , and DBSCAN ), but there are inconsistencies in their documentations. *DBSCAN :* ?? The documentation explains clearly how to run DBSCAN with a precomputed distance matrix. ?? Constructor:/ ?? ??? metric: If metric is ?precomputed?, X is assumed to be a distance matrix and must be square. / ?? fit / fit_predict /: ?? ??? X: A feature array, or array of distances between samples if |metric='precomputed'|. / *AffinityPropagation : * ??? Constructor: ??? ??? affinity: /Which affinity to use. At the moment |precomputed| and |euclidean| are supported. |euclidean| uses the negative squared euclidean distance between points. / ??? fit :? / ??? ??? X: //Data matrix or, if affinity is |precomputed|, matrix of similarities / affinities. / ??? fit_predict :/ / /??? ??? X: Input data.?????/ ??? ??? X can also be a matrix of similarities ? fit and fit_predict should share the same documentation for the input X ?/ / *AgglomerativeClustering : *??? Constructor: ??? ??? /affinity: Metric used to compute the linkage. Can be ?euclidean?, ?l1?, ?l2?, ?manhattan?, ?cosine?, or ?precomputed?. If linkage is ?ward?, only ?euclidean? is accepted/.? ?? ???? The name of the parameter 'affinity' seems misleading, since it does not correspond to similarity functions, but to distance functions. ??? fit :? / ??? ??? X: //The samples a.k.a. observations./??? ??? fit_predict :/ //??? ??? X: //Input data.? /??? ??? The documentation of fit and fit_predict does not specify that X can also be a matrix of distances. The user may be confused whether he/she should provide a distance or a similarity matrix to AgglomerativeClustering. The documentation of fit and fit_predict can be easily updated. As for the name of the 'affinity' parameter, it is more difficult since it involves an API change. What do you think of these potential updates of the documentation ? Cheers, Ana?l Beaugnon // -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.duprelatour at orange.fr Wed May 23 08:01:47 2018 From: tom.duprelatour at orange.fr (Tom DLT) Date: Wed, 23 May 2018 14:01:47 +0200 Subject: [scikit-learn] Inconsistencies in clustering documentations In-Reply-To: <6014cc14-1bc3-d175-3492-838a7ca09611@ssi.gouv.fr> References: <6014cc14-1bc3-d175-3492-838a7ca09611@ssi.gouv.fr> Message-ID: Hi Ana?l, Thanks for spotting these inconsistencies. You are very welcome to open pull-requests and/or issues on the GitHub tracker (cf. http://scikit-learn.org/stable/developers/contributing.html#contributing-code ) The documentation issue should be straightforward. The parameter renaming would need a proper deprecation cycle (cf http://scikit-learn.org/stable/developers/contributing.html#deprecation). See you on GitHub, Tom 2018-05-23 11:50 GMT+02:00 Beaugnon Anael : > Dear all, > > Three clustering algorithms can take as input distance or similarity > matrices instead of the observations (AgglomerativeClustering > , > AffinityPropagation > , > and DBSCAN > ), > but there are inconsistencies in their documentations. > > > *DBSCAN :* > The documentation explains clearly how to run DBSCAN with a precomputed > distance matrix. > Constructor: > > * metric: If metric is ?precomputed?, X is assumed to be a distance > matrix and must be square. * > fit / fit_predict > > > > *: X: A feature array, or array of distances between samples if > metric='precomputed'. * > > *AffinityPropagation : * > Constructor: > affinity: > *Which affinity to use. At the moment precomputed and euclidean are > supported. euclidean uses the negative squared euclidean distance between > points. * > fit : > * X: * > *Data matrix or, if affinity is precomputed, matrix of similarities / > affinities. * > fit_predict : > * X: Input data. * > X can also be a matrix of similarities ? fit and fit_predict > should share the same documentation for the input X ? > > > > *AgglomerativeClustering : * Constructor: > *affinity: Metric used to compute the linkage. Can be > ?euclidean?, ?l1?, ?l2?, ?manhattan?, ?cosine?, or ?precomputed?. If > linkage is ?ward?, only ?euclidean? is accepted*. > The name of the parameter 'affinity' seems misleading, since it > does not correspond to similarity functions, but to distance functions. > fit : > * X: **The samples a.k.a. observations.* > fit_predict : > * X: * > *Input data. * The documentation of fit and fit_predict does not > specify that X can also be a matrix of distances. > > The user may be confused whether he/she should provide a distance or a > similarity matrix to AgglomerativeClustering. > The documentation of fit and fit_predict can be easily updated. As for the > name of the 'affinity' parameter, it is more difficult since it involves an > API change. > > > What do you think of these potential updates of the documentation ? > > Cheers, > > Ana?l Beaugnon > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed May 23 12:07:17 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 23 May 2018 12:07:17 -0400 Subject: [scikit-learn] Submit a BoF at SciPy 2018, before June 27! In-Reply-To: References: Message-ID: Do folks think there'll be enough interest in future direction of scikit-learn to do a BoF? On 5/22/18 7:00 PM, Nelle Varoquaux wrote: > Dear all, > > (apologies for the cross-posting) > > The SciPy conference would like to invite you to submit proposals for > Birds of a Feather (BOF) sessions at this year's SciPy! BOFs usually > include short presentations by a panel and a moderator with the bulk > of the time spent opening up the discussion to everyone in attendance. > BoF topics can be of general interest, such as state-of-the-project > BoFs, or based on the themes of the conference and the mini-symposia > topics. > > Please submit your proposals by June 27 here: > https://scipy2018.scipy.org/ehome/299527/648142/ > > > Past SciPy conferences have had a large variety of BOF sessions, > including topics on Reproducibility, Jupyter Notebooks, Distributed > Computing, Geospatial Packages in Python, Teaching Scientific > Computing with Python, Python and Finance, NumFOCUS, Python in > Astronomy, Collaborating and Contributing in Open Science, Education, > and a Matplotlib Enhancement Proposal Discussion. Generally, if there > is a topic where you think a number of people at SciPy will be > interested, you should propose it! > > Thanks, > Jess & Nelle > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed May 23 12:09:41 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 23 May 2018 12:09:41 -0400 Subject: [scikit-learn] Inconsistencies in clustering documentations In-Reply-To: References: <6014cc14-1bc3-d175-3492-838a7ca09611@ssi.gouv.fr> Message-ID: <7b82776f-45f1-74ff-6e78-beb6a3a52e6b@gmail.com> +1 for a PR on fit_predict docs. This is probably due to the inheritance structure. Though it's weird that DBSCAN has the correct docs. I'm not sure about renaming affinity, but we can discuss that. I agree it's misleading. On 5/23/18 8:01 AM, Tom DLT wrote: > Hi Ana?l, > > Thanks for spotting these inconsistencies. > You are very welcome to open pull-requests and/or issues on the GitHub > tracker (cf. > http://scikit-learn.org/stable/developers/contributing.html#contributing-code) > The documentation issue should be straightforward. > The parameter renaming would need a proper deprecation cycle (cf > http://scikit-learn.org/stable/developers/contributing.html#deprecation). > > See you on GitHub, > > Tom > > 2018-05-23 11:50 GMT+02:00 Beaugnon Anael >: > > Dear all, > > Three clustering algorithms can take as input distance or > similarity matrices instead of the observations > (AgglomerativeClustering > , > AffinityPropagation > , > and DBSCAN > ), > but there are inconsistencies in their documentations. > > > *DBSCAN :* > ?? The documentation explains clearly how to run DBSCAN with a > precomputed distance matrix. > ?? Constructor:/ > ?? ??? metric: If metric is ?precomputed?, X is assumed to be a > distance matrix and must be square. > / > ?? fit / fit_predict /: > ?? ??? X: A feature array, or array of distances between samples > if |metric='precomputed'|. > > > / > *AffinityPropagation : > * > ??? Constructor: > ??? ??? affinity: /Which affinity to use. At the moment > |precomputed| and |euclidean| are supported. |euclidean| uses the > negative squared euclidean distance between points. > / > ??? fit : / > ??? ??? X: //Data matrix or, if affinity is |precomputed|, matrix > of similarities / affinities. > / > ??? fit_predict :/ > / > /??? ??? X: Input data. / > ??? ??? X can also be a matrix of similarities ? fit and > fit_predict should share the same documentation for the input X ?/ > > > / > *AgglomerativeClustering : > *??? Constructor: > /affinity: Metric used to compute the linkage. Can be ?euclidean?, > ?l1?, ?l2?, ?manhattan?, ?cosine?, or ?precomputed?. If linkage is > ?ward?, only ?euclidean? is accepted/. > The name of the parameter 'affinity' seems misleading, since it > does not correspond to similarity functions, but to distance > functions. > ??? fit : / > ??? ??? X: //The samples a.k.a. observations./ > ??? fit_predict :/ > //??? ??? X: //Input data. > /The documentation of fit and fit_predict does not specify that X > can also be a matrix of distances. > > The user may be confused whether he/she should provide a distance > or a similarity matrix to AgglomerativeClustering. > The documentation of fit and fit_predict can be easily updated. As > for the name of the 'affinity' parameter, it is more difficult > since it involves an API change. > > > What do you think of these potential updates of the documentation ? > > Cheers, > > Ana?l Beaugnon > // > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From anael.beaugnon at ssi.gouv.fr Wed May 23 12:53:44 2018 From: anael.beaugnon at ssi.gouv.fr (Beaugnon Anael) Date: Wed, 23 May 2018 18:53:44 +0200 Subject: [scikit-learn] Inconsistencies in clustering documentations In-Reply-To: <7b82776f-45f1-74ff-6e78-beb6a3a52e6b@gmail.com> References: <6014cc14-1bc3-d175-3492-838a7ca09611@ssi.gouv.fr> <7b82776f-45f1-74ff-6e78-beb6a3a52e6b@gmail.com> Message-ID: <362d447a-c9f6-1967-bd06-91a0fc806788@ssi.gouv.fr> Thanks for your answers. DBSCAN has the correct doc because the fit_predict method is not inherited, but it has its own implementation (because of the additional parameter sample_weight). I have forked the sklearn repo. I work in a virtualenv (virtualenv venv3 --no-site-packages --python python3.5). *python3 setup.py install* completes, but *make test-code* and *make doc-noplot* fail. Do you have any idea about the origin of these errors ? I intend to install work on the python3 version. When I run make test-code, I am surprise that there are references to /usr/lib/python2.7/. Thanks for your help, Ana?l Beaugnon * **make doc-noplot* Exception occurred: ? File "/usr/lib/python3.5/zipfile.py", line 1435, in write ??? st = os.stat(filename) FileNotFoundError: [Errno 2] No such file or directory: '//scikit-learn/doc/auto_examples/plot_digits_pipe.ipynb' The full traceback has been saved in /tmp/sphinx-err-ivjeif0v.log, if you want to report the issue to the developers. Please also report this if it was a user error, so that a better error message can be provided next time. A bug report can be filed in the tracker at . Thanks! File /tmp/sphinx-err-ivjeif0v.log # Sphinx version: 1.7.4 # Python version: 3.5.3 (CPython) # Docutils version: 0.14 # Jinja2 version: 2.10 # Last messages: # Loaded extensions: Traceback (most recent call last): ? File "//scikit-learn/venv3/lib/python3.5/site-packages/sphinx/cmdline.py", line 303, in main ??? args.warningiserror, args.tags, args.verbosity, args.jobs) ? File "//scikit-learn/venv3/lib/python3.5/site-packages/sphinx/application.py", line 233, in __init__ ??? self._init_builder() ? File "//scikit-learn/venv3/lib/python3.5/site-packages/sphinx/application.py", line 311, in _init_builder ??? self.emit('builder-inited') ? File "//scikit-learn/venv3/lib/python3.5/site-packages/sphinx/application.py", line 444, in emit ??? return self.events.emit(event, self, *args) ? File "//scikit-learn/venv3/lib/python3.5/site-packages/sphinx/events.py", line 79, in emit ??? results.append(callback(*args)) ? File "//scikit-learn/venv3/lib/python3.5/site-packages/sphinx_gallery/gen_gallery.py", line 247, in generate_gallery_rst ??? download_fhindex = generate_zipfiles(gallery_dir) ? File "//scikit-learn/venv3/lib/python3.5/site-packages/sphinx_gallery/downloads.py", line 115, in generate_zipfiles ??? jy_zipfile = python_zip(listdir, gallery_dir, ".ipynb") ? File "//scikit-learn/venv3/lib/python3.5/site-packages/sphinx_gallery/downloads.py", line 69, in python_zip ??? zipf.write(file_src, os.path.relpath(file_src, gallery_path)) ? File "/usr/lib/python3.5/zipfile.py", line 1435, in write ??? st = os.stat(filename) FileNotFoundError: [Errno 2] No such file or directory: '//scikit-learn/doc/auto_examples/plot_digits_pipe.ipynb' *make test-code* ======================================================================= ERRORS ======================================================================= _________________________________________________________________ ERROR collecting? __________________________________________________________________ /usr/lib/python2.7/dist-packages/py/_path/common.py:366: in visit ??? for x in Visitor(fil, rec, ignore, bf, sort).gen(self): /usr/lib/python2.7/dist-packages/py/_path/common.py:405: in gen ??? if p.check(dir=1) and (rec is None or rec(p))]) /usr/lib/python2.7/dist-packages/_pytest/main.py:682: in _recurse ??? ihook = self.gethookproxy(path) /usr/lib/python2.7/dist-packages/_pytest/main.py:587: in gethookproxy ??? my_conftestmodules = pm._getconftestmodules(fspath) /usr/lib/python2.7/dist-packages/_pytest/config.py:339: in _getconftestmodules ??? mod = self._importconftest(conftestpath) /usr/lib/python2.7/dist-packages/_pytest/config.py:364: in _importconftest ??? raise ConftestImportFailure(conftestpath, sys.exc_info()) E?? ConftestImportFailure: ImportError('No module named _check_build\n___________________________________________________________________________\nContents of //scikit-learn/sklearn/__check_build:\n__pycache__?????????????? setup.py????????????????? __init__.pyc\n_check_build.pyx????????? _check_build.cpython-35m-x86_64-linux-gnu.so_check_build.c\n__init__.py\n___________________________________________________________________________\nIt seems that scikit-learn has not been built correctly.\n\nIf you have installed scikit-learn from source, please do not forget\nto build the package before using it: run `python setup.py install` or\n`make` in the source directory.\n\nIf you have used an installer, please check that it is suited for your\nPython version, your operating system and your platform.',) E???? File "//scikit-learn/sklearn/__init__.py", line 63, in E?????? from . import __check_build E???? File "//scikit-learn/sklearn/__check_build/__init__.py", line 46, in E?????? raise_build_error(e) E???? File "//scikit-learn/sklearn/__check_build/__init__.py", line 41, in raise_build_error E?????? %s""" % (e, local_dir, ''.join(dir_content).strip(), msg)) !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ============================================================== 1 error in 0.27 seconds =============================================================== Le 23/05/2018 ? 18:09, Andreas Mueller a ?crit?: > > +1 for a PR on fit_predict docs. This is probably due to the > inheritance structure. > Though it's weird that DBSCAN has the correct docs. > > I'm not sure about renaming affinity, but we can discuss that. I agree > it's misleading. > > > On 5/23/18 8:01 AM, Tom DLT wrote: >> Hi?Ana?l, >> >> Thanks for spotting these inconsistencies. >> You are very welcome to open pull-requests and/or issues on the >> GitHub tracker >> (cf.?http://scikit-learn.org/stable/developers/contributing.html#contributing-code) >> The documentation issue should be straightforward. >> The parameter renaming would need a proper deprecation cycle (cf >> http://scikit-learn.org/stable/developers/contributing.html#deprecation). >> >> See you on GitHub, >> >> Tom >> >> 2018-05-23 11:50 GMT+02:00 Beaugnon Anael > >: >> >> Dear all, >> >> Three clustering algorithms can take as input distance or >> similarity matrices instead of the observations >> (AgglomerativeClustering >> , >> AffinityPropagation >> , >> and DBSCAN >> ), >> but there are inconsistencies in their documentations. >> >> >> *DBSCAN :* >> ?? The documentation explains clearly how to run DBSCAN with a >> precomputed distance matrix. >> ?? Constructor:/ >> ?? ??? metric: If metric is ?precomputed?, X is assumed to be a >> distance matrix and must be square. >> / >> ?? fit / fit_predict /: >> ?? ??? X: A feature array, or array of distances between samples >> if |metric='precomputed'|. >> >> >> / >> *AffinityPropagation : >> * >> ??? Constructor: >> ??? ??? affinity: /Which affinity to use. At the moment >> |precomputed| and |euclidean| are supported. |euclidean| uses the >> negative squared euclidean distance between points. >> / >> ??? fit :? / >> ??? ??? X: //Data matrix or, if affinity is |precomputed|, matrix >> of similarities / affinities. >> / >> ??? fit_predict :/ >> / >> /??? ??? X: Input data.?????/ >> ??? ??? X can also be a matrix of similarities ? fit and >> fit_predict should share the same documentation for the input X ?/ >> >> >> / >> *AgglomerativeClustering : >> *??? Constructor: >> ??? ??? /affinity: Metric used to compute the linkage. Can be >> ?euclidean?, ?l1?, ?l2?, ?manhattan?, ?cosine?, or ?precomputed?. >> If linkage is ?ward?, only ?euclidean? is accepted/.? >> ?? ???? The name of the parameter 'affinity' seems misleading, >> since it does not correspond to similarity functions, but to >> distance functions. >> ??? fit :? / >> ??? ??? X: //The samples a.k.a. observations./??? >> ??? fit_predict :/ >> //??? ??? X: //Input data.? >> /??? ??? The documentation of fit and fit_predict does not >> specify that X can also be a matrix of distances. >> >> The user may be confused whether he/she should provide a distance >> or a similarity matrix to AgglomerativeClustering. >> The documentation of fit and fit_predict can be easily updated. >> As for the name of the 'affinity' parameter, it is more difficult >> since it involves an API change. >> >> >> What do you think of these potential updates of the documentation ? >> >> Cheers, >> >> Ana?l Beaugnon >> // >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From aqsdmcet at gmail.com Thu May 24 12:05:00 2018 From: aqsdmcet at gmail.com (aijaz qazi) Date: Thu, 24 May 2018 21:35:00 +0530 Subject: [scikit-learn] (no subject) Message-ID: scikit- multi learn is misleading. *Regards,* *Aijaz A.Qazi * -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmmelen at yahoo.com Thu May 24 14:43:37 2018 From: jmmelen at yahoo.com (James Melenkevitz) Date: Thu, 24 May 2018 18:43:37 +0000 (UTC) Subject: [scikit-learn] Support Vector Regression References: <2071108243.866020.1527187417210.ref@mail.yahoo.com> Message-ID: <2071108243.866020.1527187417210@mail.yahoo.com> I have an SVR model that uses custom kernel as follows: 1) sgk = dual_laplace_gaussian_swarm(ss) svr_cust_sig = SVR(kernel=sgk, C=C_Value, epsilon = epsilon_value) svr_fit = svr_cust_sig.fit(X, y) #X is an array shape is [93, 24]? where each row is a time in the columns are variables for the model at each time #y is an array of the value that the model should fit shape of [93,] #I can do the following without any error yp = svr_cust_sig.predict(X) #This gives predictions for the times and variables in X #If I attempt this yy = svr_cust_sig.predict(X[0:1])#I get the error: "ValueError: X.shape[1] = 1 should be equal to 93, the number of samples at training time"? #The code above is based on code in http://scikit-learn.org/stable/tutorial/basic/tutorial.html 2) To get code that can give new predictions without error I need to do the following:Use data I have and do the "fit" with X as in 1) above numsteps =93 XR = np.zeros(( numsteps*2, 24)) #I set the first half of XR to be the same data that is in X#then set the second half of XR to be that same as the first half XR[numsteps:, :] = XR[:numsteps,:] #I then set the values in XR[numsteps, :] to be the row for the data I want a prediction for#and get the prediction from ypp = svr_fit.predict(XR[numsteps:, :]) #second half same size as X above with only the first row being different #this gives results (when tested with known value for the prediction) that with some calls give the correct prediction but if I make the#call multiple times I get results that can differ by 10%.? My questions are:1) Is it OK to get predictions the way I'm doing this?2) If yes, then why do predictions on the same data inputs differ at times by 10%3) Why didn't my initial call "yy = svr_cust_sig.predict(X[0:1])" work and gave the error: "ValueError: X.shape[1] = 1 should be equal to 93, the number of samples at training time"4) Is there a better way for me to get predictions out of the model -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Thu May 24 15:48:13 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 24 May 2018 21:48:13 +0200 Subject: [scikit-learn] (no subject) In-Reply-To: References: Message-ID: <20180524194813.i6ee5xuyljoyi5lc@phare.normalesup.org> On Thu, May 24, 2018 at 09:35:00PM +0530, aijaz qazi wrote: > scikit- multi learn?is misleading. Yes, but I am not sure what scikit-learn should do about this. Ga?l From jmmelen at yahoo.com Thu May 24 18:16:19 2018 From: jmmelen at yahoo.com (James Melenkevitz) Date: Thu, 24 May 2018 22:16:19 +0000 (UTC) Subject: [scikit-learn] (no subject) In-Reply-To: <20180524194813.i6ee5xuyljoyi5lc@phare.normalesup.org> References: <20180524194813.i6ee5xuyljoyi5lc@phare.normalesup.org> Message-ID: <1156517581.958139.1527200179390@mail.yahoo.com> I did some more tests.? My issue that I brought up may be related to the custom kernel.? On Thursday, May 24, 2018, 12:49:34 PM PDT, Gael Varoquaux wrote: On Thu, May 24, 2018 at 09:35:00PM +0530, aijaz qazi wrote: > scikit- multi learn?is misleading. Yes, but I am not sure what scikit-learn should do about this. Ga?l _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From shiduan at ucdavis.edu Thu May 24 19:39:42 2018 From: shiduan at ucdavis.edu (Shiheng Duan) Date: Thu, 24 May 2018 16:39:42 -0700 Subject: [scikit-learn] Should we standardize data before PCA? Message-ID: Hello all, I wonder is it necessary or correct to do z score transformation before PCA? I didn't see any preprocessing for face image in the example of Faces recognition example using eigenfaces and SVMs, link: http://scikit-learn.org/stable/auto_examples/applications/plot_face_recognition.html#sphx-glr-auto-examples-applications-plot-face-recognition-py I am doing on a similar dataset and got a weird result if I standardized data before PCA. The components figure will have a strong gradient and it doesn't make any sense. Any ideas about the reason? Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Thu May 24 20:09:52 2018 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Thu, 24 May 2018 17:09:52 -0700 Subject: [scikit-learn] Should we standardize data before PCA? In-Reply-To: References: Message-ID: Hi, that totally depends on the nature of your data and whether the standard deviation of individual feature axes/columns of your data carry some form of importance measure. Note that PCA will bias its loadings towards columns with large standard deviations all else being held equal (meaning that if you have zscored columns, and then you choose one column and multiply it by, say 1000, then that component will likely show up as your first component [if 1000 is comparable or large wrt the number of features you are using]) Does this help? Michael On Thu, May 24, 2018 at 4:39 PM, Shiheng Duan wrote: > Hello all, > > I wonder is it necessary or correct to do z score transformation before > PCA? I didn't see any preprocessing for face image in the example of Faces > recognition example using eigenfaces and SVMs, link: > http://scikit-learn.org/stable/auto_examples/applications/plot_face_ > recognition.html#sphx-glr-auto-examples-applications- > plot-face-recognition-py > > I am doing on a similar dataset and got a weird result if I standardized > data before PCA. The components figure will have a strong gradient and it > doesn't make any sense. Any ideas about the reason? > > Thanks. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmmelen at yahoo.com Thu May 24 20:03:21 2018 From: jmmelen at yahoo.com (James Melenkevitz) Date: Fri, 25 May 2018 00:03:21 +0000 (UTC) Subject: [scikit-learn] Should we standardize data before PCA? In-Reply-To: References: Message-ID: <718470127.979272.1527206601413@mail.yahoo.com> https://stats.stackexchange.com/questions/69157/why-do-we-need-to-normalize-data-before-principal-component-analysis-pca On Thursday, May 24, 2018, 4:41:07 PM PDT, Shiheng Duan wrote: Hello all, I wonder is it necessary or correct to do z score transformation before PCA? I didn't see any preprocessing for face image in the example of Faces recognition example using eigenfaces and SVMs, link:http://scikit-learn.org/stable/auto_examples/applications/plot_face_recognition.html#sphx-glr-auto-examples-applications-plot-face-recognition-py I am doing on a similar dataset and got a weird result if I standardized data before PCA. The components figure will have a strong gradient and it doesn't make any sense. Any ideas about the reason?? Thanks.?_______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From shiduan at ucdavis.edu Sun May 27 01:10:07 2018 From: shiduan at ucdavis.edu (Shiheng Duan) Date: Sat, 26 May 2018 22:10:07 -0700 Subject: [scikit-learn] Should we standardize data before PCA? In-Reply-To: References: Message-ID: Thanks. Do you mean that if feature one has a larger derivation than feature two, after zscore they will have the same weight? In that case, it is a bias, right? The feature one should be more important than feature two in the PCA. On Thu, May 24, 2018 at 5:09 PM, Michael Eickenberg < michael.eickenberg at gmail.com> wrote: > Hi, > > that totally depends on the nature of your data and whether the standard > deviation of individual feature axes/columns of your data carry some form > of importance measure. Note that PCA will bias its loadings towards columns > with large standard deviations all else being held equal (meaning that if > you have zscored columns, and then you choose one column and multiply it > by, say 1000, then that component will likely show up as your first > component [if 1000 is comparable or large wrt the number of features you > are using]) > > Does this help? > Michael > > On Thu, May 24, 2018 at 4:39 PM, Shiheng Duan wrote: > >> Hello all, >> >> I wonder is it necessary or correct to do z score transformation before >> PCA? I didn't see any preprocessing for face image in the example of Faces >> recognition example using eigenfaces and SVMs, link: >> http://scikit-learn.org/stable/auto_examples/applicatio >> ns/plot_face_recognition.html#sphx-glr-auto-examples- >> applications-plot-face-recognition-py >> >> I am doing on a similar dataset and got a weird result if I standardized >> data before PCA. The components figure will have a strong gradient and it >> doesn't make any sense. Any ideas about the reason? >> >> Thanks. >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmmelen at yahoo.com Sun May 27 15:01:01 2018 From: jmmelen at yahoo.com (James Melenkevitz) Date: Sun, 27 May 2018 19:01:01 +0000 (UTC) Subject: [scikit-learn] Should we standardize data before PCA? In-Reply-To: References: Message-ID: <842929274.1721361.1527447661896@mail.yahoo.com> Here are more reference involving the "score" that may help you: https://stats.stackexchange.com/questions/222/what-are-principal-component-scores https://stats.stackexchange.com/questions/202578/what-is-the-meaning-of-the-variable-scores-in-matlabs-pca ftp://statgen.ncsu.edu/pub/thorne/molevoclass/AtchleyOct19.pdf On Saturday, May 26, 2018, 10:10:32 PM PDT, Shiheng Duan wrote: Thanks.? Do you mean that if feature one has a larger derivation than feature two, after zscore?they will have the same weight? In that case, it is a bias, right? The feature one should be more important than feature two in the PCA.? On Thu, May 24, 2018 at 5:09 PM, Michael Eickenberg wrote: Hi, that totally depends on the nature of your data and whether the standard deviation of individual feature axes/columns of your data carry some form of importance measure. Note that PCA will bias its loadings towards columns with large standard deviations all else being held equal (meaning that if you have zscored columns, and then you choose one column and multiply it by, say 1000, then that component will likely show up as your first component [if 1000 is comparable or large wrt the number of features you are using]) Does this help?Michael On Thu, May 24, 2018 at 4:39 PM, Shiheng Duan wrote: Hello all, I wonder is it necessary or correct to do z score transformation before PCA? I didn't see any preprocessing for face image in the example of Faces recognition example using eigenfaces and SVMs, link:http://scikit-learn.org/s table/auto_examples/applicatio ns/plot_face_recognition.html# sphx-glr-auto-examples- applications-plot-face- recognition-py I am doing on a similar dataset and got a weird result if I standardized data before PCA. The components figure will have a strong gradient and it doesn't make any sense. Any ideas about the reason?? Thanks.? ______________________________ _________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailma n/listinfo/scikit-learn ______________________________ _________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/ mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmmelen at yahoo.com Sun May 27 15:10:45 2018 From: jmmelen at yahoo.com (James Melenkevitz) Date: Sun, 27 May 2018 19:10:45 +0000 (UTC) Subject: [scikit-learn] Should we standardize data before PCA? In-Reply-To: References: Message-ID: <584661261.1709933.1527448245993@mail.yahoo.com> And this you have likely seen already in Wikipedia:https://en.wikipedia.org/wiki/Principal_component_analysis"...PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It's often used to visualize genetic distance and relatedness between populations. PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering[clarification needed] (and normalizing or using Z-scores) the data matrix for each attribute.[4] The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score)..." On Saturday, May 26, 2018, 10:10:32 PM PDT, Shiheng Duan wrote: Thanks.? Do you mean that if feature one has a larger derivation than feature two, after zscore?they will have the same weight? In that case, it is a bias, right? The feature one should be more important than feature two in the PCA.? On Thu, May 24, 2018 at 5:09 PM, Michael Eickenberg wrote: Hi, that totally depends on the nature of your data and whether the standard deviation of individual feature axes/columns of your data carry some form of importance measure. Note that PCA will bias its loadings towards columns with large standard deviations all else being held equal (meaning that if you have zscored columns, and then you choose one column and multiply it by, say 1000, then that component will likely show up as your first component [if 1000 is comparable or large wrt the number of features you are using]) Does this help?Michael On Thu, May 24, 2018 at 4:39 PM, Shiheng Duan wrote: Hello all, I wonder is it necessary or correct to do z score transformation before PCA? I didn't see any preprocessing for face image in the example of Faces recognition example using eigenfaces and SVMs, link:http://scikit-learn.org/s table/auto_examples/applicatio ns/plot_face_recognition.html# sphx-glr-auto-examples- applications-plot-face- recognition-py I am doing on a similar dataset and got a weird result if I standardized data before PCA. The components figure will have a strong gradient and it doesn't make any sense. Any ideas about the reason?? Thanks.? ______________________________ _________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailma n/listinfo/scikit-learn ______________________________ _________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/ mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmmelen at yahoo.com Sun May 27 15:13:18 2018 From: jmmelen at yahoo.com (James Melenkevitz) Date: Sun, 27 May 2018 19:13:18 +0000 (UTC) Subject: [scikit-learn] Should we standardize data before PCA? In-Reply-To: References: Message-ID: <1507820895.1709098.1527448398530@mail.yahoo.com> And this is the SciKit Learn page on the normalizing:? http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html On Saturday, May 26, 2018, 10:10:32 PM PDT, Shiheng Duan wrote: Thanks.? Do you mean that if feature one has a larger derivation than feature two, after zscore?they will have the same weight? In that case, it is a bias, right? The feature one should be more important than feature two in the PCA.? On Thu, May 24, 2018 at 5:09 PM, Michael Eickenberg wrote: Hi, that totally depends on the nature of your data and whether the standard deviation of individual feature axes/columns of your data carry some form of importance measure. Note that PCA will bias its loadings towards columns with large standard deviations all else being held equal (meaning that if you have zscored columns, and then you choose one column and multiply it by, say 1000, then that component will likely show up as your first component [if 1000 is comparable or large wrt the number of features you are using]) Does this help?Michael On Thu, May 24, 2018 at 4:39 PM, Shiheng Duan wrote: Hello all, I wonder is it necessary or correct to do z score transformation before PCA? I didn't see any preprocessing for face image in the example of Faces recognition example using eigenfaces and SVMs, link:http://scikit-learn.org/s table/auto_examples/applicatio ns/plot_face_recognition.html# sphx-glr-auto-examples- applications-plot-face- recognition-py I am doing on a similar dataset and got a weird result if I standardized data before PCA. The components figure will have a strong gradient and it doesn't make any sense. Any ideas about the reason?? Thanks.? ______________________________ _________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailma n/listinfo/scikit-learn ______________________________ _________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/ mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From reismc at gmail.com Mon May 28 21:20:55 2018 From: reismc at gmail.com (Mauricio Reis) Date: Mon, 28 May 2018 22:20:55 -0300 Subject: [scikit-learn] DBScan freezes my computer !!! In-Reply-To: <20180518222919.aimm2qgywnjw5peo@espgs-MacBook-Pro.local> References: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com> <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com> <20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org> <20180518222919.aimm2qgywnjw5peo@espgs-MacBook-Pro.local> Message-ID: I decreased the sampling interval to reduce the base size from 40,000 to 10,000 so that I could then use the DBScan routine. Now another problem has arisen: I want to analyze the "Noisy Samples" points and I need to calculate the distance to the nearest cluster, ie (a) the distance to the nearest point and know (b) which cluster this point belongs to . I believe these data are available because the algorithm calculates this distance, but only marks the point that has that distance greater than the EPS as "Noisy Samples". I believe the routine needs to be changed because it has only the output attributes "core_sample_indices_", "components_" and "labels_" are available. Can you help me? Att., Mauricio Reis 2018-05-18 19:29 GMT-03:00 Shane Grigsby : > Hi Mauricio, > You can also use OPTICS in DBSCAN mode. The pull request is here if you'd > like to clone it: > > https://github.com/scikit-learn/scikit-learn/pull/1984 > > Running ~40,000 points in three dimensions takes about a minute. See the > example page here for how to do the DBSCAN extraction: > > https://github.com/espg/scikit-learn/blob/2eac9fbf67b2715e11 > fdedfbb63bcdb56a80e216/examples/cluster/plot_optics.py > > Cheers, > Shane > > On 05/17, Mauricio Reis wrote: > >> I'm not used to the terms used here. So I understood that the package had >> memory management, which was removed. But you could make the code >> available >> with memory management implementations. Is it?! :-) >> The problem is that I do not know what I would do with the code, because I >> only know how to work with the SciKitLearn package ready. :-( >> >> Att., >> Mauricio Reis >> >> 2018-05-16 20:33 GMT-03:00 Joel Nothman : >> >> Implemented in a previous version of #10280 >>> , but removed >>> for now to simplify reviews >>> >> lrequestreview-95622713>. >>> If others would like to review #10280, I'm happy to follow up with the >>> changes requested here, which have already been implemented by Aman >>> Dalmia >>> and myself.? >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> > _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > *PhD candidate & Research Assistant* > *Cooperative Institute for Research in Environmental Sciences (CIRES)* > *University of Colorado at Boulder* > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dylanf123 at gmail.com Tue May 29 00:56:47 2018 From: dylanf123 at gmail.com (Dylan Fernando) Date: Tue, 29 May 2018 14:56:47 +1000 Subject: [scikit-learn] scikit-learn-contrib: building Cython, cpp files Message-ID: Hi, I would like to publish this: https://github.com/dil12321/scikit-learn/tree/aode https://github.com/scikit-learn/scikit-learn/pull/11093 as a scikit-learn-contrib project. However, I'm not sure how to write the setup.py file so that aode_helper.cpp and _aode.pyx get included in the package, and run correctly. How should I write setup.py? Regards, Dylan -------------- next part -------------- An HTML attachment was scrubbed... URL: From egor.v.panfilov at gmail.com Tue May 29 05:07:41 2018 From: egor.v.panfilov at gmail.com (Egor Panfilov) Date: Tue, 29 May 2018 12:07:41 +0300 Subject: [scikit-learn] Scikit-image 0.14.0 release [cross-listed] Message-ID: Hello list, On behalf of `scikit-image` team I'm happy to announce the release of `scikit-image` version 0.14.0 ! This release brings a lot of exciting additions (such as new segmentation tools and algorithms, new datasets and data generating routines), various enhancements (nD support for image moments and regionprops, rescale, resize and pyramid_* function, multichannel support for HOG, multiple performance improvements, and many more) and a number of bugfixes to the code and to the documentation. I am also pleased to inform you that 0.14.x is the last major release with official support for Python 2.7. In order to make the transition to Python 3.x smooth for all the users we're making this release a long-term support one (it will receive important bugfixes and backports for approx. 2 years, following the Python 2.7 end-of-life cycle). For more details on the changes and additions, please, refer to the release notes [1]. I'd like to thank all the developers of the library for their hard work, and all the users for staying with us! Regards, Egor Panfilov, scikit-image core team .. [1] https://github.com/scikit-image/scikit-image/blob/v0.14.x/doc/release/release_0.14.rst .. [2] https://pypi.org/project/scikit-image/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From olamyy53 at gmail.com Tue May 29 19:54:34 2018 From: olamyy53 at gmail.com (Lekan Wahab) Date: Wed, 30 May 2018 00:54:34 +0100 Subject: [scikit-learn] Why doesn't sklearn have support for a Batch Gradient Descent Regressor Message-ID: I have a feeling this question might have been asked before or there's some sort of resource somewhere on it but so far I haven't found any. I would appreciate any response I can get for this. -- Olamilekan Wahab about.me/olamyy -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Tue May 29 20:14:46 2018 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Tue, 29 May 2018 17:14:46 -0700 Subject: [scikit-learn] Why doesn't sklearn have support for a Batch Gradient Descent Regressor In-Reply-To: References: Message-ID: Hi Lekan, for which type of estimator are you looking for a batch gradient descent regressor? Michael On Tue, May 29, 2018 at 4:54 PM, Lekan Wahab wrote: > I have a feeling this question might have been asked before or there's > some sort of resource somewhere on it but so far I haven't found any. > > I would appreciate any response I can get for this. > > -- > > > Olamilekan Wahab > about.me/olamyy > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olamyy53 at gmail.com Thu May 31 06:46:24 2018 From: olamyy53 at gmail.com (Lekan Wahab) Date: Thu, 31 May 2018 11:46:24 +0100 Subject: [scikit-learn] Possible Feature Suggestions. Message-ID: Hello, new contributor here. I've been meaning to contribute to the library for a while now but I haven't found anything easy or clearenough for me to. While going through the source today, I noticed some possible features I could implement and would love to run it by the team here to see which is feasible and which is not. 1. More dataset benchmark. I noticed the benchmarksfolder only has benchmark on one of the datasets. MNIST. My plan is to add benchmarks for more datasets like iris, 'wine' and boston datasets. 2. Implement a Batch Gradient Descent Regressor and a Mini Batch Gradient Regressor just like the Stochastic Gradient Regressor available in the linear_model module. This is really my first attempt at contributing to the package so if there's anything i'm missing about either feature suggestions, please, do let me know. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sepand.haghighi at yahoo.com Thu May 31 09:05:37 2018 From: sepand.haghighi at yahoo.com (Sepand Haghighi) Date: Thu, 31 May 2018 13:05:37 +0000 (UTC) Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in Python References: <253486979.646953.1527771937416.ref@mail.yahoo.com> Message-ID: <253486979.646953.1527771937416@mail.yahoo.com> PyCM is a multi-class confusion matrix library written in Python that supports both input data vectors and direct matrix, and a proper tool for post-classification model evaluation that supports most classes and overall statistics parameters. PyCM is the swiss-army knife of confusion matrices, targeted mainly at data scientists that need a broad array of metrics for predictive models and an accurate evaluation of large variety of classifiers. Github Repo :?https://github.com/sepandhaghighi/pycm Webpage :?http://pycm.shaghighi.ir/? JOSS Paper :?https://doi.org/10.21105/joss.00729 -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Thu May 31 13:26:41 2018 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Thu, 31 May 2018 10:26:41 -0700 Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in Python In-Reply-To: <253486979.646953.1527771937416@mail.yahoo.com> References: <253486979.646953.1527771937416.ref@mail.yahoo.com> <253486979.646953.1527771937416@mail.yahoo.com> Message-ID: Hi Sepand, Thanks for this -- looks useful. I had to write something similar (for the binary case) and wish scikit had something like this. I wonder if there's something similar for the binary class case where, the prediction is a real value (activation) and from this we can also derive - CMs for all prediction cutoff (or set of cutoffs?) - scores over all cutoffs (AUC, AP, ...) For me, in analyzing (binary class) performance, reporting scores for a single cutoff is less useful than seeing how the many scores (tpr, ppv, mcc, relative risk, chi^2, ...) vary at various false positive rates, or prediction quantiles. Does your library provide any tools for the binary case where we add an activation threshold? Thanks again for releasing this and providing pip packaging. - Stuart On Thu, May 31, 2018 at 6:05 AM, Sepand Haghighi via scikit-learn wrote: > PyCM is a multi-class confusion matrix library written in Python that > supports both input data vectors and direct matrix, and a proper tool for > post-classification model evaluation that supports most classes and overall > statistics parameters. PyCM is the swiss-army knife of confusion matrices, > targeted mainly at data scientists that need a broad array of metrics for > predictive models and an accurate evaluation of large variety of > classifiers. > > Github Repo : https://github.com/sepandhaghighi/pycm > > Webpage : http://pycm.shaghighi.ir/ > > JOSS Paper : https://doi.org/10.21105/joss.00729 > > > > > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From jorisvandenbossche at gmail.com Thu May 31 15:20:34 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Thu, 31 May 2018 21:20:34 +0200 Subject: [scikit-learn] scikit-learn-contrib: building Cython, cpp files In-Reply-To: References: Message-ID: Hi Dylan, In case you are still looking for a solution:I didn't directly find good templates for packages that depend on cython (there are quite some, but from quickly looking at them, I didn't find a simple one), but you can maybe have a look at one of the other scikit-learn-contrib packages that uses cython: https://github.com/scikit-learn-contrib/hdbscan And you can check here how to adapt the Extension class to specify c++: http://cython.readthedocs.io/en/latest/src/userguide/wrapping_CPlusPlus.html#specify-c-language-in-setup-py Best, Joris 2018-05-29 6:56 GMT+02:00 Dylan Fernando : > Hi, > > I would like to publish this: > https://github.com/dil12321/scikit-learn/tree/aode > https://github.com/scikit-learn/scikit-learn/pull/11093 > > as a scikit-learn-contrib project. However, I'm not sure how to write the > setup.py file so that aode_helper.cpp and _aode.pyx get included in the > package, and run correctly. How should I write setup.py? > > Regards, > Dylan > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aqsdmcet at gmail.com Tue May 8 01:15:54 2018 From: aqsdmcet at gmail.com (aijaz qazi) Date: Tue, 08 May 2018 05:15:54 -0000 Subject: [scikit-learn] Scikit Multi learn error. Message-ID: Dear developer , I am working on web page categorization with http://scikit.ml/ . *Question*: I am not able to execute MLkNN code on the link http://scikit.ml/api/classify.html. I have installed py 3.6. I found scipy versions not compatible with scikit.ml 0.0.5. Which version of scipy would work with scikit.ml 0.0.5. Kindly let me know. I will be grateful. *Regards,* *Aijaz A.Qazi * -------------- next part -------------- An HTML attachment was scrubbed... URL: