From peer.j.nowack at gmail.com Wed May 2 07:08:28 2018
From: peer.j.nowack at gmail.com (Peer Nowack)
Date: Wed, 2 May 2018 12:08:28 +0100
Subject: [scikit-learn] How does multiple target Ridge Regression work in
scikit learn?
Message-ID:
Hi all,
I am struggling to understand the following:
Scikit-learn offers a multiple output version for Ridge Regression, simply
by handing over a 2D array [n_samples, n_targets], but how is it
implemented?
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html
Is it correct to assume that each regression for each target is
independent? Under these circumstances, how can I adapt this to use
individual alpha regularization parameters for each regression? If I use
GridSeachCV, I would have to hand over a matrix of possible regularization
parameters, or how would that work?
Thanks in advance - I have been searching for hours but could not find
anything on this topic.
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From bertrand.thirion at inria.fr Wed May 2 08:07:12 2018
From: bertrand.thirion at inria.fr (bthirion)
Date: Wed, 2 May 2018 14:07:12 +0200
Subject: [scikit-learn] How does multiple target Ridge Regression work
in scikit learn?
In-Reply-To:
References:
Message-ID: <81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr>
The alpha parameter is shared for all problems; If you wnat to use
differnt parameters, you probably want to perform seprate fits.
Best,
Bertrand
On 02/05/2018 13:08, Peer Nowack wrote:
>
> Hi all,
>
> I am struggling to understand the following:
>
> Scikit-learn offers a multiple output version for Ridge Regression,
> simply by handing over a 2D array [n_samples, n_targets], but how is
> it implemented?
>
> http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html
>
> Is it correct to assume that each regression for each target is
> independent? Under these circumstances, how can I adapt this to use
> individual alpha regularization parameters for each regression? If I
> use GridSeachCV, I would have to hand over a matrix of possible
> regularization parameters, or how would that work?
>
> Thanks in advance - I have been searching for hours but could not find
> anything on this topic.
>
> Peter
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From peer.j.nowack at gmail.com Wed May 2 09:02:33 2018
From: peer.j.nowack at gmail.com (Peer Nowack)
Date: Wed, 2 May 2018 14:02:33 +0100
Subject: [scikit-learn] How does multiple target Ridge Regression work
in scikit learn?
In-Reply-To: <81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr>
References:
<81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr>
Message-ID:
Thanks, Bertrand - very helpful. Needed to consolidate this.
Peter
On 2 May 2018 at 13:07, bthirion wrote:
> The alpha parameter is shared for all problems; If you wnat to use
> differnt parameters, you probably want to perform seprate fits.
> Best,
>
> Bertrand
>
> On 02/05/2018 13:08, Peer Nowack wrote:
>
> Hi all,
>
> I am struggling to understand the following:
>
> Scikit-learn offers a multiple output version for Ridge Regression, simply
> by handing over a 2D array [n_samples, n_targets], but how is it
> implemented?
>
> http://scikit-learn.org/stable/modules/generated/
> sklearn.linear_model.Ridge.html
>
> Is it correct to assume that each regression for each target is
> independent? Under these circumstances, how can I adapt this to use
> individual alpha regularization parameters for each regression? If I use
> GridSeachCV, I would have to hand over a matrix of possible regularization
> parameters, or how would that work?
>
> Thanks in advance - I have been searching for hours but could not find
> anything on this topic.
> Peter
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From michael.eickenberg at gmail.com Wed May 2 14:32:31 2018
From: michael.eickenberg at gmail.com (Michael Eickenberg)
Date: Wed, 2 May 2018 11:32:31 -0700
Subject: [scikit-learn] How does multiple target Ridge Regression work
in scikit learn?
In-Reply-To:
References:
<81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr>
Message-ID:
By the linear nature of the problem the targets are always separately
treated (even if there was a matrix-variate normal prior indicating
covariance between target columns, you could do that adjustment before or
after fitting).
As for different alpha parameters, I think you can specify a different
alpha per target if you pass in an array of shape (n_targets,). Maybe this
is not implemented for all solvers, but it should be at least for some.
If you grid search, then the scikit-learn API requires the score to be one
number, so it's non-trivial to optimize different alphas for different
voxels easily (even though selecting the best alpha for each voxel will of
course make the sum of errors go down, too).
Depending on what your use case is, it may be easier to just write your own:
If X = U S VT (svd), then weights = VT.T.dot((1 / (S ** 2 + alpha) *
U).T.dot(Y))
For more than one alpha:
alphas.shape == (n_alphas, n_targets)
Y.shape == (n_samples, n_targets)
X.shape == (n_samples, n_features)
U, S, VT = np.linalg.svd(X)
diags = 1 / (S[np.newaxis, :, np.newaxis] ** 2 + alphas[:, np.newaxis, :])
UTY = U.T.dot(Y)
weights = np.zeros([n_alphas, n_features, n_targets])
for i in range(alphas.shape[0]):
weights[i] = VT.T.dot(diags[i] * UTY)
Then use those weights to predict.
Michael
On Wed, May 2, 2018 at 6:02 AM, Peer Nowack wrote:
> Thanks, Bertrand - very helpful. Needed to consolidate this.
>
> Peter
>
> On 2 May 2018 at 13:07, bthirion wrote:
>
>> The alpha parameter is shared for all problems; If you wnat to use
>> differnt parameters, you probably want to perform seprate fits.
>> Best,
>>
>> Bertrand
>>
>> On 02/05/2018 13:08, Peer Nowack wrote:
>>
>> Hi all,
>>
>> I am struggling to understand the following:
>>
>> Scikit-learn offers a multiple output version for Ridge Regression,
>> simply by handing over a 2D array [n_samples, n_targets], but how is it
>> implemented?
>>
>> http://scikit-learn.org/stable/modules/generated/sklearn.
>> linear_model.Ridge.html
>>
>> Is it correct to assume that each regression for each target is
>> independent? Under these circumstances, how can I adapt this to use
>> individual alpha regularization parameters for each regression? If I use
>> GridSeachCV, I would have to hand over a matrix of possible regularization
>> parameters, or how would that work?
>>
>> Thanks in advance - I have been searching for hours but could not find
>> anything on this topic.
>> Peter
>>
>>
>> _______________________________________________
>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From princejha616 at gmail.com Thu May 3 02:53:20 2018
From: princejha616 at gmail.com (prince jha)
Date: Thu, 3 May 2018 12:23:20 +0530
Subject: [scikit-learn] Project Contribution
Message-ID:
Hello everyone, I am also willing to contribute to scikit-learn open source
project but since I have never contributed to any open-source projects
earlier, so I don't have any idea regarding where to start from. So I will
be thankful if any of you could help me in this so that I could also start
contributing in this great project.
Thanks,
Prince
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From ross at cgl.ucsf.edu Thu May 3 03:02:54 2018
From: ross at cgl.ucsf.edu (Bill Ross)
Date: Thu, 3 May 2018 00:02:54 -0700
Subject: [scikit-learn] Project Contribution
In-Reply-To:
References:
Message-ID:
Quick followup from a bystander: have you used scikit-learn for
anything? How much of the code have you read? (me: no, 0)
Bill
On 5/2/18 11:53 PM, prince jha wrote:
> Hello everyone, I am also willing to contribute to scikit-learn open
> source project but since I have never contributed to any open-source
> projects earlier, so I don't have any idea regarding where to start
> from. So I will be thankful if any of you could help me in this so
> that I could also start contributing in this great project.
>
> Thanks,
> Prince
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From m.ali.jamaoui at gmail.com Thu May 3 03:25:33 2018
From: m.ali.jamaoui at gmail.com (Mohamed Ali Jamaoui)
Date: Thu, 3 May 2018 09:25:33 +0200
Subject: [scikit-learn] Project Contribution
In-Reply-To:
References:
Message-ID:
Hi,
There are many ways to contribute, not only code. You can get started by
reading the "Contributing" section of the "Developer's guide" :
http://scikit-learn.org/dev/developers/contributing.html
For code contributions, you don't need to read all the codebase to be able
to contribute, try to pave your way into it gradually. A good first step
would be to start with issues labeled good first issue
.
Welcome onboard :)
Regards,
Mohamed Ali JAMAOUI
On 3 May 2018 at 09:02, Bill Ross wrote:
> Quick followup from a bystander: have you used scikit-learn for anything?
> How much of the code have you read? (me: no, 0)
>
> Bill
>
> On 5/2/18 11:53 PM, prince jha wrote:
>
> Hello everyone, I am also willing to contribute to scikit-learn open
> source project but since I have never contributed to any open-source
> projects earlier, so I don't have any idea regarding where to start from.
> So I will be thankful if any of you could help me in this so that I could
> also start contributing in this great project.
>
> Thanks,
> Prince
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From princejha616 at gmail.com Thu May 3 03:48:14 2018
From: princejha616 at gmail.com (prince jha)
Date: Thu, 3 May 2018 13:18:14 +0530
Subject: [scikit-learn] Project Contribution
Message-ID:
Hi Bill, actually i have used scikit learn for solving problems in which
are available in kaggle but i am not so profiecient since i have not used
it much.
Thanks
Prince
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From wouterverduin at gmail.com Fri May 4 05:12:40 2018
From: wouterverduin at gmail.com (Wouter Verduin)
Date: Fri, 4 May 2018 11:12:40 +0200
Subject: [scikit-learn] Retracting model from the 'blackbox' SVM
Message-ID:
Dear developers of Scikit,
I am working on a scientific paper on a predictionmodel predicting
complications in major abdominal resections. I have been using scikit to
create that model and got good results (score of 0.94). This makes us want
to see what the model is like that is made by scikit.
As for now we got 100 input variables but logically these arent all as
usefull as the others and we want to reduce this number to about 20 and see
what the effects on the score are.
*My question*: Is there a way to get the underlying formula for the model
out of scikit instead of having it as a 'blackbox' in my svm function.
At this moment i am predicting a dichtomous variable with 100 variables,
(continuous, ordinal and binair).
My code:
import numpy as npfrom numpy import *import pandas as pdfrom sklearn
import tree, svm, linear_model, metrics, preprocessingimport
datetimefrom sklearn.model_selection import KFold, cross_val_score,
ShuffleSplit, GridSearchCVfrom time import gmtime, strftime
#database openen en voorbereiden
file = "/home/wouter/scikit/DB_SCIKIT.csv"
DB = pd.read_csv(file, sep=";", header=0, decimal= ',').as_matrix()
DBT = DBprint "Vorm van de DB: ", DB.shape
target = []for i in range(len(DB[:,-1])):
target.append(DB[i,-1])
DB = delete(DB,s_[-1],1) #Laatste kolom verwijderenAantalOutcome =
target.count(1)print "Aantal outcome:", AantalOutcomeprint "Aantal
patienten:", len(target)
A = DB
b = target
print len(DBT)
svc=svm.SVC(kernel='linear', cache_size=500, probability=True)
indices = np.random.permutation(len(DBT))
rs = ShuffleSplit(n_splits=5, test_size=.15, random_state=None)
scores = cross_val_score(svc, A, b, cv=rs)
A = ("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))print A
X_train = DBT[indices[:-302]]
y_train = []for i in range(len(X_train[:,-1])):
y_train.append(X_train[i,-1])
X_train = delete(X_train,s_[-1],1) #Laatste kolom verwijderen
X_test = DBT[indices[-302:]]
y_test = []for i in range(len(X_test[:,-1])):
y_test.append(X_test[i,-1])
X_test = delete(X_test,s_[-1],1) #Laatste kolom verwijderen
model = svc.fit(X_train,y_train)print model
uitkomst = model.score(X_test, y_test)print uitkomst
voorspel = model.predict(X_test)print voorspel
And output:
Vorm van de DB: (2011, 101)Aantal outcome: 128Aantal patienten:
20112011Accuracy: 0.94 (+/- 0.01)
SVC(C=1.0, cache_size=500, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False)0.927152317881[0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Thanks in advance!
with kind regards,
Wouter Verduin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From mail at sebastianraschka.com Fri May 4 05:51:26 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Fri, 4 May 2018 05:51:26 -0400
Subject: [scikit-learn] Retracting model from the 'blackbox' SVM
In-Reply-To:
References:
Message-ID: <5331A676-D6C6-4F01-8A4D-EDDE9318E08F@sebastianraschka.com>
Dear Wouter,
for the SVM, scikit-learn wraps the LIBSVM and LIBLINEAR. I think the scikit-learn class SVC uses LIBSVM for every kernel. Since you are using the linear kernel, you could use the more efficient LinearSVC scikit-learn class to get similar results. I guess this in turn is easier to handle in terms of
> Is there a way to get the underlying formula for the model out of scikit instead of having it as a 'blackbox' in my svm function.
More specifically, LinearSVC uses the _fit_liblinear code available here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/base.py
And more info on the LIBLINEAR library it is using can be found here: https://www.csie.ntu.edu.tw/~cjlin/liblinear/ (they have links to technical reports and implementation details there)
Best,
Sebastian
> On May 4, 2018, at 5:12 AM, Wouter Verduin wrote:
>
> Dear developers of Scikit,
>
> I am working on a scientific paper on a predictionmodel predicting complications in major abdominal resections. I have been using scikit to create that model and got good results (score of 0.94). This makes us want to see what the model is like that is made by scikit.
>
> As for now we got 100 input variables but logically these arent all as usefull as the others and we want to reduce this number to about 20 and see what the effects on the score are.
>
> My question: Is there a way to get the underlying formula for the model out of scikit instead of having it as a 'blackbox' in my svm function.
>
> At this moment i am predicting a dichtomous variable with 100 variables, (continuous, ordinal and binair).
>
> My code:
>
> import numpy as
> np
>
> from numpy import *
> import pandas as
> pd
>
> from sklearn import tree, svm, linear_model, metrics,
> preprocessing
>
> import
> datetime
>
> from sklearn.model_selection import KFold, cross_val_score, ShuffleSplit, GridSearchCV
> from time import gmtime,
> strftime
>
>
> #database openen en voorbereiden
>
> file
> = "/home/wouter/scikit/DB_SCIKIT.csv"
>
> DB
> = pd.read_csv(file, sep=";", header=0, decimal= ',').as_matrix()
>
> DBT
> =
> DB
>
> print "Vorm van de DB: ", DB.
> shape
> target
> = []
> for i in range(len(DB[:,-1])):
>
> target
> .append(DB[i,-1])
>
> DB
> = delete(DB,s_[-1],1) #Laatste kolom verwijderen
> AantalOutcome = target.count(1)
> print "Aantal outcome:", AantalOutcome
> print "Aantal patienten:", len(target)
>
>
> A
> =
> DB
> b
> =
> target
>
>
> print len(DBT)
>
>
> svc
> =svm.SVC(kernel='linear', cache_size=500, probability=True)
>
> indices
> = np.random.permutation(len(DBT))
>
>
> rs
> = ShuffleSplit(n_splits=5, test_size=.15, random_state=None)
>
> scores
> = cross_val_score(svc, A, b, cv=rs)
>
> A
> = ("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
> print
> A
>
> X_train
> = DBT[indices[:-302]]
>
> y_train
> = []
> for i in range(len(X_train[:,-1])):
>
> y_train
> .append(X_train[i,-1])
>
> X_train
> = delete(X_train,s_[-1],1) #Laatste kolom verwijderen
>
>
> X_test
> = DBT[indices[-302:]]
>
> y_test
> = []
> for i in range(len(X_test[:,-1])):
>
> y_test
> .append(X_test[i,-1])
>
> X_test
> = delete(X_test,s_[-1],1) #Laatste kolom verwijderen
>
>
> model
> = svc.fit(X_train,y_train)
> print
> model
>
> uitkomst
> = model.score(X_test, y_test)
> print
> uitkomst
>
> voorspel
> = model.predict(X_test)
> print voorspel
> And output:
>
> Vorm van de DB: (2011, 101)
> Aantal outcome: 128
> Aantal patienten: 2011
> 2011
> Accuracy: 0.94 (+/- 0.01)
>
> SVC
> (C=1.0, cache_size=500, class_weight=None, coef0=0.0,
>
> decision_function_shape
> ='ovr', degree=3, gamma='auto', kernel='linear',
>
> max_iter
> =-1, probability=True, random_state=None, shrinking=True,
>
> tol
> =0.001, verbose=False)
> 0.927152317881
> [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>
>
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>
>
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>
>
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>
>
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>
>
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>
>
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>
>
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>
>
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>
>
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>
>
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>
>
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>
>
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
> Thanks in advance!
>
> with kind regards,
>
> Wouter Verduin
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
From david.mo.burns at gmail.com Fri May 4 12:47:20 2018
From: david.mo.burns at gmail.com (David Burns)
Date: Fri, 4 May 2018 12:47:20 -0400
Subject: [scikit-learn] Retracting model from the 'blackbox' SVM
(Sebastian Raschka)
In-Reply-To:
References:
Message-ID:
Hi Sebastian,
If you are looking to reduce the feature space for your model, I suggest
you look at the scikit-learn page on doing just that
http://scikit-learn.org/stable/modules/feature_selection.html
David
On 2018-05-04 12:00 PM, scikit-learn-request at python.org wrote:
> Send scikit-learn mailing list submissions to
> scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
> scikit-learn-request at python.org
>
> You can reach the person managing the list at
> scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
> 1. Re: Retracting model from the 'blackbox' SVM (Sebastian Raschka)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 4 May 2018 05:51:26 -0400
> From: Sebastian Raschka
> To: Scikit-learn mailing list
> Subject: Re: [scikit-learn] Retracting model from the 'blackbox' SVM
> Message-ID:
> <5331A676-D6C6-4F01-8A4D-EDDE9318E08F at sebastianraschka.com>
> Content-Type: text/plain; charset=us-ascii
>
> Dear Wouter,
>
> for the SVM, scikit-learn wraps the LIBSVM and LIBLINEAR. I think the scikit-learn class SVC uses LIBSVM for every kernel. Since you are using the linear kernel, you could use the more efficient LinearSVC scikit-learn class to get similar results. I guess this in turn is easier to handle in terms of
>
>> Is there a way to get the underlying formula for the model out of scikit instead of having it as a 'blackbox' in my svm function.
> More specifically, LinearSVC uses the _fit_liblinear code available here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/base.py
>
> And more info on the LIBLINEAR library it is using can be found here: https://www.csie.ntu.edu.tw/~cjlin/liblinear/ (they have links to technical reports and implementation details there)
>
> Best,
> Sebastian
>
>> On May 4, 2018, at 5:12 AM, Wouter Verduin wrote:
>>
>> Dear developers of Scikit,
>>
>> I am working on a scientific paper on a predictionmodel predicting complications in major abdominal resections. I have been using scikit to create that model and got good results (score of 0.94). This makes us want to see what the model is like that is made by scikit.
>>
>> As for now we got 100 input variables but logically these arent all as usefull as the others and we want to reduce this number to about 20 and see what the effects on the score are.
>>
>> My question: Is there a way to get the underlying formula for the model out of scikit instead of having it as a 'blackbox' in my svm function.
>>
>> At this moment i am predicting a dichtomous variable with 100 variables, (continuous, ordinal and binair).
>>
>> My code:
>>
>> import numpy as
>> np
>>
>> from numpy import *
>> import pandas as
>> pd
>>
>> from sklearn import tree, svm, linear_model, metrics,
>> preprocessing
>>
>> import
>> datetime
>>
>> from sklearn.model_selection import KFold, cross_val_score, ShuffleSplit, GridSearchCV
>> from time import gmtime,
>> strftime
>>
>>
>> #database openen en voorbereiden
>>
>> file
>> = "/home/wouter/scikit/DB_SCIKIT.csv"
>>
>> DB
>> = pd.read_csv(file, sep=";", header=0, decimal= ',').as_matrix()
>>
>> DBT
>> =
>> DB
>>
>> print "Vorm van de DB: ", DB.
>> shape
>> target
>> = []
>> for i in range(len(DB[:,-1])):
>>
>> target
>> .append(DB[i,-1])
>>
>> DB
>> = delete(DB,s_[-1],1) #Laatste kolom verwijderen
>> AantalOutcome = target.count(1)
>> print "Aantal outcome:", AantalOutcome
>> print "Aantal patienten:", len(target)
>>
>>
>> A
>> =
>> DB
>> b
>> =
>> target
>>
>>
>> print len(DBT)
>>
>>
>> svc
>> =svm.SVC(kernel='linear', cache_size=500, probability=True)
>>
>> indices
>> = np.random.permutation(len(DBT))
>>
>>
>> rs
>> = ShuffleSplit(n_splits=5, test_size=.15, random_state=None)
>>
>> scores
>> = cross_val_score(svc, A, b, cv=rs)
>>
>> A
>> = ("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
>> print
>> A
>>
>> X_train
>> = DBT[indices[:-302]]
>>
>> y_train
>> = []
>> for i in range(len(X_train[:,-1])):
>>
>> y_train
>> .append(X_train[i,-1])
>>
>> X_train
>> = delete(X_train,s_[-1],1) #Laatste kolom verwijderen
>>
>>
>> X_test
>> = DBT[indices[-302:]]
>>
>> y_test
>> = []
>> for i in range(len(X_test[:,-1])):
>>
>> y_test
>> .append(X_test[i,-1])
>>
>> X_test
>> = delete(X_test,s_[-1],1) #Laatste kolom verwijderen
>>
>>
>> model
>> = svc.fit(X_train,y_train)
>> print
>> model
>>
>> uitkomst
>> = model.score(X_test, y_test)
>> print
>> uitkomst
>>
>> voorspel
>> = model.predict(X_test)
>> print voorspel
>> And output:
>>
>> Vorm van de DB: (2011, 101)
>> Aantal outcome: 128
>> Aantal patienten: 2011
>> 2011
>> Accuracy: 0.94 (+/- 0.01)
>>
>> SVC
>> (C=1.0, cache_size=500, class_weight=None, coef0=0.0,
>>
>> decision_function_shape
>> ='ovr', degree=3, gamma='auto', kernel='linear',
>>
>> max_iter
>> =-1, probability=True, random_state=None, shrinking=True,
>>
>> tol
>> =0.001, verbose=False)
>> 0.927152317881
>> [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>> Thanks in advance!
>>
>> with kind regards,
>>
>> Wouter Verduin
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 26, Issue 5
> *******************************************
From niyaghif at oregonstate.edu Fri May 4 19:10:44 2018
From: niyaghif at oregonstate.edu (Niyaghi, Faraz)
Date: Fri, 4 May 2018 16:10:44 -0700
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
Importance
Message-ID:
Greetings,
This is Faraz Niyaghi from Oregon State University. I research on variable
selection using random forest. To the best of my knowledge, there is a
difference between scikit-learn's and Breiman's definition of feature
importance. Breiman uses out of bag (oob) cases to calculate feature
importance but scikit-learn doesn't. I was wondering: 1) why are they
different? 2) can they result in very different rankings of features?
Here are the definitions I found on the web:
*Breiman:* "In every tree grown in the forest, put down the oob cases and
count the number of votes cast for the correct class. Now randomly permute
the values of variable m in the oob cases and put these cases down the
tree. Subtract the number of votes for the correct class in the
variable-m-permuted oob data from the number of votes for the correct class
in the untouched oob data. The average of this number over all trees in the
forest is the raw importance score for variable m."
Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
*scikit-learn: *" The relative rank (i.e. depth) of a feature used as a
decision node in a tree can be used to assess the relative importance of
that feature with respect to the predictability of the target variable.
Features used at the top of the tree contribute to the final prediction
decision of a larger fraction of the input samples. The expected fraction
of the samples they contribute to can thus be used as an estimate of the
relative importance of the features."
Link: http://scikit-learn.org/stable/modules/ensemble.html
Thank you for reading this email. Please let me know your thoughts.
Cheers,
Faraz.
Faraz Niyaghi
Ph.D. Candidate, Department of Statistics
Oregon State University
Corvallis, OR
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From mail at sebastianraschka.com Fri May 4 19:58:03 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Fri, 4 May 2018 19:58:03 -0400
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
Importance
In-Reply-To:
References:
Message-ID: <4B01B139-0D45-4F85-A287-E5B36BC3FE03@sebastianraschka.com>
Not sure how it compares in practice, but it's certainly more efficient to rank the features by impurity decrease rather than by OOB permutation performance you wouldn't need to
a) compute the OOB performance (an extra pass inference step)
b) permute a feature column and do another inference pass and compare it to a)
c) repeat step b) for each feature column
Another reason would be that Breiman's suggestion wouldn't work that well for certain RandomForestClassifier settings in scikit-learn, e.g., setting bootstrap=False etc.
If you like to compute the feature importance after Breiman's suggestion, I have implemented a simple wrapper function for scikit-learn estimators here:
http://rasbt.github.io/mlxtend/user_guide/evaluate/feature_importance_permutation/#example-1-feature-importance-for-classifiers
Note that it's not using OOB samples but an independent validation set though, because it's a general function that should not be restricted to random forests. If you have such an independent dataset, it should give more accurate results than using OOB samples.
Best,
Sebastian
> On May 4, 2018, at 7:10 PM, Niyaghi, Faraz wrote:
>
> Greetings,
>
> This is Faraz Niyaghi from Oregon State University. I research on variable selection using random forest. To the best of my knowledge, there is a difference between scikit-learn's and Breiman's definition of feature importance. Breiman uses out of bag (oob) cases to calculate feature importance but scikit-learn doesn't. I was wondering: 1) why are they different? 2) can they result in very different rankings of features?
>
> Here are the definitions I found on the web:
>
> Breiman: "In every tree grown in the forest, put down the oob cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the oob cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-m-permuted oob data from the number of votes for the correct class in the untouched oob data. The average of this number over all trees in the forest is the raw importance score for variable m."
> Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
>
> scikit-learn: " The relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features."
> Link: http://scikit-learn.org/stable/modules/ensemble.html
>
> Thank you for reading this email. Please let me know your thoughts.
>
> Cheers,
> Faraz.
>
> Faraz Niyaghi
>
> Ph.D. Candidate, Department of Statistics
> Oregon State University
> Corvallis, OR
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
From Jeremiah.Johnson at unh.edu Fri May 4 20:08:45 2018
From: Jeremiah.Johnson at unh.edu (Johnson, Jeremiah)
Date: Sat, 5 May 2018 00:08:45 +0000
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
Importance
In-Reply-To:
References:
Message-ID:
Faraz, take a look at the discussion of this issue here: http://parrt.cs.usfca.edu/doc/rf-importance/index.html
Best,
Jeremiah
=========================================
Jeremiah W. Johnson, Ph.D
Asst. Professor of Data Science
Program Coordinator, B.S. in Analytics & Data Science
University of New Hampshire
Manchester, NH 03101
https://www.linkedin.com/in/jwjohnson314
From: scikit-learn > on behalf of "Niyaghi, Faraz" >
Reply-To: Scikit-learn mailing list >
Date: Friday, May 4, 2018 at 7:10 PM
To: "scikit-learn at python.org" >
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature Importance
Caution - External Email
________________________________
Greetings,
This is Faraz Niyaghi from Oregon State University. I research on variable selection using random forest. To the best of my knowledge, there is a difference between scikit-learn's and Breiman's definition of feature importance. Breiman uses out of bag (oob) cases to calculate feature importance but scikit-learn doesn't. I was wondering: 1) why are they different? 2) can they result in very different rankings of features?
Here are the definitions I found on the web:
Breiman: "In every tree grown in the forest, put down the oob cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the oob cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-m-permuted oob data from the number of votes for the correct class in the untouched oob data. The average of this number over all trees in the forest is the raw importance score for variable m."
Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
scikit-learn: " The relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features."
Link: http://scikit-learn.org/stable/modules/ensemble.html
Thank you for reading this email. Please let me know your thoughts.
Cheers,
Faraz.
Faraz Niyaghi
Ph.D. Candidate, Department of Statistics
Oregon State University
Corvallis, OR
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From aqsdmcet at gmail.com Sat May 5 00:31:14 2018
From: aqsdmcet at gmail.com (aijaz qazi)
Date: Sat, 5 May 2018 10:01:14 +0530
Subject: [scikit-learn] Multi learn error.
Message-ID:
Dear developers of Scikit ,
I am working on web page categorization with http://scikit.ml/ .
*Question*: I am not able to execute MLkNN code on the link
http://scikit.ml/api/classify.html. I have installed py 3.6.
I found scipy versions not compatible with scikit.ml 0.0.5.
Which version of scipy would work with scikit.ml 0.0.5.
Kindly let me know.
*Regards,*
*Aijaz A.Qazi *
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From rth.yurchak at gmail.com Sat May 5 02:28:22 2018
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Sat, 5 May 2018 09:28:22 +0300
Subject: [scikit-learn] Multi learn error.
In-Reply-To:
References:
Message-ID: <49def996-56c7-ec5e-dc37-bf93968cfa2a@gmail.com>
Hi Aijaz,
On 05/05/18 07:31, aijaz qazi wrote:
> Dear developers of Scikit ,
Scikit is short for SciPy Toolkits (https://www.scipy.org/scikits.html);
there is a number of those. Scikit-learn started as one (and this is the
scikit-learn mailing list).
The package you are refering is based on scikit-learn but is a separate
project (with a somewhat confusing home page URL). The right place to
ask for support would be its Github issue tracker or other specific
communcations channels if it has any.
--
Roman
From g.lemaitre58 at gmail.com Sat May 5 04:34:36 2018
From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=)
Date: Sat, 5 May 2018 10:34:36 +0200
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
Importance
In-Reply-To:
References:
Message-ID:
+1 on the post pointed out by Jeremiah.
On 5 May 2018 at 02:08, Johnson, Jeremiah wrote:
> Faraz, take a look at the discussion of this issue here:
> http://parrt.cs.usfca.edu/doc/rf-importance/index.html
>
> Best,
> Jeremiah
> =========================================
> Jeremiah W. Johnson, Ph.D
> Asst. Professor of Data Science
> Program Coordinator, B.S. in Analytics & Data Science
> University of New Hampshire
> Manchester, NH 03101
> https://www.linkedin.com/in/jwjohnson314
>
>
> From: scikit-learn python.org> on behalf of "Niyaghi, Faraz"
> Reply-To: Scikit-learn mailing list
> Date: Friday, May 4, 2018 at 7:10 PM
> To: "scikit-learn at python.org"
> Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
> Importance
>
> *Caution - External Email*
> ------------------------------
> Greetings,
>
> This is Faraz Niyaghi from Oregon State University. I research on variable
> selection using random forest. To the best of my knowledge, there is a
> difference between scikit-learn's and Breiman's definition of feature
> importance. Breiman uses out of bag (oob) cases to calculate feature
> importance but scikit-learn doesn't. I was wondering: 1) why are they
> different? 2) can they result in very different rankings of features?
>
> Here are the definitions I found on the web:
>
> *Breiman:* "In every tree grown in the forest, put down the oob cases and
> count the number of votes cast for the correct class. Now randomly permute
> the values of variable m in the oob cases and put these cases down the
> tree. Subtract the number of votes for the correct class in the
> variable-m-permuted oob data from the number of votes for the correct class
> in the untouched oob data. The average of this number over all trees in the
> forest is the raw importance score for variable m."
> Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
>
>
> *scikit-learn: *" The relative rank (i.e. depth) of a feature used as a
> decision node in a tree can be used to assess the relative importance of
> that feature with respect to the predictability of the target variable.
> Features used at the top of the tree contribute to the final prediction
> decision of a larger fraction of the input samples. The expected fraction
> of the samples they contribute to can thus be used as an estimate of the
> relative importance of the features."
> Link: http://scikit-learn.org/stable/modules/ensemble.html
>
>
> Thank you for reading this email. Please let me know your thoughts.
>
> Cheers,
> Faraz.
>
> Faraz Niyaghi
>
> Ph.D. Candidate, Department of Statistics
> Oregon State University
> Corvallis, OR
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
--
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From g.louppe at gmail.com Sat May 5 05:21:17 2018
From: g.louppe at gmail.com (Gilles Louppe)
Date: Sat, 05 May 2018 09:21:17 +0000
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
Importance
In-Reply-To:
References:
Message-ID:
Hi,
See also chapters 6 and 7 of http://arxiv.org/abs/1407.7502 for another
point of view regarding the "issue" with feature importances. TLDR: Feature
importances as we have them in scikit-learn (i.e. MDI) are provably **not**
biased, provided trees are built totally at random (as in ExtraTrees with
max_feature=1) and the depth is controlled min_samples_split (to avoid
splitting on noise). On the other hand, it is not always clear what you
actually compute with MDA (permutation based importances), since it is
conditioned on the model you use.
Gilles
On Sat, 5 May 2018 at 10:36, Guillaume Lema?tre
wrote:
> +1 on the post pointed out by Jeremiah.
> On 5 May 2018 at 02:08, Johnson, Jeremiah
wrote:
>> Faraz, take a look at the discussion of this issue here:
http://parrt.cs.usfca.edu/doc/rf-importance/index.html
>> Best,
>> Jeremiah
>> =========================================
>> Jeremiah W. Johnson, Ph.D
>> Asst. Professor of Data Science
>> Program Coordinator, B.S. in Analytics & Data Science
>> University of New Hampshire
>> Manchester, NH 03101
>> https://www.linkedin.com/in/jwjohnson314
>> From: scikit-learn on behalf of "Niyaghi, Faraz"
>> Reply-To: Scikit-learn mailing list
>> Date: Friday, May 4, 2018 at 7:10 PM
>> To: "scikit-learn at python.org"
>> Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
Importance
>> Caution - External Email
>> ________________________________
>> Greetings,
>> This is Faraz Niyaghi from Oregon State University. I research on
variable selection using random forest. To the best of my knowledge, there
is a difference between scikit-learn's and Breiman's definition of feature
importance. Breiman uses out of bag (oob) cases to calculate feature
importance but scikit-learn doesn't. I was wondering: 1) why are they
different? 2) can they result in very different rankings of features?
>> Here are the definitions I found on the web:
>> Breiman: "In every tree grown in the forest, put down the oob cases and
count the number of votes cast for the correct class. Now randomly permute
the values of variable m in the oob cases and put these cases down the
tree. Subtract the number of votes for the correct class in the
variable-m-permuted oob data from the number of votes for the correct class
in the untouched oob data. The average of this number over all trees in the
forest is the raw importance score for variable m."
>> Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
>> scikit-learn: " The relative rank (i.e. depth) of a feature used as a
decision node in a tree can be used to assess the relative importance of
that feature with respect to the predictability of the target variable.
Features used at the top of the tree contribute to the final prediction
decision of a larger fraction of the input samples. The expected fraction
of the samples they contribute to can thus be used as an estimate of the
relative importance of the features."
>> Link: http://scikit-learn.org/stable/modules/ensemble.html
>> Thank you for reading this email. Please let me know your thoughts.
>> Cheers,
>> Faraz.
>> Faraz Niyaghi
>> Ph.D. Candidate, Department of Statistics
>> Oregon State University
>> Corvallis, OR
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> --
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
From gael.varoquaux at normalesup.org Sat May 5 09:16:50 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Sat, 5 May 2018 15:16:50 +0200
Subject: [scikit-learn] Announcing IMPAC: an IMaging-PsychiAtry Challenge,
using data-science to predict autism from brain imaging
Message-ID: <20180505131650.ke323loujdoa2mxr@phare.normalesup.org>
Dear colleagues,
It is my pleasure to announce IMPAC: an IMaging-PsychiAtry Challenge,
using data-science to predict autism from brain imaging.
https://paris-saclay-cds.github.io/autism_challenge/
This is a machine-learning challenge on brain-imaging data to achieve the
best prediction of autism spectrum disorder diagnostic status. We are
providing the largest cohort so far to learn such predictive biomarkers,
with more than 2000 individuals.
There is a total of 9000 euros of prices to win for the best prediction.
The prediction quality will be measured on a large hidden test set, to
ensure fairness.
We provide a simple starting kit to serve as a proof of feasibility. We
are excited to see what the community will come up with in terms of
predictive models and of score.
Best,
Ga?l
--
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
From jeff1evesque at yahoo.com Sat May 5 21:40:34 2018
From: jeff1evesque at yahoo.com (Jeffrey Levesque)
Date: Sat, 5 May 2018 21:40:34 -0400
Subject: [scikit-learn] Jeff Levesque: neuroscience related datasets
Message-ID: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com>
Hi guys,
I want to perform some basic data analysis. Anyone have good recommendations where I can obtain free datasets. I was thinking of trying to do something related to neuroscience. But, kaggle doesn't have many datasets for this focus.
Thank you,
Jeff Levesque
https://github.com/jeff1evesque
From nicholdav at gmail.com Sat May 5 21:58:54 2018
From: nicholdav at gmail.com (David Nicholson)
Date: Sat, 5 May 2018 21:58:54 -0400
Subject: [scikit-learn] Jeff Levesque: neuroscience related datasets
In-Reply-To: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com>
References: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com>
Message-ID:
Hi Jeff,
here's a couple of places to start, I'm sure other people can recommend
more:
https://crcns.org/
https://www.nature.com/sdata/policies/repositories (see under Neuroscience)
There's also the challenge that Gael just announced, predicting autism from
brain imaging data:
https://paris-saclay-cds.github.io/autism_challenge/
https://twittr.com/GaelVaroquaux/status/992752034242879488
https://twitter.com/GaelVaroquaux/status/992752034242879488
--David
David Nicholson, Ph.D.
nickledave.github.io
https://github.com/NickleDave
Prinz lab , Emory University,
Atlanta, GA, USA
On Sat, May 5, 2018 at 9:40 PM, Jeffrey Levesque via scikit-learn <
scikit-learn at python.org> wrote:
> Hi guys,
> I want to perform some basic data analysis. Anyone have good
> recommendations where I can obtain free datasets. I was thinking of trying
> to do something related to neuroscience. But, kaggle doesn't have many
> datasets for this focus.
>
> Thank you,
>
> Jeff Levesque
> https://github.com/jeff1evesque
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From michael.eickenberg at gmail.com Sat May 5 21:59:28 2018
From: michael.eickenberg at gmail.com (Michael Eickenberg)
Date: Sat, 5 May 2018 18:59:28 -0700
Subject: [scikit-learn] Jeff Levesque: neuroscience related datasets
In-Reply-To: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com>
References: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com>
Message-ID:
Hi Jeffrey,
check out these here for neuron data and fmri:
http://crcns.org/
And the ones here for fmri:
https://openfmri.org/
You can get started by installing one of the following packages and using
their dataset downloaders
http://nilearn.github.io/modules/reference.html#module-nilearn.datasets
https://martinos.org/mne/stable/manual/datasets_index.html
Also, there was this kaggle
https://www.kaggle.com/c/decoding-the-human-brain
And probably a bunch of others
Hope that helps!
Michael
On Sat, May 5, 2018 at 6:40 PM, Jeffrey Levesque via scikit-learn <
scikit-learn at python.org> wrote:
> Hi guys,
> I want to perform some basic data analysis. Anyone have good
> recommendations where I can obtain free datasets. I was thinking of trying
> to do something related to neuroscience. But, kaggle doesn't have many
> datasets for this focus.
>
> Thank you,
>
> Jeff Levesque
> https://github.com/jeff1evesque
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From nicholdav at gmail.com Sat May 5 22:04:56 2018
From: nicholdav at gmail.com (David Nicholson)
Date: Sat, 5 May 2018 22:04:56 -0400
Subject: [scikit-learn] Jeff Levesque: neuroscience related datasets
In-Reply-To:
References: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com>
Message-ID:
also (sorry for spamming the list!) should have said the Allen Institute
has a ton of data:
https://www.nwb.org/allen-cell-types-database/
and check out the cool dataset with this paper:
https://figshare.com/articles/Recordings_of_ten_thousand_neurons_in_visual_cortex_during_spontaneous_behaviors/6163622
https://github.com/MouseLand/stringer-pachitariu-et-al-2018a
explainer twitter thread:
https://twitter.com/marius10p/status/988069221941874688
David Nicholson, Ph.D.
nickledave.github.io
https://github.com/NickleDave
Prinz lab , Emory University,
Atlanta, GA, USA
On Sat, May 5, 2018 at 9:58 PM, David Nicholson wrote:
> Hi Jeff,
>
> here's a couple of places to start, I'm sure other people can recommend
> more:
> https://crcns.org/
> https://www.nature.com/sdata/policies/repositories (see under
> Neuroscience)
>
> There's also the challenge that Gael just announced, predicting autism
> from brain imaging data:
> https://paris-saclay-cds.github.io/autism_challenge/
> https://twittr.com/GaelVaroquaux/status/992752034242879488https://
> twitter.com/GaelVaroquaux/status/992752034242879488
> --David
>
> David Nicholson, Ph.D.
> nickledave.github.io
> https://github.com/NickleDave
> Prinz lab , Emory
> University, Atlanta, GA, USA
>
> On Sat, May 5, 2018 at 9:40 PM, Jeffrey Levesque via scikit-learn <
> scikit-learn at python.org> wrote:
>
>> Hi guys,
>> I want to perform some basic data analysis. Anyone have good
>> recommendations where I can obtain free datasets. I was thinking of trying
>> to do something related to neuroscience. But, kaggle doesn't have many
>> datasets for this focus.
>>
>> Thank you,
>>
>> Jeff Levesque
>> https://github.com/jeff1evesque
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From joel.nothman at gmail.com Sat May 5 22:17:36 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Sun, 6 May 2018 12:17:36 +1000
Subject: [scikit-learn] Retracting model from the 'blackbox' SVM
In-Reply-To:
References:
Message-ID:
The coef_ available from LinearSVC will be somewhat indicative of the
relative importance of each feature.
But you might want to look into our feature selection documentation:
http://scikit-learn.org/stable/modules/feature_selection.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From matti.v.viljamaa at gmail.com Sun May 6 14:01:12 2018
From: matti.v.viljamaa at gmail.com (Matti Viljamaa)
Date: Sun, 6 May 2018 21:01:12 +0300
Subject: [scikit-learn] Does sklearn.decomposition.TruncatedSVD take
n_components in order? Or can I select which features I want?
Message-ID: <5aef42ea.1c69fb81.779bc.933b@mx.google.com>
Does sklearn.decomposition.TruncatedSVD take n_components in order? Or can I select which features I want?
Reason being that if one uses the ?pick features with eigenvalues > 1? principle, then I?d need to tell the SVD algo somehow, which components it should use.
BR, Matti
L?hetetty Windows 10:n S?hk?postista
---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From santoshmsubedi at gmail.com Tue May 8 03:26:06 2018
From: santoshmsubedi at gmail.com (Santosh Subedi)
Date: Tue, 8 May 2018 16:26:06 +0900
Subject: [scikit-learn] Help me Please!
Message-ID:
Hello,
I'm using Scikit-learn for Gaussian Process Regression (GPR). I'm facing a
problem/confusion regarding GaussianProcessRegressor class. If gp is a
GaussianProcessRegressor, the prediction is given as:
y_pred_test, sigma = gp.predict(x_test, return_std =True)
After printing the y_pred_test and sigma, the y_pred_test predicted for all
the data source (3 data source per each test point) at every test point.
However, the Standard deviation (sigma) is predicted just a single value at
each test point. I want the sigma to be predicted as y_pred_test for every
data source. I've asked my query at StackOverflow at the following link:
https://stackoverflow.com/questions/50185399/insufficient-
output-with-predictx-test-return-std-true-in-gaussianprocessre
Could you reply with an appropriate answer to this email or at the
StackOverflow, please?
Thank you for your time and consideration.
Kindly Regards,
santobedi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From matti.v.viljamaa at gmail.com Wed May 9 10:08:40 2018
From: matti.v.viljamaa at gmail.com (Matti Viljamaa)
Date: Wed, 9 May 2018 17:08:40 +0300
Subject: [scikit-learn] How to pick the maximum possible parameters for
algos such as sklearn.decomposition.TruncatedSVD?
Message-ID: <5af300ea.1c69fb81.cc315.65e7@mx.google.com>
How to pick the maximum possible parameters for algos such as sklearn.decomposition.TruncatedSVD?
Since this algo can cause a memory error, if memory runs out. But of course one would like to select the maximum possible n_components, given the system memory available.
So how to do it?
L?hetetty Windows 10:n S?hk?postista
---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From carolduncanpc833 at yahoo.com Wed May 9 11:40:52 2018
From: carolduncanpc833 at yahoo.com (Carol Duncan)
Date: Wed, 9 May 2018 15:40:52 +0000 (UTC)
Subject: [scikit-learn] How does multiple target Ridge Regression work
in scikit learn?
In-Reply-To: <81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr>
References:
<81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr>
Message-ID: <1570331285.1609254.1525880452333@mail.yahoo.com>
From: bthirion
To: scikit-learn at python.org
Sent: Wednesday, May 2, 2018 12:07 PM
Subject: Re: [scikit-learn] How does multiple target Ridge Regression work in scikit learn?
The alpha parameter is shared for all problems; If you wnat to use differnt parameters, you probably want to perform seprate fits.
Best,
Bertrand
On 02/05/2018 13:08, Peer Nowack wrote:
Hi all, I am struggling to understand the following: Scikit-learn offers a multiple output version for Ridge Regression, simply by handing over a 2D array [n_samples, n_targets], but how is it implemented? http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html Is it correct to assume that each regression for each target is independent? Under these circumstances, how can I adapt this to use individual alpha regularization parameters for each regression? If I use GridSeachCV, I would have to hand over a matrix of possible regularization parameters, or how would that work? Thanks in advance - I have been searching for hours but could not find anything on this topic. Peter
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dylanf123 at gmail.com Thu May 10 03:08:07 2018
From: dylanf123 at gmail.com (Dylan Fernando)
Date: Thu, 10 May 2018 17:08:07 +1000
Subject: [scikit-learn] Unable to run make test-coverage
Message-ID:
Hi,
I am unable to run make test-coverage.
I get the error:
rm -rf coverage .coverage
pytest sklearn --showlocals -v --cov=sklearn --cov-report=html:coverage
usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --cov=sklearn
--cov-report=html:coverage
inifile: /Users/dylan/scikit-learn/setup.cfg
rootdir: /Users/dylan/scikit-learn
make: *** [test-coverage] Error 2
Regards,
Dylan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From joel.nothman at gmail.com Thu May 10 03:22:12 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 10 May 2018 17:22:12 +1000
Subject: [scikit-learn] Unable to run make test-coverage
In-Reply-To:
References:
Message-ID:
Do you have pytest-cov installed??
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dylanf123 at gmail.com Thu May 10 05:29:34 2018
From: dylanf123 at gmail.com (Dylan Fernando)
Date: Thu, 10 May 2018 19:29:34 +1000
Subject: [scikit-learn] Unable to run make test-coverage
In-Reply-To:
References:
Message-ID:
On Thu, May 10, 2018 at 5:22 PM, Joel Nothman
wrote:
> Do you have pytest-cov installed??
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> Thanks, I installed it and it works now
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From reismc at gmail.com Sat May 12 10:26:05 2018
From: reismc at gmail.com (Mauricio Reis)
Date: Sat, 12 May 2018 11:26:05 -0300
Subject: [scikit-learn] DBScan freezes my computer !!!
Message-ID:
The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my computer
without any warning message!
I am using WinPython 3.6.5 64 bit.
The method works normally with the original data, but freezes when I use
the normalized data (between 0 and 1).
What should I do?
Att.,
Mauricio Reis
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From awnystrom at gmail.com Sat May 12 18:20:32 2018
From: awnystrom at gmail.com (Andrew Nystrom)
Date: Sat, 12 May 2018 15:20:32 -0700
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To:
References:
Message-ID:
If you?re l2 norming your data, you?re making it live on the surface of a
hypershere. That surface will have a high density of points and may not
have areas of low density, in which case the entire surface could be
recognized as a single cluster if epsilon is high enough and min neighbors
is low enough. I?d suggest not using l2 norm with DBSCAN.
On Sat, May 12, 2018 at 7:27 AM Mauricio Reis wrote:
> The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my computer
> without any warning message!
>
> I am using WinPython 3.6.5 64 bit.
>
> The method works normally with the original data, but freezes when I use
> the normalized data (between 0 and 1).
>
> What should I do?
>
> Att.,
> Mauricio Reis
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From rth.yurchak at gmail.com Sun May 13 04:34:42 2018
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Sun, 13 May 2018 10:34:42 +0200
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To:
References:
Message-ID: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
Could you please check memory usage while running DBSCAN to make sure
freezing is due to running out of memory and not to something else?
Which parameters do you run DBSCAN with? Changing algorithm, leaf_size
parameters and ensuring n_jobs=1 could help.
Assuming eps is reasonable, I think it shouldn't be an issue to run
DBSCAN on L2 normalized data: using the default euclidean metric, this
should produce somewhat similar results to clustering not normalized
data with metric='cosine'.
On 13/05/18 00:20, Andrew Nystrom wrote:
> If you?re l2 norming your data, you?re making it live on the surface of
> a hypershere. That surface will have a high density of points and may
> not have areas of low density, in which case the entire surface could be
> recognized as a single cluster if epsilon is high enough and min
> neighbors is low enough. I?d suggest not using l2 norm with DBSCAN.
> On Sat, May 12, 2018 at 7:27 AM Mauricio Reis > wrote:
>
> The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my
> computer without any warning message!
>
> I am using WinPython 3.6.5 64 bit.
>
> The method works normally with the original data, but freezes when I
> use the normalized data (between 0 and 1).
>
> What should I do?
>
> Att.,
> Mauricio Reis
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
From reismc at gmail.com Sun May 13 19:23:15 2018
From: reismc at gmail.com (Mauricio Reis)
Date: Sun, 13 May 2018 20:23:15 -0300
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
References:
<801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
Message-ID:
I think the problem is due to the size of my database, which has 44,000
records. When I ran a database test with reduced sizes (10,000 and 20,000
first records), the routine ran normally.
You ask me to check the memory while running the DBScan routine, but I do
not know how to do that (if I did, I would have done that already).
I think the routine is not ready to work with too much data. The problem is
that my computer freezes and I can not analyze the case. I've tried to
figure out if any changes work (like changing routine parameters), but all
alternatives with lots of data (about 40,000 records) generate error.
I believe that package routines have no exception handling to improve
performance. So I suggest that there is a test version that shows a proper
message when an error occurs.
To summarize: 1) How to check the memory of the computer during the
execution of the routine? 2) I suggest developing test versions of routines
that may have a memory error.
Att.,
Mauricio Reis
2018-05-13 5:34 GMT-03:00 Roman Yurchak :
> Could you please check memory usage while running DBSCAN to make sure
> freezing is due to running out of memory and not to something else?
> Which parameters do you run DBSCAN with? Changing algorithm, leaf_size
> parameters and ensuring n_jobs=1 could help.
>
> Assuming eps is reasonable, I think it shouldn't be an issue to run DBSCAN
> on L2 normalized data: using the default euclidean metric, this should
> produce somewhat similar results to clustering not normalized data with
> metric='cosine'.
>
> On 13/05/18 00:20, Andrew Nystrom wrote:
>
>> If you?re l2 norming your data, you?re making it live on the surface of a
>> hypershere. That surface will have a high density of points and may not
>> have areas of low density, in which case the entire surface could be
>> recognized as a single cluster if epsilon is high enough and min neighbors
>> is low enough. I?d suggest not using l2 norm with DBSCAN.
>> On Sat, May 12, 2018 at 7:27 AM Mauricio Reis > reismc at gmail.com>> wrote:
>>
>> The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my
>> computer without any warning message!
>>
>> I am using WinPython 3.6.5 64 bit.
>>
>> The method works normally with the original data, but freezes when I
>> use the normalized data (between 0 and 1).
>>
>> What should I do?
>>
>> Att.,
>> Mauricio Reis
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From chema at rinzewind.org Sun May 13 19:44:34 2018
From: chema at rinzewind.org (=?iso-8859-1?Q?Jos=E9_Mar=EDa?= Mateos)
Date: Sun, 13 May 2018 19:44:34 -0400
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To:
References:
<801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
Message-ID: <20180513234434.GA3210@equipaje>
On Sun, May 13, 2018 at 08:23:15PM -0300, Mauricio Reis wrote:
> To summarize: 1) How to check the memory of the computer during the
> execution of the routine? 2) I suggest developing test versions of routines
> that may have a memory error.
If you are on Linux, can you just run "top" while your script runs? That
will tell you how much memory is being used by each process. On Windows,
you can use the task scheduler to obtain similar results.
Cheers,
--
Jos? Mar?a (Chema) Mateos
https://rinzewind.org/blog-es || https://rinzewind.org/blog-en
From mail at sebastianraschka.com Sun May 13 20:16:16 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Sun, 13 May 2018 20:16:16 -0400
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To:
References:
<801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
Message-ID: <1EA93B26-5892-4D85-9FE7-51F32B06C8DF@sebastianraschka.com>
> So I suggest that there is a test version that shows a proper message when an error occurs.
I think the freezing that happens in your case is operating system specific and it would require some weird workarounds to detect at which RAM usage the combination of machine and operating system might freeze (i.e., I never observed my system freezing when I run out of RAM, since it has a pretty swift SSD, but the sklearn process may take a very long time to finish). Plus, scikit-learn would require to know and constantly check how much memory would be used and currently available (due to the use of other apps and the OS kernel), which wouldn't be feasible.
I am not sure if this helps (depending where the memory-usage bottleneck is), but it could maybe help providing a sparse (CSR) array instead of a dense one to the .fit() method. Another thing to try would be to pre-compute the distances and give that to the .fit() method after initializing the DBSCAN object with metric='precomputed')
Best,
Sebastian
> On May 13, 2018, at 7:23 PM, Mauricio Reis wrote:
>
> I think the problem is due to the size of my database, which has 44,000 records. When I ran a database test with reduced sizes (10,000 and 20,000 first records), the routine ran normally.
>
> You ask me to check the memory while running the DBScan routine, but I do not know how to do that (if I did, I would have done that already).
>
> I think the routine is not ready to work with too much data. The problem is that my computer freezes and I can not analyze the case. I've tried to figure out if any changes work (like changing routine parameters), but all alternatives with lots of data (about 40,000 records) generate error.
>
> I believe that package routines have no exception handling to improve performance. So I suggest that there is a test version that shows a proper message when an error occurs.
>
> To summarize: 1) How to check the memory of the computer during the execution of the routine? 2) I suggest developing test versions of routines that may have a memory error.
>
> Att.,
> Mauricio Reis
>
> 2018-05-13 5:34 GMT-03:00 Roman Yurchak :
> Could you please check memory usage while running DBSCAN to make sure freezing is due to running out of memory and not to something else?
> Which parameters do you run DBSCAN with? Changing algorithm, leaf_size parameters and ensuring n_jobs=1 could help.
>
> Assuming eps is reasonable, I think it shouldn't be an issue to run DBSCAN on L2 normalized data: using the default euclidean metric, this should produce somewhat similar results to clustering not normalized data with metric='cosine'.
>
> On 13/05/18 00:20, Andrew Nystrom wrote:
> If you?re l2 norming your data, you?re making it live on the surface of a hypershere. That surface will have a high density of points and may not have areas of low density, in which case the entire surface could be recognized as a single cluster if epsilon is high enough and min neighbors is low enough. I?d suggest not using l2 norm with DBSCAN.
> On Sat, May 12, 2018 at 7:27 AM Mauricio Reis > wrote:
>
> The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my
> computer without any warning message!
>
> I am using WinPython 3.6.5 64 bit.
>
> The method works normally with the original data, but freezes when I
> use the normalized data (between 0 and 1).
>
> What should I do?
>
> Att.,
> Mauricio Reis
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
From joel.nothman at gmail.com Sun May 13 22:59:15 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Mon, 14 May 2018 12:59:15 +1000
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To:
References:
<801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
Message-ID:
This is quite a common issue with our implementation of DBSCAN, and
improvements to documentation would be very, very welcome.
The high memory cost comes from constructing the pairwise radius neighbors
for all points. If using a distance metric that cannot be indexed with a
KD-tree or Ball Tree, this results in n^2 floats being stored in memory
even before the radius neighbors are computed.
You have the following strategies available to you currently:
1. Calculate the radius neighborhoods using radius_neighbors_graph in
chunks, so as to avoid all pairs being calculated and stored at once. This
produces a sparse graph representation, which can be passed into dbscan
with metric='precomputed'. (I've just seen Sebastian suggested the same.)
2. Reduce the number of samples in your dataset and represent
(near-)duplicate points with sample_weight (i.e. two identical points would
be merged but would have a sample_weight of 2).
There is also a proposal to offer an alternative memory-efficient mode at
https://github.com/scikit-learn/scikit-learn/pull/6813. Feedback is welcome.
Cheers,
Joel
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From joel.nothman at gmail.com Sun May 13 23:07:21 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Mon, 14 May 2018 13:07:21 +1000
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To:
References:
<801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
Message-ID:
Note that this has long been documented under "Memory consumption for large
sample sizes" at
http://scikit-learn.org/stable/modules/clustering.html#dbscan
On 14 May 2018 at 12:59, Joel Nothman wrote:
> This is quite a common issue with our implementation of DBSCAN, and
> improvements to documentation would be very, very welcome.
>
> The high memory cost comes from constructing the pairwise radius neighbors
> for all points. If using a distance metric that cannot be indexed with a
> KD-tree or Ball Tree, this results in n^2 floats being stored in memory
> even before the radius neighbors are computed.
>
> You have the following strategies available to you currently:
>
> 1. Calculate the radius neighborhoods using radius_neighbors_graph in
> chunks, so as to avoid all pairs being calculated and stored at once. This
> produces a sparse graph representation, which can be passed into dbscan
> with metric='precomputed'. (I've just seen Sebastian suggested the same.)
> 2. Reduce the number of samples in your dataset and represent
> (near-)duplicate points with sample_weight (i.e. two identical points would
> be merged but would have a sample_weight of 2).
>
> There is also a proposal to offer an alternative memory-efficient mode at
> https://github.com/scikit-learn/scikit-learn/pull/6813. Feedback is
> welcome.
>
> Cheers,
>
> Joel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dylanf123 at gmail.com Mon May 14 09:39:29 2018
From: dylanf123 at gmail.com (Dylan Fernando)
Date: Mon, 14 May 2018 23:39:29 +1000
Subject: [scikit-learn] New algorithm suggestion - AODE
Message-ID:
Hello,
I would like to suggest a new classification algorithm for scikit-learn,
Averaged one-dependence estimators (AODE).
AODE achieves highly accurate classification by averaging over all of a
small space of alternative naive-Bayes-like models that have weaker (and
hence less detrimental) independence assumptions than naive Bayes. The
resulting algorithm is computationally efficient while delivering highly
accurate classification on many learning tasks. For more information, see
paper (https://link.springer.com/article/10.1007/s10994-005-4258-6). The
paper has over 200 citations.
There is an existing implementation in the WEKA machine learning suite (
http://weka.sourceforge.net/doc.stable/weka/classifiers/bayes/AODE.html).
I?ve made a pull request and I would like some feedback (
https://github.com/scikit-learn/scikit-learn/pull/11093).
Thank You,
Dylan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Wed May 16 13:27:40 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 16 May 2018 13:27:40 -0400
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
Importance
In-Reply-To:
References:
Message-ID: <3aeca8e9-2a59-61e4-a88b-4b9e4c26cdae@gmail.com>
I don't think that's how most people use the trees, though.
Probably not even the ExtraTrees.
I really need to get around to reading your thesis :-/
Do you recommend using max_features=1 with ExtraTrees?
On 05/05/2018 05:21 AM, Gilles Louppe wrote:
> Hi,
>
> See also chapters 6 and 7 of http://arxiv.org/abs/1407.7502 for another
> point of view regarding the "issue" with feature importances. TLDR: Feature
> importances as we have them in scikit-learn (i.e. MDI) are provably **not**
> biased, provided trees are built totally at random (as in ExtraTrees with
> max_feature=1) and the depth is controlled min_samples_split (to avoid
> splitting on noise). On the other hand, it is not always clear what you
> actually compute with MDA (permutation based importances), since it is
> conditioned on the model you use.
>
> Gilles
> On Sat, 5 May 2018 at 10:36, Guillaume Lema?tre
> wrote:
>
>> +1 on the post pointed out by Jeremiah.
>> On 5 May 2018 at 02:08, Johnson, Jeremiah
> wrote:
>
>>> Faraz, take a look at the discussion of this issue here:
> http://parrt.cs.usfca.edu/doc/rf-importance/index.html
>
>>> Best,
>>> Jeremiah
>>> =========================================
>>> Jeremiah W. Johnson, Ph.D
>>> Asst. Professor of Data Science
>>> Program Coordinator, B.S. in Analytics & Data Science
>>> University of New Hampshire
>>> Manchester, NH 03101
>>> https://www.linkedin.com/in/jwjohnson314
>>> From: scikit-learn unh.edu at python.org> on behalf of "Niyaghi, Faraz"
>>> Reply-To: Scikit-learn mailing list
>>> Date: Friday, May 4, 2018 at 7:10 PM
>>> To: "scikit-learn at python.org"
>>> Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
> Importance
>
>>> Caution - External Email
>>> ________________________________
>>> Greetings,
>>> This is Faraz Niyaghi from Oregon State University. I research on
> variable selection using random forest. To the best of my knowledge, there
> is a difference between scikit-learn's and Breiman's definition of feature
> importance. Breiman uses out of bag (oob) cases to calculate feature
> importance but scikit-learn doesn't. I was wondering: 1) why are they
> different? 2) can they result in very different rankings of features?
>
>>> Here are the definitions I found on the web:
>>> Breiman: "In every tree grown in the forest, put down the oob cases and
> count the number of votes cast for the correct class. Now randomly permute
> the values of variable m in the oob cases and put these cases down the
> tree. Subtract the number of votes for the correct class in the
> variable-m-permuted oob data from the number of votes for the correct class
> in the untouched oob data. The average of this number over all trees in the
> forest is the raw importance score for variable m."
>>> Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
>>> scikit-learn: " The relative rank (i.e. depth) of a feature used as a
> decision node in a tree can be used to assess the relative importance of
> that feature with respect to the predictability of the target variable.
> Features used at the top of the tree contribute to the final prediction
> decision of a larger fraction of the input samples. The expected fraction
> of the samples they contribute to can thus be used as an estimate of the
> relative importance of the features."
>>> Link: http://scikit-learn.org/stable/modules/ensemble.html
>>> Thank you for reading this email. Please let me know your thoughts.
>>> Cheers,
>>> Faraz.
>>> Faraz Niyaghi
>>> Ph.D. Candidate, Department of Statistics
>>> Oregon State University
>>> Corvallis, OR
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>> --
>> Guillaume Lemaitre
>> INRIA Saclay - Parietal team
>> Center for Data Science Paris-Saclay
>> https://glemaitre.github.io/
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
From t3kcit at gmail.com Wed May 16 13:37:36 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 16 May 2018 13:37:36 -0400
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To:
References:
<801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
Message-ID:
You might also consider looking at hdbscan:
https://github.com/scikit-learn-contrib/hdbscan
On 05/13/2018 11:07 PM, Joel Nothman wrote:
> Note that this has long been documented under "Memory consumption for
> large sample sizes" at
> http://scikit-learn.org/stable/modules/clustering.html#dbscan
>
> On 14 May 2018 at 12:59, Joel Nothman > wrote:
>
> This is quite a common issue with our implementation of DBSCAN,
> and improvements to documentation would be very, very welcome.
>
> The high memory cost comes from constructing the pairwise radius
> neighbors for all points. If using a distance metric that cannot
> be indexed with a KD-tree or Ball Tree, this results in n^2 floats
> being stored in memory even before the radius neighbors are computed.
>
> You have the following strategies available to you currently:
>
> 1. Calculate the radius neighborhoods using radius_neighbors_graph
> in chunks, so as to avoid all pairs being calculated and stored at
> once. This produces a sparse graph representation, which can be
> passed into dbscan with metric='precomputed'. (I've just seen
> Sebastian suggested the same.)
> 2. Reduce the number of samples in your dataset and represent
> (near-)duplicate points with sample_weight (i.e. two identical
> points would be merged but would have a sample_weight of 2).
>
> There is also?a proposal to offer an alternative memory-efficient
> mode at https://github.com/scikit-learn/scikit-learn/pull/6813
> . Feedback
> is welcome.
>
> Cheers,
>
> Joel
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Wed May 16 13:44:17 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 16 May 2018 13:44:17 -0400
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To:
References:
<801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
Message-ID: <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com>
Should we have "low memory"/batched version of k_neighbors_graph and
epsilon_neighbors_graph functions? I assume
those instantiate the dense matrix right now.
On 05/13/2018 10:59 PM, Joel Nothman wrote:
> This is quite a common issue with our implementation of DBSCAN, and
> improvements to documentation would be very, very welcome.
>
> The high memory cost comes from constructing the pairwise radius
> neighbors for all points. If using a distance metric that cannot be
> indexed with a KD-tree or Ball Tree, this results in n^2 floats being
> stored in memory even before the radius neighbors are computed.
>
> You have the following strategies available to you currently:
>
> 1. Calculate the radius neighborhoods using radius_neighbors_graph in
> chunks, so as to avoid all pairs being calculated and stored at once.
> This produces a sparse graph representation, which can be passed into
> dbscan with metric='precomputed'. (I've just seen Sebastian suggested
> the same.)
> 2. Reduce the number of samples in your dataset and represent
> (near-)duplicate points with sample_weight (i.e. two identical points
> would be merged but would have a sample_weight of 2).
>
> There is also?a proposal to offer an alternative memory-efficient mode
> at https://github.com/scikit-learn/scikit-learn/pull/6813. Feedback is
> welcome.
>
> Cheers,
>
> Joel
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From gael.varoquaux at normalesup.org Wed May 16 13:50:07 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Wed, 16 May 2018 19:50:07 +0200
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com>
References:
<801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
<2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com>
Message-ID: <20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org>
On Wed, May 16, 2018 at 01:44:17PM -0400, Andreas Mueller wrote:
> Should we have "low memory"/batched version of k_neighbors_graph and
> epsilon_neighbors_graph functions? I assume
> those instantiate the dense matrix right now.
+1!
It shouldn't be too hard to do.
G
From g.louppe at gmail.com Wed May 16 14:08:59 2018
From: g.louppe at gmail.com (Gilles Louppe)
Date: Wed, 16 May 2018 20:08:59 +0200
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
Importance
In-Reply-To: <3aeca8e9-2a59-61e4-a88b-4b9e4c26cdae@gmail.com>
References:
<3aeca8e9-2a59-61e4-a88b-4b9e4c26cdae@gmail.com>
Message-ID:
> Do you recommend using max_features=1 with ExtraTrees?
If what you want are feature importances that reflect, without 'bias', the
mutual information of each variable (alone or in combination with others)
with Y, then yes. Bonus points if you set min_impurity_decrease > 0, to
avoid splitting on noise and collecting that as part of the importance
scores.
The resulting forest will not be optimal with respect to
classification/regression performance though.
On Wed, 16 May 2018 at 19:29, Andreas Mueller wrote:
> I don't think that's how most people use the trees, though.
> Probably not even the ExtraTrees.
> I really need to get around to reading your thesis :-/
> Do you recommend using max_features=1 with ExtraTrees?
> On 05/05/2018 05:21 AM, Gilles Louppe wrote:
> > Hi,
> >
> > See also chapters 6 and 7 of http://arxiv.org/abs/1407.7502 for another
> > point of view regarding the "issue" with feature importances. TLDR:
Feature
> > importances as we have them in scikit-learn (i.e. MDI) are provably
**not**
> > biased, provided trees are built totally at random (as in ExtraTrees
with
> > max_feature=1) and the depth is controlled min_samples_split (to avoid
> > splitting on noise). On the other hand, it is not always clear what you
> > actually compute with MDA (permutation based importances), since it is
> > conditioned on the model you use.
> >
> > Gilles
> > On Sat, 5 May 2018 at 10:36, Guillaume Lema?tre
> > wrote:
> >
> >> +1 on the post pointed out by Jeremiah.
> >> On 5 May 2018 at 02:08, Johnson, Jeremiah
> > wrote:
> >
> >>> Faraz, take a look at the discussion of this issue here:
> > http://parrt.cs.usfca.edu/doc/rf-importance/index.html
> >
> >>> Best,
> >>> Jeremiah
> >>> =========================================
> >>> Jeremiah W. Johnson, Ph.D
> >>> Asst. Professor of Data Science
> >>> Program Coordinator, B.S. in Analytics & Data Science
> >>> University of New Hampshire
> >>> Manchester, NH 03101
> >>> https://www.linkedin.com/in/jwjohnson314
> >>> From: scikit-learn > unh.edu at python.org> on behalf of "Niyaghi, Faraz" <
niyaghif at oregonstate.edu>
> >>> Reply-To: Scikit-learn mailing list
> >>> Date: Friday, May 4, 2018 at 7:10 PM
> >>> To: "scikit-learn at python.org"
> >>> Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
> > Importance
> >
> >>> Caution - External Email
> >>> ________________________________
> >>> Greetings,
> >>> This is Faraz Niyaghi from Oregon State University. I research on
> > variable selection using random forest. To the best of my knowledge,
there
> > is a difference between scikit-learn's and Breiman's definition of
feature
> > importance. Breiman uses out of bag (oob) cases to calculate feature
> > importance but scikit-learn doesn't. I was wondering: 1) why are they
> > different? 2) can they result in very different rankings of features?
> >
> >>> Here are the definitions I found on the web:
> >>> Breiman: "In every tree grown in the forest, put down the oob cases
and
> > count the number of votes cast for the correct class. Now randomly
permute
> > the values of variable m in the oob cases and put these cases down the
> > tree. Subtract the number of votes for the correct class in the
> > variable-m-permuted oob data from the number of votes for the correct
class
> > in the untouched oob data. The average of this number over all trees in
the
> > forest is the raw importance score for variable m."
> >>> Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
> >>> scikit-learn: " The relative rank (i.e. depth) of a feature used as a
> > decision node in a tree can be used to assess the relative importance of
> > that feature with respect to the predictability of the target variable.
> > Features used at the top of the tree contribute to the final prediction
> > decision of a larger fraction of the input samples. The expected
fraction
> > of the samples they contribute to can thus be used as an estimate of the
> > relative importance of the features."
> >>> Link: http://scikit-learn.org/stable/modules/ensemble.html
> >>> Thank you for reading this email. Please let me know your thoughts.
> >>> Cheers,
> >>> Faraz.
> >>> Faraz Niyaghi
> >>> Ph.D. Candidate, Department of Statistics
> >>> Oregon State University
> >>> Corvallis, OR
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> >> --
> >> Guillaume Lemaitre
> >> INRIA Saclay - Parietal team
> >> Center for Data Science Paris-Saclay
> >> https://glemaitre.github.io/
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
From joel.nothman at gmail.com Wed May 16 19:33:01 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 17 May 2018 09:33:01 +1000
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org>
References:
<801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
<2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com>
<20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org>
Message-ID:
Implemented in a previous version of #10280
, but removed for
now to simplify reviews
.
If others would like to review #10280, I'm happy to follow up with the
changes requested here, which have already been implemented by Aman Dalmia
and myself.?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From reismc at gmail.com Thu May 17 10:37:14 2018
From: reismc at gmail.com (Mauricio Reis)
Date: Thu, 17 May 2018 11:37:14 -0300
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To:
References:
<801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
<2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com>
<20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org>
Message-ID:
I'm not used to the terms used here. So I understood that the package had
memory management, which was removed. But you could make the code available
with memory management implementations. Is it?! :-)
The problem is that I do not know what I would do with the code, because I
only know how to work with the SciKitLearn package ready. :-(
Att.,
Mauricio Reis
2018-05-16 20:33 GMT-03:00 Joel Nothman :
> Implemented in a previous version of #10280
> , but removed
> for now to simplify reviews
> .
> If others would like to review #10280, I'm happy to follow up with the
> changes requested here, which have already been implemented by Aman Dalmia
> and myself.?
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From joel.nothman at gmail.com Thu May 17 18:02:56 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Fri, 18 May 2018 08:02:56 +1000
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To:
References:
<801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
<2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com>
<20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org>
Message-ID:
There are two issues here:
1. We store all radius neighborhoods of all points in memory at once. This
is a problem if each point has a large radius neighborhood. DBSCAN only
requires that you store the radius neighbors of the point you are currently
examining. We could provide a memory-efficient mode that would do so.
2. Given that we store all neighborhoods at once, a brute force nearest
neighbors search will take O(n^2) which can be reduced by chunking the
operation.
Both solutions have patches available already, but not reviewed.
On 18 May 2018 at 00:37, Mauricio Reis wrote:
> I'm not used to the terms used here. So I understood that the package had
> memory management, which was removed. But you could make the code available
> with memory management implementations. Is it?! :-)
> The problem is that I do not know what I would do with the code, because I
> only know how to work with the SciKitLearn package ready. :-(
>
> Att.,
> Mauricio Reis
>
> 2018-05-16 20:33 GMT-03:00 Joel Nothman :
>
>> Implemented in a previous version of #10280
>> , but removed
>> for now to simplify reviews
>> .
>> If others would like to review #10280, I'm happy to follow up with the
>> changes requested here, which have already been implemented by Aman Dalmia
>> and myself.?
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From valerio.maggio at gmail.com Fri May 18 07:10:59 2018
From: valerio.maggio at gmail.com (Valerio Maggio)
Date: Fri, 18 May 2018 13:10:59 +0200
Subject: [scikit-learn] CFP: EuroSciPy 2018 - 11th European Conference on
Python in Science
Message-ID:
*** Apologies if you receive multiple copies ***
Dear Colleagues,
We are delighted to invite you to join us for the *11th European Conference
on Python in Science*.
The EuroSciPy 2018 Conference will be
organised by Fondazione Bruno Kessler (FBK) and will take place from
August, 28th to September, 1st in *Trento, Italy*.
The EuroSciPy meeting is a cross-disciplinary gathering focused on the use
and development of the Python language in scientific research. This event
strives to bring together both users and developers of scientific tools, as
well as academic research and state of the art industry.
The conference will be structured as it follows:
- *Aug, 28-29 *: Tutorials and Hands-on
- *Aug, 30-31 *: Main Conference
- *Sep, 1 *: Sprint
----------------------------------------------------------------------------------------------------------------
TOPICS OF INTEREST:
Presentations of scientific tools and libraries using the Python language,
including but not limited to:
- Algorithms implemented or exposed in Python
- Astronomy
- Data Visualisation
- Deep Learning & AI
- Earth, Ocean and Geo Science
- General-purpose Python tools that can be of special interest to the
scientific community.
- Image Processing
- Materials Science
- Parallel computing
- Political and Social Sciences
- Project Jupyter
- Reports on the use of Python in scientific achievements or ongoing
projects.
- Robotics & IoT
- Scientific data flow and persistence
- Scientific visualization
- Simulation
- Statistics
- Vector and array manipulation
- Web applications and portals for science and engineering
- 3D Printing
-----------------------------------------------------------------------------------------------------------------
CALL FOR PROPOSALS:
EuroScipy will accept three different kinds of contributions:
- *Regular Talks*: standard talks for oral presentations, allocated in
time slots of `15`, or `30` minutes, depending on your preference and
scheduling constraints. Each time slot considers a Q&A session at the end
of the talk (at least, 5 mins).
- *Hands-on Tutorials*: These are *beginner* or *advanced* training
sessions to dive into the subject with all details. These sessions are 90
minutes long, and the audience will be strongly encouraged to bring a
laptop to experiment. For a sneak peak of last years tutorials, here are
the
- *Poster: *EuroScipy will host two poster sessions during the two days
of Main Conference. So attendees and students are highly encourage to
present their work and/or preliminary results as posters.
Proposals should be submitted using the EuroScipy submission system at
https://pretalx.com/euroscipy18. Submission deadline is *May, 31st 2018.*
----------------------------------------------------------------------------------------------------------------
REGISTRATION & FEES:
To register to EuroScipy 2018, please go to euroscipy2018.eventbrite.co.uk or
to http://www.euroscipy.org/2018
*Registration fees:*
*Tutorials Aug, 28th-29th 2018*
*Student**
*Academic/Individual*
*Industry*
Early Bird (till July, 1st)
?50
?70
?125
Regular (till Aug, 5th
?100
?110
?250
Late (till Aug, 22nd)
?135
?135
?300
You register for one of the two tutorial tracks (introductory or advanced)
but you can switch between both tracks whenever you want as long as there
is enough space in the lecture rooms.
*Main Conference Aug, 30th- 31st 2018*
*Student**
*Academic/Individual*
*Industry*
Early Bird (till July, 1st)
?50
?70
?125
Regular (till Aug, 5th
?100
?110
?250
Late (till Aug, 22nd)
?135
?135
?300
* A proof of student status will be required at time of the registration.
Best regards,
EuroScipy 2018 Organising Committee,
Email: info at euroscipy.org | euroscipy at fbk.eu
Website: http://www.euroscipy.org/2018
twitter: @euroscipy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From mcasl at unileon.es Fri May 18 08:32:21 2018
From: mcasl at unileon.es (=?UTF-8?Q?Manuel_CASTEJ=C3=93N_LIMAS?=)
Date: Fri, 18 May 2018 14:32:21 +0200
Subject: [scikit-learn] Delegating "get_params" and "set_params" to a
wrapped estimator when parameter is not defined.
In-Reply-To:
References:
Message-ID:
Dear Joel,
I've changed the code of PipeGraph in order to change the old wrappers to
new Mixin Classes. The changes are reflected in this MixinClasses branch:
https://github.com/mcasl/PipeGraph/blob/feature/MixinClasses/pipegraph/adapters.py
My conclusions are that although both approaches are feasible and provide
similar functionality, Mixin Classes provide a simpler solution. Following
the 'flat is better than nested' principle, the mixin classes should be
favoured.
This approach seems as well to be more in line with general
sklearn development practice, so I'll make the necessary changes to the
docs and then the master branch will be replaced with this new Mixin
classes version.
Thanks for pointing out this issue!
Best
Manuel
2018-04-16 14:21 GMT+02:00 Manuel CASTEJ?N LIMAS :
> Nope! Mostly because of lack of experience with mixins.
> I've done some reading and I think I can come up with a few mixins doing
> the trick by dynamically adding their methods to an already instantiated
> object. I'll play with that and I hope to show you something soon! Or at
> least I will have better grounds to make an educated decision.
> Best
> Manuel
>
>
>
>
> Manuel Castej?n Limas
> *Escuela de Ingenier?a Industrial e Inform?tica*
> Universidad de Le?n
> Campus de Vegazana sn.
> 24071. Le?n. Spain.
> *e-mail: *manuel.castejon at unileon.es
> *Tel.*: +34 987 291 946
>
> Digital Business Card: Click Here