From peer.j.nowack at gmail.com  Wed May  2 07:08:28 2018
From: peer.j.nowack at gmail.com (Peer Nowack)
Date: Wed, 2 May 2018 12:08:28 +0100
Subject: [scikit-learn] How does multiple target Ridge Regression work in
 scikit learn?
Message-ID: <CAM39nOurcAU8eOUT=m8TgPH9_YO+V_w9-21RCobchxjDO-21_Q@mail.gmail.com>

Hi all,

I am struggling to understand the following:

Scikit-learn offers a multiple output version for Ridge Regression, simply
by handing over a 2D array [n_samples, n_targets], but how is it
implemented?

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

Is it correct to assume that each regression for each target is
independent? Under these circumstances, how can I adapt this to use
individual alpha regularization parameters for each regression? If I use
GridSeachCV, I would have to hand over a matrix of possible regularization
parameters, or how would that work?

Thanks in advance - I have been searching for hours but could not find
anything on this topic.
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180502/b187faa7/attachment.html>

From bertrand.thirion at inria.fr  Wed May  2 08:07:12 2018
From: bertrand.thirion at inria.fr (bthirion)
Date: Wed, 2 May 2018 14:07:12 +0200
Subject: [scikit-learn] How does multiple target Ridge Regression work
 in scikit learn?
In-Reply-To: <CAM39nOurcAU8eOUT=m8TgPH9_YO+V_w9-21RCobchxjDO-21_Q@mail.gmail.com>
References: <CAM39nOurcAU8eOUT=m8TgPH9_YO+V_w9-21RCobchxjDO-21_Q@mail.gmail.com>
Message-ID: <81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr>

The alpha parameter is shared for all problems; If you wnat to use 
differnt parameters, you probably want to perform seprate fits.
Best,

Bertrand

On 02/05/2018 13:08, Peer Nowack wrote:
>
> Hi all,
>
> I am struggling to understand the following:
>
> Scikit-learn offers a multiple output version for Ridge Regression, 
> simply by handing over a 2D array [n_samples, n_targets], but how is 
> it implemented?
>
> http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html
>
> Is it correct to assume that each regression for each target is 
> independent? Under these circumstances, how can I adapt this to use 
> individual alpha regularization parameters for each regression? If I 
> use GridSeachCV, I would have to hand over a matrix of possible 
> regularization parameters, or how would that work?
>
> Thanks in advance - I have been searching for hours but could not find 
> anything on this topic.
>
> Peter
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180502/0f9493fd/attachment.html>

From peer.j.nowack at gmail.com  Wed May  2 09:02:33 2018
From: peer.j.nowack at gmail.com (Peer Nowack)
Date: Wed, 2 May 2018 14:02:33 +0100
Subject: [scikit-learn] How does multiple target Ridge Regression work
 in scikit learn?
In-Reply-To: <81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr>
References: <CAM39nOurcAU8eOUT=m8TgPH9_YO+V_w9-21RCobchxjDO-21_Q@mail.gmail.com>
 <81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr>
Message-ID: <CAM39nOuU0RK6AZu1c=RA7ctjFqYQxEfR-YEHQE0_iwnbaY4XSg@mail.gmail.com>

Thanks, Bertrand - very helpful. Needed to consolidate this.

Peter

On 2 May 2018 at 13:07, bthirion <bertrand.thirion at inria.fr> wrote:

> The alpha parameter is shared for all problems; If you wnat to use
> differnt parameters, you probably want to perform seprate fits.
> Best,
>
> Bertrand
>
> On 02/05/2018 13:08, Peer Nowack wrote:
>
> Hi all,
>
> I am struggling to understand the following:
>
> Scikit-learn offers a multiple output version for Ridge Regression, simply
> by handing over a 2D array [n_samples, n_targets], but how is it
> implemented?
>
> http://scikit-learn.org/stable/modules/generated/
> sklearn.linear_model.Ridge.html
>
> Is it correct to assume that each regression for each target is
> independent? Under these circumstances, how can I adapt this to use
> individual alpha regularization parameters for each regression? If I use
> GridSeachCV, I would have to hand over a matrix of possible regularization
> parameters, or how would that work?
>
> Thanks in advance - I have been searching for hours but could not find
> anything on this topic.
> Peter
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180502/246d35da/attachment.html>

From michael.eickenberg at gmail.com  Wed May  2 14:32:31 2018
From: michael.eickenberg at gmail.com (Michael Eickenberg)
Date: Wed, 2 May 2018 11:32:31 -0700
Subject: [scikit-learn] How does multiple target Ridge Regression work
 in scikit learn?
In-Reply-To: <CAM39nOuU0RK6AZu1c=RA7ctjFqYQxEfR-YEHQE0_iwnbaY4XSg@mail.gmail.com>
References: <CAM39nOurcAU8eOUT=m8TgPH9_YO+V_w9-21RCobchxjDO-21_Q@mail.gmail.com>
 <81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr>
 <CAM39nOuU0RK6AZu1c=RA7ctjFqYQxEfR-YEHQE0_iwnbaY4XSg@mail.gmail.com>
Message-ID: <CADxJN65x6O_OmjhHckOXVYqUme2KWh1X3y1+7WnO6qh5qSw2ng@mail.gmail.com>

By the linear nature of the problem the targets are always separately
treated (even if there was a matrix-variate normal prior indicating
covariance between target columns, you could do that adjustment before or
after fitting).
As for different alpha parameters, I think you can specify a different
alpha per target if you pass in an array of shape (n_targets,). Maybe this
is not implemented for all solvers, but it should be at least for some.

If you grid search, then the scikit-learn API requires the score to be one
number, so it's non-trivial to optimize different alphas for different
voxels easily (even though selecting the best alpha for each voxel will of
course make the sum of errors go down, too).

Depending on what your use case is, it may be easier to just write your own:

If X = U S VT (svd), then weights = VT.T.dot((1 / (S ** 2 + alpha) *
U).T.dot(Y))

For more than one alpha:
alphas.shape == (n_alphas, n_targets)
Y.shape == (n_samples, n_targets)
X.shape == (n_samples, n_features)


U, S, VT = np.linalg.svd(X)
diags = 1 / (S[np.newaxis, :, np.newaxis] ** 2 + alphas[:, np.newaxis, :])
UTY = U.T.dot(Y)
weights = np.zeros([n_alphas, n_features, n_targets])
for i in range(alphas.shape[0]):
    weights[i] = VT.T.dot(diags[i] * UTY)

Then use those weights to predict.

Michael


On Wed, May 2, 2018 at 6:02 AM, Peer Nowack <peer.j.nowack at gmail.com> wrote:

> Thanks, Bertrand - very helpful. Needed to consolidate this.
>
> Peter
>
> On 2 May 2018 at 13:07, bthirion <bertrand.thirion at inria.fr> wrote:
>
>> The alpha parameter is shared for all problems; If you wnat to use
>> differnt parameters, you probably want to perform seprate fits.
>> Best,
>>
>> Bertrand
>>
>> On 02/05/2018 13:08, Peer Nowack wrote:
>>
>> Hi all,
>>
>> I am struggling to understand the following:
>>
>> Scikit-learn offers a multiple output version for Ridge Regression,
>> simply by handing over a 2D array [n_samples, n_targets], but how is it
>> implemented?
>>
>> http://scikit-learn.org/stable/modules/generated/sklearn.
>> linear_model.Ridge.html
>>
>> Is it correct to assume that each regression for each target is
>> independent? Under these circumstances, how can I adapt this to use
>> individual alpha regularization parameters for each regression? If I use
>> GridSeachCV, I would have to hand over a matrix of possible regularization
>> parameters, or how would that work?
>>
>> Thanks in advance - I have been searching for hours but could not find
>> anything on this topic.
>> Peter
>>
>>
>> _______________________________________________
>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180502/e502eb20/attachment.html>

From princejha616 at gmail.com  Thu May  3 02:53:20 2018
From: princejha616 at gmail.com (prince jha)
Date: Thu, 3 May 2018 12:23:20 +0530
Subject: [scikit-learn] Project Contribution
Message-ID: <CAHDBSs0rMtEC9hTT8wks8CT78biwOPTfuDWa_qawKTbe=fjkCg@mail.gmail.com>

Hello everyone, I am also willing to contribute to scikit-learn open source
project but since I have never contributed to any open-source projects
earlier, so I don't have any idea regarding where to start from. So I will
be thankful if any of you could help me in this so that I could also start
contributing in this great project.

Thanks,
Prince
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180503/7aabffb4/attachment.html>

From ross at cgl.ucsf.edu  Thu May  3 03:02:54 2018
From: ross at cgl.ucsf.edu (Bill Ross)
Date: Thu, 3 May 2018 00:02:54 -0700
Subject: [scikit-learn] Project Contribution
In-Reply-To: <CAHDBSs0rMtEC9hTT8wks8CT78biwOPTfuDWa_qawKTbe=fjkCg@mail.gmail.com>
References: <CAHDBSs0rMtEC9hTT8wks8CT78biwOPTfuDWa_qawKTbe=fjkCg@mail.gmail.com>
Message-ID: <d1a05a03-6dfd-b570-2be9-8da19d74e703@cgl.ucsf.edu>

Quick followup from a bystander: have you used scikit-learn for 
anything? How much of the code have you read? (me: no, 0)

Bill


On 5/2/18 11:53 PM, prince jha wrote:
> Hello everyone, I am also willing to contribute to scikit-learn open 
> source project but since I have never contributed to any open-source 
> projects earlier, so I don't have any idea regarding where to start 
> from. So I will be thankful if any of you could help me in this so 
> that I could also start contributing in this great project.
>
> Thanks,
> Prince
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180503/9476efdb/attachment.html>

From m.ali.jamaoui at gmail.com  Thu May  3 03:25:33 2018
From: m.ali.jamaoui at gmail.com (Mohamed Ali Jamaoui)
Date: Thu, 3 May 2018 09:25:33 +0200
Subject: [scikit-learn] Project Contribution
In-Reply-To: <d1a05a03-6dfd-b570-2be9-8da19d74e703@cgl.ucsf.edu>
References: <CAHDBSs0rMtEC9hTT8wks8CT78biwOPTfuDWa_qawKTbe=fjkCg@mail.gmail.com>
 <d1a05a03-6dfd-b570-2be9-8da19d74e703@cgl.ucsf.edu>
Message-ID: <CAHy8Q5KFTUiy80NjFYL2pF1USrad2G+SERzUL6i7HTgMpRVNwg@mail.gmail.com>

Hi,

 There are many ways to contribute, not only code. You can get started by
reading the "Contributing" section of the "Developer's guide" :
http://scikit-learn.org/dev/developers/contributing.html
 For code contributions, you don't need to read all the codebase to be able
to contribute, try to pave your way into it gradually. A good first step
would be to start with issues labeled good first issue
<https://github.com/scikit-learn/scikit-learn/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22>.


  Welcome onboard :)

Regards,
Mohamed Ali JAMAOUI

On 3 May 2018 at 09:02, Bill Ross <ross at cgl.ucsf.edu> wrote:

> Quick followup from a bystander: have you used scikit-learn for anything?
> How much of the code have you read? (me: no, 0)
>
> Bill
>
> On 5/2/18 11:53 PM, prince jha wrote:
>
> Hello everyone, I am also willing to contribute to scikit-learn open
> source project but since I have never contributed to any open-source
> projects earlier, so I don't have any idea regarding where to start from.
> So I will be thankful if any of you could help me in this so that I could
> also start contributing in this great project.
>
> Thanks,
> Prince
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180503/d0ce81db/attachment-0001.html>

From princejha616 at gmail.com  Thu May  3 03:48:14 2018
From: princejha616 at gmail.com (prince jha)
Date: Thu, 3 May 2018 13:18:14 +0530
Subject: [scikit-learn] Project Contribution
Message-ID: <CAHDBSs0Tym7mHqtpDq3OMmrPAzmnzazjzCR7Eympu1f0f53EsA@mail.gmail.com>

Hi Bill, actually i have used scikit learn for solving problems in which
are available in kaggle but i am not so profiecient since i have not used
it much.

Thanks
Prince
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180503/d147ae44/attachment.html>

From wouterverduin at gmail.com  Fri May  4 05:12:40 2018
From: wouterverduin at gmail.com (Wouter Verduin)
Date: Fri, 4 May 2018 11:12:40 +0200
Subject: [scikit-learn] Retracting model from the 'blackbox' SVM
Message-ID: <CABAPLkX5gjoUcEGRcF0NCetRC_hV2GQxudGH+EWZgE+Y8EHL2Q@mail.gmail.com>

Dear developers of Scikit,

I am working on a scientific paper on a predictionmodel predicting
complications in major abdominal resections. I have been using scikit to
create that model and got good results (score of 0.94). This makes us want
to see what the model is like that is made by scikit.

As for now we got 100 input variables but logically these arent all as
usefull as the others and we want to reduce this number to about 20 and see
what the effects on the score are.

*My question*: Is there a way to get the underlying formula for the model
out of scikit instead of having it as a 'blackbox' in my svm function.
At this moment i am predicting a dichtomous variable with 100 variables,
(continuous, ordinal and binair).

My code:

import numpy as npfrom numpy import *import pandas as pdfrom sklearn
import tree, svm, linear_model, metrics, preprocessingimport
datetimefrom sklearn.model_selection import KFold, cross_val_score,
ShuffleSplit, GridSearchCVfrom time import gmtime, strftime
#database openen en voorbereiden
file = "/home/wouter/scikit/DB_SCIKIT.csv"
DB = pd.read_csv(file, sep=";", header=0, decimal= ',').as_matrix()
DBT = DBprint "Vorm van de DB: ", DB.shape
target = []for i in range(len(DB[:,-1])):
        target.append(DB[i,-1])
DB = delete(DB,s_[-1],1) #Laatste kolom verwijderenAantalOutcome =
target.count(1)print "Aantal outcome:", AantalOutcomeprint "Aantal
patienten:", len(target)

A = DB
b = target
print len(DBT)

svc=svm.SVC(kernel='linear', cache_size=500, probability=True)
indices = np.random.permutation(len(DBT))

rs = ShuffleSplit(n_splits=5, test_size=.15, random_state=None)
scores = cross_val_score(svc, A, b, cv=rs)
A = ("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))print A

X_train = DBT[indices[:-302]]
y_train = []for i in range(len(X_train[:,-1])):
        y_train.append(X_train[i,-1])
X_train = delete(X_train,s_[-1],1) #Laatste kolom verwijderen

X_test = DBT[indices[-302:]]
y_test = []for i in range(len(X_test[:,-1])):
        y_test.append(X_test[i,-1])
X_test = delete(X_test,s_[-1],1) #Laatste kolom verwijderen

model = svc.fit(X_train,y_train)print model

uitkomst = model.score(X_test, y_test)print uitkomst

voorspel = model.predict(X_test)print voorspel

And output:

Vorm van de DB:  (2011, 101)Aantal outcome: 128Aantal patienten:
20112011Accuracy: 0.94 (+/- 0.01)
SVC(C=1.0, cache_size=500, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)0.927152317881[0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Thanks in advance!

with kind regards,

Wouter Verduin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180504/989986e9/attachment-0001.html>

From mail at sebastianraschka.com  Fri May  4 05:51:26 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Fri, 4 May 2018 05:51:26 -0400
Subject: [scikit-learn] Retracting model from the 'blackbox' SVM
In-Reply-To: <CABAPLkX5gjoUcEGRcF0NCetRC_hV2GQxudGH+EWZgE+Y8EHL2Q@mail.gmail.com>
References: <CABAPLkX5gjoUcEGRcF0NCetRC_hV2GQxudGH+EWZgE+Y8EHL2Q@mail.gmail.com>
Message-ID: <5331A676-D6C6-4F01-8A4D-EDDE9318E08F@sebastianraschka.com>

Dear Wouter,

for the SVM, scikit-learn wraps the LIBSVM and LIBLINEAR. I think the scikit-learn class SVC uses LIBSVM for every kernel. Since you are using the linear kernel, you could use the more efficient LinearSVC scikit-learn class to get similar results. I guess this in turn is easier to handle in terms of

>  Is there a way to get the underlying formula for the model out of scikit instead of having it as a 'blackbox' in my svm function.

More specifically, LinearSVC uses the _fit_liblinear code available here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/base.py

And more info on the LIBLINEAR library it is using can be found here: https://www.csie.ntu.edu.tw/~cjlin/liblinear/ (they have links to technical reports and implementation details there)

Best,
Sebastian

> On May 4, 2018, at 5:12 AM, Wouter Verduin <wouterverduin at gmail.com> wrote:
> 
> Dear developers of Scikit,
> 
> I am working on a scientific paper on a predictionmodel predicting complications in major abdominal resections. I have been using scikit to create that model and got good results (score of 0.94). This makes us want to see what the model is like that is made by scikit.
> 
> As for now we got 100 input variables but logically these arent all as usefull as the others and we want to reduce this number to about 20 and see what the effects on the score are.
> 
> My question: Is there a way to get the underlying formula for the model out of scikit instead of having it as a 'blackbox' in my svm function.
> 
> At this moment i am predicting a dichtomous variable with 100 variables, (continuous, ordinal and binair).
> 
> My code:
> 
> import numpy as
>  np
> 
> from numpy import *
> import pandas as
>  pd
> 
> from sklearn import tree, svm, linear_model, metrics,
>  preprocessing
> 
> import
>  datetime
> 
> from sklearn.model_selection import KFold, cross_val_score, ShuffleSplit, GridSearchCV
> from time import gmtime,
>  strftime
> 
> 
> #database openen en voorbereiden
> 
> file 
> = "/home/wouter/scikit/DB_SCIKIT.csv"
> 
> DB 
> = pd.read_csv(file, sep=";", header=0, decimal= ',').as_matrix()
> 
> DBT 
> =
>  DB
> 
> print "Vorm van de DB: ", DB.
> shape
> target 
> = []
> for i in range(len(DB[:,-1])):
> 
>         target
> .append(DB[i,-1])
> 
> DB 
> = delete(DB,s_[-1],1) #Laatste kolom verwijderen
> AantalOutcome = target.count(1)
> print "Aantal outcome:", AantalOutcome
> print "Aantal patienten:", len(target)
> 
> 
> A 
> =
>  DB
> b 
> =
>  target
> 
> 
> print len(DBT)
> 
> 
> svc
> =svm.SVC(kernel='linear', cache_size=500, probability=True)
> 
> indices 
> = np.random.permutation(len(DBT))
> 
> 
> rs 
> = ShuffleSplit(n_splits=5, test_size=.15, random_state=None)
> 
> scores 
> = cross_val_score(svc, A, b, cv=rs)
> 
> A 
> = ("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
> print
>  A
> 
> X_train 
> = DBT[indices[:-302]]
> 
> y_train 
> = []
> for i in range(len(X_train[:,-1])):
> 
>         y_train
> .append(X_train[i,-1])
> 
> X_train 
> = delete(X_train,s_[-1],1) #Laatste kolom verwijderen
> 
> 
> X_test 
> = DBT[indices[-302:]]
> 
> y_test 
> = []
> for i in range(len(X_test[:,-1])):
> 
>         y_test
> .append(X_test[i,-1])
> 
> X_test 
> = delete(X_test,s_[-1],1) #Laatste kolom verwijderen
> 
> 
> model 
> = svc.fit(X_train,y_train)
> print
>  model
> 
> uitkomst 
> = model.score(X_test, y_test)
> print
>  uitkomst
> 
> voorspel 
> = model.predict(X_test)
> print voorspel
> And output:
> 
> Vorm van de DB:  (2011, 101)
> Aantal outcome: 128
> Aantal patienten: 2011
> 2011
> Accuracy: 0.94 (+/- 0.01)
> 
> SVC
> (C=1.0, cache_size=500, class_weight=None, coef0=0.0,
> 
>   decision_function_shape
> ='ovr', degree=3, gamma='auto', kernel='linear',
> 
>   max_iter
> =-1, probability=True, random_state=None, shrinking=True,
> 
>   tol
> =0.001, verbose=False)
> 0.927152317881
> [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
> Thanks in advance!
> 
> with kind regards,
> 
> Wouter Verduin
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From david.mo.burns at gmail.com  Fri May  4 12:47:20 2018
From: david.mo.burns at gmail.com (David Burns)
Date: Fri, 4 May 2018 12:47:20 -0400
Subject: [scikit-learn] Retracting model from the 'blackbox' SVM
 (Sebastian Raschka)
In-Reply-To: <mailman.81.1525449603.22393.scikit-learn@python.org>
References: <mailman.81.1525449603.22393.scikit-learn@python.org>
Message-ID: <a55d374e-4e54-4c39-2be5-fc36448c3c03@gmail.com>

Hi Sebastian,

If you are looking to reduce the feature space for your model, I suggest 
you look at the scikit-learn page on doing just that

http://scikit-learn.org/stable/modules/feature_selection.html

David


On 2018-05-04 12:00 PM, scikit-learn-request at python.org wrote:
> Send scikit-learn mailing list submissions to
> 	scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
> 	scikit-learn-request at python.org
>
> You can reach the person managing the list at
> 	scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
>     1. Re: Retracting model from the 'blackbox' SVM (Sebastian Raschka)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 4 May 2018 05:51:26 -0400
> From: Sebastian Raschka <mail at sebastianraschka.com>
> To: Scikit-learn mailing list <scikit-learn at python.org>
> Subject: Re: [scikit-learn] Retracting model from the 'blackbox' SVM
> Message-ID:
> 	<5331A676-D6C6-4F01-8A4D-EDDE9318E08F at sebastianraschka.com>
> Content-Type: text/plain;	charset=us-ascii
>
> Dear Wouter,
>
> for the SVM, scikit-learn wraps the LIBSVM and LIBLINEAR. I think the scikit-learn class SVC uses LIBSVM for every kernel. Since you are using the linear kernel, you could use the more efficient LinearSVC scikit-learn class to get similar results. I guess this in turn is easier to handle in terms of
>
>>   Is there a way to get the underlying formula for the model out of scikit instead of having it as a 'blackbox' in my svm function.
> More specifically, LinearSVC uses the _fit_liblinear code available here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/base.py
>
> And more info on the LIBLINEAR library it is using can be found here: https://www.csie.ntu.edu.tw/~cjlin/liblinear/ (they have links to technical reports and implementation details there)
>
> Best,
> Sebastian
>
>> On May 4, 2018, at 5:12 AM, Wouter Verduin <wouterverduin at gmail.com> wrote:
>>
>> Dear developers of Scikit,
>>
>> I am working on a scientific paper on a predictionmodel predicting complications in major abdominal resections. I have been using scikit to create that model and got good results (score of 0.94). This makes us want to see what the model is like that is made by scikit.
>>
>> As for now we got 100 input variables but logically these arent all as usefull as the others and we want to reduce this number to about 20 and see what the effects on the score are.
>>
>> My question: Is there a way to get the underlying formula for the model out of scikit instead of having it as a 'blackbox' in my svm function.
>>
>> At this moment i am predicting a dichtomous variable with 100 variables, (continuous, ordinal and binair).
>>
>> My code:
>>
>> import numpy as
>>   np
>>
>> from numpy import *
>> import pandas as
>>   pd
>>
>> from sklearn import tree, svm, linear_model, metrics,
>>   preprocessing
>>
>> import
>>   datetime
>>
>> from sklearn.model_selection import KFold, cross_val_score, ShuffleSplit, GridSearchCV
>> from time import gmtime,
>>   strftime
>>
>>
>> #database openen en voorbereiden
>>
>> file
>> = "/home/wouter/scikit/DB_SCIKIT.csv"
>>
>> DB
>> = pd.read_csv(file, sep=";", header=0, decimal= ',').as_matrix()
>>
>> DBT
>> =
>>   DB
>>
>> print "Vorm van de DB: ", DB.
>> shape
>> target
>> = []
>> for i in range(len(DB[:,-1])):
>>
>>          target
>> .append(DB[i,-1])
>>
>> DB
>> = delete(DB,s_[-1],1) #Laatste kolom verwijderen
>> AantalOutcome = target.count(1)
>> print "Aantal outcome:", AantalOutcome
>> print "Aantal patienten:", len(target)
>>
>>
>> A
>> =
>>   DB
>> b
>> =
>>   target
>>
>>
>> print len(DBT)
>>
>>
>> svc
>> =svm.SVC(kernel='linear', cache_size=500, probability=True)
>>
>> indices
>> = np.random.permutation(len(DBT))
>>
>>
>> rs
>> = ShuffleSplit(n_splits=5, test_size=.15, random_state=None)
>>
>> scores
>> = cross_val_score(svc, A, b, cv=rs)
>>
>> A
>> = ("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
>> print
>>   A
>>
>> X_train
>> = DBT[indices[:-302]]
>>
>> y_train
>> = []
>> for i in range(len(X_train[:,-1])):
>>
>>          y_train
>> .append(X_train[i,-1])
>>
>> X_train
>> = delete(X_train,s_[-1],1) #Laatste kolom verwijderen
>>
>>
>> X_test
>> = DBT[indices[-302:]]
>>
>> y_test
>> = []
>> for i in range(len(X_test[:,-1])):
>>
>>          y_test
>> .append(X_test[i,-1])
>>
>> X_test
>> = delete(X_test,s_[-1],1) #Laatste kolom verwijderen
>>
>>
>> model
>> = svc.fit(X_train,y_train)
>> print
>>   model
>>
>> uitkomst
>> = model.score(X_test, y_test)
>> print
>>   uitkomst
>>
>> voorspel
>> = model.predict(X_test)
>> print voorspel
>> And output:
>>
>> Vorm van de DB:  (2011, 101)
>> Aantal outcome: 128
>> Aantal patienten: 2011
>> 2011
>> Accuracy: 0.94 (+/- 0.01)
>>
>> SVC
>> (C=1.0, cache_size=500, class_weight=None, coef0=0.0,
>>
>>    decision_function_shape
>> ='ovr', degree=3, gamma='auto', kernel='linear',
>>
>>    max_iter
>> =-1, probability=True, random_state=None, shrinking=True,
>>
>>    tol
>> =0.001, verbose=False)
>> 0.927152317881
>> [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>   
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>   
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>   
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>   
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>   
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>   
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>   
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>   
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>   
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>   
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>   
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
>>
>>   
>> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>> Thanks in advance!
>>
>> with kind regards,
>>
>> Wouter Verduin
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 26, Issue 5
> *******************************************


From niyaghif at oregonstate.edu  Fri May  4 19:10:44 2018
From: niyaghif at oregonstate.edu (Niyaghi, Faraz)
Date: Fri, 4 May 2018 16:10:44 -0700
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
 Importance
Message-ID: <CAOpHbdu3L5VZfgttDjzc0xG=2FRw7V2CfA=BAgqd8LHFRmVUPQ@mail.gmail.com>

 Greetings,

This is Faraz Niyaghi from Oregon State University. I research on variable
selection using random forest. To the best of my knowledge, there is a
difference between scikit-learn's and Breiman's definition of feature
importance. Breiman uses out of bag (oob) cases to calculate feature
importance but scikit-learn doesn't. I was wondering: 1) why are they
different? 2) can they result in very different rankings of features?

Here are the definitions I found on the web:

*Breiman:* "In every tree grown in the forest, put down the oob cases and
count the number of votes cast for the correct class. Now randomly permute
the values of variable m in the oob cases and put these cases down the
tree. Subtract the number of votes for the correct class in the
variable-m-permuted oob data from the number of votes for the correct class
in the untouched oob data. The average of this number over all trees in the
forest is the raw importance score for variable m."
Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

*scikit-learn: *" The relative rank (i.e. depth) of a feature used as a
decision node in a tree can be used to assess the relative importance of
that feature with respect to the predictability of the target variable.
Features used at the top of the tree contribute to the final prediction
decision of a larger fraction of the input samples. The expected fraction
of the samples they contribute to can thus be used as an estimate of the
relative importance of the features."
Link: http://scikit-learn.org/stable/modules/ensemble.html

Thank you for reading this email. Please let me know your thoughts.

Cheers,
Faraz.

Faraz Niyaghi

Ph.D. Candidate, Department of Statistics
Oregon State University
Corvallis, OR
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180504/fe82a6d0/attachment.html>

From mail at sebastianraschka.com  Fri May  4 19:58:03 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Fri, 4 May 2018 19:58:03 -0400
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
 Importance
In-Reply-To: <CAOpHbdu3L5VZfgttDjzc0xG=2FRw7V2CfA=BAgqd8LHFRmVUPQ@mail.gmail.com>
References: <CAOpHbdu3L5VZfgttDjzc0xG=2FRw7V2CfA=BAgqd8LHFRmVUPQ@mail.gmail.com>
Message-ID: <4B01B139-0D45-4F85-A287-E5B36BC3FE03@sebastianraschka.com>

Not sure how it compares in practice, but it's certainly more efficient to rank the features by impurity decrease rather than by OOB permutation performance you wouldn't need to 
a) compute the OOB performance (an extra pass inference step)
b) permute a feature column and do another inference pass and compare it to a)
c) repeat step b) for each feature column

Another reason would be that Breiman's suggestion wouldn't work that well for certain RandomForestClassifier settings in scikit-learn, e.g., setting bootstrap=False etc.

If you like to compute the feature importance after Breiman's suggestion, I have implemented a simple wrapper function for scikit-learn estimators here:

http://rasbt.github.io/mlxtend/user_guide/evaluate/feature_importance_permutation/#example-1-feature-importance-for-classifiers

Note that it's not using OOB samples but an independent validation set though, because it's a general function that should not be restricted to random forests. If you have such an independent dataset, it should give more accurate results than using OOB samples.

Best,
Sebastian

> On May 4, 2018, at 7:10 PM, Niyaghi, Faraz <niyaghif at oregonstate.edu> wrote:
> 
> Greetings,
> 
> This is Faraz Niyaghi from Oregon State University. I research on variable selection using random forest. To the best of my knowledge, there is a  difference between scikit-learn's and Breiman's definition of feature importance. Breiman uses out of bag (oob) cases to calculate feature importance but scikit-learn doesn't. I was wondering: 1) why are they different? 2) can they result in very different rankings of features?
> 
> Here are the definitions I found on the web:
> 
> Breiman: "In every tree grown in the forest, put down the oob cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the oob cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-m-permuted oob data from the number of votes for the correct class in the untouched oob data. The average of this number over all trees in the forest is the raw importance score for variable m."
> Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
> 
> scikit-learn: " The relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features."
> Link: http://scikit-learn.org/stable/modules/ensemble.html
> 
> Thank you for reading this email. Please let me know your thoughts.
> 
> Cheers,
> Faraz.
> 
> Faraz Niyaghi
> 
> Ph.D. Candidate, Department of Statistics
> Oregon State University
> Corvallis, OR
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From Jeremiah.Johnson at unh.edu  Fri May  4 20:08:45 2018
From: Jeremiah.Johnson at unh.edu (Johnson, Jeremiah)
Date: Sat, 5 May 2018 00:08:45 +0000
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
 Importance
In-Reply-To: <CAOpHbdu3L5VZfgttDjzc0xG=2FRw7V2CfA=BAgqd8LHFRmVUPQ@mail.gmail.com>
References: <CAOpHbdu3L5VZfgttDjzc0xG=2FRw7V2CfA=BAgqd8LHFRmVUPQ@mail.gmail.com>
Message-ID: <D7126D90.E40B%jeremiah.johnson@unh.edu>

Faraz, take a look at the discussion of this issue here: http://parrt.cs.usfca.edu/doc/rf-importance/index.html

Best,
Jeremiah
=========================================
Jeremiah W. Johnson, Ph.D
Asst. Professor of Data Science
Program Coordinator, B.S. in Analytics & Data Science
University of New Hampshire
Manchester, NH 03101
https://www.linkedin.com/in/jwjohnson314<https://linkedin.com/in/jwjohnson314>

From: scikit-learn <scikit-learn-bounces+jeremiah.johnson=unh.edu at python.org<mailto:scikit-learn-bounces+jeremiah.johnson=unh.edu at python.org>> on behalf of "Niyaghi, Faraz" <niyaghif at oregonstate.edu<mailto:niyaghif at oregonstate.edu>>
Reply-To: Scikit-learn mailing list <scikit-learn at python.org<mailto:scikit-learn at python.org>>
Date: Friday, May 4, 2018 at 7:10 PM
To: "scikit-learn at python.org<mailto:scikit-learn at python.org>" <scikit-learn at python.org<mailto:scikit-learn at python.org>>
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature Importance

Caution - External Email
________________________________
Greetings,

This is Faraz Niyaghi from Oregon State University. I research on variable selection using random forest. To the best of my knowledge, there is a  difference between scikit-learn's and Breiman's definition of feature importance. Breiman uses out of bag (oob) cases to calculate feature importance but scikit-learn doesn't. I was wondering: 1) why are they different? 2) can they result in very different rankings of features?

Here are the definitions I found on the web:

Breiman: "In every tree grown in the forest, put down the oob cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the oob cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-m-permuted oob data from the number of votes for the correct class in the untouched oob data. The average of this number over all trees in the forest is the raw importance score for variable m."
Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.stat.berkeley.edu_-7Ebreiman_RandomForests_cc-5Fhome.htm&d=DwMFaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=-1UkxOBfdCgjt0jth2-l9X5IHT-470kGy1VfzniEB4U&s=WaBYWZLyPqs-hxiuv69tRl2SEDRoobauBH-o9gWPiHE&e=>

scikit-learn: " The relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features."
Link: http://scikit-learn.org/stable/modules/ensemble.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__scikit-2Dlearn.org_stable_modules_ensemble.html&d=DwMFaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=-1UkxOBfdCgjt0jth2-l9X5IHT-470kGy1VfzniEB4U&s=NBDOYrJrlTE31cW1foTK9FE4A0F3NLeD1CNubjAdLRg&e=>

Thank you for reading this email. Please let me know your thoughts.

Cheers,
Faraz.

Faraz Niyaghi

Ph.D. Candidate, Department of Statistics
Oregon State University
Corvallis, OR
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180505/0d049528/attachment.html>

From aqsdmcet at gmail.com  Sat May  5 00:31:14 2018
From: aqsdmcet at gmail.com (aijaz qazi)
Date: Sat, 5 May 2018 10:01:14 +0530
Subject: [scikit-learn] Multi learn error.
Message-ID: <CAPn2g_bRKs1Zk3EDTOn2GeX-ndFeZC9ghOuY=x5yJp2Yz63ukw@mail.gmail.com>

 Dear developers of Scikit ,

I am working on web page categorization  with http://scikit.ml/ .


*Question*: I am not able to execute MLkNN code on the link
http://scikit.ml/api/classify.html. I have installed py 3.6.

I found scipy versions not compatible with scikit.ml 0.0.5.

Which version of scipy would work with scikit.ml 0.0.5.

Kindly let me know.


*Regards,*
*Aijaz A.Qazi *
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180505/8bb7af42/attachment.html>

From rth.yurchak at gmail.com  Sat May  5 02:28:22 2018
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Sat, 5 May 2018 09:28:22 +0300
Subject: [scikit-learn] Multi learn error.
In-Reply-To: <CAPn2g_bRKs1Zk3EDTOn2GeX-ndFeZC9ghOuY=x5yJp2Yz63ukw@mail.gmail.com>
References: <CAPn2g_bRKs1Zk3EDTOn2GeX-ndFeZC9ghOuY=x5yJp2Yz63ukw@mail.gmail.com>
Message-ID: <49def996-56c7-ec5e-dc37-bf93968cfa2a@gmail.com>

Hi Aijaz,

On 05/05/18 07:31, aijaz qazi wrote:
 > Dear developers of Scikit ,

Scikit is short for SciPy Toolkits (https://www.scipy.org/scikits.html); 
there is a number of those. Scikit-learn started as one (and this is the 
scikit-learn mailing list).

The package you are refering is based on scikit-learn but is a separate 
project (with a somewhat confusing home page URL). The right place to 
ask for support would be its Github issue tracker or other specific 
communcations channels if it has any.

-- 
Roman

From g.lemaitre58 at gmail.com  Sat May  5 04:34:36 2018
From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=)
Date: Sat, 5 May 2018 10:34:36 +0200
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
 Importance
In-Reply-To: <D7126D90.E40B%jeremiah.johnson@unh.edu>
References: <CAOpHbdu3L5VZfgttDjzc0xG=2FRw7V2CfA=BAgqd8LHFRmVUPQ@mail.gmail.com>
 <D7126D90.E40B%jeremiah.johnson@unh.edu>
Message-ID: <CACDxx9ioB0jaumM9O2L-Wognd6kfJ3Cc=pkM+bYrP_i11YGeXg@mail.gmail.com>

+1 on the post pointed out by Jeremiah.

On 5 May 2018 at 02:08, Johnson, Jeremiah <Jeremiah.Johnson at unh.edu> wrote:

> Faraz, take a look at the discussion of this issue here:
> http://parrt.cs.usfca.edu/doc/rf-importance/index.html
>
> Best,
> Jeremiah
> =========================================
> Jeremiah W. Johnson, Ph.D
> Asst. Professor of Data Science
> Program Coordinator, B.S. in Analytics & Data Science
> University of New Hampshire
> Manchester, NH 03101
> https://www.linkedin.com/in/jwjohnson314
> <https://linkedin.com/in/jwjohnson314>
>
> From: scikit-learn <scikit-learn-bounces+jeremiah.johnson=unh.edu@
> python.org> on behalf of "Niyaghi, Faraz" <niyaghif at oregonstate.edu>
> Reply-To: Scikit-learn mailing list <scikit-learn at python.org>
> Date: Friday, May 4, 2018 at 7:10 PM
> To: "scikit-learn at python.org" <scikit-learn at python.org>
> Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
> Importance
>
> *Caution - External Email*
> ------------------------------
> Greetings,
>
> This is Faraz Niyaghi from Oregon State University. I research on variable
> selection using random forest. To the best of my knowledge, there is a
> difference between scikit-learn's and Breiman's definition of feature
> importance. Breiman uses out of bag (oob) cases to calculate feature
> importance but scikit-learn doesn't. I was wondering: 1) why are they
> different? 2) can they result in very different rankings of features?
>
> Here are the definitions I found on the web:
>
> *Breiman:* "In every tree grown in the forest, put down the oob cases and
> count the number of votes cast for the correct class. Now randomly permute
> the values of variable m in the oob cases and put these cases down the
> tree. Subtract the number of votes for the correct class in the
> variable-m-permuted oob data from the number of votes for the correct class
> in the untouched oob data. The average of this number over all trees in the
> forest is the raw importance score for variable m."
> Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.stat.berkeley.edu_-7Ebreiman_RandomForests_cc-5Fhome.htm&d=DwMFaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=-1UkxOBfdCgjt0jth2-l9X5IHT-470kGy1VfzniEB4U&s=WaBYWZLyPqs-hxiuv69tRl2SEDRoobauBH-o9gWPiHE&e=>
>
> *scikit-learn: *" The relative rank (i.e. depth) of a feature used as a
> decision node in a tree can be used to assess the relative importance of
> that feature with respect to the predictability of the target variable.
> Features used at the top of the tree contribute to the final prediction
> decision of a larger fraction of the input samples. The expected fraction
> of the samples they contribute to can thus be used as an estimate of the
> relative importance of the features."
> Link: http://scikit-learn.org/stable/modules/ensemble.html
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__scikit-2Dlearn.org_stable_modules_ensemble.html&d=DwMFaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=-1UkxOBfdCgjt0jth2-l9X5IHT-470kGy1VfzniEB4U&s=NBDOYrJrlTE31cW1foTK9FE4A0F3NLeD1CNubjAdLRg&e=>
>
> Thank you for reading this email. Please let me know your thoughts.
>
> Cheers,
> Faraz.
>
> Faraz Niyaghi
>
> Ph.D. Candidate, Department of Statistics
> Oregon State University
> Corvallis, OR
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180505/c0e3dc65/attachment.html>

From g.louppe at gmail.com  Sat May  5 05:21:17 2018
From: g.louppe at gmail.com (Gilles Louppe)
Date: Sat, 05 May 2018 09:21:17 +0000
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
 Importance
In-Reply-To: <CACDxx9ioB0jaumM9O2L-Wognd6kfJ3Cc=pkM+bYrP_i11YGeXg@mail.gmail.com>
References: <CAOpHbdu3L5VZfgttDjzc0xG=2FRw7V2CfA=BAgqd8LHFRmVUPQ@mail.gmail.com>
 <D7126D90.E40B%jeremiah.johnson@unh.edu>
 <CACDxx9ioB0jaumM9O2L-Wognd6kfJ3Cc=pkM+bYrP_i11YGeXg@mail.gmail.com>
Message-ID: <CAH3bUKhrV1CA=g10hF+sew4z7o6o86Osyk5kc=OPoy98W79hSw@mail.gmail.com>

Hi,

See also chapters 6 and 7 of http://arxiv.org/abs/1407.7502 for another
point of view regarding the "issue" with feature importances. TLDR: Feature
importances as we have them in scikit-learn (i.e. MDI) are provably **not**
biased, provided trees are built totally at random (as in ExtraTrees with
max_feature=1) and the depth is controlled min_samples_split (to avoid
splitting on noise). On the other hand, it is not always clear what you
actually compute with MDA (permutation based importances), since it is
conditioned on the model you use.

Gilles
On Sat, 5 May 2018 at 10:36, Guillaume Lema?tre <g.lemaitre58 at gmail.com>
wrote:

> +1 on the post pointed out by Jeremiah.

> On 5 May 2018 at 02:08, Johnson, Jeremiah <Jeremiah.Johnson at unh.edu>
wrote:

>> Faraz, take a look at the discussion of this issue here:
http://parrt.cs.usfca.edu/doc/rf-importance/index.html

>> Best,
>> Jeremiah
>> =========================================
>> Jeremiah W. Johnson, Ph.D
>> Asst. Professor of Data Science
>> Program Coordinator, B.S. in Analytics & Data Science
>> University of New Hampshire
>> Manchester, NH 03101
>> https://www.linkedin.com/in/jwjohnson314

>> From: scikit-learn <scikit-learn-bounces+jeremiah.johnson=
unh.edu at python.org> on behalf of "Niyaghi, Faraz" <niyaghif at oregonstate.edu>
>> Reply-To: Scikit-learn mailing list <scikit-learn at python.org>
>> Date: Friday, May 4, 2018 at 7:10 PM
>> To: "scikit-learn at python.org" <scikit-learn at python.org>
>> Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
Importance

>> Caution - External Email
>> ________________________________
>> Greetings,

>> This is Faraz Niyaghi from Oregon State University. I research on
variable selection using random forest. To the best of my knowledge, there
is a  difference between scikit-learn's and Breiman's definition of feature
importance. Breiman uses out of bag (oob) cases to calculate feature
importance but scikit-learn doesn't. I was wondering: 1) why are they
different? 2) can they result in very different rankings of features?

>> Here are the definitions I found on the web:

>> Breiman: "In every tree grown in the forest, put down the oob cases and
count the number of votes cast for the correct class. Now randomly permute
the values of variable m in the oob cases and put these cases down the
tree. Subtract the number of votes for the correct class in the
variable-m-permuted oob data from the number of votes for the correct class
in the untouched oob data. The average of this number over all trees in the
forest is the raw importance score for variable m."
>> Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

>> scikit-learn: " The relative rank (i.e. depth) of a feature used as a
decision node in a tree can be used to assess the relative importance of
that feature with respect to the predictability of the target variable.
Features used at the top of the tree contribute to the final prediction
decision of a larger fraction of the input samples. The expected fraction
of the samples they contribute to can thus be used as an estimate of the
relative importance of the features."
>> Link: http://scikit-learn.org/stable/modules/ensemble.html

>> Thank you for reading this email. Please let me know your thoughts.

>> Cheers,
>> Faraz.

>> Faraz Niyaghi

>> Ph.D. Candidate, Department of Statistics
>> Oregon State University
>> Corvallis, OR

>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn


> --
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From gael.varoquaux at normalesup.org  Sat May  5 09:16:50 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Sat, 5 May 2018 15:16:50 +0200
Subject: [scikit-learn] Announcing IMPAC: an IMaging-PsychiAtry Challenge,
 using data-science to predict autism from brain imaging
Message-ID: <20180505131650.ke323loujdoa2mxr@phare.normalesup.org>


Dear colleagues,

It is my pleasure to announce IMPAC: an IMaging-PsychiAtry Challenge,
using data-science to predict autism from brain imaging.

https://paris-saclay-cds.github.io/autism_challenge/

This is a machine-learning challenge on brain-imaging data to achieve the
best prediction of autism spectrum disorder diagnostic status. We are
providing the largest cohort so far to learn such predictive biomarkers,
with more than 2000 individuals.

There is a total of 9000 euros of prices to win for the best prediction.
The prediction quality will be measured on a large hidden test set, to
ensure fairness.

We provide a simple starting kit to serve as a proof of feasibility. We
are excited to see what the community will come up with in terms of
predictive models and of score.

Best,

Ga?l

-- 
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux


From jeff1evesque at yahoo.com  Sat May  5 21:40:34 2018
From: jeff1evesque at yahoo.com (Jeffrey Levesque)
Date: Sat, 5 May 2018 21:40:34 -0400
Subject: [scikit-learn] Jeff Levesque: neuroscience related datasets
Message-ID: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com>

Hi guys,
I want to perform some basic data analysis. Anyone have good recommendations where I can obtain free datasets. I was thinking of trying to do something related to neuroscience. But, kaggle doesn't have many datasets for this focus.

Thank you,

Jeff Levesque
https://github.com/jeff1evesque

From nicholdav at gmail.com  Sat May  5 21:58:54 2018
From: nicholdav at gmail.com (David Nicholson)
Date: Sat, 5 May 2018 21:58:54 -0400
Subject: [scikit-learn] Jeff Levesque: neuroscience related datasets
In-Reply-To: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com>
References: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com>
Message-ID: <CAMabFbXd6L9-fNtDhRbn+SqhFquqe6dHKnGxf63t4rF5DRcO5Q@mail.gmail.com>

Hi Jeff,

here's a couple of places to start, I'm sure other people can recommend
more:
https://crcns.org/
https://www.nature.com/sdata/policies/repositories (see under Neuroscience)

There's also the challenge that Gael just announced, predicting autism from
brain imaging data:
https://paris-saclay-cds.github.io/autism_challenge/
https://twittr.com/GaelVaroquaux/status/992752034242879488
https://twitter.com/GaelVaroquaux/status/992752034242879488
--David

David Nicholson, Ph.D.
nickledave.github.io
https://github.com/NickleDave
Prinz lab <http://www.biology.emory.edu/research/Prinz/>, Emory University,
Atlanta, GA, USA

On Sat, May 5, 2018 at 9:40 PM, Jeffrey Levesque via scikit-learn <
scikit-learn at python.org> wrote:

> Hi guys,
> I want to perform some basic data analysis. Anyone have good
> recommendations where I can obtain free datasets. I was thinking of trying
> to do something related to neuroscience. But, kaggle doesn't have many
> datasets for this focus.
>
> Thank you,
>
> Jeff Levesque
> https://github.com/jeff1evesque
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180505/733044bb/attachment.html>

From michael.eickenberg at gmail.com  Sat May  5 21:59:28 2018
From: michael.eickenberg at gmail.com (Michael Eickenberg)
Date: Sat, 5 May 2018 18:59:28 -0700
Subject: [scikit-learn] Jeff Levesque: neuroscience related datasets
In-Reply-To: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com>
References: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com>
Message-ID: <CADxJN67h76NXeR4s9GkPBc5L64WnP5RV7W13xYbTwBm2BgLj7A@mail.gmail.com>

Hi Jeffrey,

check out these here for neuron data and fmri:
http://crcns.org/

And the ones here for fmri:
https://openfmri.org/

You can get started by installing one of the following packages and using
their dataset downloaders

http://nilearn.github.io/modules/reference.html#module-nilearn.datasets

https://martinos.org/mne/stable/manual/datasets_index.html

Also, there was this kaggle
https://www.kaggle.com/c/decoding-the-human-brain
And probably a bunch of others

Hope that helps!
Michael


On Sat, May 5, 2018 at 6:40 PM, Jeffrey Levesque via scikit-learn <
scikit-learn at python.org> wrote:

> Hi guys,
> I want to perform some basic data analysis. Anyone have good
> recommendations where I can obtain free datasets. I was thinking of trying
> to do something related to neuroscience. But, kaggle doesn't have many
> datasets for this focus.
>
> Thank you,
>
> Jeff Levesque
> https://github.com/jeff1evesque
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180505/20b3acf0/attachment.html>

From nicholdav at gmail.com  Sat May  5 22:04:56 2018
From: nicholdav at gmail.com (David Nicholson)
Date: Sat, 5 May 2018 22:04:56 -0400
Subject: [scikit-learn] Jeff Levesque: neuroscience related datasets
In-Reply-To: <CAMabFbXd6L9-fNtDhRbn+SqhFquqe6dHKnGxf63t4rF5DRcO5Q@mail.gmail.com>
References: <4E8AFAD8-5F36-407C-9D46-9EA65AEE3EAA@yahoo.com>
 <CAMabFbXd6L9-fNtDhRbn+SqhFquqe6dHKnGxf63t4rF5DRcO5Q@mail.gmail.com>
Message-ID: <CAMabFbUiwMEw5YZR_g_2RE0_NUb3LVBNU3KPhWiMrGXbaUKbpg@mail.gmail.com>

also (sorry for spamming the list!) should have said the Allen Institute
has a ton of data:
https://www.nwb.org/allen-cell-types-database/

and check out the cool dataset with this paper:
https://figshare.com/articles/Recordings_of_ten_thousand_neurons_in_visual_cortex_during_spontaneous_behaviors/6163622
https://github.com/MouseLand/stringer-pachitariu-et-al-2018a
explainer twitter thread:
https://twitter.com/marius10p/status/988069221941874688


David Nicholson, Ph.D.
nickledave.github.io
https://github.com/NickleDave
Prinz lab <http://www.biology.emory.edu/research/Prinz/>, Emory University,
Atlanta, GA, USA

On Sat, May 5, 2018 at 9:58 PM, David Nicholson <nicholdav at gmail.com> wrote:

> Hi Jeff,
>
> here's a couple of places to start, I'm sure other people can recommend
> more:
> https://crcns.org/
> https://www.nature.com/sdata/policies/repositories (see under
> Neuroscience)
>
> There's also the challenge that Gael just announced, predicting autism
> from brain imaging data:
> https://paris-saclay-cds.github.io/autism_challenge/
> https://twittr.com/GaelVaroquaux/status/992752034242879488https://
> twitter.com/GaelVaroquaux/status/992752034242879488
> --David
>
> David Nicholson, Ph.D.
> nickledave.github.io
> https://github.com/NickleDave
> Prinz lab <http://www.biology.emory.edu/research/Prinz/>, Emory
> University, Atlanta, GA, USA
>
> On Sat, May 5, 2018 at 9:40 PM, Jeffrey Levesque via scikit-learn <
> scikit-learn at python.org> wrote:
>
>> Hi guys,
>> I want to perform some basic data analysis. Anyone have good
>> recommendations where I can obtain free datasets. I was thinking of trying
>> to do something related to neuroscience. But, kaggle doesn't have many
>> datasets for this focus.
>>
>> Thank you,
>>
>> Jeff Levesque
>> https://github.com/jeff1evesque
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180505/b9bc8a87/attachment-0001.html>

From joel.nothman at gmail.com  Sat May  5 22:17:36 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Sun, 6 May 2018 12:17:36 +1000
Subject: [scikit-learn] Retracting model from the 'blackbox' SVM
In-Reply-To: <CABAPLkX5gjoUcEGRcF0NCetRC_hV2GQxudGH+EWZgE+Y8EHL2Q@mail.gmail.com>
References: <CABAPLkX5gjoUcEGRcF0NCetRC_hV2GQxudGH+EWZgE+Y8EHL2Q@mail.gmail.com>
Message-ID: <CAAkaFLU3KyoR8DCsmeTyv186W6KsFk4LETwfRgcBg0ji3pn4Kg@mail.gmail.com>

The coef_ available from LinearSVC will be somewhat indicative of the
relative importance of each feature.

But you might want to look into our feature selection documentation:
http://scikit-learn.org/stable/modules/feature_selection.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180506/395c52a5/attachment.html>

From matti.v.viljamaa at gmail.com  Sun May  6 14:01:12 2018
From: matti.v.viljamaa at gmail.com (Matti Viljamaa)
Date: Sun, 6 May 2018 21:01:12 +0300
Subject: [scikit-learn] Does sklearn.decomposition.TruncatedSVD take
 n_components in order? Or can I select which features I want?
Message-ID: <5aef42ea.1c69fb81.779bc.933b@mx.google.com>

Does sklearn.decomposition.TruncatedSVD take n_components in order? Or can I select which features I want?

Reason being that if one uses the ?pick features with eigenvalues > 1? principle, then I?d need to tell the SVD algo somehow, which components it should use.

BR, Matti

L?hetetty Windows 10:n S?hk?postista


---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180506/6d7e6a97/attachment.html>

From santoshmsubedi at gmail.com  Tue May  8 03:26:06 2018
From: santoshmsubedi at gmail.com (Santosh Subedi)
Date: Tue, 8 May 2018 16:26:06 +0900
Subject: [scikit-learn] Help me Please!
Message-ID: <CAG38ysDS_fuv8U9u89c_GAtasc2LO7iAnB1EBuxS+G6PD2DbNg@mail.gmail.com>

 Hello,

I'm using Scikit-learn for Gaussian Process Regression (GPR). I'm facing a
problem/confusion regarding GaussianProcessRegressor class. If gp is a
GaussianProcessRegressor, the prediction is given as:

y_pred_test, sigma = gp.predict(x_test, return_std =True)

After printing the y_pred_test and sigma, the y_pred_test predicted for all
the data source (3 data source per each test point) at every test point.
However, the Standard deviation (sigma) is predicted just a single value at
each test point. I want the sigma to be predicted as y_pred_test for every
data source. I've asked my query at StackOverflow at the following link:

https://stackoverflow.com/questions/50185399/insufficient-
output-with-predictx-test-return-std-true-in-gaussianprocessre

Could you reply with an appropriate answer to this email or at the
StackOverflow, please?

Thank you for your time and consideration.

Kindly Regards,
santobedi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180508/51996aa2/attachment.html>

From matti.v.viljamaa at gmail.com  Wed May  9 10:08:40 2018
From: matti.v.viljamaa at gmail.com (Matti Viljamaa)
Date: Wed, 9 May 2018 17:08:40 +0300
Subject: [scikit-learn] How to pick the maximum possible parameters for
 algos such as sklearn.decomposition.TruncatedSVD?
Message-ID: <5af300ea.1c69fb81.cc315.65e7@mx.google.com>

How to pick the maximum possible parameters for algos such as sklearn.decomposition.TruncatedSVD?

Since this algo can cause a memory error, if memory runs out. But of course one would like to select the maximum possible n_components, given the system memory available.

So how to do it?

L?hetetty Windows 10:n S?hk?postista


---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180509/2d2b21a8/attachment.html>

From carolduncanpc833 at yahoo.com  Wed May  9 11:40:52 2018
From: carolduncanpc833 at yahoo.com (Carol Duncan)
Date: Wed, 9 May 2018 15:40:52 +0000 (UTC)
Subject: [scikit-learn] How does multiple target Ridge Regression work
 in scikit learn?
In-Reply-To: <81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr>
References: <CAM39nOurcAU8eOUT=m8TgPH9_YO+V_w9-21RCobchxjDO-21_Q@mail.gmail.com>
 <81a921c6-31af-153a-a9e9-9db664cb63dc@inria.fr>
Message-ID: <1570331285.1609254.1525880452333@mail.yahoo.com>


      From: bthirion <bertrand.thirion at inria.fr>
 To: scikit-learn at python.org 
 Sent: Wednesday, May 2, 2018 12:07 PM
 Subject: Re: [scikit-learn] How does multiple target Ridge Regression work in scikit learn?
   
 The alpha parameter is shared for all problems; If you wnat to use differnt parameters, you probably want to perform seprate fits.
 Best,
 
 Bertrand
 
 On 02/05/2018 13:08, Peer Nowack wrote:
  
   Hi all, I am struggling to understand the following: Scikit-learn offers a multiple output version for Ridge Regression, simply by handing over a 2D array [n_samples, n_targets], but how is it implemented? http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html Is it correct to assume that each regression for each target is independent? Under these circumstances, how can I adapt this to use individual alpha regularization parameters for each regression? If I use GridSeachCV, I would have to hand over a matrix of possible regularization parameters, or how would that work? Thanks in advance - I have been searching for hours but could not find anything on this topic.  Peter
  
  
 _______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
 
 
  _______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180509/bbbecce2/attachment.html>

From dylanf123 at gmail.com  Thu May 10 03:08:07 2018
From: dylanf123 at gmail.com (Dylan Fernando)
Date: Thu, 10 May 2018 17:08:07 +1000
Subject: [scikit-learn] Unable to run make test-coverage
Message-ID: <CAPa-kAzu-2T1LN35M42eNftk7Mpvrscd6ihQUe3pFUM7xAheSg@mail.gmail.com>

Hi,

I am unable to run make test-coverage.
I get the error:

rm -rf coverage .coverage

pytest sklearn --showlocals -v --cov=sklearn --cov-report=html:coverage

usage: pytest [options] [file_or_dir] [file_or_dir] [...]

pytest: error: unrecognized arguments: --cov=sklearn
--cov-report=html:coverage

  inifile: /Users/dylan/scikit-learn/setup.cfg

  rootdir: /Users/dylan/scikit-learn

make: *** [test-coverage] Error 2


Regards,

Dylan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180510/d5f5208d/attachment.html>

From joel.nothman at gmail.com  Thu May 10 03:22:12 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 10 May 2018 17:22:12 +1000
Subject: [scikit-learn] Unable to run make test-coverage
In-Reply-To: <CAPa-kAzu-2T1LN35M42eNftk7Mpvrscd6ihQUe3pFUM7xAheSg@mail.gmail.com>
References: <CAPa-kAzu-2T1LN35M42eNftk7Mpvrscd6ihQUe3pFUM7xAheSg@mail.gmail.com>
Message-ID: <CAAkaFLVdzxkHYK2LJXwfKrN0ESBEjQZirT4ZbadwLwc2SFQtjA@mail.gmail.com>

Do you have pytest-cov installed??
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180510/d6bebe52/attachment.html>

From dylanf123 at gmail.com  Thu May 10 05:29:34 2018
From: dylanf123 at gmail.com (Dylan Fernando)
Date: Thu, 10 May 2018 19:29:34 +1000
Subject: [scikit-learn] Unable to run make test-coverage
In-Reply-To: <CAAkaFLVdzxkHYK2LJXwfKrN0ESBEjQZirT4ZbadwLwc2SFQtjA@mail.gmail.com>
References: <CAPa-kAzu-2T1LN35M42eNftk7Mpvrscd6ihQUe3pFUM7xAheSg@mail.gmail.com>
 <CAAkaFLVdzxkHYK2LJXwfKrN0ESBEjQZirT4ZbadwLwc2SFQtjA@mail.gmail.com>
Message-ID: <CAPa-kAxeMAuxPN2KQ4eoxfTQVf5hP1z0zMmHJ3nvyrFsrxOb5Q@mail.gmail.com>

On Thu, May 10, 2018 at 5:22 PM, Joel Nothman <joel.nothman at gmail.com>
wrote:

> Do you have pytest-cov installed??
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> Thanks, I installed it and it works now
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180510/46efd9a9/attachment.html>

From reismc at gmail.com  Sat May 12 10:26:05 2018
From: reismc at gmail.com (Mauricio Reis)
Date: Sat, 12 May 2018 11:26:05 -0300
Subject: [scikit-learn] DBScan freezes my computer !!!
Message-ID: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>

The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my computer
without any warning message!

I am using WinPython 3.6.5 64 bit.

The method works normally with the original data, but freezes when I use
the normalized data (between 0 and 1).

What should I do?

Att.,
Mauricio Reis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180512/823aca01/attachment.html>

From awnystrom at gmail.com  Sat May 12 18:20:32 2018
From: awnystrom at gmail.com (Andrew Nystrom)
Date: Sat, 12 May 2018 15:20:32 -0700
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>
References: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>
Message-ID: <CAD7aRH_YpH8HveqF5qLqSsHt+iYRhc1KxJ2Ajq6kTG_CM097iQ@mail.gmail.com>

If you?re l2 norming your data, you?re making it live on the surface of a
hypershere. That surface will have a high density of points and may not
have areas of low density, in which case the entire surface could be
recognized as a single cluster if epsilon is high enough and min neighbors
is low enough. I?d suggest not using l2 norm with DBSCAN.
On Sat, May 12, 2018 at 7:27 AM Mauricio Reis <reismc at gmail.com> wrote:

> The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my computer
> without any warning message!
>
> I am using WinPython 3.6.5 64 bit.
>
> The method works normally with the original data, but freezes when I use
> the normalized data (between 0 and 1).
>
> What should I do?
>
> Att.,
> Mauricio Reis
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180512/4de660ea/attachment.html>

From rth.yurchak at gmail.com  Sun May 13 04:34:42 2018
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Sun, 13 May 2018 10:34:42 +0200
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <CAD7aRH_YpH8HveqF5qLqSsHt+iYRhc1KxJ2Ajq6kTG_CM097iQ@mail.gmail.com>
References: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>
 <CAD7aRH_YpH8HveqF5qLqSsHt+iYRhc1KxJ2Ajq6kTG_CM097iQ@mail.gmail.com>
Message-ID: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>

Could you please check memory usage while running DBSCAN to make sure 
freezing is due to running out of memory and not to something else?
Which parameters do you run DBSCAN with? Changing algorithm, leaf_size 
parameters and ensuring n_jobs=1 could help.

Assuming eps is reasonable, I think it shouldn't be an issue to run 
DBSCAN on L2 normalized data: using the default euclidean metric, this 
should produce somewhat similar results to clustering not normalized 
data with metric='cosine'.

On 13/05/18 00:20, Andrew Nystrom wrote:
> If you?re l2 norming your data, you?re making it live on the surface of 
> a hypershere. That surface will have a high density of points and may 
> not have areas of low density, in which case the entire surface could be 
> recognized as a single cluster if epsilon is high enough and min 
> neighbors is low enough. I?d suggest not using l2 norm with DBSCAN.
> On Sat, May 12, 2018 at 7:27 AM Mauricio Reis <reismc at gmail.com 
> <mailto:reismc at gmail.com>> wrote:
> 
>     The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my
>     computer without any warning message!
> 
>     I am using WinPython 3.6.5 64 bit.
> 
>     The method works normally with the original data, but freezes when I
>     use the normalized data (between 0 and 1).
> 
>     What should I do?
> 
>     Att.,
>     Mauricio Reis
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 


From reismc at gmail.com  Sun May 13 19:23:15 2018
From: reismc at gmail.com (Mauricio Reis)
Date: Sun, 13 May 2018 20:23:15 -0300
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
References: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>
 <CAD7aRH_YpH8HveqF5qLqSsHt+iYRhc1KxJ2Ajq6kTG_CM097iQ@mail.gmail.com>
 <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
Message-ID: <CAHCPJrL2-zh=xRMintoO9fb2dW6BXZo9MKKM83n-V5+zHWJopA@mail.gmail.com>

I think the problem is due to the size of my database, which has 44,000
records. When I ran a database test with reduced sizes (10,000 and 20,000
first records), the routine ran normally.

You ask me to check the memory while running the DBScan routine, but I do
not know how to do that (if I did, I would have done that already).

I think the routine is not ready to work with too much data. The problem is
that my computer freezes and I can not analyze the case. I've tried to
figure out if any changes work (like changing routine parameters), but all
alternatives with lots of data (about 40,000 records) generate error.

I believe that package routines have no exception handling to improve
performance. So I suggest that there is a test version that shows a proper
message when an error occurs.

To summarize: 1) How to check the memory of the computer during the
execution of the routine? 2) I suggest developing test versions of routines
that may have a memory error.

Att.,
Mauricio Reis

2018-05-13 5:34 GMT-03:00 Roman Yurchak <rth.yurchak at gmail.com>:

> Could you please check memory usage while running DBSCAN to make sure
> freezing is due to running out of memory and not to something else?
> Which parameters do you run DBSCAN with? Changing algorithm, leaf_size
> parameters and ensuring n_jobs=1 could help.
>
> Assuming eps is reasonable, I think it shouldn't be an issue to run DBSCAN
> on L2 normalized data: using the default euclidean metric, this should
> produce somewhat similar results to clustering not normalized data with
> metric='cosine'.
>
> On 13/05/18 00:20, Andrew Nystrom wrote:
>
>> If you?re l2 norming your data, you?re making it live on the surface of a
>> hypershere. That surface will have a high density of points and may not
>> have areas of low density, in which case the entire surface could be
>> recognized as a single cluster if epsilon is high enough and min neighbors
>> is low enough. I?d suggest not using l2 norm with DBSCAN.
>> On Sat, May 12, 2018 at 7:27 AM Mauricio Reis <reismc at gmail.com <mailto:
>> reismc at gmail.com>> wrote:
>>
>>     The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my
>>     computer without any warning message!
>>
>>     I am using WinPython 3.6.5 64 bit.
>>
>>     The method works normally with the original data, but freezes when I
>>     use the normalized data (between 0 and 1).
>>
>>     What should I do?
>>
>>     Att.,
>>     Mauricio Reis
>>     _______________________________________________
>>     scikit-learn mailing list
>>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>>     https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180513/935ae134/attachment.html>

From chema at rinzewind.org  Sun May 13 19:44:34 2018
From: chema at rinzewind.org (=?iso-8859-1?Q?Jos=E9_Mar=EDa?= Mateos)
Date: Sun, 13 May 2018 19:44:34 -0400
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <CAHCPJrL2-zh=xRMintoO9fb2dW6BXZo9MKKM83n-V5+zHWJopA@mail.gmail.com>
References: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>
 <CAD7aRH_YpH8HveqF5qLqSsHt+iYRhc1KxJ2Ajq6kTG_CM097iQ@mail.gmail.com>
 <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
 <CAHCPJrL2-zh=xRMintoO9fb2dW6BXZo9MKKM83n-V5+zHWJopA@mail.gmail.com>
Message-ID: <20180513234434.GA3210@equipaje>

On Sun, May 13, 2018 at 08:23:15PM -0300, Mauricio Reis wrote:
> To summarize: 1) How to check the memory of the computer during the
> execution of the routine? 2) I suggest developing test versions of routines
> that may have a memory error.

If you are on Linux, can you just run "top" while your script runs? That 
will tell you how much memory is being used by each process. On Windows, 
you can use the task scheduler to obtain similar results.

Cheers,

-- 
Jos? Mar?a (Chema) Mateos
https://rinzewind.org/blog-es || https://rinzewind.org/blog-en

From mail at sebastianraschka.com  Sun May 13 20:16:16 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Sun, 13 May 2018 20:16:16 -0400
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <CAHCPJrL2-zh=xRMintoO9fb2dW6BXZo9MKKM83n-V5+zHWJopA@mail.gmail.com>
References: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>
 <CAD7aRH_YpH8HveqF5qLqSsHt+iYRhc1KxJ2Ajq6kTG_CM097iQ@mail.gmail.com>
 <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
 <CAHCPJrL2-zh=xRMintoO9fb2dW6BXZo9MKKM83n-V5+zHWJopA@mail.gmail.com>
Message-ID: <1EA93B26-5892-4D85-9FE7-51F32B06C8DF@sebastianraschka.com>

> So I suggest that there is a test version that shows a proper message when an error occurs.

I think the freezing that happens in your case is operating system specific and it would require some weird workarounds to detect at which RAM usage the combination of machine and operating system might freeze (i.e., I never observed my system freezing when I run out of RAM, since it has a pretty swift SSD, but the sklearn process may take a very long time to finish). Plus, scikit-learn would require to know and constantly check how much memory would be used and currently available (due to the use of other apps and the OS kernel), which wouldn't be feasible. 

I am not sure if this helps (depending where the memory-usage bottleneck is), but it could maybe help providing a sparse (CSR) array instead of a dense one to the .fit() method. Another thing to try would be to pre-compute the distances and give that to the .fit() method after initializing the DBSCAN object with metric='precomputed')

Best,
Sebastian

> On May 13, 2018, at 7:23 PM, Mauricio Reis <reismc at gmail.com> wrote:
> 
> I think the problem is due to the size of my database, which has 44,000 records. When I ran a database test with reduced sizes (10,000 and 20,000 first records), the routine ran normally.
> 
> You ask me to check the memory while running the DBScan routine, but I do not know how to do that (if I did, I would have done that already).
> 
> I think the routine is not ready to work with too much data. The problem is that my computer freezes and I can not analyze the case. I've tried to figure out if any changes work (like changing routine parameters), but all alternatives with lots of data (about 40,000 records) generate error.
> 
> I believe that package routines have no exception handling to improve performance. So I suggest that there is a test version that shows a proper message when an error occurs.
> 
> To summarize: 1) How to check the memory of the computer during the execution of the routine? 2) I suggest developing test versions of routines that may have a memory error.
> 
> Att.,
> Mauricio Reis
> 
> 2018-05-13 5:34 GMT-03:00 Roman Yurchak <rth.yurchak at gmail.com>:
> Could you please check memory usage while running DBSCAN to make sure freezing is due to running out of memory and not to something else?
> Which parameters do you run DBSCAN with? Changing algorithm, leaf_size parameters and ensuring n_jobs=1 could help.
> 
> Assuming eps is reasonable, I think it shouldn't be an issue to run DBSCAN on L2 normalized data: using the default euclidean metric, this should produce somewhat similar results to clustering not normalized data with metric='cosine'.
> 
> On 13/05/18 00:20, Andrew Nystrom wrote:
> If you?re l2 norming your data, you?re making it live on the surface of a hypershere. That surface will have a high density of points and may not have areas of low density, in which case the entire surface could be recognized as a single cluster if epsilon is high enough and min neighbors is low enough. I?d suggest not using l2 norm with DBSCAN.
> On Sat, May 12, 2018 at 7:27 AM Mauricio Reis <reismc at gmail.com <mailto:reismc at gmail.com>> wrote:
> 
>     The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my
>     computer without any warning message!
> 
>     I am using WinPython 3.6.5 64 bit.
> 
>     The method works normally with the original data, but freezes when I
>     use the normalized data (between 0 and 1).
> 
>     What should I do?
> 
>     Att.,
>     Mauricio Reis
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From joel.nothman at gmail.com  Sun May 13 22:59:15 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Mon, 14 May 2018 12:59:15 +1000
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <CAHCPJrL2-zh=xRMintoO9fb2dW6BXZo9MKKM83n-V5+zHWJopA@mail.gmail.com>
References: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>
 <CAD7aRH_YpH8HveqF5qLqSsHt+iYRhc1KxJ2Ajq6kTG_CM097iQ@mail.gmail.com>
 <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
 <CAHCPJrL2-zh=xRMintoO9fb2dW6BXZo9MKKM83n-V5+zHWJopA@mail.gmail.com>
Message-ID: <CAAkaFLWep1PZaY6e-eDQrtcZgUjLcdqU1+6t6EX+WrJRVM1yPg@mail.gmail.com>

This is quite a common issue with our implementation of DBSCAN, and
improvements to documentation would be very, very welcome.

The high memory cost comes from constructing the pairwise radius neighbors
for all points. If using a distance metric that cannot be indexed with a
KD-tree or Ball Tree, this results in n^2 floats being stored in memory
even before the radius neighbors are computed.

You have the following strategies available to you currently:

1. Calculate the radius neighborhoods using radius_neighbors_graph in
chunks, so as to avoid all pairs being calculated and stored at once. This
produces a sparse graph representation, which can be passed into dbscan
with metric='precomputed'. (I've just seen Sebastian suggested the same.)
2. Reduce the number of samples in your dataset and represent
(near-)duplicate points with sample_weight (i.e. two identical points would
be merged but would have a sample_weight of 2).

There is also a proposal to offer an alternative memory-efficient mode at
https://github.com/scikit-learn/scikit-learn/pull/6813. Feedback is welcome.

Cheers,

Joel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180514/9ce620a0/attachment-0001.html>

From joel.nothman at gmail.com  Sun May 13 23:07:21 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Mon, 14 May 2018 13:07:21 +1000
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <CAAkaFLWep1PZaY6e-eDQrtcZgUjLcdqU1+6t6EX+WrJRVM1yPg@mail.gmail.com>
References: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>
 <CAD7aRH_YpH8HveqF5qLqSsHt+iYRhc1KxJ2Ajq6kTG_CM097iQ@mail.gmail.com>
 <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
 <CAHCPJrL2-zh=xRMintoO9fb2dW6BXZo9MKKM83n-V5+zHWJopA@mail.gmail.com>
 <CAAkaFLWep1PZaY6e-eDQrtcZgUjLcdqU1+6t6EX+WrJRVM1yPg@mail.gmail.com>
Message-ID: <CAAkaFLVj7tjfKkEcPMSqXFWqC2tZiEStmntsVTZNhTekvALj2A@mail.gmail.com>

Note that this has long been documented under "Memory consumption for large
sample sizes" at
http://scikit-learn.org/stable/modules/clustering.html#dbscan

On 14 May 2018 at 12:59, Joel Nothman <joel.nothman at gmail.com> wrote:

> This is quite a common issue with our implementation of DBSCAN, and
> improvements to documentation would be very, very welcome.
>
> The high memory cost comes from constructing the pairwise radius neighbors
> for all points. If using a distance metric that cannot be indexed with a
> KD-tree or Ball Tree, this results in n^2 floats being stored in memory
> even before the radius neighbors are computed.
>
> You have the following strategies available to you currently:
>
> 1. Calculate the radius neighborhoods using radius_neighbors_graph in
> chunks, so as to avoid all pairs being calculated and stored at once. This
> produces a sparse graph representation, which can be passed into dbscan
> with metric='precomputed'. (I've just seen Sebastian suggested the same.)
> 2. Reduce the number of samples in your dataset and represent
> (near-)duplicate points with sample_weight (i.e. two identical points would
> be merged but would have a sample_weight of 2).
>
> There is also a proposal to offer an alternative memory-efficient mode at
> https://github.com/scikit-learn/scikit-learn/pull/6813. Feedback is
> welcome.
>
> Cheers,
>
> Joel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180514/5de7c0cf/attachment.html>

From dylanf123 at gmail.com  Mon May 14 09:39:29 2018
From: dylanf123 at gmail.com (Dylan Fernando)
Date: Mon, 14 May 2018 23:39:29 +1000
Subject: [scikit-learn] New algorithm suggestion - AODE
Message-ID: <CAPa-kAzZguLARk8xwHHehTp8oJ6j5osPrvdDQuKTFrOfc0=oDQ@mail.gmail.com>

Hello,

I would like to suggest a new classification algorithm for scikit-learn,
Averaged one-dependence estimators (AODE).
AODE achieves highly accurate classification by averaging over all of a
small space of alternative naive-Bayes-like models that have weaker (and
hence less detrimental) independence assumptions than naive Bayes. The
resulting algorithm is computationally efficient while delivering highly
accurate classification on many learning tasks. For more information, see
paper (https://link.springer.com/article/10.1007/s10994-005-4258-6). The
paper has over 200 citations.
There is an existing implementation in the WEKA machine learning suite (
http://weka.sourceforge.net/doc.stable/weka/classifiers/bayes/AODE.html).
I?ve made a pull request and I would like some feedback (
https://github.com/scikit-learn/scikit-learn/pull/11093).

Thank You,
Dylan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180514/c7bab19c/attachment.html>

From t3kcit at gmail.com  Wed May 16 13:27:40 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 16 May 2018 13:27:40 -0400
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
 Importance
In-Reply-To: <CAH3bUKhrV1CA=g10hF+sew4z7o6o86Osyk5kc=OPoy98W79hSw@mail.gmail.com>
References: <CAOpHbdu3L5VZfgttDjzc0xG=2FRw7V2CfA=BAgqd8LHFRmVUPQ@mail.gmail.com>
 <D7126D90.E40B%jeremiah.johnson@unh.edu>
 <CACDxx9ioB0jaumM9O2L-Wognd6kfJ3Cc=pkM+bYrP_i11YGeXg@mail.gmail.com>
 <CAH3bUKhrV1CA=g10hF+sew4z7o6o86Osyk5kc=OPoy98W79hSw@mail.gmail.com>
Message-ID: <3aeca8e9-2a59-61e4-a88b-4b9e4c26cdae@gmail.com>

I don't think that's how most people use the trees, though.
Probably not even the ExtraTrees.
I really need to get around to reading your thesis :-/

Do you recommend using max_features=1 with ExtraTrees?


On 05/05/2018 05:21 AM, Gilles Louppe wrote:
> Hi,
>
> See also chapters 6 and 7 of http://arxiv.org/abs/1407.7502 for another
> point of view regarding the "issue" with feature importances. TLDR: Feature
> importances as we have them in scikit-learn (i.e. MDI) are provably **not**
> biased, provided trees are built totally at random (as in ExtraTrees with
> max_feature=1) and the depth is controlled min_samples_split (to avoid
> splitting on noise). On the other hand, it is not always clear what you
> actually compute with MDA (permutation based importances), since it is
> conditioned on the model you use.
>
> Gilles
> On Sat, 5 May 2018 at 10:36, Guillaume Lema?tre <g.lemaitre58 at gmail.com>
> wrote:
>
>> +1 on the post pointed out by Jeremiah.
>> On 5 May 2018 at 02:08, Johnson, Jeremiah <Jeremiah.Johnson at unh.edu>
> wrote:
>
>>> Faraz, take a look at the discussion of this issue here:
> http://parrt.cs.usfca.edu/doc/rf-importance/index.html
>
>>> Best,
>>> Jeremiah
>>> =========================================
>>> Jeremiah W. Johnson, Ph.D
>>> Asst. Professor of Data Science
>>> Program Coordinator, B.S. in Analytics & Data Science
>>> University of New Hampshire
>>> Manchester, NH 03101
>>> https://www.linkedin.com/in/jwjohnson314
>>> From: scikit-learn <scikit-learn-bounces+jeremiah.johnson=
> unh.edu at python.org> on behalf of "Niyaghi, Faraz" <niyaghif at oregonstate.edu>
>>> Reply-To: Scikit-learn mailing list <scikit-learn at python.org>
>>> Date: Friday, May 4, 2018 at 7:10 PM
>>> To: "scikit-learn at python.org" <scikit-learn at python.org>
>>> Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
> Importance
>
>>> Caution - External Email
>>> ________________________________
>>> Greetings,
>>> This is Faraz Niyaghi from Oregon State University. I research on
> variable selection using random forest. To the best of my knowledge, there
> is a  difference between scikit-learn's and Breiman's definition of feature
> importance. Breiman uses out of bag (oob) cases to calculate feature
> importance but scikit-learn doesn't. I was wondering: 1) why are they
> different? 2) can they result in very different rankings of features?
>
>>> Here are the definitions I found on the web:
>>> Breiman: "In every tree grown in the forest, put down the oob cases and
> count the number of votes cast for the correct class. Now randomly permute
> the values of variable m in the oob cases and put these cases down the
> tree. Subtract the number of votes for the correct class in the
> variable-m-permuted oob data from the number of votes for the correct class
> in the untouched oob data. The average of this number over all trees in the
> forest is the raw importance score for variable m."
>>> Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
>>> scikit-learn: " The relative rank (i.e. depth) of a feature used as a
> decision node in a tree can be used to assess the relative importance of
> that feature with respect to the predictability of the target variable.
> Features used at the top of the tree contribute to the final prediction
> decision of a larger fraction of the input samples. The expected fraction
> of the samples they contribute to can thus be used as an estimate of the
> relative importance of the features."
>>> Link: http://scikit-learn.org/stable/modules/ensemble.html
>>> Thank you for reading this email. Please let me know your thoughts.
>>> Cheers,
>>> Faraz.
>>> Faraz Niyaghi
>>> Ph.D. Candidate, Department of Statistics
>>> Oregon State University
>>> Corvallis, OR
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>> --
>> Guillaume Lemaitre
>> INRIA Saclay - Parietal team
>> Center for Data Science Paris-Saclay
>> https://glemaitre.github.io/
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From t3kcit at gmail.com  Wed May 16 13:37:36 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 16 May 2018 13:37:36 -0400
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <CAAkaFLVj7tjfKkEcPMSqXFWqC2tZiEStmntsVTZNhTekvALj2A@mail.gmail.com>
References: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>
 <CAD7aRH_YpH8HveqF5qLqSsHt+iYRhc1KxJ2Ajq6kTG_CM097iQ@mail.gmail.com>
 <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
 <CAHCPJrL2-zh=xRMintoO9fb2dW6BXZo9MKKM83n-V5+zHWJopA@mail.gmail.com>
 <CAAkaFLWep1PZaY6e-eDQrtcZgUjLcdqU1+6t6EX+WrJRVM1yPg@mail.gmail.com>
 <CAAkaFLVj7tjfKkEcPMSqXFWqC2tZiEStmntsVTZNhTekvALj2A@mail.gmail.com>
Message-ID: <d186e57b-064c-9aee-05d2-efb8062cc159@gmail.com>

You might also consider looking at hdbscan:

https://github.com/scikit-learn-contrib/hdbscan


On 05/13/2018 11:07 PM, Joel Nothman wrote:
> Note that this has long been documented under "Memory consumption for 
> large sample sizes" at 
> http://scikit-learn.org/stable/modules/clustering.html#dbscan
>
> On 14 May 2018 at 12:59, Joel Nothman <joel.nothman at gmail.com 
> <mailto:joel.nothman at gmail.com>> wrote:
>
>     This is quite a common issue with our implementation of DBSCAN,
>     and improvements to documentation would be very, very welcome.
>
>     The high memory cost comes from constructing the pairwise radius
>     neighbors for all points. If using a distance metric that cannot
>     be indexed with a KD-tree or Ball Tree, this results in n^2 floats
>     being stored in memory even before the radius neighbors are computed.
>
>     You have the following strategies available to you currently:
>
>     1. Calculate the radius neighborhoods using radius_neighbors_graph
>     in chunks, so as to avoid all pairs being calculated and stored at
>     once. This produces a sparse graph representation, which can be
>     passed into dbscan with metric='precomputed'. (I've just seen
>     Sebastian suggested the same.)
>     2. Reduce the number of samples in your dataset and represent
>     (near-)duplicate points with sample_weight (i.e. two identical
>     points would be merged but would have a sample_weight of 2).
>
>     There is also?a proposal to offer an alternative memory-efficient
>     mode at https://github.com/scikit-learn/scikit-learn/pull/6813
>     <https://github.com/scikit-learn/scikit-learn/pull/6813>. Feedback
>     is welcome.
>
>     Cheers,
>
>     Joel
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180516/07f64d1f/attachment.html>

From t3kcit at gmail.com  Wed May 16 13:44:17 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 16 May 2018 13:44:17 -0400
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <CAAkaFLWep1PZaY6e-eDQrtcZgUjLcdqU1+6t6EX+WrJRVM1yPg@mail.gmail.com>
References: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>
 <CAD7aRH_YpH8HveqF5qLqSsHt+iYRhc1KxJ2Ajq6kTG_CM097iQ@mail.gmail.com>
 <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
 <CAHCPJrL2-zh=xRMintoO9fb2dW6BXZo9MKKM83n-V5+zHWJopA@mail.gmail.com>
 <CAAkaFLWep1PZaY6e-eDQrtcZgUjLcdqU1+6t6EX+WrJRVM1yPg@mail.gmail.com>
Message-ID: <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com>

Should we have "low memory"/batched version of k_neighbors_graph and 
epsilon_neighbors_graph functions? I assume
those instantiate the dense matrix right now.


On 05/13/2018 10:59 PM, Joel Nothman wrote:
> This is quite a common issue with our implementation of DBSCAN, and 
> improvements to documentation would be very, very welcome.
>
> The high memory cost comes from constructing the pairwise radius 
> neighbors for all points. If using a distance metric that cannot be 
> indexed with a KD-tree or Ball Tree, this results in n^2 floats being 
> stored in memory even before the radius neighbors are computed.
>
> You have the following strategies available to you currently:
>
> 1. Calculate the radius neighborhoods using radius_neighbors_graph in 
> chunks, so as to avoid all pairs being calculated and stored at once. 
> This produces a sparse graph representation, which can be passed into 
> dbscan with metric='precomputed'. (I've just seen Sebastian suggested 
> the same.)
> 2. Reduce the number of samples in your dataset and represent 
> (near-)duplicate points with sample_weight (i.e. two identical points 
> would be merged but would have a sample_weight of 2).
>
> There is also?a proposal to offer an alternative memory-efficient mode 
> at https://github.com/scikit-learn/scikit-learn/pull/6813. Feedback is 
> welcome.
>
> Cheers,
>
> Joel
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180516/f9ad143a/attachment-0001.html>

From gael.varoquaux at normalesup.org  Wed May 16 13:50:07 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Wed, 16 May 2018 19:50:07 +0200
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com>
References: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>
 <CAD7aRH_YpH8HveqF5qLqSsHt+iYRhc1KxJ2Ajq6kTG_CM097iQ@mail.gmail.com>
 <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
 <CAHCPJrL2-zh=xRMintoO9fb2dW6BXZo9MKKM83n-V5+zHWJopA@mail.gmail.com>
 <CAAkaFLWep1PZaY6e-eDQrtcZgUjLcdqU1+6t6EX+WrJRVM1yPg@mail.gmail.com>
 <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com>
Message-ID: <20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org>

On Wed, May 16, 2018 at 01:44:17PM -0400, Andreas Mueller wrote:
> Should we have "low memory"/batched version of k_neighbors_graph and
> epsilon_neighbors_graph functions? I assume
> those instantiate the dense matrix right now.

+1!

It shouldn't be too hard to do.

G

From g.louppe at gmail.com  Wed May 16 14:08:59 2018
From: g.louppe at gmail.com (Gilles Louppe)
Date: Wed, 16 May 2018 20:08:59 +0200
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
 Importance
In-Reply-To: <3aeca8e9-2a59-61e4-a88b-4b9e4c26cdae@gmail.com>
References: <CAOpHbdu3L5VZfgttDjzc0xG=2FRw7V2CfA=BAgqd8LHFRmVUPQ@mail.gmail.com>
 <D7126D90.E40B%jeremiah.johnson@unh.edu>
 <CACDxx9ioB0jaumM9O2L-Wognd6kfJ3Cc=pkM+bYrP_i11YGeXg@mail.gmail.com>
 <CAH3bUKhrV1CA=g10hF+sew4z7o6o86Osyk5kc=OPoy98W79hSw@mail.gmail.com>
 <3aeca8e9-2a59-61e4-a88b-4b9e4c26cdae@gmail.com>
Message-ID: <CAH3bUKhkoQgPEh4ouDsmD9hTM96C4DTcp-mCCLnii4oTRT_cEA@mail.gmail.com>

> Do you recommend using max_features=1 with ExtraTrees?

If what you want are feature importances that reflect, without 'bias', the
mutual information of each variable (alone or in combination with others)
with Y, then yes. Bonus points if you set min_impurity_decrease > 0, to
avoid splitting on noise and collecting that as part of the importance
scores.

The resulting forest will not be optimal with respect to
classification/regression performance though.
On Wed, 16 May 2018 at 19:29, Andreas Mueller <t3kcit at gmail.com> wrote:

> I don't think that's how most people use the trees, though.
> Probably not even the ExtraTrees.
> I really need to get around to reading your thesis :-/

> Do you recommend using max_features=1 with ExtraTrees?


> On 05/05/2018 05:21 AM, Gilles Louppe wrote:
> > Hi,
> >
> > See also chapters 6 and 7 of http://arxiv.org/abs/1407.7502 for another
> > point of view regarding the "issue" with feature importances. TLDR:
Feature
> > importances as we have them in scikit-learn (i.e. MDI) are provably
**not**
> > biased, provided trees are built totally at random (as in ExtraTrees
with
> > max_feature=1) and the depth is controlled min_samples_split (to avoid
> > splitting on noise). On the other hand, it is not always clear what you
> > actually compute with MDA (permutation based importances), since it is
> > conditioned on the model you use.
> >
> > Gilles
> > On Sat, 5 May 2018 at 10:36, Guillaume Lema?tre <g.lemaitre58 at gmail.com>
> > wrote:
> >
> >> +1 on the post pointed out by Jeremiah.
> >> On 5 May 2018 at 02:08, Johnson, Jeremiah <Jeremiah.Johnson at unh.edu>
> > wrote:
> >
> >>> Faraz, take a look at the discussion of this issue here:
> > http://parrt.cs.usfca.edu/doc/rf-importance/index.html
> >
> >>> Best,
> >>> Jeremiah
> >>> =========================================
> >>> Jeremiah W. Johnson, Ph.D
> >>> Asst. Professor of Data Science
> >>> Program Coordinator, B.S. in Analytics & Data Science
> >>> University of New Hampshire
> >>> Manchester, NH 03101
> >>> https://www.linkedin.com/in/jwjohnson314
> >>> From: scikit-learn <scikit-learn-bounces+jeremiah.johnson=
> > unh.edu at python.org> on behalf of "Niyaghi, Faraz" <
niyaghif at oregonstate.edu>
> >>> Reply-To: Scikit-learn mailing list <scikit-learn at python.org>
> >>> Date: Friday, May 4, 2018 at 7:10 PM
> >>> To: "scikit-learn at python.org" <scikit-learn at python.org>
> >>> Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
> > Importance
> >
> >>> Caution - External Email
> >>> ________________________________
> >>> Greetings,
> >>> This is Faraz Niyaghi from Oregon State University. I research on
> > variable selection using random forest. To the best of my knowledge,
there
> > is a  difference between scikit-learn's and Breiman's definition of
feature
> > importance. Breiman uses out of bag (oob) cases to calculate feature
> > importance but scikit-learn doesn't. I was wondering: 1) why are they
> > different? 2) can they result in very different rankings of features?
> >
> >>> Here are the definitions I found on the web:
> >>> Breiman: "In every tree grown in the forest, put down the oob cases
and
> > count the number of votes cast for the correct class. Now randomly
permute
> > the values of variable m in the oob cases and put these cases down the
> > tree. Subtract the number of votes for the correct class in the
> > variable-m-permuted oob data from the number of votes for the correct
class
> > in the untouched oob data. The average of this number over all trees in
the
> > forest is the raw importance score for variable m."
> >>> Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
> >>> scikit-learn: " The relative rank (i.e. depth) of a feature used as a
> > decision node in a tree can be used to assess the relative importance of
> > that feature with respect to the predictability of the target variable.
> > Features used at the top of the tree contribute to the final prediction
> > decision of a larger fraction of the input samples. The expected
fraction
> > of the samples they contribute to can thus be used as an estimate of the
> > relative importance of the features."
> >>> Link: http://scikit-learn.org/stable/modules/ensemble.html
> >>> Thank you for reading this email. Please let me know your thoughts.
> >>> Cheers,
> >>> Faraz.
> >>> Faraz Niyaghi
> >>> Ph.D. Candidate, Department of Statistics
> >>> Oregon State University
> >>> Corvallis, OR
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> >> --
> >> Guillaume Lemaitre
> >> INRIA Saclay - Parietal team
> >> Center for Data Science Paris-Saclay
> >> https://glemaitre.github.io/
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From joel.nothman at gmail.com  Wed May 16 19:33:01 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 17 May 2018 09:33:01 +1000
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org>
References: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>
 <CAD7aRH_YpH8HveqF5qLqSsHt+iYRhc1KxJ2Ajq6kTG_CM097iQ@mail.gmail.com>
 <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
 <CAHCPJrL2-zh=xRMintoO9fb2dW6BXZo9MKKM83n-V5+zHWJopA@mail.gmail.com>
 <CAAkaFLWep1PZaY6e-eDQrtcZgUjLcdqU1+6t6EX+WrJRVM1yPg@mail.gmail.com>
 <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com>
 <20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org>
Message-ID: <CAAkaFLWbzOgTamt3EB3p3Pa3U+up-f99nr1FCKwxFnkv1_L+Ow@mail.gmail.com>

Implemented in a previous version of #10280
<https://github.com/scikit-learn/scikit-learn/pull/10280>, but removed for
now to simplify reviews
<https://github.com/scikit-learn/scikit-learn/pull/10280#pullrequestreview-95622713>.
If others would like to review #10280, I'm happy to follow up with the
changes requested here, which have already been implemented by Aman Dalmia
and myself.?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180517/60a8444f/attachment.html>

From reismc at gmail.com  Thu May 17 10:37:14 2018
From: reismc at gmail.com (Mauricio Reis)
Date: Thu, 17 May 2018 11:37:14 -0300
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <CAAkaFLWbzOgTamt3EB3p3Pa3U+up-f99nr1FCKwxFnkv1_L+Ow@mail.gmail.com>
References: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>
 <CAD7aRH_YpH8HveqF5qLqSsHt+iYRhc1KxJ2Ajq6kTG_CM097iQ@mail.gmail.com>
 <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
 <CAHCPJrL2-zh=xRMintoO9fb2dW6BXZo9MKKM83n-V5+zHWJopA@mail.gmail.com>
 <CAAkaFLWep1PZaY6e-eDQrtcZgUjLcdqU1+6t6EX+WrJRVM1yPg@mail.gmail.com>
 <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com>
 <20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org>
 <CAAkaFLWbzOgTamt3EB3p3Pa3U+up-f99nr1FCKwxFnkv1_L+Ow@mail.gmail.com>
Message-ID: <CAHCPJrJsfqAOWaS+5LbAJ8u+_HgUUR5ZTmXm3Agf_gJA4dZ6cw@mail.gmail.com>

I'm not used to the terms used here. So I understood that the package had
memory management, which was removed. But you could make the code available
with memory management implementations. Is it?! :-)
The problem is that I do not know what I would do with the code, because I
only know how to work with the SciKitLearn package ready. :-(

Att.,
Mauricio Reis

2018-05-16 20:33 GMT-03:00 Joel Nothman <joel.nothman at gmail.com>:

> Implemented in a previous version of #10280
> <https://github.com/scikit-learn/scikit-learn/pull/10280>, but removed
> for now to simplify reviews
> <https://github.com/scikit-learn/scikit-learn/pull/10280#pullrequestreview-95622713>.
> If others would like to review #10280, I'm happy to follow up with the
> changes requested here, which have already been implemented by Aman Dalmia
> and myself.?
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180517/6dbf0ac0/attachment.html>

From joel.nothman at gmail.com  Thu May 17 18:02:56 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Fri, 18 May 2018 08:02:56 +1000
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <CAHCPJrJsfqAOWaS+5LbAJ8u+_HgUUR5ZTmXm3Agf_gJA4dZ6cw@mail.gmail.com>
References: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>
 <CAD7aRH_YpH8HveqF5qLqSsHt+iYRhc1KxJ2Ajq6kTG_CM097iQ@mail.gmail.com>
 <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
 <CAHCPJrL2-zh=xRMintoO9fb2dW6BXZo9MKKM83n-V5+zHWJopA@mail.gmail.com>
 <CAAkaFLWep1PZaY6e-eDQrtcZgUjLcdqU1+6t6EX+WrJRVM1yPg@mail.gmail.com>
 <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com>
 <20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org>
 <CAAkaFLWbzOgTamt3EB3p3Pa3U+up-f99nr1FCKwxFnkv1_L+Ow@mail.gmail.com>
 <CAHCPJrJsfqAOWaS+5LbAJ8u+_HgUUR5ZTmXm3Agf_gJA4dZ6cw@mail.gmail.com>
Message-ID: <CAAkaFLUUX0tEZW8k+Y47=N9bfVS2GadF68PYrYXsEi9cMO7GMg@mail.gmail.com>

There are two issues here:

1. We store all radius neighborhoods of all points in memory at once. This
is a problem if each point has a large radius neighborhood. DBSCAN only
requires that you store the radius neighbors of the point you are currently
examining. We could provide a memory-efficient mode that would do so.

2. Given that we store all neighborhoods at once, a brute force nearest
neighbors search will take O(n^2) which can be reduced by chunking the
operation.

Both solutions have patches available already, but not reviewed.


On 18 May 2018 at 00:37, Mauricio Reis <reismc at gmail.com> wrote:

> I'm not used to the terms used here. So I understood that the package had
> memory management, which was removed. But you could make the code available
> with memory management implementations. Is it?! :-)
> The problem is that I do not know what I would do with the code, because I
> only know how to work with the SciKitLearn package ready. :-(
>
> Att.,
> Mauricio Reis
>
> 2018-05-16 20:33 GMT-03:00 Joel Nothman <joel.nothman at gmail.com>:
>
>> Implemented in a previous version of #10280
>> <https://github.com/scikit-learn/scikit-learn/pull/10280>, but removed
>> for now to simplify reviews
>> <https://github.com/scikit-learn/scikit-learn/pull/10280#pullrequestreview-95622713>.
>> If others would like to review #10280, I'm happy to follow up with the
>> changes requested here, which have already been implemented by Aman Dalmia
>> and myself.?
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180518/230586e7/attachment.html>

From valerio.maggio at gmail.com  Fri May 18 07:10:59 2018
From: valerio.maggio at gmail.com (Valerio Maggio)
Date: Fri, 18 May 2018 13:10:59 +0200
Subject: [scikit-learn] CFP: EuroSciPy 2018 - 11th European Conference on
 Python in Science
Message-ID: <CAEt-TDxeBmQ8O+LC3PtNij02nr5mWcip0Wkjy65rZ4oF91cfAA@mail.gmail.com>

*** Apologies if you receive multiple copies ***


Dear Colleagues,


We are delighted to invite you to join us for the *11th European Conference
on Python in Science*.

The EuroSciPy 2018 <https://www.euroscipy.org/2018/> Conference will be
organised by Fondazione Bruno Kessler (FBK) and will take place from
August, 28th to September, 1st in *Trento, Italy*.


The EuroSciPy meeting is a cross-disciplinary gathering focused on the use
and development of the Python language in scientific research. This event
strives to bring together both users and developers of scientific tools, as
well as academic research and state of the art industry.


The conference will be structured as it follows:

   - *Aug, 28-29 *: Tutorials and Hands-on
   - *Aug, 30-31 *: Main Conference
   - *Sep, 1         *: Sprint

----------------------------------------------------------------------------------------------------------------


TOPICS OF INTEREST:


Presentations of scientific tools and libraries using the Python language,

including but not limited to:

   - Algorithms implemented or exposed in Python
   - Astronomy
   - Data Visualisation
   - Deep Learning & AI
   - Earth, Ocean and Geo Science
   - General-purpose Python tools that can be of special interest to the
   scientific community.
   - Image Processing
   - Materials Science
   - Parallel computing
   - Political and Social Sciences
   - Project Jupyter
   - Reports on the use of Python in scientific achievements or ongoing
   projects.
   - Robotics & IoT
   - Scientific data flow and persistence
   - Scientific visualization
   - Simulation
   - Statistics
   - Vector and array manipulation
   - Web applications and portals for science and engineering
   - 3D Printing

-----------------------------------------------------------------------------------------------------------------


CALL FOR PROPOSALS:


EuroScipy will accept three different kinds of contributions:


   - *Regular Talks*: standard talks for oral presentations, allocated in
   time slots of `15`, or `30` minutes, depending on your preference and
   scheduling constraints. Each time slot considers a Q&A session at the end
   of the talk (at least, 5 mins).
   - *Hands-on Tutorials*: These are *beginner* or *advanced* training
   sessions to dive into the subject with all details. These sessions are 90
   minutes long, and the audience will be strongly encouraged to bring a
   laptop to experiment. For a sneak peak of last years tutorials, here are
   the
   - *Poster: *EuroScipy will host two poster sessions during the two days
   of Main Conference. So attendees and students are highly encourage to
   present their work and/or preliminary results as posters.


Proposals should be submitted using the EuroScipy submission system at
https://pretalx.com/euroscipy18. Submission deadline is *May, 31st 2018.*


----------------------------------------------------------------------------------------------------------------


REGISTRATION & FEES:


To register to EuroScipy 2018, please go to euroscipy2018.eventbrite.co.uk or
to http://www.euroscipy.org/2018


*Registration fees:*


*Tutorials Aug, 28th-29th 2018*

*Student**

*Academic/Individual*

*Industry*

Early Bird (till July, 1st)

?50

?70

?125

Regular (till Aug, 5th

?100

?110

?250

Late (till Aug, 22nd)

?135

?135

?300

You register for one of the two tutorial tracks (introductory or advanced)
but you can switch between both tracks whenever you want as long as there
is enough space in the lecture rooms.


*Main Conference Aug, 30th- 31st 2018*

*Student**

*Academic/Individual*

*Industry*

Early Bird (till July, 1st)

?50

?70

?125

Regular (till Aug, 5th

?100

?110

?250

Late (till Aug, 22nd)

?135

?135

?300


* A proof of student status will be required at time of the registration.


Best regards,

EuroScipy 2018 Organising Committee,

Email: info at euroscipy.org | euroscipy at fbk.eu

Website: http://www.euroscipy.org/2018

twitter: @euroscipy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180518/6db050a2/attachment-0001.html>

From mcasl at unileon.es  Fri May 18 08:32:21 2018
From: mcasl at unileon.es (=?UTF-8?Q?Manuel_CASTEJ=C3=93N_LIMAS?=)
Date: Fri, 18 May 2018 14:32:21 +0200
Subject: [scikit-learn] Delegating "get_params" and "set_params" to a
 wrapped estimator when parameter is not defined.
In-Reply-To: <CAAQ3=UGbwr=rog5i4oHoE0pBzXctZOpT=KLRYHs_7Sbh1cZo3w@mail.gmail.com>
References: <CAJn5T5WFWfWnq7d6Hi9J8VaUx41BCY8cmZngvrZK+mjD8NuO4Q@mail.gmail.com>
 <CAAQ3=UFbBfqYF1Ahfk2TAn1YLeY08uzAsd6V3VUMz87YFL7yxg@mail.gmail.com>
 <CAAkaFLUOZ52KM83485RWOBCoG+Q9xup8tv8UkRd=mAYD8B81Yw@mail.gmail.com>
 <CAAQ3=UGbwr=rog5i4oHoE0pBzXctZOpT=KLRYHs_7Sbh1cZo3w@mail.gmail.com>
Message-ID: <CAAQ3=UGyGebQSoMSEgaY6wth_anMA+VRe1Det-vai=g2kFuVfg@mail.gmail.com>

Dear Joel,

I've changed the code of PipeGraph in order to change the old wrappers to
new Mixin Classes. The changes are reflected in this MixinClasses branch:

https://github.com/mcasl/PipeGraph/blob/feature/MixinClasses/pipegraph/adapters.py

My conclusions are that although both approaches are feasible and provide
similar functionality, Mixin Classes provide a simpler solution. Following
the 'flat is better than nested' principle, the mixin classes should be
favoured.
This approach seems as well to be more in line with general
sklearn development practice, so I'll make the necessary changes to the
docs and then the master branch will be replaced with this new Mixin
classes version.

Thanks for pointing out this issue!
Best
Manuel

2018-04-16 14:21 GMT+02:00 Manuel CASTEJ?N LIMAS <mcasl at unileon.es>:

> Nope! Mostly because of lack of experience with mixins.
> I've done some reading and I think I can come up with a few mixins doing
> the trick by dynamically adding their methods to an already instantiated
> object. I'll play with that and I hope to show you something soon! Or at
> least I will have better grounds to make an educated decision.
> Best
> Manuel
>
>
>
>
> Manuel Castej?n Limas
> *Escuela de Ingenier?a Industrial e Inform?tica*
> Universidad de Le?n
> Campus de Vegazana sn.
> 24071. Le?n. Spain.
> *e-mail: *manuel.castejon at unileon.es
> *Tel.*: +34 987 291 946
>
> Digital Business Card: Click Here <http://qrs.ly/5c3jpaj>
>
>
>
> 2018-04-15 15:18 GMT+02:00 Joel Nothman <joel.nothman at gmail.com>:
>
>> Have you considered whether a mixin is a better model than a wrapper??
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180518/611a026c/attachment.html>

From shane.grigsby at colorado.edu  Fri May 18 18:29:19 2018
From: shane.grigsby at colorado.edu (Shane Grigsby)
Date: Fri, 18 May 2018 16:29:19 -0600
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <CAHCPJrJsfqAOWaS+5LbAJ8u+_HgUUR5ZTmXm3Agf_gJA4dZ6cw@mail.gmail.com>
References: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>
 <CAD7aRH_YpH8HveqF5qLqSsHt+iYRhc1KxJ2Ajq6kTG_CM097iQ@mail.gmail.com>
 <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
 <CAHCPJrL2-zh=xRMintoO9fb2dW6BXZo9MKKM83n-V5+zHWJopA@mail.gmail.com>
 <CAAkaFLWep1PZaY6e-eDQrtcZgUjLcdqU1+6t6EX+WrJRVM1yPg@mail.gmail.com>
 <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com>
 <20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org>
 <CAAkaFLWbzOgTamt3EB3p3Pa3U+up-f99nr1FCKwxFnkv1_L+Ow@mail.gmail.com>
 <CAHCPJrJsfqAOWaS+5LbAJ8u+_HgUUR5ZTmXm3Agf_gJA4dZ6cw@mail.gmail.com>
Message-ID: <20180518222919.aimm2qgywnjw5peo@espgs-MacBook-Pro.local>

Hi Mauricio,
You can also use OPTICS in DBSCAN mode. The pull request is here if 
you'd like to clone it:

https://github.com/scikit-learn/scikit-learn/pull/1984

Running ~40,000 points in three dimensions takes about a minute. See the 
example page here for how to do the DBSCAN extraction:

https://github.com/espg/scikit-learn/blob/2eac9fbf67b2715e11fdedfbb63bcdb56a80e216/examples/cluster/plot_optics.py

Cheers,
Shane

On 05/17, Mauricio Reis wrote:
>I'm not used to the terms used here. So I understood that the package had
>memory management, which was removed. But you could make the code available
>with memory management implementations. Is it?! :-)
>The problem is that I do not know what I would do with the code, because I
>only know how to work with the SciKitLearn package ready. :-(
>
>Att.,
>Mauricio Reis
>
>2018-05-16 20:33 GMT-03:00 Joel Nothman <joel.nothman at gmail.com>:
>
>> Implemented in a previous version of #10280
>> <https://github.com/scikit-learn/scikit-learn/pull/10280>, but removed
>> for now to simplify reviews
>> <https://github.com/scikit-learn/scikit-learn/pull/10280#pullrequestreview-95622713>.
>> If others would like to review #10280, I'm happy to follow up with the
>> changes requested here, which have already been implemented by Aman Dalmia
>> and myself.?
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>

>_______________________________________________
>scikit-learn mailing list
>scikit-learn at python.org
>https://mail.python.org/mailman/listinfo/scikit-learn


-- 
*PhD candidate & Research Assistant*
*Cooperative Institute for Research in Environmental Sciences (CIRES)*
*University of Colorado at Boulder*

From sdsr.sdsr at gmail.com  Sun May 20 03:54:23 2018
From: sdsr.sdsr at gmail.com (=?UTF-8?Q?Sergio_Sol=C3=B3rzano?=)
Date: Sun, 20 May 2018 09:54:23 +0200
Subject: [scikit-learn] Isolation forests
Message-ID: <CAFrTa4Ws4x_7HkGxDLAsNZpN=CKCAY_xrOA10P1td7Dzv3v1nA@mail.gmail.com>

currently I am studying the "Isolation forest" algorithm proposed by
Liu, Ting and Zhou. I started reading the scikit-learn implementation
but could not find where exactly is the algorithm 2 of the original
paper implemented.

So far this is what I managed to understand: In the iforest.py file
there is the ?fit? method which, if my understanding is correct,
essentially makes a call to the ?_fit? method of the BaseBagging class
but there I can?t see how the algorithm 2 of the original reference is
implemented.

Any help on the details of how Itrees and Iforests are implemented is
appreciated.

If this is not the right place to ask, please let me know where is it.

Thanks for the time and help
Sergio S

From aqsdmcet at gmail.com  Mon May 21 05:10:36 2018
From: aqsdmcet at gmail.com (aijaz qazi)
Date: Mon, 21 May 2018 14:40:36 +0530
Subject: [scikit-learn] Error
Message-ID: <CAPn2g_aGqd+d_TACbFBTWQ9QWRP2GRjHJBGZc2tpuzqxMd2B8w@mail.gmail.com>

Scikit Multilearn <http://scikit.ml/> does not work.


*Regards,*
*Aijaz A.Qazi *
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180521/5cd6fddf/attachment.html>

From g.lemaitre58 at gmail.com  Mon May 21 05:17:41 2018
From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=)
Date: Mon, 21 May 2018 11:17:41 +0200
Subject: [scikit-learn] Error
In-Reply-To: <CAPn2g_aGqd+d_TACbFBTWQ9QWRP2GRjHJBGZc2tpuzqxMd2B8w@mail.gmail.com>
Message-ID: <nmqjhndc513j3e1jlibgrfas.1526894261206@gmail.com>

An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180521/2f71c123/attachment.html>

From aqsdmcet at gmail.com  Mon May 21 05:33:06 2018
From: aqsdmcet at gmail.com (aijaz qazi)
Date: Mon, 21 May 2018 15:03:06 +0530
Subject: [scikit-learn] Error
In-Reply-To: <nmqjhndc513j3e1jlibgrfas.1526894261206@gmail.com>
References: <CAPn2g_aGqd+d_TACbFBTWQ9QWRP2GRjHJBGZc2tpuzqxMd2B8w@mail.gmail.com>
 <nmqjhndc513j3e1jlibgrfas.1526894261206@gmail.com>
Message-ID: <CAPn2g_b+t4KYQnLLt1znvQpTJh+C0h1kj+gNadDz7mi6EjEgVw@mail.gmail.com>

Dev of scikit multilearn is not responding at all.


*Regards,*
*Aijaz A.Qazi *

On Mon, May 21, 2018 at 2:47 PM, Guillaume Lema?tre <g.lemaitre58 at gmail.com>
wrote:

> check with the dev of scikit multilearn directly.
>
> Sent from my phone - sorry to be brief and potential misspell.
> *From:* aqsdmcet at gmail.com
> *Sent:* 21 May 2018 11:12 am
> *To:* scikit-learn at python.org
> *Reply to:* scikit-learn at python.org
> *Subject:* [scikit-learn] Error
>
> Scikit Multilearn <http://scikit.ml/> does not work.
>
>
>
>
> *Regards,*
> *Aijaz A.Qazi <http://A.Qazi>*
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180521/fc3ca420/attachment.html>

From rth.yurchak at gmail.com  Mon May 21 05:41:14 2018
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Mon, 21 May 2018 11:41:14 +0200
Subject: [scikit-learn] Error
In-Reply-To: <CAPn2g_b+t4KYQnLLt1znvQpTJh+C0h1kj+gNadDz7mi6EjEgVw@mail.gmail.com>
References: <CAPn2g_aGqd+d_TACbFBTWQ9QWRP2GRjHJBGZc2tpuzqxMd2B8w@mail.gmail.com>
 <nmqjhndc513j3e1jlibgrfas.1526894261206@gmail.com>
 <CAPn2g_b+t4KYQnLLt1znvQpTJh+C0h1kj+gNadDz7mi6EjEgVw@mail.gmail.com>
Message-ID: <77b846ba-f4ea-fa1a-d32c-c098e7960508@gmail.com>

Try opening an issue at their Github issue tracker 
https://github.com/scikit-multilearn/scikit-multilearn/issues ; 
providing a detailed description of the issue takes some time but would 
also make it more likely to get an answer there (see 
https://stackoverflow.com/help/mcve).

-- 
Roman

On 21/05/18 11:33, aijaz qazi wrote:
> Dev of scikit multilearn is not responding at all.
> 
> 
> 
> /*Regards,*/
> /*Aijaz A.Qazi */
> 
> On Mon, May 21, 2018 at 2:47 PM, Guillaume Lema?tre 
> <g.lemaitre58 at gmail.com <mailto:g.lemaitre58 at gmail.com>> wrote:
> 
>     check with the dev of scikit multilearn directly.
> 
>     Sent from my phone - sorry to be brief and potential misspell.
> 
>     *From:* aqsdmcet at gmail.com <mailto:aqsdmcet at gmail.com>
>     *Sent:* 21 May 2018 11:12 am
>     *To:* scikit-learn at python.org <mailto:scikit-learn at python.org>
>     *Reply to:* scikit-learn at python.org <mailto:scikit-learn at python.org>
>     *Subject:* [scikit-learn] Error
> 
> 
>     Scikit Multilearn <http://scikit.ml/> does not work.
> 
> 
> 
> 
>     /*Regards,*/
>     /*Aijaz A.Qazi <http://A.Qazi>*/
> 
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn>
> 
> 
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 


From albertthomas88 at gmail.com  Tue May 22 05:18:32 2018
From: albertthomas88 at gmail.com (Albert Thomas)
Date: Tue, 22 May 2018 11:18:32 +0200
Subject: [scikit-learn] Isolation forests
In-Reply-To: <CAFrTa4Ws4x_7HkGxDLAsNZpN=CKCAY_xrOA10P1td7Dzv3v1nA@mail.gmail.com>
References: <CAFrTa4Ws4x_7HkGxDLAsNZpN=CKCAY_xrOA10P1td7Dzv3v1nA@mail.gmail.com>
Message-ID: <CAK6amUMjwvozWtAWz4G2F7887HrdQNbHv6-b=+GDugLUeSk2VA@mail.gmail.com>

Hi Sergio,

In IsolationForest, BaseBagging is applied with ExtraTreeRegressor as
base_estimator. Algorithm 2 (iTree) of the original paper is thus
implemented in ExtaTreeRegressor.

The forest is implemented thanks to the bagging procedure.

HTH,
Albert

On Sun 20 May 2018 at 09:56, Sergio Sol?rzano <sdsr.sdsr at gmail.com> wrote:

> currently I am studying the "Isolation forest" algorithm proposed by
> Liu, Ting and Zhou. I started reading the scikit-learn implementation
> but could not find where exactly is the algorithm 2 of the original
> paper implemented.
>
> So far this is what I managed to understand: In the iforest.py file
> there is the ?fit? method which, if my understanding is correct,
> essentially makes a call to the ?_fit? method of the BaseBagging class
> but there I can?t see how the algorithm 2 of the original reference is
> implemented.
>
> Any help on the details of how Itrees and Iforests are implemented is
> appreciated.
>
> If this is not the right place to ask, please let me know where is it.
>
> Thanks for the time and help
> Sergio S
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180522/56609d0a/attachment.html>

From nelle.varoquaux at gmail.com  Tue May 22 19:00:47 2018
From: nelle.varoquaux at gmail.com (Nelle Varoquaux)
Date: Tue, 22 May 2018 16:00:47 -0700
Subject: [scikit-learn] Submit a BoF at SciPy 2018, before June 27!
Message-ID: <CAE-UAvSVM-wjmtBE40P8d5ceWaSTX0wK=exgO8cvqs_FvKeyUw@mail.gmail.com>

Dear all,

(apologies for the cross-posting)

The SciPy conference would like to invite you to submit proposals for Birds
of a Feather (BOF) sessions at this year's SciPy! BOFs usually include
short presentations by a panel and a moderator with the bulk of the time
spent opening up the discussion to everyone in attendance. BoF topics can
be of general interest, such as state-of-the-project BoFs, or based on the
themes of the conference and the mini-symposia topics.

Please submit your proposals by June 27 here: https://scipy2018.scipy.
org/ehome/299527/648142/

Past SciPy conferences have had a large variety of BOF sessions, including
topics on Reproducibility, Jupyter Notebooks, Distributed Computing,
Geospatial Packages in Python, Teaching Scientific Computing with Python,
Python and Finance, NumFOCUS, Python in Astronomy, Collaborating and
Contributing in Open Science, Education, and a Matplotlib Enhancement
Proposal Discussion. Generally, if there is a topic where you think a
number of people at SciPy will be interested, you should propose it!

Thanks,
Jess & Nelle
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180522/552349f9/attachment.html>

From anael.beaugnon at ssi.gouv.fr  Wed May 23 05:50:24 2018
From: anael.beaugnon at ssi.gouv.fr (Beaugnon Anael)
Date: Wed, 23 May 2018 11:50:24 +0200
Subject: [scikit-learn] Inconsistencies in clustering documentations
Message-ID: <6014cc14-1bc3-d175-3492-838a7ca09611@ssi.gouv.fr>

Dear all,

Three clustering algorithms can take as input distance or similarity
matrices instead of the observations (AgglomerativeClustering
<http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering>,
AffinityPropagation
<http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation>,
and DBSCAN
<http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN>),
but there are inconsistencies in their documentations.


*DBSCAN :*
?? The documentation explains clearly how to run DBSCAN with a
precomputed distance matrix.
?? Constructor:/
?? ??? metric: If metric is ?precomputed?, X is assumed to be a distance
matrix and must be square.
/
?? fit / fit_predict /:
?? ??? X: A feature array, or array of distances between samples if
|metric='precomputed'|.


/
*AffinityPropagation :
*
??? Constructor:
??? ??? affinity: /Which affinity to use. At the moment |precomputed|
and |euclidean| are supported. |euclidean| uses the negative squared
euclidean distance between points.
/
??? fit :? /
??? ??? X: //Data matrix or, if affinity is |precomputed|, matrix of
similarities / affinities.
/
??? fit_predict :/
/
/??? ??? X: Input data.?????/
??? ??? X can also be a matrix of similarities ? fit and fit_predict
should share the same documentation for the input X ?/


/
*AgglomerativeClustering :
*??? Constructor:
??? ??? /affinity: Metric used to compute the linkage. Can be
?euclidean?, ?l1?, ?l2?, ?manhattan?, ?cosine?, or ?precomputed?. If
linkage is ?ward?, only ?euclidean? is accepted/.?
?? ???? The name of the parameter 'affinity' seems misleading, since it
does not correspond to similarity functions, but to distance functions.
??? fit :? /
??? ??? X: //The samples a.k.a. observations./???
??? fit_predict :/
//??? ??? X: //Input data.?
/??? ??? The documentation of fit and fit_predict does not specify that
X can also be a matrix of distances.

The user may be confused whether he/she should provide a distance or a
similarity matrix to AgglomerativeClustering.
The documentation of fit and fit_predict can be easily updated. As for
the name of the 'affinity' parameter, it is more difficult since it
involves an API change.


What do you think of these potential updates of the documentation ?

Cheers,

Ana?l Beaugnon
//
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180523/2208d1fa/attachment.html>

From tom.duprelatour at orange.fr  Wed May 23 08:01:47 2018
From: tom.duprelatour at orange.fr (Tom DLT)
Date: Wed, 23 May 2018 14:01:47 +0200
Subject: [scikit-learn] Inconsistencies in clustering documentations
In-Reply-To: <6014cc14-1bc3-d175-3492-838a7ca09611@ssi.gouv.fr>
References: <6014cc14-1bc3-d175-3492-838a7ca09611@ssi.gouv.fr>
Message-ID: <CAGKmC=uHk3ueK2otf1ibCc8UG_MYvwvi8-muiDm4DRB1NUjyng@mail.gmail.com>

Hi Ana?l,

Thanks for spotting these inconsistencies.
You are very welcome to open pull-requests and/or issues on the GitHub
tracker (cf.
http://scikit-learn.org/stable/developers/contributing.html#contributing-code
)
The documentation issue should be straightforward.
The parameter renaming would need a proper deprecation cycle (cf
http://scikit-learn.org/stable/developers/contributing.html#deprecation).

See you on GitHub,

Tom

2018-05-23 11:50 GMT+02:00 Beaugnon Anael <anael.beaugnon at ssi.gouv.fr>:

> Dear all,
>
> Three clustering algorithms can take as input distance or similarity
> matrices instead of the observations (AgglomerativeClustering
> <http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering>,
> AffinityPropagation
> <http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation>,
> and DBSCAN
> <http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN>),
> but there are inconsistencies in their documentations.
>
>
> *DBSCAN :*
>    The documentation explains clearly how to run DBSCAN with a precomputed
> distance matrix.
>    Constructor:
>
> *        metric: If metric is ?precomputed?, X is assumed to be a distance
> matrix and must be square. *
>    fit / fit_predict
>
>
>
> *:        X: A feature array, or array of distances between samples if
> metric='precomputed'. *
>
> *AffinityPropagation : *
>     Constructor:
>         affinity:
> *Which affinity to use. At the moment precomputed and euclidean are
> supported. euclidean uses the negative squared euclidean distance between
> points. *
>     fit :
> *         X: *
> *Data matrix or, if affinity is precomputed, matrix of similarities /
> affinities. *
>     fit_predict :
> *        X: Input data.      *
>         X can also be a matrix of similarities ? fit and fit_predict
> should share the same documentation for the input X ?
>
>
>
> *AgglomerativeClustering : *    Constructor:
>         *affinity: Metric used to compute the linkage. Can be
> ?euclidean?, ?l1?, ?l2?, ?manhattan?, ?cosine?, or ?precomputed?. If
> linkage is ?ward?, only ?euclidean? is accepted*.
>         The name of the parameter 'affinity' seems misleading, since it
> does not correspond to similarity functions, but to distance functions.
>     fit :
> *         X: **The samples a.k.a. observations.*
>     fit_predict :
> *        X: *
> *Input data.  *        The documentation of fit and fit_predict does not
> specify that X can also be a matrix of distances.
>
> The user may be confused whether he/she should provide a distance or a
> similarity matrix to AgglomerativeClustering.
> The documentation of fit and fit_predict can be easily updated. As for the
> name of the 'affinity' parameter, it is more difficult since it involves an
> API change.
>
>
> What do you think of these potential updates of the documentation ?
>
> Cheers,
>
> Ana?l Beaugnon
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180523/f0898076/attachment-0001.html>

From t3kcit at gmail.com  Wed May 23 12:07:17 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 23 May 2018 12:07:17 -0400
Subject: [scikit-learn] Submit a BoF at SciPy 2018, before June 27!
In-Reply-To: <CAE-UAvSVM-wjmtBE40P8d5ceWaSTX0wK=exgO8cvqs_FvKeyUw@mail.gmail.com>
References: <CAE-UAvSVM-wjmtBE40P8d5ceWaSTX0wK=exgO8cvqs_FvKeyUw@mail.gmail.com>
Message-ID: <ca6b6988-751d-75d5-473a-6063a78158e4@gmail.com>

Do folks think there'll be enough interest in future direction of 
scikit-learn to do a BoF?

On 5/22/18 7:00 PM, Nelle Varoquaux wrote:
> Dear all,
>
> (apologies for the cross-posting)
>
> The SciPy conference would like to invite you to submit proposals for 
> Birds of a Feather (BOF) sessions at this year's SciPy! BOFs usually 
> include short presentations by a panel and a moderator with the bulk 
> of the time spent opening up the discussion to everyone in attendance. 
> BoF topics can be of general interest, such as state-of-the-project 
> BoFs, or based on the themes of the conference and the mini-symposia 
> topics.
>
> Please submit your proposals by June 27 here: 
> https://scipy2018.scipy.org/ehome/299527/648142/ 
> <https://scipy2018.scipy.org/ehome/299527/648142/>
>
> Past SciPy conferences have had a large variety of BOF sessions, 
> including topics on Reproducibility, Jupyter Notebooks, Distributed 
> Computing, Geospatial Packages in Python, Teaching Scientific 
> Computing with Python, Python and Finance, NumFOCUS, Python in 
> Astronomy, Collaborating and Contributing in Open Science, Education, 
> and a Matplotlib Enhancement Proposal Discussion. Generally, if there 
> is a topic where you think a number of people at SciPy will be 
> interested, you should propose it!
>
> Thanks,
> Jess & Nelle
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180523/8701e7c6/attachment.html>

From t3kcit at gmail.com  Wed May 23 12:09:41 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 23 May 2018 12:09:41 -0400
Subject: [scikit-learn] Inconsistencies in clustering documentations
In-Reply-To: <CAGKmC=uHk3ueK2otf1ibCc8UG_MYvwvi8-muiDm4DRB1NUjyng@mail.gmail.com>
References: <6014cc14-1bc3-d175-3492-838a7ca09611@ssi.gouv.fr>
 <CAGKmC=uHk3ueK2otf1ibCc8UG_MYvwvi8-muiDm4DRB1NUjyng@mail.gmail.com>
Message-ID: <7b82776f-45f1-74ff-6e78-beb6a3a52e6b@gmail.com>

+1 for a PR on fit_predict docs. This is probably due to the inheritance 
structure.
Though it's weird that DBSCAN has the correct docs.

I'm not sure about renaming affinity, but we can discuss that. I agree 
it's misleading.


On 5/23/18 8:01 AM, Tom DLT wrote:
> Hi Ana?l,
>
> Thanks for spotting these inconsistencies.
> You are very welcome to open pull-requests and/or issues on the GitHub 
> tracker (cf. 
> http://scikit-learn.org/stable/developers/contributing.html#contributing-code)
> The documentation issue should be straightforward.
> The parameter renaming would need a proper deprecation cycle (cf 
> http://scikit-learn.org/stable/developers/contributing.html#deprecation).
>
> See you on GitHub,
>
> Tom
>
> 2018-05-23 11:50 GMT+02:00 Beaugnon Anael <anael.beaugnon at ssi.gouv.fr 
> <mailto:anael.beaugnon at ssi.gouv.fr>>:
>
>     Dear all,
>
>     Three clustering algorithms can take as input distance or
>     similarity matrices instead of the observations
>     (AgglomerativeClustering
>     <http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering>,
>     AffinityPropagation
>     <http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation>,
>     and DBSCAN
>     <http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN>),
>     but there are inconsistencies in their documentations.
>
>
>     *DBSCAN :*
>     ?? The documentation explains clearly how to run DBSCAN with a
>     precomputed distance matrix.
>     ?? Constructor:/
>     ?? ??? metric: If metric is ?precomputed?, X is assumed to be a
>     distance matrix and must be square.
>     /
>     ?? fit / fit_predict /:
>     ?? ??? X: A feature array, or array of distances between samples
>     if |metric='precomputed'|.
>
>
>     /
>     *AffinityPropagation :
>     *
>     ??? Constructor:
>     ??? ??? affinity: /Which affinity to use. At the moment
>     |precomputed| and |euclidean| are supported. |euclidean| uses the
>     negative squared euclidean distance between points.
>     /
>     ??? fit : /
>     ??? ??? X: //Data matrix or, if affinity is |precomputed|, matrix
>     of similarities / affinities.
>     /
>     ??? fit_predict :/
>     /
>     /??? ??? X: Input data. /
>     ??? ??? X can also be a matrix of similarities ? fit and
>     fit_predict should share the same documentation for the input X ?/
>
>
>     /
>     *AgglomerativeClustering :
>     *??? Constructor:
>     /affinity: Metric used to compute the linkage. Can be ?euclidean?,
>     ?l1?, ?l2?, ?manhattan?, ?cosine?, or ?precomputed?. If linkage is
>     ?ward?, only ?euclidean? is accepted/.
>     The name of the parameter 'affinity' seems misleading, since it
>     does not correspond to similarity functions, but to distance
>     functions.
>     ??? fit : /
>     ??? ??? X: //The samples a.k.a. observations./
>     ??? fit_predict :/
>     //??? ??? X: //Input data.
>     /The documentation of fit and fit_predict does not specify that X
>     can also be a matrix of distances.
>
>     The user may be confused whether he/she should provide a distance
>     or a similarity matrix to AgglomerativeClustering.
>     The documentation of fit and fit_predict can be easily updated. As
>     for the name of the 'affinity' parameter, it is more difficult
>     since it involves an API change.
>
>
>     What do you think of these potential updates of the documentation ?
>
>     Cheers,
>
>     Ana?l Beaugnon
>     //
>
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180523/c6ab498c/attachment-0001.html>

From anael.beaugnon at ssi.gouv.fr  Wed May 23 12:53:44 2018
From: anael.beaugnon at ssi.gouv.fr (Beaugnon Anael)
Date: Wed, 23 May 2018 18:53:44 +0200
Subject: [scikit-learn] Inconsistencies in clustering documentations
In-Reply-To: <7b82776f-45f1-74ff-6e78-beb6a3a52e6b@gmail.com>
References: <6014cc14-1bc3-d175-3492-838a7ca09611@ssi.gouv.fr>
 <CAGKmC=uHk3ueK2otf1ibCc8UG_MYvwvi8-muiDm4DRB1NUjyng@mail.gmail.com>
 <7b82776f-45f1-74ff-6e78-beb6a3a52e6b@gmail.com>
Message-ID: <362d447a-c9f6-1967-bd06-91a0fc806788@ssi.gouv.fr>

Thanks for your answers.

DBSCAN has the correct doc because the fit_predict method is not
inherited, but it has its own implementation (because of the additional
parameter sample_weight).

I have forked the sklearn repo. I work in a virtualenv (virtualenv venv3
--no-site-packages --python python3.5).
*python3 setup.py install* completes, but *make test-code* and *make
doc-noplot* fail.

Do you have any idea about the origin of these errors ?
I intend to install work on the python3 version. When I run make
test-code, I am surprise that there are references to /usr/lib/python2.7/.

Thanks for your help,

Ana?l Beaugnon
*

**make doc-noplot*

Exception occurred:
? File "/usr/lib/python3.5/zipfile.py", line 1435, in write
??? st = os.stat(filename)
FileNotFoundError: [Errno 2] No such file or directory:
'/<dir>/scikit-learn/doc/auto_examples/plot_digits_pipe.ipynb'
The full traceback has been saved in /tmp/sphinx-err-ivjeif0v.log, if
you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error
message can be provided next time.
A bug report can be filed in the tracker at
<https://github.com/sphinx-doc/sphinx/issues>. Thanks!

File /tmp/sphinx-err-ivjeif0v.log

# Sphinx version: 1.7.4
# Python version: 3.5.3 (CPython)
# Docutils version: 0.14
# Jinja2 version: 2.10
# Last messages:

# Loaded extensions:
Traceback (most recent call last):
? File
"/<dir>/scikit-learn/venv3/lib/python3.5/site-packages/sphinx/cmdline.py",
line 303, in main
??? args.warningiserror, args.tags, args.verbosity, args.jobs)
? File
"/<dir>/scikit-learn/venv3/lib/python3.5/site-packages/sphinx/application.py",
line 233, in __init__
??? self._init_builder()
? File
"/<dir>/scikit-learn/venv3/lib/python3.5/site-packages/sphinx/application.py",
line 311, in _init_builder
??? self.emit('builder-inited')
? File
"/<dir>/scikit-learn/venv3/lib/python3.5/site-packages/sphinx/application.py",
line 444, in emit
??? return self.events.emit(event, self, *args)
? File
"/<dir>/scikit-learn/venv3/lib/python3.5/site-packages/sphinx/events.py",
line 79, in emit
??? results.append(callback(*args))
? File
"/<dir>/scikit-learn/venv3/lib/python3.5/site-packages/sphinx_gallery/gen_gallery.py",
line 247, in generate_gallery_rst
??? download_fhindex = generate_zipfiles(gallery_dir)
? File
"/<dir>/scikit-learn/venv3/lib/python3.5/site-packages/sphinx_gallery/downloads.py",
line 115, in generate_zipfiles
??? jy_zipfile = python_zip(listdir, gallery_dir, ".ipynb")
? File
"/<dir>/scikit-learn/venv3/lib/python3.5/site-packages/sphinx_gallery/downloads.py",
line 69, in python_zip
??? zipf.write(file_src, os.path.relpath(file_src, gallery_path))
? File "/usr/lib/python3.5/zipfile.py", line 1435, in write
??? st = os.stat(filename)
FileNotFoundError: [Errno 2] No such file or directory:
'/<dir>/scikit-learn/doc/auto_examples/plot_digits_pipe.ipynb'


*make test-code*

=======================================================================
ERRORS
=======================================================================
_________________________________________________________________ ERROR
collecting?
__________________________________________________________________
/usr/lib/python2.7/dist-packages/py/_path/common.py:366: in visit
??? for x in Visitor(fil, rec, ignore, bf, sort).gen(self):
/usr/lib/python2.7/dist-packages/py/_path/common.py:405: in gen
??? if p.check(dir=1) and (rec is None or rec(p))])
/usr/lib/python2.7/dist-packages/_pytest/main.py:682: in _recurse
??? ihook = self.gethookproxy(path)
/usr/lib/python2.7/dist-packages/_pytest/main.py:587: in gethookproxy
??? my_conftestmodules = pm._getconftestmodules(fspath)
/usr/lib/python2.7/dist-packages/_pytest/config.py:339: in
_getconftestmodules
??? mod = self._importconftest(conftestpath)
/usr/lib/python2.7/dist-packages/_pytest/config.py:364: in _importconftest
??? raise ConftestImportFailure(conftestpath, sys.exc_info())
E?? ConftestImportFailure: ImportError('No module named
_check_build\n___________________________________________________________________________\nContents
of /<dir>/scikit-learn/sklearn/__check_build:\n__pycache__??????????????
setup.py????????????????? __init__.pyc\n_check_build.pyx?????????
_check_build.cpython-35m-x86_64-linux-gnu.so_check_build.c\n__init__.py\n___________________________________________________________________________\nIt
seems that scikit-learn has not been built correctly.\n\nIf you have
installed scikit-learn from source, please do not forget\nto build the
package before using it: run `python setup.py install` or\n`make` in the
source directory.\n\nIf you have used an installer, please check that it
is suited for your\nPython version, your operating system and your
platform.',)
E???? File "/<dir>/scikit-learn/sklearn/__init__.py", line 63, in <module>
E?????? from . import __check_build
E???? File "/<dir>/scikit-learn/sklearn/__check_build/__init__.py", line
46, in <module>
E?????? raise_build_error(e)
E???? File "/<dir>/scikit-learn/sklearn/__check_build/__init__.py", line
41, in raise_build_error
E?????? %s""" % (e, local_dir, ''.join(dir_content).strip(), msg))
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1
errors during collection
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
============================================================== 1 error
in 0.27 seconds
===============================================================


Le 23/05/2018 ? 18:09, Andreas Mueller a ?crit?:
>
> +1 for a PR on fit_predict docs. This is probably due to the
> inheritance structure.
> Though it's weird that DBSCAN has the correct docs.
>
> I'm not sure about renaming affinity, but we can discuss that. I agree
> it's misleading.
>
>
> On 5/23/18 8:01 AM, Tom DLT wrote:
>> Hi?Ana?l,
>>
>> Thanks for spotting these inconsistencies.
>> You are very welcome to open pull-requests and/or issues on the
>> GitHub tracker
>> (cf.?http://scikit-learn.org/stable/developers/contributing.html#contributing-code)
>> The documentation issue should be straightforward.
>> The parameter renaming would need a proper deprecation cycle (cf
>> http://scikit-learn.org/stable/developers/contributing.html#deprecation).
>>
>> See you on GitHub,
>>
>> Tom
>>
>> 2018-05-23 11:50 GMT+02:00 Beaugnon Anael <anael.beaugnon at ssi.gouv.fr
>> <mailto:anael.beaugnon at ssi.gouv.fr>>:
>>
>>     Dear all,
>>
>>     Three clustering algorithms can take as input distance or
>>     similarity matrices instead of the observations
>>     (AgglomerativeClustering
>>     <http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering>,
>>     AffinityPropagation
>>     <http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation>,
>>     and DBSCAN
>>     <http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN>),
>>     but there are inconsistencies in their documentations.
>>
>>
>>     *DBSCAN :*
>>     ?? The documentation explains clearly how to run DBSCAN with a
>>     precomputed distance matrix.
>>     ?? Constructor:/
>>     ?? ??? metric: If metric is ?precomputed?, X is assumed to be a
>>     distance matrix and must be square.
>>     /
>>     ?? fit / fit_predict /:
>>     ?? ??? X: A feature array, or array of distances between samples
>>     if |metric='precomputed'|.
>>
>>
>>     /
>>     *AffinityPropagation :
>>     *
>>     ??? Constructor:
>>     ??? ??? affinity: /Which affinity to use. At the moment
>>     |precomputed| and |euclidean| are supported. |euclidean| uses the
>>     negative squared euclidean distance between points.
>>     /
>>     ??? fit :? /
>>     ??? ??? X: //Data matrix or, if affinity is |precomputed|, matrix
>>     of similarities / affinities.
>>     /
>>     ??? fit_predict :/
>>     /
>>     /??? ??? X: Input data.?????/
>>     ??? ??? X can also be a matrix of similarities ? fit and
>>     fit_predict should share the same documentation for the input X ?/
>>
>>
>>     /
>>     *AgglomerativeClustering :
>>     *??? Constructor:
>>     ??? ??? /affinity: Metric used to compute the linkage. Can be
>>     ?euclidean?, ?l1?, ?l2?, ?manhattan?, ?cosine?, or ?precomputed?.
>>     If linkage is ?ward?, only ?euclidean? is accepted/.?
>>     ?? ???? The name of the parameter 'affinity' seems misleading,
>>     since it does not correspond to similarity functions, but to
>>     distance functions.
>>     ??? fit :? /
>>     ??? ??? X: //The samples a.k.a. observations./???
>>     ??? fit_predict :/
>>     //??? ??? X: //Input data.?
>>     /??? ??? The documentation of fit and fit_predict does not
>>     specify that X can also be a matrix of distances.
>>
>>     The user may be confused whether he/she should provide a distance
>>     or a similarity matrix to AgglomerativeClustering.
>>     The documentation of fit and fit_predict can be easily updated.
>>     As for the name of the 'affinity' parameter, it is more difficult
>>     since it involves an API change.
>>
>>
>>     What do you think of these potential updates of the documentation ?
>>
>>     Cheers,
>>
>>     Ana?l Beaugnon
>>     //
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180523/c49193a1/attachment-0001.html>

From aqsdmcet at gmail.com  Thu May 24 12:05:00 2018
From: aqsdmcet at gmail.com (aijaz qazi)
Date: Thu, 24 May 2018 21:35:00 +0530
Subject: [scikit-learn] (no subject)
Message-ID: <CAPn2g_Y243VkmYQPAti2er+j3+0Nb1iLW3m4waf-Q+g0W9q8JQ@mail.gmail.com>

 scikit- multi learn is misleading.


*Regards,*
*Aijaz A.Qazi *
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180524/39b83ca7/attachment.html>

From jmmelen at yahoo.com  Thu May 24 14:43:37 2018
From: jmmelen at yahoo.com (James Melenkevitz)
Date: Thu, 24 May 2018 18:43:37 +0000 (UTC)
Subject: [scikit-learn] Support Vector Regression
References: <2071108243.866020.1527187417210.ref@mail.yahoo.com>
Message-ID: <2071108243.866020.1527187417210@mail.yahoo.com>

I have an SVR model that uses custom kernel as follows:
1)
sgk = dual_laplace_gaussian_swarm(ss)
svr_cust_sig = SVR(kernel=sgk, C=C_Value, epsilon = epsilon_value)
svr_fit = svr_cust_sig.fit(X, y) 
#X is an array shape is [93, 24]? where each row is a time in the columns are variables for the model at each time 
#y is an array of the value that the model should fit shape of [93,]

#I can do the following without any error 
yp = svr_cust_sig.predict(X) 
#This gives predictions for the times and variables in X

#If I attempt this
yy = svr_cust_sig.predict(X[0:1])#I get the error: "ValueError: X.shape[1] = 1 should be equal to 93, the number of samples at training time"? 

#The code above is based on code in http://scikit-learn.org/stable/tutorial/basic/tutorial.html

2) To get code that can give new predictions without error I need to do the following:Use data I have and do the "fit" with X as in 1) above
numsteps =93 XR = np.zeros(( numsteps*2, 24)) 
#I set the first half of XR to be the same data that is in X#then set the second half of XR to be that same as the first half
XR[numsteps:, :] = XR[:numsteps,:]
#I then set the values in XR[numsteps, :] to be the row for the data I want a prediction for#and get the prediction from 

ypp = svr_fit.predict(XR[numsteps:, :]) #second half same size as X above with only the first row being different
#this gives results (when tested with known value for the prediction) that with some calls give the correct prediction but if I make the#call multiple times I get results that can differ by 10%.? 

My questions are:1) Is it OK to get predictions the way I'm doing this?2) If yes, then why do predictions on the same data inputs differ at times by 10%3) Why didn't my initial call "yy = svr_cust_sig.predict(X[0:1])" work and gave the error: "ValueError: X.shape[1] = 1 should be equal to 93, the number of samples at training time"4) Is there a better way for me to get predictions out of the model

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180524/68d4529d/attachment.html>

From gael.varoquaux at normalesup.org  Thu May 24 15:48:13 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Thu, 24 May 2018 21:48:13 +0200
Subject: [scikit-learn] (no subject)
In-Reply-To: <CAPn2g_Y243VkmYQPAti2er+j3+0Nb1iLW3m4waf-Q+g0W9q8JQ@mail.gmail.com>
References: <CAPn2g_Y243VkmYQPAti2er+j3+0Nb1iLW3m4waf-Q+g0W9q8JQ@mail.gmail.com>
Message-ID: <20180524194813.i6ee5xuyljoyi5lc@phare.normalesup.org>

On Thu, May 24, 2018 at 09:35:00PM +0530, aijaz qazi wrote:
> scikit- multi learn?is misleading.

Yes, but I am not sure what scikit-learn should do about this.

Ga?l

From jmmelen at yahoo.com  Thu May 24 18:16:19 2018
From: jmmelen at yahoo.com (James Melenkevitz)
Date: Thu, 24 May 2018 22:16:19 +0000 (UTC)
Subject: [scikit-learn] (no subject)
In-Reply-To: <20180524194813.i6ee5xuyljoyi5lc@phare.normalesup.org>
References: <CAPn2g_Y243VkmYQPAti2er+j3+0Nb1iLW3m4waf-Q+g0W9q8JQ@mail.gmail.com>
 <20180524194813.i6ee5xuyljoyi5lc@phare.normalesup.org>
Message-ID: <1156517581.958139.1527200179390@mail.yahoo.com>

 I did some more tests.? My issue that I brought up may be related to the custom kernel.? 

    On Thursday, May 24, 2018, 12:49:34 PM PDT, Gael Varoquaux <gael.varoquaux at normalesup.org> wrote:  
 
 On Thu, May 24, 2018 at 09:35:00PM +0530, aijaz qazi wrote:
> scikit- multi learn?is misleading.

Yes, but I am not sure what scikit-learn should do about this.

Ga?l
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180524/bcf0d59a/attachment.html>

From shiduan at ucdavis.edu  Thu May 24 19:39:42 2018
From: shiduan at ucdavis.edu (Shiheng Duan)
Date: Thu, 24 May 2018 16:39:42 -0700
Subject: [scikit-learn] Should we standardize data before PCA?
Message-ID: <CAJgygqP6ff=a=bN+K2aaO3djvELgXFCMN2XgQOeSng294cJWzA@mail.gmail.com>

Hello all,

I wonder is it necessary or correct to do z score transformation before
PCA? I didn't see any preprocessing for face image in the example of Faces
recognition example using eigenfaces and SVMs, link:
http://scikit-learn.org/stable/auto_examples/applications/plot_face_recognition.html#sphx-glr-auto-examples-applications-plot-face-recognition-py

I am doing on a similar dataset and got a weird result if I standardized
data before PCA. The components figure will have a strong gradient and it
doesn't make any sense. Any ideas about the reason?

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180524/39617e43/attachment-0001.html>

From michael.eickenberg at gmail.com  Thu May 24 20:09:52 2018
From: michael.eickenberg at gmail.com (Michael Eickenberg)
Date: Thu, 24 May 2018 17:09:52 -0700
Subject: [scikit-learn] Should we standardize data before PCA?
In-Reply-To: <CAJgygqP6ff=a=bN+K2aaO3djvELgXFCMN2XgQOeSng294cJWzA@mail.gmail.com>
References: <CAJgygqP6ff=a=bN+K2aaO3djvELgXFCMN2XgQOeSng294cJWzA@mail.gmail.com>
Message-ID: <CADxJN66QjceP1Tr+p139+QT6nek3OcZ3Ghd_9iXz3jDJRwyVEg@mail.gmail.com>

Hi,

that totally depends on the nature of your data and whether the standard
deviation of individual feature axes/columns of your data carry some form
of importance measure. Note that PCA will bias its loadings towards columns
with large standard deviations all else being held equal (meaning that if
you have zscored columns, and then you choose one column and multiply it
by, say 1000, then that component will likely show up as your first
component [if 1000 is comparable or large wrt the number of features you
are using])

Does this help?
Michael

On Thu, May 24, 2018 at 4:39 PM, Shiheng Duan <shiduan at ucdavis.edu> wrote:

> Hello all,
>
> I wonder is it necessary or correct to do z score transformation before
> PCA? I didn't see any preprocessing for face image in the example of Faces
> recognition example using eigenfaces and SVMs, link:
> http://scikit-learn.org/stable/auto_examples/applications/plot_face_
> recognition.html#sphx-glr-auto-examples-applications-
> plot-face-recognition-py
>
> I am doing on a similar dataset and got a weird result if I standardized
> data before PCA. The components figure will have a strong gradient and it
> doesn't make any sense. Any ideas about the reason?
>
> Thanks.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180524/2e362327/attachment.html>

From jmmelen at yahoo.com  Thu May 24 20:03:21 2018
From: jmmelen at yahoo.com (James Melenkevitz)
Date: Fri, 25 May 2018 00:03:21 +0000 (UTC)
Subject: [scikit-learn] Should we standardize data before PCA?
In-Reply-To: <CAJgygqP6ff=a=bN+K2aaO3djvELgXFCMN2XgQOeSng294cJWzA@mail.gmail.com>
References: <CAJgygqP6ff=a=bN+K2aaO3djvELgXFCMN2XgQOeSng294cJWzA@mail.gmail.com>
Message-ID: <718470127.979272.1527206601413@mail.yahoo.com>

 https://stats.stackexchange.com/questions/69157/why-do-we-need-to-normalize-data-before-principal-component-analysis-pca
    On Thursday, May 24, 2018, 4:41:07 PM PDT, Shiheng Duan <shiduan at ucdavis.edu> wrote:  
 
 Hello all,
I wonder is it necessary or correct to do z score transformation before PCA? I didn't see any preprocessing for face image in the example of Faces recognition example using eigenfaces and SVMs, link:http://scikit-learn.org/stable/auto_examples/applications/plot_face_recognition.html#sphx-glr-auto-examples-applications-plot-face-recognition-py
I am doing on a similar dataset and got a weird result if I standardized data before PCA. The components figure will have a strong gradient and it doesn't make any sense. Any ideas about the reason??
Thanks.?_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180525/8cebe27b/attachment.html>

From shiduan at ucdavis.edu  Sun May 27 01:10:07 2018
From: shiduan at ucdavis.edu (Shiheng Duan)
Date: Sat, 26 May 2018 22:10:07 -0700
Subject: [scikit-learn] Should we standardize data before PCA?
In-Reply-To: <CADxJN66QjceP1Tr+p139+QT6nek3OcZ3Ghd_9iXz3jDJRwyVEg@mail.gmail.com>
References: <CAJgygqP6ff=a=bN+K2aaO3djvELgXFCMN2XgQOeSng294cJWzA@mail.gmail.com>
 <CADxJN66QjceP1Tr+p139+QT6nek3OcZ3Ghd_9iXz3jDJRwyVEg@mail.gmail.com>
Message-ID: <CAJgygqNTApmkQz-rFJNPcxV-fe-2R+DaQNP6gxHhkTXkjFauxA@mail.gmail.com>

Thanks.

Do you mean that if feature one has a larger derivation than feature two,
after zscore they will have the same weight? In that case, it is a bias,
right? The feature one should be more important than feature two in the
PCA.

On Thu, May 24, 2018 at 5:09 PM, Michael Eickenberg <
michael.eickenberg at gmail.com> wrote:

> Hi,
>
> that totally depends on the nature of your data and whether the standard
> deviation of individual feature axes/columns of your data carry some form
> of importance measure. Note that PCA will bias its loadings towards columns
> with large standard deviations all else being held equal (meaning that if
> you have zscored columns, and then you choose one column and multiply it
> by, say 1000, then that component will likely show up as your first
> component [if 1000 is comparable or large wrt the number of features you
> are using])
>
> Does this help?
> Michael
>
> On Thu, May 24, 2018 at 4:39 PM, Shiheng Duan <shiduan at ucdavis.edu> wrote:
>
>> Hello all,
>>
>> I wonder is it necessary or correct to do z score transformation before
>> PCA? I didn't see any preprocessing for face image in the example of Faces
>> recognition example using eigenfaces and SVMs, link:
>> http://scikit-learn.org/stable/auto_examples/applicatio
>> ns/plot_face_recognition.html#sphx-glr-auto-examples-
>> applications-plot-face-recognition-py
>>
>> I am doing on a similar dataset and got a weird result if I standardized
>> data before PCA. The components figure will have a strong gradient and it
>> doesn't make any sense. Any ideas about the reason?
>>
>> Thanks.
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180526/f67ceb10/attachment.html>

From jmmelen at yahoo.com  Sun May 27 15:01:01 2018
From: jmmelen at yahoo.com (James Melenkevitz)
Date: Sun, 27 May 2018 19:01:01 +0000 (UTC)
Subject: [scikit-learn] Should we standardize data before PCA?
In-Reply-To: <CAJgygqNTApmkQz-rFJNPcxV-fe-2R+DaQNP6gxHhkTXkjFauxA@mail.gmail.com>
References: <CAJgygqP6ff=a=bN+K2aaO3djvELgXFCMN2XgQOeSng294cJWzA@mail.gmail.com>
 <CADxJN66QjceP1Tr+p139+QT6nek3OcZ3Ghd_9iXz3jDJRwyVEg@mail.gmail.com>
 <CAJgygqNTApmkQz-rFJNPcxV-fe-2R+DaQNP6gxHhkTXkjFauxA@mail.gmail.com>
Message-ID: <842929274.1721361.1527447661896@mail.yahoo.com>

 Here are more reference involving the "score" that may help you:
https://stats.stackexchange.com/questions/222/what-are-principal-component-scores
https://stats.stackexchange.com/questions/202578/what-is-the-meaning-of-the-variable-scores-in-matlabs-pca
ftp://statgen.ncsu.edu/pub/thorne/molevoclass/AtchleyOct19.pdf 

    On Saturday, May 26, 2018, 10:10:32 PM PDT, Shiheng Duan <shiduan at ucdavis.edu> wrote:  
 
 Thanks.?
Do you mean that if feature one has a larger derivation than feature two, after zscore?they will have the same weight? In that case, it is a bias, right? The feature one should be more important than feature two in the PCA.?
On Thu, May 24, 2018 at 5:09 PM, Michael Eickenberg <michael.eickenberg at gmail.com> wrote:

Hi,
that totally depends on the nature of your data and whether the standard deviation of individual feature axes/columns of your data carry some form of importance measure. Note that PCA will bias its loadings towards columns with large standard deviations all else being held equal (meaning that if you have zscored columns, and then you choose one column and multiply it by, say 1000, then that component will likely show up as your first component [if 1000 is comparable or large wrt the number of features you are using])
Does this help?Michael
On Thu, May 24, 2018 at 4:39 PM, Shiheng Duan <shiduan at ucdavis.edu> wrote:

Hello all,
I wonder is it necessary or correct to do z score transformation before PCA? I didn't see any preprocessing for face image in the example of Faces recognition example using eigenfaces and SVMs, link:http://scikit-learn.org/s table/auto_examples/applicatio ns/plot_face_recognition.html# sphx-glr-auto-examples- applications-plot-face- recognition-py
I am doing on a similar dataset and got a weird result if I standardized data before PCA. The components figure will have a strong gradient and it doesn't make any sense. Any ideas about the reason??
Thanks.?
______________________________ _________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailma n/listinfo/scikit-learn


______________________________ _________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/ mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180527/d2389a5d/attachment.html>

From jmmelen at yahoo.com  Sun May 27 15:10:45 2018
From: jmmelen at yahoo.com (James Melenkevitz)
Date: Sun, 27 May 2018 19:10:45 +0000 (UTC)
Subject: [scikit-learn] Should we standardize data before PCA?
In-Reply-To: <CAJgygqNTApmkQz-rFJNPcxV-fe-2R+DaQNP6gxHhkTXkjFauxA@mail.gmail.com>
References: <CAJgygqP6ff=a=bN+K2aaO3djvELgXFCMN2XgQOeSng294cJWzA@mail.gmail.com>
 <CADxJN66QjceP1Tr+p139+QT6nek3OcZ3Ghd_9iXz3jDJRwyVEg@mail.gmail.com>
 <CAJgygqNTApmkQz-rFJNPcxV-fe-2R+DaQNP6gxHhkTXkjFauxA@mail.gmail.com>
Message-ID: <584661261.1709933.1527448245993@mail.yahoo.com>

 And this you have likely seen already in Wikipedia:https://en.wikipedia.org/wiki/Principal_component_analysis"...PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It's often used to visualize genetic distance and relatedness between populations. PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering[clarification needed] (and normalizing or using Z-scores) the data matrix for each attribute.[4] The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score)..."

    On Saturday, May 26, 2018, 10:10:32 PM PDT, Shiheng Duan <shiduan at ucdavis.edu> wrote:  
 
 Thanks.?
Do you mean that if feature one has a larger derivation than feature two, after zscore?they will have the same weight? In that case, it is a bias, right? The feature one should be more important than feature two in the PCA.?
On Thu, May 24, 2018 at 5:09 PM, Michael Eickenberg <michael.eickenberg at gmail.com> wrote:

Hi,
that totally depends on the nature of your data and whether the standard deviation of individual feature axes/columns of your data carry some form of importance measure. Note that PCA will bias its loadings towards columns with large standard deviations all else being held equal (meaning that if you have zscored columns, and then you choose one column and multiply it by, say 1000, then that component will likely show up as your first component [if 1000 is comparable or large wrt the number of features you are using])
Does this help?Michael
On Thu, May 24, 2018 at 4:39 PM, Shiheng Duan <shiduan at ucdavis.edu> wrote:

Hello all,
I wonder is it necessary or correct to do z score transformation before PCA? I didn't see any preprocessing for face image in the example of Faces recognition example using eigenfaces and SVMs, link:http://scikit-learn.org/s table/auto_examples/applicatio ns/plot_face_recognition.html# sphx-glr-auto-examples- applications-plot-face- recognition-py
I am doing on a similar dataset and got a weird result if I standardized data before PCA. The components figure will have a strong gradient and it doesn't make any sense. Any ideas about the reason??
Thanks.?
______________________________ _________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailma n/listinfo/scikit-learn


______________________________ _________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/ mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180527/27590eea/attachment-0001.html>

From jmmelen at yahoo.com  Sun May 27 15:13:18 2018
From: jmmelen at yahoo.com (James Melenkevitz)
Date: Sun, 27 May 2018 19:13:18 +0000 (UTC)
Subject: [scikit-learn] Should we standardize data before PCA?
In-Reply-To: <CAJgygqNTApmkQz-rFJNPcxV-fe-2R+DaQNP6gxHhkTXkjFauxA@mail.gmail.com>
References: <CAJgygqP6ff=a=bN+K2aaO3djvELgXFCMN2XgQOeSng294cJWzA@mail.gmail.com>
 <CADxJN66QjceP1Tr+p139+QT6nek3OcZ3Ghd_9iXz3jDJRwyVEg@mail.gmail.com>
 <CAJgygqNTApmkQz-rFJNPcxV-fe-2R+DaQNP6gxHhkTXkjFauxA@mail.gmail.com>
Message-ID: <1507820895.1709098.1527448398530@mail.yahoo.com>

 And this is the SciKit Learn page on the normalizing:? http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html
    On Saturday, May 26, 2018, 10:10:32 PM PDT, Shiheng Duan <shiduan at ucdavis.edu> wrote:  
 
 Thanks.?
Do you mean that if feature one has a larger derivation than feature two, after zscore?they will have the same weight? In that case, it is a bias, right? The feature one should be more important than feature two in the PCA.?
On Thu, May 24, 2018 at 5:09 PM, Michael Eickenberg <michael.eickenberg at gmail.com> wrote:

Hi,
that totally depends on the nature of your data and whether the standard deviation of individual feature axes/columns of your data carry some form of importance measure. Note that PCA will bias its loadings towards columns with large standard deviations all else being held equal (meaning that if you have zscored columns, and then you choose one column and multiply it by, say 1000, then that component will likely show up as your first component [if 1000 is comparable or large wrt the number of features you are using])
Does this help?Michael
On Thu, May 24, 2018 at 4:39 PM, Shiheng Duan <shiduan at ucdavis.edu> wrote:

Hello all,
I wonder is it necessary or correct to do z score transformation before PCA? I didn't see any preprocessing for face image in the example of Faces recognition example using eigenfaces and SVMs, link:http://scikit-learn.org/s table/auto_examples/applicatio ns/plot_face_recognition.html# sphx-glr-auto-examples- applications-plot-face- recognition-py
I am doing on a similar dataset and got a weird result if I standardized data before PCA. The components figure will have a strong gradient and it doesn't make any sense. Any ideas about the reason??
Thanks.?
______________________________ _________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailma n/listinfo/scikit-learn


______________________________ _________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/ mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180527/a37f7a55/attachment.html>

From reismc at gmail.com  Mon May 28 21:20:55 2018
From: reismc at gmail.com (Mauricio Reis)
Date: Mon, 28 May 2018 22:20:55 -0300
Subject: [scikit-learn] DBScan freezes my computer !!!
In-Reply-To: <20180518222919.aimm2qgywnjw5peo@espgs-MacBook-Pro.local>
References: <CAHCPJrJXA7y-0+kj7VfDvMtMNwTRhu77vNwCb4JEzWuD+FyaxA@mail.gmail.com>
 <CAD7aRH_YpH8HveqF5qLqSsHt+iYRhc1KxJ2Ajq6kTG_CM097iQ@mail.gmail.com>
 <801b4e40-40b4-9972-072e-73fa052e4b6d@gmail.com>
 <CAHCPJrL2-zh=xRMintoO9fb2dW6BXZo9MKKM83n-V5+zHWJopA@mail.gmail.com>
 <CAAkaFLWep1PZaY6e-eDQrtcZgUjLcdqU1+6t6EX+WrJRVM1yPg@mail.gmail.com>
 <2a3b28be-e22f-d698-a910-c10202fcb08a@gmail.com>
 <20180516175007.d4cv4yemvwdf52ku@phare.normalesup.org>
 <CAAkaFLWbzOgTamt3EB3p3Pa3U+up-f99nr1FCKwxFnkv1_L+Ow@mail.gmail.com>
 <CAHCPJrJsfqAOWaS+5LbAJ8u+_HgUUR5ZTmXm3Agf_gJA4dZ6cw@mail.gmail.com>
 <20180518222919.aimm2qgywnjw5peo@espgs-MacBook-Pro.local>
Message-ID: <CAHCPJrLSsvRxptqJNu-in9Tes0ej6nkwxJYDVQTY1E2Pyh5Pyg@mail.gmail.com>

I decreased the sampling interval to reduce the base size from 40,000 to
10,000 so that I could then use the DBScan routine.

Now another problem has arisen: I want to analyze the "Noisy Samples"
points and I need to calculate the distance to the nearest cluster, ie (a)
the distance to the nearest point and know (b) which cluster this point
belongs to . I believe these data are available because the algorithm
calculates this distance, but only marks the point that has that distance
greater than the EPS as "Noisy Samples".

I believe the routine needs to be changed because it has only the output
attributes "core_sample_indices_", "components_" and "labels_" are
available.

Can you help me?

Att.,
Mauricio Reis

2018-05-18 19:29 GMT-03:00 Shane Grigsby <shane.grigsby at colorado.edu>:

> Hi Mauricio,
> You can also use OPTICS in DBSCAN mode. The pull request is here if you'd
> like to clone it:
>
> https://github.com/scikit-learn/scikit-learn/pull/1984
>
> Running ~40,000 points in three dimensions takes about a minute. See the
> example page here for how to do the DBSCAN extraction:
>
> https://github.com/espg/scikit-learn/blob/2eac9fbf67b2715e11
> fdedfbb63bcdb56a80e216/examples/cluster/plot_optics.py
>
> Cheers,
> Shane
>
> On 05/17, Mauricio Reis wrote:
>
>> I'm not used to the terms used here. So I understood that the package had
>> memory management, which was removed. But you could make the code
>> available
>> with memory management implementations. Is it?! :-)
>> The problem is that I do not know what I would do with the code, because I
>> only know how to work with the SciKitLearn package ready. :-(
>>
>> Att.,
>> Mauricio Reis
>>
>> 2018-05-16 20:33 GMT-03:00 Joel Nothman <joel.nothman at gmail.com>:
>>
>> Implemented in a previous version of #10280
>>> <https://github.com/scikit-learn/scikit-learn/pull/10280>, but removed
>>> for now to simplify reviews
>>> <https://github.com/scikit-learn/scikit-learn/pull/10280#pul
>>> lrequestreview-95622713>.
>>> If others would like to review #10280, I'm happy to follow up with the
>>> changes requested here, which have already been implemented by Aman
>>> Dalmia
>>> and myself.?
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>>
> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
> *PhD candidate & Research Assistant*
> *Cooperative Institute for Research in Environmental Sciences (CIRES)*
> *University of Colorado at Boulder*
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180528/eff4176c/attachment.html>

From dylanf123 at gmail.com  Tue May 29 00:56:47 2018
From: dylanf123 at gmail.com (Dylan Fernando)
Date: Tue, 29 May 2018 14:56:47 +1000
Subject: [scikit-learn] scikit-learn-contrib: building Cython, cpp files
Message-ID: <CAPa-kAxSLOEcrcjm-y++JOsG47Ak1tr_1OpN4hO3ZQ9Jmr0YrQ@mail.gmail.com>

Hi,

I would like to publish this:
https://github.com/dil12321/scikit-learn/tree/aode
https://github.com/scikit-learn/scikit-learn/pull/11093

as a scikit-learn-contrib project. However, I'm not sure how to write the
setup.py file so that aode_helper.cpp and _aode.pyx get included in the
package, and run correctly. How should I write setup.py?

Regards,
Dylan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180529/04cbed9c/attachment.html>

From egor.v.panfilov at gmail.com  Tue May 29 05:07:41 2018
From: egor.v.panfilov at gmail.com (Egor Panfilov)
Date: Tue, 29 May 2018 12:07:41 +0300
Subject: [scikit-learn] Scikit-image 0.14.0 release [cross-listed]
Message-ID: <CAPBs059Gk=dpx5yZaO3D_ASjHk6mY4-AQkWUSopXcpErtRAbSg@mail.gmail.com>

Hello list,

On behalf of `scikit-image` team I'm happy to announce the release of
`scikit-image` version 0.14.0 !
This release brings a lot of exciting additions (such as new segmentation
tools and algorithms, new datasets and data generating routines), various
enhancements (nD support for image moments and regionprops, rescale, resize
and pyramid_* function, multichannel support for HOG, multiple performance
improvements, and many more) and a number of bugfixes to the code and to
the documentation.
I am also pleased to inform you that 0.14.x is the last major release with
official support for Python 2.7. In order to make the transition to Python
3.x smooth for all the users we're making this release a long-term support
one (it will receive important bugfixes and backports for approx. 2 years,
following the Python 2.7 end-of-life cycle).

For more details on the changes and additions, please, refer to the release
notes [1].

I'd like to thank all the developers of the library for their hard work,
and all the users for staying with us!

Regards,
Egor Panfilov,
scikit-image core team

.. [1]
https://github.com/scikit-image/scikit-image/blob/v0.14.x/doc/release/release_0.14.rst

.. [2] https://pypi.org/project/scikit-image/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180529/306161d3/attachment-0001.html>

From olamyy53 at gmail.com  Tue May 29 19:54:34 2018
From: olamyy53 at gmail.com (Lekan Wahab)
Date: Wed, 30 May 2018 00:54:34 +0100
Subject: [scikit-learn] Why doesn't sklearn have support for a Batch
 Gradient Descent Regressor
Message-ID: <CAE6v7ocP0zPv8WO_ZSXJhK-xhqQnKD5GwPquSWHzPtMPuUAgMQ@mail.gmail.com>

I have a feeling this question might have been asked before or there's some
sort of resource somewhere on it but so far I haven't found any.

I would appreciate any response I can get for this.

-- 
<https://about.me/olamyy?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=gmail_api>
Olamilekan Wahab
about.me/olamyy
<https://about.me/olamyy?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=gmail_api>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180530/744f2e36/attachment.html>

From michael.eickenberg at gmail.com  Tue May 29 20:14:46 2018
From: michael.eickenberg at gmail.com (Michael Eickenberg)
Date: Tue, 29 May 2018 17:14:46 -0700
Subject: [scikit-learn] Why doesn't sklearn have support for a Batch
 Gradient Descent Regressor
In-Reply-To: <CAE6v7ocP0zPv8WO_ZSXJhK-xhqQnKD5GwPquSWHzPtMPuUAgMQ@mail.gmail.com>
References: <CAE6v7ocP0zPv8WO_ZSXJhK-xhqQnKD5GwPquSWHzPtMPuUAgMQ@mail.gmail.com>
Message-ID: <CADxJN65kLWjDKhuqj50r2eGS8fh9DnvU5v4LLVjC=g1pGrktEw@mail.gmail.com>

Hi Lekan,

for which type of estimator are you looking for a batch gradient descent
regressor?

Michael

On Tue, May 29, 2018 at 4:54 PM, Lekan Wahab <olamyy53 at gmail.com> wrote:

> I have a feeling this question might have been asked before or there's
> some sort of resource somewhere on it but so far I haven't found any.
>
> I would appreciate any response I can get for this.
>
> --
>
> <https://about.me/olamyy?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=gmail_api>
> Olamilekan Wahab
> about.me/olamyy
> <https://about.me/olamyy?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=gmail_api>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180529/cb4209d9/attachment.html>

From olamyy53 at gmail.com  Thu May 31 06:46:24 2018
From: olamyy53 at gmail.com (Lekan Wahab)
Date: Thu, 31 May 2018 11:46:24 +0100
Subject: [scikit-learn] Possible Feature Suggestions.
Message-ID: <CAE6v7ofZZurQdoo=dgkf10+J5eGgaBk7EgwcuKtguvzREt88Pg@mail.gmail.com>

Hello, new contributor here.
I've been meaning to contribute to the library for a while now but I
haven't found anything easy or clearenough for me to.
While going through the source today, I noticed some possible features I
could implement and would love to run it by the team here to see which is
feasible and which is not.

   1.

   More dataset benchmark. I noticed the benchmarksfolder only has
   benchmark on one of the datasets. MNIST. My plan is to add benchmarks
   for more datasets like iris, 'wine' and boston datasets.
   2.

   Implement a Batch Gradient Descent Regressor and a Mini Batch Gradient
   Regressor just like the Stochastic Gradient Regressor available in the
   linear_model module.

This is really my first attempt at contributing to the package so if
there's anything i'm missing about either feature suggestions, please, do
let me know.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180531/34547ead/attachment.html>

From sepand.haghighi at yahoo.com  Thu May 31 09:05:37 2018
From: sepand.haghighi at yahoo.com (Sepand Haghighi)
Date: Thu, 31 May 2018 13:05:37 +0000 (UTC)
Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in Python
References: <253486979.646953.1527771937416.ref@mail.yahoo.com>
Message-ID: <253486979.646953.1527771937416@mail.yahoo.com>

PyCM is a multi-class confusion matrix library written in Python that supports both input data vectors and direct matrix, and a proper tool for post-classification model evaluation that supports most classes and overall statistics parameters. PyCM is the swiss-army knife of confusion matrices, targeted mainly at data scientists that need a broad array of metrics for predictive models and an accurate evaluation of large variety of classifiers.

Github Repo :?https://github.com/sepandhaghighi/pycm
Webpage :?http://pycm.shaghighi.ir/?
JOSS Paper :?https://doi.org/10.21105/joss.00729


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180531/68e4c2e6/attachment.html>

From stuart at stuartreynolds.net  Thu May 31 13:26:41 2018
From: stuart at stuartreynolds.net (Stuart Reynolds)
Date: Thu, 31 May 2018 10:26:41 -0700
Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in
 Python
In-Reply-To: <253486979.646953.1527771937416@mail.yahoo.com>
References: <253486979.646953.1527771937416.ref@mail.yahoo.com>
 <253486979.646953.1527771937416@mail.yahoo.com>
Message-ID: <CAAy-kdkxxW6=rR4wgVpTznG2Sho+VrS1E=1y3ZEyGaCsL2=dbQ@mail.gmail.com>

Hi Sepand,

Thanks for this -- looks useful. I had to write something similar (for
the binary case) and wish scikit had something like this.

I wonder if there's something similar for the binary class case where,
the prediction is a real value (activation) and from this we can also
derive
 - CMs for all prediction cutoff (or set of cutoffs?)
 - scores over all cutoffs (AUC, AP, ...)

For me, in analyzing (binary class) performance, reporting scores for
a single cutoff is less useful than seeing how the many scores (tpr,
ppv, mcc, relative risk, chi^2, ...) vary at various false positive
rates, or prediction quantiles.
Does your library provide any tools for the binary case where we add
an activation threshold?

Thanks again for releasing this and providing pip packaging.
- Stuart


On Thu, May 31, 2018 at 6:05 AM, Sepand Haghighi via scikit-learn
<scikit-learn at python.org> wrote:
> PyCM is a multi-class confusion matrix library written in Python that
> supports both input data vectors and direct matrix, and a proper tool for
> post-classification model evaluation that supports most classes and overall
> statistics parameters. PyCM is the swiss-army knife of confusion matrices,
> targeted mainly at data scientists that need a broad array of metrics for
> predictive models and an accurate evaluation of large variety of
> classifiers.
>
> Github Repo : https://github.com/sepandhaghighi/pycm
>
> Webpage : http://pycm.shaghighi.ir/
>
> JOSS Paper : https://doi.org/10.21105/joss.00729
>
>
>
>
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

From jorisvandenbossche at gmail.com  Thu May 31 15:20:34 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Thu, 31 May 2018 21:20:34 +0200
Subject: [scikit-learn] scikit-learn-contrib: building Cython, cpp files
In-Reply-To: <CAPa-kAxSLOEcrcjm-y++JOsG47Ak1tr_1OpN4hO3ZQ9Jmr0YrQ@mail.gmail.com>
References: <CAPa-kAxSLOEcrcjm-y++JOsG47Ak1tr_1OpN4hO3ZQ9Jmr0YrQ@mail.gmail.com>
Message-ID: <CALQtMBaC0_hFhrVSWz0EY2NLc7m-ov_5ge=OofV27Yk1YLb3Zw@mail.gmail.com>

Hi Dylan,

In case you are still looking for a solution:I didn't directly find good
templates for packages that depend on cython (there are quite some, but
from quickly looking at them, I didn't find a simple one), but you can
maybe have a look at one of the other scikit-learn-contrib packages that
uses cython: https://github.com/scikit-learn-contrib/hdbscan
And you can check here how to adapt the Extension class to specify c++:
http://cython.readthedocs.io/en/latest/src/userguide/wrapping_CPlusPlus.html#specify-c-language-in-setup-py

Best,
Joris


2018-05-29 6:56 GMT+02:00 Dylan Fernando <dylanf123 at gmail.com>:

> Hi,
>
> I would like to publish this:
> https://github.com/dil12321/scikit-learn/tree/aode
> https://github.com/scikit-learn/scikit-learn/pull/11093
>
> as a scikit-learn-contrib project. However, I'm not sure how to write the
> setup.py file so that aode_helper.cpp and _aode.pyx get included in the
> package, and run correctly. How should I write setup.py?
>
> Regards,
> Dylan
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180531/ea63efad/attachment.html>

From aqsdmcet at gmail.com  Tue May  8 01:15:54 2018
From: aqsdmcet at gmail.com (aijaz qazi)
Date: Tue, 08 May 2018 05:15:54 -0000
Subject: [scikit-learn] Scikit Multi learn error.
Message-ID: <CAPn2g_YK-6ic_xfOCq2Mqd25JGVjP7XQts8jS=D3+2AuWUppgQ@mail.gmail.com>

 Dear developer ,

I am working on web page categorization  with http://scikit.ml/ .


*Question*: I am not able to execute MLkNN code on the link
http://scikit.ml/api/classify.html. I have installed py 3.6.

I found scipy versions not compatible with scikit.ml 0.0.5.

Which version of scipy would work with scikit.ml 0.0.5.

Kindly let me know. I will be grateful.


*Regards,*
*Aijaz A.Qazi *
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180508/36d4b7b5/attachment-0001.html>