From robert.kern at gmail.com  Sun Jul  1 22:02:02 2018
From: robert.kern at gmail.com (Robert Kern)
Date: Sun, 1 Jul 2018 19:02:02 -0700
Subject: [scikit-learn] NEP: Random Number Generator Policy
In-Reply-To: <pgbvck$fva$1@blaine.gmane.org>
References: <pfbul6$sfo$1@blaine.gmane.org> <pg2fsr$l26$1@blaine.gmane.org>
 <pga81s$i7m$1@blaine.gmane.org>
 <2e83ecf0-4f42-6eb3-c372-28bb5baf8583@gmail.com>
 <pgbvck$fva$1@blaine.gmane.org>
Message-ID: <phc0um$e8q$1@blaine.gmane.org>

On 6/19/18 15:19, Robert Kern wrote:
> On 6/19/18 08:12, Andreas Mueller wrote:
>> I don't think I have the bandwidth but I agree :-/
>> Not sure if any of the other core devs do. I can try to read it next week but 
>> that's probably too late?
> 
> We're not on a deadline. If you're interested in reading the NEP and providing 
> feedback/consent, I'm happy to hold off on formally accepting the NEP until then.

I just made a deadline. :-)

I formally proposed acceptance of the NEP. In 7 days, if no one objects, it will 
be formally marked as Accepted.

https://mail.python.org/pipermail/numpy-discussion/2018-July/078380.html

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco


From roy.pamphile at gmail.com  Tue Jul  3 04:41:30 2018
From: roy.pamphile at gmail.com (Pamphile Roy)
Date: Tue, 3 Jul 2018 10:41:30 +0200
Subject: [scikit-learn] Update or downgrade PCA
Message-ID: <CADCCF1T+br5pv4Kt3kUp=kcXPq4z4Vvv6mu6B9yhGw1rT_Qqpg@mail.gmail.com>

Hi everyone,

I have some code that allows to upgrade (or downgrade) a PCA with a new
sample.
The update part is handy when you are doing live observations for instance
and you want a quick way to update your PCA without having to recompute the
whole thing from scratch.

Are you interested in this? (For me or someone else to integrate it.)

Code is open-source (from my Batman project) and can be found here:

https://gitlab.com/cerfacs/batman/blob/develop/batman/pod/pod.py

Functions of interest are _upgrade and downgrade.
Although, the code should be cleaned up, it works well and it got some unit
tests.

Of course the math is backed-up by some literature:

[1] M. Brand: Fast low-rank modifications of the thin singular value
decomposition.
2006. DOI:10.1016/j.laa.2005.07.021

[2] T. Braconnier: Towards an adaptive POD/SVD surrogate model for
aeronautic design.
Computers & Fluids. 2011. DOI:10.1016/j.compfluid.2010.09.002

Cheers,

Pamphile
@tupui
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180703/1ac88580/attachment.html>

From alexandre.gramfort at inria.fr  Tue Jul  3 04:49:34 2018
From: alexandre.gramfort at inria.fr (Alexandre Gramfort)
Date: Tue, 3 Jul 2018 10:49:34 +0200
Subject: [scikit-learn] Update or downgrade PCA
In-Reply-To: <CADCCF1T+br5pv4Kt3kUp=kcXPq4z4Vvv6mu6B9yhGw1rT_Qqpg@mail.gmail.com>
References: <CADCCF1T+br5pv4Kt3kUp=kcXPq4z4Vvv6mu6B9yhGw1rT_Qqpg@mail.gmail.com>
Message-ID: <CADeotZrDqDoidG4XeS9QdceC9qDMmRosN272PqxrpVgF-wXfjw@mail.gmail.com>

Hi,

how does it compare with:

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html#sklearn.decomposition.IncrementalPCA

?

Alex
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180703/65f212a5/attachment.html>

From rth.yurchak at gmail.com  Tue Jul  3 04:51:47 2018
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Tue, 3 Jul 2018 10:51:47 +0200
Subject: [scikit-learn] Update or downgrade PCA
In-Reply-To: <CADCCF1T+br5pv4Kt3kUp=kcXPq4z4Vvv6mu6B9yhGw1rT_Qqpg@mail.gmail.com>
References: <CADCCF1T+br5pv4Kt3kUp=kcXPq4z4Vvv6mu6B9yhGw1rT_Qqpg@mail.gmail.com>
Message-ID: <3df37e82-c9df-6ae8-e254-209e5f45edae@gmail.com>

Hi Pamphile,

On 03/07/18 10:41, Pamphile Roy wrote:
> I have some code that allows to upgrade (or downgrade)?a PCA with a new 
> sample.
> The update part is handy when you are doing live observations for 
> instance and you want a quick way to update your PCA without having to 
> recompute the whole thing from scratch.
 > [..]
 > [1] M. Brand: Fast low-rank modifications of the thin singular value 
decomposition.

Do you know how this  would compare with 
sklearn.decomposition.IncrementalPCA ?

-- 
Roman

From roy.pamphile at gmail.com  Tue Jul  3 05:06:31 2018
From: roy.pamphile at gmail.com (Pamphile Roy)
Date: Tue, 3 Jul 2018 11:06:31 +0200
Subject: [scikit-learn] Update or downgrade PCA
Message-ID: <CADCCF1Qnn1h-OJZcOjAQRwbX_L9uGHa5OdO+g=EgStvAyRdt8w@mail.gmail.com>

I have no idea about the comparison with
sklearn.decomposition.IncrementalPCA.
Was not aware of this but from the code it seems to be a different approach.
I will try to come with some numbers.

Pamphile
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180703/c6b45e4c/attachment.html>

From amirouche.boubekki at gmail.com  Tue Jul  3 07:46:43 2018
From: amirouche.boubekki at gmail.com (Amirouche Boubekki)
Date: Tue, 3 Jul 2018 13:46:43 +0200
Subject: [scikit-learn] Supervised prediction of multiple scores for a
 document
In-Reply-To: <F2F91BBD-F704-496F-A425-A6D194D39C37@sebastianraschka.com>
References: <CAL7_Mo9fCuLjm3fonapVY9oo0gc9kjyAwOWgZcH+-gbP19QwoQ@mail.gmail.com>
 <037411E4-8B6D-4EAB-A9C6-45AA73479364@sebastianraschka.com>
 <F2F91BBD-F704-496F-A425-A6D194D39C37@sebastianraschka.com>
Message-ID: <CAL7_Mo-QCuRqpFiqm9oRgfyWO-jkvAp6CyN_B5voD5ZtmkCm4Q@mail.gmail.com>

I made a rendering of the result online https://sensimark.com/

Le dim. 3 juin 2018 ? 23:22, Sebastian Raschka <mail at sebastianraschka.com>
a ?crit :

> sorry, I had a copy & paste error, I meant "LogisticRegression(...,
> multi_class='multinomial')" and not "LogisticRegression(...,
> multi_class='ovr')"
>
> > On Jun 3, 2018, at 5:19 PM, Sebastian Raschka <mail at sebastianraschka.com>
> wrote:
> >
> > Hi,
> >
> >> I quickly read about multinomal regression, is it something do you
> recommend I use? Maybe you think about something else?
> >
> > Multinomial regression (or Softmax Regression) should give you results
> somewhat similar to a linear SVC (or logistic regression with OvO or OvR).
> The theoretical difference is that Softmax regression assumes that the
> classes are mutually exclusive, which is probably not the case in your
> setting since e.g., an article could be both "Art" and "Science" to some
> extend or so. Here a quick summary of softmax regression if useful:
> https://sebastianraschka.com/faq/docs/softmax_regression.html. In
> scikit-learn, you can use it via LogisticRegression(..., multi_class='ovr').
> >
> > Howeever, spontaneously, I would say that Latent Dirichlet Allocation
> could be a better choice in your case. I.e., fit the model on the corpus
> for a specified number of topics (e.g., 10, but depends on your dataset, I
> would experiment a bit here), look at the top words in each topic and then
> assign a topic label to each topic. Then, for a given article, you can
> assign e.g., the top X labeled topics.
> >
> > Best,
> > Sebastian
> >
> >
> >
> >
> >> On Jun 3, 2018, at 5:03 PM, Amirouche Boubekki <
> amirouche.boubekki at gmail.com> wrote:
> >>
> >> H?llo,
> >>
> >> I started a natural language processing project a few weeks ago called
> wikimark (the code is all in wikimark.py)
> >>
> >> Given a text it wants to return a dictionary scoring the input against
> vital articles categories, e.g.:
> >>
> >> out = wikimark("""Peter Hintjens wrote about the relation between
> technology and culture. Without using a scientifical tone of
> state-of-the-art review of the anthroposcene antropology, he gives a fair
> amount of food for thought. According to Hintjens, technology is doomed to
> become cheap. As matter of fact, intelligence tools will become more and
> more accessible which will trigger a revolution to rebalance forces in
> society.""")
> >>
> >> for category, score in out:
> >>    print('{} ~ {}'.format(category, score))
> >>
> >> The above program would output something like that:
> >>
> >> Art ~ 0.1
> >> Science ~ 0.5
> >> Society ~ 0.4
> >>
> >> Except not everything went as planned. Mind the fact that in the above
> example the total is equal to 1, but I could not achieve that at all.
> >>
> >> I am using gensim to compute vectors of paragraphs (doc2vev) and then
> submit those vectors to svm.SVR in a one-vs-all strategy ie. a document is
> scored 1 if it's in that subcategory and zero otherwise. At prediction
> time, it goes though the same doc2vec pipeline. The computer will score
> each paragraph against the SVR models of wikipedia vital article
> subcategories and get a value between 0 and 1 for each paragraph. I compute
> the sum and group by subcategory and then I have a score per category for
> the input document
> >>
> >> It somewhat works. I made a web ui online you can find it at
> https://sensimark.com where you can test it. You can directly access the
> >> full api e.g.
> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1
> >>
> >> The output JSON document is a list of category dictionary where the
> prediction key is associated with the average of the "prediction" of the
> subcategories. If you replace &all=1 by &top=5 you might get something else
> as top categories e.g.
> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=10
> >>
> >> or
> >>
> >>
> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=5
> >>
> >> I wrote "prediction" with double quotes because the value you see, is
> the result of some formula. Since, the predictions I get are rather small
> between 0 and 0.015 I apply the following formula:
> >> value = math.exp(prediction)
> >> magic = ((value * 100) - 110) * 100
> >>
> >> In order to have values to spread between -200 and 200. Maybe this is
> the symptom that my model doesn't work at all.
> >>
> >> Still, the top 10 results are almost always near each other (try with
> BBC articles on https://sensimark.com . It is only when a regression
> model is disqualified with a score of 0 that the results are simple to
> understand. Sadly, I don't have an example at hand to support that claim.
> You have to believe me.
> >>
> >> I just figured looking at the machine learning map that my problem
> might be classification problem, except I don't really want to know what is
> the class of new documents, I want to how what are the different subjects
> that are dealt in the document based on a hiearchical corpus;
> >> I don't want to guess a hiearchy! I want to now how the document
> content spread over the different categories or subcategories.
> >>
> >> I quickly read about multinomal regression, is it something do you
> recommend I use? Maybe you think about something else?
> >>
> >> Also, it seems I should benchmark / evaluate my model against LDA.
> >>
> >> I am rather noob in terms of datascience and my math skills are not so
> fresh. I more likely looking for ideas on what algorithm, fine tuning and
> some practice of datascience I must follow that doesn't involve writing my
> own algorithm.
> >>
> >> Thanks in advance!
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180703/3d76dea2/attachment-0001.html>

From roy.pamphile at gmail.com  Tue Jul  3 08:39:46 2018
From: roy.pamphile at gmail.com (Pamphile Roy)
Date: Tue, 3 Jul 2018 14:39:46 +0200
Subject: [scikit-learn] Update or downgrade PCA
In-Reply-To: <CADCCF1Qnn1h-OJZcOjAQRwbX_L9uGHa5OdO+g=EgStvAyRdt8w@mail.gmail.com>
References: <CADCCF1Qnn1h-OJZcOjAQRwbX_L9uGHa5OdO+g=EgStvAyRdt8w@mail.gmail.com>
Message-ID: <CADCCF1QxUvy-0Q9nWGDU8fhOOXNgy5LgFzOrE3be4tnv7uzzDA@mail.gmail.com>

So yes there is a difference between the two depending on the size of the
matrix.

Following is an output from ipython:

*With a matrix of shape (1000 * 500)*
(batman3) tupui at Batman:Desktop $ ipython -i sk_pod.py
Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: %timeit pod._update(snapshot2.T)
491 ms ? 22.4 ms per loop (mean ? std. dev. of 7 runs, 1 loop each)

In [2]: %timeit ipca.partial_fit(snapshot2)
163 ms ? 1.6 ms per loop (mean ? std. dev. of 7 runs, 10 loops each)

*With a matrix of shape (1000 * 2000)*
(batman3) tupui at Batman:Desktop $ ipython -i sk_pod.py
Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: %timeit pod._update(snapshot2.T)
4.84 s ? 220 ms per loop (mean ? std. dev. of 7 runs, 1 loop each)

In [2]: %timeit ipca.partial_fit(snapshot2)
5.85 s ? 77.6 ms per loop (mean ? std. dev. of 7 runs, 1 loop each)

In [3]:
Do you really want to exit ([y]/n)?

*With a matrix of shape (1000 * 20 000)*
(batman3) tupui at Batman:Desktop $ ipython -i sk_pod.py
Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: %timeit pod._update(snapshot2.T)
3.39 s ? 65.8 ms per loop (mean ? std. dev. of 7 runs, 1 loop each)

In [2]: %timeit ipca.partial_fit(snapshot2)
33.1 s ? 17.7 s per loop (mean ? std. dev. of 7 runs, 1 loop each)

Conclusion is that, the method seems faster to add one sample if the number
of feature is superior to the number of samples.
But if you want to add a bunch of sample, I found that sklearn seems a bit
faster (38.75 s vs
34.51s to add 10 samples of shape 1000 * 20 000).
It is to be noted that in this last case, adding a single or 10 samples is
taking the same time ~30s. So depending on how much sample are to be added,
this can help.

Cheers,

Pamphile

P.S. Following is the code I used (requires batman available though
conda-forge):

import time
import numpy as np
from batman.pod import Pod
from sklearn.decomposition import IncrementalPCA

n_samples, n_features = 1000, 20000

snapshots = np.random.random_sample((n_samples, n_features))
snapshot2 = np.random.random_sample((1, n_features))

pod = Pod([np.zeros(n_features), np.ones(n_features)], None, np.inf, 1, 999)
pod._decompose(snapshots.T)


ipca = IncrementalPCA(999)
ipca.fit(snapshots)

np.allclose(ipca.singular_values_, pod.S)

pod._update(snapshot2.T)
ipca.partial_fit(snapshot2)

np.allclose(ipca.singular_values_[:999], pod.S[:999])

snapshot3 = np.random.random_sample((10, n_features))

itime = time.time()
[pod._update(snap.T[:, None]) for snap in snapshot3]
print(time.time() - itime)

itime = time.time()
ipca.partial_fit(snapshot3)
print(time.time() - itime)
np.allclose(ipca.singular_values_[:999], pod.S[:999])


2018-07-03 11:06 GMT+02:00 Pamphile Roy <roy.pamphile at gmail.com>:

> I have no idea about the comparison with sklearn.decomposition.Inc
> rementalPCA.
> Was not aware of this but from the code it seems to be a different
> approach.
> I will try to come with some numbers.
>
> Pamphile
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180703/a72d3e55/attachment.html>

From jeremie.du-boisberranger at inria.fr  Tue Jul  3 09:23:35 2018
From: jeremie.du-boisberranger at inria.fr (Jeremie du Boisberranger)
Date: Tue, 3 Jul 2018 15:23:35 +0200
Subject: [scikit-learn] Next sprint in Paris (july 16th and 17th)
Message-ID: <81047951-6691-a303-6638-d55350c74cc5@inria.fr>

Hi everyone,

On july 16th and 17th, there will be a scikit-learn sprint in Paris, in 
parallel with the one in Austin.

There will be an official announce soon with the location and other 
informations.

This is just an informal mail to ask if you have suggestions on 
topics/issues that you think we should look at during the sprint. 
Remember that it is a 2 days sprint, so we need things that can be 
handled in 2 days.

Whether you intend to come or not, any suggestion is welcomed !

Best regards,

Jeremie du Boisberranger <jeremiedbb>


From sdsr.sdsr at gmail.com  Wed Jul  4 05:08:55 2018
From: sdsr.sdsr at gmail.com (=?UTF-8?Q?Sergio_Sol=C3=B3rzano?=)
Date: Wed, 4 Jul 2018 11:08:55 +0200
Subject: [scikit-learn] Next sprint in Paris (july 16th and 17th)
In-Reply-To: <81047951-6691-a303-6638-d55350c74cc5@inria.fr>
References: <81047951-6691-a303-6638-d55350c74cc5@inria.fr>
Message-ID: <CAFrTa4UAQUYAECH=20RPoYL02p6-u+0ORes5i7X-xWWv-mOf0Q@mail.gmail.com>

Hi everyone,

Regarding the Python Sprint in Paris,
I would like to know if it is possible to attend if one wants to
contribute but has never done it before. In other  words,
is it "reserved" for experienced contributors/developers of
sckikit-learn or newcomers can join as well?

Best,
Sergio


On Tue, Jul 3, 2018 at 3:25 PM Jeremie du Boisberranger
<jeremie.du-boisberranger at inria.fr> wrote:
>
> Hi everyone,
>
> On july 16th and 17th, there will be a scikit-learn sprint in Paris, in
> parallel with the one in Austin.
>
> There will be an official announce soon with the location and other
> informations.
>
> This is just an informal mail to ask if you have suggestions on
> topics/issues that you think we should look at during the sprint.
> Remember that it is a 2 days sprint, so we need things that can be
> handled in 2 days.
>
> Whether you intend to come or not, any suggestion is welcomed !
>
> Best regards,
>
> Jeremie du Boisberranger <jeremiedbb>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From jeremie.du-boisberranger at inria.fr  Wed Jul  4 08:31:23 2018
From: jeremie.du-boisberranger at inria.fr (Jeremie du Boisberranger)
Date: Wed, 4 Jul 2018 14:31:23 +0200
Subject: [scikit-learn] Next sprint in Paris (july 16th and 17th)
In-Reply-To: <CAFrTa4UAQUYAECH=20RPoYL02p6-u+0ORes5i7X-xWWv-mOf0Q@mail.gmail.com>
References: <81047951-6691-a303-6638-d55350c74cc5@inria.fr>
 <CAFrTa4UAQUYAECH=20RPoYL02p6-u+0ORes5i7X-xWWv-mOf0Q@mail.gmail.com>
Message-ID: <3554e67e-adb7-f7ae-c89d-3795c6d279a7@inria.fr>

Hi Sergio,

I'm sorry but this sprint is quite short and thus will be for 
experienced contributors (at least experienced with the scikit-learn 
contributing work flow).

We'll probably organize less restrictive sprints in the future.

Best regards,

Jeremie


On 04/07/2018 11:08, Sergio Sol?rzano wrote:
> Hi everyone,
>
> Regarding the Python Sprint in Paris,
> I would like to know if it is possible to attend if one wants to
> contribute but has never done it before. In other  words,
> is it "reserved" for experienced contributors/developers of
> sckikit-learn or newcomers can join as well?
>
> Best,
> Sergio
>
>
> On Tue, Jul 3, 2018 at 3:25 PM Jeremie du Boisberranger
> <jeremie.du-boisberranger at inria.fr> wrote:
>> Hi everyone,
>>
>> On july 16th and 17th, there will be a scikit-learn sprint in Paris, in
>> parallel with the one in Austin.
>>
>> There will be an official announce soon with the location and other
>> informations.
>>
>> This is just an informal mail to ask if you have suggestions on
>> topics/issues that you think we should look at during the sprint.
>> Remember that it is a 2 days sprint, so we need things that can be
>> handled in 2 days.
>>
>> Whether you intend to come or not, any suggestion is welcomed !
>>
>> Best regards,
>>
>> Jeremie du Boisberranger <jeremiedbb>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From jeremie.du-boisberranger at inria.fr  Wed Jul  4 08:36:02 2018
From: jeremie.du-boisberranger at inria.fr (Jeremie du Boisberranger)
Date: Wed, 4 Jul 2018 14:36:02 +0200
Subject: [scikit-learn] Next sprint in Paris (july 16th and 17th)
In-Reply-To: <CAFrTa4UAQUYAECH=20RPoYL02p6-u+0ORes5i7X-xWWv-mOf0Q@mail.gmail.com>
References: <81047951-6691-a303-6638-d55350c74cc5@inria.fr>
 <CAFrTa4UAQUYAECH=20RPoYL02p6-u+0ORes5i7X-xWWv-mOf0Q@mail.gmail.com>
Message-ID: <2e8b08fc-6aca-60d0-b006-d976a89ed764@inria.fr>

Hi Sergio,

I'm sorry but this sprint is quite short and thus will be for 
experienced contributors (at least experienced with the scikit-learn 
contributing work flow).

We'll probably organize less restrictive sprints in the future.

Best regards,

Jeremie


On 04/07/2018 11:08, Sergio Sol?rzano wrote:
> Hi everyone,
>
> Regarding the Python Sprint in Paris,
> I would like to know if it is possible to attend if one wants to
> contribute but has never done it before. In other  words,
> is it "reserved" for experienced contributors/developers of
> sckikit-learn or newcomers can join as well?
>
> Best,
> Sergio
>
>
> On Tue, Jul 3, 2018 at 3:25 PM Jeremie du Boisberranger
> <jeremie.du-boisberranger at inria.fr> wrote:
>> Hi everyone,
>>
>> On july 16th and 17th, there will be a scikit-learn sprint in Paris, in
>> parallel with the one in Austin.
>>
>> There will be an official announce soon with the location and other
>> informations.
>>
>> This is just an informal mail to ask if you have suggestions on
>> topics/issues that you think we should look at during the sprint.
>> Remember that it is a 2 days sprint, so we need things that can be
>> handled in 2 days.
>>
>> Whether you intend to come or not, any suggestion is welcomed !
>>
>> Best regards,
>>
>> Jeremie du Boisberranger <jeremiedbb>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From marco.fronzi at gmail.com  Tue Jul 10 00:25:51 2018
From: marco.fronzi at gmail.com (Marco Fronzi)
Date: Tue, 10 Jul 2018 14:25:51 +1000
Subject: [scikit-learn] compiling issue
Message-ID: <CAJu-Dpvu6Z3X_G23PqRNv5bwY4LuF+4LWfAEJ1Kw0kqMpgAMxA@mail.gmail.com>

Hi,

My name is Marco and I am trying to install scikit-learn on my mac (OX
10.11.6). I installed already python3, numpy (1.8.2) and scipy, however
when I run pip3 scikit-learn I get several errors which are listed below.

Failed building wheel for scikit-learn

and also:

Command "/usr/local/opt/python/bin/python3.7 -u -c "import setuptools,
tokenize;__file__='/private/tmp/pip-install-4z67z8of/
scikit-learn/setup.py';f=getattr(tokenize, 'open',
open)(__file__);code=f.read().replace('\r\n',
'\n');f.close();exec(compile(code,
__file__, 'exec'))" install --record
/private/tmp/pip-record-pbcsv1zz/install-record.txt
--single-version-externally-managed --compile" failed with error code 1 in
/private/tmp/pip-install-4z67z8of/scikit-learn/

I would appreciate any suggestion/hint to solve this issue and the
package.


Thank you,

Marco
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180710/64b0bb5f/attachment.html>

From joel.nothman at gmail.com  Tue Jul 10 00:54:24 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 10 Jul 2018 14:54:24 +1000
Subject: [scikit-learn] compiling issue
In-Reply-To: <CAJu-Dpvu6Z3X_G23PqRNv5bwY4LuF+4LWfAEJ1Kw0kqMpgAMxA@mail.gmail.com>
References: <CAJu-Dpvu6Z3X_G23PqRNv5bwY4LuF+4LWfAEJ1Kw0kqMpgAMxA@mail.gmail.com>
Message-ID: <CAAkaFLXu6+xDWFKnf-N0UNKmrU-LxJbjjhhh0Bui0AX7Osya1w@mail.gmail.com>

Homebrew has pushed a lot of users onto Python 3.7 arguably prematurely:
several packages weren't ready to support it. A compatibility release,
Scikit-learn 0.19.2, is basically ready to be released, but it may take
another couple of days.

See https://github.com/scikit-learn/scikit-learn/issues/11320

As noted there
<https://github.com/scikit-learn/scikit-learn/issues/11320#issuecomment-403644634>,
you can also download Python to 3.6 with:

brew info python3
brew switch python 3.6.5


On 10 July 2018 at 14:25, Marco Fronzi <marco.fronzi at gmail.com> wrote:

> Hi,
>
> My name is Marco and I am trying to install scikit-learn on my mac (OX
> 10.11.6). I installed already python3, numpy (1.8.2) and scipy, however
> when I run pip3 scikit-learn I get several errors which are listed below.
>
> Failed building wheel for scikit-learn
>
> and also:
>
> Command "/usr/local/opt/python/bin/python3.7 -u -c "import setuptools,
> tokenize;__file__='/private/tmp/pip-install-4z67z8of/scikit-
> learn/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n',
> '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record
> /private/tmp/pip-record-pbcsv1zz/install-record.txt
> --single-version-externally-managed --compile" failed with error code 1
> in /private/tmp/pip-install-4z67z8of/scikit-learn/
>
> I would appreciate any suggestion/hint to solve this issue and the
> package.
>
>
> Thank you,
>
> Marco
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180710/36c6b01b/attachment.html>

From morin070 at umn.edu  Thu Jul 12 13:34:07 2018
From: morin070 at umn.edu (August Morin)
Date: Thu, 12 Jul 2018 13:34:07 -0400
Subject: [scikit-learn] Finding Formula of Gaussian Process Classification
Message-ID: <CAKGU5c7WsNw=zAuvgdZRL0wu-91_8x_DRx9jedobG-4btbA_Jw@mail.gmail.com>

Hi all,

I've been handed down some code that is based on the Classifier Comparison
done by Ga?l Varoquaux and Andreas M?ller. The dataset is best classified
by the Gaussian Process, from which I would like to be able to find a
formula that I can run other datasets through for an image filtering
project. Is there a way to export the formula directly from sklearn?

Any ideas are much appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180712/027a4272/attachment.html>

From tevang3 at gmail.com  Sun Jul 15 19:51:28 2018
From: tevang3 at gmail.com (Thomas Evangelidis)
Date: Mon, 16 Jul 2018 01:51:28 +0200
Subject: [scikit-learn] sample_weights in RandomForestRegressor
Message-ID: <CAACvdx1i_sNqSfn=2tufNSTUa_Tr=XpfvCiG0zOT0G-bx9K6jw@mail.gmail.com>

??
Hello,

I am kind of confused about the use of sample_weights parameter in the
fit() function of RandomForestRegressor. Here is my problem:

I am trying to predict the binding affinity of small molecules to a
protein. I have a training set of 709 molecules and a blind test set of 180
molecules. I want to find those features that are more important for the
correct prediction of the binding affinity of those 180 molecules of my
blind test set.  My rationale is that if I give more emphasis to the
similar molecules in the training set, then I will get higher importances
for those features that have higher predictive ability for this specific
blind test set of 180 molecules. To this end, I weighted the 709 training
set molecules by their maximum similarity to the 180 molecules, selected
only those features with high importance and trained a new RF with all 709
molecules. I got some results but I am not satisfied. Is this the right way
to use sample_weights in RF. I would appreciate any advice or suggested
work flow.


-- 

======================================================================

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tevang at pharm.uoa.gr

          tevang3 at gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180716/98b6dd94/attachment.html>

From jbbrown at kuhp.kyoto-u.ac.jp  Mon Jul 16 10:54:55 2018
From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.)
Date: Mon, 16 Jul 2018 23:54:55 +0900
Subject: [scikit-learn] sample_weights in RandomForestRegressor
In-Reply-To: <CAACvdx1i_sNqSfn=2tufNSTUa_Tr=XpfvCiG0zOT0G-bx9K6jw@mail.gmail.com>
References: <CAACvdx1i_sNqSfn=2tufNSTUa_Tr=XpfvCiG0zOT0G-bx9K6jw@mail.gmail.com>
Message-ID: <CAJe_vxDnK3i2KiHcQFa+oFWpDm9yqwLt-GiMDR_tuaPdK+bCRQ@mail.gmail.com>

Dear Thomas,

Your strategy for model development is built on the assumption that the SAR
(structure-activity relationship) is a continuous manifold constructed for
your compound descriptors.
However, SARs for many proteins in drug discovery or chemical biology are
not continuous (consider kinase inhibitors).

Therefore, you must make an assessment of the training data SAR to check
for the prevalence of activity cliffs.
There are at least two ways you can go about this:
  (1) Simply compute all pairwise similarities by your choice of
descriptor+metric, then identify where there are pairs (e.g.,
MACCS-Tanimoto > 0.7) with large activity differences (e.g., K_i or IC50
difference of more than 10/50/100-fold; again, the biology of your problem
determines the right values).
  (2) Perform many repetitions of train-test splitting on the 709 reference
molecules, look at the distribution of your evaluation metric, and see if
there is a limit in your ability to predict. If you are hitting a wall in
terms of predictability (metric performance), it's a likely sign there is
an activity cliff, and no amount of machine learning is going to be able to
overcome this. Further, trace the predictability of individual compounds to
identify those which consistently are predicted wrong.  If you combine this
with analysis (1), you can know exactly which of your chemistries are
unmodelable.

If you find that there are no activity cliffs in your dataset, then your
application of the assumption that chemical similarity implies biological
endpoint similarity will hold, and your experimental design is validated
because of the presence of a continuous manifold.
However, if you do have activity cliffs, then as awesome as sklearn is, it
still cannot make the computational chemistry any better.

Hope this helps you contextualize your work. Don't hesitate to contact me
if I can be of consultation.

Sincerely,
J.B. Brown
Kyoto University Graduate School of Medicine


2018-07-16 8:51 GMT+09:00 Thomas Evangelidis <tevang3 at gmail.com>:

> ??
> Hello,
>
> I am kind of confused about the use of sample_weights parameter in the
> fit() function of RandomForestRegressor. Here is my problem:
>
> I am trying to predict the binding affinity of small molecules to a
> protein. I have a training set of 709 molecules and a blind test set of 180
> molecules. I want to find those features that are more important for the
> correct prediction of the binding affinity of those 180 molecules of my
> blind test set.  My rationale is that if I give more emphasis to the
> similar molecules in the training set, then I will get higher importances
> for those features that have higher predictive ability for this specific
> blind test set of 180 molecules. To this end, I weighted the 709 training
> set molecules by their maximum similarity to the 180 molecules, selected
> only those features with high importance and trained a new RF with all 709
> molecules. I got some results but I am not satisfied. Is this the right way
> to use sample_weights in RF. I would appreciate any advice or suggested
> work flow.
>
>
> --
>
> ======================================================================
>
> Dr Thomas Evangelidis
>
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049,
> 62500 Brno, Czech Republic
>
> email: tevang at pharm.uoa.gr
>
>           tevang3 at gmail.com
>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180716/2176551d/attachment.html>

From abhishekb2209 at gmail.com  Mon Jul 16 23:19:21 2018
From: abhishekb2209 at gmail.com (Abhishek Babuji)
Date: Mon, 16 Jul 2018 23:19:21 -0400
Subject: [scikit-learn] Would love to contribute to this library that I fell
 in love with. I have a question! FIRST TIMER
Message-ID: <CAHvt82spRpK_Bs6zJV_0KGmkScSjp+gZWfJ6xULX841k9rCxDg@mail.gmail.com>

TO WHOM IT MAY CONCERN,

I have just learned Python to a level that I can say I'm comfortable with
it. I have also picked up and learned Git and GitHub, and so now I'm ready
to make my contribution to this library.

I'm really enthusiastic but since this is my first time, I'd like to know a
few things!

*Must I know the underlying implementation of something to contribute code
to fix it?*

Explanation: Let's say, someone, tags some issue as 'first timers' and
'easy', and you want to take a look at it, see and contribute code/fix the
code.

Should I know the implementation of what the fixed code is supposed to do?
or will this be explained when the issue is brought up? I have gone over
issues in your GitHub. but I don't think I've seen enough examples. I don't
seem to find this in the contributor guide.

If someone could help me understand the level of depth that I must know
scikit-learn to be able to contribute, I would then begin working towards
it! Because I have used it  a lot in my Machine Learning projects, so I'm
not sure where I stand.

Example: "The shovel doesn't work! Fix it! It is supposed to be able to dig
through mud"
My dilemma: I found an immovable rock in the mud that the shovel is not
being able to dig through.. so I'm stuck. Guess I shouldn't have
volunteered to help.

Just on a side note, to all scikit-learn's contributors, you're doing
God's work.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180716/92a4513d/attachment.html>

From seralouk at hotmail.com  Fri Jul 20 05:35:18 2018
From: seralouk at hotmail.com (serafim loukas)
Date: Fri, 20 Jul 2018 09:35:18 +0000
Subject: [scikit-learn] Plot Cross-validated ROCs for multi-class
 classification problem
Message-ID: <AFDCFB19-AD24-4A61-B224-53225986E7E4@hotmail.com>

Dear Scikit-learn community,


I have a 3 class classification problem and I would like to plot the average ROC across Folds.
There is an example in scikit-learn website but only for binary classification problems (http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html).

I want to do the same but in the case of 3 classes. I have tried to use `clf = OneVsRestClassifier(LinearDiscriminantAnalysis())`but I am having a hard time to make it work.


Any help would be appreciated,
Makis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180720/f648fd04/attachment.html>

From t3kcit at gmail.com  Fri Jul 20 10:44:00 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 20 Jul 2018 10:44:00 -0400
Subject: [scikit-learn] Plot Cross-validated ROCs for multi-class
 classification problem
In-Reply-To: <43CF60D7-EDFC-42AD-86AE-F5733D18FF6B@hotmail.com>
References: <AFDCFB19-AD24-4A61-B224-53225986E7E4@hotmail.com>
 <daa2938a-2c97-d653-77ae-6e99d8d37e43@gmail.com>
 <43CF60D7-EDFC-42AD-86AE-F5733D18FF6B@hotmail.com>
Message-ID: <459e3dfc-14fb-1af6-02ff-6bccd093e66d@gmail.com>

Please stay on the mailing list.
There is no single roc curve for a 3 class problem. So what do you want 
to plot?

On 07/20/2018 10:40 AM, serafim loukas wrote:
> Hello Andy,
>
>
> Thank you for your response.
>
> What I want to do is to plot the average(mean) ROC across Folds for a 
> 3-class case.
> I have managed to do so for the binary case and I am trying to make it 
> work for the mutlti-class case but with no luck.
>
> There is an example in the documentation 
> (http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html) 
> for the binary case.
> I want to do the same ( plot the mean ROC and the confidence interval 
> for my 3-class problem).
>
> Here is also my SO question about this: 
> https://stackoverflow.com/questions/51442818/average-roc-curve-across-folds-for-multi-class-classification-case-in-sklearn?with 
> some code included.
>
>
>
> Best,
> Makis
>
>
>
>> On 20 Jul 2018, at 16:34, Andreas Mueller <t3kcit at gmail.com 
>> <mailto:t3kcit at gmail.com>> wrote:
>>
>> Hi Makis.
>> What do you mean by a roc curve for multi-class?
>> You can have one curve per class using OVR or one curve per pair of 
>> classes.
>> That doesn't need the OneVsRestClassifier, it's more a matter of 
>> evaluation.
>>
>> Cheers,
>> Andy
>>
>> On 07/20/2018 05:35 AM, serafim loukas wrote:
>>> Dear Scikit-learn community,
>>>
>>>
>>> I have a 3 class classification problem and I would like to plot the 
>>> average ROC across Folds.
>>> There is an example in scikit-learn website but only for binary 
>>> classification problems 
>>> (http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html).
>>>
>>> I want to do the same but in the case of 3 classes. I have tried to 
>>> use `clf = OneVsRestClassifier(LinearDiscriminantAnalysis())`but I 
>>> am having a hard time to make it work.
>>>
>>>
>>> Any help would be?appreciated,
>>> Makis
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180720/2adfe973/attachment.html>

From jbbrown at kuhp.kyoto-u.ac.jp  Sat Jul 21 10:02:02 2018
From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.)
Date: Sat, 21 Jul 2018 23:02:02 +0900
Subject: [scikit-learn] Plot Cross-validated ROCs for multi-class
 classification problem
In-Reply-To: <459e3dfc-14fb-1af6-02ff-6bccd093e66d@gmail.com>
References: <AFDCFB19-AD24-4A61-B224-53225986E7E4@hotmail.com>
 <daa2938a-2c97-d653-77ae-6e99d8d37e43@gmail.com>
 <43CF60D7-EDFC-42AD-86AE-F5733D18FF6B@hotmail.com>
 <459e3dfc-14fb-1af6-02ff-6bccd093e66d@gmail.com>
Message-ID: <CAJe_vxDPr6MyNRe5-Zj4sFavWd2upL6EA1OMK60DJMRGvzLLVQ@mail.gmail.com>

Hello Makis,

2018-07-20 23:44 GMT+09:00 Andreas Mueller <t3kcit at gmail.com>:

> There is no single roc curve for a 3 class problem. So what do you want to
> plot?
>
> On 07/20/2018 10:40 AM, serafim loukas wrote:
>
> What I want to do is to plot the average(mean) ROC across Folds for a
> 3-class case.
>
>
The prototypical ROC curve uses True Positive Rate and False Positive Rate
for its axes, so it is for 2-class problems, and not for 3+-class problems,
as Andy mentioned.
Perhaps you are wanting the mean and confidence intervals of the n-class
Cohen Kappa metric as estimated by either many folds of cross validation,
or you want to evaluate your classifier by repeated subsampling experiments
and Kappa value distribution/histogram?

Hope this helps,
J.B.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180721/91a030b9/attachment.html>

From seralouk at hotmail.com  Sat Jul 21 10:20:39 2018
From: seralouk at hotmail.com (serafim loukas)
Date: Sat, 21 Jul 2018 14:20:39 +0000
Subject: [scikit-learn] Plot Cross-validated ROCs for multi-class
 classification problem
In-Reply-To: <CAJe_vxDPr6MyNRe5-Zj4sFavWd2upL6EA1OMK60DJMRGvzLLVQ@mail.gmail.com>
References: <AFDCFB19-AD24-4A61-B224-53225986E7E4@hotmail.com>
 <daa2938a-2c97-d653-77ae-6e99d8d37e43@gmail.com>
 <43CF60D7-EDFC-42AD-86AE-F5733D18FF6B@hotmail.com>
 <459e3dfc-14fb-1af6-02ff-6bccd093e66d@gmail.com>
 <CAJe_vxDPr6MyNRe5-Zj4sFavWd2upL6EA1OMK60DJMRGvzLLVQ@mail.gmail.com>
Message-ID: <E2217FA5-12CF-4F99-A844-E53D0294E61F@hotmail.com>

Hello J.B,


I could simply create some ROC curves as shown in the scikit-learn documentation by selecting only 2 classes and then repeating by selecting other pair of classes (in total I have 3 classes so this would result in 3 different ROC figures).

An alternative would be I would like to plot the mean and confidence intervals of the 3-class Cohen Kappa metric as estimated by KFolds (k=5) cross-validation.

Any tips about this ?


Cheers,
Makis


On 21 Jul 2018, at 16:02, Brown J.B. via scikit-learn <scikit-learn at python.org<mailto:scikit-learn at python.org>> wrote:

Hello Makis,

2018-07-20 23:44 GMT+09:00 Andreas Mueller <t3kcit at gmail.com<mailto:t3kcit at gmail.com>>:
There is no single roc curve for a 3 class problem. So what do you want to plot?

On 07/20/2018 10:40 AM, serafim loukas wrote:
What I want to do is to plot the average(mean) ROC across Folds for a 3-class case.

The prototypical ROC curve uses True Positive Rate and False Positive Rate for its axes, so it is for 2-class problems, and not for 3+-class problems, as Andy mentioned.
Perhaps you are wanting the mean and confidence intervals of the n-class Cohen Kappa metric as estimated by either many folds of cross validation, or you want to evaluate your classifier by repeated subsampling experiments and Kappa value distribution/histogram?

Hope this helps,
J.B.
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180721/5a5d4bc2/attachment.html>

From benoit.presles at u-bourgogne.fr  Tue Jul 24 07:07:22 2018
From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=)
Date: Tue, 24 Jul 2018 13:07:22 +0200
Subject: [scikit-learn] RFE with logistic regression
Message-ID: <f6a33a06-3e48-ffc1-9903-17056c409065@u-bourgogne.fr>

Dear scikit-learn users,

I am using the recursive feature elimination (RFE) tool from sklearn to 
rank my features:

from sklearn.linear_model import LogisticRegression
classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000)
from sklearn.feature_selection import RFE
rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1)
rfe.fit(X, y)
ranking = rfe.ranking_
print(ranking)

1. The first problem I have is when I execute the above code multiple 
times, I don't get the same results.

2. When I change the solver to 'sag' or 'saga' (classifier_RFE = 
LogisticRegression(C=1e9, verbose=1, max_iter=10000), solver='sag'), it 
seems that I get the same results at each run but the ranking is not the 
same between these two solvers.

3. With C=1, it seems I have the same results at each run for the 
solver='liblinear', but not for the solvers 'sag' and 'saga'. I still 
don't get the same results between the different solvers.


Thanks for your help,
Best regards,
Ben


From stuart at stuartreynolds.net  Tue Jul 24 12:16:57 2018
From: stuart at stuartreynolds.net (Stuart Reynolds)
Date: Tue, 24 Jul 2018 09:16:57 -0700
Subject: [scikit-learn] RFE with logistic regression
In-Reply-To: <f6a33a06-3e48-ffc1-9903-17056c409065@u-bourgogne.fr>
References: <f6a33a06-3e48-ffc1-9903-17056c409065@u-bourgogne.fr>
Message-ID: <CAAy-kdkRFeQBhOGOVDyFW_ECdfLQfaXFM2gSBSouE84fW9Jjjw@mail.gmail.com>

liblinear regularizes the intercept (which is a questionable thing to
do and a poor choice of default in sklearn).
The other solvers do not.

On Tue, Jul 24, 2018 at 4:07 AM, Beno?t Presles
<benoit.presles at u-bourgogne.fr> wrote:
> Dear scikit-learn users,
>
> I am using the recursive feature elimination (RFE) tool from sklearn to rank
> my features:
>
> from sklearn.linear_model import LogisticRegression
> classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000)
> from sklearn.feature_selection import RFE
> rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1)
> rfe.fit(X, y)
> ranking = rfe.ranking_
> print(ranking)
>
> 1. The first problem I have is when I execute the above code multiple times,
> I don't get the same results.
>
> 2. When I change the solver to 'sag' or 'saga' (classifier_RFE =
> LogisticRegression(C=1e9, verbose=1, max_iter=10000), solver='sag'), it
> seems that I get the same results at each run but the ranking is not the
> same between these two solvers.
>
> 3. With C=1, it seems I have the same results at each run for the
> solver='liblinear', but not for the solvers 'sag' and 'saga'. I still don't
> get the same results between the different solvers.
>
>
> Thanks for your help,
> Best regards,
> Ben
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From mail at sebastianraschka.com  Tue Jul 24 12:40:34 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Tue, 24 Jul 2018 11:40:34 -0500
Subject: [scikit-learn] RFE with logistic regression
In-Reply-To: <CAAy-kdkRFeQBhOGOVDyFW_ECdfLQfaXFM2gSBSouE84fW9Jjjw@mail.gmail.com>
References: <f6a33a06-3e48-ffc1-9903-17056c409065@u-bourgogne.fr>
 <CAAy-kdkRFeQBhOGOVDyFW_ECdfLQfaXFM2gSBSouE84fW9Jjjw@mail.gmail.com>
Message-ID: <ECDA9F90-E8D7-4B38-8BA0-4FB9AA1CCDF2@sebastianraschka.com>

Agreed. But then the setting is c=1e9 in this context (where C is the inverse regularization strength), so the regularization effect should be very small. 

Probably shouldn't matter much for convex optimization, but I would still try to 

a) set the random_state to some fixed value
b) make sure that .n_iter_ < .max_iter

to see if that results in more consistency.

Best,
Sebastian

> On Jul 24, 2018, at 11:16 AM, Stuart Reynolds <stuart at stuartreynolds.net> wrote:
> 
> liblinear regularizes the intercept (which is a questionable thing to
> do and a poor choice of default in sklearn).
> The other solvers do not.
> 
> On Tue, Jul 24, 2018 at 4:07 AM, Beno?t Presles
> <benoit.presles at u-bourgogne.fr> wrote:
>> Dear scikit-learn users,
>> 
>> I am using the recursive feature elimination (RFE) tool from sklearn to rank
>> my features:
>> 
>> from sklearn.linear_model import LogisticRegression
>> classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000)
>> from sklearn.feature_selection import RFE
>> rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1)
>> rfe.fit(X, y)
>> ranking = rfe.ranking_
>> print(ranking)
>> 
>> 1. The first problem I have is when I execute the above code multiple times,
>> I don't get the same results.
>> 
>> 2. When I change the solver to 'sag' or 'saga' (classifier_RFE =
>> LogisticRegression(C=1e9, verbose=1, max_iter=10000), solver='sag'), it
>> seems that I get the same results at each run but the ranking is not the
>> same between these two solvers.
>> 
>> 3. With C=1, it seems I have the same results at each run for the
>> solver='liblinear', but not for the solvers 'sag' and 'saga'. I still don't
>> get the same results between the different solvers.
>> 
>> 
>> Thanks for your help,
>> Best regards,
>> Ben
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From benoit.presles at u-bourgogne.fr  Tue Jul 24 14:07:02 2018
From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=)
Date: Tue, 24 Jul 2018 20:07:02 +0200
Subject: [scikit-learn] RFE with logistic regression
In-Reply-To: <CAAy-kdkRFeQBhOGOVDyFW_ECdfLQfaXFM2gSBSouE84fW9Jjjw@mail.gmail.com>
References: <f6a33a06-3e48-ffc1-9903-17056c409065@u-bourgogne.fr>
 <CAAy-kdkRFeQBhOGOVDyFW_ECdfLQfaXFM2gSBSouE84fW9Jjjw@mail.gmail.com>
Message-ID: <fdc9f94d-8b5f-f81f-413d-370dcd99c65d@u-bourgogne.fr>

I did the same tests as before adding fit_intercept=False and:

1. I have got the same problem as before, i.e. when I execute the RFE 
multiple times I don't get the same ranking each time.

2. When I change the solver to 'sag' 
(classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, 
fit_intercept=False, solver='sag')), it seems that I get the same 
ranking at each run. This is not the case with the 'saga' solver.
The ranking is not the same between the solvers.

3. With C=1, it seems that I have the same results at each run for all 
solvers (liblinear, sag and saga), however the ranking is not the same 
between the solvers.


How can I get reproducible and consistent results?


Thanks for your help,
Best regards,
Ben


Le 24/07/2018 ? 18:16, Stuart Reynolds a ?crit?:
> liblinear regularizes the intercept (which is a questionable thing to
> do and a poor choice of default in sklearn).
> The other solvers do not.
>
> On Tue, Jul 24, 2018 at 4:07 AM, Beno?t Presles
> <benoit.presles at u-bourgogne.fr> wrote:
>> Dear scikit-learn users,
>>
>> I am using the recursive feature elimination (RFE) tool from sklearn to rank
>> my features:
>>
>> from sklearn.linear_model import LogisticRegression
>> classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000)
>> from sklearn.feature_selection import RFE
>> rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1)
>> rfe.fit(X, y)
>> ranking = rfe.ranking_
>> print(ranking)
>>
>> 1. The first problem I have is when I execute the above code multiple times,
>> I don't get the same results.
>>
>> 2. When I change the solver to 'sag' or 'saga' (classifier_RFE =
>> LogisticRegression(C=1e9, verbose=1, max_iter=10000), solver='sag'), it
>> seems that I get the same results at each run but the ranking is not the
>> same between these two solvers.
>>
>> 3. With C=1, it seems I have the same results at each run for the
>> solver='liblinear', but not for the solvers 'sag' and 'saga'. I still don't
>> get the same results between the different solvers.
>>
>>
>> Thanks for your help,
>> Best regards,
>> Ben
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From t3kcit at gmail.com  Tue Jul 24 14:33:24 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Tue, 24 Jul 2018 14:33:24 -0400
Subject: [scikit-learn] RFE with logistic regression
In-Reply-To: <fdc9f94d-8b5f-f81f-413d-370dcd99c65d@u-bourgogne.fr>
References: <f6a33a06-3e48-ffc1-9903-17056c409065@u-bourgogne.fr>
 <CAAy-kdkRFeQBhOGOVDyFW_ECdfLQfaXFM2gSBSouE84fW9Jjjw@mail.gmail.com>
 <fdc9f94d-8b5f-f81f-413d-370dcd99c65d@u-bourgogne.fr>
Message-ID: <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com>


On 07/24/2018 02:07 PM, Beno?t Presles wrote:
> I did the same tests as before adding fit_intercept=False and:
>
> 1. I have got the same problem as before, i.e. when I execute the RFE 
> multiple times I don't get the same ranking each time.
>
> 2. When I change the solver to 'sag' 
> (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, 
> fit_intercept=False, solver='sag')), it seems that I get the same 
> ranking at each run. This is not the case with the 'saga' solver.
> The ranking is not the same between the solvers.
>
> 3. With C=1, it seems that I have the same results at each run for all 
> solvers (liblinear, sag and saga), however the ranking is not the same 
> between the solvers.
>
>
> How can I get reproducible and consistent results?
>
Did you scale your data? If not, saga and sag will basically fail.

From benoit.presles at u-bourgogne.fr  Tue Jul 24 14:43:27 2018
From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=)
Date: Tue, 24 Jul 2018 20:43:27 +0200
Subject: [scikit-learn] RFE with logistic regression
In-Reply-To: <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com>
References: <f6a33a06-3e48-ffc1-9903-17056c409065@u-bourgogne.fr>
 <CAAy-kdkRFeQBhOGOVDyFW_ECdfLQfaXFM2gSBSouE84fW9Jjjw@mail.gmail.com>
 <fdc9f94d-8b5f-f81f-413d-370dcd99c65d@u-bourgogne.fr>
 <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com>
Message-ID: <bce9056c-208f-5815-6c25-08ff107e121d@u-bourgogne.fr>

I did the same tests as before adding random_state=0 and:

1. I have got the same problem as before, i.e. when I execute the RFE 
multiple times I don't get the same ranking each time.

2. When I change the solver to 'sag' or 'saga' 
(LogisticRegression(C=1e9, verbose=1, max_iter=10000, 
fit_intercept=False, random_state=0, solver='sag')), it
seems that I get the same results at each run but the ranking is not the 
same between these two solvers.

3. With C=1, it seems that I have the same results at each run for all 
solvers (liblinear, sag and saga), however the ranking is not the same 
between the solvers.

Thanks for your help,
Ben


PS1: I checked and n_iter_ seems to be always lower than max_iter.
PS2: my data is scaled, I am using "StandardScaler".


Le 24/07/2018 ? 20:33, Andreas Mueller a ?crit?:
>
>
> On 07/24/2018 02:07 PM, Beno?t Presles wrote:
>> I did the same tests as before adding fit_intercept=False and:
>>
>> 1. I have got the same problem as before, i.e. when I execute the RFE 
>> multiple times I don't get the same ranking each time.
>>
>> 2. When I change the solver to 'sag' 
>> (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, 
>> fit_intercept=False, solver='sag')), it seems that I get the same 
>> ranking at each run. This is not the case with the 'saga' solver.
>> The ranking is not the same between the solvers.
>>
>> 3. With C=1, it seems that I have the same results at each run for 
>> all solvers (liblinear, sag and saga), however the ranking is not the 
>> same between the solvers.
>>
>>
>> How can I get reproducible and consistent results?
>>
> Did you scale your data? If not, saga and sag will basically fail.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From t3kcit at gmail.com  Tue Jul 24 14:54:12 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Tue, 24 Jul 2018 14:54:12 -0400
Subject: [scikit-learn] RFE with logistic regression
In-Reply-To: <bce9056c-208f-5815-6c25-08ff107e121d@u-bourgogne.fr>
References: <f6a33a06-3e48-ffc1-9903-17056c409065@u-bourgogne.fr>
 <CAAy-kdkRFeQBhOGOVDyFW_ECdfLQfaXFM2gSBSouE84fW9Jjjw@mail.gmail.com>
 <fdc9f94d-8b5f-f81f-413d-370dcd99c65d@u-bourgogne.fr>
 <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com>
 <bce9056c-208f-5815-6c25-08ff107e121d@u-bourgogne.fr>
Message-ID: <e0247713-4e2c-2b3c-2b13-b56cd3de7c67@gmail.com>

Can you share your data or reproduce with synthetic data?

On 07/24/2018 02:43 PM, Beno?t Presles wrote:
> I did the same tests as before adding random_state=0 and:
>
> 1. I have got the same problem as before, i.e. when I execute the RFE 
> multiple times I don't get the same ranking each time.
>
> 2. When I change the solver to 'sag' or 'saga' 
> (LogisticRegression(C=1e9, verbose=1, max_iter=10000, 
> fit_intercept=False, random_state=0, solver='sag')), it
> seems that I get the same results at each run but the ranking is not 
> the same between these two solvers.
>
> 3. With C=1, it seems that I have the same results at each run for all 
> solvers (liblinear, sag and saga), however the ranking is not the same 
> between the solvers.
>
> Thanks for your help,
> Ben
>
>
> PS1: I checked and n_iter_ seems to be always lower than max_iter.
> PS2: my data is scaled, I am using "StandardScaler".
>
>
>
> Le 24/07/2018 ? 20:33, Andreas Mueller a ?crit?:
>>
>>
>> On 07/24/2018 02:07 PM, Beno?t Presles wrote:
>>> I did the same tests as before adding fit_intercept=False and:
>>>
>>> 1. I have got the same problem as before, i.e. when I execute the 
>>> RFE multiple times I don't get the same ranking each time.
>>>
>>> 2. When I change the solver to 'sag' 
>>> (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, 
>>> fit_intercept=False, solver='sag')), it seems that I get the same 
>>> ranking at each run. This is not the case with the 'saga' solver.
>>> The ranking is not the same between the solvers.
>>>
>>> 3. With C=1, it seems that I have the same results at each run for 
>>> all solvers (liblinear, sag and saga), however the ranking is not 
>>> the same between the solvers.
>>>
>>>
>>> How can I get reproducible and consistent results?
>>>
>> Did you scale your data? If not, saga and sag will basically fail.
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From mail at sebastianraschka.com  Tue Jul 24 14:26:26 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Tue, 24 Jul 2018 13:26:26 -0500
Subject: [scikit-learn] RFE with logistic regression
In-Reply-To: <fdc9f94d-8b5f-f81f-413d-370dcd99c65d@u-bourgogne.fr>
References: <f6a33a06-3e48-ffc1-9903-17056c409065@u-bourgogne.fr>
 <CAAy-kdkRFeQBhOGOVDyFW_ECdfLQfaXFM2gSBSouE84fW9Jjjw@mail.gmail.com>
 <fdc9f94d-8b5f-f81f-413d-370dcd99c65d@u-bourgogne.fr>
Message-ID: <29EE1F14-4D1A-435D-93A1-5FC890F99447@sebastianraschka.com>

I addition to checking _n_iter and fixing the random seed as I suggested maybe also try normalizing the features (eg z scores via the standard scale we) to see if that stabilizes the training 

Sent from my iPhone

> On Jul 24, 2018, at 1:07 PM, Beno?t Presles <benoit.presles at u-bourgogne.fr> wrote:
> 
> I did the same tests as before adding fit_intercept=False and:
> 
> 1. I have got the same problem as before, i.e. when I execute the RFE multiple times I don't get the same ranking each time.
> 
> 2. When I change the solver to 'sag' (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, fit_intercept=False, solver='sag')), it seems that I get the same ranking at each run. This is not the case with the 'saga' solver.
> The ranking is not the same between the solvers.
> 
> 3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers.
> 
> 
> How can I get reproducible and consistent results?
> 
> 
> Thanks for your help,
> Best regards,
> Ben
> 
> 
> 
>> Le 24/07/2018 ? 18:16, Stuart Reynolds a ?crit :
>> liblinear regularizes the intercept (which is a questionable thing to
>> do and a poor choice of default in sklearn).
>> The other solvers do not.
>> 
>> On Tue, Jul 24, 2018 at 4:07 AM, Beno?t Presles
>> <benoit.presles at u-bourgogne.fr> wrote:
>>> Dear scikit-learn users,
>>> 
>>> I am using the recursive feature elimination (RFE) tool from sklearn to rank
>>> my features:
>>> 
>>> from sklearn.linear_model import LogisticRegression
>>> classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000)
>>> from sklearn.feature_selection import RFE
>>> rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1)
>>> rfe.fit(X, y)
>>> ranking = rfe.ranking_
>>> print(ranking)
>>> 
>>> 1. The first problem I have is when I execute the above code multiple times,
>>> I don't get the same results.
>>> 
>>> 2. When I change the solver to 'sag' or 'saga' (classifier_RFE =
>>> LogisticRegression(C=1e9, verbose=1, max_iter=10000), solver='sag'), it
>>> seems that I get the same results at each run but the ranking is not the
>>> same between these two solvers.
>>> 
>>> 3. With C=1, it seems I have the same results at each run for the
>>> solver='liblinear', but not for the solvers 'sag' and 'saga'. I still don't
>>> get the same results between the different solvers.
>>> 
>>> 
>>> Thanks for your help,
>>> Best regards,
>>> Ben
>>> 
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From gael.varoquaux at normalesup.org  Tue Jul 24 15:34:31 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Tue, 24 Jul 2018 21:34:31 +0200
Subject: [scikit-learn] RFE with logistic regression
In-Reply-To: <bce9056c-208f-5815-6c25-08ff107e121d@u-bourgogne.fr>
References: <f6a33a06-3e48-ffc1-9903-17056c409065@u-bourgogne.fr>
 <CAAy-kdkRFeQBhOGOVDyFW_ECdfLQfaXFM2gSBSouE84fW9Jjjw@mail.gmail.com>
 <fdc9f94d-8b5f-f81f-413d-370dcd99c65d@u-bourgogne.fr>
 <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com>
 <bce9056c-208f-5815-6c25-08ff107e121d@u-bourgogne.fr>
Message-ID: <20180724193431.tng3k5tgijqxnhkf@phare.normalesup.org>

On Tue, Jul 24, 2018 at 08:43:27PM +0200, Beno?t Presles wrote:
> 3. With C=1, it seems that I have the same results at each run for all
> solvers (liblinear, sag and saga), however the ranking is not the same
> between the solvers.

Your problem is probably ill-conditioned, hence the specific weights on
the features are not stable. There isn't a good answer to ordering
features, they are degenerate.

In general, I would avoid RFE, it is a hack, and can easily lead to these
problems.

Ga?l

> Thanks for your help,
> Ben


> PS1: I checked and n_iter_ seems to be always lower than max_iter.
> PS2: my data is scaled, I am using "StandardScaler".


> Le 24/07/2018 ? 20:33, Andreas Mueller a ?crit?:


> > On 07/24/2018 02:07 PM, Beno?t Presles wrote:
> > > I did the same tests as before adding fit_intercept=False and:

> > > 1. I have got the same problem as before, i.e. when I execute the
> > > RFE multiple times I don't get the same ranking each time.

> > > 2. When I change the solver to 'sag'
> > > (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000,
> > > fit_intercept=False, solver='sag')), it seems that I get the same
> > > ranking at each run. This is not the case with the 'saga' solver.
> > > The ranking is not the same between the solvers.

> > > 3. With C=1, it seems that I have the same results at each run for
> > > all solvers (liblinear, sag and saga), however the ranking is not
> > > the same between the solvers.


> > > How can I get reproducible and consistent results?

> > Did you scale your data? If not, saga and sag will basically fail.
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From benoit.presles at u-bourgogne.fr  Tue Jul 24 17:33:30 2018
From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=)
Date: Tue, 24 Jul 2018 23:33:30 +0200
Subject: [scikit-learn] RFE with logistic regression
In-Reply-To: <20180724193431.tng3k5tgijqxnhkf@phare.normalesup.org>
References: <f6a33a06-3e48-ffc1-9903-17056c409065@u-bourgogne.fr>
 <CAAy-kdkRFeQBhOGOVDyFW_ECdfLQfaXFM2gSBSouE84fW9Jjjw@mail.gmail.com>
 <fdc9f94d-8b5f-f81f-413d-370dcd99c65d@u-bourgogne.fr>
 <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com>
 <bce9056c-208f-5815-6c25-08ff107e121d@u-bourgogne.fr>
 <20180724193431.tng3k5tgijqxnhkf@phare.normalesup.org>
Message-ID: <57006ee7-a454-5934-90d8-5dc82663ef2f@u-bourgogne.fr>

So you think that I cannot get reproducible and consistent results with 
this method ?
If you would avoid RFE, which method do you suggest to find the best 
features ?

Ben


Le 24/07/2018 ? 21:34, Gael Varoquaux a ?crit?:
> On Tue, Jul 24, 2018 at 08:43:27PM +0200, Beno?t Presles wrote:
>> 3. With C=1, it seems that I have the same results at each run for all
>> solvers (liblinear, sag and saga), however the ranking is not the same
>> between the solvers.
> Your problem is probably ill-conditioned, hence the specific weights on
> the features are not stable. There isn't a good answer to ordering
> features, they are degenerate.
>
> In general, I would avoid RFE, it is a hack, and can easily lead to these
> problems.
>
> Ga?l
>
>> Thanks for your help,
>> Ben
>
>> PS1: I checked and n_iter_ seems to be always lower than max_iter.
>> PS2: my data is scaled, I am using "StandardScaler".
>
>
>> Le 24/07/2018 ? 20:33, Andreas Mueller a ?crit?:
>
>>> On 07/24/2018 02:07 PM, Beno?t Presles wrote:
>>>> I did the same tests as before adding fit_intercept=False and:
>>>> 1. I have got the same problem as before, i.e. when I execute the
>>>> RFE multiple times I don't get the same ranking each time.
>>>> 2. When I change the solver to 'sag'
>>>> (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000,
>>>> fit_intercept=False, solver='sag')), it seems that I get the same
>>>> ranking at each run. This is not the case with the 'saga' solver.
>>>> The ranking is not the same between the solvers.
>>>> 3. With C=1, it seems that I have the same results at each run for
>>>> all solvers (liblinear, sag and saga), however the ranking is not
>>>> the same between the solvers.
>
>>>> How can I get reproducible and consistent results?
>>> Did you scale your data? If not, saga and sag will basically fail.
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn


From bertrand.thirion at inria.fr  Tue Jul 24 17:44:58 2018
From: bertrand.thirion at inria.fr (bthirion)
Date: Tue, 24 Jul 2018 23:44:58 +0200
Subject: [scikit-learn] RFE with logistic regression
In-Reply-To: <57006ee7-a454-5934-90d8-5dc82663ef2f@u-bourgogne.fr>
References: <f6a33a06-3e48-ffc1-9903-17056c409065@u-bourgogne.fr>
 <CAAy-kdkRFeQBhOGOVDyFW_ECdfLQfaXFM2gSBSouE84fW9Jjjw@mail.gmail.com>
 <fdc9f94d-8b5f-f81f-413d-370dcd99c65d@u-bourgogne.fr>
 <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com>
 <bce9056c-208f-5815-6c25-08ff107e121d@u-bourgogne.fr>
 <20180724193431.tng3k5tgijqxnhkf@phare.normalesup.org>
 <57006ee7-a454-5934-90d8-5dc82663ef2f@u-bourgogne.fr>
Message-ID: <9ebbda6d-05dd-1282-2e3b-148c74bce0cb@inria.fr>

Univariate screening is somewhat hackish too, but much more stable -- 
and cheap.
Best,

Bertrand

On 24/07/2018 23:33, Beno?t Presles wrote:
> So you think that I cannot get reproducible and consistent results 
> with this method ?
> If you would avoid RFE, which method do you suggest to find the best 
> features ?
>
> Ben
>
>
> Le 24/07/2018 ? 21:34, Gael Varoquaux a ?crit?:
>> On Tue, Jul 24, 2018 at 08:43:27PM +0200, Beno?t Presles wrote:
>>> 3. With C=1, it seems that I have the same results at each run for all
>>> solvers (liblinear, sag and saga), however the ranking is not the same
>>> between the solvers.
>> Your problem is probably ill-conditioned, hence the specific weights on
>> the features are not stable. There isn't a good answer to ordering
>> features, they are degenerate.
>>
>> In general, I would avoid RFE, it is a hack, and can easily lead to 
>> these
>> problems.
>>
>> Ga?l
>>
>>> Thanks for your help,
>>> Ben
>>
>>> PS1: I checked and n_iter_ seems to be always lower than max_iter.
>>> PS2: my data is scaled, I am using "StandardScaler".
>>
>>
>>> Le 24/07/2018 ? 20:33, Andreas Mueller a ?crit?:
>>
>>>> On 07/24/2018 02:07 PM, Beno?t Presles wrote:
>>>>> I did the same tests as before adding fit_intercept=False and:
>>>>> 1. I have got the same problem as before, i.e. when I execute the
>>>>> RFE multiple times I don't get the same ranking each time.
>>>>> 2. When I change the solver to 'sag'
>>>>> (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000,
>>>>> fit_intercept=False, solver='sag')), it seems that I get the same
>>>>> ranking at each run. This is not the case with the 'saga' solver.
>>>>> The ranking is not the same between the solvers.
>>>>> 3. With C=1, it seems that I have the same results at each run for
>>>>> all solvers (liblinear, sag and saga), however the ranking is not
>>>>> the same between the solvers.
>>
>>>>> How can I get reproducible and consistent results?
>>>> Did you scale your data? If not, saga and sag will basically fail.
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From prat2 at umbc.edu  Tue Jul 24 20:33:31 2018
From: prat2 at umbc.edu (Prathusha Jonnagaddla Subramanyam Naidu)
Date: Tue, 24 Jul 2018 20:33:31 -0400
Subject: [scikit-learn] Help with Pull Request( Checks failing)
Message-ID: <CAOhYXzS2-69_9eVh_+8uKtPvwNBZpJ6ezNwpSFhD19AkngW_zA@mail.gmail.com>

Hi everyone,
      I submitted my first PR few hours back and I see that two tests
failed. Would really appreciate if anyone can help me with how to fix
these/ what I am doing wrong.

Thank you !
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180724/8eede1d5/attachment.html>

From prat2 at umbc.edu  Tue Jul 24 20:34:08 2018
From: prat2 at umbc.edu (Prathusha Jonnagaddla Subramanyam Naidu)
Date: Tue, 24 Jul 2018 20:34:08 -0400
Subject: [scikit-learn] Help with Pull Request( Checks failing)
In-Reply-To: <CAOhYXzS2-69_9eVh_+8uKtPvwNBZpJ6ezNwpSFhD19AkngW_zA@mail.gmail.com>
References: <CAOhYXzS2-69_9eVh_+8uKtPvwNBZpJ6ezNwpSFhD19AkngW_zA@mail.gmail.com>
Message-ID: <CAOhYXzTqAhp_jKg9rwYMjF=3G7v=g8oK4+GQi9g_RDt1k4uJJQ@mail.gmail.com>

This is the link to the PR -
https://github.com/scikit-learn/scikit-learn/pull/11670

On Tue, Jul 24, 2018 at 8:33 PM, Prathusha Jonnagaddla Subramanyam Naidu <
prat2 at umbc.edu> wrote:

> Hi everyone,
>       I submitted my first PR few hours back and I see that two tests
> failed. Would really appreciate if anyone can help me with how to fix
> these/ what I am doing wrong.
>
> Thank you !
>


-- 
Regards,
Prathusha JS Naidu
Graduate Student
Department of CSEE
UMBC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180724/c6faf947/attachment.html>

From mail at sebastianraschka.com  Tue Jul 24 21:06:07 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Tue, 24 Jul 2018 20:06:07 -0500
Subject: [scikit-learn] Help with Pull Request( Checks failing)
In-Reply-To: <CAOhYXzTqAhp_jKg9rwYMjF=3G7v=g8oK4+GQi9g_RDt1k4uJJQ@mail.gmail.com>
References: <CAOhYXzS2-69_9eVh_+8uKtPvwNBZpJ6ezNwpSFhD19AkngW_zA@mail.gmail.com>
 <CAOhYXzTqAhp_jKg9rwYMjF=3G7v=g8oK4+GQi9g_RDt1k4uJJQ@mail.gmail.com>
Message-ID: <FF9C26AB-0CB1-46ED-8BCC-277752C8D662@sebastianraschka.com>

I am not a core dev, but I think I can see what's wrong there (mostly Flake8 issues). Let me comment about that over there.

> On Jul 24, 2018, at 7:34 PM, Prathusha Jonnagaddla Subramanyam Naidu <prat2 at umbc.edu> wrote:
> 
> This is the link to the PR - https://github.com/scikit-learn/scikit-learn/pull/11670
> 
> On Tue, Jul 24, 2018 at 8:33 PM, Prathusha Jonnagaddla Subramanyam Naidu <prat2 at umbc.edu> wrote:
> Hi everyone,
>       I submitted my first PR few hours back and I see that two tests failed. Would really appreciate if anyone can help me with how to fix these/ what I am doing wrong. 
> 
> Thank you !
> 
> 
> 
> -- 
> Regards,
> Prathusha JS Naidu
> Graduate Student
> Department of CSEE
> UMBC
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From joel.nothman at gmail.com  Wed Jul 25 00:29:22 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 25 Jul 2018 14:29:22 +1000
Subject: [scikit-learn] Would love to contribute to this library that I
 fell in love with. I have a question! FIRST TIMER
In-Reply-To: <CAHvt82spRpK_Bs6zJV_0KGmkScSjp+gZWfJ6xULX841k9rCxDg@mail.gmail.com>
References: <CAHvt82spRpK_Bs6zJV_0KGmkScSjp+gZWfJ6xULX841k9rCxDg@mail.gmail.com>
Message-ID: <CAAkaFLWFKnL46eNL-ZntvrZJ9HxmXn10TBafdfo_Noa_s-Zw4w@mail.gmail.com>

Hi Abishek,

In case you can't tell from the response, this is not a straightforward
question to answer. I hope you have looked at our contributor guidelines:
http://scikit-learn.org/dev/developers/contributing.html.

We encourage contributors to start with changes that focus on things like
documentation, or that involve simple changes to the code. In any case, we
can try to help you navigate the code or the process of fixing a specific
issue. Some issues require a deeper understanding of the implementation
than others, and contributors should advance to those over time.

We look forward to your contributions.

Joel


On 17 July 2018 at 13:19, Abhishek Babuji <abhishekb2209 at gmail.com> wrote:

> TO WHOM IT MAY CONCERN,
>
> I have just learned Python to a level that I can say I'm comfortable with
> it. I have also picked up and learned Git and GitHub, and so now I'm ready
> to make my contribution to this library.
>
> I'm really enthusiastic but since this is my first time, I'd like to know
> a few things!
>
> *Must I know the underlying implementation of something to contribute code
> to fix it?*
>
> Explanation: Let's say, someone, tags some issue as 'first timers' and
> 'easy', and you want to take a look at it, see and contribute code/fix the
> code.
>
> Should I know the implementation of what the fixed code is supposed to do?
> or will this be explained when the issue is brought up? I have gone over
> issues in your GitHub. but I don't think I've seen enough examples. I don't
> seem to find this in the contributor guide.
>
> If someone could help me understand the level of depth that I must know
> scikit-learn to be able to contribute, I would then begin working towards
> it! Because I have used it  a lot in my Machine Learning projects, so I'm
> not sure where I stand.
>
> Example: "The shovel doesn't work! Fix it! It is supposed to be able to
> dig through mud"
> My dilemma: I found an immovable rock in the mud that the shovel is not
> being able to dig through.. so I'm stuck. Guess I shouldn't have
> volunteered to help.
>
> Just on a side note, to all scikit-learn's contributors, you're doing
> God's work.
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180725/7c29b517/attachment.html>

From benoit.presles at u-bourgogne.fr  Wed Jul 25 06:36:55 2018
From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=)
Date: Wed, 25 Jul 2018 12:36:55 +0200
Subject: [scikit-learn] RFE with logistic regression
In-Reply-To: <9ebbda6d-05dd-1282-2e3b-148c74bce0cb@inria.fr>
References: <f6a33a06-3e48-ffc1-9903-17056c409065@u-bourgogne.fr>
 <CAAy-kdkRFeQBhOGOVDyFW_ECdfLQfaXFM2gSBSouE84fW9Jjjw@mail.gmail.com>
 <fdc9f94d-8b5f-f81f-413d-370dcd99c65d@u-bourgogne.fr>
 <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com>
 <bce9056c-208f-5815-6c25-08ff107e121d@u-bourgogne.fr>
 <20180724193431.tng3k5tgijqxnhkf@phare.normalesup.org>
 <57006ee7-a454-5934-90d8-5dc82663ef2f@u-bourgogne.fr>
 <9ebbda6d-05dd-1282-2e3b-148c74bce0cb@inria.fr>
Message-ID: <1c2d3fb1-6c5a-8070-2412-1648406dd047@u-bourgogne.fr>

Do you think the problems I have can come from correlated features? 
Indeed, in my dataset I have some highly correlated features.
Do you think this could explain why I don't get reproducible and 
consistent results?

Thanks for your help,
Ben


Le 24/07/2018 ? 23:44, bthirion a ?crit?:
> Univariate screening is somewhat hackish too, but much more stable -- 
> and cheap.
> Best,
>
> Bertrand
>
> On 24/07/2018 23:33, Beno?t Presles wrote:
>> So you think that I cannot get reproducible and consistent results 
>> with this method ?
>> If you would avoid RFE, which method do you suggest to find the best 
>> features ?
>>
>> Ben
>>
>>
>> Le 24/07/2018 ? 21:34, Gael Varoquaux a ?crit?:
>>> On Tue, Jul 24, 2018 at 08:43:27PM +0200, Beno?t Presles wrote:
>>>> 3. With C=1, it seems that I have the same results at each run for all
>>>> solvers (liblinear, sag and saga), however the ranking is not the same
>>>> between the solvers.
>>> Your problem is probably ill-conditioned, hence the specific weights on
>>> the features are not stable. There isn't a good answer to ordering
>>> features, they are degenerate.
>>>
>>> In general, I would avoid RFE, it is a hack, and can easily lead to 
>>> these
>>> problems.
>>>
>>> Ga?l
>>>
>>>> Thanks for your help,
>>>> Ben
>>>
>>>> PS1: I checked and n_iter_ seems to be always lower than max_iter.
>>>> PS2: my data is scaled, I am using "StandardScaler".
>>>
>>>
>>>> Le 24/07/2018 ? 20:33, Andreas Mueller a ?crit?:
>>>
>>>>> On 07/24/2018 02:07 PM, Beno?t Presles wrote:
>>>>>> I did the same tests as before adding fit_intercept=False and:
>>>>>> 1. I have got the same problem as before, i.e. when I execute the
>>>>>> RFE multiple times I don't get the same ranking each time.
>>>>>> 2. When I change the solver to 'sag'
>>>>>> (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000,
>>>>>> fit_intercept=False, solver='sag')), it seems that I get the same
>>>>>> ranking at each run. This is not the case with the 'saga' solver.
>>>>>> The ranking is not the same between the solvers.
>>>>>> 3. With C=1, it seems that I have the same results at each run for
>>>>>> all solvers (liblinear, sag and saga), however the ranking is not
>>>>>> the same between the solvers.
>>>
>>>>>> How can I get reproducible and consistent results?
>>>>> Did you scale your data? If not, saga and sag will basically fail.
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From gael.varoquaux at normalesup.org  Wed Jul 25 07:50:04 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Wed, 25 Jul 2018 13:50:04 +0200
Subject: [scikit-learn] RFE with logistic regression
In-Reply-To: <1c2d3fb1-6c5a-8070-2412-1648406dd047@u-bourgogne.fr>
References: <f6a33a06-3e48-ffc1-9903-17056c409065@u-bourgogne.fr>
 <CAAy-kdkRFeQBhOGOVDyFW_ECdfLQfaXFM2gSBSouE84fW9Jjjw@mail.gmail.com>
 <fdc9f94d-8b5f-f81f-413d-370dcd99c65d@u-bourgogne.fr>
 <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com>
 <bce9056c-208f-5815-6c25-08ff107e121d@u-bourgogne.fr>
 <20180724193431.tng3k5tgijqxnhkf@phare.normalesup.org>
 <57006ee7-a454-5934-90d8-5dc82663ef2f@u-bourgogne.fr>
 <9ebbda6d-05dd-1282-2e3b-148c74bce0cb@inria.fr>
 <1c2d3fb1-6c5a-8070-2412-1648406dd047@u-bourgogne.fr>
Message-ID: <20180725115004.aaqhqbi65mbifr2r@phare.normalesup.org>

On Wed, Jul 25, 2018 at 12:36:55PM +0200, Beno?t Presles wrote:
> Do you think the problems I have can come from correlated features? Indeed,
> in my dataset I have some highly correlated features.

Yes, in general selecting features conditionally on others is very hard
when features are highly correlated.

> Do you think this could explain why I don't get reproducible and consistent
> results?

Yes.

> Thanks for your help,
> Ben


> Le 24/07/2018 ? 23:44, bthirion a ?crit?:
> > Univariate screening is somewhat hackish too, but much more stable --
> > and cheap.
> > Best,

> > Bertrand

> > On 24/07/2018 23:33, Beno?t Presles wrote:
> > > So you think that I cannot get reproducible and consistent results
> > > with this method ?
> > > If you would avoid RFE, which method do you suggest to find the best
> > > features ?

> > > Ben


> > > Le 24/07/2018 ? 21:34, Gael Varoquaux a ?crit?:
> > > > On Tue, Jul 24, 2018 at 08:43:27PM +0200, Beno?t Presles wrote:
> > > > > 3. With C=1, it seems that I have the same results at each run for all
> > > > > solvers (liblinear, sag and saga), however the ranking is not the same
> > > > > between the solvers.
> > > > Your problem is probably ill-conditioned, hence the specific weights on
> > > > the features are not stable. There isn't a good answer to ordering
> > > > features, they are degenerate.

> > > > In general, I would avoid RFE, it is a hack, and can easily lead
> > > > to these
> > > > problems.

> > > > Ga?l

> > > > > Thanks for your help,
> > > > > Ben

> > > > > PS1: I checked and n_iter_ seems to be always lower than max_iter.
> > > > > PS2: my data is scaled, I am using "StandardScaler".


> > > > > Le 24/07/2018 ? 20:33, Andreas Mueller a ?crit?:

> > > > > > On 07/24/2018 02:07 PM, Beno?t Presles wrote:
> > > > > > > I did the same tests as before adding fit_intercept=False and:
> > > > > > > 1. I have got the same problem as before, i.e. when I execute the
> > > > > > > RFE multiple times I don't get the same ranking each time.
> > > > > > > 2. When I change the solver to 'sag'
> > > > > > > (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000,
> > > > > > > fit_intercept=False, solver='sag')), it seems that I get the same
> > > > > > > ranking at each run. This is not the case with the 'saga' solver.
> > > > > > > The ranking is not the same between the solvers.
> > > > > > > 3. With C=1, it seems that I have the same results at each run for
> > > > > > > all solvers (liblinear, sag and saga), however the ranking is not
> > > > > > > the same between the solvers.

> > > > > > > How can I get reproducible and consistent results?
> > > > > > Did you scale your data? If not, saga and sag will basically fail.
> > > > > > _______________________________________________
> > > > > > scikit-learn mailing list
> > > > > > scikit-learn at python.org
> > > > > > https://mail.python.org/mailman/listinfo/scikit-learn
> > > > > _______________________________________________
> > > > > scikit-learn mailing list
> > > > > scikit-learn at python.org
> > > > > https://mail.python.org/mailman/listinfo/scikit-learn

> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn


> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From drraph at gmail.com  Thu Jul 26 01:05:21 2018
From: drraph at gmail.com (Raphael C)
Date: Thu, 26 Jul 2018 06:05:21 +0100
Subject: [scikit-learn] What is the FeatureAgglomeration algorithm?
Message-ID: <CAFHc1QaJfmh1rbZ7RSrhUtXNbRhVvyUUVjaCe1fJrL5abq7Mfg@mail.gmail.com>

Hi,

I am trying to work out what, in precise mathematical terms,
[FeatureAgglomeration][1] does and would love some help. Here is some
example code:


    import numpy as np
    from sklearn.cluster import FeatureAgglomeration
    for S in ['ward', 'average', 'complete']:
        FA = FeatureAgglomeration(linkage=S)
        print(FA.fit_transform(np.array([[-50,6,6,7,], [0,1,2,3]])))

This outputs:


    [[  6.33333333 -50.        ]
     [  2.           0.        ]]
    [[  6.33333333 -50.        ]
     [  2.           0.        ]]
    [[  6.33333333 -50.        ]
     [  2.           0.        ]]

Is it possible to say mathematically how these values have been computed?

Also, what exactly does linkage do and why doesn't it seem to make any
difference which option you choose?

Raphael


  [1]:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.FeatureAgglomeration.html

PS I also asked at
https://stackoverflow.com/questions/51526616/what-does-featureagglomeration-compute-mathematically-and-when-does-linkage-make
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180726/135c450e/attachment.html>

From gael.varoquaux at normalesup.org  Thu Jul 26 01:19:45 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Thu, 26 Jul 2018 07:19:45 +0200
Subject: [scikit-learn] What is the FeatureAgglomeration algorithm?
In-Reply-To: <CAFHc1QaJfmh1rbZ7RSrhUtXNbRhVvyUUVjaCe1fJrL5abq7Mfg@mail.gmail.com>
References: <CAFHc1QaJfmh1rbZ7RSrhUtXNbRhVvyUUVjaCe1fJrL5abq7Mfg@mail.gmail.com>
Message-ID: <20180726051945.vm2bg6ar63kdqzcx@phare.normalesup.org>

FeatureAgglomeration uses the Ward, complete linkage, or average linkage,
algorithms, depending on the choice of "linkage". These are well
documented in the literature, or on wikipedia.

Ga?l

On Thu, Jul 26, 2018 at 06:05:21AM +0100, Raphael C wrote:
> Hi,

> I am trying to work out what, in precise mathematical terms,
> [FeatureAgglomeration][1] does and would love some help. Here is some example
> code:


> ? ? import numpy as np
> ? ? from sklearn.cluster import FeatureAgglomeration
> ? ? for S in ['ward', 'average', 'complete']:
> ? ? ? ? FA = FeatureAgglomeration(linkage=S)
> ? ? ? ? print(FA.fit_transform(np.array([[-50,6,6,7,], [0,1,2,3]])))

> This outputs:

> ? ?

> ? ? [[ ?6.33333333 -50. ? ? ? ?]
> ? ? ?[ ?2. ? ? ? ? ? 0. ? ? ? ?]]
> ? ? [[ ?6.33333333 -50. ? ? ? ?]
> ? ? ?[ ?2. ? ? ? ? ? 0. ? ? ? ?]]
> ? ? [[ ?6.33333333 -50. ? ? ? ?]
> ? ? ?[ ?2. ? ? ? ? ? 0. ? ? ? ?]]

> Is it possible to say mathematically how these values have been computed?

> Also, what exactly does linkage do and why doesn't it seem to make any
> difference which option you choose?

> Raphael


> ? [1]: http://scikit-learn.org/stable/modules/generated/
> sklearn.cluster.FeatureAgglomeration.html

> PS I also asked at?
> https://stackoverflow.com/questions/51526616/
> what-does-featureagglomeration-compute-mathematically-and-when-does-linkage-make


> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From drraph at gmail.com  Thu Jul 26 01:25:44 2018
From: drraph at gmail.com (Raphael C)
Date: Thu, 26 Jul 2018 06:25:44 +0100
Subject: [scikit-learn] What is the FeatureAgglomeration algorithm?
In-Reply-To: <20180726051945.vm2bg6ar63kdqzcx@phare.normalesup.org>
References: <CAFHc1QaJfmh1rbZ7RSrhUtXNbRhVvyUUVjaCe1fJrL5abq7Mfg@mail.gmail.com>
 <20180726051945.vm2bg6ar63kdqzcx@phare.normalesup.org>
Message-ID: <CAFHc1QY39c1_4iMb9ozu5UV04=LCMzvheBmpHq8u5=RDX2-wSg@mail.gmail.com>

Is it expected that all three linkages options should give the same result
in my toy example?

Raphael

On Thu, 26 Jul 2018 at 06:20 Gael Varoquaux <gael.varoquaux at normalesup.org>
wrote:

> FeatureAgglomeration uses the Ward, complete linkage, or average linkage,
> algorithms, depending on the choice of "linkage". These are well
> documented in the literature, or on wikipedia.
>
> Ga?l
>
> On Thu, Jul 26, 2018 at 06:05:21AM +0100, Raphael C wrote:
> > Hi,
>
> > I am trying to work out what, in precise mathematical terms,
> > [FeatureAgglomeration][1] does and would love some help. Here is some
> example
> > code:
>
>
> >     import numpy as np
> >     from sklearn.cluster import FeatureAgglomeration
> >     for S in ['ward', 'average', 'complete']:
> >         FA = FeatureAgglomeration(linkage=S)
> >         print(FA.fit_transform(np.array([[-50,6,6,7,], [0,1,2,3]])))
>
> > This outputs:
>
> >
>
> >     [[  6.33333333 -50.        ]
> >      [  2.           0.        ]]
> >     [[  6.33333333 -50.        ]
> >      [  2.           0.        ]]
> >     [[  6.33333333 -50.        ]
> >      [  2.           0.        ]]
>
> > Is it possible to say mathematically how these values have been computed?
>
> > Also, what exactly does linkage do and why doesn't it seem to make any
> > difference which option you choose?
>
> > Raphael
>
>
> >   [1]: http://scikit-learn.org/stable/modules/generated/
> > sklearn.cluster.FeatureAgglomeration.html
>
> > PS I also asked at
> > https://stackoverflow.com/questions/51526616/
> >
> what-does-featureagglomeration-compute-mathematically-and-when-does-linkage-make
>
>
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> --
>     Gael Varoquaux
>     Senior Researcher, INRIA Parietal
>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>     Phone:  ++ 33-1-69-08-79-68
>     http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180726/d728b543/attachment-0001.html>

From gael.varoquaux at normalesup.org  Thu Jul 26 01:38:54 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Thu, 26 Jul 2018 07:38:54 +0200
Subject: [scikit-learn] What is the FeatureAgglomeration algorithm?
In-Reply-To: <CAFHc1QY39c1_4iMb9ozu5UV04=LCMzvheBmpHq8u5=RDX2-wSg@mail.gmail.com>
References: <CAFHc1QaJfmh1rbZ7RSrhUtXNbRhVvyUUVjaCe1fJrL5abq7Mfg@mail.gmail.com>
 <20180726051945.vm2bg6ar63kdqzcx@phare.normalesup.org>
 <CAFHc1QY39c1_4iMb9ozu5UV04=LCMzvheBmpHq8u5=RDX2-wSg@mail.gmail.com>
Message-ID: <05e6d50a-fcf9-436a-b3aa-35d9f97eb195@normalesup.org>

No. 

?Sent from my phone. Please forgive typos and briefness.?

On Jul 26, 2018, 07:28, at 07:28, Raphael C <drraph at gmail.com> wrote:
>Is it expected that all three linkages options should give the same
>result
>in my toy example?
>
>Raphael
>
>On Thu, 26 Jul 2018 at 06:20 Gael Varoquaux
><gael.varoquaux at normalesup.org>
>wrote:
>
>> FeatureAgglomeration uses the Ward, complete linkage, or average
>linkage,
>> algorithms, depending on the choice of "linkage". These are well
>> documented in the literature, or on wikipedia.
>>
>> Ga?l
>>
>> On Thu, Jul 26, 2018 at 06:05:21AM +0100, Raphael C wrote:
>> > Hi,
>>
>> > I am trying to work out what, in precise mathematical terms,
>> > [FeatureAgglomeration][1] does and would love some help. Here is
>some
>> example
>> > code:
>>
>>
>> >     import numpy as np
>> >     from sklearn.cluster import FeatureAgglomeration
>> >     for S in ['ward', 'average', 'complete']:
>> >         FA = FeatureAgglomeration(linkage=S)
>> >         print(FA.fit_transform(np.array([[-50,6,6,7,],
>[0,1,2,3]])))
>>
>> > This outputs:
>>
>> >
>>
>> >     [[  6.33333333 -50.        ]
>> >      [  2.           0.        ]]
>> >     [[  6.33333333 -50.        ]
>> >      [  2.           0.        ]]
>> >     [[  6.33333333 -50.        ]
>> >      [  2.           0.        ]]
>>
>> > Is it possible to say mathematically how these values have been
>computed?
>>
>> > Also, what exactly does linkage do and why doesn't it seem to make
>any
>> > difference which option you choose?
>>
>> > Raphael
>>
>>
>> >   [1]: http://scikit-learn.org/stable/modules/generated/
>> > sklearn.cluster.FeatureAgglomeration.html
>>
>> > PS I also asked at
>> > https://stackoverflow.com/questions/51526616/
>> >
>>
>what-does-featureagglomeration-compute-mathematically-and-when-does-linkage-make
>>
>>
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> --
>>     Gael Varoquaux
>>     Senior Researcher, INRIA Parietal
>>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>>     Phone:  ++ 33-1-69-08-79-68
>>     http://gael-varoquaux.info           
>http://twitter.com/GaelVaroquaux
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>scikit-learn mailing list
>scikit-learn at python.org
>https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180726/52aa81d4/attachment.html>

From trenton.bricken at duke.edu  Thu Jul 26 15:01:41 2018
From: trenton.bricken at duke.edu (Trenton Bricken)
Date: Thu, 26 Jul 2018 19:01:41 +0000
Subject: [scikit-learn] f_classif function confusion
Message-ID: <MWHPR05MB31344F02DD1178E13C5C867C932B0@MWHPR05MB3134.namprd05.prod.outlook.com>

I am very confused by the f_classif function using feature.selection.f_classif()

The function under the User Guide says that it can be used for classification tasks and under the documentation it claims to use ANOVA. However, ANOVA takes a categorical input and continuous output. Why can we provide it with a continuous input and categorical output here?

Also looking at the source code, there are warnings for not using ANOVA if your feature is not normally distributed. I think these should be more visible to warn the unaware before they start using this method for feature analysis.

Thank you in advance for any help and explanations that can be provided about why f_classif can be used with categorical outputs.

Trenton

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180726/1b23438b/attachment.html>

From rajkiranvsgo at gmail.com  Sat Jul 28 23:49:05 2018
From: rajkiranvsgo at gmail.com (Rajkiran Veldur)
Date: Sun, 29 Jul 2018 09:19:05 +0530
Subject: [scikit-learn] Suggestion to update the code for Segmenting the
 picture of Lena in regions
Message-ID: <CANXO6Ko2G0PnKQKRORbxVT-ZcRLdOn6isDg9NHUxsK8Je8YayA@mail.gmail.com>

 Hello Team,

I have been following scikit-learn closely these days as I have been
working on different machine learning algorithms. Thank you for making
everything so simple. Your documents could be followed even by novice.

Now, when I was working with spectral clustering, I found your example
of  *Segmenting
the picture of Lena in regions *intuitive and wanted to try it.

However, scipy has removed the scipy.misc.lena() module from their library,
due to licensing issues.

So, I request you to please update the code with any other image.

Regards,
Rajkiran Veldur
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180729/97b865e6/attachment.html>

From jakevdp at cs.washington.edu  Sun Jul 29 00:51:59 2018
From: jakevdp at cs.washington.edu (Jacob Vanderplas)
Date: Sat, 28 Jul 2018 21:51:59 -0700
Subject: [scikit-learn] Suggestion to update the code for Segmenting the
 picture of Lena in regions
In-Reply-To: <CANXO6Ko2G0PnKQKRORbxVT-ZcRLdOn6isDg9NHUxsK8Je8YayA@mail.gmail.com>
References: <CANXO6Ko2G0PnKQKRORbxVT-ZcRLdOn6isDg9NHUxsK8Je8YayA@mail.gmail.com>
Message-ID: <CACpqBg2cWbTNqNT7muQhOrNabm400BtA5Hi8TsAM9orgXUd+HA@mail.gmail.com>

Hi Rajkiran,
It sounds like you found an example from an old version of the scikit-learn
documentation.

After scipy removed that image, the example you're referring to was updated
to this one:
http://scikit-learn.org/stable/auto_examples/cluster/plot_face_segmentation.html

Best,
   Jake

 Jake VanderPlas
 Senior Data Science Fellow
 Director of Open Software
 University of Washington eScience Institute

On Sat, Jul 28, 2018 at 8:49 PM, Rajkiran Veldur <rajkiranvsgo at gmail.com>
wrote:

> Hello Team,
>
> I have been following scikit-learn closely these days as I have been
> working on different machine learning algorithms. Thank you for making
> everything so simple. Your documents could be followed even by novice.
>
> Now, when I was working with spectral clustering, I found your example of  *Segmenting
> the picture of Lena in regions *intuitive and wanted to try it.
>
> However, scipy has removed the scipy.misc.lena() module from their
> library, due to licensing issues.
>
> So, I request you to please update the code with any other image.
>
> Regards,
> Rajkiran Veldur
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180728/cf94ad54/attachment.html>

From gael.varoquaux at normalesup.org  Sun Jul 29 01:40:54 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Sun, 29 Jul 2018 07:40:54 +0200
Subject: [scikit-learn] Suggestion to update the code for Segmenting the
 picture of Lena in regions
In-Reply-To: <CANXO6Ko2G0PnKQKRORbxVT-ZcRLdOn6isDg9NHUxsK8Je8YayA@mail.gmail.com>
References: <CANXO6Ko2G0PnKQKRORbxVT-ZcRLdOn6isDg9NHUxsK8Je8YayA@mail.gmail.com>
Message-ID: <8168f979-d374-4d1e-a84f-d18aeb3197dd@normalesup.org>

You are looking at an old version of the documentation. In the up to date documentation, the picture has been replaced:
http://scikit-learn.org/stable/auto_examples/cluster/plot_face_segmentation.html


?Sent from my phone. Please forgive typos and briefness.?

On Jul 29, 2018, 05:51, at 05:51, Rajkiran Veldur <rajkiranvsgo at gmail.com> wrote:
> Hello Team,
>
>I have been following scikit-learn closely these days as I have been
>working on different machine learning algorithms. Thank you for making
>everything so simple. Your documents could be followed even by novice.
>
>Now, when I was working with spectral clustering, I found your example
>of  *Segmenting
>the picture of Lena in regions *intuitive and wanted to try it.
>
>However, scipy has removed the scipy.misc.lena() module from their
>library,
>due to licensing issues.
>
>So, I request you to please update the code with any other image.
>
>Regards,
>Rajkiran Veldur
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>scikit-learn mailing list
>scikit-learn at python.org
>https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180729/412b2e2f/attachment.html>

From rajkiranvsgo at gmail.com  Sun Jul 29 06:00:57 2018
From: rajkiranvsgo at gmail.com (Rajkiran Veldur)
Date: Sun, 29 Jul 2018 15:30:57 +0530
Subject: [scikit-learn] Suggestion to update the code for Segmenting the
 picture of Lena in regions
In-Reply-To: <CACpqBg2cWbTNqNT7muQhOrNabm400BtA5Hi8TsAM9orgXUd+HA@mail.gmail.com>
References: <CANXO6Ko2G0PnKQKRORbxVT-ZcRLdOn6isDg9NHUxsK8Je8YayA@mail.gmail.com>
 <CACpqBg2cWbTNqNT7muQhOrNabm400BtA5Hi8TsAM9orgXUd+HA@mail.gmail.com>
Message-ID: <CANXO6Koc78rCCScQduL_yEeB=XnfMEdpu7mTRXdHgusm29R4HQ@mail.gmail.com>

Hi Jacob,

Thanks for the update. That was real quick and helpful.

Regards,
Rajkiran Veldur

On Sun, Jul 29, 2018 at 10:21 AM, Jacob Vanderplas <
jakevdp at cs.washington.edu> wrote:

> Hi Rajkiran,
> It sounds like you found an example from an old version of the
> scikit-learn documentation.
>
> After scipy removed that image, the example you're referring to was
> updated to this one: http://scikit-learn.org/stable/auto_examples/cluster/
> plot_face_segmentation.html
>
> Best,
>    Jake
>
>  Jake VanderPlas
>  Senior Data Science Fellow
>  Director of Open Software
>  University of Washington eScience Institute
>
> On Sat, Jul 28, 2018 at 8:49 PM, Rajkiran Veldur <rajkiranvsgo at gmail.com>
> wrote:
>
>> Hello Team,
>>
>> I have been following scikit-learn closely these days as I have been
>> working on different machine learning algorithms. Thank you for making
>> everything so simple. Your documents could be followed even by novice.
>>
>> Now, when I was working with spectral clustering, I found your example
>> of  *Segmenting the picture of Lena in regions *intuitive and wanted to
>> try it.
>>
>> However, scipy has removed the scipy.misc.lena() module from their
>> library, due to licensing issues.
>>
>> So, I request you to please update the code with any other image.
>>
>> Regards,
>> Rajkiran Veldur
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180729/33672ab7/attachment-0001.html>

From prat2 at umbc.edu  Mon Jul 30 16:12:29 2018
From: prat2 at umbc.edu (Prathusha Jonnagaddla Subramanyam Naidu)
Date: Mon, 30 Jul 2018 16:12:29 -0400
Subject: [scikit-learn] Dependency issues
Message-ID: <CAOhYXzRYcDM1ZktRuTwBk6cKxvX3hf=gH0DVkUH6xOmRFJrJ=A@mail.gmail.com>

Hi everyone,
I updated the version of SCIKIT_IMAGE used in circle build to 0.14.0.
This is the error that I got

UnsatisfiableError: The following specifications were found to be in conflict:
  - numpy=1.8.2
  - scikit-image=0.14.0 -> scipy[version='>=0.17']Use "conda info
<package>" to see the dependencies for each package.


So I updated scipy version to 1.1.0 and I get this error now

UnsatisfiableError: The following specifications were found to be in conflict:
  - pandas=0.13.1
  - scipy=1.1.0Use "conda info <package>" to see the dependencies for
each package.


Should I update the versions of everything ? Or am I doing something wrong
?

-- 
Regards,
Prathusha JS Naidu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180730/914380b1/attachment.html>

From g.lemaitre58 at gmail.com  Tue Jul 31 01:00:14 2018
From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=)
Date: Tue, 31 Jul 2018 12:00:14 +0700
Subject: [scikit-learn] Dependency issues
In-Reply-To: <CAOhYXzRYcDM1ZktRuTwBk6cKxvX3hf=gH0DVkUH6xOmRFJrJ=A@mail.gmail.com>
Message-ID: <eh8to81m4n99pun3ik5b6vfp.1533013214844@gmail.com>

An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180731/cab98bbc/attachment.html>

From shantanubhattacharya at yahoo.com  Tue Jul 31 19:49:15 2018
From: shantanubhattacharya at yahoo.com (Shantanu Bhattacharya)
Date: Tue, 31 Jul 2018 23:49:15 +0000 (UTC)
Subject: [scikit-learn] Query about an algorithm
References: <246651338.109874.1533080955512.ref@mail.yahoo.com>
Message-ID: <246651338.109874.1533080955512@mail.yahoo.com>

Hello,
I am new to this mailing list. I would like to understand the algorithms provided.
Is?second order gradient descent with hessian error matrix supported by this library?
I went through the documentation, but did not find it. Are you able to confirm or direct me to some place that might have it?
Look forward to your thoughts
Kind regardsShantanu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180731/2d078b24/attachment.html>