From robert.kern at gmail.com Sun Jul 1 22:02:02 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 1 Jul 2018 19:02:02 -0700 Subject: [scikit-learn] NEP: Random Number Generator Policy In-Reply-To: References: <2e83ecf0-4f42-6eb3-c372-28bb5baf8583@gmail.com> Message-ID: On 6/19/18 15:19, Robert Kern wrote: > On 6/19/18 08:12, Andreas Mueller wrote: >> I don't think I have the bandwidth but I agree :-/ >> Not sure if any of the other core devs do. I can try to read it next week but >> that's probably too late? > > We're not on a deadline. If you're interested in reading the NEP and providing > feedback/consent, I'm happy to hold off on formally accepting the NEP until then. I just made a deadline. :-) I formally proposed acceptance of the NEP. In 7 days, if no one objects, it will be formally marked as Accepted. https://mail.python.org/pipermail/numpy-discussion/2018-July/078380.html -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco From roy.pamphile at gmail.com Tue Jul 3 04:41:30 2018 From: roy.pamphile at gmail.com (Pamphile Roy) Date: Tue, 3 Jul 2018 10:41:30 +0200 Subject: [scikit-learn] Update or downgrade PCA Message-ID: Hi everyone, I have some code that allows to upgrade (or downgrade) a PCA with a new sample. The update part is handy when you are doing live observations for instance and you want a quick way to update your PCA without having to recompute the whole thing from scratch. Are you interested in this? (For me or someone else to integrate it.) Code is open-source (from my Batman project) and can be found here: https://gitlab.com/cerfacs/batman/blob/develop/batman/pod/pod.py Functions of interest are _upgrade and downgrade. Although, the code should be cleaned up, it works well and it got some unit tests. Of course the math is backed-up by some literature: [1] M. Brand: Fast low-rank modifications of the thin singular value decomposition. 2006. DOI:10.1016/j.laa.2005.07.021 [2] T. Braconnier: Towards an adaptive POD/SVD surrogate model for aeronautic design. Computers & Fluids. 2011. DOI:10.1016/j.compfluid.2010.09.002 Cheers, Pamphile @tupui -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.gramfort at inria.fr Tue Jul 3 04:49:34 2018 From: alexandre.gramfort at inria.fr (Alexandre Gramfort) Date: Tue, 3 Jul 2018 10:49:34 +0200 Subject: [scikit-learn] Update or downgrade PCA In-Reply-To: References: Message-ID: Hi, how does it compare with: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html#sklearn.decomposition.IncrementalPCA ? Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Tue Jul 3 04:51:47 2018 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Tue, 3 Jul 2018 10:51:47 +0200 Subject: [scikit-learn] Update or downgrade PCA In-Reply-To: References: Message-ID: <3df37e82-c9df-6ae8-e254-209e5f45edae@gmail.com> Hi Pamphile, On 03/07/18 10:41, Pamphile Roy wrote: > I have some code that allows to upgrade (or downgrade)?a PCA with a new > sample. > The update part is handy when you are doing live observations for > instance and you want a quick way to update your PCA without having to > recompute the whole thing from scratch. > [..] > [1] M. Brand: Fast low-rank modifications of the thin singular value decomposition. Do you know how this would compare with sklearn.decomposition.IncrementalPCA ? -- Roman From roy.pamphile at gmail.com Tue Jul 3 05:06:31 2018 From: roy.pamphile at gmail.com (Pamphile Roy) Date: Tue, 3 Jul 2018 11:06:31 +0200 Subject: [scikit-learn] Update or downgrade PCA Message-ID: I have no idea about the comparison with sklearn.decomposition.IncrementalPCA. Was not aware of this but from the code it seems to be a different approach. I will try to come with some numbers. Pamphile -------------- next part -------------- An HTML attachment was scrubbed... URL: From amirouche.boubekki at gmail.com Tue Jul 3 07:46:43 2018 From: amirouche.boubekki at gmail.com (Amirouche Boubekki) Date: Tue, 3 Jul 2018 13:46:43 +0200 Subject: [scikit-learn] Supervised prediction of multiple scores for a document In-Reply-To: References: <037411E4-8B6D-4EAB-A9C6-45AA73479364@sebastianraschka.com> Message-ID: I made a rendering of the result online https://sensimark.com/ Le dim. 3 juin 2018 ? 23:22, Sebastian Raschka a ?crit : > sorry, I had a copy & paste error, I meant "LogisticRegression(..., > multi_class='multinomial')" and not "LogisticRegression(..., > multi_class='ovr')" > > > On Jun 3, 2018, at 5:19 PM, Sebastian Raschka > wrote: > > > > Hi, > > > >> I quickly read about multinomal regression, is it something do you > recommend I use? Maybe you think about something else? > > > > Multinomial regression (or Softmax Regression) should give you results > somewhat similar to a linear SVC (or logistic regression with OvO or OvR). > The theoretical difference is that Softmax regression assumes that the > classes are mutually exclusive, which is probably not the case in your > setting since e.g., an article could be both "Art" and "Science" to some > extend or so. Here a quick summary of softmax regression if useful: > https://sebastianraschka.com/faq/docs/softmax_regression.html. In > scikit-learn, you can use it via LogisticRegression(..., multi_class='ovr'). > > > > Howeever, spontaneously, I would say that Latent Dirichlet Allocation > could be a better choice in your case. I.e., fit the model on the corpus > for a specified number of topics (e.g., 10, but depends on your dataset, I > would experiment a bit here), look at the top words in each topic and then > assign a topic label to each topic. Then, for a given article, you can > assign e.g., the top X labeled topics. > > > > Best, > > Sebastian > > > > > > > > > >> On Jun 3, 2018, at 5:03 PM, Amirouche Boubekki < > amirouche.boubekki at gmail.com> wrote: > >> > >> H?llo, > >> > >> I started a natural language processing project a few weeks ago called > wikimark (the code is all in wikimark.py) > >> > >> Given a text it wants to return a dictionary scoring the input against > vital articles categories, e.g.: > >> > >> out = wikimark("""Peter Hintjens wrote about the relation between > technology and culture. Without using a scientifical tone of > state-of-the-art review of the anthroposcene antropology, he gives a fair > amount of food for thought. According to Hintjens, technology is doomed to > become cheap. As matter of fact, intelligence tools will become more and > more accessible which will trigger a revolution to rebalance forces in > society.""") > >> > >> for category, score in out: > >> print('{} ~ {}'.format(category, score)) > >> > >> The above program would output something like that: > >> > >> Art ~ 0.1 > >> Science ~ 0.5 > >> Society ~ 0.4 > >> > >> Except not everything went as planned. Mind the fact that in the above > example the total is equal to 1, but I could not achieve that at all. > >> > >> I am using gensim to compute vectors of paragraphs (doc2vev) and then > submit those vectors to svm.SVR in a one-vs-all strategy ie. a document is > scored 1 if it's in that subcategory and zero otherwise. At prediction > time, it goes though the same doc2vec pipeline. The computer will score > each paragraph against the SVR models of wikipedia vital article > subcategories and get a value between 0 and 1 for each paragraph. I compute > the sum and group by subcategory and then I have a score per category for > the input document > >> > >> It somewhat works. I made a web ui online you can find it at > https://sensimark.com where you can test it. You can directly access the > >> full api e.g. > https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1 > >> > >> The output JSON document is a list of category dictionary where the > prediction key is associated with the average of the "prediction" of the > subcategories. If you replace &all=1 by &top=5 you might get something else > as top categories e.g. > https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=10 > >> > >> or > >> > >> > https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=5 > >> > >> I wrote "prediction" with double quotes because the value you see, is > the result of some formula. Since, the predictions I get are rather small > between 0 and 0.015 I apply the following formula: > >> value = math.exp(prediction) > >> magic = ((value * 100) - 110) * 100 > >> > >> In order to have values to spread between -200 and 200. Maybe this is > the symptom that my model doesn't work at all. > >> > >> Still, the top 10 results are almost always near each other (try with > BBC articles on https://sensimark.com . It is only when a regression > model is disqualified with a score of 0 that the results are simple to > understand. Sadly, I don't have an example at hand to support that claim. > You have to believe me. > >> > >> I just figured looking at the machine learning map that my problem > might be classification problem, except I don't really want to know what is > the class of new documents, I want to how what are the different subjects > that are dealt in the document based on a hiearchical corpus; > >> I don't want to guess a hiearchy! I want to now how the document > content spread over the different categories or subcategories. > >> > >> I quickly read about multinomal regression, is it something do you > recommend I use? Maybe you think about something else? > >> > >> Also, it seems I should benchmark / evaluate my model against LDA. > >> > >> I am rather noob in terms of datascience and my math skills are not so > fresh. I more likely looking for ideas on what algorithm, fine tuning and > some practice of datascience I must follow that doesn't involve writing my > own algorithm. > >> > >> Thanks in advance! > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From roy.pamphile at gmail.com Tue Jul 3 08:39:46 2018 From: roy.pamphile at gmail.com (Pamphile Roy) Date: Tue, 3 Jul 2018 14:39:46 +0200 Subject: [scikit-learn] Update or downgrade PCA In-Reply-To: References: Message-ID: So yes there is a difference between the two depending on the size of the matrix. Following is an output from ipython: *With a matrix of shape (1000 * 500)* (batman3) tupui at Batman:Desktop $ ipython -i sk_pod.py Python 3.6.5 | packaged by conda-forge | (default, Apr 6 2018, 13:44:09) Type 'copyright', 'credits' or 'license' for more information IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: %timeit pod._update(snapshot2.T) 491 ms ? 22.4 ms per loop (mean ? std. dev. of 7 runs, 1 loop each) In [2]: %timeit ipca.partial_fit(snapshot2) 163 ms ? 1.6 ms per loop (mean ? std. dev. of 7 runs, 10 loops each) *With a matrix of shape (1000 * 2000)* (batman3) tupui at Batman:Desktop $ ipython -i sk_pod.py Python 3.6.5 | packaged by conda-forge | (default, Apr 6 2018, 13:44:09) Type 'copyright', 'credits' or 'license' for more information IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: %timeit pod._update(snapshot2.T) 4.84 s ? 220 ms per loop (mean ? std. dev. of 7 runs, 1 loop each) In [2]: %timeit ipca.partial_fit(snapshot2) 5.85 s ? 77.6 ms per loop (mean ? std. dev. of 7 runs, 1 loop each) In [3]: Do you really want to exit ([y]/n)? *With a matrix of shape (1000 * 20 000)* (batman3) tupui at Batman:Desktop $ ipython -i sk_pod.py Python 3.6.5 | packaged by conda-forge | (default, Apr 6 2018, 13:44:09) Type 'copyright', 'credits' or 'license' for more information IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: %timeit pod._update(snapshot2.T) 3.39 s ? 65.8 ms per loop (mean ? std. dev. of 7 runs, 1 loop each) In [2]: %timeit ipca.partial_fit(snapshot2) 33.1 s ? 17.7 s per loop (mean ? std. dev. of 7 runs, 1 loop each) Conclusion is that, the method seems faster to add one sample if the number of feature is superior to the number of samples. But if you want to add a bunch of sample, I found that sklearn seems a bit faster (38.75 s vs 34.51s to add 10 samples of shape 1000 * 20 000). It is to be noted that in this last case, adding a single or 10 samples is taking the same time ~30s. So depending on how much sample are to be added, this can help. Cheers, Pamphile P.S. Following is the code I used (requires batman available though conda-forge): import time import numpy as np from batman.pod import Pod from sklearn.decomposition import IncrementalPCA n_samples, n_features = 1000, 20000 snapshots = np.random.random_sample((n_samples, n_features)) snapshot2 = np.random.random_sample((1, n_features)) pod = Pod([np.zeros(n_features), np.ones(n_features)], None, np.inf, 1, 999) pod._decompose(snapshots.T) ipca = IncrementalPCA(999) ipca.fit(snapshots) np.allclose(ipca.singular_values_, pod.S) pod._update(snapshot2.T) ipca.partial_fit(snapshot2) np.allclose(ipca.singular_values_[:999], pod.S[:999]) snapshot3 = np.random.random_sample((10, n_features)) itime = time.time() [pod._update(snap.T[:, None]) for snap in snapshot3] print(time.time() - itime) itime = time.time() ipca.partial_fit(snapshot3) print(time.time() - itime) np.allclose(ipca.singular_values_[:999], pod.S[:999]) 2018-07-03 11:06 GMT+02:00 Pamphile Roy : > I have no idea about the comparison with sklearn.decomposition.Inc > rementalPCA. > Was not aware of this but from the code it seems to be a different > approach. > I will try to come with some numbers. > > Pamphile > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeremie.du-boisberranger at inria.fr Tue Jul 3 09:23:35 2018 From: jeremie.du-boisberranger at inria.fr (Jeremie du Boisberranger) Date: Tue, 3 Jul 2018 15:23:35 +0200 Subject: [scikit-learn] Next sprint in Paris (july 16th and 17th) Message-ID: <81047951-6691-a303-6638-d55350c74cc5@inria.fr> Hi everyone, On july 16th and 17th, there will be a scikit-learn sprint in Paris, in parallel with the one in Austin. There will be an official announce soon with the location and other informations. This is just an informal mail to ask if you have suggestions on topics/issues that you think we should look at during the sprint. Remember that it is a 2 days sprint, so we need things that can be handled in 2 days. Whether you intend to come or not, any suggestion is welcomed ! Best regards, Jeremie du Boisberranger From sdsr.sdsr at gmail.com Wed Jul 4 05:08:55 2018 From: sdsr.sdsr at gmail.com (=?UTF-8?Q?Sergio_Sol=C3=B3rzano?=) Date: Wed, 4 Jul 2018 11:08:55 +0200 Subject: [scikit-learn] Next sprint in Paris (july 16th and 17th) In-Reply-To: <81047951-6691-a303-6638-d55350c74cc5@inria.fr> References: <81047951-6691-a303-6638-d55350c74cc5@inria.fr> Message-ID: Hi everyone, Regarding the Python Sprint in Paris, I would like to know if it is possible to attend if one wants to contribute but has never done it before. In other words, is it "reserved" for experienced contributors/developers of sckikit-learn or newcomers can join as well? Best, Sergio On Tue, Jul 3, 2018 at 3:25 PM Jeremie du Boisberranger wrote: > > Hi everyone, > > On july 16th and 17th, there will be a scikit-learn sprint in Paris, in > parallel with the one in Austin. > > There will be an official announce soon with the location and other > informations. > > This is just an informal mail to ask if you have suggestions on > topics/issues that you think we should look at during the sprint. > Remember that it is a 2 days sprint, so we need things that can be > handled in 2 days. > > Whether you intend to come or not, any suggestion is welcomed ! > > Best regards, > > Jeremie du Boisberranger > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From jeremie.du-boisberranger at inria.fr Wed Jul 4 08:31:23 2018 From: jeremie.du-boisberranger at inria.fr (Jeremie du Boisberranger) Date: Wed, 4 Jul 2018 14:31:23 +0200 Subject: [scikit-learn] Next sprint in Paris (july 16th and 17th) In-Reply-To: References: <81047951-6691-a303-6638-d55350c74cc5@inria.fr> Message-ID: <3554e67e-adb7-f7ae-c89d-3795c6d279a7@inria.fr> Hi Sergio, I'm sorry but this sprint is quite short and thus will be for experienced contributors (at least experienced with the scikit-learn contributing work flow). We'll probably organize less restrictive sprints in the future. Best regards, Jeremie On 04/07/2018 11:08, Sergio Sol?rzano wrote: > Hi everyone, > > Regarding the Python Sprint in Paris, > I would like to know if it is possible to attend if one wants to > contribute but has never done it before. In other words, > is it "reserved" for experienced contributors/developers of > sckikit-learn or newcomers can join as well? > > Best, > Sergio > > > On Tue, Jul 3, 2018 at 3:25 PM Jeremie du Boisberranger > wrote: >> Hi everyone, >> >> On july 16th and 17th, there will be a scikit-learn sprint in Paris, in >> parallel with the one in Austin. >> >> There will be an official announce soon with the location and other >> informations. >> >> This is just an informal mail to ask if you have suggestions on >> topics/issues that you think we should look at during the sprint. >> Remember that it is a 2 days sprint, so we need things that can be >> handled in 2 days. >> >> Whether you intend to come or not, any suggestion is welcomed ! >> >> Best regards, >> >> Jeremie du Boisberranger >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From jeremie.du-boisberranger at inria.fr Wed Jul 4 08:36:02 2018 From: jeremie.du-boisberranger at inria.fr (Jeremie du Boisberranger) Date: Wed, 4 Jul 2018 14:36:02 +0200 Subject: [scikit-learn] Next sprint in Paris (july 16th and 17th) In-Reply-To: References: <81047951-6691-a303-6638-d55350c74cc5@inria.fr> Message-ID: <2e8b08fc-6aca-60d0-b006-d976a89ed764@inria.fr> Hi Sergio, I'm sorry but this sprint is quite short and thus will be for experienced contributors (at least experienced with the scikit-learn contributing work flow). We'll probably organize less restrictive sprints in the future. Best regards, Jeremie On 04/07/2018 11:08, Sergio Sol?rzano wrote: > Hi everyone, > > Regarding the Python Sprint in Paris, > I would like to know if it is possible to attend if one wants to > contribute but has never done it before. In other words, > is it "reserved" for experienced contributors/developers of > sckikit-learn or newcomers can join as well? > > Best, > Sergio > > > On Tue, Jul 3, 2018 at 3:25 PM Jeremie du Boisberranger > wrote: >> Hi everyone, >> >> On july 16th and 17th, there will be a scikit-learn sprint in Paris, in >> parallel with the one in Austin. >> >> There will be an official announce soon with the location and other >> informations. >> >> This is just an informal mail to ask if you have suggestions on >> topics/issues that you think we should look at during the sprint. >> Remember that it is a 2 days sprint, so we need things that can be >> handled in 2 days. >> >> Whether you intend to come or not, any suggestion is welcomed ! >> >> Best regards, >> >> Jeremie du Boisberranger >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From marco.fronzi at gmail.com Tue Jul 10 00:25:51 2018 From: marco.fronzi at gmail.com (Marco Fronzi) Date: Tue, 10 Jul 2018 14:25:51 +1000 Subject: [scikit-learn] compiling issue Message-ID: Hi, My name is Marco and I am trying to install scikit-learn on my mac (OX 10.11.6). I installed already python3, numpy (1.8.2) and scipy, however when I run pip3 scikit-learn I get several errors which are listed below. Failed building wheel for scikit-learn and also: Command "/usr/local/opt/python/bin/python3.7 -u -c "import setuptools, tokenize;__file__='/private/tmp/pip-install-4z67z8of/ scikit-learn/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/tmp/pip-record-pbcsv1zz/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/tmp/pip-install-4z67z8of/scikit-learn/ I would appreciate any suggestion/hint to solve this issue and the package. Thank you, Marco -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Jul 10 00:54:24 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 10 Jul 2018 14:54:24 +1000 Subject: [scikit-learn] compiling issue In-Reply-To: References: Message-ID: Homebrew has pushed a lot of users onto Python 3.7 arguably prematurely: several packages weren't ready to support it. A compatibility release, Scikit-learn 0.19.2, is basically ready to be released, but it may take another couple of days. See https://github.com/scikit-learn/scikit-learn/issues/11320 As noted there , you can also download Python to 3.6 with: brew info python3 brew switch python 3.6.5 On 10 July 2018 at 14:25, Marco Fronzi wrote: > Hi, > > My name is Marco and I am trying to install scikit-learn on my mac (OX > 10.11.6). I installed already python3, numpy (1.8.2) and scipy, however > when I run pip3 scikit-learn I get several errors which are listed below. > > Failed building wheel for scikit-learn > > and also: > > Command "/usr/local/opt/python/bin/python3.7 -u -c "import setuptools, > tokenize;__file__='/private/tmp/pip-install-4z67z8of/scikit- > learn/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', > '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record > /private/tmp/pip-record-pbcsv1zz/install-record.txt > --single-version-externally-managed --compile" failed with error code 1 > in /private/tmp/pip-install-4z67z8of/scikit-learn/ > > I would appreciate any suggestion/hint to solve this issue and the > package. > > > Thank you, > > Marco > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From morin070 at umn.edu Thu Jul 12 13:34:07 2018 From: morin070 at umn.edu (August Morin) Date: Thu, 12 Jul 2018 13:34:07 -0400 Subject: [scikit-learn] Finding Formula of Gaussian Process Classification Message-ID: Hi all, I've been handed down some code that is based on the Classifier Comparison done by Ga?l Varoquaux and Andreas M?ller. The dataset is best classified by the Gaussian Process, from which I would like to be able to find a formula that I can run other datasets through for an image filtering project. Is there a way to export the formula directly from sklearn? Any ideas are much appreciated. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Sun Jul 15 19:51:28 2018 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Mon, 16 Jul 2018 01:51:28 +0200 Subject: [scikit-learn] sample_weights in RandomForestRegressor Message-ID: ?? Hello, I am kind of confused about the use of sample_weights parameter in the fit() function of RandomForestRegressor. Here is my problem: I am trying to predict the binding affinity of small molecules to a protein. I have a training set of 709 molecules and a blind test set of 180 molecules. I want to find those features that are more important for the correct prediction of the binding affinity of those 180 molecules of my blind test set. My rationale is that if I give more emphasis to the similar molecules in the training set, then I will get higher importances for those features that have higher predictive ability for this specific blind test set of 180 molecules. To this end, I weighted the 709 training set molecules by their maximum similarity to the 180 molecules, selected only those features with high importance and trained a new RF with all 709 molecules. I got some results but I am not satisfied. Is this the right way to use sample_weights in RF. I would appreciate any advice or suggested work flow. -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Mon Jul 16 10:54:55 2018 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Mon, 16 Jul 2018 23:54:55 +0900 Subject: [scikit-learn] sample_weights in RandomForestRegressor In-Reply-To: References: Message-ID: Dear Thomas, Your strategy for model development is built on the assumption that the SAR (structure-activity relationship) is a continuous manifold constructed for your compound descriptors. However, SARs for many proteins in drug discovery or chemical biology are not continuous (consider kinase inhibitors). Therefore, you must make an assessment of the training data SAR to check for the prevalence of activity cliffs. There are at least two ways you can go about this: (1) Simply compute all pairwise similarities by your choice of descriptor+metric, then identify where there are pairs (e.g., MACCS-Tanimoto > 0.7) with large activity differences (e.g., K_i or IC50 difference of more than 10/50/100-fold; again, the biology of your problem determines the right values). (2) Perform many repetitions of train-test splitting on the 709 reference molecules, look at the distribution of your evaluation metric, and see if there is a limit in your ability to predict. If you are hitting a wall in terms of predictability (metric performance), it's a likely sign there is an activity cliff, and no amount of machine learning is going to be able to overcome this. Further, trace the predictability of individual compounds to identify those which consistently are predicted wrong. If you combine this with analysis (1), you can know exactly which of your chemistries are unmodelable. If you find that there are no activity cliffs in your dataset, then your application of the assumption that chemical similarity implies biological endpoint similarity will hold, and your experimental design is validated because of the presence of a continuous manifold. However, if you do have activity cliffs, then as awesome as sklearn is, it still cannot make the computational chemistry any better. Hope this helps you contextualize your work. Don't hesitate to contact me if I can be of consultation. Sincerely, J.B. Brown Kyoto University Graduate School of Medicine 2018-07-16 8:51 GMT+09:00 Thomas Evangelidis : > ?? > Hello, > > I am kind of confused about the use of sample_weights parameter in the > fit() function of RandomForestRegressor. Here is my problem: > > I am trying to predict the binding affinity of small molecules to a > protein. I have a training set of 709 molecules and a blind test set of 180 > molecules. I want to find those features that are more important for the > correct prediction of the binding affinity of those 180 molecules of my > blind test set. My rationale is that if I give more emphasis to the > similar molecules in the training set, then I will get higher importances > for those features that have higher predictive ability for this specific > blind test set of 180 molecules. To this end, I weighted the 709 training > set molecules by their maximum similarity to the 180 molecules, selected > only those features with high importance and trained a new RF with all 709 > molecules. I got some results but I am not satisfied. Is this the right way > to use sample_weights in RF. I would appreciate any advice or suggested > work flow. > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abhishekb2209 at gmail.com Mon Jul 16 23:19:21 2018 From: abhishekb2209 at gmail.com (Abhishek Babuji) Date: Mon, 16 Jul 2018 23:19:21 -0400 Subject: [scikit-learn] Would love to contribute to this library that I fell in love with. I have a question! FIRST TIMER Message-ID: TO WHOM IT MAY CONCERN, I have just learned Python to a level that I can say I'm comfortable with it. I have also picked up and learned Git and GitHub, and so now I'm ready to make my contribution to this library. I'm really enthusiastic but since this is my first time, I'd like to know a few things! *Must I know the underlying implementation of something to contribute code to fix it?* Explanation: Let's say, someone, tags some issue as 'first timers' and 'easy', and you want to take a look at it, see and contribute code/fix the code. Should I know the implementation of what the fixed code is supposed to do? or will this be explained when the issue is brought up? I have gone over issues in your GitHub. but I don't think I've seen enough examples. I don't seem to find this in the contributor guide. If someone could help me understand the level of depth that I must know scikit-learn to be able to contribute, I would then begin working towards it! Because I have used it a lot in my Machine Learning projects, so I'm not sure where I stand. Example: "The shovel doesn't work! Fix it! It is supposed to be able to dig through mud" My dilemma: I found an immovable rock in the mud that the shovel is not being able to dig through.. so I'm stuck. Guess I shouldn't have volunteered to help. Just on a side note, to all scikit-learn's contributors, you're doing God's work. -------------- next part -------------- An HTML attachment was scrubbed... URL: From seralouk at hotmail.com Fri Jul 20 05:35:18 2018 From: seralouk at hotmail.com (serafim loukas) Date: Fri, 20 Jul 2018 09:35:18 +0000 Subject: [scikit-learn] Plot Cross-validated ROCs for multi-class classification problem Message-ID: Dear Scikit-learn community, I have a 3 class classification problem and I would like to plot the average ROC across Folds. There is an example in scikit-learn website but only for binary classification problems (http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html). I want to do the same but in the case of 3 classes. I have tried to use `clf = OneVsRestClassifier(LinearDiscriminantAnalysis())`but I am having a hard time to make it work. Any help would be appreciated, Makis -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Jul 20 10:44:00 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 20 Jul 2018 10:44:00 -0400 Subject: [scikit-learn] Plot Cross-validated ROCs for multi-class classification problem In-Reply-To: <43CF60D7-EDFC-42AD-86AE-F5733D18FF6B@hotmail.com> References: <43CF60D7-EDFC-42AD-86AE-F5733D18FF6B@hotmail.com> Message-ID: <459e3dfc-14fb-1af6-02ff-6bccd093e66d@gmail.com> Please stay on the mailing list. There is no single roc curve for a 3 class problem. So what do you want to plot? On 07/20/2018 10:40 AM, serafim loukas wrote: > Hello Andy, > > > Thank you for your response. > > What I want to do is to plot the average(mean) ROC across Folds for a > 3-class case. > I have managed to do so for the binary case and I am trying to make it > work for the mutlti-class case but with no luck. > > There is an example in the documentation > (http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html) > for the binary case. > I want to do the same ( plot the mean ROC and the confidence interval > for my 3-class problem). > > Here is also my SO question about this: > https://stackoverflow.com/questions/51442818/average-roc-curve-across-folds-for-multi-class-classification-case-in-sklearn?with > some code included. > > > > Best, > Makis > > > >> On 20 Jul 2018, at 16:34, Andreas Mueller > > wrote: >> >> Hi Makis. >> What do you mean by a roc curve for multi-class? >> You can have one curve per class using OVR or one curve per pair of >> classes. >> That doesn't need the OneVsRestClassifier, it's more a matter of >> evaluation. >> >> Cheers, >> Andy >> >> On 07/20/2018 05:35 AM, serafim loukas wrote: >>> Dear Scikit-learn community, >>> >>> >>> I have a 3 class classification problem and I would like to plot the >>> average ROC across Folds. >>> There is an example in scikit-learn website but only for binary >>> classification problems >>> (http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html). >>> >>> I want to do the same but in the case of 3 classes. I have tried to >>> use `clf = OneVsRestClassifier(LinearDiscriminantAnalysis())`but I >>> am having a hard time to make it work. >>> >>> >>> Any help would be?appreciated, >>> Makis >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Sat Jul 21 10:02:02 2018 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Sat, 21 Jul 2018 23:02:02 +0900 Subject: [scikit-learn] Plot Cross-validated ROCs for multi-class classification problem In-Reply-To: <459e3dfc-14fb-1af6-02ff-6bccd093e66d@gmail.com> References: <43CF60D7-EDFC-42AD-86AE-F5733D18FF6B@hotmail.com> <459e3dfc-14fb-1af6-02ff-6bccd093e66d@gmail.com> Message-ID: Hello Makis, 2018-07-20 23:44 GMT+09:00 Andreas Mueller : > There is no single roc curve for a 3 class problem. So what do you want to > plot? > > On 07/20/2018 10:40 AM, serafim loukas wrote: > > What I want to do is to plot the average(mean) ROC across Folds for a > 3-class case. > > The prototypical ROC curve uses True Positive Rate and False Positive Rate for its axes, so it is for 2-class problems, and not for 3+-class problems, as Andy mentioned. Perhaps you are wanting the mean and confidence intervals of the n-class Cohen Kappa metric as estimated by either many folds of cross validation, or you want to evaluate your classifier by repeated subsampling experiments and Kappa value distribution/histogram? Hope this helps, J.B. -------------- next part -------------- An HTML attachment was scrubbed... URL: From seralouk at hotmail.com Sat Jul 21 10:20:39 2018 From: seralouk at hotmail.com (serafim loukas) Date: Sat, 21 Jul 2018 14:20:39 +0000 Subject: [scikit-learn] Plot Cross-validated ROCs for multi-class classification problem In-Reply-To: References: <43CF60D7-EDFC-42AD-86AE-F5733D18FF6B@hotmail.com> <459e3dfc-14fb-1af6-02ff-6bccd093e66d@gmail.com> Message-ID: Hello J.B, I could simply create some ROC curves as shown in the scikit-learn documentation by selecting only 2 classes and then repeating by selecting other pair of classes (in total I have 3 classes so this would result in 3 different ROC figures). An alternative would be I would like to plot the mean and confidence intervals of the 3-class Cohen Kappa metric as estimated by KFolds (k=5) cross-validation. Any tips about this ? Cheers, Makis On 21 Jul 2018, at 16:02, Brown J.B. via scikit-learn > wrote: Hello Makis, 2018-07-20 23:44 GMT+09:00 Andreas Mueller >: There is no single roc curve for a 3 class problem. So what do you want to plot? On 07/20/2018 10:40 AM, serafim loukas wrote: What I want to do is to plot the average(mean) ROC across Folds for a 3-class case. The prototypical ROC curve uses True Positive Rate and False Positive Rate for its axes, so it is for 2-class problems, and not for 3+-class problems, as Andy mentioned. Perhaps you are wanting the mean and confidence intervals of the n-class Cohen Kappa metric as estimated by either many folds of cross validation, or you want to evaluate your classifier by repeated subsampling experiments and Kappa value distribution/histogram? Hope this helps, J.B. _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From benoit.presles at u-bourgogne.fr Tue Jul 24 07:07:22 2018 From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=) Date: Tue, 24 Jul 2018 13:07:22 +0200 Subject: [scikit-learn] RFE with logistic regression Message-ID: Dear scikit-learn users, I am using the recursive feature elimination (RFE) tool from sklearn to rank my features: from sklearn.linear_model import LogisticRegression classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000) from sklearn.feature_selection import RFE rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1) rfe.fit(X, y) ranking = rfe.ranking_ print(ranking) 1. The first problem I have is when I execute the above code multiple times, I don't get the same results. 2. When I change the solver to 'sag' or 'saga' (classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000), solver='sag'), it seems that I get the same results at each run but the ranking is not the same between these two solvers. 3. With C=1, it seems I have the same results at each run for the solver='liblinear', but not for the solvers 'sag' and 'saga'. I still don't get the same results between the different solvers. Thanks for your help, Best regards, Ben From stuart at stuartreynolds.net Tue Jul 24 12:16:57 2018 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Tue, 24 Jul 2018 09:16:57 -0700 Subject: [scikit-learn] RFE with logistic regression In-Reply-To: References: Message-ID: liblinear regularizes the intercept (which is a questionable thing to do and a poor choice of default in sklearn). The other solvers do not. On Tue, Jul 24, 2018 at 4:07 AM, Beno?t Presles wrote: > Dear scikit-learn users, > > I am using the recursive feature elimination (RFE) tool from sklearn to rank > my features: > > from sklearn.linear_model import LogisticRegression > classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000) > from sklearn.feature_selection import RFE > rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1) > rfe.fit(X, y) > ranking = rfe.ranking_ > print(ranking) > > 1. The first problem I have is when I execute the above code multiple times, > I don't get the same results. > > 2. When I change the solver to 'sag' or 'saga' (classifier_RFE = > LogisticRegression(C=1e9, verbose=1, max_iter=10000), solver='sag'), it > seems that I get the same results at each run but the ranking is not the > same between these two solvers. > > 3. With C=1, it seems I have the same results at each run for the > solver='liblinear', but not for the solvers 'sag' and 'saga'. I still don't > get the same results between the different solvers. > > > Thanks for your help, > Best regards, > Ben > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From mail at sebastianraschka.com Tue Jul 24 12:40:34 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Tue, 24 Jul 2018 11:40:34 -0500 Subject: [scikit-learn] RFE with logistic regression In-Reply-To: References: Message-ID: Agreed. But then the setting is c=1e9 in this context (where C is the inverse regularization strength), so the regularization effect should be very small. Probably shouldn't matter much for convex optimization, but I would still try to a) set the random_state to some fixed value b) make sure that .n_iter_ < .max_iter to see if that results in more consistency. Best, Sebastian > On Jul 24, 2018, at 11:16 AM, Stuart Reynolds wrote: > > liblinear regularizes the intercept (which is a questionable thing to > do and a poor choice of default in sklearn). > The other solvers do not. > > On Tue, Jul 24, 2018 at 4:07 AM, Beno?t Presles > wrote: >> Dear scikit-learn users, >> >> I am using the recursive feature elimination (RFE) tool from sklearn to rank >> my features: >> >> from sklearn.linear_model import LogisticRegression >> classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000) >> from sklearn.feature_selection import RFE >> rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1) >> rfe.fit(X, y) >> ranking = rfe.ranking_ >> print(ranking) >> >> 1. The first problem I have is when I execute the above code multiple times, >> I don't get the same results. >> >> 2. When I change the solver to 'sag' or 'saga' (classifier_RFE = >> LogisticRegression(C=1e9, verbose=1, max_iter=10000), solver='sag'), it >> seems that I get the same results at each run but the ranking is not the >> same between these two solvers. >> >> 3. With C=1, it seems I have the same results at each run for the >> solver='liblinear', but not for the solvers 'sag' and 'saga'. I still don't >> get the same results between the different solvers. >> >> >> Thanks for your help, >> Best regards, >> Ben >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From benoit.presles at u-bourgogne.fr Tue Jul 24 14:07:02 2018 From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=) Date: Tue, 24 Jul 2018 20:07:02 +0200 Subject: [scikit-learn] RFE with logistic regression In-Reply-To: References: Message-ID: I did the same tests as before adding fit_intercept=False and: 1. I have got the same problem as before, i.e. when I execute the RFE multiple times I don't get the same ranking each time. 2. When I change the solver to 'sag' (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, fit_intercept=False, solver='sag')), it seems that I get the same ranking at each run. This is not the case with the 'saga' solver. The ranking is not the same between the solvers. 3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers. How can I get reproducible and consistent results? Thanks for your help, Best regards, Ben Le 24/07/2018 ? 18:16, Stuart Reynolds a ?crit?: > liblinear regularizes the intercept (which is a questionable thing to > do and a poor choice of default in sklearn). > The other solvers do not. > > On Tue, Jul 24, 2018 at 4:07 AM, Beno?t Presles > wrote: >> Dear scikit-learn users, >> >> I am using the recursive feature elimination (RFE) tool from sklearn to rank >> my features: >> >> from sklearn.linear_model import LogisticRegression >> classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000) >> from sklearn.feature_selection import RFE >> rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1) >> rfe.fit(X, y) >> ranking = rfe.ranking_ >> print(ranking) >> >> 1. The first problem I have is when I execute the above code multiple times, >> I don't get the same results. >> >> 2. When I change the solver to 'sag' or 'saga' (classifier_RFE = >> LogisticRegression(C=1e9, verbose=1, max_iter=10000), solver='sag'), it >> seems that I get the same results at each run but the ranking is not the >> same between these two solvers. >> >> 3. With C=1, it seems I have the same results at each run for the >> solver='liblinear', but not for the solvers 'sag' and 'saga'. I still don't >> get the same results between the different solvers. >> >> >> Thanks for your help, >> Best regards, >> Ben >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Tue Jul 24 14:33:24 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 24 Jul 2018 14:33:24 -0400 Subject: [scikit-learn] RFE with logistic regression In-Reply-To: References: Message-ID: <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com> On 07/24/2018 02:07 PM, Beno?t Presles wrote: > I did the same tests as before adding fit_intercept=False and: > > 1. I have got the same problem as before, i.e. when I execute the RFE > multiple times I don't get the same ranking each time. > > 2. When I change the solver to 'sag' > (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, > fit_intercept=False, solver='sag')), it seems that I get the same > ranking at each run. This is not the case with the 'saga' solver. > The ranking is not the same between the solvers. > > 3. With C=1, it seems that I have the same results at each run for all > solvers (liblinear, sag and saga), however the ranking is not the same > between the solvers. > > > How can I get reproducible and consistent results? > Did you scale your data? If not, saga and sag will basically fail. From benoit.presles at u-bourgogne.fr Tue Jul 24 14:43:27 2018 From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=) Date: Tue, 24 Jul 2018 20:43:27 +0200 Subject: [scikit-learn] RFE with logistic regression In-Reply-To: <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com> References: <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com> Message-ID: I did the same tests as before adding random_state=0 and: 1. I have got the same problem as before, i.e. when I execute the RFE multiple times I don't get the same ranking each time. 2. When I change the solver to 'sag' or 'saga' (LogisticRegression(C=1e9, verbose=1, max_iter=10000, fit_intercept=False, random_state=0, solver='sag')), it seems that I get the same results at each run but the ranking is not the same between these two solvers. 3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers. Thanks for your help, Ben PS1: I checked and n_iter_ seems to be always lower than max_iter. PS2: my data is scaled, I am using "StandardScaler". Le 24/07/2018 ? 20:33, Andreas Mueller a ?crit?: > > > On 07/24/2018 02:07 PM, Beno?t Presles wrote: >> I did the same tests as before adding fit_intercept=False and: >> >> 1. I have got the same problem as before, i.e. when I execute the RFE >> multiple times I don't get the same ranking each time. >> >> 2. When I change the solver to 'sag' >> (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, >> fit_intercept=False, solver='sag')), it seems that I get the same >> ranking at each run. This is not the case with the 'saga' solver. >> The ranking is not the same between the solvers. >> >> 3. With C=1, it seems that I have the same results at each run for >> all solvers (liblinear, sag and saga), however the ranking is not the >> same between the solvers. >> >> >> How can I get reproducible and consistent results? >> > Did you scale your data? If not, saga and sag will basically fail. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Tue Jul 24 14:54:12 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 24 Jul 2018 14:54:12 -0400 Subject: [scikit-learn] RFE with logistic regression In-Reply-To: References: <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com> Message-ID: Can you share your data or reproduce with synthetic data? On 07/24/2018 02:43 PM, Beno?t Presles wrote: > I did the same tests as before adding random_state=0 and: > > 1. I have got the same problem as before, i.e. when I execute the RFE > multiple times I don't get the same ranking each time. > > 2. When I change the solver to 'sag' or 'saga' > (LogisticRegression(C=1e9, verbose=1, max_iter=10000, > fit_intercept=False, random_state=0, solver='sag')), it > seems that I get the same results at each run but the ranking is not > the same between these two solvers. > > 3. With C=1, it seems that I have the same results at each run for all > solvers (liblinear, sag and saga), however the ranking is not the same > between the solvers. > > Thanks for your help, > Ben > > > PS1: I checked and n_iter_ seems to be always lower than max_iter. > PS2: my data is scaled, I am using "StandardScaler". > > > > Le 24/07/2018 ? 20:33, Andreas Mueller a ?crit?: >> >> >> On 07/24/2018 02:07 PM, Beno?t Presles wrote: >>> I did the same tests as before adding fit_intercept=False and: >>> >>> 1. I have got the same problem as before, i.e. when I execute the >>> RFE multiple times I don't get the same ranking each time. >>> >>> 2. When I change the solver to 'sag' >>> (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, >>> fit_intercept=False, solver='sag')), it seems that I get the same >>> ranking at each run. This is not the case with the 'saga' solver. >>> The ranking is not the same between the solvers. >>> >>> 3. With C=1, it seems that I have the same results at each run for >>> all solvers (liblinear, sag and saga), however the ranking is not >>> the same between the solvers. >>> >>> >>> How can I get reproducible and consistent results? >>> >> Did you scale your data? If not, saga and sag will basically fail. >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From mail at sebastianraschka.com Tue Jul 24 14:26:26 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Tue, 24 Jul 2018 13:26:26 -0500 Subject: [scikit-learn] RFE with logistic regression In-Reply-To: References: Message-ID: <29EE1F14-4D1A-435D-93A1-5FC890F99447@sebastianraschka.com> I addition to checking _n_iter and fixing the random seed as I suggested maybe also try normalizing the features (eg z scores via the standard scale we) to see if that stabilizes the training Sent from my iPhone > On Jul 24, 2018, at 1:07 PM, Beno?t Presles wrote: > > I did the same tests as before adding fit_intercept=False and: > > 1. I have got the same problem as before, i.e. when I execute the RFE multiple times I don't get the same ranking each time. > > 2. When I change the solver to 'sag' (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, fit_intercept=False, solver='sag')), it seems that I get the same ranking at each run. This is not the case with the 'saga' solver. > The ranking is not the same between the solvers. > > 3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers. > > > How can I get reproducible and consistent results? > > > Thanks for your help, > Best regards, > Ben > > > >> Le 24/07/2018 ? 18:16, Stuart Reynolds a ?crit : >> liblinear regularizes the intercept (which is a questionable thing to >> do and a poor choice of default in sklearn). >> The other solvers do not. >> >> On Tue, Jul 24, 2018 at 4:07 AM, Beno?t Presles >> wrote: >>> Dear scikit-learn users, >>> >>> I am using the recursive feature elimination (RFE) tool from sklearn to rank >>> my features: >>> >>> from sklearn.linear_model import LogisticRegression >>> classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000) >>> from sklearn.feature_selection import RFE >>> rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1) >>> rfe.fit(X, y) >>> ranking = rfe.ranking_ >>> print(ranking) >>> >>> 1. The first problem I have is when I execute the above code multiple times, >>> I don't get the same results. >>> >>> 2. When I change the solver to 'sag' or 'saga' (classifier_RFE = >>> LogisticRegression(C=1e9, verbose=1, max_iter=10000), solver='sag'), it >>> seems that I get the same results at each run but the ranking is not the >>> same between these two solvers. >>> >>> 3. With C=1, it seems I have the same results at each run for the >>> solver='liblinear', but not for the solvers 'sag' and 'saga'. I still don't >>> get the same results between the different solvers. >>> >>> >>> Thanks for your help, >>> Best regards, >>> Ben >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Tue Jul 24 15:34:31 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 24 Jul 2018 21:34:31 +0200 Subject: [scikit-learn] RFE with logistic regression In-Reply-To: References: <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com> Message-ID: <20180724193431.tng3k5tgijqxnhkf@phare.normalesup.org> On Tue, Jul 24, 2018 at 08:43:27PM +0200, Beno?t Presles wrote: > 3. With C=1, it seems that I have the same results at each run for all > solvers (liblinear, sag and saga), however the ranking is not the same > between the solvers. Your problem is probably ill-conditioned, hence the specific weights on the features are not stable. There isn't a good answer to ordering features, they are degenerate. In general, I would avoid RFE, it is a hack, and can easily lead to these problems. Ga?l > Thanks for your help, > Ben > PS1: I checked and n_iter_ seems to be always lower than max_iter. > PS2: my data is scaled, I am using "StandardScaler". > Le 24/07/2018 ? 20:33, Andreas Mueller a ?crit?: > > On 07/24/2018 02:07 PM, Beno?t Presles wrote: > > > I did the same tests as before adding fit_intercept=False and: > > > 1. I have got the same problem as before, i.e. when I execute the > > > RFE multiple times I don't get the same ranking each time. > > > 2. When I change the solver to 'sag' > > > (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, > > > fit_intercept=False, solver='sag')), it seems that I get the same > > > ranking at each run. This is not the case with the 'saga' solver. > > > The ranking is not the same between the solvers. > > > 3. With C=1, it seems that I have the same results at each run for > > > all solvers (liblinear, sag and saga), however the ranking is not > > > the same between the solvers. > > > How can I get reproducible and consistent results? > > Did you scale your data? If not, saga and sag will basically fail. > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From benoit.presles at u-bourgogne.fr Tue Jul 24 17:33:30 2018 From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=) Date: Tue, 24 Jul 2018 23:33:30 +0200 Subject: [scikit-learn] RFE with logistic regression In-Reply-To: <20180724193431.tng3k5tgijqxnhkf@phare.normalesup.org> References: <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com> <20180724193431.tng3k5tgijqxnhkf@phare.normalesup.org> Message-ID: <57006ee7-a454-5934-90d8-5dc82663ef2f@u-bourgogne.fr> So you think that I cannot get reproducible and consistent results with this method ? If you would avoid RFE, which method do you suggest to find the best features ? Ben Le 24/07/2018 ? 21:34, Gael Varoquaux a ?crit?: > On Tue, Jul 24, 2018 at 08:43:27PM +0200, Beno?t Presles wrote: >> 3. With C=1, it seems that I have the same results at each run for all >> solvers (liblinear, sag and saga), however the ranking is not the same >> between the solvers. > Your problem is probably ill-conditioned, hence the specific weights on > the features are not stable. There isn't a good answer to ordering > features, they are degenerate. > > In general, I would avoid RFE, it is a hack, and can easily lead to these > problems. > > Ga?l > >> Thanks for your help, >> Ben > >> PS1: I checked and n_iter_ seems to be always lower than max_iter. >> PS2: my data is scaled, I am using "StandardScaler". > > >> Le 24/07/2018 ? 20:33, Andreas Mueller a ?crit?: > >>> On 07/24/2018 02:07 PM, Beno?t Presles wrote: >>>> I did the same tests as before adding fit_intercept=False and: >>>> 1. I have got the same problem as before, i.e. when I execute the >>>> RFE multiple times I don't get the same ranking each time. >>>> 2. When I change the solver to 'sag' >>>> (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, >>>> fit_intercept=False, solver='sag')), it seems that I get the same >>>> ranking at each run. This is not the case with the 'saga' solver. >>>> The ranking is not the same between the solvers. >>>> 3. With C=1, it seems that I have the same results at each run for >>>> all solvers (liblinear, sag and saga), however the ranking is not >>>> the same between the solvers. > >>>> How can I get reproducible and consistent results? >>> Did you scale your data? If not, saga and sag will basically fail. >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn From bertrand.thirion at inria.fr Tue Jul 24 17:44:58 2018 From: bertrand.thirion at inria.fr (bthirion) Date: Tue, 24 Jul 2018 23:44:58 +0200 Subject: [scikit-learn] RFE with logistic regression In-Reply-To: <57006ee7-a454-5934-90d8-5dc82663ef2f@u-bourgogne.fr> References: <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com> <20180724193431.tng3k5tgijqxnhkf@phare.normalesup.org> <57006ee7-a454-5934-90d8-5dc82663ef2f@u-bourgogne.fr> Message-ID: <9ebbda6d-05dd-1282-2e3b-148c74bce0cb@inria.fr> Univariate screening is somewhat hackish too, but much more stable -- and cheap. Best, Bertrand On 24/07/2018 23:33, Beno?t Presles wrote: > So you think that I cannot get reproducible and consistent results > with this method ? > If you would avoid RFE, which method do you suggest to find the best > features ? > > Ben > > > Le 24/07/2018 ? 21:34, Gael Varoquaux a ?crit?: >> On Tue, Jul 24, 2018 at 08:43:27PM +0200, Beno?t Presles wrote: >>> 3. With C=1, it seems that I have the same results at each run for all >>> solvers (liblinear, sag and saga), however the ranking is not the same >>> between the solvers. >> Your problem is probably ill-conditioned, hence the specific weights on >> the features are not stable. There isn't a good answer to ordering >> features, they are degenerate. >> >> In general, I would avoid RFE, it is a hack, and can easily lead to >> these >> problems. >> >> Ga?l >> >>> Thanks for your help, >>> Ben >> >>> PS1: I checked and n_iter_ seems to be always lower than max_iter. >>> PS2: my data is scaled, I am using "StandardScaler". >> >> >>> Le 24/07/2018 ? 20:33, Andreas Mueller a ?crit?: >> >>>> On 07/24/2018 02:07 PM, Beno?t Presles wrote: >>>>> I did the same tests as before adding fit_intercept=False and: >>>>> 1. I have got the same problem as before, i.e. when I execute the >>>>> RFE multiple times I don't get the same ranking each time. >>>>> 2. When I change the solver to 'sag' >>>>> (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, >>>>> fit_intercept=False, solver='sag')), it seems that I get the same >>>>> ranking at each run. This is not the case with the 'saga' solver. >>>>> The ranking is not the same between the solvers. >>>>> 3. With C=1, it seems that I have the same results at each run for >>>>> all solvers (liblinear, sag and saga), however the ranking is not >>>>> the same between the solvers. >> >>>>> How can I get reproducible and consistent results? >>>> Did you scale your data? If not, saga and sag will basically fail. >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From prat2 at umbc.edu Tue Jul 24 20:33:31 2018 From: prat2 at umbc.edu (Prathusha Jonnagaddla Subramanyam Naidu) Date: Tue, 24 Jul 2018 20:33:31 -0400 Subject: [scikit-learn] Help with Pull Request( Checks failing) Message-ID: Hi everyone, I submitted my first PR few hours back and I see that two tests failed. Would really appreciate if anyone can help me with how to fix these/ what I am doing wrong. Thank you ! -------------- next part -------------- An HTML attachment was scrubbed... URL: From prat2 at umbc.edu Tue Jul 24 20:34:08 2018 From: prat2 at umbc.edu (Prathusha Jonnagaddla Subramanyam Naidu) Date: Tue, 24 Jul 2018 20:34:08 -0400 Subject: [scikit-learn] Help with Pull Request( Checks failing) In-Reply-To: References: Message-ID: This is the link to the PR - https://github.com/scikit-learn/scikit-learn/pull/11670 On Tue, Jul 24, 2018 at 8:33 PM, Prathusha Jonnagaddla Subramanyam Naidu < prat2 at umbc.edu> wrote: > Hi everyone, > I submitted my first PR few hours back and I see that two tests > failed. Would really appreciate if anyone can help me with how to fix > these/ what I am doing wrong. > > Thank you ! > -- Regards, Prathusha JS Naidu Graduate Student Department of CSEE UMBC -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Tue Jul 24 21:06:07 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Tue, 24 Jul 2018 20:06:07 -0500 Subject: [scikit-learn] Help with Pull Request( Checks failing) In-Reply-To: References: Message-ID: I am not a core dev, but I think I can see what's wrong there (mostly Flake8 issues). Let me comment about that over there. > On Jul 24, 2018, at 7:34 PM, Prathusha Jonnagaddla Subramanyam Naidu wrote: > > This is the link to the PR - https://github.com/scikit-learn/scikit-learn/pull/11670 > > On Tue, Jul 24, 2018 at 8:33 PM, Prathusha Jonnagaddla Subramanyam Naidu wrote: > Hi everyone, > I submitted my first PR few hours back and I see that two tests failed. Would really appreciate if anyone can help me with how to fix these/ what I am doing wrong. > > Thank you ! > > > > -- > Regards, > Prathusha JS Naidu > Graduate Student > Department of CSEE > UMBC > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Wed Jul 25 00:29:22 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 25 Jul 2018 14:29:22 +1000 Subject: [scikit-learn] Would love to contribute to this library that I fell in love with. I have a question! FIRST TIMER In-Reply-To: References: Message-ID: Hi Abishek, In case you can't tell from the response, this is not a straightforward question to answer. I hope you have looked at our contributor guidelines: http://scikit-learn.org/dev/developers/contributing.html. We encourage contributors to start with changes that focus on things like documentation, or that involve simple changes to the code. In any case, we can try to help you navigate the code or the process of fixing a specific issue. Some issues require a deeper understanding of the implementation than others, and contributors should advance to those over time. We look forward to your contributions. Joel On 17 July 2018 at 13:19, Abhishek Babuji wrote: > TO WHOM IT MAY CONCERN, > > I have just learned Python to a level that I can say I'm comfortable with > it. I have also picked up and learned Git and GitHub, and so now I'm ready > to make my contribution to this library. > > I'm really enthusiastic but since this is my first time, I'd like to know > a few things! > > *Must I know the underlying implementation of something to contribute code > to fix it?* > > Explanation: Let's say, someone, tags some issue as 'first timers' and > 'easy', and you want to take a look at it, see and contribute code/fix the > code. > > Should I know the implementation of what the fixed code is supposed to do? > or will this be explained when the issue is brought up? I have gone over > issues in your GitHub. but I don't think I've seen enough examples. I don't > seem to find this in the contributor guide. > > If someone could help me understand the level of depth that I must know > scikit-learn to be able to contribute, I would then begin working towards > it! Because I have used it a lot in my Machine Learning projects, so I'm > not sure where I stand. > > Example: "The shovel doesn't work! Fix it! It is supposed to be able to > dig through mud" > My dilemma: I found an immovable rock in the mud that the shovel is not > being able to dig through.. so I'm stuck. Guess I shouldn't have > volunteered to help. > > Just on a side note, to all scikit-learn's contributors, you're doing > God's work. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benoit.presles at u-bourgogne.fr Wed Jul 25 06:36:55 2018 From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=) Date: Wed, 25 Jul 2018 12:36:55 +0200 Subject: [scikit-learn] RFE with logistic regression In-Reply-To: <9ebbda6d-05dd-1282-2e3b-148c74bce0cb@inria.fr> References: <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com> <20180724193431.tng3k5tgijqxnhkf@phare.normalesup.org> <57006ee7-a454-5934-90d8-5dc82663ef2f@u-bourgogne.fr> <9ebbda6d-05dd-1282-2e3b-148c74bce0cb@inria.fr> Message-ID: <1c2d3fb1-6c5a-8070-2412-1648406dd047@u-bourgogne.fr> Do you think the problems I have can come from correlated features? Indeed, in my dataset I have some highly correlated features. Do you think this could explain why I don't get reproducible and consistent results? Thanks for your help, Ben Le 24/07/2018 ? 23:44, bthirion a ?crit?: > Univariate screening is somewhat hackish too, but much more stable -- > and cheap. > Best, > > Bertrand > > On 24/07/2018 23:33, Beno?t Presles wrote: >> So you think that I cannot get reproducible and consistent results >> with this method ? >> If you would avoid RFE, which method do you suggest to find the best >> features ? >> >> Ben >> >> >> Le 24/07/2018 ? 21:34, Gael Varoquaux a ?crit?: >>> On Tue, Jul 24, 2018 at 08:43:27PM +0200, Beno?t Presles wrote: >>>> 3. With C=1, it seems that I have the same results at each run for all >>>> solvers (liblinear, sag and saga), however the ranking is not the same >>>> between the solvers. >>> Your problem is probably ill-conditioned, hence the specific weights on >>> the features are not stable. There isn't a good answer to ordering >>> features, they are degenerate. >>> >>> In general, I would avoid RFE, it is a hack, and can easily lead to >>> these >>> problems. >>> >>> Ga?l >>> >>>> Thanks for your help, >>>> Ben >>> >>>> PS1: I checked and n_iter_ seems to be always lower than max_iter. >>>> PS2: my data is scaled, I am using "StandardScaler". >>> >>> >>>> Le 24/07/2018 ? 20:33, Andreas Mueller a ?crit?: >>> >>>>> On 07/24/2018 02:07 PM, Beno?t Presles wrote: >>>>>> I did the same tests as before adding fit_intercept=False and: >>>>>> 1. I have got the same problem as before, i.e. when I execute the >>>>>> RFE multiple times I don't get the same ranking each time. >>>>>> 2. When I change the solver to 'sag' >>>>>> (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, >>>>>> fit_intercept=False, solver='sag')), it seems that I get the same >>>>>> ranking at each run. This is not the case with the 'saga' solver. >>>>>> The ranking is not the same between the solvers. >>>>>> 3. With C=1, it seems that I have the same results at each run for >>>>>> all solvers (liblinear, sag and saga), however the ranking is not >>>>>> the same between the solvers. >>> >>>>>> How can I get reproducible and consistent results? >>>>> Did you scale your data? If not, saga and sag will basically fail. >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Wed Jul 25 07:50:04 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 25 Jul 2018 13:50:04 +0200 Subject: [scikit-learn] RFE with logistic regression In-Reply-To: <1c2d3fb1-6c5a-8070-2412-1648406dd047@u-bourgogne.fr> References: <40c1ae87-6f5e-14b2-86e8-4458f2f17753@gmail.com> <20180724193431.tng3k5tgijqxnhkf@phare.normalesup.org> <57006ee7-a454-5934-90d8-5dc82663ef2f@u-bourgogne.fr> <9ebbda6d-05dd-1282-2e3b-148c74bce0cb@inria.fr> <1c2d3fb1-6c5a-8070-2412-1648406dd047@u-bourgogne.fr> Message-ID: <20180725115004.aaqhqbi65mbifr2r@phare.normalesup.org> On Wed, Jul 25, 2018 at 12:36:55PM +0200, Beno?t Presles wrote: > Do you think the problems I have can come from correlated features? Indeed, > in my dataset I have some highly correlated features. Yes, in general selecting features conditionally on others is very hard when features are highly correlated. > Do you think this could explain why I don't get reproducible and consistent > results? Yes. > Thanks for your help, > Ben > Le 24/07/2018 ? 23:44, bthirion a ?crit?: > > Univariate screening is somewhat hackish too, but much more stable -- > > and cheap. > > Best, > > Bertrand > > On 24/07/2018 23:33, Beno?t Presles wrote: > > > So you think that I cannot get reproducible and consistent results > > > with this method ? > > > If you would avoid RFE, which method do you suggest to find the best > > > features ? > > > Ben > > > Le 24/07/2018 ? 21:34, Gael Varoquaux a ?crit?: > > > > On Tue, Jul 24, 2018 at 08:43:27PM +0200, Beno?t Presles wrote: > > > > > 3. With C=1, it seems that I have the same results at each run for all > > > > > solvers (liblinear, sag and saga), however the ranking is not the same > > > > > between the solvers. > > > > Your problem is probably ill-conditioned, hence the specific weights on > > > > the features are not stable. There isn't a good answer to ordering > > > > features, they are degenerate. > > > > In general, I would avoid RFE, it is a hack, and can easily lead > > > > to these > > > > problems. > > > > Ga?l > > > > > Thanks for your help, > > > > > Ben > > > > > PS1: I checked and n_iter_ seems to be always lower than max_iter. > > > > > PS2: my data is scaled, I am using "StandardScaler". > > > > > Le 24/07/2018 ? 20:33, Andreas Mueller a ?crit?: > > > > > > On 07/24/2018 02:07 PM, Beno?t Presles wrote: > > > > > > > I did the same tests as before adding fit_intercept=False and: > > > > > > > 1. I have got the same problem as before, i.e. when I execute the > > > > > > > RFE multiple times I don't get the same ranking each time. > > > > > > > 2. When I change the solver to 'sag' > > > > > > > (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, > > > > > > > fit_intercept=False, solver='sag')), it seems that I get the same > > > > > > > ranking at each run. This is not the case with the 'saga' solver. > > > > > > > The ranking is not the same between the solvers. > > > > > > > 3. With C=1, it seems that I have the same results at each run for > > > > > > > all solvers (liblinear, sag and saga), however the ranking is not > > > > > > > the same between the solvers. > > > > > > > How can I get reproducible and consistent results? > > > > > > Did you scale your data? If not, saga and sag will basically fail. > > > > > > _______________________________________________ > > > > > > scikit-learn mailing list > > > > > > scikit-learn at python.org > > > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > > > > > scikit-learn mailing list > > > > > scikit-learn at python.org > > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From drraph at gmail.com Thu Jul 26 01:05:21 2018 From: drraph at gmail.com (Raphael C) Date: Thu, 26 Jul 2018 06:05:21 +0100 Subject: [scikit-learn] What is the FeatureAgglomeration algorithm? Message-ID: Hi, I am trying to work out what, in precise mathematical terms, [FeatureAgglomeration][1] does and would love some help. Here is some example code: import numpy as np from sklearn.cluster import FeatureAgglomeration for S in ['ward', 'average', 'complete']: FA = FeatureAgglomeration(linkage=S) print(FA.fit_transform(np.array([[-50,6,6,7,], [0,1,2,3]]))) This outputs: [[ 6.33333333 -50. ] [ 2. 0. ]] [[ 6.33333333 -50. ] [ 2. 0. ]] [[ 6.33333333 -50. ] [ 2. 0. ]] Is it possible to say mathematically how these values have been computed? Also, what exactly does linkage do and why doesn't it seem to make any difference which option you choose? Raphael [1]: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.FeatureAgglomeration.html PS I also asked at https://stackoverflow.com/questions/51526616/what-does-featureagglomeration-compute-mathematically-and-when-does-linkage-make -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Thu Jul 26 01:19:45 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 26 Jul 2018 07:19:45 +0200 Subject: [scikit-learn] What is the FeatureAgglomeration algorithm? In-Reply-To: References: Message-ID: <20180726051945.vm2bg6ar63kdqzcx@phare.normalesup.org> FeatureAgglomeration uses the Ward, complete linkage, or average linkage, algorithms, depending on the choice of "linkage". These are well documented in the literature, or on wikipedia. Ga?l On Thu, Jul 26, 2018 at 06:05:21AM +0100, Raphael C wrote: > Hi, > I am trying to work out what, in precise mathematical terms, > [FeatureAgglomeration][1] does and would love some help. Here is some example > code: > ? ? import numpy as np > ? ? from sklearn.cluster import FeatureAgglomeration > ? ? for S in ['ward', 'average', 'complete']: > ? ? ? ? FA = FeatureAgglomeration(linkage=S) > ? ? ? ? print(FA.fit_transform(np.array([[-50,6,6,7,], [0,1,2,3]]))) > This outputs: > ? ? > ? ? [[ ?6.33333333 -50. ? ? ? ?] > ? ? ?[ ?2. ? ? ? ? ? 0. ? ? ? ?]] > ? ? [[ ?6.33333333 -50. ? ? ? ?] > ? ? ?[ ?2. ? ? ? ? ? 0. ? ? ? ?]] > ? ? [[ ?6.33333333 -50. ? ? ? ?] > ? ? ?[ ?2. ? ? ? ? ? 0. ? ? ? ?]] > Is it possible to say mathematically how these values have been computed? > Also, what exactly does linkage do and why doesn't it seem to make any > difference which option you choose? > Raphael > ? [1]: http://scikit-learn.org/stable/modules/generated/ > sklearn.cluster.FeatureAgglomeration.html > PS I also asked at? > https://stackoverflow.com/questions/51526616/ > what-does-featureagglomeration-compute-mathematically-and-when-does-linkage-make > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From drraph at gmail.com Thu Jul 26 01:25:44 2018 From: drraph at gmail.com (Raphael C) Date: Thu, 26 Jul 2018 06:25:44 +0100 Subject: [scikit-learn] What is the FeatureAgglomeration algorithm? In-Reply-To: <20180726051945.vm2bg6ar63kdqzcx@phare.normalesup.org> References: <20180726051945.vm2bg6ar63kdqzcx@phare.normalesup.org> Message-ID: Is it expected that all three linkages options should give the same result in my toy example? Raphael On Thu, 26 Jul 2018 at 06:20 Gael Varoquaux wrote: > FeatureAgglomeration uses the Ward, complete linkage, or average linkage, > algorithms, depending on the choice of "linkage". These are well > documented in the literature, or on wikipedia. > > Ga?l > > On Thu, Jul 26, 2018 at 06:05:21AM +0100, Raphael C wrote: > > Hi, > > > I am trying to work out what, in precise mathematical terms, > > [FeatureAgglomeration][1] does and would love some help. Here is some > example > > code: > > > > import numpy as np > > from sklearn.cluster import FeatureAgglomeration > > for S in ['ward', 'average', 'complete']: > > FA = FeatureAgglomeration(linkage=S) > > print(FA.fit_transform(np.array([[-50,6,6,7,], [0,1,2,3]]))) > > > This outputs: > > > > > > [[ 6.33333333 -50. ] > > [ 2. 0. ]] > > [[ 6.33333333 -50. ] > > [ 2. 0. ]] > > [[ 6.33333333 -50. ] > > [ 2. 0. ]] > > > Is it possible to say mathematically how these values have been computed? > > > Also, what exactly does linkage do and why doesn't it seem to make any > > difference which option you choose? > > > Raphael > > > > [1]: http://scikit-learn.org/stable/modules/generated/ > > sklearn.cluster.FeatureAgglomeration.html > > > PS I also asked at > > https://stackoverflow.com/questions/51526616/ > > > what-does-featureagglomeration-compute-mathematically-and-when-does-linkage-make > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Senior Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Thu Jul 26 01:38:54 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 26 Jul 2018 07:38:54 +0200 Subject: [scikit-learn] What is the FeatureAgglomeration algorithm? In-Reply-To: References: <20180726051945.vm2bg6ar63kdqzcx@phare.normalesup.org> Message-ID: <05e6d50a-fcf9-436a-b3aa-35d9f97eb195@normalesup.org> No. ?Sent from my phone. Please forgive typos and briefness.? On Jul 26, 2018, 07:28, at 07:28, Raphael C wrote: >Is it expected that all three linkages options should give the same >result >in my toy example? > >Raphael > >On Thu, 26 Jul 2018 at 06:20 Gael Varoquaux > >wrote: > >> FeatureAgglomeration uses the Ward, complete linkage, or average >linkage, >> algorithms, depending on the choice of "linkage". These are well >> documented in the literature, or on wikipedia. >> >> Ga?l >> >> On Thu, Jul 26, 2018 at 06:05:21AM +0100, Raphael C wrote: >> > Hi, >> >> > I am trying to work out what, in precise mathematical terms, >> > [FeatureAgglomeration][1] does and would love some help. Here is >some >> example >> > code: >> >> >> > import numpy as np >> > from sklearn.cluster import FeatureAgglomeration >> > for S in ['ward', 'average', 'complete']: >> > FA = FeatureAgglomeration(linkage=S) >> > print(FA.fit_transform(np.array([[-50,6,6,7,], >[0,1,2,3]]))) >> >> > This outputs: >> >> > >> >> > [[ 6.33333333 -50. ] >> > [ 2. 0. ]] >> > [[ 6.33333333 -50. ] >> > [ 2. 0. ]] >> > [[ 6.33333333 -50. ] >> > [ 2. 0. ]] >> >> > Is it possible to say mathematically how these values have been >computed? >> >> > Also, what exactly does linkage do and why doesn't it seem to make >any >> > difference which option you choose? >> >> > Raphael >> >> >> > [1]: http://scikit-learn.org/stable/modules/generated/ >> > sklearn.cluster.FeatureAgglomeration.html >> >> > PS I also asked at >> > https://stackoverflow.com/questions/51526616/ >> > >> >what-does-featureagglomeration-compute-mathematically-and-when-does-linkage-make >> >> >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> -- >> Gael Varoquaux >> Senior Researcher, INRIA Parietal >> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France >> Phone: ++ 33-1-69-08-79-68 >> http://gael-varoquaux.info >http://twitter.com/GaelVaroquaux >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > >------------------------------------------------------------------------ > >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From trenton.bricken at duke.edu Thu Jul 26 15:01:41 2018 From: trenton.bricken at duke.edu (Trenton Bricken) Date: Thu, 26 Jul 2018 19:01:41 +0000 Subject: [scikit-learn] f_classif function confusion Message-ID: I am very confused by the f_classif function using feature.selection.f_classif() The function under the User Guide says that it can be used for classification tasks and under the documentation it claims to use ANOVA. However, ANOVA takes a categorical input and continuous output. Why can we provide it with a continuous input and categorical output here? Also looking at the source code, there are warnings for not using ANOVA if your feature is not normally distributed. I think these should be more visible to warn the unaware before they start using this method for feature analysis. Thank you in advance for any help and explanations that can be provided about why f_classif can be used with categorical outputs. Trenton -------------- next part -------------- An HTML attachment was scrubbed... URL: From rajkiranvsgo at gmail.com Sat Jul 28 23:49:05 2018 From: rajkiranvsgo at gmail.com (Rajkiran Veldur) Date: Sun, 29 Jul 2018 09:19:05 +0530 Subject: [scikit-learn] Suggestion to update the code for Segmenting the picture of Lena in regions Message-ID: Hello Team, I have been following scikit-learn closely these days as I have been working on different machine learning algorithms. Thank you for making everything so simple. Your documents could be followed even by novice. Now, when I was working with spectral clustering, I found your example of *Segmenting the picture of Lena in regions *intuitive and wanted to try it. However, scipy has removed the scipy.misc.lena() module from their library, due to licensing issues. So, I request you to please update the code with any other image. Regards, Rajkiran Veldur -------------- next part -------------- An HTML attachment was scrubbed... URL: From jakevdp at cs.washington.edu Sun Jul 29 00:51:59 2018 From: jakevdp at cs.washington.edu (Jacob Vanderplas) Date: Sat, 28 Jul 2018 21:51:59 -0700 Subject: [scikit-learn] Suggestion to update the code for Segmenting the picture of Lena in regions In-Reply-To: References: Message-ID: Hi Rajkiran, It sounds like you found an example from an old version of the scikit-learn documentation. After scipy removed that image, the example you're referring to was updated to this one: http://scikit-learn.org/stable/auto_examples/cluster/plot_face_segmentation.html Best, Jake Jake VanderPlas Senior Data Science Fellow Director of Open Software University of Washington eScience Institute On Sat, Jul 28, 2018 at 8:49 PM, Rajkiran Veldur wrote: > Hello Team, > > I have been following scikit-learn closely these days as I have been > working on different machine learning algorithms. Thank you for making > everything so simple. Your documents could be followed even by novice. > > Now, when I was working with spectral clustering, I found your example of *Segmenting > the picture of Lena in regions *intuitive and wanted to try it. > > However, scipy has removed the scipy.misc.lena() module from their > library, due to licensing issues. > > So, I request you to please update the code with any other image. > > Regards, > Rajkiran Veldur > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Sun Jul 29 01:40:54 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sun, 29 Jul 2018 07:40:54 +0200 Subject: [scikit-learn] Suggestion to update the code for Segmenting the picture of Lena in regions In-Reply-To: References: Message-ID: <8168f979-d374-4d1e-a84f-d18aeb3197dd@normalesup.org> You are looking at an old version of the documentation. In the up to date documentation, the picture has been replaced: http://scikit-learn.org/stable/auto_examples/cluster/plot_face_segmentation.html ?Sent from my phone. Please forgive typos and briefness.? On Jul 29, 2018, 05:51, at 05:51, Rajkiran Veldur wrote: > Hello Team, > >I have been following scikit-learn closely these days as I have been >working on different machine learning algorithms. Thank you for making >everything so simple. Your documents could be followed even by novice. > >Now, when I was working with spectral clustering, I found your example >of *Segmenting >the picture of Lena in regions *intuitive and wanted to try it. > >However, scipy has removed the scipy.misc.lena() module from their >library, >due to licensing issues. > >So, I request you to please update the code with any other image. > >Regards, >Rajkiran Veldur > > >------------------------------------------------------------------------ > >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From rajkiranvsgo at gmail.com Sun Jul 29 06:00:57 2018 From: rajkiranvsgo at gmail.com (Rajkiran Veldur) Date: Sun, 29 Jul 2018 15:30:57 +0530 Subject: [scikit-learn] Suggestion to update the code for Segmenting the picture of Lena in regions In-Reply-To: References: Message-ID: Hi Jacob, Thanks for the update. That was real quick and helpful. Regards, Rajkiran Veldur On Sun, Jul 29, 2018 at 10:21 AM, Jacob Vanderplas < jakevdp at cs.washington.edu> wrote: > Hi Rajkiran, > It sounds like you found an example from an old version of the > scikit-learn documentation. > > After scipy removed that image, the example you're referring to was > updated to this one: http://scikit-learn.org/stable/auto_examples/cluster/ > plot_face_segmentation.html > > Best, > Jake > > Jake VanderPlas > Senior Data Science Fellow > Director of Open Software > University of Washington eScience Institute > > On Sat, Jul 28, 2018 at 8:49 PM, Rajkiran Veldur > wrote: > >> Hello Team, >> >> I have been following scikit-learn closely these days as I have been >> working on different machine learning algorithms. Thank you for making >> everything so simple. Your documents could be followed even by novice. >> >> Now, when I was working with spectral clustering, I found your example >> of *Segmenting the picture of Lena in regions *intuitive and wanted to >> try it. >> >> However, scipy has removed the scipy.misc.lena() module from their >> library, due to licensing issues. >> >> So, I request you to please update the code with any other image. >> >> Regards, >> Rajkiran Veldur >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prat2 at umbc.edu Mon Jul 30 16:12:29 2018 From: prat2 at umbc.edu (Prathusha Jonnagaddla Subramanyam Naidu) Date: Mon, 30 Jul 2018 16:12:29 -0400 Subject: [scikit-learn] Dependency issues Message-ID: Hi everyone, I updated the version of SCIKIT_IMAGE used in circle build to 0.14.0. This is the error that I got UnsatisfiableError: The following specifications were found to be in conflict: - numpy=1.8.2 - scikit-image=0.14.0 -> scipy[version='>=0.17']Use "conda info " to see the dependencies for each package. So I updated scipy version to 1.1.0 and I get this error now UnsatisfiableError: The following specifications were found to be in conflict: - pandas=0.13.1 - scipy=1.1.0Use "conda info " to see the dependencies for each package. Should I update the versions of everything ? Or am I doing something wrong ? -- Regards, Prathusha JS Naidu -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Tue Jul 31 01:00:14 2018 From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=) Date: Tue, 31 Jul 2018 12:00:14 +0700 Subject: [scikit-learn] Dependency issues In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: From shantanubhattacharya at yahoo.com Tue Jul 31 19:49:15 2018 From: shantanubhattacharya at yahoo.com (Shantanu Bhattacharya) Date: Tue, 31 Jul 2018 23:49:15 +0000 (UTC) Subject: [scikit-learn] Query about an algorithm References: <246651338.109874.1533080955512.ref@mail.yahoo.com> Message-ID: <246651338.109874.1533080955512@mail.yahoo.com> Hello, I am new to this mailing list. I would like to understand the algorithms provided. Is?second order gradient descent with hessian error matrix supported by this library? I went through the documentation, but did not find it. Are you able to confirm or direct me to some place that might have it? Look forward to your thoughts Kind regardsShantanu -------------- next part -------------- An HTML attachment was scrubbed... URL: